Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scope applications for Teamlyzer data #2

Open
TSFelg opened this issue Apr 8, 2021 · 6 comments
Open

Scope applications for Teamlyzer data #2

TSFelg opened this issue Apr 8, 2021 · 6 comments

Comments

@TSFelg
Copy link
Owner

TSFelg commented Apr 8, 2021

Teamlyzer has an open database with several open datasets of salaries in Portugal (not only tech). It would be interesting scope how this data can be used in Fairly. It can be used simply to train the model with more data, or it can be used to expand the coverage of job types besides tech.

@ghost
Copy link

ghost commented Apr 9, 2021

We can add a link pointing/embed the current url of fairly to increase visibility of this tool. Or maybe some type of integration since we are working on the same problem.

There are also thousands of portuguese salaries shared in stackoverflow surveys or knowyourworth, the problem is the normalization as always.

@TSFelg
Copy link
Owner Author

TSFelg commented Apr 9, 2021

That's quite interesting, I have to take a look into the data, but eventually some type of integration would make sense.

Also, the extra visibility would be appreciated :)

@TSFelg
Copy link
Owner Author

TSFelg commented Apr 9, 2021

Also, would you mind elaborating a bit more when you say the problem is the normalization? I imagine it's the fact that different datasets collect different variables, but would appreciate your feedback on the typical issues you face.

@ghost
Copy link

ghost commented Apr 9, 2021

Also, would you mind elaborating a bit more when you say the problem is the normalization? I imagine it's the fact that different datasets collect different variables, but would appreciate your feedback on the typical issues you face.

yeah, each survey has a different structure like seniority, some surveys use years of experience, others senior, junior, middle, and so on.

The same for role, a back-end golang developer earns much more than a back-end php developer , so convert both to "back-end developer" will ignore this type of details.

And from my experience all datasets needs always some manual validation especially surveys with open fields to check potential fake data like junior | 150k | lisbon

@TSFelg
Copy link
Owner Author

TSFelg commented Apr 9, 2021

Thanks, that's great info! I think a a lot of those are very interesting machine learning challenges so I'm quite excited to try and tackle them :)

For example, it's possible to formulate a modelling strategy that can both leverage data wich only specifies back end as well as data that specifies the languages/frameworks. Some outlier detection can also help to detect those types of fake cases, not necessarily automatically, but at least make them stand out and then a human can just confirm if it's bad data or not.

@ghost
Copy link

ghost commented Apr 9, 2021

modelling strategy that can both leverage data wich only specifies back end as well as data that specifies the languages/frameworks

In that case I think you need some Named Entity Recognition framework. Maybe this paper can be helpful.

These guys are doing an awesome work with NER https://www.glasssquid.io/try-analyze

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant