Skip to content

Commit

Permalink
Update to latest model (#205)
Browse files Browse the repository at this point in the history
* Update configs to latest model

* Update model cards and readmes with new model info

* Add disclaimer to pipeline metric summary analysis

* Use gpu option in bervectorizer

* Add some model and data folder download messages

* Dont use args in tests

* use logger in downlon public data error

* Update tests, and logs, and only output the 4 entities we care about
  • Loading branch information
lizgzil authored Dec 8, 2023
1 parent 794f826 commit ade0887
Show file tree
Hide file tree
Showing 16 changed files with 77 additions and 45 deletions.
11 changes: 6 additions & 5 deletions docs/source/labelling.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,17 @@
# Entity Labelling

To extract skills from job adverts we took an approach of training a named entity recognition (NER) model to predict which parts of job adverts were skills ("skill entities") and which were experiences ("experience entities").
To extract skills from job adverts we took an approach of training a named entity recognition (NER) model to predict which parts of job adverts were skills ("skill entities"), which were experiences ("experience entities") and which were job benefits ("benefit entities").

To train the NER model we needed labelled data. First we created a random sample of job adverts and got them into a form needed for labelling using [Label Studio](https://labelstud.io/). More about this labelling process can be found in the [`skill_ner` pipeline](https://nestauk.github.io/ojd_daps_skills/pipeline/skill_ner/README.md).
To train the NER model we needed labelled data. First we created a random sample of job adverts and got them into a form needed for labelling using [Label Studio](https://labelstud.io/) and also [Prodigy](https://prodi.gy/). More about this labelling process can be found in the [`skill_ner` pipeline](https://nestauk.github.io/ojd_daps_skills/pipeline/skill_ner/README.md).

There are 3 entity labels in our training data:
There are 4 entity labels in our training data:

1. `SKILL`
2. `MULTISKILL`
3. `EXPERIENCE`
4. `BENEFIT`

The user interface for this labelling task looks like:
The user interface for the labelling task in label-studio looks like:

![](../../outputs/reports/figures/label_studio.png)

Expand All @@ -27,4 +28,4 @@ Sometimes there were no entities to label:

### Training dataset

For the current NER model, 5641 entities in 375 job adverts from our dataset of job adverts were labelled; 354 are multiskill, 4696 are skill, and 608 were experience entities. 20% of the labelled entities were held out as a test set to evaluate the models.
For the current NER model (20230808), 8971 entities in 500 job adverts from our dataset of job adverts were labelled; 443 are multiskill, 7313 are skill, 852 were experience entities, and 363 were benefit entities. 20% of the labelled entities were held out as a test set to evaluate the models.
25 changes: 13 additions & 12 deletions docs/source/model_card.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

This page contains information for different parts of the skills extraction and mapping pipeline. We detail the two main parts of the pipeline; the extract skills pipeline and the skills to taxonomy mapping pipeline.

Developed by data scientists in Nesta’s Data Analytics Practice, (last updated on 23-11-2022).
Developed by data scientists in Nesta’s Data Analytics Practice, (last updated on 29-09-2023).

- [Model Card: Extract Skills](extract_skills_card)
- [Model Card: Skills to Taxonomy Mapping](mapping_card)
Expand All @@ -17,38 +17,39 @@ _The extracting skills pipeline._

### Summary

- Train a Named Entity Recognition (NER) spaCy component to extract skills, multiskills and experience entities from job adverts.
- Train a Named Entity Recognition (NER) spaCy component to extract skills, multiskills, experience and benefits entities from job adverts.
- Predict whether or not a skill is multi-skill or not using scikit learn's SVM model. Features are length of entity; if 'and' in entity; if ',' in entity.
- Split multiskills, where possible, based on semantic rules.

### Training

- For the NER model, 375 job adverts were labelled for skills, multiskills and experience.
- As of 15th November 2022, **5641** entities in 375 job adverts from OJO were labelled;
- **354** are multiskill, **4696** are skill, and **608** were experience entities. 20% of the labelled entities were held out as a test set to evaluate the models.
- For the NER model, 500 job adverts were labelled for skills, multiskills, experience and benefits.
- As of 8th August 2023, **8971** entities in 500 job adverts from OJO were labelled;
- **443** are multiskill, **7313** are skill, **852** were experience entities, and **363** were benefit entities. 20% of the labelled entities were held out as a test set to evaluate the models.

The NER model we trained used [spaCy's](https://spacy.io/) NER neural network architecture. Their NER architecture _"features a sophisticated word embedding strategy using subword features and 'Bloom' embeddings, a deep convolutional neural network with residual connections, and a novel transition-based approach to named entity parsing"_ - more about this [here](https://spacy.io/universe/project/video-spacys-ner-model).

You can read more about the creation of the labelling data [here](./labelling.md).

### NER Metrics

- A metric in the python library nerevaluate ([read more here](https://pypi.org/project/nervaluate/)) was used to calculate F1, precision and recall for the NER and SVM classifier on the held-out test set. As of 15th November 2022, the results are as follows:
- A metric in the python library nerevaluate ([read more here](https://pypi.org/project/nervaluate/)) was used to calculate F1, precision and recall for the NER and SVM classifier on the held-out test set. As of 8th August 2023, the results are as follows:

| Entity | F1 | Precision | Recall |
| ---------- | ----- | --------- | ------ |
| Skill | 0.586 | 0.679 | 0.515 |
| Experience | 0.506 | 0.648 | 0.416 |
| All | 0.563 | 0.643 | 0.500 |
| Skill | 0.612 | 0.712 | 0.537 |
| Experience | 0.524 | 0.647 | 0.441 |
| Benefit | 0.531 | 0.708 | 0.425 |
| All | 0.590 | 0.680 | 0.521 |

- These metrics use partial entity matching.
- More details of the evaluation performance across both the NER model and the SVM model can be found in `outputs/models/ner_model/20220825/train_details.json`
- More details of the evaluation performance across both the NER model and the SVM model can be found in `outputs/models/ner_model/20230808/train_details.json`

### Multiskill Metrics

- The same training data and held out test set used for the NER model was used to evaluate the SVM model. On a held out test set, the SVM model achieved 91% accuracy.
- The same training data and held out test set used for the NER model was used to evaluate the SVM model. On a held out test set, the SVM model achieved 94% accuracy.
- When evaluating the multiskill splitter algorithm rules, 253 multiskill spans were labelled as ‘good’, ‘ok’ or ‘bad’ splits. Of the 253 multiskill spans, 80 were split. Of the splits, 66% were ‘good’, 9% were ‘ok’ and 25% were ‘bad’.
- More details of the evaluation performance across both the NER model and the SVM model can be found in `outputs/models/ner_model/20220825/train_details.json`
- More details of the evaluation performance across both the NER model and the SVM model can be found in `outputs/models/ner_model/20230808/train_details.json`

### Caveats and Recommendations

Expand Down
2 changes: 1 addition & 1 deletion docs/source/pipeline_summary.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ For further information or feedback please contact Liz Gallagher, India Kerle or

## Metrics

There is no exact way to evaluate how well our pipeline works; however we have several proxies to better understand how our approach compares.
There is no exact way to evaluate how well our pipeline works; however we have several proxies to better understand how our approach compares. The analysis in this section was performed using the results of the `20220825` model. We believe the newer `20230808` model will improve these results, but the analysis hasn't been repeated.

### Comparison 1 - Top skill groups per occupation comparison to ESCO essential skill groups per occupation

Expand Down
2 changes: 1 addition & 1 deletion ojd_daps_skills/config/extract_skills_esco.yaml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
ner_model_path: "outputs/models/ner_model/20220825/"
ner_model_path: "outputs/models/ner_model/20230808/"
taxonomy_name: "esco"
taxonomy_path: "outputs/data/skill_ner_mapping/esco_data_formatted.csv"
clean_job_ads: True
Expand Down
2 changes: 1 addition & 1 deletion ojd_daps_skills/config/extract_skills_lightcast.yaml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
ner_model_path: "outputs/models/ner_model/20220825/"
ner_model_path: "outputs/models/ner_model/20230808/"
taxonomy_name: "lightcast"
taxonomy_path: "outputs/data/skill_ner_mapping/lightcast_data_formatted.csv"
clean_job_ads: True
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
ner_model_path: "outputs/models/ner_model/20220825/"
ner_model_path: "outputs/models/ner_model/20230808/"
taxonomy_name: "lightcast"
taxonomy_path: "escoe_extension/outputs/data/skill_ner_mapping/lightcast_data_formatted.csv"
clean_job_ads: True
Expand Down
2 changes: 1 addition & 1 deletion ojd_daps_skills/config/extract_skills_template.yaml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
#This is a template config file - we have added definitions to parameters that you will need to modify for your own taxonomy

#the relative path to the trained NER model
ner_model_path: "outputs/models/ner_model/20220825/"
ner_model_path: "outputs/models/ner_model/20230808/"
#the relative path to where
taxonomy_path: "path/to/formatted_taxonomy.csv"
#the name of your own taxonomy
Expand Down
2 changes: 1 addition & 1 deletion ojd_daps_skills/config/extract_skills_toy.yaml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
ner_model_path: "outputs/models/ner_model/20220825/"
ner_model_path: "outputs/models/ner_model/20230808/"
taxonomy_name: "toy"
taxonomy_path: ""
clean_job_ads: True
Expand Down
8 changes: 5 additions & 3 deletions ojd_daps_skills/getters/download_public_data.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
from ojd_daps_skills import PUBLIC_DATA_FOLDER_NAME, PROJECT_DIR
from ojd_daps_skills import PUBLIC_DATA_FOLDER_NAME, PROJECT_DIR, logger

import os
import boto3
Expand All @@ -7,6 +7,7 @@
from botocore.config import Config
from zipfile import ZipFile


def download():
"""Download public data. Expected to run once on first use."""
s3 = boto3.client(
Expand All @@ -25,11 +26,12 @@ def download():
zip_ref.extractall(PROJECT_DIR)

os.remove(f"{public_data_dir}.zip")
logger.info(f"Data folder downloaded from {public_data_dir}")

except ClientError as ce:
print(f"Error: {ce}")
logger.warning(f"Error: {ce}")
except FileNotFoundError as fnfe:
print(f"Error: {fnfe}")
logger.warning(f"Error: {fnfe}")


if __name__ == "__main__":
Expand Down
5 changes: 4 additions & 1 deletion ojd_daps_skills/pipeline/extract_skills/extract_skills.py
Original file line number Diff line number Diff line change
Expand Up @@ -64,9 +64,12 @@ def __init__(
"Neccessary files are not downloaded. Downloading ~1GB of neccessary files."
)
download()
else:
logger.info("Model files found locally")
else:
self.base_path = "escoe_extension/"
self.s3 = True
logger.info("Will be downloading data and models directly from S3")
pass

self.taxonomy_name = self.config["taxonomy_name"]
Expand Down Expand Up @@ -146,7 +149,7 @@ def load(

self.nlp = self.job_ner.load_model(self.ner_model_path, s3_download=self.s3)

self.labels = self.nlp.get_pipe("ner").labels + ("MULTISKILL",)
self.labels = ("BENEFIT", "SKILL", "MULTISKILL", "EXPERIENCE")

logger.info(f"Loading '{self.taxonomy_name}' taxonomy information")
if self.taxonomy_name == "toy":
Expand Down
10 changes: 7 additions & 3 deletions ojd_daps_skills/pipeline/skill_ner/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Skill NER

## Label data
## Label data using label-studio

### Creating a sample of the OJO data

Expand Down Expand Up @@ -79,9 +79,13 @@ For the labelling done at the end of June 2022, we labelled the chunk of 400 job

The outputs of this labelled are stored in `s3://open-jobs-lake/escoe_extension/outputs/skill_span_labels/`.

### Merging labelled files
## Label data using Prodigy

Since multiple people labelled files from different locations, we merge the labelled data using the following command:
We labelled another batch of job adverts using [Prodigy](https://prodi.gy/). This was to avail of their active learning capabilities. Details of how we labelled job adverts this way are given in [the Prodigy labelling README](./ojd_daps_skills/ojd_daps_skills/pipeline/skill_ner/prodigy/README.md).

## Merging labelled files

Since multiple people labelled files from different locations, and we labelled in both label-studio and Prodigy, we merge the labelled data using the following command:

```
python ojd_daps_skills/pipeline/skill_ner/combine_labels.py
Expand Down
4 changes: 2 additions & 2 deletions ojd_daps_skills/pipeline/skill_ner/get_skills.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
Running
python ojd_daps_skills/pipeline/skill_ner/get_skills.py
--model_path outputs/models/ner_model/20220825/
--model_path outputs/models/ner_model/20230808/
--output_file_dir escoe_extension/outputs/data/skill_ner/skill_predictions/
--job_adverts_filename escoe_extension/inputs/data/skill_ner/data_sample/20220622_sampled_job_ads.json
Expand Down Expand Up @@ -40,7 +40,7 @@ def parse_arguments(parser):
parser.add_argument(
"--model_path",
help="The path to the model you want to make predictions with",
default="outputs/models/ner_model/20220825/",
default="outputs/models/ner_model/20230808/",
)

parser.add_argument(
Expand Down
5 changes: 3 additions & 2 deletions ojd_daps_skills/pipeline/skill_ner/ner_spacy.py
Original file line number Diff line number Diff line change
Expand Up @@ -512,11 +512,12 @@ def load_model(self, model_folder, s3_download=True):
self.ms_classifier = pickle.load(
open(os.path.join(model_folder, "ms_classifier.pkl"), "rb")
)
return self.nlp
except OSError:
logger.info(
logger.warning(
"Model not found locally - you may need to download it from S3 (set s3_download to True)"
)
return self.nlp
return None


def parse_arguments(parser):
Expand Down
28 changes: 22 additions & 6 deletions ojd_daps_skills/tests/test_extract_skills.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,6 @@
from ojd_daps_skills.utils.text_cleaning import short_hash
from ojd_daps_skills.pipeline.extract_skills.extract_skills import ExtractSkills

es = ExtractSkills(local=True)

job_adverts = [
"The job involves communication and maths skills",
"The job involves excel and presenting skills. You need good excel skills",
Expand All @@ -15,10 +13,16 @@

def test_load():

es = ExtractSkills(local=True)
es.load()

assert isinstance(es.nlp, spacy.lang.en.English)
assert es.labels == ("EXPERIENCE", "SKILL", "MULTISKILL")
assert all(
[
label in es.labels
for label in ["EXPERIENCE", "SKILL", "MULTISKILL", "BENEFIT"]
]
)
assert es.skill_mapper
assert (
len(
Expand All @@ -31,6 +35,9 @@ def test_load():

def test_get_skills():

es = ExtractSkills(local=True)
es.load()

predicted_skills = es.get_skills(job_adverts)

# The keys are the labels for every job prediction
Expand All @@ -46,6 +53,9 @@ def test_get_skills():

def test_map_skills():

es = ExtractSkills(local=True)
es.load()

predicted_skills = es.get_skills(job_adverts)
matched_skills = es.map_skills(predicted_skills)

Expand All @@ -56,13 +66,17 @@ def test_map_skills():
*[[skill[1][0] for skill in skills["SKILL"]] for skills in matched_skills]
)
)
assert (
set(test_skills).difference(set(es.taxonomy_info["hier_name_mapper"].values()))
== set()
tax_skills_and_hier_names = set(
es.taxonomy_skills["description"].tolist()
+ list(es.taxonomy_info["hier_name_mapper"].values())
)
assert set(test_skills).difference(tax_skills_and_hier_names) == set()


def test_map_no_skills():
es = ExtractSkills(local=True)
es.load()

job_adverts = ["nothing", "we want excel skills", "we want communication skills"]
extract_matched_skills = es.extract_skills(job_adverts)
assert len(job_adverts) == len(extract_matched_skills)
Expand All @@ -72,6 +86,8 @@ def test_hardcoded_mapping():
"""
The mapped results using the algorithm should be the same as the hardcoded results
"""
es = ExtractSkills(local=True)
es.load()

hard_coded_skills = {
"3267542715426065": {
Expand Down
6 changes: 4 additions & 2 deletions ojd_daps_skills/utils/bert_vectorizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
import time
from ojd_daps_skills import logger
import logging
import torch


class BertVectorizer:
Expand All @@ -13,7 +14,7 @@ class BertVectorizer:
def __init__(
self,
bert_model_name="sentence-transformers/all-MiniLM-L6-v2",
multi_process=True,
multi_process=False,
batch_size=32,
verbose=True,
):
Expand All @@ -27,7 +28,8 @@ def __init__(
logger.setLevel(logging.ERROR)

def fit(self, *_):
self.bert_model = SentenceTransformer(self.bert_model_name)
device = torch.device(f"cuda:0" if torch.cuda.is_available() else "cpu")
self.bert_model = SentenceTransformer(self.bert_model_name, device=device)
self.bert_model.max_seq_length = 512
return self

Expand Down
8 changes: 5 additions & 3 deletions outputs/reports/skills_extraction.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,11 +16,11 @@ This process means we can extract skills from thousands of job adverts and analy

## Labelling data

To train the NER model we needed labelled data. First we created a random sample of job adverts and got them into a form needed for labelling using [Label Studio](https://labelstud.io/). More about this labelling process can be found in the `skill_ner` pipeline [README.md](./ojd_daps_skills/ojd_daps_skills/pipeline/skill_ner/README.md).
To train the NER model we needed labelled data. First we created a random sample of job adverts and got them into a form needed for labelling using [Label Studio](https://labelstud.io/), we then did a second batch of labelled using [Prodigy](https://prodi.gy/). More about this labelling process can be found in the `skill_ner` pipeline [README.md](./ojd_daps_skills/ojd_daps_skills/pipeline/skill_ner/README.md).

![](figures/label_studio.png)

As of 11th July 2022 we have labelled 3400 entities; 404 (12%) are multiskill, 2603 (77%) are skill, and 393 (12%) are experience entities.
As of 8th August 2023 we have labelled 8971 entities; 443 (5%) are multiskill, 7313 (82%) are skill, 852 (10%) are experience entities and 363 (4%) are benefit entities.

### Multiskill labels

Expand Down Expand Up @@ -60,7 +60,7 @@ A summary of the experiments with training the model is below.

| Date (model name) | Base model | Training size | Evaluation size | Number of iterations | Drop out rate | Learning rate | Convert multiskill? | Other info | Skill F1 | Experience F1 | All F1 | Multiskill test score |
| ----------------- | -------------- | --------------- | --------------- | -------------------- | ------------- | ------------- | ------------------- | ------------------------------------------------------------------------------------------------ | -------- | ------------- | ------ | --------------------- |
| 20230808 | en_core_web_lg | 400 (7149 ents) | 100 (1805 ents) | 100 | 0.1 | 0.001 | True | More data, different base model, BENEFIT label data | 0.61 | 0.52 | 0.59 | 0.94 |
| 20230808\*\* | en_core_web_lg | 400 (7149 ents) | 100 (1805 ents) | 100 | 0.1 | 0.001 | True | More data, different base model, BENEFIT label data | 0.61 | 0.52 | 0.59 | 0.94 |
| 20220825 | blank en | 300 (4508 ents) | 75 (1133 ents) | 100 | 0.1 | 0.001 | True | Changed hyperparams, more data | 0.59 | 0.51 | 0.56 | 0.91 |
| 20220729\* | blank en | 196 (2850 ents) | 49 (636 ents) | 50 | 0.3 | 0.001 | True | More data, padding in cleaning but do fix_entity_annotations after fix_all_formatting to sort it | 0.57 | 0.44 | 0.54 | 0.87 |
| 20220729_nopad | blank en | 196 | 49 | 50 | 0.3 | 0.001 | True | No padding in cleaning, more data | 0.52 | 0.33 | 0.45 | 0.87 |
Expand Down Expand Up @@ -124,6 +124,8 @@ More in-depth metrics for `20220714`:

\* For model `20220714` we relabelled the MULTISKILL labels in the dataset - we were trying to see whether some of them should actually be single skills, or could be separated into single skills rather than (as we found) labelling a large span as a multiskill. This process increased our number of labelled skill entities (from 2603 to 2887) and decreased the number of multiskill entities (from 404 to 218), resulting in a net increase in entities labelled (from 3400 to 3498).

\*\* For model `20230808` we included BENEFIT labels in some of the labelled data.

### Parameter tuning

For model `20220825` onwards we changed our hyperparameters after some additional experimentation revealed improvements could be made. This experimentation was on a dataset of 375 job adverts in total.
Expand Down

0 comments on commit ade0887

Please sign in to comment.