Update to latest model (#205)

* Update configs to latest model * Update model cards and readmes with new model info * Add disclaimer to pipeline metric summary analysis * Use gpu option in bervectorizer * Add some model and data folder download messages * Dont use args in tests * use logger in downlon public data error * Update tests, and logs, and only output the 4 entities we care about
nestauk · Dec 8, 2023 · ade0887 · ade0887
1 parent 794f826
commit ade0887
Show file tree

Hide file tree

Showing 16 changed files with 77 additions and 45 deletions.
diff --git a/docs/source/labelling.md b/docs/source/labelling.md
@@ -1,16 +1,17 @@
 # Entity Labelling
 
-To extract skills from job adverts we took an approach of training a named entity recognition (NER) model to predict which parts of job adverts were skills ("skill entities") and which were experiences ("experience entities").
+To extract skills from job adverts we took an approach of training a named entity recognition (NER) model to predict which parts of job adverts were skills ("skill entities"), which were experiences ("experience entities") and which were job benefits ("benefit entities").
 
-To train the NER model we needed labelled data. First we created a random sample of job adverts and got them into a form needed for labelling using [Label Studio](https://labelstud.io/). More about this labelling process can be found in the [`skill_ner` pipeline](https://nestauk.github.io/ojd_daps_skills/pipeline/skill_ner/README.md).
+To train the NER model we needed labelled data. First we created a random sample of job adverts and got them into a form needed for labelling using [Label Studio](https://labelstud.io/) and also [Prodigy](https://prodi.gy/). More about this labelling process can be found in the [`skill_ner` pipeline](https://nestauk.github.io/ojd_daps_skills/pipeline/skill_ner/README.md).
 
-There are 3 entity labels in our training data:
+There are 4 entity labels in our training data:
 
 1. `SKILL`
 2. `MULTISKILL`
 3. `EXPERIENCE`
+4. `BENEFIT`
 
-The user interface for this labelling task looks like:
+The user interface for the labelling task in label-studio looks like:
 
 ![](../../outputs/reports/figures/label_studio.png)
 
@@ -27,4 +28,4 @@ Sometimes there were no entities to label:
 
 ### Training dataset
 
-For the current NER model, 5641 entities in 375 job adverts from our dataset of job adverts were labelled; 354 are multiskill, 4696 are skill, and 608 were experience entities. 20% of the labelled entities were held out as a test set to evaluate the models.
+For the current NER model (20230808), 8971 entities in 500 job adverts from our dataset of job adverts were labelled; 443 are multiskill, 7313 are skill, 852 were experience entities, and 363 were benefit entities. 20% of the labelled entities were held out as a test set to evaluate the models.
diff --git a/docs/source/model_card.md b/docs/source/model_card.md
@@ -2,7 +2,7 @@
 
 This page contains information for different parts of the skills extraction and mapping pipeline. We detail the two main parts of the pipeline; the extract skills pipeline and the skills to taxonomy mapping pipeline.
 
-Developed by data scientists in Nesta’s Data Analytics Practice, (last updated on 23-11-2022).
+Developed by data scientists in Nesta’s Data Analytics Practice, (last updated on 29-09-2023).
 
 - [Model Card: Extract Skills](extract_skills_card)
 - [Model Card: Skills to Taxonomy Mapping](mapping_card)
@@ -17,38 +17,39 @@ _The extracting skills pipeline._
 
 ### Summary
 
-- Train a Named Entity Recognition (NER) spaCy component to extract skills, multiskills and experience entities from job adverts.
+- Train a Named Entity Recognition (NER) spaCy component to extract skills, multiskills, experience and benefits entities from job adverts.
 - Predict whether or not a skill is multi-skill or not using scikit learn's SVM model. Features are length of entity; if 'and' in entity; if ',' in entity.
 - Split multiskills, where possible, based on semantic rules.
 
 ### Training
 
-- For the NER model, 375 job adverts were labelled for skills, multiskills and experience.
-- As of 15th November 2022, **5641** entities in 375 job adverts from OJO were labelled;
-- **354** are multiskill, **4696** are skill, and **608** were experience entities. 20% of the labelled entities were held out as a test set to evaluate the models.
+- For the NER model, 500 job adverts were labelled for skills, multiskills, experience and benefits.
+- As of 8th August 2023, **8971** entities in 500 job adverts from OJO were labelled;
+- **443** are multiskill, **7313** are skill, **852** were experience entities, and **363** were benefit entities. 20% of the labelled entities were held out as a test set to evaluate the models.
 
 The NER model we trained used [spaCy's](https://spacy.io/) NER neural network architecture. Their NER architecture _"features a sophisticated word embedding strategy using subword features and 'Bloom' embeddings, a deep convolutional neural network with residual connections, and a novel transition-based approach to named entity parsing"_ - more about this [here](https://spacy.io/universe/project/video-spacys-ner-model).
 
 You can read more about the creation of the labelling data [here](./labelling.md).
 
 ### NER Metrics
 
-- A metric in the python library nerevaluate ([read more here](https://pypi.org/project/nervaluate/)) was used to calculate F1, precision and recall for the NER and SVM classifier on the held-out test set. As of 15th November 2022, the results are as follows:
+- A metric in the python library nerevaluate ([read more here](https://pypi.org/project/nervaluate/)) was used to calculate F1, precision and recall for the NER and SVM classifier on the held-out test set. As of 8th August 2023, the results are as follows:
 
 | Entity     | F1    | Precision | Recall |
 | ---------- | ----- | --------- | ------ |
-| Skill      | 0.586 | 0.679     | 0.515  |
-| Experience | 0.506 | 0.648     | 0.416  |
-| All        | 0.563 | 0.643     | 0.500  |
+| Skill      | 0.612 | 0.712     | 0.537  |
+| Experience | 0.524 | 0.647     | 0.441  |
+| Benefit    | 0.531 | 0.708     | 0.425  |
+| All        | 0.590 | 0.680     | 0.521  |
 
 - These metrics use partial entity matching.
-- More details of the evaluation performance across both the NER model and the SVM model can be found in `outputs/models/ner_model/20220825/train_details.json`
+- More details of the evaluation performance across both the NER model and the SVM model can be found in `outputs/models/ner_model/20230808/train_details.json`
 
 ### Multiskill Metrics
 
-- The same training data and held out test set used for the NER model was used to evaluate the SVM model. On a held out test set, the SVM model achieved 91% accuracy.
+- The same training data and held out test set used for the NER model was used to evaluate the SVM model. On a held out test set, the SVM model achieved 94% accuracy.
 - When evaluating the multiskill splitter algorithm rules, 253 multiskill spans were labelled as ‘good’, ‘ok’ or ‘bad’ splits. Of the 253 multiskill spans, 80 were split. Of the splits, 66% were ‘good’, 9% were ‘ok’ and 25% were ‘bad’.
-- More details of the evaluation performance across both the NER model and the SVM model can be found in `outputs/models/ner_model/20220825/train_details.json`
+- More details of the evaluation performance across both the NER model and the SVM model can be found in `outputs/models/ner_model/20230808/train_details.json`
 
 ### Caveats and Recommendations
 

diff --git a/docs/source/pipeline_summary.md b/docs/source/pipeline_summary.md
@@ -23,7 +23,7 @@ For further information or feedback please contact Liz Gallagher, India Kerle or
 
 ## Metrics
 
-There is no exact way to evaluate how well our pipeline works; however we have several proxies to better understand how our approach compares.
+There is no exact way to evaluate how well our pipeline works; however we have several proxies to better understand how our approach compares. The analysis in this section was performed using the results of the `20220825` model. We believe the newer `20230808` model will improve these results, but the analysis hasn't been repeated.
 
 ### Comparison 1 - Top skill groups per occupation comparison to ESCO essential skill groups per occupation
 

diff --git a/ojd_daps_skills/config/extract_skills_esco.yaml b/ojd_daps_skills/config/extract_skills_esco.yaml
@@ -1,4 +1,4 @@
-ner_model_path: "outputs/models/ner_model/20220825/"
+ner_model_path: "outputs/models/ner_model/20230808/"
 taxonomy_name: "esco"
 taxonomy_path: "outputs/data/skill_ner_mapping/esco_data_formatted.csv"
 clean_job_ads: True

diff --git a/ojd_daps_skills/config/extract_skills_lightcast.yaml b/ojd_daps_skills/config/extract_skills_lightcast.yaml
@@ -1,4 +1,4 @@
-ner_model_path: "outputs/models/ner_model/20220825/"
+ner_model_path: "outputs/models/ner_model/20230808/"
 taxonomy_name: "lightcast"
 taxonomy_path: "outputs/data/skill_ner_mapping/lightcast_data_formatted.csv"
 clean_job_ads: True

diff --git a/ojd_daps_skills/config/extract_skills_lightcast_evaluation.yaml b/ojd_daps_skills/config/extract_skills_lightcast_evaluation.yaml
@@ -1,4 +1,4 @@
-ner_model_path: "outputs/models/ner_model/20220825/"
+ner_model_path: "outputs/models/ner_model/20230808/"
 taxonomy_name: "lightcast"
 taxonomy_path: "escoe_extension/outputs/data/skill_ner_mapping/lightcast_data_formatted.csv"
 clean_job_ads: True

diff --git a/ojd_daps_skills/config/extract_skills_template.yaml b/ojd_daps_skills/config/extract_skills_template.yaml
@@ -1,7 +1,7 @@
 #This is a template config file - we have added definitions to parameters that you will need to modify for your own taxonomy
 
 #the relative path to the trained NER model
-ner_model_path: "outputs/models/ner_model/20220825/"
+ner_model_path: "outputs/models/ner_model/20230808/"
 #the relative path to where
 taxonomy_path: "path/to/formatted_taxonomy.csv"
 #the name of your own taxonomy

diff --git a/ojd_daps_skills/config/extract_skills_toy.yaml b/ojd_daps_skills/config/extract_skills_toy.yaml
@@ -1,4 +1,4 @@
-ner_model_path: "outputs/models/ner_model/20220825/"
+ner_model_path: "outputs/models/ner_model/20230808/"
 taxonomy_name: "toy"
 taxonomy_path: ""
 clean_job_ads: True

diff --git a/ojd_daps_skills/getters/download_public_data.py b/ojd_daps_skills/getters/download_public_data.py
@@ -1,4 +1,4 @@
-from ojd_daps_skills import PUBLIC_DATA_FOLDER_NAME, PROJECT_DIR
+from ojd_daps_skills import PUBLIC_DATA_FOLDER_NAME, PROJECT_DIR, logger
 
 import os
 import boto3
@@ -7,6 +7,7 @@
 from botocore.config import Config
 from zipfile import ZipFile
 
+
 def download():
     """Download public data. Expected to run once on first use."""
     s3 = boto3.client(
@@ -25,11 +26,12 @@ def download():
             zip_ref.extractall(PROJECT_DIR)
 
         os.remove(f"{public_data_dir}.zip")
+        logger.info(f"Data folder downloaded from {public_data_dir}")
 
     except ClientError as ce:
-        print(f"Error: {ce}")
+        logger.warning(f"Error: {ce}")
     except FileNotFoundError as fnfe:
-        print(f"Error: {fnfe}")
+        logger.warning(f"Error: {fnfe}")
 
 
 if __name__ == "__main__":

diff --git a/ojd_daps_skills/pipeline/extract_skills/extract_skills.py b/ojd_daps_skills/pipeline/extract_skills/extract_skills.py
@@ -64,9 +64,12 @@ def __init__(
                     "Neccessary files are not downloaded. Downloading ~1GB of neccessary files."
                 )
                 download()
+            else:
+                logger.info("Model files found locally")
         else:
             self.base_path = "escoe_extension/"
             self.s3 = True
+            logger.info("Will be downloading data and models directly from S3")
             pass
 
         self.taxonomy_name = self.config["taxonomy_name"]
@@ -146,7 +149,7 @@ def load(
 
         self.nlp = self.job_ner.load_model(self.ner_model_path, s3_download=self.s3)
 
-        self.labels = self.nlp.get_pipe("ner").labels + ("MULTISKILL",)
+        self.labels = ("BENEFIT", "SKILL", "MULTISKILL", "EXPERIENCE")
 
         logger.info(f"Loading '{self.taxonomy_name}' taxonomy information")
         if self.taxonomy_name == "toy":

diff --git a/ojd_daps_skills/pipeline/skill_ner/README.md b/ojd_daps_skills/pipeline/skill_ner/README.md
@@ -1,6 +1,6 @@
 # Skill NER
 
-## Label data
+## Label data using label-studio
 
 ### Creating a sample of the OJO data
 
@@ -79,9 +79,13 @@ For the labelling done at the end of June 2022, we labelled the chunk of 400 job
 
 The outputs of this labelled are stored in `s3://open-jobs-lake/escoe_extension/outputs/skill_span_labels/`.
 
-### Merging labelled files
+## Label data using Prodigy
 
-Since multiple people labelled files from different locations, we merge the labelled data using the following command:
+We labelled another batch of job adverts using [Prodigy](https://prodi.gy/). This was to avail of their active learning capabilities. Details of how we labelled job adverts this way are given in [the Prodigy labelling README](./ojd_daps_skills/ojd_daps_skills/pipeline/skill_ner/prodigy/README.md).
+
+## Merging labelled files
+
+Since multiple people labelled files from different locations, and we labelled in both label-studio and Prodigy, we merge the labelled data using the following command:
 
 ```
 python ojd_daps_skills/pipeline/skill_ner/combine_labels.py

diff --git a/ojd_daps_skills/pipeline/skill_ner/get_skills.py b/ojd_daps_skills/pipeline/skill_ner/get_skills.py
@@ -4,7 +4,7 @@
 Running
 
 python ojd_daps_skills/pipeline/skill_ner/get_skills.py
-    --model_path outputs/models/ner_model/20220825/
+    --model_path outputs/models/ner_model/20230808/
     --output_file_dir escoe_extension/outputs/data/skill_ner/skill_predictions/
     --job_adverts_filename escoe_extension/inputs/data/skill_ner/data_sample/20220622_sampled_job_ads.json
 
@@ -40,7 +40,7 @@ def parse_arguments(parser):
     parser.add_argument(
         "--model_path",
         help="The path to the model you want to make predictions with",
-        default="outputs/models/ner_model/20220825/",
+        default="outputs/models/ner_model/20230808/",
     )
 
     parser.add_argument(

diff --git a/ojd_daps_skills/pipeline/skill_ner/ner_spacy.py b/ojd_daps_skills/pipeline/skill_ner/ner_spacy.py
@@ -512,11 +512,12 @@ def load_model(self, model_folder, s3_download=True):
             self.ms_classifier = pickle.load(
                 open(os.path.join(model_folder, "ms_classifier.pkl"), "rb")
             )
+            return self.nlp
         except OSError:
-            logger.info(
+            logger.warning(
                 "Model not found locally - you may need to download it from S3 (set s3_download to True)"
             )
-        return self.nlp
+            return None
 
 
 def parse_arguments(parser):

diff --git a/ojd_daps_skills/tests/test_extract_skills.py b/ojd_daps_skills/tests/test_extract_skills.py
@@ -5,8 +5,6 @@
 from ojd_daps_skills.utils.text_cleaning import short_hash
 from ojd_daps_skills.pipeline.extract_skills.extract_skills import ExtractSkills
 
-es = ExtractSkills(local=True)
-
 job_adverts = [
     "The job involves communication and maths skills",
     "The job involves excel and presenting skills. You need good excel skills",
@@ -15,10 +13,16 @@
 
 def test_load():
 
+    es = ExtractSkills(local=True)
     es.load()
 
     assert isinstance(es.nlp, spacy.lang.en.English)
-    assert es.labels == ("EXPERIENCE", "SKILL", "MULTISKILL")
+    assert all(
+        [
+            label in es.labels
+            for label in ["EXPERIENCE", "SKILL", "MULTISKILL", "BENEFIT"]
+        ]
+    )
     assert es.skill_mapper
     assert (
         len(
@@ -31,6 +35,9 @@ def test_load():
 
 def test_get_skills():
 
+    es = ExtractSkills(local=True)
+    es.load()
+
     predicted_skills = es.get_skills(job_adverts)
 
     # The keys are the labels for every job prediction
@@ -46,6 +53,9 @@ def test_get_skills():
 
 def test_map_skills():
 
+    es = ExtractSkills(local=True)
+    es.load()
+
     predicted_skills = es.get_skills(job_adverts)
     matched_skills = es.map_skills(predicted_skills)
 
@@ -56,13 +66,17 @@ def test_map_skills():
             *[[skill[1][0] for skill in skills["SKILL"]] for skills in matched_skills]
         )
     )
-    assert (
-        set(test_skills).difference(set(es.taxonomy_info["hier_name_mapper"].values()))
-        == set()
+    tax_skills_and_hier_names = set(
+        es.taxonomy_skills["description"].tolist()
+        + list(es.taxonomy_info["hier_name_mapper"].values())
     )
+    assert set(test_skills).difference(tax_skills_and_hier_names) == set()
 
 
 def test_map_no_skills():
+    es = ExtractSkills(local=True)
+    es.load()
+
     job_adverts = ["nothing", "we want excel skills", "we want communication skills"]
     extract_matched_skills = es.extract_skills(job_adverts)
     assert len(job_adverts) == len(extract_matched_skills)
@@ -72,6 +86,8 @@ def test_hardcoded_mapping():
     """
     The mapped results using the algorithm should be the same as the hardcoded results
     """
+    es = ExtractSkills(local=True)
+    es.load()
 
     hard_coded_skills = {
         "3267542715426065": {

diff --git a/ojd_daps_skills/utils/bert_vectorizer.py b/ojd_daps_skills/utils/bert_vectorizer.py
@@ -2,6 +2,7 @@
 import time
 from ojd_daps_skills import logger
 import logging
+import torch
 
 
 class BertVectorizer:
@@ -13,7 +14,7 @@ class BertVectorizer:
     def __init__(
         self,
         bert_model_name="sentence-transformers/all-MiniLM-L6-v2",
-        multi_process=True,
+        multi_process=False,
         batch_size=32,
         verbose=True,
     ):
@@ -27,7 +28,8 @@ def __init__(
             logger.setLevel(logging.ERROR)
 
     def fit(self, *_):
-        self.bert_model = SentenceTransformer(self.bert_model_name)
+        device = torch.device(f"cuda:0" if torch.cuda.is_available() else "cpu")
+        self.bert_model = SentenceTransformer(self.bert_model_name, device=device)
         self.bert_model.max_seq_length = 512
         return self
 

diff --git a/outputs/reports/skills_extraction.md b/outputs/reports/skills_extraction.md
@@ -16,11 +16,11 @@ This process means we can extract skills from thousands of job adverts and analy
 
 ## Labelling data
 
-To train the NER model we needed labelled data. First we created a random sample of job adverts and got them into a form needed for labelling using [Label Studio](https://labelstud.io/). More about this labelling process can be found in the `skill_ner` pipeline [README.md](./ojd_daps_skills/ojd_daps_skills/pipeline/skill_ner/README.md).
+To train the NER model we needed labelled data. First we created a random sample of job adverts and got them into a form needed for labelling using [Label Studio](https://labelstud.io/), we then did a second batch of labelled using [Prodigy](https://prodi.gy/). More about this labelling process can be found in the `skill_ner` pipeline [README.md](./ojd_daps_skills/ojd_daps_skills/pipeline/skill_ner/README.md).
 
 ![](figures/label_studio.png)
 
-As of 11th July 2022 we have labelled 3400 entities; 404 (12%) are multiskill, 2603 (77%) are skill, and 393 (12%) are experience entities.
+As of 8th August 2023 we have labelled 8971 entities; 443 (5%) are multiskill, 7313 (82%) are skill, 852 (10%) are experience entities and 363 (4%) are benefit entities.
 
 ### Multiskill labels
 
@@ -60,7 +60,7 @@ A summary of the experiments with training the model is below.
 
 | Date (model name) | Base model     | Training size   | Evaluation size | Number of iterations | Drop out rate | Learning rate | Convert multiskill? | Other info                                                                                       | Skill F1 | Experience F1 | All F1 | Multiskill test score |
 | ----------------- | -------------- | --------------- | --------------- | -------------------- | ------------- | ------------- | ------------------- | ------------------------------------------------------------------------------------------------ | -------- | ------------- | ------ | --------------------- |
-| 20230808          | en_core_web_lg | 400 (7149 ents) | 100 (1805 ents) | 100                  | 0.1           | 0.001         | True                | More data, different base model, BENEFIT label data                                              | 0.61     | 0.52          | 0.59   | 0.94                  |
+| 20230808\*\*      | en_core_web_lg | 400 (7149 ents) | 100 (1805 ents) | 100                  | 0.1           | 0.001         | True                | More data, different base model, BENEFIT label data                                              | 0.61     | 0.52          | 0.59   | 0.94                  |
 | 20220825          | blank en       | 300 (4508 ents) | 75 (1133 ents)  | 100                  | 0.1           | 0.001         | True                | Changed hyperparams, more data                                                                   | 0.59     | 0.51          | 0.56   | 0.91                  |
 | 20220729\*        | blank en       | 196 (2850 ents) | 49 (636 ents)   | 50                   | 0.3           | 0.001         | True                | More data, padding in cleaning but do fix_entity_annotations after fix_all_formatting to sort it | 0.57     | 0.44          | 0.54   | 0.87                  |
 | 20220729_nopad    | blank en       | 196             | 49              | 50                   | 0.3           | 0.001         | True                | No padding in cleaning, more data                                                                | 0.52     | 0.33          | 0.45   | 0.87                  |
@@ -124,6 +124,8 @@ More in-depth metrics for `20220714`:
 
 \* For model `20220714` we relabelled the MULTISKILL labels in the dataset - we were trying to see whether some of them should actually be single skills, or could be separated into single skills rather than (as we found) labelling a large span as a multiskill. This process increased our number of labelled skill entities (from 2603 to 2887) and decreased the number of multiskill entities (from 404 to 218), resulting in a net increase in entities labelled (from 3400 to 3498).
 
+\*\* For model `20230808` we included BENEFIT labels in some of the labelled data.
+
 ### Parameter tuning
 
 For model `20220825` onwards we changed our hyperparameters after some additional experimentation revealed improvements could be made. This experimentation was on a dataset of 375 job adverts in total.