BERT based medical text de-identification

Requirements:

We have used the Transformers library of python. The pretrained BERT model we have used is SciBERT which was trained on scientific corpora. The model will be automatically downloaded when you run the code for the first time.

How to reproduce the results / train your own BERT based de-identificatiion model for a different dataset

Download the MIMIC III dataset. Extract the "NOTEEVENTS.csv" file

Run Preprocessing notebook to preprocess the data. Then run De-identification NB to train a SciBERT model for de-identification.

Note:

We used only 500 million character subset of the full dataset. You can run it for the full dataset by just changing the splicing indices in the Preprocessing notebook. Since the MIMIC-III dataset provides no annotation, we have re-identified the data by filling in realistic looking fake date.

Train test split:

We train on 75% of data and test on 25%

Results:

Class	Positive Predictive Value (Precision)	Sensitivity (Recall)	F1-score
CONTACT	0.98336516	0.98241133	0.98288801
DATE	0.99244865	0.99275729	0.99260295
LOCATION	0.99569739	0.99507795	0.99538757
NAME	0.99179696	0.99013438	0.99096497
OTHER (non-PHI)	0.99934905	0.99934959	0.99934932
UNIQUE ID	0.96716848	0.97340426	0.97027635

Credits:

The tutorial by Tobias Sterbak was very helpful for training the BERT model

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
De-identification NB.ipynb		De-identification NB.ipynb
Pre-processing.ipynb		Pre-processing.ipynb
README.md		README.md
addresses.txt		addresses.txt
first_names_female.txt		first_names_female.txt
first_names_male.txt		first_names_male.txt
last_names.txt		last_names.txt
sampled_hospitals.txt		sampled_hospitals.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BERT based medical text de-identification

Requirements:

How to reproduce the results / train your own BERT based de-identificatiion model for a different dataset

Note:

Train test split:

Results:

Credits:

About

Releases

Packages

Languages

abrarmajeedi/BERT-based-medical-text-de-identification

Folders and files

Latest commit

History

Repository files navigation

BERT based medical text de-identification

Requirements:

How to reproduce the results / train your own BERT based de-identificatiion model for a different dataset

Note:

Train test split:

Results:

Credits:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages