This document briefly describes a pipeline to extract diagnosis out of a medical report along with a (rough) estimate on its reliability (confirmed, suspected, excluded).
The current implementation is mainly based on keyword matching and is hence very
basic. An initial evaluation shows that the diagnosis are extracted well,
however, the reliability of the diagnosis are not detected very reliably as the
language around suspected or excluded cases can be involved. Because of way the
pipeline is currently implemented, there is a bias towards 'confirmed'
diagnosis, i.e the pipeline may label a diagnosis as confirmed
whereas the
actual diagnosis was merely a suspection or even an exclusion.
The pipeline is based on the same framework as the deidentification pipeline and hence shares many commonalities.
The input is the same as for the annotation
tool i.e. reports out of KISIM
either directly from a database or already imported as GATE documents.
The pipeline generates for every diagnosis the following output:
- document ID/report ID
- annotation text found related to diagnosis (mainly for debugging)
- code (ICD-10)
- reliability of diagnosis: confirmed, suspected, excluded
Note, that per report several diagnosis may be extracted. Furthermore, the reliabilities may be conflicting, that is, in a report there may be the same diagnosis twice with different reliabilities. Depending on the use case, the downstream system need to resolve this "conflict".
These fields can be written back to a database table and/or written to a text file for further analysis in pandas/excel.
The configuration related for reading and writing into the database are the same as with the deidentification pipeline. There are just a few more configuration keys related to the naming of the fields generated by the diagnosis extraction pipeline:
annotationtext_field_name=annotationtext
reliability_field_name=reliability
code_field_name=code
These properties can be changed to match the destination table schema.
The same keys should then be used in the dest_columns
property.
For example:
dest_columns=reportnr,fcode,dat,fallnr,annotationtext,code,reliability
Many aspects of the pipeline can be tweaked by editing text files.
For every ICD-10 code to be extracted there needs to be a few keywords/names for
the condition.
The keyword configuration file is a text file where a row corresponds to one
ICD-10 code. The fields are separated by ;
and constituting the following:
- ICD-10 code, for example
G35
; - keywords separated by
,
, for exampleMultiple Sklerosis,Encephalitis disseminata
- blacklist paths: paths in the document structure which should not be
considered to search for the keyword: e.g.
Header,Fragestellung
(see alsoPaths in Field Tree
incomponents.md
)
In order to assess the reliability of a diagnosis the 'reliability context' of a diagnosis
is determined. This is done in a very crude way by looking for keywords within
the neighborhood of a diagnosis keyword. For instance ausgeschlossen
somewhere
close after a diagnosis term like Multiple Sklerose
could mean that the
diagnosis is excluded.
These keywords can be configured the in reliability context configuration file.
This is a textfile with one keyword a line, where the fields are separated by
;
. The fields are as follows
- context keyword, e.g.
ausgeschlossen
- context name:
ExclusionContext
orSuspectionContext
- left extend: number of tokens from the context keyword to the left, the context is valid
- right extend: number of tokens from the context keyword to the right
The entry point of the pipeline is the
org.ratschlab.structuring.DiagnosisExtractionCmd
class. It has the following
usage (note, that currently not all options are implemented).
Usage: <main class> [--json-input] [--xml-input]
[--doc-id-filter=<docIdFilterPath>]
[--doc-type-filter=<docTypeFilterPath>]
[--max-docs=<maxDocs>]
[--output-corpus-dir=<outputCorpusDir>]
[--skip-docs=<skipDocs>] -c=<pipelineConfigFile>
[-d=<databaseConfigPath>] [-i=<corpusInputDir>]
[-o=<outputFile>] [-t=<threads>]
--doc-id-filter=<docIdFilterPath>
Path to file id list to consider
--doc-type-filter=<docTypeFilterPath>
Path to file type list to consider
--json-input Assumes input dir consists of json files, one per
report (testing purposes)
--max-docs=<maxDocs> Maximum number of docs to process
--output-corpus-dir=<outputCorpusDir>
Output GATE Corpus Dir
--skip-docs=<skipDocs> Skipping number of docs (useful to just work on a slice
of the corpus)
--xml-input Assumes input dir consists of xml files, one per report
(testing purposes)
-c=<pipelineConfigFile> Config file
-d=<databaseConfigPath> DB config path
-i=<corpusInputDir> Input corpus dir
-o=<outputFile> Output Txt File
-t=<threads> Number of threads
The pipeline executes the following steps for every document:
- tokenization of the import report
- diagnosis keywords are annotated
- reliability contexts are annotated
- JAPE rules are run to
- determine the reliability of a diagnosis by either checking whether it is
within some reliability context or whether it matches a certain language
pattern (e.g.
Verdacht auf
...) - Removing some false positives like
some_diagnosis Sprechstunde
orsome_diagnosis Abklaerung
- determine the reliability of a diagnosis by either checking whether it is
within some reliability context or whether it matches a certain language
pattern (e.g.
- Consolidate diagnosis anntoations:
* adding
confirmed
reliability as default, if no other reliability could be determined * removing duplicates - Writing to file and/or database