Skip to content

Latest commit

 

History

History
154 lines (119 loc) · 6.13 KB

structuring.md

File metadata and controls

154 lines (119 loc) · 6.13 KB

Diagnosis Extraction Pipeline

This document briefly describes a pipeline to extract diagnosis out of a medical report along with a (rough) estimate on its reliability (confirmed, suspected, excluded).

Approach

The current implementation is mainly based on keyword matching and is hence very basic. An initial evaluation shows that the diagnosis are extracted well, however, the reliability of the diagnosis are not detected very reliably as the language around suspected or excluded cases can be involved. Because of way the pipeline is currently implemented, there is a bias towards 'confirmed' diagnosis, i.e the pipeline may label a diagnosis as confirmed whereas the actual diagnosis was merely a suspection or even an exclusion.

Pipeline Overview

The pipeline is based on the same framework as the deidentification pipeline and hence shares many commonalities.

The input is the same as for the annotation tool i.e. reports out of KISIM either directly from a database or already imported as GATE documents.

The pipeline generates for every diagnosis the following output:

  • document ID/report ID
  • annotation text found related to diagnosis (mainly for debugging)
  • code (ICD-10)
  • reliability of diagnosis: confirmed, suspected, excluded

Note, that per report several diagnosis may be extracted. Furthermore, the reliabilities may be conflicting, that is, in a report there may be the same diagnosis twice with different reliabilities. Depending on the use case, the downstream system need to resolve this "conflict".

These fields can be written back to a database table and/or written to a text file for further analysis in pandas/excel.

Configuration

Database Configuraiton

The configuration related for reading and writing into the database are the same as with the deidentification pipeline. There are just a few more configuration keys related to the naming of the fields generated by the diagnosis extraction pipeline:

annotationtext_field_name=annotationtext
reliability_field_name=reliability
code_field_name=code

These properties can be changed to match the destination table schema. The same keys should then be used in the dest_columns property.

For example:

dest_columns=reportnr,fcode,dat,fallnr,annotationtext,code,reliability

Pipeline Configuration

Many aspects of the pipeline can be tweaked by editing text files.

Keywords Configuration

For every ICD-10 code to be extracted there needs to be a few keywords/names for the condition. The keyword configuration file is a text file where a row corresponds to one ICD-10 code. The fields are separated by ; and constituting the following:

  • ICD-10 code, for example G35;
  • keywords separated by ,, for example Multiple Sklerosis,Encephalitis disseminata
  • blacklist paths: paths in the document structure which should not be considered to search for the keyword: e.g. Header,Fragestellung (see also Paths in Field Tree in components.md)
Reliability Context Configuration

In order to assess the reliability of a diagnosis the 'reliability context' of a diagnosis is determined. This is done in a very crude way by looking for keywords within the neighborhood of a diagnosis keyword. For instance ausgeschlossen somewhere close after a diagnosis term like Multiple Sklerose could mean that the diagnosis is excluded.

These keywords can be configured the in reliability context configuration file. This is a textfile with one keyword a line, where the fields are separated by ;. The fields are as follows

  • context keyword, e.g. ausgeschlossen
  • context name: ExclusionContext or SuspectionContext
  • left extend: number of tokens from the context keyword to the left, the context is valid
  • right extend: number of tokens from the context keyword to the right

Invocation

The entry point of the pipeline is the org.ratschlab.structuring.DiagnosisExtractionCmd class. It has the following usage (note, that currently not all options are implemented).

 Usage: <main class> [--json-input] [--xml-input]
                     [--doc-id-filter=<docIdFilterPath>]
                     [--doc-type-filter=<docTypeFilterPath>]
                     [--max-docs=<maxDocs>]
                     [--output-corpus-dir=<outputCorpusDir>]
                     [--skip-docs=<skipDocs>] -c=<pipelineConfigFile>
                     [-d=<databaseConfigPath>] [-i=<corpusInputDir>]
                     [-o=<outputFile>] [-t=<threads>]
       --doc-id-filter=<docIdFilterPath>
                              Path to file id list to consider
       --doc-type-filter=<docTypeFilterPath>
                              Path to file type list to consider
       --json-input           Assumes input dir consists of json files, one per
                                report (testing purposes)
       --max-docs=<maxDocs>   Maximum number of docs to process
       --output-corpus-dir=<outputCorpusDir>
                              Output GATE Corpus Dir
       --skip-docs=<skipDocs> Skipping number of docs (useful to just work on a slice
                                of the corpus)
       --xml-input            Assumes input dir consists of xml files, one per report
                                (testing purposes)
   -c=<pipelineConfigFile>    Config file
   -d=<databaseConfigPath>    DB config path
   -i=<corpusInputDir>        Input corpus dir
   -o=<outputFile>            Output Txt File
   -t=<threads>               Number of threads

Pipeline Description

The pipeline executes the following steps for every document:

  1. tokenization of the import report
  2. diagnosis keywords are annotated
  3. reliability contexts are annotated
  4. JAPE rules are run to
    • determine the reliability of a diagnosis by either checking whether it is within some reliability context or whether it matches a certain language pattern (e.g. Verdacht auf ...)
    • Removing some false positives like some_diagnosis Sprechstunde or some_diagnosis Abklaerung
  5. Consolidate diagnosis anntoations: * adding confirmed reliability as default, if no other reliability could be determined * removing duplicates
  6. Writing to file and/or database