Skip to content

Latest commit

 

History

History
210 lines (168 loc) · 9.66 KB

README.md

File metadata and controls

210 lines (168 loc) · 9.66 KB

Coptic NLP Pipeline

An end-to-end NLP pipeline for Coptic text in UTF-8 encoding.

Online production version available as a web interface at: https://tools.copticscriptorium.org/coptic-nlp/

The pipeline supports normalization, segmentation (at the word and subword levels), part of speech tagging, lemmatization, language of origin detection, sentence splitting, syntactic dependency parsing, multiword expression recognition, entity recognition, Wikification, and more.

🔥New🔥: The coptic-nlp now supports both Sahidic and Bohairic Coptic dialect varieties!

Installation

The NLP pipeline can run as a script or as part of the included web interface via a web server (e.g. Apache, using index.py as a landing page). To run it:

  • Clone this repository into the directory that the script should run on
  • Ensure that the relevant user has permissions to run scripts in the directory
  • Make sure the dependencies under Requirements are installed

Requirements

Python libraries

The NLP pipeline will run on Python 2.7+ or Python 3.5+ (2.6 and lower are not supported). See requirements.txt for required libraries.

External dependencies

The pipeline also requires perl for segmentation. If you want to use the old marmot tagger instead of flair (not recommended), or the old Malt parser model or Marmot tagging model (also not recommended) instead of the Python models, java needs to be available. Additionally if you want to use the old TreeTagger for POS tagging and lemmatization, TreeTagger must be installed. These are not included in the distribution but the script will offer to attempt to download them if they are missing.

Note on using TreeTagger with older Linux distributions: the latest TreeTagger binaries do not run on some older Linux distributions. When automatically downloading TreeTagger, the script will attempt to notice this. If you receive the error FATAL: kernel too old, please contact @amir-zeldes or open an issue describing your Linux version so it can be added to the script handler. The compatible older version of TreeTagger can be downloaded manually from http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/tree-tagger-linux-3.2-old5.tar.gz. This should not be necessary is you are using our recommended Python tagger built using flair.

Command line usage

usage: python coptic_nlp.py [OPTIONS] files

positional arguments:
  files                 File name or pattern of files to process (e.g. *.txt)

optional arguments:
  -h, --help            show this help message and exit

standard module options:
  -u, --unary           Binarize unary XML milestone tags
  -t, --tag             Do POS tagging
  -l, --lemma           Do lemmatization
  -n, --norm            Do normalization
  -m, --multiword       Tag multiword expressions
  -b, --breaklines      Add line tags at line breaks
  -p, --parse           Parse with dependency parser
  -e, --etym            Add etymolgical language of origin for loan words
  -s SENT, --sent SENT  XML tag to split sentences, e.g. verse for <verse ..>
                        (otherwise PUNCT tag is used to split sentences); 
                        use -s=predict to use neural segmenter instead
  -o {pipes,sgml,conllu}, --outmode {pipes,sgml,conllu}
                        Output SGML, conllu or tokenize with pipes
  --dialect {auto,sahidic,bohairic}
                        Coptic dialect of input data (default: auto-detect)

less common options:
  -f, --finitestate     Use old finite-state tokenizer (less accurate)
  -d {0,1,2,3}, --detokenize {0,1,2,3}
                        Re-group non-standard bound groups (a.k.a.
                        'laytonize') - 1=normal 2=aggressive 3=smart
  --segment_merged      When re-grouping bound groups, assume merged groups
                        have segmentation boundary between them
  -q, --quiet           Suppress verbose messages
  -x EXTENSION, --extension EXTENSION
                        Extension for SGML mode output files (default: tt)
  --stdout              Print output to stdout, do not create output file
  --para                Add <p> tags if <hi rend="ekthetic"> is present
  --space               Add spaces around punctuation
  --from_pipes          Tokenization is indicated in input via pipes
  --dirout DIROUT       Optional output directory (default: this dir)
  --meta META           Add fixed meta data string read from this file name
  --parse_only          Only add a parse to an existing tagged SGML input
  --no_tok              Do not tokenize at all, input is one token per line
  --pos_spans           Harvest POS tags and lemmas from SGML spans
  --merge_parse         Merge/add a parse into a ready SGML file
  --version             Print version number and quit
  --treetagger          Tag using TreeTagger instead of flair
  --marmot              Tag using Marmot instead of flair
  --malt                Parse using MaltParser instead of Diaparser (requires Java)
  --no_gold_parse       Do not use UD_Coptic cache for gold parses
  --processing_meta     Add segmentation/tagging/parsing/entities="auto"
  --old_testament       Use Old Testament identities (Jesus means Jesus son of Naue i.e. Joshua, etc.)

Example usage

Add norm, lemma, parse, tag, unary tags, find multiword expressions and do language recognition:
> python coptic_nlp.py -penmult infile.txt        

Just tokenize a file using pipes and dashes:
> python coptic_nlp.py -o pipes infile.txt       

Tokenize with pipes and mark up line breaks, conservatively detokenize bound groups, assume seg boundary at merge site:
> python coptic_nlp.py -b -d 1 --segment_merged -o pipes infile.txt

Normalize, tag, lemmatize, find multiword expressions and parse, splitting sentences by <verse> tags:
> python coptic_nlp.py -pnmlt -s verse infile.txt       

Add full analyses to a whole directory of *.xml files, output to a specified directory:
> python coptic_nlp.py -penmult --dirout /home/cop/out/ *.xml

Parse a tagged SGML file into CoNLL tabular format for treebanking, use translation tag to recognize sentences:
> python coptic_nlp.py --no_tok --parse_only --pos_spans -s translation infile.tt

Merge a parse into a tagged SGML file's <norm> tags, use translation tag to recognize sentences:
> python coptic_nlp.py --merge_parse --pos_spans -s translation infile.tt

Input formats

The pipeline accepts the following kinds of input:

  • Plain text, with bound groups separated by underscores or spaces.
    • Note that if punctuation has not been separated from bound groups, you can use the --space option to attempt to automatically separate punctuation
    • If your Coptic text represents line breaks as new line characters, you can automatically add line break tags using -b / --breaklines
    • Gold tokenization information may be present in the input as pipes between part-of-speech bearing units and hyphens between morphemes
  • XML/SGML input, with bound groups separated by underscores or spaces. The script will retain XML tags as-is around Coptic text.
  • Coptic Scriptorium style TreeTagger SGML, with normalized units in tags such as .
    • This input format is used when adding a parse to an existing .tt file using the --merge_parse option
    • It is also possible to enrich .tt files with additional annotations, e.g. only add language of origin tags to an existing analysis

Some example inputs

  1. Plain text input, untokenized, no meaningful line breaks, spaces separating bound groups
Ⲁϥⲥⲱⲧⲙ ⲛϭⲓⲡϣⲃⲏⲣ ⲛ̄ⲧⲙⲛⲧⲁⲧⲥⲱⲧⲙ
  1. XML input with meaningful line breaks and gold tokenization, underscores separating bound groups
Ⲁ|ϥ|ⲥ<hi renf="large">ⲱ</hi>ⲧⲙ_
ⲛϭⲓ|ⲡ|ϣⲃⲏⲣ_ⲛ̄|ⲧ|ⲙⲛⲧ-ⲁ
ⲧ-ⲥⲱⲧⲙ
  1. TreeTagger SGML input with some analyses, but for example missing the syntactic parse
<lb n="1">
<norm_group norm_group="ⲁϥⲥⲱⲧⲙ" orig_group="ⲁϥⲥⲱⲧⲙ">
<norm norm="ⲁ" orig="Ⲁ" lemma="ⲁ" pos="APST">
Ⲁ
</norm>
<norm norm="ϥ" orig="ϥ" lemma="ⲛⲧⲟϥ" pos="PPERS">
ϥ
</norm>
<norm norm="ⲥⲱⲧⲙ" orig="ⲥⲱⲧⲙ" lemma="ⲥⲱⲧⲙ" pos="V">
ⲥ
<hi rend="large">
ⲱ
</hi>
ⲧⲙ
</norm>
</norm_group>
</lb>
<lb n="2">
<norm_group norm_group="ⲛϭⲓⲡϣⲃⲏⲣ" orig_group="ⲛϭⲓⲡϣⲃⲏⲣ">
<norm norm="ⲛϭⲓ" orig="ⲛϭⲓ" lemma="ⲛϭⲓ" pos="PTC">
ⲛϭⲓ
</norm>
<norm norm="ⲡ" orig="ⲡ" lemma="ⲡ" pos="ART">
ⲡ
</norm>
<norm norm="ϣⲃⲏⲣ" orig="ϣⲃⲏⲣ" lemma="ϣⲃⲏⲣ" pos="N">
ϣⲃⲏⲣ
</norm>
</norm_group>
<norm_group norm_group="ⲛⲧⲙⲛⲧⲁⲧⲥⲱⲧⲙ" orig_group="ⲛ̄ⲧⲙⲛⲧⲁⲧⲥⲱⲧⲙ">
<norm norm="ⲛ" orig="ⲛ̄" lemma="ⲛ" pos="PREP">
ⲛ
</norm>
<norm norm="ⲧ" orig="ⲧ" lemma="ⲡ" pos="ART">
ⲧ
</norm>
<norm norm="ⲙⲛⲧⲁⲧⲥⲱⲧⲙ" orig="ⲙⲛⲧⲁⲧⲥⲱⲧⲙ" lemma="ⲙⲛⲧⲁⲧⲥⲱⲧⲙ" pos="N">
<morph morph="ⲙⲛⲧ">
ⲙⲛⲧ
</morph>
<morph morph="ⲁⲧ">
ⲁ
</lb>
<lb n="3">
ⲧ
</morph>
<morph morph="ⲥⲱⲧⲙ">
ⲥⲱⲧⲙ
</morph>
</norm>
</norm_group>
</lb>

Testing installation

If all requirements are installed correctly, you can verify that modules are working correctly by running the built-in unit tests:

python run_tests.py

Acknowledgments

We would like to thank Tito Orlandi (CMCL) for contributing Sahidic lexicon data to the project, Hany Takla (Saint Shenouda the Archimandrite Society) for much of the Bohairic data that the tools are based on and Pishoy Georgios for contributing lexical data for Bohairic. A full list of contributors and collaborators in other projects can be found on the Coptic Scriptorium website.