Skip to content

Commit

Permalink
version 1.0
Browse files Browse the repository at this point in the history
  • Loading branch information
rayliu521 committed Jun 17, 2014
0 parents commit fc9a6f6
Show file tree
Hide file tree
Showing 3 changed files with 38 additions and 0 deletions.
38 changes: 38 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
TreeTagger part-of-speech tagging models for Sahidic Coptic
===========================================================
The part-of-speech tagging models are for use with the freely available TreeTagger
(http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/). There models are based
on the guidelines of the Coptic SCRIPTORIUM project, which closely follow Layton's (2004)
grammar. The lexicon used by the tagger is based on a lexicon kindly provided by Prof.
Tito Orlandi and the CMCL project (http://cmcl.let.uniroma1.it/). Please cite the CMCL
project whenever publishing research using the tagging models.

There are two different models: one for the coarse grained tagset, with 22 tags, and one
for the fine grained tagset, which distinguishes 44 tags (including individual tags for
each positive and negative conjugation base). For details on the tagset, see the
documentation on the Coptic SCRIPTORIUM web page.

To use the models, download and unzip the TreeTagger. In the folder bin/ you will find
the TreeTagger excutable, which requires one of the two parameter files to run. TreeTagger
also expects an input file in a one-token-per-line format. For exaple, the input file input.txt could
include the following tokens (in UTF-8!):

p
noute
pe
.

These will be tagged as:

p ART
noute N
pe COP
. PUNCT

To run the tagger, run the TreeTagger excutable as follows (Windows example):

tree-tagger.exe coptic.par -token input.txt output.txt

The option -token tells the TreeTagger that the input is already tokenized. For a Coptic tokenizer,
see the Coptic SCRIPTORIUM project web page. Further options, such as allowing for SGML tags in the
input, are documented in the TreeTagger documentation.
Binary file added coptic_coarse.par
Binary file not shown.
Binary file added coptic_fine.par
Binary file not shown.

0 comments on commit fc9a6f6

Please sign in to comment.