-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
0 parents
commit fc9a6f6
Showing
3 changed files
with
38 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,38 @@ | ||
TreeTagger part-of-speech tagging models for Sahidic Coptic | ||
=========================================================== | ||
The part-of-speech tagging models are for use with the freely available TreeTagger | ||
(http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/). There models are based | ||
on the guidelines of the Coptic SCRIPTORIUM project, which closely follow Layton's (2004) | ||
grammar. The lexicon used by the tagger is based on a lexicon kindly provided by Prof. | ||
Tito Orlandi and the CMCL project (http://cmcl.let.uniroma1.it/). Please cite the CMCL | ||
project whenever publishing research using the tagging models. | ||
|
||
There are two different models: one for the coarse grained tagset, with 22 tags, and one | ||
for the fine grained tagset, which distinguishes 44 tags (including individual tags for | ||
each positive and negative conjugation base). For details on the tagset, see the | ||
documentation on the Coptic SCRIPTORIUM web page. | ||
|
||
To use the models, download and unzip the TreeTagger. In the folder bin/ you will find | ||
the TreeTagger excutable, which requires one of the two parameter files to run. TreeTagger | ||
also expects an input file in a one-token-per-line format. For exaple, the input file input.txt could | ||
include the following tokens (in UTF-8!): | ||
|
||
p | ||
noute | ||
pe | ||
. | ||
|
||
These will be tagged as: | ||
|
||
p ART | ||
noute N | ||
pe COP | ||
. PUNCT | ||
|
||
To run the tagger, run the TreeTagger excutable as follows (Windows example): | ||
|
||
tree-tagger.exe coptic.par -token input.txt output.txt | ||
|
||
The option -token tells the TreeTagger that the input is already tokenized. For a Coptic tokenizer, | ||
see the Coptic SCRIPTORIUM project web page. Further options, such as allowing for SGML tags in the | ||
input, are documented in the TreeTagger documentation. |
Binary file not shown.
Binary file not shown.