Merge branch 'master' into PROPS

proycon · May 10, 2024 · f33647d · f33647d
2 parents 1d3c920 + a7f1e9d
commit f33647d
Show file tree

Hide file tree

Showing 8 changed files with 152 additions and 35 deletions.
diff --git a/README.rst b/README.rst
@@ -1,10 +1,6 @@
 .. image:: https://github.com/proycon/foliatools/actions/workflows/foliatools.yml/badge.svg?branch=master
     :target: https://github.com/proycon/foliatools/actions/
 
-.. image:: http://readthedocs.org/projects/foliatools/badge/?version=latest
-	:target: http://foliatools.readthedocs.io/en/latest/?badge=latest
-	:alt: Documentation Status
-
 .. image:: http://applejack.science.ru.nl/lamabadge.php/foliatools
    :target: http://applejack.science.ru.nl/languagemachines/
 
@@ -149,7 +145,7 @@ TEI P5 documents can be processed. Some notable things that are supported:
 * Gaps
 * Text markup (highlighting, ``<hi>``), emphasis, foreign, term, mentioned, names and places
     * Limited corrections
-* Conversion of `lightweigth linguistic annotation <https://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-att.linguistic.html>`_.
+* Conversion of `lightweight linguistic annotation <https://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-att.linguistic.html>`_.
 * Linguistic segments: sentences (``<s>``) & words (``w``), but **not** ``<cl>`` nor ``<phr>``.
     * Basic tokenisation (spacing) information (TEI's ``@join`` attribute)
 * Limited metadata
@@ -163,14 +159,74 @@ Specifically not supported (yet), non-exhaustive list:
 * Contextual information
 * Feature structures (``<fs>``, ``<f>``)
 
+FoLiA to STAM
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+`STAM <https://annotation.github.io/stam>`__ is a stand-off model for text
+annotation that. It does not prescribe any vocabulary at all but allows one to
+reuse existing vocabularies. The `folia2stam` tool converts FoLiA documents to
+STAM, preserving the vocabulary that FoLiA predefines regarding annotation types, common attributes etc... 
+
+**Supported:**
+
+* Conversion of text structure including divisions, paragraphs, headers & titles, lists, figures, tables (limited), front matter, back
+  matter.
+* Conversion of inline and span annotation
+
+**Not supported yet:**
+
+* Only tokenised documents (i.e. with word elements) are implemented currently
+* Conversion of text markup annotation
+* Certain higher-order annotation is not converted yet
+* No explicit tree structure is built yet for hierarchical annotations like syntax annotation
+* Do note that there is no conversion back from STAM to FoLiA XML currently (that would be complicated for multiple reasons, so might never be realized).
+
+**Vocabulary conversion:**
+
+Both FoLiA and STAM have the notion of a *set* or *annotation dataset*. In
+FoLiA the scope of such a set is to define the vocabulary used for a particular
+annotation type (e.g. a tagset). FoLiA itself already defines what annotation
+types exist. In STAM an annotation dataset is a broader notion and all
+vocabulary, even the notion of a word or sentence, comes from a set, as nothing
+is predefined at all aside from the STAM model's primitives.
+
+We map most of the vocabulary of FoLiA itself to a STAM dataset with ID
+`https://w3id.org/folia/v2/`. All of FoLiA's annotation types, element types, and
+common attributes are defined in this set.
+
+Each FoLiA set definition maps to a STAM dataset with the same set ID (URI. The
+STAM set defines `class` key in that set, that corresponds to FoLiA's *class*
+attribute. Any FoLiA subsets (for features) also translate to key identifiers.
+
+The declarations inside a FoLiA document will be explicitly expressed in STAM as well;
+each STAM dataset will have an annotation that points to it (with a
+DataSetSelector). This annotation has data with key `declaration`  (set
+`https://w3id.org/folia/v2/`) that marks it as a declaration for a specific type,
+the value is something like `pos-annotation` and corresponds one-on-one to the declaration
+element used in FoLiA XML. Additionally, this annotation also has data with key
+`annotationtype` (same set as above) that where the value corresponds to the
+annotation type (lowercased, e.g. `pos`).
+
+The FoLiA to STAM conversion is RDF-ready. That is, all identifiers are valid
+IRIs and all FoLiA vocabulary (`https://w3id.org/folia/v2/`) is backed by `a formal ontology <https://github.com/proycon/folia/blob/master/schemas/folia.ttl>`_ using RDF and SKOS.
+
+FoLiA set definitions, if defined, are already in SKOS (or in the legacy
+format).
+
+Being RDF-ready means that the STAM model produced by `folia2stam` can in turn
+be easily be exported to W3C Web Annotations. Tooling for that conversion will
+be provided in `Stam Tools <https://github.com/annotation/stam-tools>`_.
+
+
 
 FoLiA to Salt
 ^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 `Salt <https://corpus-tools.org/salt/>`_ is a graph based annotation model that is designed to act as an intermediate
 format in the conversion between various annotation formats. It is used by the conversion tool `Pepper <https://corpus-tools.org/pepper/>`_. Our FoLiA to Salt converter, however, is a standalone tool as part of these FoLiA tools, rather than integrated into pepper. You can use ``folia2salt`` to convert FoLiA XML to Salt XML and subsequently use Pepper to do conversions to other formats such as TCF, PAULA, TigerXML, GraF, Annis, etc... (there is no guarantee though that everything can be preserved accurately in each conversion).
 
-The current state of this conversion is summarised below:
+The current state of this conversion is summarised below, it is however not
+likely that this particular tool will be developed any further:
 
 *  Conversion of FoLiA tokens to salt SToken nodes
    * The converter only supports tokenised FoLiA documents
@@ -205,3 +261,5 @@ Our Salt conversion tries to preserve as much of the FoLiA as possible, we exten
 specifying namespaces to hold and group the annotation type and set of an annotation. SLabel elements with the same
 namespace should often be considered together.
 
+
+
diff --git a/codemeta-harvest.json b/codemeta-harvest.json
@@ -0,0 +1,5 @@
+{
+    "name": "FoLiA tools",
+	"developmentStatus": [ "https://www.repostatus.org/#active", "https://w3id.org/research-technology-readiness-levels#Level9Proven" ],
+    "applicationCategory": [ "https://vocabs.dariah.eu/tadirah/annotating", "https://w3id.org/nwo-research-fields#ComputationalLinguisticsandPhilology", "https://w3id.org/nwo-research-fields#TextualAndLinguisticCorpora" ]
+}
diff --git a/folia b/folia
diff --git a/foliatools/__init__.py b/foliatools/__init__.py
@@ -1,3 +1,3 @@
 """FoLiA-tools contains various Python-based command line tools for working with FoLiA XML (Format for Linguistic Annotation)"""
 
-VERSION = "2.5.4"
+VERSION = "2.5.7"
diff --git a/foliatools/folia2stam.py b/foliatools/folia2stam.py
@@ -6,14 +6,13 @@
 import os
 import argparse
 import glob
-from collections import OrderedDict
 from foliatools import VERSION as TOOLVERSION
-from typing import Generator, Optional
+from typing import Generator
 import folia.main as folia
 import stam
 
 #Namespace for STAM annotationset and for RDF, not the same as XML namepace because that one is very old and hard to resolve
-FOLIA_NAMESPACE = "https://w3id.org/folia/"
+FOLIA_NAMESPACE = "https://w3id.org/folia/v2/"
 
 
 def processdir(d, annotationstore: stam.AnnotationStore, **kwargs):
@@ -43,6 +42,29 @@ def convert(f, annotationstore: stam.AnnotationStore,  **kwargs):
         for key, value in doc.metadata.items():
             annotationstore.annotate(target=selector, data={"key":key,"value":value,"set":"metadata"}) #TODO: make metadata set configurable
 
+    for annotationtype, foliaset in doc.annotations:
+        if foliaset:
+            try:
+                dataset = annotationstore.dataset(foliaset)
+            except stam.StamError:
+                dataset = annotationstore.add_dataset(foliaset)
+            selector = stam.Selector.datasetselector(dataset)
+            value = folia.annotationtype2str(annotationtype)
+            if value:
+                value = value.lower()
+                annotationstore.annotate(target=selector, data=[{
+                    "key":"declaration", 
+                    "value": f"{value}-annotation", 
+                    "set": FOLIA_NAMESPACE
+                    },
+                    {
+                    "key":"annotationtype",
+                    "value": value,
+                    "set": FOLIA_NAMESPACE
+                    },
+                ])
+
+
 
 def convert_tokens(doc: folia.Document, annotationstore: stam.AnnotationStore, **kwargs) -> stam.TextResource:
     """Convert FoLiA tokens (w) and text content to STAM. Returns a STAM resource"""
@@ -54,8 +76,23 @@ def convert_tokens(doc: folia.Document, annotationstore: stam.AnnotationStore, *
     for word in doc.words():
         if not word.id:
             raise Exception("Only documents in which all words have IDs can be converted. Consider preprocessing with foliaid first.")
+        if kwargs.get('debug'):
+            print(f"Processing FoLiA word {word.id}...",file=sys.stderr)
+
 
         textstart = len(text)
+        if text:
+           if prevword:
+               ancestors = set(word.ancestors(folia.AbstractStructureElement))
+               prevancestors = set(prevword.ancestors(folia.AbstractStructureElement))
+               delimiters = [ ancestor.gettextdelimiter() for ancestor in prevancestors - ancestors ]
+               if delimiters:
+                   delimiters.sort(key= lambda x: len(x), reverse=True)
+                   text += delimiters[0]
+                   textstart += len(delimiters[0])
+               elif prevword.space:
+                   text += " "
+                   textstart += 1
         try:
             text += word.text()
         except folia.NoSuchText:
@@ -74,30 +111,34 @@ def convert_tokens(doc: folia.Document, annotationstore: stam.AnnotationStore, *
         word._begin = textstart
         word._end = textend
 
-        if text and textstart != textend:
-           if word.space or (prevword and word.parent != prevword.parent):
-               text += " "
+        prevword = word
 
     if not text:
         raise Exception(f"Document {doc.filename} has no text!")
 
     if kwargs['external_resources']: 
-        #write text as standoff document
+        if kwargs.get('debug'):
+            print(f"Writing text as stand-off document and adding it as a resource in the STAM model",file=sys.stderr)
         filename = os.path.join(kwargs['outputdir'], doc.id + ".txt")
         with open(filename,'w',encoding='utf-8') as f:
             f.write(text)
-        #reads it again and associates it with the store:
+        #reads it again and associate it with the store:
         resource = annotationstore.add_resource(id=doc.id, filename=filename)
     else:
+        if kwargs.get('debug'):
+            print(f"Adding resource to STAM model",file=sys.stderr)
         resource = annotationstore.add_resource(id=doc.id, text=text)
 
     for token in tokens:
+        if kwargs.get('debug'):
+            print(f"Adding token to STAM: {token}",file=sys.stderr)
         word_stam = annotationstore.annotate(id=token["id"], 
                                              target=stam.Selector.textselector(resource, stam.Offset.simple(token["begin"], token["end"])),
                                              data=token["data"])
 
         word_folia = doc[token["id"]]
-        convert_inline_annotation(word_folia, word_stam, annotationstore, **kwargs )
+        if word_folia:
+            convert_inline_annotation(word_folia, word_stam, annotationstore, **kwargs )
 
     return resource
 
@@ -243,8 +284,10 @@ def convert_inline_annotation(word: folia.Word, word_stam: stam.Annotation, anno
                list(convert_common_attributes(annotation_folia)) + \
                list(convert_features(annotation_folia)) 
         if annotation_folia.id:
+            if kwargs.get('debug'): print(f"Adding inline annotation with Data ID {annotation_folia.id}, data: {data}",file=sys.stderr)
             annotationstore.annotate(id=annotation_folia.id, target=selector, data=data)
         else:
+            if kwargs.get('debug'): print(f"Adding inline annotation: {data}",file=sys.stderr)
             annotationstore.annotate(target=selector, data=data)
 
         #TODO: list(convert_higher_order(annotation_folia))
@@ -400,15 +443,17 @@ def convert_span_annotation(doc: folia.Document, annotationstore: stam.Annotatio
 def convert_type_information(annotation: folia.AbstractElement) -> Generator[dict,None,None]:
      if annotation.XMLTAG:
         yield { "set":FOLIA_NAMESPACE,
-                "id": f"{FOLIA_NAMESPACE}elementtype/{annotation.XMLTAG}",
+                "id": f"{annotation.__class__.__name__}",
                 "key": "elementtype",
                 "value": annotation.XMLTAG}
      if annotation.ANNOTATIONTYPE:
-        value = folia.annotationtype2str(annotation.ANNOTATIONTYPE).lower()
-        yield {"set": FOLIA_NAMESPACE,
-               "id":f"{FOLIA_NAMESPACE}annotationtype/{value}",
-                "key":"annotationtype",
-                "value":value}
+        value = folia.annotationtype2str(annotation.ANNOTATIONTYPE)
+        if value:
+            value = value.lower()
+            yield {"set": FOLIA_NAMESPACE,
+                   "id":f"{value.capitalize()}AnnotationType",
+                    "key":"annotationtype",
+                    "value":value}
 
 def convert_common_attributes(annotation: folia.AbstractElement) -> Generator[dict,None,None]:
     """Convert common FoLiA attributes"""
@@ -423,7 +468,6 @@ def convert_common_attributes(annotation: folia.AbstractElement) -> Generator[di
 
     if annotation.confidence is not None:
         yield {"set":FOLIA_NAMESPACE,
-            "id":f"{FOLIA_NAMESPACE}confidence/{annotation.confidence}",
             "key":"confidence",
             "value":annotation.confidence}
 
@@ -440,20 +484,19 @@ def convert_common_attributes(annotation: folia.AbstractElement) -> Generator[di
     if annotation.datetime is not None:
         value = annotation.datetime.strftime("%Y-%m-%dT%H:%M:%S")
         yield { "set":FOLIA_NAMESPACE,
-            "id":f"{FOLIA_NAMESPACE}datetime/{value}",
             "key":"datetime",
             "value":value} #MAYBE TODO: convert to STAM's internal datetime type?
 
     if annotation.processor:
         yield { "set":FOLIA_NAMESPACE,
-            "key":"processor/id",
+            "key":"processorId",
             "value":annotation.processor.id}
         yield { "set":FOLIA_NAMESPACE,
-            "key":"processor/name",
+            "key":"processorName",
             "value":annotation.processor.name}
         yield { "set":FOLIA_NAMESPACE,
-            "id":f"{FOLIA_NAMESPACE}processor/type/{annotation.processor.type}",
-            "key":"processor/type",
+            "id":f"{annotation.processor.type.capitalize()}ProcessorType",
+            "key":"processorType",
             "value":annotation.processor.type}
 
 def convert_features(annotation: folia.AbstractElement):
@@ -493,6 +536,7 @@ def main():
     parser.add_argument('--inline-annotations-mode',type=str, help="What STAM selector to use to translate FoLiA's inline annotations? Can be set to AnnotationSelector (reference the tokens) or TextSelector (directly reference the text)", action='store', default="TextSelector", required=False)
     parser.add_argument('--span-annotations-mode',type=str, help="What STAM selector to use to translate FoLiA's span annotations? Can be set to AnnotationSelector (reference the tokens) or TextSelector (directly reference the text)", action='store', default="TextSelector", required=False)
     parser.add_argument('--external-resources',"-X",help="Serialize text to external/stand-off text files rather than including them in the JSON", action='store_true')
+    parser.add_argument('--debug',"-D",help="Enable debug mode, produces extra output to stderr", action='store_true')
     parser.add_argument('files', nargs='*', help='Files (and/or directories) to convert. All will be added to a single STAM annotation store.')
     args = parser.parse_args()
 

diff --git a/foliatools/foliaspec2rdf.py b/foliatools/foliaspec2rdf.py
@@ -71,7 +71,7 @@ def main():
     majorversion = spec['version'].split(".")[0]
 
     print(\
-f"""@prefix folia: <{spec['namespace']}/v{majorversion}#> .
+f"""@prefix folia: <{spec['rdfnamespace']}> .
 @prefix rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
 @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
 @prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@@ -94,7 +94,7 @@ def main():
 
 folia:Elements a skos:ConceptScheme ;
                  dc:title "FoLiA Elements" ;
-                 dc:description "Defines FoLiA elements. These correspond to element in FoLiA XML" .
+                 dc:description "Defines FoLiA elements. These correspond to elements in FoLiA XML" .
 folia:Element  a rdfs:Class .
 
 ### Element Properties ###
@@ -126,6 +126,12 @@ def main():
                       rdfs:domain folia:Element ;
                       rdfs:range folia:Element .
 
+folia:processortype a rdf:Property ;
+                     rdfs:range folia:ProcessorType .
+
+folia:AutoProcessorType a folia:ProcessorType .
+folia:ManualProcessorType a folia:ProcessorType .
+
 """)
 
     for prop in BOOLPROPERTIES:

diff --git a/foliatools/tei2folia.py b/foliatools/tei2folia.py
@@ -79,7 +79,10 @@ def convert(filename, transformer, parser=None, **kwargs):
     else:
         with open(filename,'rb') as f:
             parsedsource = lxml.etree.parse(f, parser)
-    transformed = transformer(parsedsource,quiet="true")
+    transform_kwargs = {"quiet":"true"}
+    if kwargs.get('docid'):
+        transform_kwargs['docid'] = f"'{kwargs['docid']}'"
+    transformed = transformer(parsedsource,**transform_kwargs)
     if 'intermediate' in kwargs and kwargs['intermediate']:
         print(str(lxml.etree.tostring(transformed,encoding='utf-8'),'utf-8'))
     try:
@@ -257,6 +260,7 @@ def main():
     parser.add_argument('-P','--leaveparts',help="Do *NOT* resolve temporary parts", action='store_true', default=False)
     parser.add_argument('-N','--leavenotes',help="Do *NOT* resolve inline notes (t-gap)", action='store_true', default=False)
     parser.add_argument('-i','--ids',help="Generate IDs for all structural elements", action='store_true', default=False)
+    parser.add_argument('--docid',type=str, help="Set FoLiA document ID", action='store', default=False)
     parser.add_argument('-f','--forcenamespace',help="Force a TEI namespace even if the input document has none", action='store_true', default=False)
     parser.add_argument('files', nargs='+', help='TEI Files to process')
     args = parser.parse_args()

diff --git a/setup.py b/setup.py
@@ -11,11 +11,11 @@ def read(fname):
 
 setup(
     name = "FoLiA-tools",
-    version = "2.5.5", #also change in __init__.py
+    version = "2.5.7", #also change in __init__.py
     author = "Maarten van Gompel",
     author_email = "[email protected]",
     description = ("FoLiA-tools contains various Python-based command line tools for working with FoLiA XML (Format for Linguistic Annotation)"),
-    license = "GPL",
+    license = "GPL-3.0-only",
     keywords = ["nlp", "computational linguistics", "search", "folia", "annotation"],
     url = "https://proycon.github.io/folia",
     packages=['foliatools'],
@@ -76,5 +76,5 @@ def read(fname):
     },
     #include_package_data=True,
     package_data = {'foliatools': ['*.xsl']},
-    install_requires=['folia >= 2.5.4', 'lxml >= 2.2','docutils', 'pyyaml', 'langid','conllu', 'requests','stam >= 0.1.0']
+    install_requires=['folia >= 2.5.9', 'lxml >= 2.2','docutils', 'pyyaml', 'langid','conllu', 'requests','stam >= 0.4.0']
 )
+23 −0		.readthedocs.yml
+1 −1		docs/source/comment_annotation.rst
+1 −1		docs/source/description_annotation.rst
+59 −0		docs/source/etymology_annotation.rst
+1 −1		docs/source/index.rst
+1 −1		docs/source/introduction.rst
+1 −1		docs/source/metric_annotation.rst
+1 −1		docs/source/text_annotation.rst
+62 −0		examples/etymology.2.5.2.folia.xml
+2,457 −0		schemas/folia.ttl
+21 −17		schemas/folia.yml