Skip to content

Commit

Permalink
Merge branch 'master' into PROPS
Browse files Browse the repository at this point in the history
  • Loading branch information
kosloot committed May 10, 2024
2 parents 1d3c920 + a7f1e9d commit f33647d
Show file tree
Hide file tree
Showing 8 changed files with 152 additions and 35 deletions.
70 changes: 64 additions & 6 deletions README.rst
Original file line number Diff line number Diff line change
@@ -1,10 +1,6 @@
.. image:: https://github.com/proycon/foliatools/actions/workflows/foliatools.yml/badge.svg?branch=master
:target: https://github.com/proycon/foliatools/actions/

.. image:: http://readthedocs.org/projects/foliatools/badge/?version=latest
:target: http://foliatools.readthedocs.io/en/latest/?badge=latest
:alt: Documentation Status

.. image:: http://applejack.science.ru.nl/lamabadge.php/foliatools
:target: http://applejack.science.ru.nl/languagemachines/

Expand Down Expand Up @@ -149,7 +145,7 @@ TEI P5 documents can be processed. Some notable things that are supported:
* Gaps
* Text markup (highlighting, ``<hi>``), emphasis, foreign, term, mentioned, names and places
* Limited corrections
* Conversion of `lightweigth linguistic annotation <https://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-att.linguistic.html>`_.
* Conversion of `lightweight linguistic annotation <https://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-att.linguistic.html>`_.
* Linguistic segments: sentences (``<s>``) & words (``w``), but **not** ``<cl>`` nor ``<phr>``.
* Basic tokenisation (spacing) information (TEI's ``@join`` attribute)
* Limited metadata
Expand All @@ -163,14 +159,74 @@ Specifically not supported (yet), non-exhaustive list:
* Contextual information
* Feature structures (``<fs>``, ``<f>``)

FoLiA to STAM
^^^^^^^^^^^^^^^^^^^^^^^^^^

`STAM <https://annotation.github.io/stam>`__ is a stand-off model for text
annotation that. It does not prescribe any vocabulary at all but allows one to
reuse existing vocabularies. The `folia2stam` tool converts FoLiA documents to
STAM, preserving the vocabulary that FoLiA predefines regarding annotation types, common attributes etc...

**Supported:**

* Conversion of text structure including divisions, paragraphs, headers & titles, lists, figures, tables (limited), front matter, back
matter.
* Conversion of inline and span annotation

**Not supported yet:**

* Only tokenised documents (i.e. with word elements) are implemented currently
* Conversion of text markup annotation
* Certain higher-order annotation is not converted yet
* No explicit tree structure is built yet for hierarchical annotations like syntax annotation
* Do note that there is no conversion back from STAM to FoLiA XML currently (that would be complicated for multiple reasons, so might never be realized).

**Vocabulary conversion:**

Both FoLiA and STAM have the notion of a *set* or *annotation dataset*. In
FoLiA the scope of such a set is to define the vocabulary used for a particular
annotation type (e.g. a tagset). FoLiA itself already defines what annotation
types exist. In STAM an annotation dataset is a broader notion and all
vocabulary, even the notion of a word or sentence, comes from a set, as nothing
is predefined at all aside from the STAM model's primitives.

We map most of the vocabulary of FoLiA itself to a STAM dataset with ID
`https://w3id.org/folia/v2/`. All of FoLiA's annotation types, element types, and
common attributes are defined in this set.

Each FoLiA set definition maps to a STAM dataset with the same set ID (URI. The
STAM set defines `class` key in that set, that corresponds to FoLiA's *class*
attribute. Any FoLiA subsets (for features) also translate to key identifiers.

The declarations inside a FoLiA document will be explicitly expressed in STAM as well;
each STAM dataset will have an annotation that points to it (with a
DataSetSelector). This annotation has data with key `declaration` (set
`https://w3id.org/folia/v2/`) that marks it as a declaration for a specific type,
the value is something like `pos-annotation` and corresponds one-on-one to the declaration
element used in FoLiA XML. Additionally, this annotation also has data with key
`annotationtype` (same set as above) that where the value corresponds to the
annotation type (lowercased, e.g. `pos`).

The FoLiA to STAM conversion is RDF-ready. That is, all identifiers are valid
IRIs and all FoLiA vocabulary (`https://w3id.org/folia/v2/`) is backed by `a formal ontology <https://github.com/proycon/folia/blob/master/schemas/folia.ttl>`_ using RDF and SKOS.

FoLiA set definitions, if defined, are already in SKOS (or in the legacy
format).

Being RDF-ready means that the STAM model produced by `folia2stam` can in turn
be easily be exported to W3C Web Annotations. Tooling for that conversion will
be provided in `Stam Tools <https://github.com/annotation/stam-tools>`_.



FoLiA to Salt
^^^^^^^^^^^^^^^^^^^^^^^^^^

`Salt <https://corpus-tools.org/salt/>`_ is a graph based annotation model that is designed to act as an intermediate
format in the conversion between various annotation formats. It is used by the conversion tool `Pepper <https://corpus-tools.org/pepper/>`_. Our FoLiA to Salt converter, however, is a standalone tool as part of these FoLiA tools, rather than integrated into pepper. You can use ``folia2salt`` to convert FoLiA XML to Salt XML and subsequently use Pepper to do conversions to other formats such as TCF, PAULA, TigerXML, GraF, Annis, etc... (there is no guarantee though that everything can be preserved accurately in each conversion).

The current state of this conversion is summarised below:
The current state of this conversion is summarised below, it is however not
likely that this particular tool will be developed any further:

* Conversion of FoLiA tokens to salt SToken nodes
* The converter only supports tokenised FoLiA documents
Expand Down Expand Up @@ -205,3 +261,5 @@ Our Salt conversion tries to preserve as much of the FoLiA as possible, we exten
specifying namespaces to hold and group the annotation type and set of an annotation. SLabel elements with the same
namespace should often be considered together.



5 changes: 5 additions & 0 deletions codemeta-harvest.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
{
"name": "FoLiA tools",
"developmentStatus": [ "https://www.repostatus.org/#active", "https://w3id.org/research-technology-readiness-levels#Level9Proven" ],
"applicationCategory": [ "https://vocabs.dariah.eu/tadirah/annotating", "https://w3id.org/nwo-research-fields#ComputationalLinguisticsandPhilology", "https://w3id.org/nwo-research-fields#TextualAndLinguisticCorpora" ]
}
2 changes: 1 addition & 1 deletion foliatools/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
"""FoLiA-tools contains various Python-based command line tools for working with FoLiA XML (Format for Linguistic Annotation)"""

VERSION = "2.5.4"
VERSION = "2.5.7"
86 changes: 65 additions & 21 deletions foliatools/folia2stam.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,14 +6,13 @@
import os
import argparse
import glob
from collections import OrderedDict
from foliatools import VERSION as TOOLVERSION
from typing import Generator, Optional
from typing import Generator
import folia.main as folia
import stam

#Namespace for STAM annotationset and for RDF, not the same as XML namepace because that one is very old and hard to resolve
FOLIA_NAMESPACE = "https://w3id.org/folia/"
FOLIA_NAMESPACE = "https://w3id.org/folia/v2/"


def processdir(d, annotationstore: stam.AnnotationStore, **kwargs):
Expand Down Expand Up @@ -43,6 +42,29 @@ def convert(f, annotationstore: stam.AnnotationStore, **kwargs):
for key, value in doc.metadata.items():
annotationstore.annotate(target=selector, data={"key":key,"value":value,"set":"metadata"}) #TODO: make metadata set configurable

for annotationtype, foliaset in doc.annotations:
if foliaset:
try:
dataset = annotationstore.dataset(foliaset)
except stam.StamError:
dataset = annotationstore.add_dataset(foliaset)
selector = stam.Selector.datasetselector(dataset)
value = folia.annotationtype2str(annotationtype)
if value:
value = value.lower()
annotationstore.annotate(target=selector, data=[{
"key":"declaration",
"value": f"{value}-annotation",
"set": FOLIA_NAMESPACE
},
{
"key":"annotationtype",
"value": value,
"set": FOLIA_NAMESPACE
},
])



def convert_tokens(doc: folia.Document, annotationstore: stam.AnnotationStore, **kwargs) -> stam.TextResource:
"""Convert FoLiA tokens (w) and text content to STAM. Returns a STAM resource"""
Expand All @@ -54,8 +76,23 @@ def convert_tokens(doc: folia.Document, annotationstore: stam.AnnotationStore, *
for word in doc.words():
if not word.id:
raise Exception("Only documents in which all words have IDs can be converted. Consider preprocessing with foliaid first.")
if kwargs.get('debug'):
print(f"Processing FoLiA word {word.id}...",file=sys.stderr)


textstart = len(text)
if text:
if prevword:
ancestors = set(word.ancestors(folia.AbstractStructureElement))
prevancestors = set(prevword.ancestors(folia.AbstractStructureElement))
delimiters = [ ancestor.gettextdelimiter() for ancestor in prevancestors - ancestors ]
if delimiters:
delimiters.sort(key= lambda x: len(x), reverse=True)
text += delimiters[0]
textstart += len(delimiters[0])
elif prevword.space:
text += " "
textstart += 1
try:
text += word.text()
except folia.NoSuchText:
Expand All @@ -74,30 +111,34 @@ def convert_tokens(doc: folia.Document, annotationstore: stam.AnnotationStore, *
word._begin = textstart
word._end = textend

if text and textstart != textend:
if word.space or (prevword and word.parent != prevword.parent):
text += " "
prevword = word

if not text:
raise Exception(f"Document {doc.filename} has no text!")

if kwargs['external_resources']:
#write text as standoff document
if kwargs.get('debug'):
print(f"Writing text as stand-off document and adding it as a resource in the STAM model",file=sys.stderr)
filename = os.path.join(kwargs['outputdir'], doc.id + ".txt")
with open(filename,'w',encoding='utf-8') as f:
f.write(text)
#reads it again and associates it with the store:
#reads it again and associate it with the store:
resource = annotationstore.add_resource(id=doc.id, filename=filename)
else:
if kwargs.get('debug'):
print(f"Adding resource to STAM model",file=sys.stderr)
resource = annotationstore.add_resource(id=doc.id, text=text)

for token in tokens:
if kwargs.get('debug'):
print(f"Adding token to STAM: {token}",file=sys.stderr)
word_stam = annotationstore.annotate(id=token["id"],
target=stam.Selector.textselector(resource, stam.Offset.simple(token["begin"], token["end"])),
data=token["data"])

word_folia = doc[token["id"]]
convert_inline_annotation(word_folia, word_stam, annotationstore, **kwargs )
if word_folia:
convert_inline_annotation(word_folia, word_stam, annotationstore, **kwargs )

return resource

Expand Down Expand Up @@ -243,8 +284,10 @@ def convert_inline_annotation(word: folia.Word, word_stam: stam.Annotation, anno
list(convert_common_attributes(annotation_folia)) + \
list(convert_features(annotation_folia))
if annotation_folia.id:
if kwargs.get('debug'): print(f"Adding inline annotation with Data ID {annotation_folia.id}, data: {data}",file=sys.stderr)
annotationstore.annotate(id=annotation_folia.id, target=selector, data=data)
else:
if kwargs.get('debug'): print(f"Adding inline annotation: {data}",file=sys.stderr)
annotationstore.annotate(target=selector, data=data)

#TODO: list(convert_higher_order(annotation_folia))
Expand Down Expand Up @@ -400,15 +443,17 @@ def convert_span_annotation(doc: folia.Document, annotationstore: stam.Annotatio
def convert_type_information(annotation: folia.AbstractElement) -> Generator[dict,None,None]:
if annotation.XMLTAG:
yield { "set":FOLIA_NAMESPACE,
"id": f"{FOLIA_NAMESPACE}elementtype/{annotation.XMLTAG}",
"id": f"{annotation.__class__.__name__}",
"key": "elementtype",
"value": annotation.XMLTAG}
if annotation.ANNOTATIONTYPE:
value = folia.annotationtype2str(annotation.ANNOTATIONTYPE).lower()
yield {"set": FOLIA_NAMESPACE,
"id":f"{FOLIA_NAMESPACE}annotationtype/{value}",
"key":"annotationtype",
"value":value}
value = folia.annotationtype2str(annotation.ANNOTATIONTYPE)
if value:
value = value.lower()
yield {"set": FOLIA_NAMESPACE,
"id":f"{value.capitalize()}AnnotationType",
"key":"annotationtype",
"value":value}

def convert_common_attributes(annotation: folia.AbstractElement) -> Generator[dict,None,None]:
"""Convert common FoLiA attributes"""
Expand All @@ -423,7 +468,6 @@ def convert_common_attributes(annotation: folia.AbstractElement) -> Generator[di

if annotation.confidence is not None:
yield {"set":FOLIA_NAMESPACE,
"id":f"{FOLIA_NAMESPACE}confidence/{annotation.confidence}",
"key":"confidence",
"value":annotation.confidence}

Expand All @@ -440,20 +484,19 @@ def convert_common_attributes(annotation: folia.AbstractElement) -> Generator[di
if annotation.datetime is not None:
value = annotation.datetime.strftime("%Y-%m-%dT%H:%M:%S")
yield { "set":FOLIA_NAMESPACE,
"id":f"{FOLIA_NAMESPACE}datetime/{value}",
"key":"datetime",
"value":value} #MAYBE TODO: convert to STAM's internal datetime type?

if annotation.processor:
yield { "set":FOLIA_NAMESPACE,
"key":"processor/id",
"key":"processorId",
"value":annotation.processor.id}
yield { "set":FOLIA_NAMESPACE,
"key":"processor/name",
"key":"processorName",
"value":annotation.processor.name}
yield { "set":FOLIA_NAMESPACE,
"id":f"{FOLIA_NAMESPACE}processor/type/{annotation.processor.type}",
"key":"processor/type",
"id":f"{annotation.processor.type.capitalize()}ProcessorType",
"key":"processorType",
"value":annotation.processor.type}

def convert_features(annotation: folia.AbstractElement):
Expand Down Expand Up @@ -493,6 +536,7 @@ def main():
parser.add_argument('--inline-annotations-mode',type=str, help="What STAM selector to use to translate FoLiA's inline annotations? Can be set to AnnotationSelector (reference the tokens) or TextSelector (directly reference the text)", action='store', default="TextSelector", required=False)
parser.add_argument('--span-annotations-mode',type=str, help="What STAM selector to use to translate FoLiA's span annotations? Can be set to AnnotationSelector (reference the tokens) or TextSelector (directly reference the text)", action='store', default="TextSelector", required=False)
parser.add_argument('--external-resources',"-X",help="Serialize text to external/stand-off text files rather than including them in the JSON", action='store_true')
parser.add_argument('--debug',"-D",help="Enable debug mode, produces extra output to stderr", action='store_true')
parser.add_argument('files', nargs='*', help='Files (and/or directories) to convert. All will be added to a single STAM annotation store.')
args = parser.parse_args()

Expand Down
10 changes: 8 additions & 2 deletions foliatools/foliaspec2rdf.py
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,7 @@ def main():
majorversion = spec['version'].split(".")[0]

print(\
f"""@prefix folia: <{spec['namespace']}/v{majorversion}#> .
f"""@prefix folia: <{spec['rdfnamespace']}> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
Expand All @@ -94,7 +94,7 @@ def main():
folia:Elements a skos:ConceptScheme ;
dc:title "FoLiA Elements" ;
dc:description "Defines FoLiA elements. These correspond to element in FoLiA XML" .
dc:description "Defines FoLiA elements. These correspond to elements in FoLiA XML" .
folia:Element a rdfs:Class .
### Element Properties ###
Expand Down Expand Up @@ -126,6 +126,12 @@ def main():
rdfs:domain folia:Element ;
rdfs:range folia:Element .
folia:processortype a rdf:Property ;
rdfs:range folia:ProcessorType .
folia:AutoProcessorType a folia:ProcessorType .
folia:ManualProcessorType a folia:ProcessorType .
""")

for prop in BOOLPROPERTIES:
Expand Down
6 changes: 5 additions & 1 deletion foliatools/tei2folia.py
Original file line number Diff line number Diff line change
Expand Up @@ -79,7 +79,10 @@ def convert(filename, transformer, parser=None, **kwargs):
else:
with open(filename,'rb') as f:
parsedsource = lxml.etree.parse(f, parser)
transformed = transformer(parsedsource,quiet="true")
transform_kwargs = {"quiet":"true"}
if kwargs.get('docid'):
transform_kwargs['docid'] = f"'{kwargs['docid']}'"
transformed = transformer(parsedsource,**transform_kwargs)
if 'intermediate' in kwargs and kwargs['intermediate']:
print(str(lxml.etree.tostring(transformed,encoding='utf-8'),'utf-8'))
try:
Expand Down Expand Up @@ -257,6 +260,7 @@ def main():
parser.add_argument('-P','--leaveparts',help="Do *NOT* resolve temporary parts", action='store_true', default=False)
parser.add_argument('-N','--leavenotes',help="Do *NOT* resolve inline notes (t-gap)", action='store_true', default=False)
parser.add_argument('-i','--ids',help="Generate IDs for all structural elements", action='store_true', default=False)
parser.add_argument('--docid',type=str, help="Set FoLiA document ID", action='store', default=False)
parser.add_argument('-f','--forcenamespace',help="Force a TEI namespace even if the input document has none", action='store_true', default=False)
parser.add_argument('files', nargs='+', help='TEI Files to process')
args = parser.parse_args()
Expand Down
6 changes: 3 additions & 3 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,11 +11,11 @@ def read(fname):

setup(
name = "FoLiA-tools",
version = "2.5.5", #also change in __init__.py
version = "2.5.7", #also change in __init__.py
author = "Maarten van Gompel",
author_email = "[email protected]",
description = ("FoLiA-tools contains various Python-based command line tools for working with FoLiA XML (Format for Linguistic Annotation)"),
license = "GPL",
license = "GPL-3.0-only",
keywords = ["nlp", "computational linguistics", "search", "folia", "annotation"],
url = "https://proycon.github.io/folia",
packages=['foliatools'],
Expand Down Expand Up @@ -76,5 +76,5 @@ def read(fname):
},
#include_package_data=True,
package_data = {'foliatools': ['*.xsl']},
install_requires=['folia >= 2.5.4', 'lxml >= 2.2','docutils', 'pyyaml', 'langid','conllu', 'requests','stam >= 0.1.0']
install_requires=['folia >= 2.5.9', 'lxml >= 2.2','docutils', 'pyyaml', 'langid','conllu', 'requests','stam >= 0.4.0']
)

0 comments on commit f33647d

Please sign in to comment.