Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interoperability with Salt, Pepper and ANNIS #85

Open
proycon opened this issue Aug 13, 2020 · 20 comments
Open

Interoperability with Salt, Pepper and ANNIS #85

proycon opened this issue Aug 13, 2020 · 20 comments
Assignees
Labels
enhancement waiting waiting for feedback or another issue/process to finish

Comments

@proycon
Copy link
Owner

proycon commented Aug 13, 2020

This issue came up in discussions with @luutuntin who was looking for a search and retrieval tool capable of handling FoLiA. There is some FoLiA support in both Blacklab and MTAS, but both may not sufficiently cover all of FoLiA's expressive abilities (tree handling in particular).

ANNIS is another well-developed and interesting solution, but right now there is no FoLiA support. ANNIS relies on a conversion tool called Pepper to support a great variety of input formats. Pepper in turn uses a low-level graph-based model called Salt as its intermediate model, which in turn can export to a variety of formats again (including ANNIS' format).

To enhance interoperability, it would be a good idea to implement conversion from FoLiA to the salt model (and possibly vice versa, but with much less priority)

To write such a converter we could:

  1. implement it as an extension to Pepper, however: Pepper and Salt are all Java-based, but we have no proper java-based FoLiA library (and I'm very reluctant to start one, we already have extensive libraries for Python, C++ and Rust).
  2. Implement it as a standalone tool, possibly serialising to SaltXML . This allows us to leverage an existing FoLiA library (although we lose the benefit of the Salt library), and keeps things a bit simpler.

Update: we are picking option 2

@proycon proycon self-assigned this Aug 13, 2020
@luutuntin
Copy link

luutuntin commented Aug 15, 2020

Information about SaltXML (XMI) format

Most relevant:

Useful:

@proycon
Copy link
Owner Author

proycon commented Aug 20, 2020

Here's an additional salt example, converted from the TCF v0.4 example on their website: https://download.anaproy.nl/tcf04-karin-wl.salt

@luutuntin
Copy link

And this is an example of document-structure (vs corpus-structure), I suppose.

proycon added a commit to proycon/foliatools that referenced this issue Aug 20, 2020
proycon added a commit to proycon/foliatools that referenced this issue Aug 20, 2020
proycon added a commit to proycon/foliatools that referenced this issue Aug 25, 2020
@proycon
Copy link
Owner Author

proycon commented Aug 25, 2020

This comment tracks the current state of the folia2salt implementation in foliatools. Not all is a priority and some may not be implemented for the time being:

  • - Conversion of FoLiA tokens to salt SToken nodes
    • The converter only supports tokenised FoLiA documents
    • - Add support for FoLiA hiddenword
  • - Text extraction (from tokens) to STextualDS node and conversion to STextualRelation edges
    • preserves untokenised text only to a certain degree (using FoLiA's token spacing information only)
    • - Support for multiple text classes
  • - Conversion of FoLiA Inline Annotation (pos, lemma etc) to salt SAnnotation labels
  • - Conversion of FoLiA Structure Annotation (sentences,paragraph, etc) to salt SSpan nodes and SSpanRelation edges
    • converted structures will directly relate to the underlying token nodes rather than to a structural hierarchy like in FoLiA
  • - Conversion of simple FoLiA Span Annotation (entities etc) to salt SSpan nodes and SSpanRelation edges
    • - Conversion of nested Span Annotation (syntax etc) to SSpan nodes and SDominanceRelation edges
    • - Conversion of Span Annotation including span roles (dependencies etc) to SSpan nodes and SDominanceRelation edges
  • - Grouping of annotation types/sets in salt SLayer nodes
  • - Conversion of FoLiA higher order elements:
    • - Features
    • - Comments
    • - Descriptions
    • - Relations
    • - Metrics
    • - Span Relations
    • - String annotation
    • - Alternative annotation
    • - Corrections
  • - Conversion of FoLiA subtoken annotation (morphology/phonology)
  • - Conversion of FoLiA phonetic content (as an extra STextualDS node and STextualRelation edges)
  • - Conversion of FoLiA references to audio/video sources and timing information
  • - Convert FoLiA native metadata

@proycon
Copy link
Owner Author

proycon commented Aug 25, 2020

I think I have a decent convertor implementation now. The big question now is if my resulting Salt XML is actually valid and can be parsed by Pepper. Testing that will be the next step (pepper seems to have a Salt Validator so that should help). In order to do that though, I can't get around writing a sCorpusGraph in a file called saltProject.salt .

Next step after that is to see if pepper's annis conversion is actually usable (or other conversions for that matter), I'll leave that part up to @luutuntin if you don't mind :)

I'm certainly not expecting any loss-less conversions when converting from this to all of the output formats pepper supports. It's hard to do that through an intermediate format without knowing the specifics of the input and output format.

proycon added a commit to proycon/foliatools that referenced this issue Aug 26, 2020
@proycon
Copy link
Owner Author

proycon commented Aug 26, 2020

Well, now the conversion is done I'm trying to get things to validate and process with pepper, and hopefully resolve any issues that I got wrong in my convertor. This proves to be much more difficult than I had anticipated as I can't even get pepper to import Salt XML properly: I'm a bit stuck at this point.

I tried building a conversion/validation workflow with three steps, a SaltXML importer, a SaltValidator and a DoNothingExporter. It doesn't look like any documents get processed (it says 0 of 4, how it gets the number '4' is a mystery to me as there is only one document in my test corpus).

--------------------------- pepper job status ---------------------------
id:                     'la7st384
active documents:       0 of 4
status:                 initializing
- no documents found to display progress -
-------------------------------------------------------------------------

+----------------------------------- step 1 -----------------------------------+
|importer:      SaltXMLImporter                                                |
|path:          file:/home/proycon/exp/pepper/saltin/                          |
|corpus index:  0                                                              |
|properties:                                                                   |
|               pepper.after.reportCorpusGraph:false                                 |
|               pepper.after.tokenize:   false                                 |
|                                                                              |
+----------------------------------- step 2 -----------------------------------+
|manipulator:   SaltValidator                                                  |
|path:          null                                                           |
|properties:                                                                   |
|               pepper.after.reportCorpusGraph:false                                 |
|               pepper.after.tokenize:   false                                 |
|                                                                              |
+----------------------------------- step 3 -----------------------------------+
|exporter:      DoNothingExporter                                              |
|path:          file:/home/proycon/exp/pepper/saltout/                         |
|properties:                                                                   |
|               pepper.after.reportCorpusGraph:false                                 |
|               pepper.after.tokenize:   false                                 |
|                                                                              |
+------------------------------------------------------------------------------+

--------------------------- pepper job status ---------------------------
id:                     'la7st384
active documents:       0 of 4
status:                 ended
- no documents found to display progress -
-------------------------------------------------------------------------

Unfortunately, there's not really any validation information to go by yet, so I set out to test a similar pepper pipeline by reimporting salt XML pepper itself outputted (conversion from TCF source). I get almost exactly the same output (0 of 4 documents)...

My initial test corpus (one document) outputted by the new converter: https://download.anaproy.nl/foliasalt.tar.gz

@proycon
Copy link
Owner Author

proycon commented Aug 26, 2020

^--- I cross-posted the issue, with some further context, to the pepper issue tracker.

@proycon proycon added waiting waiting for feedback or another issue/process to finish and removed in progress labels Aug 26, 2020
@ghost
Copy link

ghost commented Aug 26, 2020

I just looked into https://download.anaproy.nl/foliasalt.tar.gz and found that saltProject.salt is a sDocumentStructure instead of sCorpusStructure:

<?xml version='1.0' encoding='utf-8'?>
<saltCommon:SaltProject xmlns:sDocumentStructure="sDocumentStructure" xmlns:xmi="http://www.omg.org/XMI" xmlns:saltCore="saltCore" xmlns:saltCommon="saltCommon" xmlns:sCorpusStructure="sCorpusStructure" xmi:version="2.0">

@proycon
Copy link
Owner Author

proycon commented Aug 26, 2020

It just declares the sDocumentStructure namespace with an identical prefix (which isn't really used indeed in this context, but its presence should be irrelevant), the root tag itself is in the saltCommon namespace. (the way Salt uses XML namespaces is a bit weird though, they only pertain to some elements and they are not proper URIs). There is no default XML namespace set in any of the examples.

@ghost
Copy link

ghost commented Aug 26, 2020

Thank you.

@luutuntin
Copy link

When I compare the foliasalt corpus and other examples, foliasalt doesn't have xmlns:xsi, and therefore uses xmi:type, instead of xsi:type. Does this difference matter?

@proycon
Copy link
Owner Author

proycon commented Aug 26, 2020

Good point! That's a definitely mistake on my part indeed. These are precisely the things I'd hope a good validator would catch. I'll fix it.

I don't think it's the root cause of the pepper issue because that one also fails if I try to reimport the TCF->Salt corpus.

proycon added a commit to proycon/foliatools that referenced this issue Aug 26, 2020
@proycon
Copy link
Owner Author

proycon commented Aug 26, 2020

ok, that did help! we have some progress! The original error is gone (for now) and I get a java traceback error, so it's definitely trying to parse more. The feedback isn't very verbose unfortunately so it'll be a bit tricky to pinpoint exactly where the culprit is.

full stack trace:
org.corpus_tools.pepper.modules.exceptions.PepperModuleException: Failed to import corpus by module. Nested exception was:
        at org.corpus_tools.pepper.core.PepperJobImpl.importCorpusStructures(PepperJobImpl.java:594)
        at org.corpus_tools.pepper.core.PepperJobImpl.convert(PepperJobImpl.java:930)
        at org.corpus_tools.pepper.cli.PepperStarter.convert(PepperStarter.java:534)
        at org.corpus_tools.pepper.cli.PepperStarter.main(PepperStarter.java:1437)
Caused by: org.corpus_tools.salt.exceptions.SaltResourceException: Cannot find a target node '//@nodes.1' for relation.
        at org.corpus_tools.salt.util.internal.persistence.SaltXML10Handler.startElement(SaltXML10Handler.java:247)
        at java.xml/com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.startElement(AbstractSAXParser.java:510)
        at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanStartElement(XMLDocumentFragmentScannerImpl.java:1397)
        at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2710)
        at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:605)
        at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:534)
        at java.xml/com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:888)
        at java.xml/com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:824)
        at java.xml/com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
        at java.xml/com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1216)
        at java.xml/com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:635)
        at org.corpus_tools.salt.util.SaltUtil.loadObjects(SaltUtil.java:483)
        at org.corpus_tools.salt.util.SaltUtil.load(SaltUtil.java:434)
        at org.corpus_tools.salt.util.SaltUtil.loadCorpusGraph(SaltUtil.java:720)
        at org.corpus_tools.salt.util.SaltUtil.loadCorpusGraph(SaltUtil.java:687)
        at org.corpus_tools.salt.common.impl.SCorpusGraphImpl.load(SCorpusGraphImpl.java:372)
        at org.corpus_tools.pepper.modules.coreModules.SaltXMLImporter.importCorpusStructure(SaltXMLImporter.java:106)
        at org.corpus_tools.pepper.core.ModuleControllerImpl$1.run(ModuleControllerImpl.java:245)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:834)

proycon added a commit to proycon/foliatools that referenced this issue Aug 26, 2020
@proycon
Copy link
Owner Author

proycon commented Aug 26, 2020

Ok, further progress, the one above is solved too. I missed a few xsi:type attributes. As long as I get parsing tracebacks now I can hopefully pinpoint and fix it.

@proycon proycon added in progress and removed waiting waiting for feedback or another issue/process to finish labels Aug 26, 2020
@luutuntin
Copy link

Great. We are moving steadily.

proycon added a commit to proycon/foliatools that referenced this issue Aug 26, 2020
@proycon
Copy link
Owner Author

proycon commented Aug 26, 2020

I solved a few parser errors and now I'm back at the same '0 of 4' situation we started with. But at least now I can be assured that it did some parsing (even though the output doesn't really show that). I'll try some conversion (e.g. annis) to see how that looks.

PS: I updated https://download.anaproy.nl/foliasalt.tar.gz with the new results.

@proycon
Copy link
Owner Author

proycon commented Sep 2, 2020

The annis conversion seems to work although pepper does raise one exception for which the cause is unclear to me:

Exception in thread "pool-5-thread-1" org.corpus_tools.salt.exceptions.SaltException: An exception occured while traversing the graph 'salt:/foliacorpus/example.deep' with path 'null'. because of null.
        at org.corpus_tools.salt.core.impl.GraphTraverserModule$Traverser.run(GraphTraverserModule.java:486)
        at org.corpus_tools.salt.core.impl.GraphTraverserModule.traverse(GraphTraverserModule.java:173)
        at org.corpus_tools.salt.core.impl.SGraphImpl.traverse(SGraphImpl.java:241)
        at org.corpus_tools.salt.core.impl.SGraphImpl.traverse(SGraphImpl.java:232)
        at org.corpus_tools.peppermodules.annis.SSpanningRelation2ANNISMapper.run(SSpanningRelation2ANNISMapper.java:82)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.lang.NullPointerException
        at org.corpus_tools.peppermodules.annis.SRelation2ANNISMapper.writeNodeTabEntry(SRelation2ANNISMapper.java:742)
        at org.corpus_tools.peppermodules.annis.SRelation2ANNISMapper.mapSNode(SRelation2ANNISMapper.java:715)
        at org.corpus_tools.peppermodules.annis.SRelation2ANNISMapper.mapSNode(SRelation2ANNISMapper.java:496)
        at org.corpus_tools.peppermodules.annis.SRelation2ANNISMapper.nodeReached(SRelation2ANNISMapper.java:310)
        at org.corpus_tools.peppermodules.annis.SSpanningRelation2ANNISMapper.nodeReached(SSpanningRelation2ANNISMapper.java:162)
        at org.corpus_tools.salt.core.impl.GraphTraverserModule$Traverser.run(GraphTraverserModule.java:391)
        ... 7 more

I tried some other conversions too:

  • Conversion to TCF produced only a token and sentence layer, losing all the annotations.
  • Conversion to PAULA failed with another error and looks incomplete.

Unfortunately the error messages are often too cryptic and make no clear reference to the actual salt input that failed. The Salt validator in Pepper also didn't lead to any output, so I assume it considers everything okay.

@proycon
Copy link
Owner Author

proycon commented Sep 2, 2020

A first version of folia2salt is now released as part of foliatools v2.3.0 , it is still to be considered highly experimental, though.

@proycon proycon added waiting waiting for feedback or another issue/process to finish and removed in progress labels Sep 2, 2020
@parkervg
Copy link

parkervg commented Jan 11, 2021

Hi,

I've tried taking the example foliasalt document (https://download.anaproy.nl/foliasalt.tar.gz) and ran it through a simple Pepper workflow file to convert SaltXML to Annis, salt_to_annis.pepper.zip.

Running that, we still get the same ominous '0 of 4' message, in addition to some other error logs:

Exception in thread "pool-4-thread-1" org.corpus_tools.salt.exceptions.SaltException: An exception occured while traversing the graph 'salt:/foliacorpus/example.deep' with path 'null'. because of null.
at org.corpus_tools.salt.core.impl.GraphTraverserModule$Traverser.run(GraphTraverserModule.java:486)
at org.corpus_tools.salt.core.impl.GraphTraverserModule.traverse(GraphTraverserModule.java:173)
at org.corpus_tools.salt.core.impl.SGraphImpl.traverse(SGraphImpl.java:241)
at org.corpus_tools.salt.core.impl.SGraphImpl.traverse(SGraphImpl.java:232)
at org.corpus_tools.peppermodules.annis.SSpanningRelation2ANNISMapper.run(SSpanningRelation2ANNISMapper.java:82)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
at java.base/java.lang.Thread.run(Thread.java:832)
Caused by: java.lang.NullPointerException
at org.corpus_tools.peppermodules.annis.SRelation2ANNISMapper.writeNodeTabEntry(SRelation2ANNISMapper.java:742)
at org.corpus_tools.peppermodules.annis.SRelation2ANNISMapper.mapSNode(SRelation2ANNISMapper.java:715)
at org.corpus_tools.peppermodules.annis.SRelation2ANNISMapper.mapSNode(SRelation2ANNISMapper.java:496)
at org.corpus_tools.peppermodules.annis.SRelation2ANNISMapper.nodeReached(SRelation2ANNISMapper.java:310)
at org.corpus_tools.peppermodules.annis.SSpanningRelation2ANNISMapper.nodeReached(SSpanningRelation2ANNISMapper.java:162)
at org.corpus_tools.salt.core.impl.GraphTraverserModule$Traverser.run(GraphTraverserModule.java:391)
... 7 more
replaced invalid ANNIS identifier FoLiA::pos::https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/frog-mbpos-cgn with FoLiA%3A%3Apos%3A%3Ahttps%3A%2F%2Fraw%2Egithubusercontent%2Ecom%2Fproycon%2Ffolia%2Fmaster%2Fsetdefinitions%2Ffrog-mbpos-cgn
replaced invalid ANNIS identifier FoLiA::token::https://raw.githubusercontent.com/LanguageMachines/uctodata/folia1.4/setdefinitions/tokconfig-nld.foliaset.ttl with FoLiA%3A%3Atoken%3A%3Ahttps%3A%2F%2Fraw%2Egithubusercontent%2Ecom%2FLanguageMachines%2Fuctodata%2Ffolia1%2E4%2Fsetdefinitions%2Ftokconfig-nld%2Efoliaset%2Ettl
replaced invalid ANNIS identifier feature/head with feature%2Fhead
replaced invalid ANNIS identifier feature/spectype with feature%2Fspectype
replaced invalid ANNIS identifier FoLiA::lemma::https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/frog-mblem-nl with FoLiA%3A%3Alemma%3A%3Ahttps%3A%2F%2Fraw%2Egithubusercontent%2Ecom%2Fproycon%2Ffolia%2Fmaster%2Fsetdefinitions%2Ffrog-mblem-nl
replaced invalid ANNIS identifier feature/pvtijd with feature%2Fpvtijd
replaced invalid ANNIS identifier feature/pvagr with feature%2Fpvagr
replaced invalid ANNIS identifier feature/wvorm with feature%2Fwvorm
replaced invalid ANNIS identifier feature/vztype with feature%2Fvztype
replaced invalid ANNIS identifier feature/npagr with feature%2Fnpagr
replaced invalid ANNIS identifier feature/lwtype with feature%2Flwtype
replaced invalid ANNIS identifier feature/naamval with feature%2Fnaamval
replaced invalid ANNIS identifier feature/positie with feature%2Fpositie
replaced invalid ANNIS identifier feature/numtype with feature%2Fnumtype
replaced invalid ANNIS identifier feature/conjtype with feature%2Fconjtype
replaced invalid ANNIS identifier feature/getal with feature%2Fgetal
replaced invalid ANNIS identifier feature/genus with feature%2Fgenus
replaced invalid ANNIS identifier feature/ntype with feature%2Fntype
replaced invalid ANNIS identifier feature/graad with feature%2Fgraad
replaced invalid ANNIS identifier feature/buiging with feature%2Fbuiging
replaced invalid ANNIS identifier feature/vwtype with feature%2Fvwtype
replaced invalid ANNIS identifier feature/persoon with feature%2Fpersoon
replaced invalid ANNIS identifier feature/pdtype with feature%2Fpdtype
replaced invalid ANNIS identifier feature/status with feature%2Fstatus
replaced invalid ANNIS identifier feature/getal-n with feature%2Fgetal-n

We have an Annis deployment up at https://annis.ling.brandeis.edu/annis-gui/ that you can check out; the Annis corpus from the resulting Pepper conversion is loaded up as foliacorpus. Despite it being loaded up without an error, you can see that the node annotations are malformed:
Screen Shot 2021-01-11 at 1 27 10 PM

I'll try doing some digging into why this happens, but wanted to add a log here with all these resources in case you have any ideas for fixing this behavior.

@proycon
Copy link
Owner Author

proycon commented Jan 13, 2021

Thanks for the feedback. As you see the conversion is still very experimental. I encode things pretty verbosely in Salt, and use their namespace functionality extensively to encode both the annotation types as well as the FoLiA sets, as I want to preserve as much data as possible. Salt is not very prescriptive so I took a some liberties without knowing exactly how they would translate in further conversion steps. There is even some duplication in the data, you might already have enough information if you just look at the 'simplified' annotations that are in the salt namespace with name pos, lemma etc, these kind of summarize some of the more complex fields. You can even instruct folia2salt to only output these simplified annotations and omit all others (but you will lose information and there may be clashes if there are for example multiple pos tags in multiple sets).

So I wouldn't say the annotations are 'malformed', but some of these identifiers don't translate well to ANNIS and it seems pepper url-encoded them. I agree they're not very interpretable for end users in this way, does the simplified annotation show in the interface too? (I admit I know virtually nothing about ANNIS itself).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement waiting waiting for feedback or another issue/process to finish
Projects
None yet
Development

No branches or pull requests

3 participants