-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Interoperability with Salt, Pepper and ANNIS #85
Comments
Here's an additional salt example, converted from the TCF v0.4 example on their website: https://download.anaproy.nl/tcf04-karin-wl.salt |
And this is an example of document-structure (vs corpus-structure), I suppose. |
…tributes, and features (proycon/folia#85)
…tokens) + refactoring (proycon/folia#85)
This comment tracks the current state of the folia2salt implementation in foliatools. Not all is a priority and some may not be implemented for the time being:
|
I think I have a decent convertor implementation now. The big question now is if my resulting Salt XML is actually valid and can be parsed by Pepper. Testing that will be the next step (pepper seems to have a Salt Validator so that should help). In order to do that though, I can't get around writing a sCorpusGraph in a file called saltProject.salt . Next step after that is to see if pepper's annis conversion is actually usable (or other conversions for that matter), I'll leave that part up to @luutuntin if you don't mind :) I'm certainly not expecting any loss-less conversions when converting from this to all of the output formats pepper supports. It's hard to do that through an intermediate format without knowing the specifics of the input and output format. |
…lso implemented metadata conversion (proycon/folia#85)
Well, now the conversion is done I'm trying to get things to validate and process with pepper, and hopefully resolve any issues that I got wrong in my convertor. This proves to be much more difficult than I had anticipated as I can't even get pepper to import Salt XML properly: I'm a bit stuck at this point. I tried building a conversion/validation workflow with three steps, a SaltXML importer, a SaltValidator and a DoNothingExporter. It doesn't look like any documents get processed (it says 0 of 4, how it gets the number '4' is a mystery to me as there is only one document in my test corpus).
Unfortunately, there's not really any validation information to go by yet, so I set out to test a similar pepper pipeline by reimporting salt XML pepper itself outputted (conversion from TCF source). I get almost exactly the same output (0 of 4 documents)... My initial test corpus (one document) outputted by the new converter: https://download.anaproy.nl/foliasalt.tar.gz |
^--- I cross-posted the issue, with some further context, to the pepper issue tracker. |
I just looked into https://download.anaproy.nl/foliasalt.tar.gz and found that saltProject.salt is a sDocumentStructure instead of sCorpusStructure: <?xml version='1.0' encoding='utf-8'?>
<saltCommon:SaltProject xmlns:sDocumentStructure="sDocumentStructure" xmlns:xmi="http://www.omg.org/XMI" xmlns:saltCore="saltCore" xmlns:saltCommon="saltCommon" xmlns:sCorpusStructure="sCorpusStructure" xmi:version="2.0"> |
It just declares the sDocumentStructure namespace with an identical prefix (which isn't really used indeed in this context, but its presence should be irrelevant), the root tag itself is in the saltCommon namespace. (the way Salt uses XML namespaces is a bit weird though, they only pertain to some elements and they are not proper URIs). There is no default XML namespace set in any of the examples. |
Thank you. |
When I compare the foliasalt corpus and other examples, foliasalt doesn't have |
Good point! That's a definitely mistake on my part indeed. These are precisely the things I'd hope a good validator would catch. I'll fix it. I don't think it's the root cause of the pepper issue because that one also fails if I try to reimport the TCF->Salt corpus. |
ok, that did help! we have some progress! The original error is gone (for now) and I get a java traceback error, so it's definitely trying to parse more. The feedback isn't very verbose unfortunately so it'll be a bit tricky to pinpoint exactly where the culprit is.
|
Ok, further progress, the one above is solved too. I missed a few xsi:type attributes. As long as I get parsing tracebacks now I can hopefully pinpoint and fix it. |
Great. We are moving steadily. |
I solved a few parser errors and now I'm back at the same '0 of 4' situation we started with. But at least now I can be assured that it did some parsing (even though the output doesn't really show that). I'll try some conversion (e.g. annis) to see how that looks. PS: I updated https://download.anaproy.nl/foliasalt.tar.gz with the new results. |
The annis conversion seems to work although pepper does raise one exception for which the cause is unclear to me:
I tried some other conversions too:
Unfortunately the error messages are often too cryptic and make no clear reference to the actual salt input that failed. The Salt validator in Pepper also didn't lead to any output, so I assume it considers everything okay. |
A first version of folia2salt is now released as part of foliatools v2.3.0 , it is still to be considered highly experimental, though. |
Hi, I've tried taking the example foliasalt document (https://download.anaproy.nl/foliasalt.tar.gz) and ran it through a simple Pepper workflow file to convert SaltXML to Annis, salt_to_annis.pepper.zip. Running that, we still get the same ominous '0 of 4' message, in addition to some other error logs:
We have an Annis deployment up at https://annis.ling.brandeis.edu/annis-gui/ that you can check out; the Annis corpus from the resulting Pepper conversion is loaded up as foliacorpus. Despite it being loaded up without an error, you can see that the node annotations are malformed: I'll try doing some digging into why this happens, but wanted to add a log here with all these resources in case you have any ideas for fixing this behavior. |
Thanks for the feedback. As you see the conversion is still very experimental. I encode things pretty verbosely in Salt, and use their namespace functionality extensively to encode both the annotation types as well as the FoLiA sets, as I want to preserve as much data as possible. Salt is not very prescriptive so I took a some liberties without knowing exactly how they would translate in further conversion steps. There is even some duplication in the data, you might already have enough information if you just look at the 'simplified' annotations that are in the salt namespace with name pos, lemma etc, these kind of summarize some of the more complex fields. You can even instruct folia2salt to only output these simplified annotations and omit all others (but you will lose information and there may be clashes if there are for example multiple pos tags in multiple sets). So I wouldn't say the annotations are 'malformed', but some of these identifiers don't translate well to ANNIS and it seems pepper url-encoded them. I agree they're not very interpretable for end users in this way, does the simplified annotation show in the interface too? (I admit I know virtually nothing about ANNIS itself). |
This issue came up in discussions with @luutuntin who was looking for a search and retrieval tool capable of handling FoLiA. There is some FoLiA support in both Blacklab and MTAS, but both may not sufficiently cover all of FoLiA's expressive abilities (tree handling in particular).
ANNIS is another well-developed and interesting solution, but right now there is no FoLiA support. ANNIS relies on a conversion tool called Pepper to support a great variety of input formats. Pepper in turn uses a low-level graph-based model called Salt as its intermediate model, which in turn can export to a variety of formats again (including ANNIS' format).
To enhance interoperability, it would be a good idea to implement conversion from FoLiA to the salt model (and possibly vice versa, but with much less priority)
To write such a converter we could:
Update: we are picking option 2
The text was updated successfully, but these errors were encountered: