Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FoLiA to Salt XML: attempting to build a new convertor but major problem setting up a validation pipeline with pepper #131

Closed
proycon opened this issue Aug 26, 2020 · 1 comment

Comments

@proycon
Copy link

proycon commented Aug 26, 2020

I could use some help and I'm hoping this project is still maintained. I like the salt model and the aim of pepper to provide conversions between different formats. To facilitate interoperability I am writing a convertor from FoLiA to Salt (see proycon/folia#85). FoLiA is a rich XML-based format for linguistic annotation popular in the Netherlands and Flanders and developed in the scope of CLARIN/CLARIAH. I decided to implement the convertor outside of Pepper as a part of foliatools as I'm not a fan of Java and wanted to leverage the existing Python-based FoLiA library to facilitate the conversion. So I'm outputting Salt XML myself.

The salt model is documented, but the Salt XML isn't really, so I relied mostly on existing examples that Pepper outputted. Such as conversion from a TCF example.

Now my convertor is implemented I'm trying to get things to validate and process with pepper, and hopefully resolve any issues that I got wrong in my convertor. This proves to be much more difficult than I had anticipated as I can't even get pepper to import Salt XML properly: I'm a bit stuck at this point and hoping to get some help h ere.

I tried building a conversion/validation workflow with three steps, a SaltXML importer, a SaltValidator and a DoNothingExporter. It doesn't look like any documents get processed (it says 0 of 4, how it gets the number '4' is a mystery to me as there is only one document in my test corpus). To me this looks like nothing got processed or validated:

--------------------------- pepper job status ---------------------------
id:                     'la7st384
active documents:       0 of 4
status:                 initializing
- no documents found to display progress -
-------------------------------------------------------------------------

+----------------------------------- step 1 -----------------------------------+
|importer:      SaltXMLImporter                                                |
|path:          file:/home/proycon/exp/pepper/saltin/                          |
|corpus index:  0                                                              |
|properties:                                                                   |
|               pepper.after.reportCorpusGraph:false                                 |
|               pepper.after.tokenize:   false                                 |
|                                                                              |
+----------------------------------- step 2 -----------------------------------+
|manipulator:   SaltValidator                                                  |
|path:          null                                                           |
|properties:                                                                   |
|               pepper.after.reportCorpusGraph:false                                 |
|               pepper.after.tokenize:   false                                 |
|                                                                              |
+----------------------------------- step 3 -----------------------------------+
|exporter:      DoNothingExporter                                              |
|path:          file:/home/proycon/exp/pepper/saltout/                         |
|properties:                                                                   |
|               pepper.after.reportCorpusGraph:false                                 |
|               pepper.after.tokenize:   false                                 |
|                                                                              |
+------------------------------------------------------------------------------+

--------------------------- pepper job status ---------------------------
id:                     'la7st384
active documents:       0 of 4
status:                 ended
- no documents found to display progress -
-------------------------------------------------------------------------

Unfortunately, there's not really any validation information to go by yet, so I set out to test a similar pepper pipeline by reimporting salt XML pepper itself outputted (conversion from TCF source). That way I can test pepper as-is and ruling out my convertor did anything wrong. I get almost exactly the same output (0 of 4 documents) in that case.

My initial test corpus (one document) outputted by the new converter: https://download.anaproy.nl/foliasalt.tar.gz
Salt output after conversion from TCF using Pepper (my reference example): https://download.anaproy.nl/tcfsalt.tar.gz

I'm hoping someone can point me at what's going wrong? How can I properly validate my Salt XML output?

@proycon
Copy link
Author

proycon commented Aug 26, 2020

Pepper's output is a bit confusing, and the validation option I enabled did not explicitly report anything afaik,
but after fixing some problems in my convertor, it does seem to get parsed and I can get some conversions.

Closing this for now.

@proycon proycon closed this as completed Aug 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant