comprehensive linguistic annotation #105

osherenko · 2022-11-07T19:21:17Z

I am developing an annotation and would specify for particular words not only lemmas and POS, but also etymologic or morphological information. How should I do it?

proycon · 2022-11-09T19:49:52Z

For morphological information, check the respective section in the documentation:Morphological annotation

For data on etymology there is no specific element, and I'd probably need to see an example of how such an annotation looks like in your data to give the best advice. It might be added as a higher-order feature using <feat> with one of the existing types.

osherenko · 2022-11-10T09:56:35Z

How would you annotate the morphology of the irregular verb "went"? "go" is not a part of it. BTW, there is a broken link to the API class -- https://folia.readthedocs.io/en/latest/morphological_annotation.html doesn't exist. I want to specify etymological data within a word annotation as its definition from a dictionary, similar to the annotation of the phoneme in an utterance. <utt xml:id="example.utt.1" src="helloworld.mp3" begintime="00:00:01.000" endtime="00:00:02.000"> <ph>helˈoʊ wɝːld</ph> <w xml:id="example.utt.1.w.1" begintime="00:00:00.000" endtime="00:00:01.000"> <ph>helˈoʊ</ph> <etymology>early 19th century: variant of earlier hollo ; related to holla.</etymology> </w> <w xml:id="example.utt.1.w.2" begintime="00:00:01.000" endtime="00:00:02.000"> <ph>wɝːld</ph> </w> </utt> Moreover, I want to specify linguistic information in the sentence annotation such as dependencies and the grammatical parse. For example, for sentence annotation of "The strongest rain ever recorded in India..." <https://nlp.stanford.edu/software/lex-parser.shtml#Sample> should include the grammatical tree (ROOT (S (S (NP (NP (DT The) (JJS strongest) (NN rain)) (VP (ADVP (RB ever)) (VBN recorded) (PP (IN in) (NP (NNP India))))) and dependencies: det(rain-3, The-1) amod(rain-3, strongest-2) nsubj(shut-8, rain-3) nsubj(snapped-16, rain-3) nsubj(closed-20, rain-3) nsubj(forced-23, rain-3) advmod(recorded-5, ever-4) partmod(rain-3, recorded-5) prep_in(recorded-5, India-7) Am Mi., 9. Nov. 2022 um 20:50 Uhr schrieb Maarten van Gompel < ***@***.***>:

…

For morphological information, check the respective section in the documentation:Morphological annotation <https://folia.readthedocs.io/en/latest/morphological_annotation.html> For data on etymology there is no specific element, and I'd probably need to see an example of how such an annotation looks like in your data to give the best advice. It might be added as a higher-order feature using <feat> with one of the existing types. — Reply to this email directly, view it on GitHub <#105 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/A4B2N3PYFLTITQVHL6FO4F3WHP56XANCNFSM6AAAAAARZQV3PU> . You are receiving this because you authored the thread.Message ID: ***@***.***>

osherenko · 2022-11-10T11:19:06Z

Probably, the etymology annotation can resemble the sense annotation https://folia.readthedocs.io/en/latest/sense_annotation.html

proycon · 2022-11-14T12:50:32Z

How would you annotate the morphology of the irregular verb "went"? "go" is not a part of it.

True, "go" would be the lemma, in the simplest annotation form:

<w>
  <t>went</t>
  <lemma class="go" />
</w>

If you want to express it in a morphological structure, you could do the following:

<w>
  <t>went</t>
  <lemma class="go" />
  <morphology>
    <morpheme class="stem">
       <t>went</t>
       <lemma class="go" />
       <pos class="V" />
    </morpheme>
  </morphology>
</w>

There's a bit of duplication here to show that you can choose on what level you want to express certain things (like pos/lemma/sense). Most of what you can express on the word level is also valid on the morpheme level.

Note: the classes belong to a user-defined set definition, FoLiA itself does not prescribe them.

There are some further examples on https://folia.readthedocs.io/en/latest/morphological_annotation.html , which shows that you can also do nesting and associate extra features with morphemes (such as function, or any other you can invent). This is also the place where you might want to express etymology:

<w>
  <t>went</t>
  <lemma class="go" />
  <morphology>
    <morpheme class="stem">
       <t>went</t>
       <lemma class="go" />
       <pos class="V" />
       <feat subset="etymology" class="wenden" />
    </morpheme>
  </morphology>
</w>

In this case it's an extra feature on the morpheme. Rather than using the full description I opted for a shorted 'class' here, which ideally some external database would provide a full definition of (and the set of your morphology-annotation would determine what that database is). But of course you may also just put the full definition as class.

Moreover, I want to specify linguistic information in the sentence
annotation such as dependencies and the grammatical parse.

That can be accommodated in FoLiA by respectively dependency annotation and syntactic annotation.

BTW, there is a broken link to the API class --
https://folia.readthedocs.io/en/latest/morphological_annotation.html doesn't
exist.

Oops indeed, thanks! I have corrected the mistake.

I hope this provides some more clarity?

proycon · 2022-11-14T12:59:51Z

One of the points arising from this issue is whether we want to introduce an explicit <etymology> element in FoLiA, It would be an inline annotation element very alike to <sense>. We could then do <etymology class="wenden" /> . Which may be nicer than using a feature structure.

osherenko · 2022-11-16T08:50:25Z

An explicit etymology tag is indeed nicer. Another question is about multimodal injections in annotations similar to Example 1.7.1 in the documentation. Can FoliA accept multimodal injections? You spoke about OCR corrections. My annotation should annotate texts not only lexically, but also statistically, semantically, or cognitively. It should also consider properties of the original image used by the OCR to extract the texts. Another issue is the annotation of a corpus containing several texts. In this case, each text tag should hold the source of a text like the src entity in the Utterance annotation and be saved in the output XML file. Am Mo., 14. Nov. 2022 um 14:00 Uhr schrieb Maarten van Gompel < ***@***.***>:

…

One of the points arising from this issue is whether we want to introduce an explicit <etymology> element in FoLiA, It would be an inline annotation element very alike to <sense>. We could then do <etymology class="wenden" /> . Which may be nicer than using a feature structure. — Reply to this email directly, view it on GitHub <#105 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/A4B2N3OXYMRCSXIVXIN6LMTWIIZVDANCNFSM6AAAAAARZQV3PU> . You are receiving this because you authored the thread.Message ID: ***@***.***>

osherenko · 2022-11-18T08:57:59Z

It might be a good idea to use Metric annotation, For example:
<text src="...">
<metric class="charlength" value="4" />
</text>"

proycon · 2022-11-18T14:25:42Z

An explicit etymology tag is indeed nicer

I'll work on adding one.

Another issue is the annotation of a corpus containing several texts.

Just do one FoLiA document per text. It's not recommended to put the whole corpus in a single FoLiA file.

Can FoliA accept multimodal injections?

You can have multiple text layers and multiple phonological layers, yes, if that's what you mean. One limitation in FoLiA, however, is that there can only be one canonical tokenization for everything.

It might be a good idea to use Metric annotation

<text src="...">
<metric class="charlength" value="4" />
</text>

I'm not entirely sure what your exact use case is for this, but you can indeed use metric annotation for all kinds of measurements on whatever you want.

It should also consider
properties of the original image used by the OCR to extract the texts.

proycon · 2022-11-18T15:04:17Z

@osherenko Please check what you think of the example in the above commit d43bbc8 and the documentation in commit db5499f . That's my proposed solution for annotating etymology in FoLiA.

…d mentions of 'token annotation' to 'inline annotation'

osherenko · 2022-11-21T09:50:17Z

The etymology example looks very nice and "set" is a great addition (can it be empty if the set is unclear?) Could you explain why "It's not recommended to put the whole corpus in a single FoLiA file"? If this case, each FoliA file needs a descriptive name and there are many FoliA annotations in separate files, what can be problematic because the file names must be cross-platform. Moreover, I wonder why the Statement and the Utterance annotations have the src-entity and the text tag not? Actually, in the current implementation, I can have several doc tags in a single FoliA file and add the src entity to the doc tag. Unfortunately, If I save such annotation, the src entity is not present in the XML output. BTW, I am particularly interested in the PyNLPI library. Do you have a special mailing group for questions? Am Fr., 18. Nov. 2022 um 16:04 Uhr schrieb Maarten van Gompel < ***@***.***>:

…

@osherenko <https://github.com/osherenko> Please check what you think of the example in the above commit d43bbc8 <d43bbc8> and the documentation in commit db5499f <db5499f> . That's my proposed solution for annotation etymology in FoLiA. — Reply to this email directly, view it on GitHub <#105 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/A4B2N3MONOBZKFK5Q6HKG23WI6LH3ANCNFSM6AAAAAARZQV3PU> . You are receiving this because you were mentioned.Message ID: ***@***.***>

proycon · 2022-11-21T11:51:56Z

The etymology example looks very nice and "set" is a great addition (can it be empty if the set is unclear?)

Yes, in that case an implicit set is assigned internally. The whole 'set' concept is at the core of FoLiA's paradigm.

Could you explain why "It's not recommended to put the whole corpus in a single FoLiA file"? If this case, each FoliA file needs a descriptive name and there are many FoliA annotations in separate files, what can be problematic because the file names must be cross-platform.

A corpus is often quite big, putting it all into a single XML file results in a big file which, when loaded into memory, blows up even more. FoLiA is designed as a document-based format. Note that FoLiA *does* typically group all annotations into the same file, so you have one text with all its annotations in a single XML file (and multiple such files for an entire corpus). How to divide the corpus into separate texts is up to you, whatever makes most sense for your use case.

Moreover, I wonder why the Statement and the Utterance annotations have the src-entity and the text tag not?

The src attribute is a speech attribute, it refers back to the audio/video. In a speech context, you usually have <speech> rather than <text> as the main body tag (which does allow the src attribute)

Actually, in the current implementation, I can have several doc tags in a single FoliA file and add the src entity to the doc tag. Unfortunately, If I save such annotation, the src entity is not present in the XML output.

I'm not sure what you mean here, can you show an example?

BTW, I am particularly interested in the PyNLPI library. Do you have a special mailing group for questions?

You probably mean the foliapy library (it used to be part of pynlpl but was split out several years ago). You can use the issue tracker at https://github.com/proycon/foliapy for specific questions about that library.

osherenko · 2022-11-21T15:07:29Z

> Actually, in the current implementation, I can have > several doc tags in a single FoliA file and add the src entity to the doc > tag. Unfortunately, If I save such annotation, the src entity is not > present in the XML output. I'm not sure what you mean here, can you show an example?

For example, I can instantiate a document as doc = folia.Document(id='id', src="src") When I store the document using doc.save("annotation.xml") the src is unavilable in the XML file.

BTW, I am particularly interested in the PyNLPI library. Do you have > a special mailing group for questions? You probably mean the foliapy library (it used to be part of pynlpl but was split out several years ago). You can use the issue tracker at https://github.com/proycon/foliapy for specific questions about that library. You use the library in the foliafreqlist.py as

from pynlpl.statistics import FrequencyList I am particularly interested in search algorithms in pynlpl (Chapter 7 in the documentation), for example, using regular expressions like https://nlp.stanford.edu/software/tokensregex.html. —

…

Reply to this email directly, view it on GitHub <#105 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/A4B2N3PEZL52ZF4Z45GTMCLWJNO6NANCNFSM6AAAAAARZQV3PU> . You are receiving this because you were mentioned.Message ID: ***@***.***>

proycon · 2022-11-28T12:18:15Z

For example, I can instantiate a document as doc = folia.Document(id='id', src="src") When I store the document using doc.save("annotation.xml") the src is unavilable in the XML file.

That's because src is not a valid attribute on a FoLiA Document as a whole.

I am particularly interested in search algorithms in pynlpl (Chapter 7 in the documentation), for example, using regular expressions like https://nlp.stanford.edu/software/tokensregex.html.

If you want something like Stanford Tokensregex then look at the FoLiA Query Language instead: https://folia.readthedocs.io/en/latest/fql.html The search algorithms in pynlpl are very generic (and the implementation is fairly old, you'll probably find better ones elsewhere).

kosloot · 2022-11-29T19:04:05Z

I am surprised that doc = folia.Document(id='id', src="src") doesn't generate an error message.

proycon · 2022-11-30T15:41:25Z

I am surprised that doc = folia.Document(id='id', src="src") doesn't generate an error message.

True, me too.

osherenko · 2022-11-30T16:55:24Z

You probably await an unexpected keyword error in the constructor. Maybe, an exception is not implemented in the constructor. If you want to reproduce: if __name__ == "__main__": doc = folia.Document(id='id', src="src") If I call text = doc.add(folia.Text, src="src") I get an unexpected keyword error. BTW, it stroke me that I can place the src as string in doc.metadata['desc'] to annotate the source of the document. Or is it better to add a Description annotation like desc = doc.add(folia.Description, value="text source %s" % src) Am Di., 29. Nov. 2022 um 20:04 Uhr schrieb Ko van der Sloot < ***@***.***>:

I am surprised that doc = folia.Document(id='id', src="src") doesn't generate an error message. — Reply to this email directly, view it on GitHub <#105 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/A4B2N3IBZE6VXKPMMDGIOBDWKZHTBANCNFSM6AAAAAARZQV3PU> . You are receiving this because you were mentioned.

Am Mi., 30. Nov. 2022 um 16:41 Uhr schrieb Maarten van Gompel < ***@***.***>:

…

I am surprised that doc = folia.Document(id='id', src="src") doesn't generate an error message. True, me too. — Reply to this email directly, view it on GitHub <#105 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/A4B2N3JDEDQ5LOIUOKYMOJLWK5YTBANCNFSM6AAAAAARZQV3PU> . You are receiving this because you were mentioned.Message ID: ***@***.***>

proycon self-assigned this Nov 14, 2022

proycon added the question label Nov 14, 2022

proycon added a commit to proycon/foliapy that referenced this issue Nov 14, 2022

added missing subtoken classes to documentation (proycon/folia#105)

4876a90

proycon added a commit that referenced this issue Nov 18, 2022

Added EtymologyAnnotation to specification #105

9b316ce

proycon added a commit that referenced this issue Nov 18, 2022

added example for new etymology annotation #105

d43bbc8

proycon added a commit that referenced this issue Nov 18, 2022

added documentation skeleton for etymology annotation #105

db5499f

proycon added a commit to proycon/foliapy that referenced this issue Nov 18, 2022

implement etymology annotation (proycon/folia#105), corrected some ol…

a944621

…d mentions of 'token annotation' to 'inline annotation'

proycon added a commit that referenced this issue Nov 18, 2022

regenerated documentation #105

ef04117

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

comprehensive linguistic annotation #105

comprehensive linguistic annotation #105

osherenko commented Nov 7, 2022

proycon commented Nov 9, 2022

osherenko commented Nov 10, 2022 via email

osherenko commented Nov 10, 2022

proycon commented Nov 14, 2022

proycon commented Nov 14, 2022

osherenko commented Nov 16, 2022 via email

osherenko commented Nov 18, 2022 •

edited

Loading

proycon commented Nov 18, 2022

proycon commented Nov 18, 2022 •

edited

Loading

osherenko commented Nov 21, 2022 via email

proycon commented Nov 21, 2022 via email

osherenko commented Nov 21, 2022 via email

proycon commented Nov 28, 2022 via email

kosloot commented Nov 29, 2022

proycon commented Nov 30, 2022

osherenko commented Nov 30, 2022 via email

comprehensive linguistic annotation #105

comprehensive linguistic annotation #105

Comments

osherenko commented Nov 7, 2022

proycon commented Nov 9, 2022

osherenko commented Nov 10, 2022 via email

osherenko commented Nov 10, 2022

proycon commented Nov 14, 2022

proycon commented Nov 14, 2022

osherenko commented Nov 16, 2022 via email

osherenko commented Nov 18, 2022 • edited Loading

proycon commented Nov 18, 2022

proycon commented Nov 18, 2022 • edited Loading

osherenko commented Nov 21, 2022 via email

proycon commented Nov 21, 2022 via email

osherenko commented Nov 21, 2022 via email

proycon commented Nov 28, 2022 via email

kosloot commented Nov 29, 2022

proycon commented Nov 30, 2022

osherenko commented Nov 30, 2022 via email

osherenko commented Nov 18, 2022 •

edited

Loading

proycon commented Nov 18, 2022 •

edited

Loading