Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

comprehensive linguistic annotation #105

Open
osherenko opened this issue Nov 7, 2022 · 16 comments
Open

comprehensive linguistic annotation #105

osherenko opened this issue Nov 7, 2022 · 16 comments
Assignees
Labels

Comments

@osherenko
Copy link

I am developing an annotation and would specify for particular words not only lemmas and POS, but also etymologic or morphological information. How should I do it?

@proycon
Copy link
Owner

proycon commented Nov 9, 2022

For morphological information, check the respective section in the documentation:Morphological annotation

For data on etymology there is no specific element, and I'd probably need to see an example of how such an annotation looks like in your data to give the best advice. It might be added as a higher-order feature using <feat> with one of the existing types.

@osherenko
Copy link
Author

osherenko commented Nov 10, 2022 via email

@osherenko
Copy link
Author

Probably, the etymology annotation can resemble the sense annotation https://folia.readthedocs.io/en/latest/sense_annotation.html

@proycon proycon self-assigned this Nov 14, 2022
@proycon
Copy link
Owner

proycon commented Nov 14, 2022

How would you annotate the morphology of the irregular verb "went"? "go" is not a part of it.

True, "go" would be the lemma, in the simplest annotation form:

<w>
  <t>went</t>
  <lemma class="go" />
</w>

If you want to express it in a morphological structure, you could do the following:

<w>
  <t>went</t>
  <lemma class="go" />
  <morphology>
    <morpheme class="stem">
       <t>went</t>
       <lemma class="go" />
       <pos class="V" />
    </morpheme>
  </morphology>
</w>

There's a bit of duplication here to show that you can choose on what level you want to express certain things (like pos/lemma/sense). Most of what you can express on the word level is also valid on the morpheme level.

Note: the classes belong to a user-defined set definition, FoLiA itself does not prescribe them.

There are some further examples on https://folia.readthedocs.io/en/latest/morphological_annotation.html , which shows that you can also do nesting and associate extra features with morphemes (such as function, or any other you can invent). This is also the place where you might want to express etymology:

<w>
  <t>went</t>
  <lemma class="go" />
  <morphology>
    <morpheme class="stem">
       <t>went</t>
       <lemma class="go" />
       <pos class="V" />
       <feat subset="etymology" class="wenden" />
    </morpheme>
  </morphology>
</w>

In this case it's an extra feature on the morpheme. Rather than using the full description I opted for a shorted 'class' here, which ideally some external database would provide a full definition of (and the set of your morphology-annotation would determine what that database is). But of course you may also just put the full definition as class.

Moreover, I want to specify linguistic information in the sentence
annotation such as dependencies and the grammatical parse.

That can be accommodated in FoLiA by respectively dependency annotation and syntactic annotation.

BTW, there is a broken link to the API class --
https://folia.readthedocs.io/en/latest/morphological_annotation.html doesn't
exist.

Oops indeed, thanks! I have corrected the mistake.

I hope this provides some more clarity?

@proycon
Copy link
Owner

proycon commented Nov 14, 2022

One of the points arising from this issue is whether we want to introduce an explicit <etymology> element in FoLiA, It would be an inline annotation element very alike to <sense>. We could then do <etymology class="wenden" /> . Which may be nicer than using a feature structure.

@osherenko
Copy link
Author

osherenko commented Nov 16, 2022 via email

@osherenko
Copy link
Author

osherenko commented Nov 18, 2022

It might be a good idea to use Metric annotation, For example:
<text src="...">
<metric class="charlength" value="4" />
</text>"

@proycon
Copy link
Owner

proycon commented Nov 18, 2022

An explicit etymology tag is indeed nicer

I'll work on adding one.

Another issue is the annotation of a corpus containing several texts.

Just do one FoLiA document per text. It's not recommended to put the whole corpus in a single FoLiA file.

Can FoliA accept multimodal injections?

You can have multiple text layers and multiple phonological layers, yes, if that's what you mean. One limitation in FoLiA, however, is that there can only be one canonical tokenization for everything.

It might be a good idea to use Metric annotation

<text src="...">
<metric class="charlength" value="4" />
</text>

I'm not entirely sure what your exact use case is for this, but you can indeed use metric annotation for all kinds of measurements on whatever you want.

It should also consider
properties of the original image used by the OCR to extract the texts.

@proycon
Copy link
Owner

proycon commented Nov 18, 2022

@osherenko Please check what you think of the example in the above commit d43bbc8 and the documentation in commit db5499f . That's my proposed solution for annotating etymology in FoLiA.

proycon added a commit to proycon/foliapy that referenced this issue Nov 18, 2022
…d mentions of 'token annotation' to 'inline annotation'
proycon added a commit that referenced this issue Nov 18, 2022
@osherenko
Copy link
Author

osherenko commented Nov 21, 2022 via email

@proycon
Copy link
Owner

proycon commented Nov 21, 2022 via email

@osherenko
Copy link
Author

osherenko commented Nov 21, 2022 via email

@proycon
Copy link
Owner

proycon commented Nov 28, 2022 via email

@kosloot
Copy link
Collaborator

kosloot commented Nov 29, 2022

I am surprised that doc = folia.Document(id='id', src="src") doesn't generate an error message.

@proycon
Copy link
Owner

proycon commented Nov 30, 2022

I am surprised that doc = folia.Document(id='id', src="src") doesn't generate an error message.

True, me too.

@osherenko
Copy link
Author

osherenko commented Nov 30, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants