-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
comprehensive linguistic annotation #105
Comments
For morphological information, check the respective section in the documentation:Morphological annotation For data on etymology there is no specific element, and I'd probably need to see an example of how such an annotation looks like in your data to give the best advice. It might be added as a higher-order feature using |
How would you annotate the morphology of the irregular verb "went"? "go" is
not a part of it. BTW, there is a broken link to the API class --
https://folia.readthedocs.io/en/latest/morphological_annotation.html doesn't
exist.
I want to specify etymological data within a word annotation as its
definition from a dictionary, similar to the annotation of the phoneme in
an utterance.
<utt xml:id="example.utt.1" src="helloworld.mp3"
begintime="00:00:01.000" endtime="00:00:02.000">
<ph>helˈoʊ wɝːld</ph>
<w xml:id="example.utt.1.w.1" begintime="00:00:00.000"
endtime="00:00:01.000">
<ph>helˈoʊ</ph>
<etymology>early 19th century: variant of earlier hollo ;
related to holla.</etymology>
</w>
<w xml:id="example.utt.1.w.2" begintime="00:00:01.000"
endtime="00:00:02.000">
<ph>wɝːld</ph>
</w>
</utt>
Moreover, I want to specify linguistic information in the sentence
annotation such as dependencies and the grammatical parse. For example, for
sentence annotation of "The strongest rain ever recorded in India..."
<https://nlp.stanford.edu/software/lex-parser.shtml#Sample> should include
the grammatical tree
(ROOT
(S
(S
(NP
(NP (DT The) (JJS strongest) (NN rain))
(VP
(ADVP (RB ever))
(VBN recorded)
(PP (IN in)
(NP (NNP India)))))
and dependencies:
det(rain-3, The-1)
amod(rain-3, strongest-2)
nsubj(shut-8, rain-3)
nsubj(snapped-16, rain-3)
nsubj(closed-20, rain-3)
nsubj(forced-23, rain-3)
advmod(recorded-5, ever-4)
partmod(rain-3, recorded-5)
prep_in(recorded-5, India-7)
Am Mi., 9. Nov. 2022 um 20:50 Uhr schrieb Maarten van Gompel <
***@***.***>:
… For morphological information, check the respective section in the
documentation:Morphological annotation
<https://folia.readthedocs.io/en/latest/morphological_annotation.html>
For data on etymology there is no specific element, and I'd probably need
to see an example of how such an annotation looks like in your data to give
the best advice. It might be added as a higher-order feature using <feat>
with one of the existing types.
—
Reply to this email directly, view it on GitHub
<#105 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/A4B2N3PYFLTITQVHL6FO4F3WHP56XANCNFSM6AAAAAARZQV3PU>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Probably, the etymology annotation can resemble the sense annotation https://folia.readthedocs.io/en/latest/sense_annotation.html |
True, "go" would be the lemma, in the simplest annotation form: <w>
<t>went</t>
<lemma class="go" />
</w> If you want to express it in a morphological structure, you could do the following: <w>
<t>went</t>
<lemma class="go" />
<morphology>
<morpheme class="stem">
<t>went</t>
<lemma class="go" />
<pos class="V" />
</morpheme>
</morphology>
</w> There's a bit of duplication here to show that you can choose on what level you want to express certain things (like pos/lemma/sense). Most of what you can express on the word level is also valid on the morpheme level. Note: the classes belong to a user-defined set definition, FoLiA itself does not prescribe them. There are some further examples on https://folia.readthedocs.io/en/latest/morphological_annotation.html , which shows that you can also do nesting and associate extra features with morphemes (such as <w>
<t>went</t>
<lemma class="go" />
<morphology>
<morpheme class="stem">
<t>went</t>
<lemma class="go" />
<pos class="V" />
<feat subset="etymology" class="wenden" />
</morpheme>
</morphology>
</w> In this case it's an extra feature on the morpheme. Rather than using the full description I opted for a shorted 'class' here, which ideally some external database would provide a full definition of (and the set of your
That can be accommodated in FoLiA by respectively dependency annotation and syntactic annotation.
Oops indeed, thanks! I have corrected the mistake. I hope this provides some more clarity? |
One of the points arising from this issue is whether we want to introduce an explicit |
An explicit etymology tag is indeed nicer. Another question is about
multimodal injections in annotations similar to Example 1.7.1 in the
documentation.
Can FoliA accept multimodal injections? You spoke about OCR corrections. My
annotation should annotate texts not only lexically, but also
statistically, semantically, or cognitively. It should also consider
properties of the original image used by the OCR to extract the texts.
Another issue is the annotation of a corpus containing several texts. In
this case, each text tag should hold the source of a text like the src
entity in the Utterance annotation and be saved in the output XML file.
Am Mo., 14. Nov. 2022 um 14:00 Uhr schrieb Maarten van Gompel <
***@***.***>:
… One of the points arising from this issue is whether we want to introduce
an explicit <etymology> element in FoLiA, It would be an inline
annotation element very alike to <sense>. We could then do <etymology
class="wenden" /> . Which may be nicer than using a feature structure.
—
Reply to this email directly, view it on GitHub
<#105 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/A4B2N3OXYMRCSXIVXIN6LMTWIIZVDANCNFSM6AAAAAARZQV3PU>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
It might be a good idea to use Metric annotation, For example: |
I'll work on adding one.
Just do one FoLiA document per text. It's not recommended to put the whole corpus in a single FoLiA file.
You can have multiple text layers and multiple phonological layers, yes, if that's what you mean. One limitation in FoLiA, however, is that there can only be one canonical tokenization for everything.
I'm not entirely sure what your exact use case is for this, but you can indeed use metric annotation for all kinds of measurements on whatever you want.
|
@osherenko Please check what you think of the example in the above commit d43bbc8 and the documentation in commit db5499f . That's my proposed solution for annotating etymology in FoLiA. |
…d mentions of 'token annotation' to 'inline annotation'
The etymology example looks very nice and "set" is a great addition (can it
be empty if the set is unclear?)
Could you explain why "It's not recommended to put the whole corpus in a
single FoLiA file"? If this case, each FoliA file needs a descriptive name
and there are many FoliA annotations in separate files, what can be
problematic because the file names must be cross-platform. Moreover, I
wonder why the Statement and the Utterance annotations have the src-entity
and the text tag not? Actually, in the current implementation, I can have
several doc tags in a single FoliA file and add the src entity to the doc
tag. Unfortunately, If I save such annotation, the src entity is not
present in the XML output.
BTW, I am particularly interested in the PyNLPI library. Do you have
a special mailing group for questions?
Am Fr., 18. Nov. 2022 um 16:04 Uhr schrieb Maarten van Gompel <
***@***.***>:
… @osherenko <https://github.com/osherenko> Please check what you think of
the example in the above commit d43bbc8
<d43bbc8>
and the documentation in commit db5499f
<db5499f>
. That's my proposed solution for annotation etymology in FoLiA.
—
Reply to this email directly, view it on GitHub
<#105 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/A4B2N3MONOBZKFK5Q6HKG23WI6LH3ANCNFSM6AAAAAARZQV3PU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
The etymology example looks very nice and "set" is a great addition (can it
be empty if the set is unclear?)
Yes, in that case an implicit set is assigned internally. The whole
'set' concept is at the core of FoLiA's paradigm.
Could you explain why "It's not recommended to put the whole corpus in a
single FoLiA file"? If this case, each FoliA file needs a descriptive name
and there are many FoliA annotations in separate files, what can be
problematic because the file names must be cross-platform.
A corpus is often quite big, putting it all into a single XML file
results in a big file which, when loaded into memory, blows up even
more. FoLiA is designed as a document-based format.
Note that FoLiA *does* typically group all annotations into the same file,
so you have one text with all its annotations in a single XML file (and multiple
such files for an entire corpus). How to divide the corpus into separate texts is
up to you, whatever makes most sense for your use case.
Moreover, I
wonder why the Statement and the Utterance annotations have the src-entity
and the text tag not?
The src attribute is a speech attribute, it refers back to the audio/video.
In a speech context, you usually have <speech> rather than <text> as the
main body tag (which does allow the src attribute)
Actually, in the current implementation, I can have
several doc tags in a single FoliA file and add the src entity to the doc
tag. Unfortunately, If I save such annotation, the src entity is not
present in the XML output.
I'm not sure what you mean here, can you show an example?
BTW, I am particularly interested in the PyNLPI library. Do you have
a special mailing group for questions?
You probably mean the foliapy library (it used to be part of pynlpl but
was split out several years ago). You can use the issue tracker at
https://github.com/proycon/foliapy for specific questions about that
library.
|
> Actually, in the current implementation, I can have
> several doc tags in a single FoliA file and add the src entity to the doc
> tag. Unfortunately, If I save such annotation, the src entity is not
> present in the XML output.
I'm not sure what you mean here, can you show an example?
For example, I can instantiate a document as
doc = folia.Document(id='id', src="src")
When I store the document using
doc.save("annotation.xml")
the src is unavilable in the XML file.
BTW, I am particularly interested in the PyNLPI library. Do you have
> a special mailing group for questions?
You probably mean the foliapy library (it used to be part of pynlpl but
was split out several years ago). You can use the issue tracker at
https://github.com/proycon/foliapy for specific questions about that
library.
You use the library in the foliafreqlist.py as
from pynlpl.statistics import FrequencyList
I am particularly interested in search algorithms in pynlpl (Chapter 7 in
the documentation), for example, using regular expressions like
https://nlp.stanford.edu/software/tokensregex.html.
—
… Reply to this email directly, view it on GitHub
<#105 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/A4B2N3PEZL52ZF4Z45GTMCLWJNO6NANCNFSM6AAAAAARZQV3PU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
For example, I can instantiate a document as
doc = folia.Document(id='id', src="src")
When I store the document using
doc.save("annotation.xml")
the src is unavilable in the XML file.
That's because src is not a valid attribute on a FoLiA Document as a
whole.
I am particularly interested in search algorithms in pynlpl (Chapter 7 in
the documentation), for example, using regular expressions like
https://nlp.stanford.edu/software/tokensregex.html.
If you want something like Stanford Tokensregex then look at the FoLiA
Query Language instead: https://folia.readthedocs.io/en/latest/fql.html
The search algorithms in pynlpl are very generic (and the implementation
is fairly old, you'll probably find better ones elsewhere).
|
I am surprised that |
True, me too. |
You probably await an unexpected keyword error in the constructor. Maybe,
an exception is not implemented in the constructor. If you want to
reproduce:
if __name__ == "__main__":
doc = folia.Document(id='id', src="src")
If I call
text = doc.add(folia.Text, src="src")
I get an unexpected keyword error.
BTW, it stroke me that I can place the src as string in doc.metadata['desc']
to annotate the source of the document. Or is it better to add a
Description annotation like
desc = doc.add(folia.Description, value="text source %s" % src)
Am Di., 29. Nov. 2022 um 20:04 Uhr schrieb Ko van der Sloot <
***@***.***>:
I am surprised that doc = folia.Document(id='id', src="src") doesn't
generate an error message.
—
Reply to this email directly, view it on GitHub
<#105 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/A4B2N3IBZE6VXKPMMDGIOBDWKZHTBANCNFSM6AAAAAARZQV3PU>
.
You are receiving this because you were mentioned.
Am Mi., 30. Nov. 2022 um 16:41 Uhr schrieb Maarten van Gompel <
***@***.***>:
… I am surprised that doc = folia.Document(id='id', src="src") doesn't
generate an error message.
True, me too.
—
Reply to this email directly, view it on GitHub
<#105 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/A4B2N3JDEDQ5LOIUOKYMOJLWK5YTBANCNFSM6AAAAAARZQV3PU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I am developing an annotation and would specify for particular words not only lemmas and POS, but also etymologic or morphological information. How should I do it?
The text was updated successfully, but these errors were encountered: