Preparing a training corpus #9

djbpitt · 2019-02-03T17:54:10Z

Dan comments:

It seems like you will have to spend a good amount of time doing some rhyme annotation by hand, because as you say, the rhyme data is not very good. This might be a problem. Unless your corpus does actually come with annotations you can use, I'd suggest taking a small, manageable sample of the corpus and working with that first. Maybe all the poems by one poet or something. After annotating for whatever you want to be classifying (e.g. approximate vs exact rhyme, which lines rhyme, etc), you could then work on rule-based and/or ML methods for recognizing those same annotations using a test/train split of the annotated portion. If you annotate a small amount of your data and try to run your rule-based or ML classifier on the rest of the data, you'll have the difficulty of evaluating how well it performs.

Some parts of the corpus have useful rhyme annotation, but for the most part the representation of rhyme in the corpus is messy, incomplete, and inconsistent. I think that starting with the works of one poet might risk overfitting, that is, training to recognize that poet’s rhyme conventions, rather than rhyme conventions more broadly.

With that said, there is low-hanging fruit. For example, Aleksandr Puškin’s Eugene Onegin contains about 400 14-line stanzas with the same rhyme scheme in all stanzas, which means that don’t have to tag the rhyme manually. On the other hand, I have to think about whether I would be training on Russian rhyme, or on Puškin’s rhyme, or on just Eugene Onegin rhyme. Requires further thought on my part!

djbpitt added corpus training labels Feb 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preparing a training corpus #9

Preparing a training corpus #9

djbpitt commented Feb 3, 2019

Preparing a training corpus #9

Preparing a training corpus #9

Comments

djbpitt commented Feb 3, 2019