Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preparing a training corpus #9

Open
djbpitt opened this issue Feb 3, 2019 · 0 comments
Open

Preparing a training corpus #9

djbpitt opened this issue Feb 3, 2019 · 0 comments

Comments

@djbpitt
Copy link
Member

djbpitt commented Feb 3, 2019

Dan comments:

It seems like you will have to spend a good amount of time doing some rhyme annotation by hand, because as you say, the rhyme data is not very good. This might be a problem. Unless your corpus does actually come with annotations you can use, I'd suggest taking a small, manageable sample of the corpus and working with that first. Maybe all the poems by one poet or something. After annotating for whatever you want to be classifying (e.g. approximate vs exact rhyme, which lines rhyme, etc), you could then work on rule-based and/or ML methods for recognizing those same annotations using a test/train split of the annotated portion. If you annotate a small amount of your data and try to run your rule-based or ML classifier on the rest of the data, you'll have the difficulty of evaluating how well it performs.

Some parts of the corpus have useful rhyme annotation, but for the most part the representation of rhyme in the corpus is messy, incomplete, and inconsistent. I think that starting with the works of one poet might risk overfitting, that is, training to recognize that poet’s rhyme conventions, rather than rhyme conventions more broadly.

With that said, there is low-hanging fruit. For example, Aleksandr Puškin’s Eugene Onegin contains about 400 14-line stanzas with the same rhyme scheme in all stanzas, which means that don’t have to tag the rhyme manually. On the other hand, I have to think about whether I would be training on Russian rhyme, or on Puškin’s rhyme, or on just Eugene Onegin rhyme. Requires further thought on my part!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant