-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wordlists and training texts contain lots of errors #1
Comments
The word lists and trained text were generates by using a web crawler. So the undesirable effects you mentioned are to be expected. |
Using a web crawler on German texts will normally not find words like "drauBen" (instead of "draußen"), unless you crawl OCR results which were made with English language settings. It looks like Ray crawled Google Books. What happens if Google learns from Google? At some time there will be lots of evidence that "drauBen" is correct. :-) Searching for "drauBen" (with Google Search of course) already finds texts outside of Google Books, but maybe generated by Google Translate. So using a web crawler is fine as long as it only crawls more reliable content (German text corpora, German Wikipedia, German newspapers, German books from Wikisource or Project Gutenberg, ...). |
tesseract-ocr/tesseract#654 (comment) theraysmith commented on Jan 23, 2017
|
Wikipedia? Other Wikimedia's wikis? |
Let's say we provide corpus text. Is there only the slightest chance that retraining |
IMO (I did not try it yet) it should be possible at least for LTSM: see wiki training-from-scratch. |
Thanks for your estimation. I guess reproducing the current models would be very useful before trying to improve them. I'll give it a try. And yes, I am only interested in LSTM training. |
My own experience with legacy training is different. It was quite easy to train a useable Fraktur model (frk.traineddata), but up to now I did not succeed in training a similar LSTM model from scratch. Legacy training only requires a selection of good fonts and a short training text which includes all glyphs, so it is sufficient to make an artificial text listing those glyphs. |
Just to make sure: With reproducing, I refer to more or less exactly reproducing the current state of the stack models. |
I am afraid that reproducing the current models won't be possible, maybe not even with Google internal information. If the text used for training was extracted from Internet sources (it looks like that), then that extraction cannot be reproduced. The original extracted text would be needed, also how it was distributed on the trained fonts and which parameters were used for Most of the current models have known deficits, so maybe it is not a great loss if they cannot be reproduced exactly. The important thing is finding a way to get new models from scratch without those deficits, but with comparable or better quality, and with a 100 % defined training process. |
just by clear regarding my statement about legacy engine: Fraktur fonts belong to special fonts. |
Another issue is that some of the fonts they used for training are not open source fonts and cost some $$. |
@wrznr, I think that Ray's statement is the best piece of information which we currently have on the training done by Google.
A 1 GB text file for a single language which was taken from "all the www" is not only too large to be easily handled, but will also contain lots of copyrighted text. That might be a major reason why such files could not be shared. |
@stweil I missed that piece of information. Thanks. I always thought that the training texts would be part of the data repos. If this is not the case, I really think we should make an effort and come up with re-trainable alternatives. Wikipedia could be a good source for the texts. |
The small training texts in the data repos were sufficient for the legacy model. I have no idea how the larger training texts in Wikipedia can contribute training text, but those text uses modern language and is not formatted like printed books. Wikisource offers older texts, and other projects (like Project Gutenberg) also offer the typical book layout. I expect a higher quality from those sources than from a more random www sample. Maybe we can also use other large existing text corpora. |
Just my 2 cents as comment to what the basic languages models should be:
Personally I gave up the idea to distinguish orthographies in the intervals of 1750, 1830, 1875, 1901 and 1996. Now I just divide my corpora into periods of 50 years like 1800-1849, 1850-1899, etc. It's always possible to combine them into longer periods. Modern, because I assume the majority of users need modern language. Archives and libraries have other requirements and can help themselves.
From all the available corpora which I know https://wortschatz.uni-leipzig.de/de/download provides random "proper" sentences of different sizes, domains (e.g. news, web, wikipedia). For German up to 300 M-sentences, which is IMHO not very handy to process. The license is friendly:
Some what what 1 M-sentences mean:
Thus 1 M-sentences need ~100 MB. According to Zipf's law the average of ~18 tokens per sentence is very constant between the German corpora. The size of the alphabet (unique chars/graphemes) is very different, because some corpora include non-Latin scripts like Greek, Arabic, Hebrew, Cyrillic, Chinese, and emoticons too. BTW: None of the corpora I know is free of spelling errors. Even DTA has still errors like I am not sure if size matters for training. If there would be a gain in accuracy using 1 GB text versus 100 MB, or if it would degrade. Other works using CTC/(B)LSTM show a stagnation along increasing dictionary sizes up to 90 K-words (morphemes or surface forms). HMMs degrade early, but exactly this was the reason to use CTC.
IMHO the current character set of |
A short test with
codespell
(which only finds the most common typos for English) found more than 1000 errors ineng.wordlist
.The German wordlist
deu.wordlist
contains the well known B / ß confusion and also other errors.The training texts also contain similar errors. In addition, I noticed many foreign (Turkish?) words in the German text.
Are such errors critical for the trained model which is based on that data?
The text was updated successfully, but these errors were encountered: