Django Korektor is a brigam based "Did you mean?" proof of concept.
Have you ever heard about Google's "Did you mean" feature? Django Korektor is a simplified proof of concept based od bigrams intersection between spellchecked query and database set.
Django Korektor contains optimized database models. Learning management commands to import your huge language datasets and finally test spellcheck. It finds closest bigrams match, corrects and preserves any separators. Database structure should be effecient enough for production use. (e.g. 5 word query checked over 1 million words in 0.3s on cheapest Digital Ocean droplet :)
Cassical Damereau errors introduced by F.J. Damereau in 1964:
- Substitution
ALPHABET -> ALPHSBET
- Deletion
ALPHABET -> ALPHBET
- Insertion
ALPHABET -> ALPHAABET
- Transposition
ALPHABET -> ALPHBAET
A bigram or digram is every sequence of two adjacent elements in a string of tokens, which are typically letters, syllables, or words; they are n-grams for n=2. The frequency distribution of bigrams in a string are commonly used for simple statistical analysis of text in many applications, including in computational linguistics, cryptography, speech recognition, and so on. Source: Wiki
For example: "Spellcheck" broken down to bigrams will result in [Sp, pe, el, ll, lc, ch, he, ec, ck]
Install from pip repository
$ pip install django-korektor
Add djkorektor
to your installed apps:
INSTALLED_APPS = (
...
'djkorektor'
)
Create database tables with Django's syncdb:
$ cd /path/to/app
$ python manage.py syncdb
or just export schema to create database tables by yourself
$ python manage.py sqlall djkorektor > djkorektor_database_schema.sql
will create five tables with sample data:
- djkorektor_bigrams (bigrams for specific language ~2700 rows per language)
- djkorektor_locales (static table of all locales)
- djkorektor_words (words for specific language, will be the biggest table ~1 mil. rows and more per language)
- djkorektor_words_bigrams (word broken to bigrams, biggest but simple table)
- djkorektor_words_pairs (words pairs, left right bigrams of words)
Learning from command line:
$ python manage.py djkorektor --import_word="Bigrams are fun! It is raining, let's dance together. It will be my pleasure." --locale=en_US
Spellcheck test from command line:
$ python manage.py djkorektor --spell="It is fn to dence" --locale=en_US
Spellcheck from your app using management command in view:
from django.core.management import call_command
spellchecked = call_command('djkorektor',locale="en_US",spell="It is fn to dence")
will return dictionary
{ your_input: It is fn to dence
did_you_mean: It is fun to dance
did_you_mean_markdown: It is *fun* to *dance*
did_you_mean_html: It is <i>fun</i> to <i>dance</i> }
- Django Korektor can't fix a word if you did't learned it. Basically you need to import huge language specific dataset of correct phrases. For example a lot of newspaper articles (by constant learning from any rss feed).
- Django Korektor will only fix certain misspells between words. For example "It is fan to dance" will result as correct phrase.
- Django Korektor usage of context is limited. For example "Icland is icland" will somehow result in "Iceland is iceland"