Django Korektor

Django Korektor is a brigam based "Did you mean?" proof of concept.

Have you ever heard about Google's "Did you mean" feature? Django Korektor is a simplified proof of concept based od bigrams intersection between spellchecked query and database set.

Django Korektor contains optimized database models. Learning management commands to import your huge language datasets and finally test spellcheck. It finds closest bigrams match, corrects and preserves any separators. Database structure should be effecient enough for production use. (e.g. 5 word query checked over 1 million words in 0.3s on cheapest Digital Ocean droplet :)

Misspell theory

Cassical Damereau errors introduced by F.J. Damereau in 1964:

Substitution ALPHABET -> ALPHSBET
Deletion ALPHABET -> ALPHBET
Insertion ALPHABET -> ALPHAABET
Transposition ALPHABET -> ALPHBAET

What is a bigram?

A bigram or digram is every sequence of two adjacent elements in a string of tokens, which are typically letters, syllables, or words; they are n-grams for n=2. The frequency distribution of bigrams in a string are commonly used for simple statistical analysis of text in many applications, including in computational linguistics, cryptography, speech recognition, and so on. Source: Wiki

For example: "Spellcheck" broken down to bigrams will result in [Sp, pe, el, ll, lc, ch, he, ec, ck]

Installation

Install from pip repository

$ pip install django-korektor

Add djkorektor to your installed apps:

INSTALLED_APPS = (
    ...
    'djkorektor'
)

Create database tables with Django's syncdb:

$ cd /path/to/app
$ python manage.py syncdb

or just export schema to create database tables by yourself

$ python manage.py sqlall djkorektor > djkorektor_database_schema.sql

will create five tables with sample data:

djkorektor_bigrams (bigrams for specific language ~2700 rows per language)
djkorektor_locales (static table of all locales)
djkorektor_words (words for specific language, will be the biggest table ~1 mil. rows and more per language)
djkorektor_words_bigrams (word broken to bigrams, biggest but simple table)
djkorektor_words_pairs (words pairs, left right bigrams of words)

Example use

Learning from command line:

$ python manage.py djkorektor --import_word="Bigrams are fun! It is raining, let's dance together. It will be my pleasure." --locale=en_US

Spellcheck test from command line:

$ python manage.py djkorektor --spell="It is fn to dence" --locale=en_US

Spellcheck from your app using management command in view:

from django.core.management import call_command
spellchecked = call_command('djkorektor',locale="en_US",spell="It is fn to dence")

will return dictionary

 { your_input: It is fn to dence
   did_you_mean: It is fun to dance
   did_you_mean_markdown: It is *fun* to *dance*
   did_you_mean_html: It is <i>fun</i> to <i>dance</i> }

Please note

Django Korektor can't fix a word if you did't learned it. Basically you need to import huge language specific dataset of correct phrases. For example a lot of newspaper articles (by constant learning from any rss feed).
Django Korektor will only fix certain misspells between words. For example "It is fan to dance" will result as correct phrase.
Django Korektor usage of context is limited. For example "Icland is icland" will somehow result in "Iceland is iceland"

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
djkorektor		djkorektor
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
manage.py		manage.py
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Django Korektor

Misspell theory

What is a bigram?

Installation

Example use

Please note

About

Releases

Packages

Languages

License

stefanbackor/django-korektor

Folders and files

Latest commit

History

Repository files navigation

Django Korektor

Misspell theory

What is a bigram?

Installation

Example use

Please note

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages