Turkish Preprocessing Toolkit

This repository includes the modules of preprocessing operations for Turkish text data. The modules are implemented for the general domain text data. The modules include

Sentence Splitting (Sentence Boundary Disambiguation)
Tokenization
Normalization
Stemming
Stopword Removal
Also a handwritten Naive Bayes Classifier for any use

Requirements

First, you need to clone this repository to your project from your terminal and have the necessary Python version which is __version__ >= 3.6

git clone [email protected]:karahan-sahin/Turkish_Preprocessing.git

After cloning the repository, you need to initiate your virtualenv and load the necessary package

cd Turkish_Preprocessing/
python -m venv toolkit_venv
source toolkit_venv/bin/activate
pip install -r requirements.txt

After that you can go to the initial folder contains the repository folder and use it on your notebook or script

cd ..
mv Turkish_Preprocessing/example/demo.ipynb -t .

Usage

Create sample text for the usage which can be also demo.ipynb

sample_text = """...politika faizi olan bir hafta vadeli repo ihale faiz oranının yüzde 16’dan yüzde 15’e indirilmesine karar vermiştir. Küresel iktisadi..."""

Tokenization

from Turkish_Preprocessing.modules.tokenizer import Tokenizer
tokenizer = Tokenizer()

After you import your module, you need the initiate the instance of the tokenizer. In this processes, the necessary models will be trained. Then you can either use RuleBasedTokenizer which is tokenizer that uses handwritten and non-overlapping set of rules.

tokens = tokenizer.RuleBasedTokenizer(sample_text)

RuleBasedTokenizer also includes normalization rules for Multi-Word Expression and other normalization method

or LogisticTokenizer which is trained char-by-char binary Logistic Regression classifier which detects a char to be a end-of-token or not

tokens = tokenizer.LogisticTokenizer(sample_text)

Sentence Splitting

from Turkish_Preprocessing.modules.splitter import SentenceSplitter
splitter = SentenceSplitter()

Sentence Splitter also contains one rule based and one machine learning based (a Naive Bayes classifier) which detects a punctuation to be a sentence boundary punctuation or not

sentences = splitter.RuleBasedTokenizer(sample_text)

sentences = tokenizer.NaiveBayesSplitter(sample_text)

Normalization

Normalization module for normalization of newlines, multiwords and other special tokens

from Turkish_Preprocessing.modules.normalization import Normalization
normalization = Normalization()

normalized_text = normalization.normalize(sample_text)

Has multiple attributes as

normalization.normalize_abbreviation()
normalization.adapt_newline()
normalization.normalize_urls()
normalization.normalize_emails()
normalization.normalize_hashtags()
normalization.remove_punctuation()
normalization.MWE_Normalization()

Stemmer

Stemmer works with only in token base you need to give token by token to get the stems and the detect list of morphology. Stemmer has created using the regular expression which are obtained from Universal Dependencies framework.

from Turkish_Preprocessing.modules.stemmer import Stemmer
stemmer = Stemmer()

stems, morphs = stemmer.get_stem(token)

Sample Output

{
	Token: "etmektedir",
	Stem: "et",
	Morphology:  ['Copula=GenCop', 'Case=Loc', 'PersonNumber=V1pl', 'Polarity=Neg']
}

Stopword Removal

Has a both dynamic approach from corpus where top_k words and frozen lexicon from nltk.stopwords can be selected.

from Turkish_Preprocessing.modules.stopwords import StopwordRemoval
stopword = StopwordRemoval("custom", top_k=100)
stopword = StopwordRemoval("lexicon")

tokens = stopword.remove(tokenized_text)

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
example		example
modules		modules
source		source
.gitignore		.gitignore
Application Project 1.pdf		Application Project 1.pdf
README.md		README.md
Turkish Preprocessing.pdf		Turkish Preprocessing.pdf
rapor.pdf		rapor.pdf
requirements.txt		requirements.txt
utils_.py		utils_.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Turkish Preprocessing Toolkit

Requirements

Usage

Tokenization

Sentence Splitting

Normalization

Stemmer

Sample Output

Stopword Removal

About

Releases

Packages

Languages

karahan-sahin/Turkish_Preprocessing

Folders and files

Latest commit

History

Repository files navigation

Turkish Preprocessing Toolkit

Requirements

Usage

Tokenization

Sentence Splitting

Normalization

Stemmer

Sample Output

Stopword Removal

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages