Skip to content

v6.0.0

Compare
Choose a tag to compare
@hyunwoongko hyunwoongko released this 27 Apr 14:23
· 28 commits to main since this release
91355e2

KSS: Korean String processing Suite

GitHub release Issues Tests on Ubuntu Tests on MacOS Tests on Windows

KSS is a Korean string processing suite that provides various functions for processing Korean strings. It is designed to be simple and easy to use, and it is designed to be used in various fields such as natural language processing, data preprocessing, and data analysis.

Usage

1. Basic Usage

All functions can be used by creating an instance of the Kss class and calling the instance with the inputs.

from kss import Kss

module = Kss("MODULE_NAME")
output = module("YOUR_INPUT_STRING", **kwargs)

2. Available Modules

If you want to check the available modules, you can use the available() function.

from kss import Kss

Kss.available()
['augment', 'collocate', 'g2p', 'hangulize', 'split_hanja', 'is_hanja', 'hanja2hangul', 'h2j', 'h2hcj', 'j2h', 'j2hcj', 'hcj2h', 'hcj2j', 'is_jamo', 'is_jamo_modern', 'is_hcj', 'is_hcj_modern', 'is_hangul_char', 'select_josa', 'combine_josa', 'extract_keywords', 'split_morphemes', 'paradigm', 'anonymize', 'clean_news', 'is_completed_form', 'get_all_completed_form_hangul_chars', 'get_all_incompleted_form_hangul_chars', 'filter_out', 'half2full', 'reduce_char_repeats', 'reduce_emoticon_repeats', 'remove_invisible_chars', 'normalize', 'preprocess', 'qwerty', 'romanize', 'is_unsafe', 'split_sentences', 'correct_spacing', 'summarize_sentences']

3. Checking the usage of each module

If you want to check the usage of each module, you can use the help() function.

from kss import Kss

module = Kss("split_sentences")
module.help()
Split texts into sentences.

Args:
    text (Union[str, List[str], Tuple[str]]): single text or list/tuple of texts
    backend (str): morpheme analyzer backend. 'mecab', 'pecab', 'punct' are supported
    num_workers (Union[int, str]): the number of multiprocessing workers
    strip (bool): strip all sentences or not
    return_morphemes (bool): whether to return morphemes or not
    ignores (List[str]): list of strings to ignore

Returns:
    Union[List[str], List[List[str]]]: outputs of sentence splitting

Examples:
    >>> from kss import Kss
    >>> split_sentences = Kss("split_sentences")
    >>> text = "회사 동료 분들과 다녀왔는데 분위기도 좋고 음식도 맛있었어요 다만, 강남 토끼정이 강남 쉑쉑버거 골목길로 쭉 올라가야 하는데 다들 쉑쉑버거의 유혹에 넘어갈 뻔 했답니다 강남역 맛집 토끼정의 외부 모습."
    >>> split_sentences(text)
    ['회사 동료 분들과 다녀왔는데 분위기도 좋고 음식도 맛있었어요', '다만, 강남 토끼정이 강남 쉑쉑버거 골목길로 쭉 올라가야 하는데 다들 쉑쉑버거의 유혹에 넘어갈 뻔 했답니다', '강남역 맛집 토끼정의 외부 모습.']

4. Multiprocessing

If you input a list of strings, Kss will automatically use multiprocessing to process the strings in parallel.
And you can set the number of processes to use by setting the num_workers parameter.
If you input num_workers<2, Kss will not use multiprocessing.

from kss import Kss

module = Kss("MODULE_NAME")

# using all cores
output = module(["YOUR_INPUT_STRING1", "YOUR_INPUT_STRING2", ...], **kwargs)
# using 4 cores
output = module(["YOUR_INPUT_STRING1", "YOUR_INPUT_STRING2", ...], num_workers=4, **kwargs)
# using 1 core (no multiprocessing)
output = module(["YOUR_INPUT_STRING1", "YOUR_INPUT_STRING2", ...], num_workers=1, **kwargs)

5. Backward Compatibility

The old version of Kss used functional usage. KSS also supports this for backward compatibility.

from kss import split_sentences

output = split_sentences("YOUR_INPUT_STRING", **kwargs)

Supported Modules

See here for more details.