diff --git a/README.md b/README.md index b19b768..adcd196 100644 --- a/README.md +++ b/README.md @@ -25,7 +25,7 @@ Install from source: ### Troubleshooting -OpusFilter should generally work fine on Python 3.8 to 3.11. In the case of troubles, try installing the exact versions in `requirements.txt`: +OpusFilter should generally work fine on Python 3.8 to 3.12. In the case of troubles, try installing the exact versions in `requirements.txt`: * `pip install -r requirements.txt` diff --git a/docs/CHANGELOG.md b/docs/CHANGELOG.md index b11bd69..745e705 100644 --- a/docs/CHANGELOG.md +++ b/docs/CHANGELOG.md @@ -9,7 +9,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ### Changed -- make pycld2 and fasttext libraries optional +- make `pycld2` and `fasttext` libraries optional +- replace `langid.py` library with `py3langid` - update github workflows and include Python 3.12 tests ## [3.1.0] - 2024-06-05 diff --git a/docs/CONTRIBUTING.md b/docs/CONTRIBUTING.md index 506eb41..9a0de65 100644 --- a/docs/CONTRIBUTING.md +++ b/docs/CONTRIBUTING.md @@ -5,7 +5,7 @@ issues page. We are also happy to consider pull requests. There are a few rules for pull requests: * Make a pull request to the `develop` branch instead of `master`. -* The code should support at least Python versions from 3.8 to 3.11. +* The code should support at least Python versions from 3.8 to 3.12. * Please follow [PEP 8](https://www.python.org/dev/peps/pep-0008/). Exception: The maximum line length is 127 characters instead of 79. * Especially for new features, please include test cases for unit testing. @@ -20,7 +20,7 @@ skips the respective tests if not.) GitHub workflows defined in the project run automatically `flake8` checks and unit testing with `pytest` using Python 3.8, 3.9, 3.10, -and 3.11. +3.11, and 3.12. Especially for larger contributions, consider using a code analysis tool like [Pylint](https://github.com/PyCQA/pylint). Install it diff --git a/docs/filters/script_and_language_identification_filters.md b/docs/filters/script_and_language_identification_filters.md index 020c69d..cf1a665 100644 --- a/docs/filters/script_and_language_identification_filters.md +++ b/docs/filters/script_and_language_identification_filters.md @@ -35,7 +35,7 @@ Filter segments based on their language identification confidence scores. Parameters: * `languages`: expected languages (ISO639 language codes) for the segments -* `id_method`: language indentification method (`langid` for using the `langid` library, `cld2` for using the `cld2` library, or `fasttext` for using a `fasttext` model; the default is `langid`) +* `id_method`: language indentification method (`langid`, `lingua`, `cld2`, `fasttext`; default `langid`) * `thresholds`: minimum identification confidence score for the segments (a single float or a list of floats per language) * `fasttext_model_path`: path for a `fasttext` model (required only for the `fasttext` method; default `null`) * `langid_languages`: limit detection to a list of possible languages (valid only for the `langid` method; default `null`) @@ -44,8 +44,15 @@ Parameters: Returned scores are the language identification confidence scores from a given identification method for the segments. The scores range from 0 to 1. In filtering, all values have to be greater than the minimum thresholds. Negative threshold can be used to skip filtering for a language. -See [langid.py](https://github.com/saffsd/langid.py) and -[pycld2](https://github.com/aboSamoor/pycld2) for the method-specific -options. A pretrained `fasttext` model can be downloaded from -[fasttext.cc/docs/en/language-identification.html](https://fasttext.cc/docs/en/language-identification.html). -The `cld2` and `fasttext` methods require [installing optional libraries](../installation.md). +Currently the following identification methods are supported: + +* `langid` (default) :cite:`lui-baldwin-2012-langid` + * See https://github.com/adbar/py3langid +* `lingua` + * See https://github.com/pemistahl/lingua-py +* `cld2` + * See https://github.com/CLD2Owners/cld2 + * Requires [installing optional libraries](../installation.md). +* `fasttext` :cite:`joulin-etal-2016-fasttext` and :cite:`joulin-etal-2017-bag` + * A pretrained model can be downloaded from [fasttext.cc/docs/en/language-identification.html](https://fasttext.cc/docs/en/language-identification.html). + * Requires [installing optional libraries](../installation.md). diff --git a/docs/installation.md b/docs/installation.md index d9e0473..c14bee3 100644 --- a/docs/installation.md +++ b/docs/installation.md @@ -12,14 +12,14 @@ Install from source: Note that all required libraries are not available to install via PyPI on Windows OS. On Linux and MacOS, it should work directly for Python -versions from 3.8 to 3.11. +versions from 3.8 to 3.12. ## Required libraries * beautifulsoup4 * opus-fast-mosestokenizer * graphviz -* langid +* py3langid * matplotlib * morfessor * OpusTools @@ -41,11 +41,11 @@ See `setup.py` for possible version requirements. ### FastText and PyCLD2 language identification -The language identification methods currently supported out-of-the-box -are [langid](https://github.com/saffsd/langid.py) and +The language identification libraries currently supported out-of-the-box +are [py3langid](https://github.com/adbar/py3langid) and [lingua](https://github.com/pemistahl/lingua-py). The support for for -[pycld2](https://github.com/aboSamoor/pycld2) and -[fasttext models](https://fasttext.cc/docs/en/language-identification.html) +[PyCLD2](https://github.com/aboSamoor/pycld2) and +[FastText models](https://fasttext.cc/docs/en/language-identification.html) have been changed to optional due to the lack of support especially for newer Python versions. diff --git a/opusfilter/filters.py b/opusfilter/filters.py index 13f1461..bf2058f 100644 --- a/opusfilter/filters.py +++ b/opusfilter/filters.py @@ -334,8 +334,8 @@ def __init__(self, languages=None, id_method='langid', thresholds=None, def init_langid(self, langid_languages): """Initialize langid identifier""" - from langid.langid import LanguageIdentifier, model - self.identifier = LanguageIdentifier.from_modelstring(model, norm_probs=True) + from py3langid.langid import LanguageIdentifier, MODEL_FILE + self.identifier = LanguageIdentifier.from_pickled_model(MODEL_FILE, norm_probs=True) if langid_languages: self.identifier.set_languages(langid_languages) diff --git a/requirements.txt b/requirements.txt index 623a808..44a157f 100644 --- a/requirements.txt +++ b/requirements.txt @@ -5,7 +5,7 @@ opustools jieba>=0.42 beautifulsoup4>=4.8.2 graphviz>=0.16 -langid==1.1.6 +py3langid==0.3.0 matplotlib>=3.3.0 opus-fast-mosestokenizer>=0.0.8.5 pandas>=1.0.0 diff --git a/setup.py b/setup.py index 6161b90..74dab3e 100644 --- a/setup.py +++ b/setup.py @@ -8,7 +8,7 @@ "opustools", "beautifulsoup4>=4.8.0", "graphviz", - "langid", + "py3langid>=0.2.2", "matplotlib", "morfessor", "opus-fast-mosestokenizer>=0.0.8.5", @@ -30,6 +30,7 @@ ] fasttext_require = [ + "py3langid<0.3.0", # 0.3.0 requires numpy 2.0.0 "numpy<2.0.0", "fasttext" ]