replace langid.py with py3langid

Helsinki-NLP · Jun 26, 2024 · 55ec9a7 · 55ec9a7
1 parent 9fbe7d0
commit 55ec9a7
Show file tree

Hide file tree

Showing 8 changed files with 29 additions and 20 deletions.
diff --git a/README.md b/README.md
@@ -25,7 +25,7 @@ Install from source:
 
 ### Troubleshooting
 
-OpusFilter should generally work fine on Python 3.8 to 3.11. In the case of troubles, try installing the exact versions in `requirements.txt`:
+OpusFilter should generally work fine on Python 3.8 to 3.12. In the case of troubles, try installing the exact versions in `requirements.txt`:
 
 * `pip install -r requirements.txt`
 

diff --git a/docs/CHANGELOG.md b/docs/CHANGELOG.md
@@ -9,7 +9,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ### Changed
 
-- make pycld2 and fasttext libraries optional
+- make `pycld2` and `fasttext` libraries optional
+- replace `langid.py` library with `py3langid`
 - update github workflows and include Python 3.12 tests
 
 ## [3.1.0] - 2024-06-05

diff --git a/docs/CONTRIBUTING.md b/docs/CONTRIBUTING.md
@@ -5,7 +5,7 @@ issues page. We are also happy to consider pull requests. There are a
 few rules for pull requests:
 
 * Make a pull request to the `develop` branch instead of `master`.
-* The code should support at least Python versions from 3.8 to 3.11.
+* The code should support at least Python versions from 3.8 to 3.12.
 * Please follow [PEP 8](https://www.python.org/dev/peps/pep-0008/). Exception: The maximum line length is 127 characters instead of 79.
 * Especially for new features, please include test cases for unit testing.
 
@@ -20,7 +20,7 @@ skips the respective tests if not.)
 
 GitHub workflows defined in the project run automatically `flake8`
 checks and unit testing with `pytest` using Python 3.8, 3.9, 3.10,
-and 3.11.
+3.11, and 3.12.
 
 Especially for larger contributions, consider using a code analysis
 tool like [Pylint](https://github.com/PyCQA/pylint). Install it

diff --git a/docs/filters/script_and_language_identification_filters.md b/docs/filters/script_and_language_identification_filters.md
@@ -35,7 +35,7 @@ Filter segments based on their language identification confidence scores.
 Parameters:
 
 * `languages`: expected languages (ISO639 language codes) for the segments
-* `id_method`: language indentification method (`langid` for using the `langid` library, `cld2` for using the `cld2` library, or `fasttext` for using a `fasttext` model; the default is `langid`)
+* `id_method`: language indentification method (`langid`, `lingua`, `cld2`, `fasttext`; default `langid`)
 * `thresholds`: minimum identification confidence score for the segments (a single float or a list of floats per language)
 * `fasttext_model_path`: path for a `fasttext` model (required only for the `fasttext` method; default `null`)
 * `langid_languages`: limit detection to a list of possible languages (valid only for the `langid` method; default `null`)
@@ -44,8 +44,15 @@ Parameters:
 
 Returned scores are the language identification confidence scores from a given identification method for the segments. The scores range from 0 to 1. In filtering, all values have to be greater than the minimum thresholds. Negative threshold can be used to skip filtering for a language.
 
-See [langid.py](https://github.com/saffsd/langid.py) and
-[pycld2](https://github.com/aboSamoor/pycld2) for the method-specific
-options. A pretrained `fasttext` model can be downloaded from
-[fasttext.cc/docs/en/language-identification.html](https://fasttext.cc/docs/en/language-identification.html).
-The `cld2` and `fasttext` methods require [installing optional libraries](../installation.md).
+Currently the following identification methods are supported:
+
+* `langid` (default) :cite:`lui-baldwin-2012-langid`
+  * See https://github.com/adbar/py3langid
+* `lingua`
+  * See https://github.com/pemistahl/lingua-py
+* `cld2`
+  * See https://github.com/CLD2Owners/cld2
+  * Requires [installing optional libraries](../installation.md).
+* `fasttext` :cite:`joulin-etal-2016-fasttext` and :cite:`joulin-etal-2017-bag`
+  * A pretrained model can be downloaded from [fasttext.cc/docs/en/language-identification.html](https://fasttext.cc/docs/en/language-identification.html).
+  * Requires [installing optional libraries](../installation.md).
diff --git a/docs/installation.md b/docs/installation.md
@@ -12,14 +12,14 @@ Install from source:
 
 Note that all required libraries are not available to install via PyPI
 on Windows OS. On Linux and MacOS, it should work directly for Python
-versions from 3.8 to 3.11.
+versions from 3.8 to 3.12.
 
 ## Required libraries
 
 * beautifulsoup4
 * opus-fast-mosestokenizer
 * graphviz
-* langid
+* py3langid
 * matplotlib
 * morfessor
 * OpusTools
@@ -41,11 +41,11 @@ See `setup.py` for possible version requirements.
 
 ### FastText and PyCLD2 language identification
 
-The language identification methods currently supported out-of-the-box
-are [langid](https://github.com/saffsd/langid.py) and
+The language identification libraries currently supported out-of-the-box
+are [py3langid](https://github.com/adbar/py3langid) and
 [lingua](https://github.com/pemistahl/lingua-py). The support for for
-[pycld2](https://github.com/aboSamoor/pycld2) and
-[fasttext models](https://fasttext.cc/docs/en/language-identification.html)
+[PyCLD2](https://github.com/aboSamoor/pycld2) and
+[FastText models](https://fasttext.cc/docs/en/language-identification.html)
 have been changed to optional due to the lack of support especially
 for newer Python versions.
 

diff --git a/opusfilter/filters.py b/opusfilter/filters.py
@@ -334,8 +334,8 @@ def __init__(self, languages=None, id_method='langid', thresholds=None,
 
     def init_langid(self, langid_languages):
         """Initialize langid identifier"""
-        from langid.langid import LanguageIdentifier, model
-        self.identifier = LanguageIdentifier.from_modelstring(model, norm_probs=True)
+        from py3langid.langid import LanguageIdentifier, MODEL_FILE
+        self.identifier = LanguageIdentifier.from_pickled_model(MODEL_FILE, norm_probs=True)
         if langid_languages:
             self.identifier.set_languages(langid_languages)
 

diff --git a/requirements.txt b/requirements.txt
@@ -5,7 +5,7 @@ opustools
 jieba>=0.42
 beautifulsoup4>=4.8.2
 graphviz>=0.16
-langid==1.1.6
+py3langid==0.3.0
 matplotlib>=3.3.0
 opus-fast-mosestokenizer>=0.0.8.5
 pandas>=1.0.0

diff --git a/setup.py b/setup.py
@@ -8,7 +8,7 @@
     "opustools",
     "beautifulsoup4>=4.8.0",
     "graphviz",
-    "langid",
+    "py3langid>=0.2.2",
     "matplotlib",
     "morfessor",
     "opus-fast-mosestokenizer>=0.0.8.5",
@@ -30,6 +30,7 @@
 ]
 
 fasttext_require = [
+    "py3langid<0.3.0",  # 0.3.0 requires numpy 2.0.0
     "numpy<2.0.0",
     "fasttext"
 ]