Skip to content

Commit

Permalink
replace langid.py with py3langid
Browse files Browse the repository at this point in the history
  • Loading branch information
svirpioj committed Jun 26, 2024
1 parent 9fbe7d0 commit 55ec9a7
Show file tree
Hide file tree
Showing 8 changed files with 29 additions and 20 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ Install from source:

### Troubleshooting

OpusFilter should generally work fine on Python 3.8 to 3.11. In the case of troubles, try installing the exact versions in `requirements.txt`:
OpusFilter should generally work fine on Python 3.8 to 3.12. In the case of troubles, try installing the exact versions in `requirements.txt`:

* `pip install -r requirements.txt`

Expand Down
3 changes: 2 additions & 1 deletion docs/CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

### Changed

- make pycld2 and fasttext libraries optional
- make `pycld2` and `fasttext` libraries optional
- replace `langid.py` library with `py3langid`
- update github workflows and include Python 3.12 tests

## [3.1.0] - 2024-06-05
Expand Down
4 changes: 2 additions & 2 deletions docs/CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ issues page. We are also happy to consider pull requests. There are a
few rules for pull requests:

* Make a pull request to the `develop` branch instead of `master`.
* The code should support at least Python versions from 3.8 to 3.11.
* The code should support at least Python versions from 3.8 to 3.12.
* Please follow [PEP 8](https://www.python.org/dev/peps/pep-0008/). Exception: The maximum line length is 127 characters instead of 79.
* Especially for new features, please include test cases for unit testing.

Expand All @@ -20,7 +20,7 @@ skips the respective tests if not.)

GitHub workflows defined in the project run automatically `flake8`
checks and unit testing with `pytest` using Python 3.8, 3.9, 3.10,
and 3.11.
3.11, and 3.12.

Especially for larger contributions, consider using a code analysis
tool like [Pylint](https://github.com/PyCQA/pylint). Install it
Expand Down
19 changes: 13 additions & 6 deletions docs/filters/script_and_language_identification_filters.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ Filter segments based on their language identification confidence scores.
Parameters:

* `languages`: expected languages (ISO639 language codes) for the segments
* `id_method`: language indentification method (`langid` for using the `langid` library, `cld2` for using the `cld2` library, or `fasttext` for using a `fasttext` model; the default is `langid`)
* `id_method`: language indentification method (`langid`, `lingua`, `cld2`, `fasttext`; default `langid`)
* `thresholds`: minimum identification confidence score for the segments (a single float or a list of floats per language)
* `fasttext_model_path`: path for a `fasttext` model (required only for the `fasttext` method; default `null`)
* `langid_languages`: limit detection to a list of possible languages (valid only for the `langid` method; default `null`)
Expand All @@ -44,8 +44,15 @@ Parameters:

Returned scores are the language identification confidence scores from a given identification method for the segments. The scores range from 0 to 1. In filtering, all values have to be greater than the minimum thresholds. Negative threshold can be used to skip filtering for a language.

See [langid.py](https://github.com/saffsd/langid.py) and
[pycld2](https://github.com/aboSamoor/pycld2) for the method-specific
options. A pretrained `fasttext` model can be downloaded from
[fasttext.cc/docs/en/language-identification.html](https://fasttext.cc/docs/en/language-identification.html).
The `cld2` and `fasttext` methods require [installing optional libraries](../installation.md).
Currently the following identification methods are supported:

* `langid` (default) :cite:`lui-baldwin-2012-langid`
* See https://github.com/adbar/py3langid
* `lingua`
* See https://github.com/pemistahl/lingua-py
* `cld2`
* See https://github.com/CLD2Owners/cld2
* Requires [installing optional libraries](../installation.md).
* `fasttext` :cite:`joulin-etal-2016-fasttext` and :cite:`joulin-etal-2017-bag`
* A pretrained model can be downloaded from [fasttext.cc/docs/en/language-identification.html](https://fasttext.cc/docs/en/language-identification.html).
* Requires [installing optional libraries](../installation.md).
12 changes: 6 additions & 6 deletions docs/installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,14 +12,14 @@ Install from source:

Note that all required libraries are not available to install via PyPI
on Windows OS. On Linux and MacOS, it should work directly for Python
versions from 3.8 to 3.11.
versions from 3.8 to 3.12.

## Required libraries

* beautifulsoup4
* opus-fast-mosestokenizer
* graphviz
* langid
* py3langid
* matplotlib
* morfessor
* OpusTools
Expand All @@ -41,11 +41,11 @@ See `setup.py` for possible version requirements.

### FastText and PyCLD2 language identification

The language identification methods currently supported out-of-the-box
are [langid](https://github.com/saffsd/langid.py) and
The language identification libraries currently supported out-of-the-box
are [py3langid](https://github.com/adbar/py3langid) and
[lingua](https://github.com/pemistahl/lingua-py). The support for for
[pycld2](https://github.com/aboSamoor/pycld2) and
[fasttext models](https://fasttext.cc/docs/en/language-identification.html)
[PyCLD2](https://github.com/aboSamoor/pycld2) and
[FastText models](https://fasttext.cc/docs/en/language-identification.html)
have been changed to optional due to the lack of support especially
for newer Python versions.

Expand Down
4 changes: 2 additions & 2 deletions opusfilter/filters.py
Original file line number Diff line number Diff line change
Expand Up @@ -334,8 +334,8 @@ def __init__(self, languages=None, id_method='langid', thresholds=None,

def init_langid(self, langid_languages):
"""Initialize langid identifier"""
from langid.langid import LanguageIdentifier, model
self.identifier = LanguageIdentifier.from_modelstring(model, norm_probs=True)
from py3langid.langid import LanguageIdentifier, MODEL_FILE
self.identifier = LanguageIdentifier.from_pickled_model(MODEL_FILE, norm_probs=True)
if langid_languages:
self.identifier.set_languages(langid_languages)

Expand Down
2 changes: 1 addition & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ opustools
jieba>=0.42
beautifulsoup4>=4.8.2
graphviz>=0.16
langid==1.1.6
py3langid==0.3.0
matplotlib>=3.3.0
opus-fast-mosestokenizer>=0.0.8.5
pandas>=1.0.0
Expand Down
3 changes: 2 additions & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
"opustools",
"beautifulsoup4>=4.8.0",
"graphviz",
"langid",
"py3langid>=0.2.2",
"matplotlib",
"morfessor",
"opus-fast-mosestokenizer>=0.0.8.5",
Expand All @@ -30,6 +30,7 @@
]

fasttext_require = [
"py3langid<0.3.0", # 0.3.0 requires numpy 2.0.0
"numpy<2.0.0",
"fasttext"
]
Expand Down

0 comments on commit 55ec9a7

Please sign in to comment.