Skip to content

Commit

Permalink
Deploying to gh-pages from @ d5f2118 🚀
Browse files Browse the repository at this point in the history
  • Loading branch information
svirpioj committed Aug 14, 2024
1 parent c56fff3 commit d2a642e
Show file tree
Hide file tree
Showing 46 changed files with 358 additions and 254 deletions.
2 changes: 1 addition & 1 deletion .buildinfo
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Sphinx build info version 1
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
config: c1d81864d9a486aace2e262a23a8d628
config: 049ffd4e121bdafbe656d7a50163aab1
tags: 645f666f9bcd5a90fca523b33c5a78b7
Binary file modified .doctrees/CHANGELOG.doctree
Binary file not shown.
Binary file modified .doctrees/CONTRIBUTING.doctree
Binary file not shown.
Binary file modified .doctrees/environment.pickle
Binary file not shown.
Binary file not shown.
Binary file modified .doctrees/functions/downloading_and_selecting_data.doctree
Binary file not shown.
Binary file modified .doctrees/installation.doctree
Binary file not shown.
274 changes: 148 additions & 126 deletions CHANGELOG.html

Large diffs are not rendered by default.

10 changes: 5 additions & 5 deletions CONTRIBUTING.html
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
<meta charset="utf-8" /><meta name="viewport" content="width=device-width, initial-scale=1" />

<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>Contributing &mdash; OpusFilter 3.1.0 documentation</title>
<title>Contributing &mdash; OpusFilter 3.2.0 documentation</title>
<link rel="stylesheet" type="text/css" href="_static/pygments.css?v=80d5e7a1" />
<link rel="stylesheet" type="text/css" href="_static/css/theme.css?v=19f00094" />

Expand All @@ -15,7 +15,7 @@

<script src="_static/jquery.js?v=5d32c60e"></script>
<script src="_static/_sphinx_javascript_frameworks_compat.js?v=2cd50e6c"></script>
<script data-url_root="./" id="documentation_options" src="_static/documentation_options.js?v=ff68d3e4"></script>
<script data-url_root="./" id="documentation_options" src="_static/documentation_options.js?v=6d170d0f"></script>
<script src="_static/doctools.js?v=888ff710"></script>
<script src="_static/sphinx_highlight.js?v=4825356b"></script>
<script src="_static/js/theme.js"></script>
Expand All @@ -37,7 +37,7 @@
OpusFilter
</a>
<div class="version">
3.1
3.2
</div>
<div role="search">
<form id="rtd-search-form" class="wy-form" action="search.html" method="get">
Expand Down Expand Up @@ -121,7 +121,7 @@ <h1>Contributing<a class="headerlink" href="#contributing" title="Permalink to t
few rules for pull requests:</p>
<ul class="simple">
<li><p>Make a pull request to the <code class="docutils literal notranslate"><span class="pre">develop</span></code> branch instead of <code class="docutils literal notranslate"><span class="pre">master</span></code>.</p></li>
<li><p>The code should support at least Python versions from 3.8 to 3.11.</p></li>
<li><p>The code should support at least Python versions from 3.8 to 3.12.</p></li>
<li><p>Please follow <a class="reference external" href="https://www.python.org/dev/peps/pep-0008/">PEP 8</a>. Exception: The maximum line length is 127 characters instead of 79.</p></li>
<li><p>Especially for new features, please include test cases for unit testing.</p></li>
</ul>
Expand All @@ -134,7 +134,7 @@ <h1>Contributing<a class="headerlink" href="#contributing" title="Permalink to t
skips the respective tests if not.)</p>
<p>GitHub workflows defined in the project run automatically <code class="docutils literal notranslate"><span class="pre">flake8</span></code>
checks and unit testing with <code class="docutils literal notranslate"><span class="pre">pytest</span></code> using Python 3.8, 3.9, 3.10,
and 3.11.</p>
3.11, and 3.12.</p>
<p>Especially for larger contributions, consider using a code analysis
tool like <a class="reference external" href="https://github.com/PyCQA/pylint">Pylint</a>. Install it
e.g. via <code class="docutils literal notranslate"><span class="pre">pip</span></code>, run <code class="docutils literal notranslate"><span class="pre">pylint</span> <span class="pre">opusfilter/</span></code> in the project root and fix
Expand Down
15 changes: 14 additions & 1 deletion _sources/CHANGELOG.md.txt
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,18 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## [Unreleased]

## [3.2.0] - 2024-08-14

### Changed

- make `pycld2` and `fasttext` libraries optional
- replace `langid.py` library with `py3langid`
- update github workflows and include Python 3.12 tests

### Fixed

- `OpusRead` interface using `moses` format (requires `opustools >= 1.6.2`)

## [3.1.0] - 2024-06-05

### Added
Expand Down Expand Up @@ -204,7 +216,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
First tagged version.


[Unreleased]: https://github.com/Helsinki-NLP/OpusFilter/compare/3.1.0...develop
[Unreleased]: https://github.com/Helsinki-NLP/OpusFilter/compare/3.2.0...develop
[3.2.0]: https://github.com/Helsinki-NLP/OpusFilter/compare/3.1.0...3.2.0
[3.1.0]: https://github.com/Helsinki-NLP/OpusFilter/compare/3.0.0...3.1.0
[3.0.0]: https://github.com/Helsinki-NLP/OpusFilter/compare/2.6.0...3.0.0
[2.6.0]: https://github.com/Helsinki-NLP/OpusFilter/compare/2.5.1...2.6.0
Expand Down
4 changes: 2 additions & 2 deletions _sources/CONTRIBUTING.md.txt
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ issues page. We are also happy to consider pull requests. There are a
few rules for pull requests:

* Make a pull request to the `develop` branch instead of `master`.
* The code should support at least Python versions from 3.8 to 3.11.
* The code should support at least Python versions from 3.8 to 3.12.
* Please follow [PEP 8](https://www.python.org/dev/peps/pep-0008/). Exception: The maximum line length is 127 characters instead of 79.
* Especially for new features, please include test cases for unit testing.

Expand All @@ -20,7 +20,7 @@ skips the respective tests if not.)

GitHub workflows defined in the project run automatically `flake8`
checks and unit testing with `pytest` using Python 3.8, 3.9, 3.10,
and 3.11.
3.11, and 3.12.

Especially for larger contributions, consider using a code analysis
tool like [Pylint](https://github.com/PyCQA/pylint). Install it
Expand Down
18 changes: 13 additions & 5 deletions _sources/filters/script_and_language_identification_filters.md.txt
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ Filter segments based on their language identification confidence scores.
Parameters:

* `languages`: expected languages (ISO639 language codes) for the segments
* `id_method`: language indentification method (`langid` for using the `langid` library, `cld2` for using the `cld2` library, or `fasttext` for using a `fasttext` model; the default is `langid`)
* `id_method`: language indentification method (`langid`, `lingua`, `cld2`, `fasttext`; default `langid`)
* `thresholds`: minimum identification confidence score for the segments (a single float or a list of floats per language)
* `fasttext_model_path`: path for a `fasttext` model (required only for the `fasttext` method; default `null`)
* `langid_languages`: limit detection to a list of possible languages (valid only for the `langid` method; default `null`)
Expand All @@ -44,7 +44,15 @@ Parameters:

Returned scores are the language identification confidence scores from a given identification method for the segments. The scores range from 0 to 1. In filtering, all values have to be greater than the minimum thresholds. Negative threshold can be used to skip filtering for a language.

See [langid.py](https://github.com/saffsd/langid.py) and
[pycld2](https://github.com/aboSamoor/pycld2) for the method-specific
options. A pretrained `fasttext` model can be downloaded from
[fasttext.cc/docs/en/language-identification.html](https://fasttext.cc/docs/en/language-identification.html).
Currently the following identification methods are supported:

* `langid` (default) :cite:`lui-baldwin-2012-langid`
* See https://github.com/adbar/py3langid
* `lingua`
* See https://github.com/pemistahl/lingua-py
* `cld2`
* See https://github.com/CLD2Owners/cld2
* Requires [installing optional libraries](../installation.md).
* `fasttext` :cite:`joulin-etal-2016-fasttext` and :cite:`joulin-etal-2017-bag`
* A pretrained model can be downloaded from [fasttext.cc/docs/en/language-identification.html](https://fasttext.cc/docs/en/language-identification.html).
* Requires [installing optional libraries](../installation.md).
7 changes: 6 additions & 1 deletion _sources/functions/downloading_and_selecting_data.md.txt
Original file line number Diff line number Diff line change
Expand Up @@ -11,11 +11,16 @@ Parameters:
* `source_language`: language code for the source language
* `target_language`: language code for the target language
* `release`: version of the corpus in OPUS
* `preprocessing`: `raw` for untokenized and `xml` for tokenized segments
* `preprocessing`: `moses` or `raw` for untokenized and `xml` for tokenized segments
* `src_output`: output file for source language
* `tgt_output`: output file for target language
* `suppress_prompts`: `false` (default) prompts user to confirm before download, `true` to download without prompting

The `moses` preprocessing type (available with `OpusTools` version
1.6.2 and above) is recommended for those corpora for which it
exists. The output is equivalent to `raw`, but in some cases it can
significantly reduce the amount of data downloaded in the process.

## concatenate

Concatenate two or more text files.
Expand Down
34 changes: 25 additions & 9 deletions _sources/installation.md.txt
Original file line number Diff line number Diff line change
Expand Up @@ -12,20 +12,18 @@ Install from source:

Note that all required libraries are not available to install via PyPI
on Windows OS. On Linux and MacOS, it should work directly for Python
versions from 3.8 to 3.11.
versions from 3.8 to 3.12.

## Required libraries

* beautifulsoup4
* opus-fast-mosestokenizer
* fasttext
* graphviz
* langid
* py3langid
* matplotlib
* morfessor
* OpusTools
* pandas
* pycld2
* rapidfuzz
* ruamel.yaml
* regex
Expand All @@ -41,24 +39,42 @@ See `setup.py` for possible version requirements.

## Optional libraries and tools

### FastText and PyCLD2 language identification

The language identification libraries currently supported out-of-the-box
are [py3langid](https://github.com/adbar/py3langid) and
[lingua](https://github.com/pemistahl/lingua-py). The support for for
[PyCLD2](https://github.com/aboSamoor/pycld2) and
[FastText models](https://fasttext.cc/docs/en/language-identification.html)
have been changed to optional due to the lack of support especially
for newer Python versions.

The PyCLD2 support can be installed automatically with pip by
including the extras `[pycld2]` or `[all]` (e.g.
`pip install opusfilter[pycld2]`).

The support for FastText models can be installed automatically with
pip by including the extras `[fasttext]` or `[all]` (e.g.
`pip install opusfilter[fasttext]`).

### Jieba and MeCab word segmentation

For Chinese tokenization (word segmentation), you can use the
[jieba](https://github.com/fxsjy/jieba) library. It can be installed
automatically with pip by including the extras `[jieba]` or `[all]`
(e.g. `pip install opusfilter[all]`).
(e.g. `pip install opusfilter[jieba]`).

For Japanese tokenization (word segmentation), you can use the
[MeCab](https://github.com/SamuraiT/mecab-python3) library. It can be installed
automatically with pip by including the extras `[mecab]` or `[all]`
(e.g. `pip install opusfilter[all]`).
(e.g. `pip install opusfilter[mecab]`).

### LASER sentence embeddings

For using sentence embeddings filters, you need to install
`laserembeddings` (https://github.com/yannvgn/laserembeddings). It can
be installed automatically with pip by including the extras `[laser]`
or `[all]` (e.g. `pip install opusfilter[all]`). The package will also
or `[all]` (e.g. `pip install opusfilter[laser]`). The package will also
require a number of additional libraries, including PyTorch, jieba,
and MeCab. Note that you need also to download the prebuild models
with `python -m laserembeddings download-models`.
Expand All @@ -68,12 +84,12 @@ with `python -m laserembeddings download-models`.
For using n-gram language model filters, you need to install the
Python wrapper for VariKN (https://github.com/vsiivola/variKN). It can
be installed automatically with pip by including the extras `[varikn]`
or `[all]` (e.g. `pip install opusfilter[all]`).
or `[all]` (e.g. `pip install opusfilter[varikn]`).

### Eflomal word alignment

For using word alignment filters, you need to install elfomal
(https://github.com/robertostling/eflomal). It can be installed
automatically with pip by including the extras `[eflomal]` or `[all]`
(e.g. `pip install opusfilter[all]`). Note that you will need `Cython`
(e.g. `pip install opusfilter[eflomal]`). Note that you will need `Cython`
for the installation.
2 changes: 1 addition & 1 deletion _static/documentation_options.js
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
var DOCUMENTATION_OPTIONS = {
URL_ROOT: document.getElementById("documentation_options").getAttribute('data-url_root'),
VERSION: '3.1.0',
VERSION: '3.2.0',
LANGUAGE: 'en',
COLLAPSE_INDEX: false,
BUILDER: 'html',
Expand Down
6 changes: 3 additions & 3 deletions automatic_configuration.html
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
<meta charset="utf-8" /><meta name="viewport" content="width=device-width, initial-scale=1" />

<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>Automatic configuration generation &mdash; OpusFilter 3.1.0 documentation</title>
<title>Automatic configuration generation &mdash; OpusFilter 3.2.0 documentation</title>
<link rel="stylesheet" type="text/css" href="_static/pygments.css?v=80d5e7a1" />
<link rel="stylesheet" type="text/css" href="_static/css/theme.css?v=19f00094" />

Expand All @@ -15,7 +15,7 @@

<script src="_static/jquery.js?v=5d32c60e"></script>
<script src="_static/_sphinx_javascript_frameworks_compat.js?v=2cd50e6c"></script>
<script data-url_root="./" id="documentation_options" src="_static/documentation_options.js?v=ff68d3e4"></script>
<script data-url_root="./" id="documentation_options" src="_static/documentation_options.js?v=6d170d0f"></script>
<script src="_static/doctools.js?v=888ff710"></script>
<script src="_static/sphinx_highlight.js?v=4825356b"></script>
<script src="_static/js/theme.js"></script>
Expand All @@ -37,7 +37,7 @@
OpusFilter
</a>
<div class="version">
3.1
3.2
</div>
<div role="search">
<form id="rtd-search-form" class="wy-form" action="search.html" method="get">
Expand Down
6 changes: 3 additions & 3 deletions command_line_tools.html
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
<meta charset="utf-8" /><meta name="viewport" content="width=device-width, initial-scale=1" />

<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>Command line tools for analysis &mdash; OpusFilter 3.1.0 documentation</title>
<title>Command line tools for analysis &mdash; OpusFilter 3.2.0 documentation</title>
<link rel="stylesheet" type="text/css" href="_static/pygments.css?v=80d5e7a1" />
<link rel="stylesheet" type="text/css" href="_static/css/theme.css?v=19f00094" />

Expand All @@ -15,7 +15,7 @@

<script src="_static/jquery.js?v=5d32c60e"></script>
<script src="_static/_sphinx_javascript_frameworks_compat.js?v=2cd50e6c"></script>
<script data-url_root="./" id="documentation_options" src="_static/documentation_options.js?v=ff68d3e4"></script>
<script data-url_root="./" id="documentation_options" src="_static/documentation_options.js?v=6d170d0f"></script>
<script src="_static/doctools.js?v=888ff710"></script>
<script src="_static/sphinx_highlight.js?v=4825356b"></script>
<script src="_static/js/theme.js"></script>
Expand All @@ -37,7 +37,7 @@
OpusFilter
</a>
<div class="version">
3.1
3.2
</div>
<div role="search">
<form id="rtd-search-form" class="wy-form" action="search.html" method="get">
Expand Down
6 changes: 3 additions & 3 deletions filters/alignment_model_filters.html
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
<meta charset="utf-8" /><meta name="viewport" content="width=device-width, initial-scale=1" />

<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>Alignment model filters &mdash; OpusFilter 3.1.0 documentation</title>
<title>Alignment model filters &mdash; OpusFilter 3.2.0 documentation</title>
<link rel="stylesheet" type="text/css" href="../_static/pygments.css?v=80d5e7a1" />
<link rel="stylesheet" type="text/css" href="../_static/css/theme.css?v=19f00094" />

Expand All @@ -15,7 +15,7 @@

<script src="../_static/jquery.js?v=5d32c60e"></script>
<script src="../_static/_sphinx_javascript_frameworks_compat.js?v=2cd50e6c"></script>
<script data-url_root="../" id="documentation_options" src="../_static/documentation_options.js?v=ff68d3e4"></script>
<script data-url_root="../" id="documentation_options" src="../_static/documentation_options.js?v=6d170d0f"></script>
<script src="../_static/doctools.js?v=888ff710"></script>
<script src="../_static/sphinx_highlight.js?v=4825356b"></script>
<script src="../_static/js/theme.js"></script>
Expand All @@ -37,7 +37,7 @@
OpusFilter
</a>
<div class="version">
3.1
3.2
</div>
<div role="search">
<form id="rtd-search-form" class="wy-form" action="../search.html" method="get">
Expand Down
6 changes: 3 additions & 3 deletions filters/custom_filters.html
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
<meta charset="utf-8" /><meta name="viewport" content="width=device-width, initial-scale=1" />

<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>Custom filters &mdash; OpusFilter 3.1.0 documentation</title>
<title>Custom filters &mdash; OpusFilter 3.2.0 documentation</title>
<link rel="stylesheet" type="text/css" href="../_static/pygments.css?v=80d5e7a1" />
<link rel="stylesheet" type="text/css" href="../_static/css/theme.css?v=19f00094" />

Expand All @@ -15,7 +15,7 @@

<script src="../_static/jquery.js?v=5d32c60e"></script>
<script src="../_static/_sphinx_javascript_frameworks_compat.js?v=2cd50e6c"></script>
<script data-url_root="../" id="documentation_options" src="../_static/documentation_options.js?v=ff68d3e4"></script>
<script data-url_root="../" id="documentation_options" src="../_static/documentation_options.js?v=6d170d0f"></script>
<script src="../_static/doctools.js?v=888ff710"></script>
<script src="../_static/sphinx_highlight.js?v=4825356b"></script>
<script src="../_static/js/theme.js"></script>
Expand All @@ -37,7 +37,7 @@
OpusFilter
</a>
<div class="version">
3.1
3.2
</div>
<div role="search">
<form id="rtd-search-form" class="wy-form" action="../search.html" method="get">
Expand Down
6 changes: 3 additions & 3 deletions filters/language_model_filters.html
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
<meta charset="utf-8" /><meta name="viewport" content="width=device-width, initial-scale=1" />

<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>Language model filters &mdash; OpusFilter 3.1.0 documentation</title>
<title>Language model filters &mdash; OpusFilter 3.2.0 documentation</title>
<link rel="stylesheet" type="text/css" href="../_static/pygments.css?v=80d5e7a1" />
<link rel="stylesheet" type="text/css" href="../_static/css/theme.css?v=19f00094" />

Expand All @@ -15,7 +15,7 @@

<script src="../_static/jquery.js?v=5d32c60e"></script>
<script src="../_static/_sphinx_javascript_frameworks_compat.js?v=2cd50e6c"></script>
<script data-url_root="../" id="documentation_options" src="../_static/documentation_options.js?v=ff68d3e4"></script>
<script data-url_root="../" id="documentation_options" src="../_static/documentation_options.js?v=6d170d0f"></script>
<script src="../_static/doctools.js?v=888ff710"></script>
<script src="../_static/sphinx_highlight.js?v=4825356b"></script>
<script src="../_static/js/theme.js"></script>
Expand All @@ -37,7 +37,7 @@
OpusFilter
</a>
<div class="version">
3.1
3.2
</div>
<div role="search">
<form id="rtd-search-form" class="wy-form" action="../search.html" method="get">
Expand Down
Loading

0 comments on commit d2a642e

Please sign in to comment.