Skip to content

Commit

Permalink
Merge main into transformers-batching
Browse files Browse the repository at this point in the history
  • Loading branch information
fexfl committed Jan 14, 2025
2 parents 7f32161 + d24d170 commit c52b2b6
Show file tree
Hide file tree
Showing 3 changed files with 18 additions and 7 deletions.
4 changes: 2 additions & 2 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
repos:
- repo: https://github.com/psf/black
rev: 24.4.2
rev: 24.10.0
hooks:
- id: black
- repo: https://github.com/pycqa/flake8
rev: 7.1.0
rev: 7.1.1
hooks:
- id: flake8
- repo: https://github.com/kynan/nbstripout
Expand Down
8 changes: 5 additions & 3 deletions mailcom/inout.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,8 +30,8 @@ def list_of_files(self):
if len(self.email_list) == 0:
raise ValueError(
"""The directory {} does not contain .eml or .html files.
Please check that the directory is containing the
email data files""".format(
Please check that the directory is containing the email
data files""".format(
mypath
)
)
Expand Down Expand Up @@ -83,7 +83,9 @@ def validate_data(self):
pass

def data_to_xml(self, text):
my_item_func = lambda x: "content" # noqa
def my_item_func(x):
return "content"

xml = dicttoxml(text, custom_root="email", item_func=my_item_func)
return xml.decode()

Expand Down
13 changes: 11 additions & 2 deletions notebook/demo.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Below, the input files are loaded from the given `input_dir` directory. You can provide relative or absolute paths to the directory that contains your `eml` or `html` files. All files of the `eml` or `htlm` file type in that directory will be considered input files."
"The cell below defines a function used to display the result in the end, and highlight all named entities found in the text. It is used for demonstration purposes in this example."
]
},
{
Expand Down Expand Up @@ -67,6 +67,13 @@
" return text"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Below, the input files are loaded from the given `input_dir` directory. You can provide relative or absolute paths to the directory that contains your `eml` or `html` files. All files of the `eml` or `html` file type in that directory will be considered input files."
]
},
{
"cell_type": "code",
"execution_count": null,
Expand Down Expand Up @@ -99,7 +106,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"In the cell below, the emails are looped over and the text is extracted. The text is then split into sentences and the sentences are pseudonymized. The pseudonymized sentences are then joined back into a text and saved to a new file."
"In the cell below, the emails are looped over and the text is extracted. The text is then split into sentences and the sentences are pseudonymized. The pseudonymized sentences are then joined back into a text and saved to a new file.\n",
"\n",
"The input text is displayed and the found named entities are highlighted for demonstration. Note that emails (all words containing '@') are filtered out seperately and thus not highlighted here."
]
},
{
Expand Down

0 comments on commit c52b2b6

Please sign in to comment.