Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Notification nouveaux JDD : répare matching avec mots-clefs #4413

Merged
merged 1 commit into from
Jan 14, 2025

Conversation

AntoineAugusti
Copy link
Member

Fixes #4411

Adapte NewDatagouvDatasetsJob en charge d'identifier les JDDs pertinents à référencer sur le PAN. Le code précédent était trop laxiste : pour une catégorie "vélos" avec le mot clé "vélo" qui nous intéresse, on aurait dit qu'un JDD avec un tag developpement-durable était pertinent.

J'adapte donc la méthode en charge de supprimer les accents (qui supprimait les espaces avant…) et la méthode de matching en fonction de mots-clefs. Un simple String.contains? ne suffit pas.

@AntoineAugusti AntoineAugusti requested a review from a team as a code owner January 14, 2025 08:20
@@ -257,8 +265,27 @@ defmodule Transport.Jobs.NewDatagouvDatasetsJob do
"velo"
iex> normalize("Châteauroux")
"chateauroux"
iex> normalize("J'adore manger")
"j'adore manger"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Précédemment on avait ça

iex> "J'adore manger" |> String.normalize(:nfd) |> String.replace(~r/[^A-z]/u, "") |> String.downcase()
"jadoremanger"

ce n'était pas attendu, les espaces sont importants à converser pour le matching avec les mots-clefs

Comment on lines 224 to 234
{words_with_spaces, words_without_spaces} = Enum.split_with(searches, &String.contains?(&1, " "))
match_without_spaces = not (str
|> normalize()
|> String.split(~r/\s+/)
|> MapSet.new()
|> MapSet.disjoint?(MapSet.new(words_without_spaces))
)
match_with_spaces = str |> normalize() |> String.contains?(words_with_spaces)
match_without_spaces || match_with_spaces
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C'est pas le plus clair, si vous avez une idée de refactor je prends 👌

"libre-service",
"libre service",
"scooter"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Était en double précédemment, j'ai ajouté le pluriel aussi

@AntoineAugusti AntoineAugusti force-pushed the new_datagouv_datasets_job_full_words branch from df66671 to 567edc4 Compare January 14, 2025 08:24
@AntoineAugusti AntoineAugusti force-pushed the new_datagouv_datasets_job_full_words branch from 567edc4 to c7444ea Compare January 14, 2025 08:27
@ptitfred ptitfred self-assigned this Jan 14, 2025
@AntoineAugusti AntoineAugusti added this pull request to the merge queue Jan 14, 2025
Merged via the queue into master with commit 50d0e86 Jan 14, 2025
4 checks passed
@AntoineAugusti AntoineAugusti deleted the new_datagouv_datasets_job_full_words branch January 14, 2025 10:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Notification nouveaux JDD sur datagouv - pb avec les mots-clefs
2 participants