Skip to content

Commit

Permalink
Add stop words and use them to filter text matches
Browse files Browse the repository at this point in the history
Stop words are very common words that carry basically no information.
Usually, stop word lists are language specific and you can easily see
why: "hat" might be a normal word in English, but carries no
information in German. "these" might be a stop word in English, but is
a useful word in German. Unfortunately we don't have the luxury of only
supporting one language and in fact: we don't even know the language
of a certain document. So we are kind of forced to have a combined list.
I created this semi-manually by combining DE and EN (the only languages
we currently support), making sure that words that carry meaning in any
of the languages are not marked as stop words. Additional languages can
be added in the future, but each new one decreases the usefulness of the
list.

Once the need arises, we can also easily add the feature to configure
your own stop words.

These stop words we could just send to Meili, instructing it to ignore
them. Unfortunately, there are some disadvantages to that as Meili
doesn't nicely deal with stop words IMO: especially in phrase search,
the highlighting is broken and might confuse users. Phrase search still
kind of works but from reading the docs, I think with stop search "the"
and "a", searching for "foo the bar" will also find documents with the
text "foo a bar". See https://github.com/orgs/meilisearch/discussions/793

So instead, we just use the stop words to filter out matches in texts.
That doesn't improve indexing speed, search speed, or index size in
Meili, but it can vastly reduce the size of the GQL response to the
frontend and makes the frontend less likely to choke on these useless
matches.

We might still use our stop words for more in the future (ignoring
matches in metadata or even sending them to Meili once Meili fixes its
problems).
  • Loading branch information
LukasKalbertodt committed Jan 21, 2025
1 parent b027ef5 commit 250be1b
Show file tree
Hide file tree
Showing 7 changed files with 448 additions and 4 deletions.
1 change: 1 addition & 0 deletions backend/Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions backend/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ embed-in-debug = ["reinda/always-prod"]


[dependencies]
ahash = "0.8"
anyhow = { version = "1.0.71", features = ["backtrace"] }
base64 = "0.22.1"
bincode = "1.3.3"
Expand Down
18 changes: 17 additions & 1 deletion backend/src/search/event.rs
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,11 @@ use crate::{
util::{base64_decode, BASE64_DIGITS},
};

use super::{realm::Realm, util::{self, FieldAbilities}, IndexItem, IndexItemKind, SearchId};
use super::{
realm::Realm,
util::{self, is_stop_word, FieldAbilities},
IndexItem, IndexItemKind, SearchId,
};



Expand Down Expand Up @@ -366,6 +370,18 @@ impl TextSearchIndex {
continue;
}

// Get correct indices and the actual text snippet. Unfortunately,
// Meilisearch might sometimes return invalid indices that slice
// UTF-8 codepoints in half, so we need to protect against that.
let start = ceil_char_boundary(&self.texts, match_range.start);
let end = ceil_char_boundary(&self.texts, match_range.start + match_range.length);
let snippet = &self.texts[start..end];

// If the match is a single stop word, we ignore it.
if is_stop_word(snippet) {
continue;
}

let slot = self.lookup(match_range);
let matches = entries.entry(slot as u32).or_insert_with(Vec::new);

Expand Down
2 changes: 1 addition & 1 deletion backend/src/search/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -336,7 +336,7 @@ pub(crate) async fn rebuild_if_necessary(
for task in tasks {
util::wait_on_task(task, meili).await?;
}
info!("Completely rebuild search index");
info!("Completely rebuilt search index");

meili.meta_index.add_or_replace(&[meta::Meta::current_clean()], None).await
.context("failed to update index version document (clean)")?;
Expand Down
Loading

0 comments on commit 250be1b

Please sign in to comment.