[DOCS] Add full-text search overview

leemthompo · Jan 2, 2025 · 0ea93d2 · 0ea93d2
1 parent 7f37edf
commit 0ea93d2
Show file tree

Hide file tree

Showing 4 changed files with 139 additions and 0 deletions.
diff --git a/docs/reference/analysis/tokenizers.asciidoc b/docs/reference/analysis/tokenizers.asciidoc
@@ -1,6 +1,13 @@
 [[analysis-tokenizers]]
 == Tokenizer reference
 
+[NOTE]
+====
+{es}'s text analysis produces meaningful _linguistic_ tokens (like words and phrases) optimized for search relevance scoring.
+This differs from neural tokenizers, which break text into smaller subword units and numerical vectors for machine learning models.
+For example, "searching" becomes the searchable word token "search" in {es}, while a neural tokenizer might split it into ["sea", "##rch", "##ing"] for model consumption.
+====
+
 A _tokenizer_ receives a stream of characters, breaks it up into individual
 _tokens_ (usually individual words), and outputs a stream of _tokens_. For
 instance, a <<analysis-whitespace-tokenizer,`whitespace`>> tokenizer breaks

diff --git a/docs/reference/images/search/full-text-search-overview.svg b/docs/reference/images/search/full-text-search-overview.svg
diff --git a/docs/reference/search/search-your-data/full-text-search.asciidoc b/docs/reference/search/search-your-data/full-text-search.asciidoc
@@ -0,0 +1,68 @@
+[[full-text-search]]
+== Full-text search
+
+.Hands-on introduction to full-text search
+[TIP]
+====
+Would you prefer to jump straight into a hands-on tutorial?
+Refer to our quick start <<full-text-filter-tutorial,full-text search tutorial>>.
+====
+
+Full-text search, also known as lexical search, is a technique for fast, efficient searching through text fields in documents.
+Documents and search queries are transformed to enable returning https://www.elastic.co/what-is/search-relevance[relevant] results instead of simply exact term matches.
+Fields of type <<text-field-type,`text`>> are analyzed and indexed for full-text search.
+
+Built on decades of information retrieval research, full-text search in {es} is a compute-efficient, deterministic approach that scales predictably with data volume.
+Full-text search is the cornerstone of production-grade search solutions.
+Combine full-text search with <<semantic-search,semantic search using vectors>> to build modern hybrid search applications.
+
+[discrete]
+[[full-text-search-how-it-works]]
+=== How full-text search works
+
+The following diagram illustrates the components of full-text search. Note that the query text also undergoes text analysis, so that it's transformed in the same way as the indexed text. 
+
+image::images/search/full-text-search-overview.svg[Components of full-text search from analysis to relevance scoring, align=center, width=500]
+
+At a high level, full-text search involves the following:
+
+* <<analysis-overview,*Text analysis*>>: Analysis consists of a pipeline of sequential transformations. Text is transformed into a format optimized for searching by stemming, lowercasing, stop word elimination, etc. {es} contains a number of built-in <<analysis-analyzers,analyzers>> (including language-specific analyzers) and tokenizers, and you can also create custom analyzers.
++
+[TIP]
+====
+Refer to <<test-analyzer,Test an analyzer>> to learn how to test an analyzer and inspect the tokens and metadata it generates.
+====
+* *Inverted index*: After analysis is complete, {es} builds an inverted index from the resulting tokens.
+An inverted index is a data structure that maps each token to the documents that contain it.
+It's made up of two key components:
+** *Dictionary*: A sorted list of all unique terms in the collection of documents in your index.
+** *Posting list*: For each term, a list of document IDs where the term appears, along with optional metadata like term frequency and position.
+* *Relevance scoring*: Results are ranked by how relevant they are to the given query. The relevance score of each document is represented by a positive floating-point number called the `_score`. The higher the `_score`, the more relevant the document.
++
+The default <<index-modules-similarity,similarity algorithm>> {es} uses for calculating relevance scores is https://en.wikipedia.org/wiki/Okapi_BM25[Okapi BM25], a variation of the https://en.wikipedia.org/wiki/Tf–idf[TF-IDF algorithm]. BM25 calculates relevance scores based on term frequency, document frequency, and document length.
+Refer to this https://www.elastic.co/blog/practical-bm25-part-2-the-bm25-algorithm-and-its-variables[technical blog post] for a deep dive into BM25.
+* *Full-text search query*: Query text is analyzed <<analysis-index-search-time,the same way as the indexed text>>, and the resulting tokens are used to search the inverted index.
++ 
+Query DSL supports a number of <<full-text-queries,full-text queries>>.
++ 
+As of 8.17, {esql} also supports <<esql-search-functions,full-text search>> functions.
+
+[discrete]
+[[full-text-search-learn-more]]
+=== Learn more
+
+.Getting Started
+* <<full-text-filter-tutorial,Hands-on full-text search tutorial>> 
+
+.Core Concepts
+* <<text,Text fields>>
+* <<analysis,Text analysis>>
+* <<analysis-tokenizers,Tokenizers>>
+* <<analysis-analyzers,Analyzers>>
+
+.Search APIs
+* <<full-text-queries,Full-text queries using Query DSL>> 
+* <<esql-search-functions,Full-text search functions in {esql}>>
+
+.Advanced Topics
+* https://www.elastic.co/blog/practical-bm25-part-2-the-bm25-algorithm-and-its-variables[Practical BM25: Part 2 - The BM25 Algorithm and its Variables]
diff --git a/docs/reference/search/search-your-data/search-your-data.asciidoc b/docs/reference/search/search-your-data/search-your-data.asciidoc
@@ -42,7 +42,9 @@ DSL, with a simplified user experience. Create search applications based on your
 {es} indices, build queries using search templates, and easily preview your
 results directly in the Kibana Search UI.
 
+include
 include::search-api.asciidoc[]
+include::full-text-search.asciidoc[]
 include::../../how-to/recipes.asciidoc[]
 // ☝️ search relevance recipes
 include::retrievers-overview.asciidoc[]