From 46e92bb4932299ad975447a4568937b0dc6fcfe2 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Istv=C3=A1n=20Zolt=C3=A1n=20Szab=C3=B3?= Date: Wed, 11 Oct 2023 17:33:20 +0200 Subject: [PATCH] [DOCS] Fixes conflicts. --- .../ingest/processors/inference.asciidoc | 36 ++-- .../semantic-search-elser.asciidoc | 164 +++++++++--------- .../semantic-search/field-mappings.asciidoc | 2 +- .../generate-embeddings.asciidoc | 40 ++--- .../semantic-search/hybrid-search.asciidoc | 2 +- .../semantic-search/search.asciidoc | 6 +- 6 files changed, 129 insertions(+), 121 deletions(-) diff --git a/docs/reference/ingest/processors/inference.asciidoc b/docs/reference/ingest/processors/inference.asciidoc index 75b667e634cdb..5f0fedfd7902c 100644 --- a/docs/reference/ingest/processors/inference.asciidoc +++ b/docs/reference/ingest/processors/inference.asciidoc @@ -15,20 +15,27 @@ ingested in the pipeline. .{infer-cap} Options [options="header"] |====== -| Name | Required | Default | Description -| `model_id` . | yes | - | (String) The ID or alias for the trained model, or the ID of the deployment. -| `input_output` | no | - | (List) Input fields for inference and output (destination) fields for the inference results. This options is incompatible with the `target_field` and `field_map` options. -| `target_field` | no | `ml.inference.` | (String) Field added to incoming documents to contain results objects. -| `field_map` | no | If defined the model's default field map | (Object) Maps the document field names to the known field names of the model. This mapping takes precedence over any default mappings provided in the model configuration. +| Name | Required | Default | Description +| `model_id` . | yes | - | (String) The ID or alias for the trained model, or the ID of the deployment. +| `input_output` | no | - | (List) Input fields for {infer} and output (destination) fields for the {infer} results. This option is incompatible with the `target_field` and `field_map` options. +| `target_field` | no | `ml.inference.` | (String) Field added to incoming documents to contain results objects. +| `field_map` | no | If defined the model's default field map | (Object) Maps the document field names to the known field names of the model. This mapping takes precedence over any default mappings provided in the model configuration. | `inference_config` | no | The default settings defined in the model | (Object) Contains the inference type and its options. -| `ignore_missing` | no | `false` | (Boolean) If `true` and any of the input fields defined in `input_ouput` are missing then those missing fields are quietly ignored, otherwise a missing field causes a failure. Only applies when using `input_output` configurations to explicitly list the input fields. +| `ignore_missing` | no | `false` | (Boolean) If `true` and any of the input fields defined in `input_ouput` are missing then those missing fields are quietly ignored, otherwise a missing field causes a failure. Only applies when using `input_output` configurations to explicitly list the input fields. include::common-options.asciidoc[] |====== +IMPORTANT: You cannot use the `input_output` field with the `target_field` and +`field_map` fields. For NLP models, use the `input_output` option. For +{dfanalytics} models, use the `target_field` and `field_map` option. + + [discrete] [[inference-input-output-example]] ==== Configuring input and output fields -Select the `content` field for inference and write the result to `content_embedding`. + +Select the `content` field for inference and write the result to +`content_embedding`. [source,js] -------------------------------------------------- @@ -47,9 +54,11 @@ Select the `content` field for inference and write the result to `content_embedd // NOTCONSOLE ==== Configuring multiple inputs -The `content` and `title` fields will be read from the incoming document -and sent to the model for the inference. The inference output is written -to `content_embedding` and `title_embedding` respectively. + +The `content` and `title` fields will be read from the incoming document and +sent to the model for the inference. The inference output is written to +`content_embedding` and `title_embedding` respectively. + [source,js] -------------------------------------------------- { @@ -73,9 +82,9 @@ to `content_embedding` and `title_embedding` respectively. Selecting the input fields with `input_output` is incompatible with the `target_field` and `field_map` options. -Data frame analytics models must use the `target_field` to specify the -root location results are written to and optionally a `field_map` to map -field names in the input document to the model input fields. +{dfanalytics-cap} models must use the `target_field` to specify the root +location results are written to and optionally a `field_map` to map field names +in the input document to the model input fields. [source,js] -------------------------------------------------- @@ -92,6 +101,7 @@ field names in the input document to the model input fields. -------------------------------------------------- // NOTCONSOLE + [discrete] [[inference-processor-classification-opt]] ==== {classification-cap} configuration options diff --git a/docs/reference/search/search-your-data/semantic-search-elser.asciidoc b/docs/reference/search/search-your-data/semantic-search-elser.asciidoc index 082bb2ae2e020..03bce8ff23a46 100644 --- a/docs/reference/search/search-your-data/semantic-search-elser.asciidoc +++ b/docs/reference/search/search-your-data/semantic-search-elser.asciidoc @@ -42,11 +42,11 @@ you must provide suitably sized nodes yourself. [[elser-mappings]] ==== Create the index mapping -First, the mapping of the destination index - the index that contains the tokens -that the model created based on your text - must be created. The destination -index must have a field with the -<> field type to index the -ELSER output. +First, the mapping of the destination index - the index that contains the tokens +that the model created based on your text - must be created. The destination +index must have a field with the +<> or <> field +type to index the ELSER output. NOTE: ELSER output must be ingested into a field with the `sparse_vector` or `rank_features` field type. Otherwise, {es} interprets the token-weight pairs as @@ -61,10 +61,10 @@ PUT my-index { "mappings": { "properties": { - "ml.tokens": { <1> + "content_embedding": { <1> "type": "sparse_vector" <2> }, - "text": { <3> + "content": { <3> "type": "text" <4> } } @@ -72,10 +72,12 @@ PUT my-index } ---- // TEST[skip:TBD] -<1> The name of the field to contain the generated tokens. +<1> The name of the field to contain the generated tokens. It must be refrenced +in the {infer} pipeline configuration in the next step. <2> The field to contain the tokens is a `sparse_vector` field. <3> The name of the field from which to create the sparse vector representation. -In this example, the name of the field is `text`. +In this example, the name of the field is `content`. It must be referenced in the +{infer} pipeline configuration in the next step. <4> The field type which is text in this example. To learn how to optimize space, refer to the <> section. @@ -91,32 +93,33 @@ that is being ingested in the pipeline. [source,console] ---- -PUT _ingest/pipeline/elser-v2-test -{ - "processors": [ - { - "inference": { - "model_id": ".elser_model_2", - "target_field": "ml", - "field_map": { <1> - "text": "text_field" - }, - "inference_config": { - "text_expansion": { <2> - "results_field": "tokens" - } - } - } - } - ] +PUT _ingest/pipeline/elser-v2-test +{ + "processors": [ + { + "inference": { + "model_id": ".elser_model_2", + "input_output": [ <1> + { + "input_field": "content", + "output_field": "content_embedding" + } + ] + } + } + ] } ---- -// TEST[skip:TBD] -<1> The `field_map` object maps the input document field name (which is `text` -in this example) to the name of the field that the model expects (which is -always `text_field`). -<2> The `text_expansion` inference type needs to be used in the {infer} ingest -processor. +<1> Configuration object that defines the `input_field` for the {infer} process +and the `output_field` that will contain the {infer} results. + +//// +[source,console] +---- +DELETE _ingest/pipeline/elser-v2-test +---- +// TEST[continued] +//// [discrete] @@ -132,11 +135,11 @@ a list of relevant text passages. All unique passages, along with their IDs, have been extracted from that data set and compiled into a https://github.com/elastic/stack-docs/blob/main/docs/en/stack/ml/nlp/data/msmarco-passagetest2019-unique.tsv[tsv file]. -Download the file and upload it to your cluster using the -{kibana-ref}/connect-to-elasticsearch.html#upload-data-kibana[Data Visualizer] -in the {ml-app} UI. Assign the name `id` to the first column and `text` to the -second column. The index name is `test-data`. Once the upload is complete, you -can see an index named `test-data` with 182469 documents. +Download the file and upload it to your cluster using the +{kibana-ref}/connect-to-elasticsearch.html#upload-data-kibana[Data Visualizer] +in the {ml-app} UI. Assign the name `id` to the first column and `content` to +the second column. The index name is `test-data`. Once the upload is complete, +you can see an index named `test-data` with 182469 documents. [discrete] @@ -183,8 +186,8 @@ follow the progress. To perform semantic search, use the `text_expansion` query, and provide the query text and the ELSER model ID. The example below uses the query text "How to -avoid muscle soreness after running?", the `ml.tokens` field contains the -generated ELSER output: +avoid muscle soreness after running?", the `content_embedding` field contains +the generated ELSER output: [source,console] ---- @@ -192,7 +195,7 @@ GET my-index/_search { "query":{ "text_expansion":{ - "ml.tokens":{ + "content_embedding":{ "model_id":".elser_model_2", "model_text":"How to avoid muscle soreness after running?" } @@ -209,40 +212,41 @@ weights. [source,consol-result] ---- -"hits":[ - { - "_index":"my-index", - "_id":"978UAYgBKCQMet06sLEy", - "_score":18.612831, - "_ignored":[ - "text.keyword" - ], - "_source":{ - "id":7361587, - "text":"For example, if you go for a run, you will mostly use the muscles in your lower body. Give yourself 2 days to rest those muscles so they have a chance to heal before you exercise them again. Not giving your muscles enough time to rest can cause muscle damage, rather than muscle development.", - "ml":{ - "tokens":{ - "muscular":0.075696334, - "mostly":0.52380747, - "practice":0.23430172, - "rehab":0.3673556, - "cycling":0.13947526, - "your":0.35725075, - "years":0.69484913, - "soon":0.005317828, - "leg":0.41748235, - "fatigue":0.3157955, - "rehabilitation":0.13636169, - "muscles":1.302141, - "exercises":0.36694175, - (...) - }, - "model_id":".elser_model_2" - } +"hits": { + "total": { + "value": 10000, + "relation": "gte" + }, + "max_score": 26.199875, + "hits": [ + { + "_index": "my-index", + "_id": "FPr9HYsBag9jXmT8lEpI", + "_score": 26.199875, + "_source": { + "content_embedding": { + "muscular": 0.2821541, + "bleeding": 0.37929374, + "foods": 1.1718726, + "delayed": 1.2112266, + "cure": 0.6848574, + "during": 0.5886185, + "fighting": 0.35022718, + "rid": 0.2752442, + "soon": 0.2967024, + "leg": 0.37649947, + "preparation": 0.32974035, + "advance": 0.09652356, + (...) + }, + "id": 1713868, + "model_id": ".elser_model_2", + "content": "For example, if you go for a run, you will mostly use the muscles in your lower body. Give yourself 2 days to rest those muscles so they have a chance to heal before you exercise them again. Not giving your muscles enough time to rest can cause muscle damage, rather than muscle development." } - }, - (...) -] + }, + (...) + ] +} ---- // NOTCONSOLE @@ -275,7 +279,7 @@ GET my-index/_search "should": [ { "text_expansion": { - "ml.tokens": { + "content_embedding": { "model_text": "How to avoid muscle soreness after running?", "model_id": ".elser_model_2", "boost": 1 <2> @@ -328,8 +332,8 @@ reindexing will not be required in the future! It's important to carefully consider this trade-off and make sure that excluding the ELSER terms from the source aligns with your specific requirements and use case. -The mapping that excludes `ml.tokens` from the `_source` field can be created -by the following API call: +The mapping that excludes `content_embedding` from the `_source` field can be +created by the following API call: [source,console] ---- @@ -338,14 +342,14 @@ PUT my-index "mappings": { "_source": { "excludes": [ - "ml.tokens" + "content_embedding" ] }, "properties": { - "ml.tokens": { + "content_embedding": { "type": "sparse_vector" }, - "text": { + "content": { "type": "text" } } diff --git a/docs/reference/tab-widgets/semantic-search/field-mappings.asciidoc b/docs/reference/tab-widgets/semantic-search/field-mappings.asciidoc index 0228078e8ce39..170604f41dc8c 100644 --- a/docs/reference/tab-widgets/semantic-search/field-mappings.asciidoc +++ b/docs/reference/tab-widgets/semantic-search/field-mappings.asciidoc @@ -17,7 +17,7 @@ PUT my-index { "mappings": { "properties": { - "my_embeddings.tokens": { <1> + "my_tokens": { <1> "type": "sparse_vector" <2> }, "my_text_field": { <3> diff --git a/docs/reference/tab-widgets/semantic-search/generate-embeddings.asciidoc b/docs/reference/tab-widgets/semantic-search/generate-embeddings.asciidoc index 786f40fe141bd..caf6523783b02 100644 --- a/docs/reference/tab-widgets/semantic-search/generate-embeddings.asciidoc +++ b/docs/reference/tab-widgets/semantic-search/generate-embeddings.asciidoc @@ -15,32 +15,26 @@ This is how an ingest pipeline that uses the ELSER model is created: [source,console] ---- -PUT _ingest/pipeline/my-text-embeddings-pipeline -{ +PUT _ingest/pipeline/my-text-embeddings-pipeline +{ "description": "Text embedding pipeline", - "processors": [ - { - "inference": { - "model_id": ".elser_model_2", - "target_field": "my_embeddings", - "field_map": { <1> - "my_text_field": "text_field" - }, - "inference_config": { - "text_expansion": { <2> - "results_field": "tokens" - } - } - } - } - ] + "processors": [ + { + "inference": { + "model_id": ".elser_model_2", + "input_output": [ <1> + { + "input_field": "my_text_field", + "output_field": "my_tokens" + } + ] + } + } + ] } ---- -<1> The `field_map` object maps the input document field name (which is -`my_text_field` in this example) to the name of the field that the model expects -(which is always `text_field`). -<2> The `text_expansion` inference type needs to be used in the inference ingest -processor. +<1> Configuration object that defines the `input_field` for the {infer} process +and the `output_field` that will contain the {infer} results. To ingest data through the pipeline to generate tokens with ELSER, refer to the <> section of the tutorial. After you successfully diff --git a/docs/reference/tab-widgets/semantic-search/hybrid-search.asciidoc b/docs/reference/tab-widgets/semantic-search/hybrid-search.asciidoc index a99bdf3c8722b..f7d9ee1ad6443 100644 --- a/docs/reference/tab-widgets/semantic-search/hybrid-search.asciidoc +++ b/docs/reference/tab-widgets/semantic-search/hybrid-search.asciidoc @@ -21,7 +21,7 @@ GET my-index/_search { "query": { "text_expansion": { - "my_embeddings.tokens": { + "my_tokens": { "model_id": ".elser_model_2", "model_text": "the query string" } diff --git a/docs/reference/tab-widgets/semantic-search/search.asciidoc b/docs/reference/tab-widgets/semantic-search/search.asciidoc index d1cd31fbe4309..315328add07f0 100644 --- a/docs/reference/tab-widgets/semantic-search/search.asciidoc +++ b/docs/reference/tab-widgets/semantic-search/search.asciidoc @@ -2,8 +2,8 @@ ELSER text embeddings can be queried using a <>. The text expansion -query enables you to query a rank features field, by providing the model ID of -the NLP model, and the query text: +query enables you to query a rank features field or a sparse vector field, by +providing the model ID of the NLP model, and the query text: [source,console] ---- @@ -11,7 +11,7 @@ GET my-index/_search { "query":{ "text_expansion":{ - "my_embeddings.tokens":{ <1> + "my_tokens":{ <1> "model_id":".elser_model_2", "model_text":"the query string" }