Skip to content

Commit

Permalink
[DOCS] Fixes conflicts.
Browse files Browse the repository at this point in the history
  • Loading branch information
szabosteve committed Oct 11, 2023
1 parent 948f1fa commit 46e92bb
Show file tree
Hide file tree
Showing 6 changed files with 129 additions and 121 deletions.
36 changes: 23 additions & 13 deletions docs/reference/ingest/processors/inference.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -15,20 +15,27 @@ ingested in the pipeline.
.{infer-cap} Options
[options="header"]
|======
| Name | Required | Default | Description
| `model_id` . | yes | - | (String) The ID or alias for the trained model, or the ID of the deployment.
| `input_output` | no | - | (List) Input fields for inference and output (destination) fields for the inference results. This options is incompatible with the `target_field` and `field_map` options.
| `target_field` | no | `ml.inference.<processor_tag>` | (String) Field added to incoming documents to contain results objects.
| `field_map` | no | If defined the model's default field map | (Object) Maps the document field names to the known field names of the model. This mapping takes precedence over any default mappings provided in the model configuration.
| Name | Required | Default | Description
| `model_id` . | yes | - | (String) The ID or alias for the trained model, or the ID of the deployment.
| `input_output` | no | - | (List) Input fields for {infer} and output (destination) fields for the {infer} results. This option is incompatible with the `target_field` and `field_map` options.
| `target_field` | no | `ml.inference.<processor_tag>` | (String) Field added to incoming documents to contain results objects.
| `field_map` | no | If defined the model's default field map | (Object) Maps the document field names to the known field names of the model. This mapping takes precedence over any default mappings provided in the model configuration.
| `inference_config` | no | The default settings defined in the model | (Object) Contains the inference type and its options.
| `ignore_missing` | no | `false` | (Boolean) If `true` and any of the input fields defined in `input_ouput` are missing then those missing fields are quietly ignored, otherwise a missing field causes a failure. Only applies when using `input_output` configurations to explicitly list the input fields.
| `ignore_missing` | no | `false` | (Boolean) If `true` and any of the input fields defined in `input_ouput` are missing then those missing fields are quietly ignored, otherwise a missing field causes a failure. Only applies when using `input_output` configurations to explicitly list the input fields.
include::common-options.asciidoc[]
|======

IMPORTANT: You cannot use the `input_output` field with the `target_field` and
`field_map` fields. For NLP models, use the `input_output` option. For
{dfanalytics} models, use the `target_field` and `field_map` option.


[discrete]
[[inference-input-output-example]]
==== Configuring input and output fields
Select the `content` field for inference and write the result to `content_embedding`.

Select the `content` field for inference and write the result to
`content_embedding`.

[source,js]
--------------------------------------------------
Expand All @@ -47,9 +54,11 @@ Select the `content` field for inference and write the result to `content_embedd
// NOTCONSOLE

==== Configuring multiple inputs
The `content` and `title` fields will be read from the incoming document
and sent to the model for the inference. The inference output is written
to `content_embedding` and `title_embedding` respectively.

The `content` and `title` fields will be read from the incoming document and
sent to the model for the inference. The inference output is written to
`content_embedding` and `title_embedding` respectively.

[source,js]
--------------------------------------------------
{
Expand All @@ -73,9 +82,9 @@ to `content_embedding` and `title_embedding` respectively.
Selecting the input fields with `input_output` is incompatible with
the `target_field` and `field_map` options.

Data frame analytics models must use the `target_field` to specify the
root location results are written to and optionally a `field_map` to map
field names in the input document to the model input fields.
{dfanalytics-cap} models must use the `target_field` to specify the root
location results are written to and optionally a `field_map` to map field names
in the input document to the model input fields.

[source,js]
--------------------------------------------------
Expand All @@ -92,6 +101,7 @@ field names in the input document to the model input fields.
--------------------------------------------------
// NOTCONSOLE


[discrete]
[[inference-processor-classification-opt]]
==== {classification-cap} configuration options
Expand Down
164 changes: 84 additions & 80 deletions docs/reference/search/search-your-data/semantic-search-elser.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -42,11 +42,11 @@ you must provide suitably sized nodes yourself.
[[elser-mappings]]
==== Create the index mapping

First, the mapping of the destination index - the index that contains the tokens
that the model created based on your text - must be created. The destination
index must have a field with the
<<rank-features, `sparse_vector` or `rank_features`>> field type to index the
ELSER output.
First, the mapping of the destination index - the index that contains the tokens
that the model created based on your text - must be created. The destination
index must have a field with the
<<sparse-vector, `sparse_vector`>> or <<rank-features,`rank_features`>> field
type to index the ELSER output.

NOTE: ELSER output must be ingested into a field with the `sparse_vector` or
`rank_features` field type. Otherwise, {es} interprets the token-weight pairs as
Expand All @@ -61,21 +61,23 @@ PUT my-index
{
"mappings": {
"properties": {
"ml.tokens": { <1>
"content_embedding": { <1>
"type": "sparse_vector" <2>
},
"text": { <3>
"content": { <3>
"type": "text" <4>
}
}
}
}
----
// TEST[skip:TBD]
<1> The name of the field to contain the generated tokens.
<1> The name of the field to contain the generated tokens. It must be refrenced
in the {infer} pipeline configuration in the next step.
<2> The field to contain the tokens is a `sparse_vector` field.
<3> The name of the field from which to create the sparse vector representation.
In this example, the name of the field is `text`.
In this example, the name of the field is `content`. It must be referenced in the
{infer} pipeline configuration in the next step.
<4> The field type which is text in this example.

To learn how to optimize space, refer to the <<save-space>> section.
Expand All @@ -91,32 +93,33 @@ that is being ingested in the pipeline.

[source,console]
----
PUT _ingest/pipeline/elser-v2-test
{
"processors": [
{
"inference": {
"model_id": ".elser_model_2",
"target_field": "ml",
"field_map": { <1>
"text": "text_field"
},
"inference_config": {
"text_expansion": { <2>
"results_field": "tokens"
}
}
}
}
]
PUT _ingest/pipeline/elser-v2-test
{
"processors": [
{
"inference": {
"model_id": ".elser_model_2",
"input_output": [ <1>
{
"input_field": "content",
"output_field": "content_embedding"
}
]
}
}
]
}
----
// TEST[skip:TBD]
<1> The `field_map` object maps the input document field name (which is `text`
in this example) to the name of the field that the model expects (which is
always `text_field`).
<2> The `text_expansion` inference type needs to be used in the {infer} ingest
processor.
<1> Configuration object that defines the `input_field` for the {infer} process
and the `output_field` that will contain the {infer} results.

////
[source,console]
----
DELETE _ingest/pipeline/elser-v2-test
----
// TEST[continued]
////


[discrete]
Expand All @@ -132,11 +135,11 @@ a list of relevant text passages. All unique passages, along with their IDs,
have been extracted from that data set and compiled into a
https://github.com/elastic/stack-docs/blob/main/docs/en/stack/ml/nlp/data/msmarco-passagetest2019-unique.tsv[tsv file].

Download the file and upload it to your cluster using the
{kibana-ref}/connect-to-elasticsearch.html#upload-data-kibana[Data Visualizer]
in the {ml-app} UI. Assign the name `id` to the first column and `text` to the
second column. The index name is `test-data`. Once the upload is complete, you
can see an index named `test-data` with 182469 documents.
Download the file and upload it to your cluster using the
{kibana-ref}/connect-to-elasticsearch.html#upload-data-kibana[Data Visualizer]
in the {ml-app} UI. Assign the name `id` to the first column and `content` to
the second column. The index name is `test-data`. Once the upload is complete,
you can see an index named `test-data` with 182469 documents.


[discrete]
Expand Down Expand Up @@ -183,16 +186,16 @@ follow the progress.

To perform semantic search, use the `text_expansion` query, and provide the
query text and the ELSER model ID. The example below uses the query text "How to
avoid muscle soreness after running?", the `ml.tokens` field contains the
generated ELSER output:
avoid muscle soreness after running?", the `content_embedding` field contains
the generated ELSER output:

[source,console]
----
GET my-index/_search
{
"query":{
"text_expansion":{
"ml.tokens":{
"content_embedding":{
"model_id":".elser_model_2",
"model_text":"How to avoid muscle soreness after running?"
}
Expand All @@ -209,40 +212,41 @@ weights.

[source,consol-result]
----
"hits":[
{
"_index":"my-index",
"_id":"978UAYgBKCQMet06sLEy",
"_score":18.612831,
"_ignored":[
"text.keyword"
],
"_source":{
"id":7361587,
"text":"For example, if you go for a run, you will mostly use the muscles in your lower body. Give yourself 2 days to rest those muscles so they have a chance to heal before you exercise them again. Not giving your muscles enough time to rest can cause muscle damage, rather than muscle development.",
"ml":{
"tokens":{
"muscular":0.075696334,
"mostly":0.52380747,
"practice":0.23430172,
"rehab":0.3673556,
"cycling":0.13947526,
"your":0.35725075,
"years":0.69484913,
"soon":0.005317828,
"leg":0.41748235,
"fatigue":0.3157955,
"rehabilitation":0.13636169,
"muscles":1.302141,
"exercises":0.36694175,
(...)
},
"model_id":".elser_model_2"
}
"hits": {
"total": {
"value": 10000,
"relation": "gte"
},
"max_score": 26.199875,
"hits": [
{
"_index": "my-index",
"_id": "FPr9HYsBag9jXmT8lEpI",
"_score": 26.199875,
"_source": {
"content_embedding": {
"muscular": 0.2821541,
"bleeding": 0.37929374,
"foods": 1.1718726,
"delayed": 1.2112266,
"cure": 0.6848574,
"during": 0.5886185,
"fighting": 0.35022718,
"rid": 0.2752442,
"soon": 0.2967024,
"leg": 0.37649947,
"preparation": 0.32974035,
"advance": 0.09652356,
(...)
},
"id": 1713868,
"model_id": ".elser_model_2",
"content": "For example, if you go for a run, you will mostly use the muscles in your lower body. Give yourself 2 days to rest those muscles so they have a chance to heal before you exercise them again. Not giving your muscles enough time to rest can cause muscle damage, rather than muscle development."
}
},
(...)
]
},
(...)
]
}
----
// NOTCONSOLE

Expand Down Expand Up @@ -275,7 +279,7 @@ GET my-index/_search
"should": [
{
"text_expansion": {
"ml.tokens": {
"content_embedding": {
"model_text": "How to avoid muscle soreness after running?",
"model_id": ".elser_model_2",
"boost": 1 <2>
Expand Down Expand Up @@ -328,8 +332,8 @@ reindexing will not be required in the future! It's important to carefully
consider this trade-off and make sure that excluding the ELSER terms from the
source aligns with your specific requirements and use case.

The mapping that excludes `ml.tokens` from the `_source` field can be created
by the following API call:
The mapping that excludes `content_embedding` from the `_source` field can be
created by the following API call:

[source,console]
----
Expand All @@ -338,14 +342,14 @@ PUT my-index
"mappings": {
"_source": {
"excludes": [
"ml.tokens"
"content_embedding"
]
},
"properties": {
"ml.tokens": {
"content_embedding": {
"type": "sparse_vector"
},
"text": {
"content": {
"type": "text"
}
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ PUT my-index
{
"mappings": {
"properties": {
"my_embeddings.tokens": { <1>
"my_tokens": { <1>
"type": "sparse_vector" <2>
},
"my_text_field": { <3>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -15,32 +15,26 @@ This is how an ingest pipeline that uses the ELSER model is created:

[source,console]
----
PUT _ingest/pipeline/my-text-embeddings-pipeline
{
PUT _ingest/pipeline/my-text-embeddings-pipeline
{
"description": "Text embedding pipeline",
"processors": [
{
"inference": {
"model_id": ".elser_model_2",
"target_field": "my_embeddings",
"field_map": { <1>
"my_text_field": "text_field"
},
"inference_config": {
"text_expansion": { <2>
"results_field": "tokens"
}
}
}
}
]
"processors": [
{
"inference": {
"model_id": ".elser_model_2",
"input_output": [ <1>
{
"input_field": "my_text_field",
"output_field": "my_tokens"
}
]
}
}
]
}
----
<1> The `field_map` object maps the input document field name (which is
`my_text_field` in this example) to the name of the field that the model expects
(which is always `text_field`).
<2> The `text_expansion` inference type needs to be used in the inference ingest
processor.
<1> Configuration object that defines the `input_field` for the {infer} process
and the `output_field` that will contain the {infer} results.

To ingest data through the pipeline to generate tokens with ELSER, refer to the
<<reindexing-data-elser>> section of the tutorial. After you successfully
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ GET my-index/_search
{
"query": {
"text_expansion": {
"my_embeddings.tokens": {
"my_tokens": {
"model_id": ".elser_model_2",
"model_text": "the query string"
}
Expand Down
Loading

0 comments on commit 46e92bb

Please sign in to comment.