Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] XPackRestIT class failing #120816

Open
elasticsearchmachine opened this issue Jan 24, 2025 · 16 comments
Open

[CI] XPackRestIT class failing #120816

elasticsearchmachine opened this issue Jan 24, 2025 · 16 comments
Labels
medium-risk An open issue or test failure that is a medium risk to future releases :ml Machine learning stateful Marking issues only relevant for stateful releases Team:ML Meta label for the ML team >test-failure Triaged test failures from CI v8.18.1

Comments

@elasticsearchmachine
Copy link
Collaborator

elasticsearchmachine commented Jan 24, 2025

Build Scans:

Reproduction Line:

./gradlew ":x-pack:plugin:yamlRestTest" --tests "org.elasticsearch.xpack.test.rest.XPackRestIT.test {p0=esql/30_types/version}" -Dtests.seed=AD456B68687F4652 -Dtests.locale=so-Latn-SO -Dtests.timezone=America/Creston -Druntime.java=23

Applicable branches:
8.18

Reproduces locally?:
N/A

Failure History:
See dashboard

Failure Message:

java.lang.IllegalStateException: Exception when waiting for [.ml-notifications-000002] template to be created

Issue Reasons:

  • [8.18] 2 failures in class org.elasticsearch.xpack.test.rest.XPackRestIT (0.3% fail rate in 577 executions)

Note:
This issue was created using new test triage automation. Please report issues or feedback to es-delivery.

@elasticsearchmachine elasticsearchmachine added :Search Relevance/Ranking Scoring, rescoring, rank evaluation. >test-failure Triaged test failures from CI labels Jan 24, 2025
@elasticsearchmachine
Copy link
Collaborator Author

This has been muted on branch main

Mute Reasons:

  • [main] 60 failures in class org.elasticsearch.xpack.test.rest.XPackRestIT (6.0% fail rate in 999 executions)
  • [main] 2 failures in step part3 (2.6% fail rate in 78 executions)
  • [main] 40 failures in step rest-compatibility (12.1% fail rate in 330 executions)
  • [main] 2 failures in step rest-compat (2.6% fail rate in 77 executions)
  • [main] 16 failures in step part-3 (5.3% fail rate in 300 executions)
  • [main] 3 failures in pipeline elasticsearch-intake (3.7% fail rate in 81 executions)
  • [main] 51 failures in pipeline elasticsearch-pull-request (15.2% fail rate in 336 executions)

Build Scans:

@elasticsearchmachine elasticsearchmachine added needs:risk Requires assignment of a risk label (low, medium, blocker) Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch labels Jan 24, 2025
@elasticsearchmachine
Copy link
Collaborator Author

Pinging @elastic/es-search-relevance (Team:Search Relevance)

@elasticsearchmachine elasticsearchmachine removed the needs:risk Requires assignment of a risk label (low, medium, blocker) label Jan 24, 2025
@benwtrent benwtrent added :ml Machine learning and removed :Search Relevance/Ranking Scoring, rescoring, rank evaluation. labels Jan 24, 2025
@elasticsearchmachine elasticsearchmachine added Team:ML Meta label for the ML team and removed Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch labels Jan 24, 2025
@elasticsearchmachine
Copy link
Collaborator Author

Pinging @elastic/ml-core (Team:ML)

@benwtrent
Copy link
Member

Looking at the logs, looks like ML is crashing the yaml test nodes.

The final lines of the log include:

[2025-01-24T08:54:06,548][INFO ][o.e.c.m.MetadataCreateIndexService] [yamlRestTest-0] [.ml-notifications-000002] creating index, cause [auto(bulk api)], templates [.ml-notifications-000002], shards [1]/[1]
[2025-01-24T08:54:06,549][INFO ][o.e.c.r.a.AllocationService] [yamlRestTest-0] updating number_of_replicas to [0] for indices [.ml-notifications-000002]
[2025-01-24T08:54:06,563][WARN ][o.e.x.m.i.l.ModelLoadingService] [yamlRestTest-0] [a-classification-model] failed to load model definition
org.elasticsearch.ResourceNotFoundException: Could not find trained model definition [a-classification-model]
	at org.elasticsearch.xpack.ml.inference.persistence.TrainedModelProvider.lambda$getTrainedModelForInference$16(TrainedModelProvider.java:599) ~[?:?]
	at org.elasticsearch.xpack.ml.inference.persistence.ChunkedTrainedModelRestorer.doSearch(ChunkedTrainedModelRestorer.java:162) ~[?:?]
	at org.elasticsearch.xpack.ml.inference.persistence.ChunkedTrainedModelRestorer.lambda$restoreModelDefinition$0(ChunkedTrainedModelRestorer.java:134) ~[?:?]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:977) ~[elasticsearch-9.0.0-SNAPSHOT.jar:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
	at java.lang.Thread.run(Thread.java:1575) ~[?:?]
[2025-01-24T08:54:06,566][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [yamlRestTest-0] fatal error in thread [elasticsearch[yamlRestTest-0][ml_utility][T#2]], exiting
java.lang.AssertionError: Used bytes: [-520] must be >= 0
	at org.elasticsearch.common.breaker.ChildMemoryCircuitBreaker.addWithoutBreaking(ChildMemoryCircuitBreaker.java:207) ~[elasticsearch-9.0.0-SNAPSHOT.jar:?]
	at org.elasticsearch.xpack.ml.inference.loadingservice.ModelLoadingService.lambda$loadModel$10(ModelLoadingService.java:490) ~[?:?]
	at org.elasticsearch.action.ActionListenerImplementations.safeAcceptException(ActionListenerImplementations.java:64) ~[elasticsearch-9.0.0-SNAPSHOT.jar:?]
	at org.elasticsearch.action.ActionListener$2.onFailure(ActionListener.java:265) ~[elasticsearch-9.0.0-SNAPSHOT.jar:?]
	at org.elasticsearch.xpack.ml.inference.persistence.TrainedModelProvider.lambda$getTrainedModelForInference$16(TrainedModelProvider.java:601) ~[?:?]
	at org.elasticsearch.xpack.ml.inference.persistence.ChunkedTrainedModelRestorer.doSearch(ChunkedTrainedModelRestorer.java:162) ~[?:?]
	at org.elasticsearch.xpack.ml.inference.persistence.ChunkedTrainedModelRestorer.lambda$restoreModelDefinition$0(ChunkedTrainedModelRestorer.java:134) ~[?:?]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:977) ~[elasticsearch-9.0.0-SNAPSHOT.jar:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
	at java.lang.Thread.run(Thread.java:1575) ~[?:?]

@benwtrent
Copy link
Member

Higher up in the node logs I see this as well, but this might not be a big deal?

[2025-01-24T19:03:42,096][INFO ][o.e.c.m.MetadataDeleteIndexService] [yamlRestTest-0] [index-1/reCq_J_ZQDOykvbkKLQfjw] deleting index
[2025-01-24T19:03:42,096][INFO ][o.e.c.m.MetadataDeleteIndexService] [yamlRestTest-0] [.ml-annotations-000001/jOdeeSGfR3uB8b_VSZ0-oQ] deleting index
[2025-01-24T19:03:42,131][ERROR][o.e.x.m.MlInitializationService] [yamlRestTest-0] Error creating ML annotations index or aliases
org.elasticsearch.index.IndexNotFoundException: no such index [.ml-annotations-000001]
	at org.elasticsearch.cluster.metadata.IndexNameExpressionResolver.notFoundException(IndexNameExpressionResolver.java:685) ~[elasticsearch-9.0.0-SNAPSHOT.jar:?]
	at org.elasticsearch.cluster.metadata.IndexNameExpressionResolver.ensureAliasOrIndexExists(IndexNameExpressionResolver.java:1321) ~[elasticsearch-9.0.0-SNAPSHOT.jar:?]
	at org.elasticsearch.cluster.metadata.IndexNameExpressionResolver.resolveExpressionsToResources(IndexNameExpressionResolver.java:373) ~[elasticsearch-9.0.0-SNAPSHOT.jar:?]
	at org.elasticsearch.cluster.metadata.IndexNameExpressionResolver.dataStreams(IndexNameExpressionResolver.java:235) ~[elasticsearch-9.0.0-SNAPSHOT.jar:?]
	at org.elasticsearch.cluster.metadata.IndexNameExpressionResolver.dataStreamNames(IndexNameExpressionResolver.java:220) ~[elasticsearch-9.0.0-SNAPSHOT.jar:?]
	at org.elasticsearch.action.admin.indices.alias.TransportIndicesAliasesAction.masterOperation(TransportIndicesAliasesAction.java:125) ~[elasticsearch-9.0.0-SNAPSHOT.jar:?]
	at org.elasticsearch.action.admin.indices.alias.TransportIndicesAliasesAction.masterOperation(TransportIndicesAliasesAction.java:60) ~[elasticsearch-9.0.0-SNAPSHOT.jar:?]
	at org.elasticsearch.action.support.master.TransportMasterNodeAction.executeMasterOperation(TransportMasterNodeAction.java:111) ~[elasticsearch-9.0.0-SNAPSHOT.jar:?]
	at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction.lambda$doStart$3(TransportMasterNodeAction.java:222) ~[elasticsearch-9.0.0-SNAPSHOT.jar:?]
	at org.elasticsearch.action.ActionRunnable$4.doRun(ActionRunnable.java:101) ~[elasticsearch-9.0.0-SNAPSHOT.jar:?]

@benwtrent
Copy link
Member

It seems at least one ML test is to blame for some of the valid failures:

ml/3rd_party_deployment/Test start deployment fails while model download in progress

@benwtrent
Copy link
Member

benwtrent commented Jan 24, 2025

Many test failures in a build are from this:

Exception when waiting for [.ml-notifications-000002] template to be created

https://gradle-enterprise.elastic.co/s/4rprshounwmig

@davidkyle ^ Maybe related to #120405?

This is also the one where the node crashed, maybe its unrelated, but the node crashed due to subtracting more bytes than the model added to the circuit breaker.

@davidkyle davidkyle added medium-risk An open issue or test failure that is a medium risk to future releases and removed blocker labels Jan 25, 2025
@davidkyle
Copy link
Member

The XPackRestIT suite has been unmuted in #120859

Most of the PR build failures relate to an error in monitoring

       java.lang.AssertionError: got unexpected warning header [	
        	299 Elasticsearch-9.0.0-9d5d6ab40efe5c07d61dc2c3695875391a78ff4e "[xpack.monitoring.collection.enabled] setting was deprecated in Elasticsearch and will be removed in a future release. See the deprecation changes documentation for the next major version."	
        ]

All those stem from this draft PR #120718 that changes the wording in the deprecation message. We can safely ignore those.

Exception when waiting for [.ml-notifications-000002] template to be created

This error is from the test run where the node stopped due to the CB exception. It comes from the start up of the next test waiting for a condition that will never be satisfied as the node has gone. Later in the console output we see the node is talking to the client.

Caused by: |  
 java.net.ConnectException: Connection refused

That leaves a few ML issues to investigate

1 Circuit Breaker Error

[2025-01-24T08:54:06,566][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [yamlRestTest-0] fatal error in thread [elasticsearch[yamlRestTest-0][ml_utility][T#2]], exiting
java.lang.AssertionError: Used bytes: [-520] must be >= 0
	at org.elasticsearch.common.breaker.ChildMemoryCircuitBreaker.addWithoutBreaking(ChildMemoryCircuitBreaker.java:207) ~[elasticsearch-9.0.0-SNAPSHOT.jar:?]

I've absolutely no idea how this can occur but it did and it took down the node which caused many test failures.

2 Exception logged by MlInitializationService

[2025-01-24T19:03:42,131][ERROR][o.e.x.m.MlInitializationService] [yamlRestTest-0] Error creating ML annotations index or aliases
org.elasticsearch.index.IndexNotFoundException: no such index [.ml-annotations-000001]

This did not cause any test failures but it is suspicious. Here we have a race condition between the post test feature reset and the MlInitializationService which after startup creates the .ml-annotations-000001 index and adds an alias. The error is that the index was deleted by post test feature reset just as MlInitializationService was trying to add the alias for it.

It might be possible for MlInitializationService to check for the feature reset status before performing its actions

3 Bug in ML feature reset

It seems at least one ML test is to blame for some of the valid failures:
ml/3rd_party_deployment/Test start deployment fails while model download in progress

In this case and others the failure was because the ml feature reset API call failed (timed out).

Cause

Probably related to #120405

I've removed the blocker label as the suite has now been unmuted and set this to medium risk while we investigate the above.

@elasticsearchmachine
Copy link
Collaborator Author

This has been muted on branch main

Mute Reasons:

  • [main] 39 failures in class org.elasticsearch.xpack.test.rest.XPackRestIT (4.0% fail rate in 979 executions)
  • [main] 2 failures in step part3 (3.3% fail rate in 60 executions)
  • [main] 8 failures in step part-3 (3.4% fail rate in 234 executions)
  • [main] 27 failures in step rest-compatibility (10.8% fail rate in 249 executions)
  • [main] 2 failures in step rest-compat (3.2% fail rate in 62 executions)
  • [main] 3 failures in pipeline elasticsearch-intake (4.8% fail rate in 63 executions)
  • [main] 31 failures in pipeline elasticsearch-pull-request (12.3% fail rate in 253 executions)

Build Scans:

@davidkyle
Copy link
Member

elasticsearch-intake #16416 / part3 is the circuit breaker Used bytes: [-520] must be >= 0 taking down the node again.

The PR build failures are unrelated to ML

I've muted the inference tests causing the circuit breaker to trigger and unmuted the test suite in #120897

davidkyle added a commit that referenced this issue Jan 27, 2025
Mute failing inference_crud yml tests and unmute the rest of XPackRestIT
For #120816
@elasticsearchmachine
Copy link
Collaborator Author

This has been muted on branch main

Mute Reasons:

  • [main] 14 failures in class org.elasticsearch.xpack.test.rest.XPackRestIT (2.0% fail rate in 707 executions)
  • [main] 6 failures in step part-3 (3.2% fail rate in 185 executions)
  • [main] 5 failures in step rest-compatibility (2.4% fail rate in 209 executions)
  • [main] 10 failures in pipeline elasticsearch-pull-request (4.7% fail rate in 211 executions)

Build Scans:

@benwtrent
Copy link
Member

ugh:

java.lang.IllegalStateException: Exception when waiting for [.ml-notifications-000002] template to be created

Seems to be causing timeouts. We are definitely fighting a bot here to keep this from muting the suite.

@breskeby breskeby added blocker and removed blocker labels Jan 30, 2025
@kkrik-es
Copy link
Contributor

org.elasticsearch.xpack.test.rest.XPackRestIT is still fully muted. @davidkyle can you please have this checked and addressed? Let's start with updating muted-tests.yml to filter out the offending tests only, if a fix needs more time.

@davidkyle
Copy link
Member

davidkyle commented Jan 31, 2025

The bot is muting the suite because the failures are occurring in the test setup, an assertion is taking down the node then the next test fails. In elasticsearch-intake #16817 / part3 the assertion failure comes from the ml auditor code:

[2025-01-29T18:42:43,105][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [yamlRestTest-0] fatal error in thread [elasticsearch[yamlRestTest-0][management][T#1]], exiting
java.lang.AssertionError: null
	at org.elasticsearch.xpack.core.common.notifications.AbstractAuditor.writeBacklog(AbstractAuditor.java:175) ~[?:?]
	at org.elasticsearch.xpack.transform.notifications.TransformAuditor.writeBacklog(TransformAuditor.java:86) ~[?:?]
	at org.elasticsearch.xpack.core.common.notifications.AbstractAuditor.lambda$indexDoc$0(AbstractAuditor.java:123) ~[?:?]
	at org.elasticsearch.action.ActionListener$2.onResponse(ActionListener.java:257) ~[elasticsearch-9.0.0-SNAPSHOT.jar:?]

java.lang.IllegalStateException: Exception when waiting for [.ml-notifications-000002] template to be created

The cause is an connection exception because the node has gone and once the node has gone every subsequent test will fail racking up the failure count.

The PR failures are unrelated and are genuine test failures stemming from changes in open PRs. Most are from these two PRs; #121099 and #121078. The signal from these failures is being misinterpreted as they are due to uncommitted code in the PRs not code in the main branch however, these failures are part of the reason the bot is muting the suite.

elasticsearch-intake #16817 / part3 is a transform test so I've muted all ml and transform tests while we figure this one out, see #121377

The failing assertion is protected by a null check https://github.com/elastic/elasticsearch/blob/main/x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/common/notifications/AbstractAuditor.java#L175-L179 and will not trigger an error in production code. The assertion is that something unexpected has happened and this is probably due to some race condition. For this reason I don't consider this a blocker and will remove the label once the suite is unmuted by merging #121377

@davidkyle davidkyle removed the blocker label Feb 3, 2025
@elasticsearchmachine
Copy link
Collaborator Author

This has been muted on branch 8.18

Mute Reasons:

  • [8.18] 2 failures in class org.elasticsearch.xpack.test.rest.XPackRestIT (0.3% fail rate in 577 executions)

Build Scans:

elasticsearchmachine added a commit that referenced this issue Feb 4, 2025
@benwtrent benwtrent added blocker v8.18.1 stateful Marking issues only relevant for stateful releases labels Feb 4, 2025
@davidkyle davidkyle removed the blocker label Feb 5, 2025
@davidkyle
Copy link
Member

The latest failure is the auditor assertion again

[2025-01-29T18:42:43,105][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [yamlRestTest-0] fatal error in thread [elasticsearch[yamlRestTest-0][management][T#1]], exiting
java.lang.AssertionError: null
	at org.elasticsearch.xpack.core.common.notifications.AbstractAuditor.writeBacklog(AbstractAuditor.java:175) ~[?:?]

XPackRestIT is unmuted and the ml tests muted for 8.18 in #121765
Backports to 9.0 and 8.19 are in progress

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
medium-risk An open issue or test failure that is a medium risk to future releases :ml Machine learning stateful Marking issues only relevant for stateful releases Team:ML Meta label for the ML team >test-failure Triaged test failures from CI v8.18.1
Projects
None yet
Development

No branches or pull requests

5 participants