Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky test: test_everserver_status_contains_max_runtime_failure #9738

Open
xjules opened this issue Jan 14, 2025 · 1 comment · May be fixed by #9758
Open

Flaky test: test_everserver_status_contains_max_runtime_failure #9738

xjules opened this issue Jan 14, 2025 · 1 comment · May be fixed by #9758

Comments

@xjules
Copy link
Contributor

xjules commented Jan 14, 2025

This test fails occasionally on the PR runs with the following:

mock_server = None, change_to_tmpdir = None
min_config = {'config_path': '.', 'controls': [{'max': 0.1, 'min': 0, 'name': 'my_control', 'type': 'well_control', ...}], 'forward_model': ['sleep 5'], 'install_jobs': [{'name': 'sleep', 'source': 'SLEEP_job'}], ...}

    @patch("sys.argv", ["name", "--config-file", "config_minimal.yml"])
    def test_everserver_status_contains_max_runtime_failure(
        mock_server, change_to_tmpdir, min_config
    ):
        config_file = "config_minimal.yml"
    
        Path("SLEEP_job").write_text("EXECUTABLE sleep", encoding="utf-8")
        min_config["simulator"] = {"max_runtime": 2}
        min_config["forward_model"] = ["sleep 5"]
        min_config["install_jobs"] = [{"name": "sleep", "source": "SLEEP_job"}]
    
        config = EverestConfig(**min_config)
        config.dump(config_file)
    
        everserver.main()
        status = everserver_status(
            ServerConfig.get_everserver_status_path(config.output_dir)
        )
    
        assert status["status"] == ServerStatus.failed
        print(status["message"])
>       assert (
            "sleep Failed with: The run is cancelled due to reaching MAX_RUNTIME"
            in status["message"]
        )
E       assert 'sleep Failed with: The run is cancelled due to reaching MAX_RUNTIME' in 'Traceback (most recent call last):\n  File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/everest/detached/jobs/everserver.py", line 317, in main\n    status, message = _get_optimization_status(run_model.exit_code, shared_data)\n                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/everest/detached/jobs/everserver.py", line 387, in _get_optimization_status\n    messages = _failed_realizations_messages(shared_data)\n               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/everest/detached/jobs/everserver.py", line 397, in _failed_realizations_messages\n    failed = shared_data[SIM_PROGRESS_ENDPOINT]["status"]["failed"]\n             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^\nKeyError: \'status\'\n'

Relevant warning:

WARNING  _ert.forward_model_runner.client:client.py:128 client-35528cd8 failed to get acknowledgment on the b'CONNECT'. Resending.
WARNING  _ert.forward_model_runner.client:client.py:128 dispatch-56328fc6 failed to get acknowledgment on the b'CONNECT'. Resending.
WARNING  _ert.forward_model_runner.client:client.py:128 client-35528cd8 failed to get acknowledgment on the b'CONNECT'. Resending.
WARNING  _ert.forward_model_runner.client:client.py:128 dispatch-56328fc6 failed to get acknowledgment on the b'CONNECT'. Resending.
ERROR    ert.run_models.base_run_model:base_run_model.py:556 unexpected error: client-35528cd8 Failed to send b'CONNECT' after retries!
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/ert/run_models/base_run_model.py", line 519, in run_monitor
    async with Monitor(ee_config.get_connection_info()) as monitor:
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/_ert/forward_model_runner/client.py", line 56, in __aenter__
    await self.connect()
  File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/_ert/forward_model_runner/client.py", line 86, in connect
    await self.send(CONNECT_MSG, retries=1)
  File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/_ert/forward_model_runner/client.py", line 146, in send
    raise ClientConnectionError(
_ert.forward_model_runner.client.ClientConnectionError: client-35528cd8 Failed to send b'CONNECT' after retries!
ERROR    ert.ensemble_evaluator._ensemble:_ensemble.py:292 Traceback (most recent call last):
Unexpected exception in ensemble: 
  File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/ert/ensemble_evaluator/_ensemble.py", line 277, in _evaluate_inner
    await event_unary_send(event_creator(Id.ENSEMBLE_STARTED))
Unexpected exception in ensemble: 
  File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/ert/ensemble_evaluator/_ensemble.py", line 195, in send_event
    async with Client(url, token) as client:

See more here: https://github.com/equinor/ert/actions/runs/12768054939/job/35587654666

@xjules xjules added the bug label Jan 14, 2025
@xjules xjules moved this to Todo in SCOUT Jan 14, 2025
@sondreso sondreso added flaky-test and removed bug labels Jan 15, 2025
@xjules
Copy link
Contributor Author

xjules commented Jan 15, 2025

After doing some tests, with @larsevj, we realized that there might be some discrapency with the actual QueueSystem names and setting up the EvaluatorServerConfig.

This if is always false (since QueuSystem.LOCAL is from ert and should be capital I guess):

if run_model._queue_config.queue_system == QueueSystem.LOCAL:

which means everest server will only use tcp protocol for EvaluatorServerConfig.

While this fixture in conftest:

def evaluator_server_config_generator():

will run only ipc protocol for EvaluatorServerConfig.

I think the discrepancy might have been introduced around this commit: 0c7bf35

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Todo
Development

Successfully merging a pull request may close this issue.

2 participants