Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduler asks more than 1 worker to upload its result #428

Closed
mcornaton opened this issue Sep 23, 2021 · 6 comments
Closed

Scheduler asks more than 1 worker to upload its result #428

mcornaton opened this issue Sep 23, 2021 · 6 comments

Comments

@mcornaton
Copy link
Contributor

Sometimes, the scheduler asks more than 1 worker to upload their results. This could be seen in the following logs:

2021-09-22 20:57:24.166  INFO 1 --- [onPool-worker-3] com.iexec.core.task.TaskService          : Received task update request [chainTaskId:0xd6608a3a6b075331040ccc7936030d41235e68b8eb4553ccda79f4ebda8139fe]
2021-09-22 20:57:34.916  INFO 1 --- [    Scheduled-2] c.i.c.d.r.UnnotifiedAbstractDetector     : Detected confirmed missing update (replicate) [is:REVEALING, should:REVEALED, taskId:0xd6608a3a6b075331040ccc7936030d41235e68b8eb4553ccda79f4ebda8139fe]
2021-09-22 20:57:34.926  INFO 1 --- [    Scheduled-2] c.i.core.replicate.ReplicatesService     : Replicate update request [status:REVEALED, chainTaskId:0xd6608a3a6b075331040ccc7936030d41235e68b8eb4553ccda79f4ebda8139fe, walletAddress:0x2ab2674aa374fe6415d11f0a8fcbd8027fc1e6a9, details:ReplicateStatusDetails(chainReceipt=ChainReceipt(blockNumber=426, txHash=null), resultLink=null, chainCallbackData=null, errorMessage=null, cause=null, stdout=null)]
2021-09-22 20:57:34.933  INFO 1 --- [    Scheduled-2] c.i.common.chain.Web3jAbstractService    : Verified check [try:1, function:isRevealedTrue, args:[0xd6608a3a6b075331040ccc7936030d41235e68b8eb4553ccda79f4ebda8139fe, 0x2ab2674aa374fe6415d11f0a8fcbd8027fc1e6a9], maxTry:3, msToWait:6000, msPerBlock:1000]
2021-09-22 20:57:34.937  INFO 1 --- [onPool-worker-3] com.iexec.core.task.TaskService          : Received task update request [chainTaskId:0xd6608a3a6b075331040ccc7936030d41235e68b8eb4553ccda79f4ebda8139fe]
2021-09-22 20:57:34.937  INFO 1 --- [    Scheduled-5] c.i.c.t.update.TaskUpdateRequestManager  : Waiting requests from publisher [queueSize:0]
2021-09-22 20:57:34.937  INFO 1 --- [    Scheduled-2] c.i.core.replicate.ReplicatesService     : Replicate updated successfully [newStatus:REVEALED, newStatusCause:null nextAction:PLEASE_WAIT, chainTaskId:0xd6608a3a6b075331040ccc7936030d41235e68b8eb4553ccda79f4ebda8139fe, walletAddress:0x2ab2674aa374fe6415d11f0a8fcbd8027fc1e6a9]
2021-09-22 20:57:34.937  INFO 1 --- [ool-10-thread-1] c.i.c.t.update.TaskUpdateRequestManager  : Published task update request [chainTaskId:0xd6608a3a6b075331040ccc7936030d41235e68b8eb4553ccda79f4ebda8139fe, queueSize:1]
2021-09-22 20:57:34.940  INFO 1 --- [    Scheduled-2] c.i.c.d.r.UnnotifiedAbstractDetector     : Detected confirmed missing update (replicate) [is:REVEALING, should:REVEALED, taskId:0xd6608a3a6b075331040ccc7936030d41235e68b8eb4553ccda79f4ebda8139fe]
2021-09-22 20:57:34.941  INFO 1 --- [onPool-worker-3] com.iexec.core.task.TaskService          : UpdateTaskStatus suceeded [chainTaskId:0xd6608a3a6b075331040ccc7936030d41235e68b8eb4553ccda79f4ebda8139fe, currentStatus:CONSENSUS_REACHED, newStatus:AT_LEAST_ONE_REVEALED]
2021-09-22 20:57:34.944  INFO 1 --- [onPool-worker-3] com.iexec.core.task.TaskService          : UpdateTaskStatus suceeded [chainTaskId:0xd6608a3a6b075331040ccc7936030d41235e68b8eb4553ccda79f4ebda8139fe, currentStatus:AT_LEAST_ONE_REVEALED, newStatus:RESULT_UPLOAD_REQUESTED]
2021-09-22 20:57:34.945  INFO 1 --- [onPool-worker-3] c.i.core.replicate.ReplicatesService     : Replicate update request [status:RESULT_UPLOAD_REQUESTED, chainTaskId:0xd6608a3a6b075331040ccc7936030d41235e68b8eb4553ccda79f4ebda8139fe, walletAddress:0x2ab2674aa374fe6415d11f0a8fcbd8027fc1e6a9, details:null]
2021-09-22 20:57:34.949  INFO 1 --- [ool-10-thread-1] c.i.c.t.update.TaskUpdateRequestManager  : Published task update request [chainTaskId:0xd6608a3a6b075331040ccc7936030d41235e68b8eb4553ccda79f4ebda8139fe, queueSize:1]
2021-09-22 20:57:34.949  INFO 1 --- [onPool-worker-3] c.i.core.replicate.ReplicatesService     : Replicate updated successfully [newStatus:RESULT_UPLOAD_REQUESTED, newStatusCause:null nextAction:null, chainTaskId:0xd6608a3a6b075331040ccc7936030d41235e68b8eb4553ccda79f4ebda8139fe, walletAddress:0x2ab2674aa374fe6415d11f0a8fcbd8027fc1e6a9]
2021-09-22 20:57:34.950  INFO 1 --- [    Scheduled-2] c.i.core.replicate.ReplicatesService     : Replicate update request [status:REVEALED, chainTaskId:0xd6608a3a6b075331040ccc7936030d41235e68b8eb4553ccda79f4ebda8139fe, walletAddress:0x1a69b2eb604db8eba185df03ea4f5288dcbbd248, details:ReplicateStatusDetails(chainReceipt=ChainReceipt(blockNumber=426, txHash=null), resultLink=null, chainCallbackData=null, errorMessage=null, cause=null, stdout=null)]
2021-09-22 20:57:34.951  INFO 1 --- [onPool-worker-3] c.i.core.task.listener.TaskListeners     : Received PleaseUploadEvent [chainTaskId:0xd6608a3a6b075331040ccc7936030d41235e68b8eb4553ccda79f4ebda8139fe, workerWallet:0x2ab2674aa374fe6415d11f0a8fcbd8027fc1e6a9] 
2021-09-22 20:57:34.955  INFO 1 --- [    Scheduled-5] c.i.c.t.update.TaskUpdateRequestManager  : Waiting requests from publisher [queueSize:0]
2021-09-22 20:57:34.955  INFO 1 --- [onPool-worker-5] com.iexec.core.task.TaskService          : Received task update request [chainTaskId:0xd6608a3a6b075331040ccc7936030d41235e68b8eb4553ccda79f4ebda8139fe]
2021-09-22 20:57:34.956  INFO 1 --- [onPool-worker-3] c.iexec.core.pubsub.NotificationService  : Sent TaskNotification [chainTaskId:0xd6608a3a6b075331040ccc7936030d41235e68b8eb4553ccda79f4ebda8139fe, type:PLEASE_UPLOAD, workers:[0x2ab2674aa374fe6415d11f0a8fcbd8027fc1e6a9]]
2021-09-22 20:57:34.957  INFO 1 --- [onPool-worker-3] c.i.core.task.listener.TaskListeners     : NotifyUploadingWorker completed[uploadingWorkerWallet=0x2ab2674aa374fe6415d11f0a8fcbd8027fc1e6a9]
2021-09-22 20:57:34.962  INFO 1 --- [    Scheduled-2] c.i.common.chain.Web3jAbstractService    : Verified check [try:1, function:isRevealedTrue, args:[0xd6608a3a6b075331040ccc7936030d41235e68b8eb4553ccda79f4ebda8139fe, 0x1a69b2eb604db8eba185df03ea4f5288dcbbd248], maxTry:3, msToWait:6000, msPerBlock:1000]
2021-09-22 20:57:34.966  INFO 1 --- [ool-10-thread-1] c.i.c.t.update.TaskUpdateRequestManager  : Published task update request [chainTaskId:0xd6608a3a6b075331040ccc7936030d41235e68b8eb4553ccda79f4ebda8139fe, queueSize:1]
2021-09-22 20:57:34.966  INFO 1 --- [    Scheduled-2] c.i.core.replicate.ReplicatesService     : Replicate updated successfully [newStatus:REVEALED, newStatusCause:null nextAction:PLEASE_WAIT, chainTaskId:0xd6608a3a6b075331040ccc7936030d41235e68b8eb4553ccda79f4ebda8139fe, walletAddress:0x1a69b2eb604db8eba185df03ea4f5288dcbbd248]
2021-09-22 20:57:34.966  INFO 1 --- [    Scheduled-5] c.i.c.t.update.TaskUpdateRequestManager  : Waiting requests from publisher [queueSize:0]
2021-09-22 20:57:34.966  INFO 1 --- [io-13000-exec-4] c.i.core.replicate.ReplicatesService     : Replicate update request [status:RESULT_UPLOADING, chainTaskId:0xd6608a3a6b075331040ccc7936030d41235e68b8eb4553ccda79f4ebda8139fe, walletAddress:0x2ab2674aa374fe6415d11f0a8fcbd8027fc1e6a9, details:null]
2021-09-22 20:57:34.967  INFO 1 --- [onPool-worker-5] com.iexec.core.task.TaskService          : Received task update request [chainTaskId:0xd6608a3a6b075331040ccc7936030d41235e68b8eb4553ccda79f4ebda8139fe]
2021-09-22 20:57:34.971  INFO 1 --- [ool-10-thread-1] c.i.c.t.update.TaskUpdateRequestManager  : Published task update request [chainTaskId:0xd6608a3a6b075331040ccc7936030d41235e68b8eb4553ccda79f4ebda8139fe, queueSize:1]
2021-09-22 20:57:34.971  INFO 1 --- [onPool-worker-3] com.iexec.core.task.TaskService          : Received task update request [chainTaskId:0xd6608a3a6b075331040ccc7936030d41235e68b8eb4553ccda79f4ebda8139fe]
2021-09-22 20:57:34.970  INFO 1 --- [io-13000-exec-4] c.i.core.replicate.ReplicatesService     : Replicate updated successfully [newStatus:RESULT_UPLOADING, newStatusCause:null nextAction:PLEASE_CONTINUE, chainTaskId:0xd6608a3a6b075331040ccc7936030d41235e68b8eb4553ccda79f4ebda8139fe, walletAddress:0x2ab2674aa374fe6415d11f0a8fcbd8027fc1e6a9]
2021-09-22 20:57:34.971  INFO 1 --- [    Scheduled-5] c.i.c.t.update.TaskUpdateRequestManager  : Waiting requests from publisher [queueSize:0]
2021-09-22 20:57:34.974  INFO 1 --- [onPool-worker-5] com.iexec.core.task.TaskService          : UpdateTaskStatus suceeded [chainTaskId:0xd6608a3a6b075331040ccc7936030d41235e68b8eb4553ccda79f4ebda8139fe, currentStatus:RESULT_UPLOAD_REQUESTED, newStatus:RESULT_UPLOAD_REQUESTED]

Note that the last line should not happen as a task can't be transitioned from a status to the same status.

@mcornaton mcornaton changed the title SCheduler asks more than 1 worker to upload its result Scheduler asks more than 1 worker to upload its result Sep 23, 2021
@jeremyjams
Copy link
Member

IMO it is the root cause of iExecBlockchainComputing/iexec-result-proxy#21 (the result proxy might be smarter though).

@mcornaton
Copy link
Contributor Author

mcornaton commented Sep 23, 2021

I agree.
As far as I have understood this bug, the first thing to do is to add some sort of synchronization to TaskService#requestUpload so that if the task already has an uploadingWorkerWalletAddress, nothing should happen.
We should also look for the real cause of the issue. There's probably some part of the workflow that's not clear.

@mcornaton
Copy link
Contributor Author

mcornaton commented Sep 28, 2021

In a bunch of 100 local runs, 5 runs have failed: 2 because of a timeout and 3 because of the scheduler asking both workers to upload.


During an usual run, the method TaskService#requestUpload is called once. During a failing run, this method is called a bunch of times (the maximum I have seen is currently 10 times).
In all tests, TaskService#consensusReached2AtLeastOneReveal2UploadRequested seems to be the first method that calls TaskService#requestUpload. It is also the only one in successful tests. However, in failing tests, there are also TaskService#resultUploading2Uploaded (currently line 582) and TaskService#uploadRequested2UploadingResult (line 525) that can be called multiple times.

@mcornaton
Copy link
Contributor Author

mcornaton commented Sep 28, 2021

In our 3 failing tests we can see that the workers haven't been able to upload their result onto IPFS because of an internal server error:

com.iexec.worker.feign.BaseFeignClient   : Failed to make http call [action:getIpfsHashForTask, status:500 INTERNAL_SERVER_ERROR, attempts:3]
com.iexec.worker.result.ResultService    : Cannot get tee web2 result link (result-proxy issue) [chainTaskId:0x71d953138e0293b77eeae25b5114cbcc5df53ef9d99c3a153fcd9fe615668883]

This issue seems to arise because of a non-unique IpfsName id.

org.springframework.dao.IncorrectResultSizeDataAccessException: Query { "$java" : Query: { "taskId" : "0x1a29ebb8b64998a7a29ca30dfa8b8d33dc673aea030c5eadb90d07fb913d31d8"}, Fields: {}, Sort: {} } returned non unique result.
	at ...
	at com.sun.proxy.$Proxy98.findByTaskId(Unknown Source) ~[na:na]
	at com.iexec.resultproxy.ipfs.task.IpfsNameService.getIpfsHashForTask(IpfsNameService.java:34) ~[classes!/:na]
	at com.iexec.resultproxy.proxy.ProxyController.getIpfsHashForTask(ProxyController.java:176) ~[classes!/:na]

That's the issue explained in iExecBlockchainComputing/iexec-result-proxy#21. As pointed out there, both issues are probably linked.

@mcornaton
Copy link
Contributor Author

mcornaton commented Sep 29, 2021

Cause:

After having added some logs to ``, we can see that we have a clear race-condition.
New code:

    private void uploadRequested2UploadingResult(Task task) {
        boolean isTaskInUploadRequested = task.getCurrentStatus().equals(RESULT_UPLOAD_REQUESTED);

        final int uploadingReplicatesCount = replicatesService.getNbReplicatesWithCurrentStatus(task.getChainTaskId(),
                ReplicateStatus.RESULT_UPLOADING);
        boolean isThereAWorkerUploading = uploadingReplicatesCount > 0;
        log.info("Uploading workers: {}", uploadingReplicatesCount);

        final int requestedToUploadReplicatesCount = replicatesService.getNbReplicatesWithCurrentStatus(task.getChainTaskId(),
                ReplicateStatus.RESULT_UPLOAD_REQUESTED);
        boolean isThereAWorkerRequestedToUpload = requestedToUploadReplicatesCount > 0;
        log.info("Requested to upload workers: {}", requestedToUploadReplicatesCount);

        if (!isTaskInUploadRequested) {
            return;
        }

        if (isThereAWorkerUploading) {
            updateTaskStatusAndSave(task, RESULT_UPLOADING);
            return;
        }

        if (!isThereAWorkerRequestedToUpload) {
            requestUpload(task);
        }
    }

Logs:

2021-09-29 07:24:16.744  INFO 1 --- [io-13000-exec-7] c.i.core.replicate.ReplicatesService     : Replicate update request [status:RESULT_UPLOADING, chainTaskId:0x8dc6fa700b627a92ec56cb967bffa8176f2ea7fb2a681dcea9fe6a7158eeee4e, walletAddress:0x2ab2674aa374fe6415d11f0a8fcbd8027fc1e6a9, details:null]
2021-09-29 07:24:16.746  INFO 1 --- [ool-10-thread-1] c.i.c.t.update.TaskUpdateRequestManager  : Published task update request [chainTaskId:0x8dc6fa700b627a92ec56cb967bffa8176f2ea7fb2a681dcea9fe6a7158eeee4e, queueSize:1]
2021-09-29 07:24:16.746  INFO 1 --- [nPool-worker-19] com.iexec.core.task.TaskService          : Received task update request [chainTaskId:0x8dc6fa700b627a92ec56cb967bffa8176f2ea7fb2a681dcea9fe6a7158eeee4e]
2021-09-29 07:24:16.746  INFO 1 --- [    Scheduled-5] c.i.c.t.update.TaskUpdateRequestManager  : Waiting requests from publisher [queueSize:0]
2021-09-29 07:24:16.746  INFO 1 --- [   Scheduled-12] c.i.core.replicate.ReplicatesService     : Replicate updated successfully [newStatus:REVEALED, newStatusCause:null nextAction:PLEASE_WAIT, chainTaskId:0x8dc6fa700b627a92ec56cb967bffa8176f2ea7fb2a681dcea9fe6a7158eeee4e, walletAddress:0x1a69b2eb604db8eba185df03ea4f5288dcbbd248]
2021-09-29 07:24:16.749  INFO 1 --- [nPool-worker-19] com.iexec.core.task.TaskService          : Uploading workers: 0
2021-09-29 07:24:16.750  INFO 1 --- [io-13000-exec-7] c.i.core.replicate.ReplicatesService     : Replicate updated successfully [newStatus:RESULT_UPLOADING, newStatusCause:null nextAction:PLEASE_CONTINUE, chainTaskId:0x8dc6fa700b627a92ec56cb967bffa8176f2ea7fb2a681dcea9fe6a7158eeee4e, walletAddress:0x2ab2674aa374fe6415d11f0a8fcbd8027fc1e6a9]
2021-09-29 07:24:16.750  INFO 1 --- [onPool-worker-5] com.iexec.core.task.TaskService          : Received task update request [chainTaskId:0x8dc6fa700b627a92ec56cb967bffa8176f2ea7fb2a681dcea9fe6a7158eeee4e]
2021-09-29 07:24:16.750  INFO 1 --- [    Scheduled-5] c.i.c.t.update.TaskUpdateRequestManager  : Waiting requests from publisher [queueSize:0]
2021-09-29 07:24:16.750  INFO 1 --- [ool-10-thread-1] c.i.c.t.update.TaskUpdateRequestManager  : Published task update request [chainTaskId:0x8dc6fa700b627a92ec56cb967bffa8176f2ea7fb2a681dcea9fe6a7158eeee4e, queueSize:1]
2021-09-29 07:24:16.752  INFO 1 --- [nPool-worker-19] com.iexec.core.task.TaskService          : Requested to upload workers: 0
2021-09-29 07:24:16.754 ERROR 1 --- [nPool-worker-19] com.iexec.core.task.TaskService          : [DEBUG concurrency]

java.lang.Exception: [DEBUG concurrency]
	at com.iexec.core.task.TaskService.requestUpload(TaskService.java:603)

What happens there:

  1. Replicate 2 is in RESULT_UPLOAD_REQUESTED status.
  2. Scheduler receives an update request for Replicate 2 to transition it to RESULT_UPLOADING.
  3. Replicate 1 is updated to REVEALED status.
    1. Task update is triggered.
    2. Task is in RESULT_UPLOAD_REQUESTED status.
  4. Method uploadRequested2UploadingResult is called.
    1. uploadingReplicatesCount is equal to 0 because there's no uploading replicate.
    2. Replicate 2 update is achieved: it is now in RESULT_UPLOADING status.
    3. requestedToUploadReplicatesCount is equal to 0 because there's no more requested to upload replicate (remember that Replicate 2 has transitioned from RESULT_UPLOAD_REQUESTED to RESULT_UPLOADING).
    4. requestUpload is called another time and a potentially new worker is selected to upload its result - note that the newly selected worker can be the already-uploading worker.

How to fix it?

The first patch to do would be to revert the tests order. Then, the race condition would be reverted so that a worker can be counted for both tests. That's no more an issue as we would only use the uploadingReplicatesCount.
The second patch should be a check on the task in requestUpload to verify there's no currently uploading replicate.

@mcornaton
Copy link
Contributor Author

Fixed in #454.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants