-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
file to VBS transfer fails after a few scans with "socket is not in listening state" #10
Comments
The detailed logs (thanks!) point at a likely suspect; one that was been encountered before. The UDT file descriptor numbers shown are somewhat predictable and indicate that in "normal operating mode" the code (attempts to) close the file descriptor of the previous transfer and the failing transfer it tries to close the one from two transfers ago. It seems to point at a race condition in the multi-thread bookkeeping of open/close file descriptors; will look into that. Since you indicate to have both ends under control, would it be possible to check if the combination jive5ab-3.0 <=> jive5ab-3.0 gives the same problems? |
With "both ends under control", I unfortunately only meant the "running" of things, not compiling. But I will ask Haystack(the other end) if they can let me have jive5ab v3 as well. |
I have a theory: I noted that I had let two transfers run without any rate limiting option. The machine in question is not very powerful, and had maxed out the CPU core assigned to dealing with IRQs for this NIC (i.e. the ksoftirq process took 100% CPU). I suspect this must have lead to lost packets. Because, for this instance, the same interface was used for control and data traffic, it could be that some control messages (close socket?) were lost? Data packets would also be lost, so I'm re'running with "--resume" to check this. So far it seems to have found some loss, which would support my theory. |
Good thinking. I had tried to reproduce locally (in-house @jive) but couldn't trigger. My thoughts are along long+fat links = lots of data in flight and if there isn't enough time to flush everything to disk, maybe that triggers the race condition. You could also add
|
Experienced the same issue with transfer to Vienna. The command sent was Many scans worked, last one was vo1074_ow_074-1815_7, but the following file vo1074_ow_074-1816b_0 failed. Receiver side:
Client side, a possibly interesting log chunk is
Interestingly, I don't see any "scan_set" command in the sender jive5ab log for the problematic scan. There is one for "vo1074_ow_074-1815_7 " which works, but not for the next one "vo1074_ow_074-1816b_0". Curious? Anyway, I want to try the EDT instead. |
I'm using m5copy 1.60 (latest from github) to transfer data from one end (j5ab 2.90) to another (j5ab 3.0). After a few scans (not always the same number) I get this error by m5copy
The receiver log file says (for one scans which works, and the next one that fails):
The sender j5ab logfile says (for all scans)
I have full control over both ends, so I am fairly sure there is no other transfer in place on these ports or these instances. Any idea why this happens?
The text was updated successfully, but these errors were encountered: