Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

file to VBS transfer fails after a few scans with "socket is not in listening state" #10

Open
varenius opened this issue Feb 26, 2021 · 5 comments

Comments

@varenius
Copy link

I'm using m5copy 1.60 (latest from github) to transfer data from one end (j5ab 2.90) to another (j5ab 3.0). After a few scans (not always the same number) I get this error by m5copy

oso_oper@Mark6-4026:~$ ./m5copy file://:2680/mnt/raid3/vo1021_k2/*.vdif vbs://130.242.10.3:32625/ -udt -p 2681
$Id: m5copy,v 1.60 2020/31/03 20:23:00 verkout Exp $
    copy VLBI data from somewhere to elsewhere
               (c) H. Verkouter

SRC::FILE [127.0.0.1:2680] /mnt/raid3/vo1021_k2/*.vdif  ===>  DST::VBS [130.242.10.3:32625:<controlIP>] 
127.0.0.1::/mnt/raid3/vo1021_k2/vo1021_k2_021-1800.vdif => 130.242.10.3::vo1021_k2_021-1800
127.0.0.1::/mnt/raid3/vo1021_k2/vo1021_k2_021-1801.vdif => 130.242.10.3::vo1021_k2_021-1801
127.0.0.1::/mnt/raid3/vo1021_k2/vo1021_k2_021-1803b.vdif => 130.242.10.3::vo1021_k2_021-1803b
127.0.0.1::/mnt/raid3/vo1021_k2/vo1021_k2_021-1809b.vdif => 130.242.10.3::vo1021_k2_021-1809b
Unacceptable return code 1 from query 'mode=none' - !record=  1 ; [acceptable: [0]]
-1

The receiver log file says (for one scans which works, and the next one that fails):

202105708:20:45 2021-02-26 07:20:45.99: Processing command 'record=on:vo1021_k2_021-1803b'
202105708:20:46 2021-02-26 07:20:46.29: Failed to close fd#132206458 (udt) - if UDT, may already be closed
202105708:20:46 2021-02-26 07:20:46.50: getsok_udt: req. server socket@:2681
202105708:20:46 2021-02-26 07:20:46.50: getsok_udt: listening on interface 
202105708:20:46 2021-02-26 07:20:46.50: Reply: !record=  0  ;
202105708:20:46 2021-02-26 07:20:46.50: netreader: waiting for incoming connection
202105708:20:46 2021-02-26 07:20:46.72: netreader: incoming dataconnection from 192.52.62.236:45546
202105708:20:46 2021-02-26 07:20:46.72: udtreader: read fd=132206456 rd:1312 wr:1312 bs:268435200
202105708:20:46 2021-02-26 07:20:46.82: Processing command 'echo=off'
202105708:20:56 2021-02-26 07:20:56.26: 
202105708:20:56 #################### WARNING ####################
202105708:20:56   mountpoint /mnt/disk34  POSSIBLY BAD!
202105708:20:56   failed to write 268435200 bytes to /mnt/disk34/vo1021_k2_021-1803b/vo1021_k2_021-1803b.00000004
202105708:20:56     - No space left on device
202105708:20:56   removing it from list!
202105708:20:56 #################################################
202105708:21:06 2021-02-26 07:21:06.05: 
202105708:21:06 #################### WARNING ####################
202105708:21:06   mountpoint /mnt/disk33  POSSIBLY BAD!
202105708:21:06   failed to write 268435200 bytes to /mnt/disk33/vo1021_k2_021-1803b/vo1021_k2_021-1803b.00000013
202105708:21:06     - No space left on device
202105708:21:06   removing it from list!
202105708:21:06 #################################################
202105708:22:34 2021-02-26 07:22:34.18: Reply: !echo= 0 ;
202105708:22:34 2021-02-26 07:22:34.28: Processing command 'record?'
202105708:22:34 2021-02-26 07:22:34.28: Reply: !record?  0 : on : 3 : vo1021_k2_021-1803b : 30559449056 ;
202105708:22:35 2021-02-26 07:22:35.04: udtreader: error Connection was broken. (2001)
202105708:22:35 2021-02-26 07:22:35.04: udtreader: partial block; adjusting block size by -120048000
202105708:22:35 2021-02-26 07:22:35.04: udtreader: stopping. read 30750000000 (28.6382Gbyte)
202105708:22:35 2021-02-26 07:22:35.04: scopedfd: closing fd=132206457 (udt)
202105708:22:35 2021-02-26 07:22:35.39: Processing command 'record?'
202105708:22:35 2021-02-26 07:22:35.39: Reply: !record?  0 : on : 3 : vo1021_k2_021-1803b : 30750000000 ;
202105708:22:35 2021-02-26 07:22:35.73: net2vbs guard function: transfer done
202105708:22:36 2021-02-26 07:22:36.50: Processing command 'record?'
202105708:22:36 2021-02-26 07:22:36.50: Reply: !record?  0 : off ;
202105708:22:37 2021-02-26 07:22:37.61: Processing command 'record?'
202105708:22:37 2021-02-26 07:22:37.61: Reply: !record?  0 : off ;
202105708:22:38 2021-02-26 07:22:38.72: Processing command 'record?'
202105708:22:38 2021-02-26 07:22:38.72: Reply: !record?  0 : off ;
202105708:22:38 2021-02-26 07:22:38.83: Processing command 'record=off'
202105708:22:38 2021-02-26 07:22:38.83: Reply: !record=  6 : Not doing record ;
202105708:22:38 2021-02-26 07:22:38.93: Processing command 'mode=none'
202105708:22:38 2021-02-26 07:22:38.93: Reply: !mode= 0 ;
202105708:22:49 2021-02-26 07:22:49.07: Processing command 'mode=VDIF_32768-8192-256-1'
202105708:22:49 2021-02-26 07:22:49.07: Reply: !mode= 0 ;
202105708:22:49 2021-02-26 07:22:49.18: Processing command 'record=on:vo1021_k2_021-1809b'
202105708:22:58 2021-02-26 07:22:58.92: Failed to close fd#132206456 (udt) - if UDT, may already be closed
202105708:22:59 2021-02-26 07:22:59.12: getsok_udt: req. server socket@:2681
202105708:22:59 2021-02-26 07:22:59.12: getsok_udt: listening on interface 
202105708:22:59 2021-02-26 07:22:59.12: Reply: !record=  0  ;
202105708:22:59 2021-02-26 07:22:59.12: netreader: waiting for incoming connection
202105708:22:59 2021-02-26 07:22:59.12: Processing command 'record=off'
202105708:22:59 2021-02-26 07:22:59.12: close_filedescriptor: closed fd#132206455
202105708:22:59 OH NOES! A step threw an exception:
202105708:22:59 **** /usr/local/src/jive5ab-3.0.0/src/getsok_udt.cc@229 [(afd=UDT::accept(fd, (struct sockaddr*)&remote, &islen))!=-1] fails UDTError:Operation not supported: Socket is not in listening state.
202105708:22:59 2021-02-26 07:22:59.12: net2vbs guard function: transfer done
202105708:22:59 2021-02-26 07:22:59.62: Reply: !record=  1 ;
202105708:22:59 2021-02-26 07:22:59.62: Processing command 'mode=none'
202105708:22:59 2021-02-26 07:22:59.62: Reply: !mode= 0 ;
202105708:22:59 2021-02-26 07:22:59.73: Processing command 'runtime=fYt6rMdE:delete'
202105708:22:59 2021-02-26 07:22:59.73: Cleaning up runtime
202105708:22:59 2021-02-26 07:22:59.73: Reply: !runtime = 0 : default ;
202105708:22:59 2021-02-26 07:22:59.84: Processing command 'runtime=default'
202105708:22:59 2021-02-26 07:22:59.84: Reply: !runtime = 0 : default ;

The sender j5ab logfile says (for all scans)

2021-02-26 07:12:23.86: main: jive5a [ jive5ab : 2.9.0 : 64bit : release : Mark6-4026 : 29-May-2019 : 12h09m53s : nossapi] ready
2021-02-26 07:13:14.10: mk6info - found 0 FlexBuff mountpoints
2021-02-26 07:16:39.87: mk6info - found 0 FlexBuff mountpoints
2021-02-26 07:16:41.99: open_file: opened /mnt/raid3/vo1021_k2/vo1021_k2_021-1800.vdif as fd=8
2021-02-26 07:16:41.99: udtwriter: writing to fd=909351429
2021-02-26 07:16:41.99: fdreader: start reading from fd=8, 0->30750000000
2021-02-26 07:16:41.99: udtwriter: switch to ipd=0 [set=0, , theoretical=0]
2021-02-26 07:18:27.17: fdreader: done 30750000000, 28.6382Gbyte
2021-02-26 07:18:27.40: udtwriter: stopping. wrote 30750000000 (28.6382Gbyte)
2021-02-26 07:18:43.10: open_file: opened /mnt/raid3/vo1021_k2/vo1021_k2_021-1801.vdif as fd=8
2021-02-26 07:18:43.10: udtwriter: writing to fd=909351428
2021-02-26 07:18:43.10: fdreader: start reading from fd=8, 0->30750000000
2021-02-26 07:18:43.56: udtwriter: switch to ipd=0 [set=0, , theoretical=0]
2021-02-26 07:20:29.99: fdreader: done 30750000000, 28.6382Gbyte
2021-02-26 07:20:30.22: udtwriter: stopping. wrote 30750000000 (28.6382Gbyte)
2021-02-26 07:20:46.77: open_file: opened /mnt/raid3/vo1021_k2/vo1021_k2_021-1803b.vdif as fd=8
2021-02-26 07:20:46.77: udtwriter: writing to fd=909351427
2021-02-26 07:20:46.77: fdreader: start reading from fd=8, 0->30750000000
2021-02-26 07:20:47.20: udtwriter: switch to ipd=0 [set=0, , theoretical=0]
2021-02-26 07:22:33.45: fdreader: done 30750000000, 28.6382Gbyte
2021-02-26 07:22:33.68: udtwriter: stopping. wrote 30750000000 (28.6382Gbyte)

I have full control over both ends, so I am fairly sure there is no other transfer in place on these ports or these instances. Any idea why this happens?

@haavee
Copy link
Member

haavee commented Feb 26, 2021

The detailed logs (thanks!) point at a likely suspect; one that was been encountered before.

The UDT file descriptor numbers shown are somewhat predictable and indicate that in "normal operating mode" the code (attempts to) close the file descriptor of the previous transfer and the failing transfer it tries to close the one from two transfers ago.

It seems to point at a race condition in the multi-thread bookkeeping of open/close file descriptors; will look into that.

Since you indicate to have both ends under control, would it be possible to check if the combination jive5ab-3.0 <=> jive5ab-3.0 gives the same problems?

@varenius
Copy link
Author

With "both ends under control", I unfortunately only meant the "running" of things, not compiling. But I will ask Haystack(the other end) if they can let me have jive5ab v3 as well.

@varenius
Copy link
Author

I have a theory: I noted that I had let two transfers run without any rate limiting option. The machine in question is not very powerful, and had maxed out the CPU core assigned to dealing with IRQs for this NIC (i.e. the ksoftirq process took 100% CPU). I suspect this must have lead to lost packets. Because, for this instance, the same interface was used for control and data traffic, it could be that some control messages (close socket?) were lost? Data packets would also be lost, so I'm re'running with "--resume" to check this. So far it seems to have found some loss, which would support my theory.

@haavee
Copy link
Member

haavee commented Feb 28, 2021

Good thinking. I had tried to reproduce locally (in-house @jive) but couldn't trigger. My thoughts are along long+fat links = lots of data in flight and if there isn't enough time to flush everything to disk, maybe that triggers the race condition. You could also add -e 10 to the command line (run m5copy w/o arguments for all commanddline options):

-e NNN effectively allows more time for a slow receiver to flush data to disk, probably lengthening the time between transfers, hopefully less chance of triggering the race condition.

@varenius
Copy link
Author

varenius commented Mar 17, 2021

Experienced the same issue with transfer to Vienna. The command sent was
m5copy vbs://:2621/vo1074_ow* file://VIEN:2620/gpfs/cdata/incoming/ow/ -udt --resume -e 10 -r 2G
EDIT: So, in this case it happened in VBS--> FILE and not in file--> VBS.

Many scans worked, last one was vo1074_ow_074-1815_7, but the following file vo1074_ow_074-1816b_0 failed.

Receiver side:

2021-03-17 07:08:40.13: Processing command 'net2file=open:/gpfs/cdata/incoming/ow/vo1074_ow_074-1815_7.m5a,a'
2021-03-17 07:08:40.13: open_file: opened /gpfs/cdata/incoming/ow/vo1074_ow_074-1815_7.m5a as fd=7
2021-03-17 07:08:40.13: getsok_udt: req. server socket@:2630
2021-03-17 07:08:40.13: fdwriter: writing to fd=7
2021-03-17 07:08:40.13: getsok_udt: listening on interface
2021-03-17 07:08:40.13: Reply: !net2file=  0 : 0 ;
2021-03-17 07:08:40.13: netreader: waiting for incoming connection
2021-03-17 07:08:40.20: netreader: incoming dataconnection from 130.242.10.2:57882
2021-03-17 07:08:40.20: udtreader: read fd=1058892436 rd:1472 wr:1472 bs:2096128
2021-03-17 07:08:41.24: Processing command 'echo=off'
2021-03-17 07:09:03.30: Reply: !echo= 0 ;
2021-03-17 07:09:03.33: Processing command 'net2file?'
2021-03-17 07:09:03.33: Reply: !net2file?  0 : active : 3479572480 ;
2021-03-17 07:09:04.36: Processing command 'net2file?'
2021-03-17 07:09:04.36: Reply: !net2file?  0 : active : 3615820800 ;
2021-03-17 07:09:05.40: Processing command 'net2file?'
2021-03-17 07:09:05.40: Reply: !net2file?  0 : active : 3764645888 ;
2021-03-17 07:09:05.62: udtreader: error Connection was broken. (2001)
2021-03-17 07:09:05.62: udtreader: partial block; adjusting block size by -770368
2021-03-17 07:09:05.62: udtreader: stopping. read 3793221312 (3.53271Gbyte)
2021-03-17 07:09:05.62: scopedfd: closing fd=1058892437 (udt)
2021-03-17 07:09:05.62: fdwriter: stopping. wrote 3793221312 (3.53271Gbyte)
2021-03-17 07:09:05.62: net2file guard function: transfer done
2021-03-17 07:09:05.62: close_filedescriptor: closed fd#1058892436
2021-03-17 07:09:05.62: close_filedescriptor: closed fd#7
2021-03-17 07:09:06.43: Processing command 'net2file?'
2021-03-17 07:09:06.43: Reply: !net2file?  0 : inactive : 0 ;
2021-03-17 07:09:07.46: Processing command 'net2file?'
2021-03-17 07:09:07.46: Reply: !net2file?  0 : inactive : 0 ;
2021-03-17 07:09:08.49: Processing command 'net2file?'
2021-03-17 07:09:08.49: Reply: !net2file?  0 : inactive : 0 ;
2021-03-17 07:09:08.52: Processing command 'net2file=close'
2021-03-17 07:09:08.52: Reply: !net2file=  6 : Not doing net2file yet ;
2021-03-17 07:09:18.56: Processing command 'net2file=open:/gpfs/cdata/incoming/ow/vo1074_ow_074-1816b_0.m5a,a'
2021-03-17 07:09:58.85: open_file: opened /gpfs/cdata/incoming/ow/vo1074_ow_074-1816b_0.m5a as fd=7
2021-03-17 07:09:58.85: getsok_udt: req. server socket@:2630
2021-03-17 07:09:58.85: fdwriter: writing to fd=7
2021-03-17 07:09:58.86: getsok_udt: listening on interface
2021-03-17 07:09:58.86: Reply: !net2file=  0 : 0 ;
2021-03-17 07:09:58.86: netreader: waiting for incoming connection
2021-03-17 07:09:58.86: Processing command 'net2file=close'
2021-03-17 07:09:58.86: close_filedescriptor: closed fd#1058892435
2021-03-17 07:09:58.86: close_filedescriptor: closed fd#7
2021-03-17 07:09:58.86: fdwriter: stopping. wrote 0 (0byte)
OH NOES! A step threw an exception:
**** /home/lv71338/jamiem1/software/jive5ab_src/jive5ab/src/getsok_udt.cc@229 [(afd=UDT::accept(fd, (struct sockaddr*)&remote, &islen))!=-1] fails UDTError:Operation not supported: Socket is not in listening state.
2021-03-17 07:09:58.86: net2file guard function: transfer done
2021-03-17 07:09:58.86: Processing command 'runtime=u44MPqrk:delete'
2021-03-17 07:09:58.86: Cleaning up runtime
2021-03-17 07:09:58.86: Reply: !net2file=  0 ;!runtime = 0 : default ;

Client side, a possibly interesting log chunk is

202107608:08:30 2021-03-17 07:08:30.06: Processing command 'disk2net=disconnect'
202107608:08:30 2021-03-17 07:08:30.06: Reply: !disk2net=  6 : Not doing disk2net ;
202107608:08:40 2021-03-17 07:08:40.14: Processing command 'record=mk6:0'
202107608:08:40 2021-03-17 07:08:40.14: Reply: !record=  0 ;
202107608:08:40 2021-03-17 07:08:40.14: Processing command 'scan_set=vo1074_ow_074-1815_7::'
202107608:08:40 2021-03-17 07:08:40.14: openfile_type: found 3793221312 bytes in 15 chunks, 100%
202107608:08:40 2021-03-17 07:08:40.14: Reply: !scan_set= 0 ;
202107608:08:40 2021-03-17 07:08:40.14: Processing command 'scan_set?'
202107608:08:40 2021-03-17 07:08:40.14: Reply: !scan_set? 0 : -1 : vo1074_ow_074-1815_7 : 0 : 3793221312 ;
202107608:08:40 2021-03-17 07:08:40.14: Processing command 'disk2net=connect:193.170.79.54'
202107608:08:40 2021-03-17 07:08:40.14: compute_theoretical_ipd: ntrack=0 @0 dgsize=1472 => n_pkt_p_s=0
202107608:08:40 2021-03-17 07:08:40.14: openfile_type: found 3793221312 bytes in 15 chunks, 100%
202107608:08:40 2021-03-17 07:08:40.14: open_vbs: opened vo1074_ow_074-1815_7 as fd=2147483647 [vbs]
202107608:08:40 2021-03-17 07:08:40.14: getsok_udt: trying 193.170.79.54{193.170.79.54}:2630 ...
202107608:08:40 2021-03-17 07:08:40.21: Reply: !disk2net=  1 ;
202107608:08:40 2021-03-17 07:08:40.21: udtwriter: writing to fd=1043591041
202107608:08:40 2021-03-17 07:08:40.21: Processing command 'disk2net?'
202107608:08:40 2021-03-17 07:08:40.21: Reply: !disk2net?  0 : connected : 193.170.79.54 : 0 : 0 : 3793221312 ;
202107608:08:40 2021-03-17 07:08:40.21: Processing command 'disk2net=on::'
202107608:08:40 2021-03-17 07:08:40.21: Reply: !disk2net=  0 ;
202107608:08:40 2021-03-17 07:08:40.21: Processing command 'disk2net?'
202107608:08:40 2021-03-17 07:08:40.21: Reply: !disk2net?  0 : connected : 193.170.79.54 : 0 : 0 : 3793221312 ;
202107608:08:40 2021-03-17 07:08:40.21: vbsreader_c: start reading fd#2147483647, 0 => 3793221312
202107608:08:40 2021-03-17 07:08:40.24: udtwriter: switch to ipd=0 [set=0, , theoretical=0]
202107608:08:41 2021-03-17 07:08:41.21: Processing command 'disk2net?'
202107608:08:41 2021-03-17 07:08:41.21: Reply: !disk2net?  0 : active : 193.170.79.54 : 0 : 197036032 : 3793221312 ;
202107608:08:41 2021-03-17 07:08:41.21: Processing command 'echo=off'
202107608:09:02 2021-03-17 07:09:02.53: vbsreader_c: done 3.53271Gbyte
202107608:09:02 2021-03-17 07:09:02.72: udtwriter: re-enabling lingering on UDT socket for 180s
202107608:09:02 2021-03-17 07:09:02.72: udtwriter: stopping. wrote 3793221312 (3.53271Gbyte)
202107608:09:02 2021-03-17 07:09:02.72: disk2net(vbs) guard function: transfer done
202107608:09:02 2021-03-17 07:09:02.72: close_vbs_c: closing fd#2147483647
202107608:09:03 2021-03-17 07:09:03.27: Reply: !echo= 0 ;
202107608:09:05 2021-03-17 07:09:05.59: close_filedescriptor: closed fd#1043591041
202107608:09:05 2021-03-17 07:09:05.59: disk2net(vbs) finalization done.
202107608:09:08 2021-03-17 07:09:08.49: Processing command 'disk2net=disconnect'
202107608:09:08 2021-03-17 07:09:08.49: Reply: !disk2net=  6 : Not doing disk2net ;
202107608:09:23 2021-03-17 07:09:23.54: Processing command 'disk2net=disconnect'
202107608:09:23 2021-03-17 07:09:23.54: Reply: !disk2net=  6 : Not doing disk2net ;
202107608:09:28 2021-03-17 07:09:28.55: Processing command 'echo=on'
202107608:09:28 2021-03-17 07:09:28.55: Reply: !echo= 0 ;
202107608:09:28 2021-03-17 07:09:28.55: Processing command 'runtime=i5OADMNY:delete'
202107608:09:28 2021-03-17 07:09:28.55: Cleaning up runtime
202107608:09:28 2021-03-17 07:09:28.55: Reply: !runtime = 0 : default ;
202107608:09:28 2021-03-17 07:09:28.55: Processing command 'runtime=default'
202107608:09:28 2021-03-17 07:09:28.55: Reply: !runtime = 0 : default ;

Interestingly, I don't see any "scan_set" command in the sender jive5ab log for the problematic scan. There is one for "vo1074_ow_074-1815_7 " which works, but not for the next one "vo1074_ow_074-1816b_0". Curious?

Anyway, I want to try the EDT instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants