You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm having some difficulties with the test function. I got the train function working after some minor modifications due to different dependency version, but the test function just doesn't want to run. At some point in the train function, the NCCL is activated and the epochs and steps start running after that. However, with the test function the NCCL is never activated, it just stops after a few warning logs about other stuff and then nothing. I've waited for a considerable time to see if I just needed to wait for something, but that's not it.
Specifically, I see this when running the train function (and some more logs, after which the epochs and steps start):
[1,1]<stderr>:To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
[1,2]<stderr>:WARNING:tensorflow:AutoGraph could not transform <function StreamReader.__init__.<locals>.<lambda> at 0x7faa4d884f70> and will run it as-is.
[1,2]<stderr>:Cause: could not parse the source code of <function StreamReader.__init__.<locals>.<lambda> at 0x7faa4d884f70>: no matching AST found
[1,2]<stderr>:To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
[1,2]<stderr>:AutoGraph could not transform <function StreamReader.__init__.<locals>.<lambda> at 0x7faa4d884f70> and will run it as-is.
[1,2]<stderr>:Cause: could not parse the source code of <function StreamReader.__init__.<locals>.<lambda> at 0x7faa4d884f70>: no matching AST found
[1,2]<stderr>:To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
[1,3]<stderr>:2021-12-01 15:41:35.619073: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)
[1,0]<stderr>:2021-12-01 15:41:35.622922: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)
[1,2]<stderr>:2021-12-01 15:41:35.623069: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)
[1,1]<stderr>:2021-12-01 15:41:35.623122: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)
[1,0]<stdout>:1120-154839-mblqlrmu-10-199-200-5:1359:1554 [0] NCCL INFO Bootstrap : Using eth0:10.199.200.5<0>
[1,0]<stdout>:1120-154839-mblqlrmu-10-199-200-5:1359:1554 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
[1,0]<stdout>:1120-154839-mblqlrmu-10-199-200-5:1359:1554 [0] NCCL INFO NET/IB : No device found.
When running the test function, this is what I see (after which nothing happens):
[1,0]<stdout>:worker_rank:0, world_size:1, shuffle:False, seed:0, directory:/tmp/mind/dev, files:['/tmp/mind/dev/behaviors_0.tsv', '/tmp/mind/dev/behaviors_1.tsv', '/tmp/mind/dev/behaviors_2.tsv', '/tmp/mind/dev/behaviors_3.tsv']
[1,0]<stderr>:2021-12-01 15:47:18.943032: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
[1,0]<stderr>:To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
[1,0]<stderr>:WARNING:tensorflow:AutoGraph could not transform <function StreamReaderTest.__init__.<locals>.<lambda> at 0x7f1ec9f7c4c0> and will run it as-is.
[1,0]<stderr>:Cause: could not parse the source code of <function StreamReaderTest.__init__.<locals>.<lambda> at 0x7f1ec9f7c4c0>: no matching AST found
[1,0]<stderr>:To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
[1,0]<stderr>:AutoGraph could not transform <function StreamReaderTest.__init__.<locals>.<lambda> at 0x7f1ec9f7c4c0> and will run it as-is.
[1,0]<stderr>:Cause: could not parse the source code of <function StreamReaderTest.__init__.<locals>.<lambda> at 0x7f1ec9f7c4c0>: no matching AST found
[1,0]<stderr>:To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
[1,0]<stderr>:2021-12-01 15:47:19.011939: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)
I tried adding some print statements and it seems like everything works up to the point that I iterate over the dataloader for the steps, at that point no further print statements get fired.
When calling the test function without Horovod, I get stuck at this point ('Dataloader__next__ called' I added myself):
I'm having some difficulties with the test function. I got the train function working after some minor modifications due to different dependency version, but the test function just doesn't want to run. At some point in the train function, the NCCL is activated and the epochs and steps start running after that. However, with the test function the NCCL is never activated, it just stops after a few warning logs about other stuff and then nothing. I've waited for a considerable time to see if I just needed to wait for something, but that's not it.
Specifically, I see this when running the train function (and some more logs, after which the epochs and steps start):
When running the test function, this is what I see (after which nothing happens):
I tried adding some print statements and it seems like everything works up to the point that I iterate over the dataloader for the steps, at that point no further print statements get fired.
When calling the test function without Horovod, I get stuck at this point ('Dataloader__next__ called' I added myself):
The text was updated successfully, but these errors were encountered: