Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test function doesn't work #4

Open
Luuk99 opened this issue Dec 1, 2021 · 0 comments
Open

Test function doesn't work #4

Luuk99 opened this issue Dec 1, 2021 · 0 comments

Comments

@Luuk99
Copy link

Luuk99 commented Dec 1, 2021

I'm having some difficulties with the test function. I got the train function working after some minor modifications due to different dependency version, but the test function just doesn't want to run. At some point in the train function, the NCCL is activated and the epochs and steps start running after that. However, with the test function the NCCL is never activated, it just stops after a few warning logs about other stuff and then nothing. I've waited for a considerable time to see if I just needed to wait for something, but that's not it.

Specifically, I see this when running the train function (and some more logs, after which the epochs and steps start):

[1,1]<stderr>:To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
[1,2]<stderr>:WARNING:tensorflow:AutoGraph could not transform <function StreamReader.__init__.<locals>.<lambda> at 0x7faa4d884f70> and will run it as-is.
[1,2]<stderr>:Cause: could not parse the source code of <function StreamReader.__init__.<locals>.<lambda> at 0x7faa4d884f70>: no matching AST found
[1,2]<stderr>:To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
[1,2]<stderr>:AutoGraph could not transform <function StreamReader.__init__.<locals>.<lambda> at 0x7faa4d884f70> and will run it as-is.
[1,2]<stderr>:Cause: could not parse the source code of <function StreamReader.__init__.<locals>.<lambda> at 0x7faa4d884f70>: no matching AST found
[1,2]<stderr>:To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
[1,3]<stderr>:2021-12-01 15:41:35.619073: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)
[1,0]<stderr>:2021-12-01 15:41:35.622922: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)
[1,2]<stderr>:2021-12-01 15:41:35.623069: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)
[1,1]<stderr>:2021-12-01 15:41:35.623122: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)
[1,0]<stdout>:1120-154839-mblqlrmu-10-199-200-5:1359:1554 [0] NCCL INFO Bootstrap : Using eth0:10.199.200.5<0>
[1,0]<stdout>:1120-154839-mblqlrmu-10-199-200-5:1359:1554 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
[1,0]<stdout>:1120-154839-mblqlrmu-10-199-200-5:1359:1554 [0] NCCL INFO NET/IB : No device found.

When running the test function, this is what I see (after which nothing happens):

[1,0]<stdout>:worker_rank:0, world_size:1, shuffle:False, seed:0, directory:/tmp/mind/dev, files:['/tmp/mind/dev/behaviors_0.tsv', '/tmp/mind/dev/behaviors_1.tsv', '/tmp/mind/dev/behaviors_2.tsv', '/tmp/mind/dev/behaviors_3.tsv']
[1,0]<stderr>:2021-12-01 15:47:18.943032: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
[1,0]<stderr>:To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
[1,0]<stderr>:WARNING:tensorflow:AutoGraph could not transform <function StreamReaderTest.__init__.<locals>.<lambda> at 0x7f1ec9f7c4c0> and will run it as-is.
[1,0]<stderr>:Cause: could not parse the source code of <function StreamReaderTest.__init__.<locals>.<lambda> at 0x7f1ec9f7c4c0>: no matching AST found
[1,0]<stderr>:To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
[1,0]<stderr>:AutoGraph could not transform <function StreamReaderTest.__init__.<locals>.<lambda> at 0x7f1ec9f7c4c0> and will run it as-is.
[1,0]<stderr>:Cause: could not parse the source code of <function StreamReaderTest.__init__.<locals>.<lambda> at 0x7f1ec9f7c4c0>: no matching AST found
[1,0]<stderr>:To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
[1,0]<stderr>:2021-12-01 15:47:19.011939: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)

I tried adding some print statements and it seems like everything works up to the point that I iterate over the dataloader for the steps, at that point no further print statements get fired.

When calling the test function without Horovod, I get stuck at this point ('Dataloader__next__ called' I added myself):

DataLoader __iter__()
Dataloader __next__ called
worker_rank:0, world_size:1, shuffle:False, seed:0, directory:/tmp/mind/dev, files:['/tmp/mind/dev/behaviors_0.tsv', '/tmp/mind/dev/behaviors_1.tsv', '/tmp/mind/dev/behaviors_2.tsv', '/tmp/mind/dev/behaviors_3.tsv']
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant