Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nan problem when training #7

Open
huangjian2015 opened this issue Dec 3, 2018 · 1 comment
Open

nan problem when training #7

huangjian2015 opened this issue Dec 3, 2018 · 1 comment

Comments

@huangjian2015
Copy link

Hello, Thank for your contribution. I encountered one problem. After one epoch, the loss would be nan like

Epoch 1: 28%|#################7 | 47/167 [10:03<25:40, 12.84s/it, acc=24.5, loss=260, step=47]I tensorflow/core/common_runtime/gpu/pool_allocator.cc:247] PoolAllocator: After 69102 get requests, put_count=76956 evicted_count=7000 eviction_rate=0.0909611 and unsatisfied allocation rate=0
Epoch 1: 29%|##################4 | 49/167 [10:24<25:04, 12.75s/it, acc=25.5, loss=237, step=49]I tensorflow/core/common_runtime/gpu/pool_allocator.cc:247] PoolAllocator: After 16280 get requests, put_count=18313 evicted_count=1000 eviction_rate=0.054606 and unsatisfied allocation rate=0
Epoch 1: 30%|##################8 | 50/167 [10:35<24:47, 12.71s/it, acc=24.3, loss=262, step=50]I tensorflow/core/common_runtime/gpu/pool_allocator.cc:247] PoolAllocator: After 43740 get requests, put_count=48876 evicted_count=4000 eviction_rate=0.0818398 and unsatisfied allocation rate=0
Epoch 1: 99%|##############################################################6| 166/167 [32:06<00:11, 11.61s/it, acc=32, loss=nan, step=166]wait!
Epoch 1: 100%|###############################################################| 167/167 [32:17<00:00, 11.60s/it, acc=32, loss=nan, step=167]
Epoch 2: 13%|########4 | 22/167 [03:51<25:23, 10.51s/it, acc=32, loss=nan, step=189]I tensorflow/core/common_runtime/gpu/pool_allocator.cc:247] PoolAllocator: After 28810067 get requests, put_count=28811000 evicted_count=2000 eviction_rate=6.94179e-05 and unsatisfied allocation rate=9.47585e-05
Epoch 2: 99%|##############################################################6| 166/167 [29:29<00:10, 10.66s/it, acc=32, loss=nan, step=333]wait!
Epoch 2: 100%|###############################################################| 167/167 [29:40<00:00, 10.66s/it, acc=32, loss=nan, step=334

Did you encounter this problem?

@cdyangbo
Copy link
Owner

reduce learn rate
fine adjust batch-size and learn rate

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants