Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: Incompatible shapes between op input and calculated input gradient. #4

Open
nikdnaik opened this issue Apr 2, 2018 · 17 comments · May be fixed by #29
Open

ValueError: Incompatible shapes between op input and calculated input gradient. #4

nikdnaik opened this issue Apr 2, 2018 · 17 comments · May be fixed by #29

Comments

@nikdnaik
Copy link

nikdnaik commented Apr 2, 2018

While running the "cifar10_macro_search.sh" script, I get the following error. Is it related to the tensorflow version? I am using 1.6.0.

Build train graph
Tensor("child/layer_0/case/cond/Merge:0", shape=(?, 36, 32, 32), dtype=float32)
Tensor("child/layer_1/skip/bn/Identity:0", shape=(?, 36, 32, 32), dtype=float32)
Tensor("child/layer_2/skip/bn/Identity:0", shape=(?, 36, 32, 32), dtype=float32)
Tensor("child/layer_3/pool_at_3/from_4/bn/Identity:0", shape=(?, 36, 16, 16), dtype=float32)
Tensor("child/layer_4/skip/bn/Identity:0", shape=(?, 36, 16, 16), dtype=float32)
Tensor("child/layer_5/skip/bn/Identity:0", shape=(?, 36, 16, 16), dtype=float32)
Tensor("child/layer_6/skip/bn/Identity:0", shape=(?, 36, 16, 16), dtype=float32)
Tensor("child/layer_7/pool_at_7/from_8/bn/Identity:0", shape=(?, 36, 8, 8), dtype=float32)
Tensor("child/layer_8/skip/bn/Identity:0", shape=(?, 36, 8, 8), dtype=float32)
Tensor("child/layer_9/skip/bn/Identity:0", shape=(?, 36, 8, 8), dtype=float32)
Tensor("child/layer_10/skip/bn/Identity:0", shape=(?, 36, 8, 8), dtype=float32)
Tensor("child/layer_11/skip/bn/Identity:0", shape=(?, 36, 8, 8), dtype=float32)
Model has 697860 params
Traceback (most recent call last):
File "src/cifar10/main.py", line 359, in
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 126, in run
_sys.exit(main(argv))
File "src/cifar10/main.py", line 355, in main
train()
File "src/cifar10/main.py", line 223, in train
ops = get_ops(images, labels)
File "src/cifar10/main.py", line 171, in get_ops
child_model.connect_controller(controller_model)
File "/home/nikhil/google_enas/src/cifar10/general_child.py", line 705, in connect_controller
self._build_train()
File "/home/nikhil/google_enas/src/cifar10/general_child.py", line 633, in _build_train
num_replicas=self.num_replicas)
File "/home/nikhil/google_enas/src/utils.py", line 125, in get_train_ops
grads = tf.gradients(loss, tf_variables)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gradients_impl.py", line 641, in gradients
(op.name, i, t_in.shape, in_grad.shape))
ValueError: Incompatible shapes between op input and calculated input gradient. Forward operation: child/layer_11/case/cond/cond/cond/cond/cond/cond/Merge. Input index: 0. Original input shape: (). Calculated input gradient shape: (?, 36, 8, 8)

@hyhieu
Copy link
Collaborator

hyhieu commented Apr 2, 2018

Thanks for noting this.

We just re-ran the script and didn't see the issue. We use TensorFlow 1.4.1 though.

We'll update to 1.7 soon and let you know if we have the problem.

@nikdnaik
Copy link
Author

nikdnaik commented Apr 2, 2018

Thanks for the response. I suspect that it is related to this open bug tensorflow 1.5.0 onwards: tensorflow/tensorflow#15214

@ahundt
Copy link

ahundt commented Apr 10, 2018

I'm on 1.7 and I've been able to reproduce this error.

@ahundt
Copy link

ahundt commented Apr 16, 2018

@hyhieu @melodyguan sorry to bug you but would it be possible to give this a try? I'm unable to run the cifar10 model search due to this issue.

@hyhieu
Copy link
Collaborator

hyhieu commented Apr 16, 2018

Hi @ahundt. We tried running the scripts, on TF 1.4, and didn't have this problem. We have heard of this version discrepancy issue. We'll fix it, but it will take a while.

@ahundt
Copy link

ahundt commented Apr 18, 2018

Yeah I had read the rest of this issue, I was just confirming the problem remains on 1.7. I'm able to start micro search without issue (haven't completed a full run yet) which calls the same function which is reporting the gradient input shape error, reproduced below.

Here is the macro search error I get:
error.txt


Model has 697860 params
Traceback (most recent call last):
  File "enas/cifar10/main.py", line 359, in <module>
    tf.app.run()
  File "/home/ahundt/.local/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 126, in run
    _sys.exit(main(argv))
  File "enas/cifar10/main.py", line 355, in main
    train()
  File "enas/cifar10/main.py", line 223, in train
    ops = get_ops(images, labels)
  File "enas/cifar10/main.py", line 171, in get_ops
    child_model.connect_controller(controller_model)
  File "/home/ahundt/src/enas/enas/cifar10/general_child.py", line 705, in connect_controller
    self._build_train()
  File "/home/ahundt/src/enas/enas/cifar10/general_child.py", line 633, in _build_train
    num_replicas=self.num_replicas)
  File "/home/ahundt/src/enas/enas/utils.py", line 127, in get_train_ops
    grads = tf.gradients(loss, tf_variables)
  File "/home/ahundt/.local/lib/python2.7/site-packages/tensorflow/python/ops/gradients_impl.py", line 488, in gradients
    gate_gradients, aggregation_method, stop_gradients)
  File "/home/ahundt/.local/lib/python2.7/site-packages/tensorflow/python/ops/gradients_impl.py", line 655, in _GradientsHelper
    (op.name, i, t_in.shape, in_grad.shape))
ValueError: Incompatible shapes between op input and calculated input gradient.  Forward operation: child/layer_11/case/cond/cond/cond/cond/cond/cond/Merge.  Input index: 0. Original input shape: ().  Calculated input gradient shape: (?, 36, 8, 8)

@AbelardLiu
Copy link

I guess that this problem is caused by the tensor
”child/layer_11/case/cond/cond/cond/cond/cond/cond/Constant“ has the shape () which is from the follow code
out = tf.case(branches, default=lambda: tf.constant(0, tf.float32), exclusive=True)
I change this line into the following can run enas.
out = tf.case(branches, default=lambda: tf.constant(0, tf.float32, shape=[self.batch_size, out_filters, inp_h, inp_w]), exclusive=True)
But this change only support data-format which is "NCHW".

@ahundt
Copy link

ahundt commented Apr 22, 2018

@AbelardLiu Awesome! That gave me the info I needed to create a proper fix which should work in all cases, see #29

@AbelardLiu
Copy link

@ahundt Great!I'll merge this commit and test,
Btw, do you run enas use python2 or python3?
Thansks!

@ahundt
Copy link

ahundt commented Apr 25, 2018

I run on python2

@harewei
Copy link

harewei commented May 16, 2018

By the way, I'm using Python3.6.5 with TensorFlow 1.7.0, and the fix by @ahundt does indeed fix this issue.

@shiyongde
Copy link

the same bug whith Tensorflow 1.8 cuda9.1

Build train graph
Tensor("child/layer_0/case/cond/Merge:0", shape=(?, 36, 32, 32), dtype=float32)
Tensor("child/layer_1/skip/bn/Identity:0", shape=(?, 36, 32, 32), dtype=float32)
Tensor("child/layer_2/skip/bn/Identity:0", shape=(?, 36, 32, 32), dtype=float32)
Tensor("child/layer_3/pool_at_3/from_4/bn/Identity:0", shape=(?, 36, 16, 16), dtype=float32)
Tensor("child/layer_4/skip/bn/Identity:0", shape=(?, 36, 16, 16), dtype=float32)
Tensor("child/layer_5/skip/bn/Identity:0", shape=(?, 36, 16, 16), dtype=float32)
Tensor("child/layer_6/skip/bn/Identity:0", shape=(?, 36, 16, 16), dtype=float32)
Tensor("child/layer_7/pool_at_7/from_8/bn/Identity:0", shape=(?, 36, 8, 8), dtype=float32)
Tensor("child/layer_8/skip/bn/Identity:0", shape=(?, 36, 8, 8), dtype=float32)
Tensor("child/layer_9/skip/bn/Identity:0", shape=(?, 36, 8, 8), dtype=float32)
Tensor("child/layer_10/skip/bn/Identity:0", shape=(?, 36, 8, 8), dtype=float32)
Tensor("child/layer_11/skip/bn/Identity:0", shape=(?, 36, 8, 8), dtype=float32)
Model has 697860 params
Traceback (most recent call last):
File "src/cifar10/main.py", line 359, in
tf.app.run()
File "/data5/xiawei/program/python2/local/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 126, in run
_sys.exit(main(argv))
File "src/cifar10/main.py", line 355, in main
train()
File "src/cifar10/main.py", line 223, in train
ops = get_ops(images, labels)
File "src/cifar10/main.py", line 171, in get_ops
child_model.connect_controller(controller_model)
File "/data3/xiawei/work/enas/enas/src/cifar10/general_child.py", line 705, in connect_controller
self._build_train()
File "/data3/xiawei/work/enas/enas/src/cifar10/general_child.py", line 633, in _build_train
num_replicas=self.num_replicas)
File "/data3/xiawei/work/enas/enas/src/utils.py", line 125, in get_train_ops
grads = tf.gradients(loss, tf_variables)
File "/data5/xiawei/program/python2/local/lib/python2.7/site-packages/tensorflow/python/ops/gradients_impl.py", line 494, in gradients
gate_gradients, aggregation_method, stop_gradients)
File "/data5/xiawei/program/python2/local/lib/python2.7/site-packages/tensorflow/python/ops/gradients_impl.py", line 669, in _GradientsHelper
(op.name, i, t_in.shape, in_grad.shape))
ValueError: Incompatible shapes between op input and calculated input gradient. Forward operation: child/layer_11/case/cond/cond/cond/cond/cond/cond/Merge. Input index: 0. Original input shape: (). Calculated input gradient shape: (?, 36, 8, 8)

@upwindflys
Copy link

@AbelardLiu Thanks a lot,Fix works for me.But why out.set_shape does not work,I feel confused.

@pingguokiller
Copy link

@AbelardLiu Thanks a lot,Fix works for me.But why out.set_shape does not work,I feel confused.

Use the code below is OK on tf1.13.
out = tf.reshape(out, (-1, inp_h, inp_w, out_filters))

"tensor.set_shape" is just to set the static shape. At most situations, we should use "tf.reshape" to set the dynamic shape.

@maryam089
Copy link

maryam089 commented Sep 15, 2020

@ahundt I'm using Python3.6 with TensorFlow 1.5.1 and my issue is
[Report Error]ValueError: Incompatible shapes between op input and calculated input gradient. conv2d_transpose
any idea how to solve it ?

@maryam089
Copy link

The problem has been solved. The issue was with the strides

@laksheenmendis
Copy link

I also got the same issue with Python 3.7 with TensorFlow 1.15.2, and the fix by @ahundt (#29) fixed the issue. And I believe it should be merged to the master branch for future users (I spent a lot of time on this issue)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants