[training] bootstrapping training #99

pilkicTT · 2024-08-16T09:58:54Z

Initial changes for bootstrapping training.

expanding pybuda.compile to accept loss and optimizer module.
running autograd
forward and backward graphs are lowered into tt-mlir as two separate functions - resulting binary has two programs (fwd and bwd)
expanding CompiledModel with backward()

For initial training support we are going with executing loss and optimizer modules on CPU.

Currently, the most problematic part is isolating and lowering fwd and bwd parts of the graph, since the autograd is creating a single graph. For now, I've hacked it up using GraphTraversalContext. In changes to follow, we will separate these two into different graphs. That should make things much cleaner - e.g., each graph will have precisely defined inputs and outputs unlike now, where we need to deduce what those are for a particular subgraph.

Handling of intermediate results and storing them as outputs of the forward pass to be used later in the backward pass is NOT implemented. This will follow after the autograd changes are done and we have two separate graphs.

For separating backward and forward executions, I've experimented with two approaches:

creating two binaries
creating one binary with two programs

The second option seems cleaner and will simplify handling of the binary, but in future we might support both approaches.

Basic test for training single parametrized multiply op is added.

Issues #17 #18 #19 #20

vladimirjovanovicTT · 2024-08-16T10:27:21Z

"Handling of intermediate results and storing them as outputs of the forward pass to be used later in the backward pass is NOT implemented."

This is perfectly OK as this will be a part of the first MNIST training milestone.

pybuda/csrc/passes/lower_to_mlir.cpp

pybuda/csrc/passes/mlir_compiler.cpp

pybuda/pybuda/compiled_graph_state.py

pybuda/csrc/graph_lib/utils.hpp

nvukobratTT · 2024-08-19T08:50:54Z

pybuda/pybuda/compile.py

    autograd_engine = pyautograd.AutogradEngine(graph, autograd_config)

    graph = autograd_engine.run()
    dump_graph(graph, graph_name, "post_autograd")

-    context.losses = calculate_grads(outputs, dev, intermediate_tensors, False, context.losses)
+    # GOLDEN: 
+    # context.losses = calculate_grads(outputs, dev, intermediate_tensors, False, context.losses)


Why is this one commented out? We don't track Golden for the initial implementation?

Commented out because it uses the deleted TTDevice object. I plan to file issues for golden verification refactor - currently, the calculations needed for golden verifications are sprinkled throughout compile.py, so they need to be centralized and have dependency on TTDevice removed.

nvukobratTT · 2024-08-19T08:54:59Z

pybuda/pybuda/compiled_graph_state.py

+
+        if self.compiled_graph_state.graph.training():
+            # For executing loss and its backward graph on CPU, we need to tell torch to compute gradients
+            for output in outputs:


Do you think this needs to be part of compiler logic? E.g. can we set the PyTorch model into the .train() mode, and just have here a check if the PyTorch module is in a valid state?

My reasoning is that if user executes something on CPU, they should handle it properly. For ease of use, compiler should have robust checks to inform user it something is out of order.

Do you think this needs to be part of compiler logic?

I am not sure - but what i know for certain is that the APIs are still messy, and we'll need to iterate on this. 😄

we can sync offline further, but the reason i put this logic here is that we are creating/returning the output tensors... so, if we are creating them and know that they will further be used for backward pass, why not set the flag...

My 2 cents - APIs are messy by design, as described in #15 :)

pybuda/test/mlir/test_training.py

vladimirjovanovicTT

LGTM

Initial changes for bootstrapping training. - expanding `pybuda.compile` to accept loss and optimizer module. - running autograd - forward and backward graphs are lowered into tt-mlir as two separate functions - resulting binary has two programs (fwd and bwd) - expanding `CompiledModel` with `backward()` For initial training support we are going with executing loss and optimizer modules on CPU. Currently, the most problematic part is isolating and lowering fwd and bwd parts of the graph, since the autograd is creating a single graph. For now, I've hacked it up using `GraphTraversalContext`. In changes to follow, we will separate these two into different graphs. That should make things much cleaner - e.g., each graph will have precisely defined inputs and outputs unlike now, where we need to deduce what those are for a particular subgraph. Handling of intermediate results and storing them as outputs of the forward pass to be used later in the backward pass is NOT implemented. This will follow after the autograd changes are done and we have two separate graphs. For separating backward and forward executions, I've experimented with two approaches: 1. creating two binaries 2. creating one binary with two programs The second option seems cleaner and will simplify handling of the binary, but in future we might support both approaches. Basic test for training single parametrized multiply op is added. Issues #17 #18 #19 #20

pilkicTT changed the title ~~[training~~ [training] bootstrapping training Aug 16, 2024

pilkicTT requested review from nvukobratTT, nsmithtt and vladimirjovanovicTT August 16, 2024 10:22

sdjordjevicTT reviewed Aug 16, 2024

View reviewed changes

pybuda/csrc/passes/lower_to_mlir.cpp Show resolved Hide resolved

sdjordjevicTT reviewed Aug 16, 2024

View reviewed changes

pybuda/csrc/passes/mlir_compiler.cpp Outdated Show resolved Hide resolved

sdjordjevicTT reviewed Aug 16, 2024

View reviewed changes

pybuda/pybuda/compiled_graph_state.py Show resolved Hide resolved

nvukobratTT reviewed Aug 19, 2024

View reviewed changes

sdjordjevicTT approved these changes Aug 19, 2024

View reviewed changes

nvukobratTT approved these changes Aug 19, 2024

View reviewed changes

vladimirjovanovicTT approved these changes Aug 19, 2024

View reviewed changes

pilkicTT force-pushed the pilkic/init-training branch from 312cefa to 737fcb7 Compare August 19, 2024 15:56

pilkicTT merged commit fffeed3 into main Aug 19, 2024
2 checks passed

pilkicTT deleted the pilkic/init-training branch August 19, 2024 16:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[training] bootstrapping training #99

[training] bootstrapping training #99

pilkicTT commented Aug 16, 2024 •

edited

Loading

vladimirjovanovicTT commented Aug 16, 2024 •

edited

Loading

nvukobratTT Aug 19, 2024

pilkicTT Aug 19, 2024

nvukobratTT Aug 19, 2024

pilkicTT Aug 19, 2024

vladimirjovanovicTT Aug 19, 2024

vladimirjovanovicTT left a comment

[training] bootstrapping training #99

[training] bootstrapping training #99

Conversation

pilkicTT commented Aug 16, 2024 • edited Loading

vladimirjovanovicTT commented Aug 16, 2024 • edited Loading

nvukobratTT Aug 19, 2024

Choose a reason for hiding this comment

pilkicTT Aug 19, 2024

Choose a reason for hiding this comment

nvukobratTT Aug 19, 2024

Choose a reason for hiding this comment

pilkicTT Aug 19, 2024

Choose a reason for hiding this comment

vladimirjovanovicTT Aug 19, 2024

Choose a reason for hiding this comment

vladimirjovanovicTT left a comment

Choose a reason for hiding this comment

pilkicTT commented Aug 16, 2024 •

edited

Loading

vladimirjovanovicTT commented Aug 16, 2024 •

edited

Loading