Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uplift 2024-06-27 #34

Merged
merged 29 commits into from
Jun 27, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
7a6fa97
Fix core dumped and constraint issue
dsudhakarTT Jun 12, 2024
e4e3b40
Try padding with queue as fallback to padding with nop
dgolubovicTT Jun 12, 2024
0cb7c5d
Shape definitions and calculation of operand inputs in RGG
vbrkicTT Jun 13, 2024
3e8d19f
[Ribbon2] Switch to util function for linked output nodes.
nobradovictt Jun 13, 2024
c2e1c33
Move sparse matmul op tests from sanity
kmilanovicTT Jun 13, 2024
ca2fc89
Add CCM test for PIDNet in Wormhole_B0(pytorch)
meenakshiramanathan1 Jun 10, 2024
5d34756
Enable NOC and DRAM estimates by default
rpavlovicTT Jun 13, 2024
761678a
Revert of "Revert "Adding a support for generating TTI image for Blac…
sdjordjevicTT Jun 13, 2024
6238cf8
Add skipped codegen model variants in wh_b0 and gs
chandrasekaranpradeep Jun 12, 2024
f1fa8c9
Add test for IndexCopy operator
vobojevicTT Jun 14, 2024
3870713
Validate netlist for matmul
kmilanovicTT Jun 4, 2024
f1b5563
[padding] add queue instead of padding if adding only queue fixes all…
dgolubovicTT Jun 13, 2024
93aeccd
Switch off NOC bandwidth estimates for T5 and flan-T5 benchmark models.
vcanicTT Jun 16, 2024
9d4a2b9
Add monodle model demos for PyTorch in Wormhole & Grayskull
ashokkumarkannan1 Jun 17, 2024
d9c8f7d
Disable data movement, DRAM and NOC, estimates by default.
vcanicTT Jun 18, 2024
4c82da5
[Balancer/GS] Fix partial datacopy ops related OpModel choice mismatc…
nobradovictt Jun 15, 2024
3a6e3b6
Add missing tests for Concatenate operator
vobojevicTT Jun 18, 2024
b867b41
Add more operators to PyBuda repository
vbrkicTT Jun 18, 2024
ebf480b
Fix pybuda n300 failures
ashokkumarkannan1 Jun 19, 2024
245243d
[Balancer] Migrate policy MinimizeGrid to PolicyManager.
nobradovictt Jun 19, 2024
f2aed91
Adding few more exception rules in python script
sdjordjevicTT Jun 20, 2024
8d5363e
[fork-join] fix merge queue and nop instructions
pilkicTT Jun 10, 2024
a6d6093
[test-cleanup] removing legacy ribbon flag
pilkicTT Jun 19, 2024
b0afaa8
Merge community changes, fix spelling and adding main guard
vmilosevic Jun 20, 2024
05ff6c7
Connect multiple open nodes
vbrkicTT Jun 24, 2024
42bafb0
Update submodules
vmilosevic Jun 27, 2024
ab395c8
Add path to erisc hex files.
vcanicTT Jun 21, 2024
b9c0cc0
Update ERISC path in setup.py.
vcanicTT Jun 24, 2024
8772356
Update gitlab action to build for wormhole_b0
vmilosevic Jun 27, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 13 additions & 2 deletions .github/workflows/build-artifacts.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,15 +3,26 @@ name: Build artifacts
on:
workflow_dispatch:
workflow_call:
push:
branches:
- main
pull_request:
branches:
- main

env:
PYTHON_VERSION: "python3.10"

jobs:
build-artifacts:

strategy:
matrix:
arch: ["grayskull"]
include:
- arch: grayskull
env_script: env_for_silicon.sh
- arch: wormhole_b0
env_script: env_for_wormhole_b0.sh
runs-on: ubuntu-22.04
steps:
- uses: actions/checkout@v4
Expand All @@ -21,4 +32,4 @@ jobs:
- name: Update submodule
run: git submodule update --init --recursive
- name: Build for ${{ matrix.arch }}
run: source env_for_silicon.sh
run: source ${{ matrix.env_script }}
13 changes: 0 additions & 13 deletions .github/workflows/post-commit-workflow.yml

This file was deleted.

13 changes: 0 additions & 13 deletions .github/workflows/pull-request-workflow.yml

This file was deleted.

2 changes: 2 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,8 @@ DOCSDIR = $(OUT)/docs
SUBMODULESDIR = $(OUT)/submodules
TORCHVISIONDIR = build_deps/vision

export TT_BACKEND_ERISC_PRECOMPILED_BINARIES_PATH=./erisc_hex/

# Top level flags, compiler, defines etc.

#WARNINGS ?= -Wall -Wextra
Expand Down
1 change: 1 addition & 0 deletions README.debug.md
Original file line number Diff line number Diff line change
Expand Up @@ -109,6 +109,7 @@
* PYBUDA\_RIBBON2\_CALCULATE\_TARGET\_CYCLES: Calculate target cycles for every epoch within Ribbon2 balancing policy. (default: 0/False)
* PYBUDA\_RIBBON2\_CALCULATE\_TARGET\_CYCLES\_APPLY\_FILTERING: Apply filtering on GS search space while calculating dynamic cycles per epoch within Ribbon2 balancing policy. (default: 0/False)
* PYBUDA\_RIBBON\_LEGACY: Use legacy Ribbon balancing policy. (default: 0/False)
* PYBUDA\_MAXIMIZE\_GRID: Reverse logic of MinimizeGrid policy. Maximize grid size for all ops. (default: 0/False)
* PYBUDA\_ENABLE\_HOST\_INPUT\_NOP\_BUFFERING: Enable nop buffering of input host read. (default: 0/False)
* PYBUDA\_AUTO\_RECOMPILE: Triggers handling of backend compile error and recompiles the model. (default: 1/True)
* PYBUDA\_AUTO\_RECOMPILE\_TARGET\_CYCLES: Enables adjustment of target cycles during recompile if no errors from backend have been previously handled. Requires PYBUDA\_AUTO\_RECOMPILE to be enabled to work. (default: 0/False)
Expand Down
1 change: 0 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,4 +49,3 @@ Set `LD_LIBRARY_PATH` to the location of `third_party/budabackend/build/lib` - p
## Silicon

See README.silicon.md for details on how to run on silicon.

20 changes: 10 additions & 10 deletions docs/public/developer.rst
Original file line number Diff line number Diff line change
Expand Up @@ -125,7 +125,7 @@ User Visible Constants
++++++++++++++++++++++

Constant registers are implemented as objects which can be referenced
whereever a vector can be used.
wherever a vector can be used.

* Grayskull:

Expand Down Expand Up @@ -230,8 +230,8 @@ Library

Below ``Vec`` means any vector type.

Grayskulll and Wormhole
^^^^^^^^^^^^^^^^^^^^^^^
Grayskull and Wormhole
^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: c++

Expand Down Expand Up @@ -396,8 +396,8 @@ For example:
l_reg[LRegs::LReg1] = x; // this is necessary at the end of the function
// to preserve the value in LReg1 (if desired)

Miscelaneous
************
Miscellaneous
*************

Register Pressure Management
++++++++++++++++++++++++++++
Expand All @@ -413,7 +413,7 @@ loads dst_reg[0] and dst_reg[1] into temporary LREGs (as expected).

The compiler will not spill registers. Exceeding the number of registers
available will result in the cryptic: ``error: cannot store SFPU register
(reigster spill?) - exiting!`` without a line number.
(register spill?) - exiting!`` without a line number.

The compiler does a reasonable job with lifetime analysis when assigning
variables to registers. Reloading or recalculating results helps the compiler
Expand Down Expand Up @@ -448,7 +448,7 @@ The ``SFPREPLAY`` instruction available on Wormhole allows the RISCV processor
to submit up to 32 SFP instructions at once. The compiler looks for sequences
of instructions that repeat, stores these and then "replays" them later.

The current implemention of this is very much first cut: it does not handle
The current implementation of this is very much first cut: it does not handle
kernels with rolled up loops very well. Best performance is typically attained by
unrolling the top level loop and then letting the compiler find the repetitions
and replace them with ``SFPREPLAY``. This works well when the main loop
Expand Down Expand Up @@ -494,15 +494,15 @@ Register Spilling
+++++++++++++++++

The compiler does not implement register spilling. Since Grayskull only has 4
LRegs, running out of registers is a common occurence. If you see the
following: ``error: cannot store SFPU register (reigster spill?) - exiting!``
LRegs, running out of registers is a common occurrence. If you see the
following: ``error: cannot store SFPU register (register spill?) - exiting!``
you have most likely run out of registers.

Error Messages
++++++++++++++

Unfortunately, many errors are attributed to the code in the wrapper rather than in the code
being written. For example, using an unitialized variable would show an error at a macro
being written. For example, using an uninitialized variable would show an error at a macro
called by a wrapper function before showing the line number in the user's code.

Function Calls
Expand Down
2 changes: 1 addition & 1 deletion docs/public/installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ Python Environment Installation

It is strongly recommended to use virtual environments for each project utilizing PyBUDA and Python dependencies. Creating a new virtual environment with PyBUDA and libraries is very easy.

Prerequisites (detailed sections below) for python envirnment installation are listed here:
Prerequisites (detailed sections below) for python environment installation are listed here:

* `Setup HugePages (below) <#setup-hugepages>`_
* `PCI Driver Installation (below) <#pci-driver-installation>`_
Expand Down
4 changes: 2 additions & 2 deletions docs/public/terminology.rst
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ The dense tensor math unit in Tensix. It performs bulk tensor math operations, s

SFPU
----
Tensix SIMD engine, used for various miscellaneous activations operations, such as exponents, square roots, softmax, topK, and others.
Tensix SIMD engine, used for various miscellaneous activation operations, such as exponents, square roots, softmax, topK, and others.

Unpacker
--------
Expand All @@ -49,7 +49,7 @@ A collection of ops that fits onto one chip. In a typical workflow, epoch code w

Buffer
------
A reserved location in local memory, DRAM, or host memory. Buffers are used either as desinations for operation outputs, sources for operation inputs, or temporary locations for intermediate data.
A reserved location in local memory, DRAM, or host memory. Buffers are used either as destinations for operation outputs, sources for operation inputs, or temporary locations for intermediate data.

Pipe
----
Expand Down
88 changes: 46 additions & 42 deletions docs/public/user_guide.rst
Original file line number Diff line number Diff line change
Expand Up @@ -17,28 +17,30 @@ Compiling and running a PyBuda workload is as easy as:
import pybuda
import torch
from transformers import BertModel, BertConfig

# Download the model from huggingface
model = BertModel.from_pretrained("bert-base-uncased")

# Wrap the pytorch model in a PyBuda module wrapper
module = pybuda.PyTorchModule("bert_encoder", model.encoder)

# Create a tenstorrent device
tt0 = pybuda.TTDevice(
"tt0",
module=module,
arch=pybuda.BackendDevice.Wormhole_B0,
devtype=pybuda.BackendType.Silicon,
)

# Create an input tensor
seq_len = 128
input = torch.randn(1, seq_len, model.config.hidden_size)

# Compile and run inference
output_queue = pybuda.run_inference(inputs=[input])
print(output_queue.get())

# Guard in the main module to avoid creating subprocesses recursively.
if __name__ == "__main__":
# Download the model from huggingface
model = BertModel.from_pretrained("bert-base-uncased")

# Wrap the pytorch model in a PyBuda module wrapper
module = pybuda.PyTorchModule("bert_encoder", model.encoder)

# Create a tenstorrent device
tt0 = pybuda.TTDevice(
"tt0",
module=module,
arch=pybuda.BackendDevice.Wormhole_B0,
devtype=pybuda.BackendType.Silicon,
)

# Create an input tensor
seq_len = 128
input = torch.randn(1, seq_len, model.config.hidden_size)

# Compile and run inference
output_queue = pybuda.run_inference(inputs=[input])
print(output_queue.get())


Framework Support
Expand Down Expand Up @@ -90,7 +92,7 @@ PyBuda API and workflow is flexible enough that some of these steps can be merge
Devices
*******

PyBuda makes it easy to distribute a workload onto a heterogenous set of devices available to you. This can be one or more
PyBuda makes it easy to distribute a workload onto a heterogeneous set of devices available to you. This can be one or more
Tenstorrent devices, CPUs, or GPUs. Each device that will be used to run your workflow needs to be declared by creating the appropriate
device type and giving it a unique name:

Expand Down Expand Up @@ -121,7 +123,7 @@ To run a module on a device, it needs to be "placed" on it
tt0.place_module(mod)

This tells PyBuda that module ``mod`` needs to be compiled and executed on device ``tt0``. In this case, ``mod`` is a native PyBuda module. To
simiarly place a PyTorch module onto a Tenstorrent device, the module must be wrapped in a :py:class:`PyTorchModule<pybuda.PyTorchModule>` wrapper:
similarly place a PyTorch module onto a Tenstorrent device, the module must be wrapped in a :py:class:`PyTorchModule<pybuda.PyTorchModule>` wrapper:

.. code-block:: python

Expand All @@ -147,7 +149,7 @@ PyBuda provides all-in-one APIs for compiling and running workloads, :py:func:`r
For inference, and simple training setups, this is the simplest way to get up and running.

Alternatively, the models can be compiled in a separate step, using the :py:func:`initialize_pipeline<pybuda.initialize_pipeline>` call,
which optioanlly takes sample inputs, if none have been pushed into the first device. Once the compilation has completed, the user
which optionally takes sample inputs, if none have been pushed into the first device. Once the compilation has completed, the user
can run :py:func:`run_forward<pybuda.run_forward>` pass through the pipeline for inference, or a loop of
:py:func:`run_forward<pybuda.run_forward>`, :py:func:`run_backward<pybuda.run_backward>`, and :py:func:`run_optimizer<pybuda.run_optimizer>`
calls to manually implement a training loop:
Expand All @@ -165,10 +167,10 @@ calls to manually implement a training loop:
CPU Fallback
************

If there are operators in the workload that are unsuppored by PyBuda, the user can create a CPUDevice and place module containing those
If there are operators in the workload that are unsupported by PyBuda, the user can create a CPUDevice and place module containing those
operators onto that CPUDevice. If enabled, PyBuda is capable of doing this automatically.

If a TTDevice contains unsuppored operators, during compilation, the device will be split into mupltiple devices (TTDevice and CPUDevice). If
If a TTDevice contains unsupported operators, during compilation, the device will be split into multiple devices (TTDevice and CPUDevice). If
the CPUDevice is at the front of the pipeline (i.e. the unsupported ops are in the first half of the graph), any inputs pushed to the TTDevice
will be redirected to the correct CPUDevice.

Expand Down Expand Up @@ -214,7 +216,7 @@ Output queues hold PyBuda tensors. For each PyBuda tensor, user can convert it b
output_in_tf = output_q[0].to_framework("tensorflow")

Advanced training scenarios sometimes require accumulated gradients to be retrieved and analyzed. For those cases, PyBuda provides an
:py::func:`API<pybuda.get_parameter_gradients>` that retrieves a dictionary of all currently accumulated gradients on a device. This can be used to
:py:func:`API<pybuda.get_parameter_gradients>` that retrieves a dictionary of all currently accumulated gradients on a device. This can be used to
debug or analyze data, or even run a manual optimizer and push new weights onto the device.

Saving and Loading Models
Expand Down Expand Up @@ -623,20 +625,22 @@ Here is a simple example to (1) tag operations of interest and (2) fetch interme
matmul2 = pybuda.op.Matmul("matmul2", matmul1_gelu, self.weights2)
return matmul2

# Configure Pybuda compilation options to include a list of operations to collect intermediate tensors
tagged_operations = ["matmul1", "gelu"]
pybuda.set_configuration_options(op_intermediates_to_save=tagged_operations)
# Guard in the main module to avoid creating subprocesses recursively.
if __name__ == "__main__":
# Configure Pybuda compilation options to include a list of operations to collect intermediate tensors
tagged_operations = ["matmul1", "gelu"]
pybuda.set_configuration_options(op_intermediates_to_save=tagged_operations)

# Invoke the run_inference API to create device, compile and run module on device:
output_q = pybuda.run_inference(PyBudaTestModule("test_module"), inputs=[torch.randn(1, 32, 32)])
# Invoke the run_inference API to create device, compile and run module on device:
output_q = pybuda.run_inference(PyBudaTestModule("test_module"), inputs=[torch.randn(1, 32, 32)])

# After running inference, the intermediates queue will contain the ordered list of tagged intermediates
intermediates_queue = pybuda.get_intermediates_queue()
matmul1_tensor, gelu_tensor = intermediates_queue.get()
# After running inference, the intermediates queue will contain the ordered list of tagged intermediates
intermediates_queue = pybuda.get_intermediates_queue()
matmul1_tensor, gelu_tensor = intermediates_queue.get()

# Print tensor values recorded from device inference
print(matmul1_tensor)
print(gelu_tensor)
# Print tensor values recorded from device inference
print(matmul1_tensor)
print(gelu_tensor)


Multiple Devices
Expand All @@ -647,7 +651,7 @@ Using Multiple Tenstorrent Devices

PyBuda makes it easy to parallelize workloads onto multiple devices. A single :py:class:`TTDevice<pybuda.TTDevice>` can be used as a wrapper to any number of available
Tenstorrent devices accessible to the host - either locally or through ethernet. The PyBuda compiler will then break up the workload over
assigned devices using either pipeline or model parllelism strategies, or a combination of both.
assigned devices using either pipeline or model parallelism strategies, or a combination of both.

The easiest way to use all available hardware is to set ``num_chips`` parameter in :py:class:`TTDevice<pybuda.TTDevice>` to 0, which instructs it to auto-detect and use everything it can find.
However, ``num_chips`` and ``chip_ids`` parameters can be used to select a subset of available hardware:
Expand Down Expand Up @@ -765,7 +769,7 @@ The following Python code generates a Multi-Model TTI in a manner identical to t

model_binary_loc = "device_images_to_merge"
models_to_merge = ["bert_large", "deit", "hrnet", "inception", "mobilenet_v1", "mobilenet_v2", "mobilenet_v3", "resnet", "unet", "vit"]
target_arch = "wormhole_b0
target_arch = "wormhole_b0"
merged_model_location = "multi_model_workload.tti"

# Individual Model Generation Code Goes Here
Expand All @@ -776,7 +780,7 @@ The following Python code generates a Multi-Model TTI in a manner identical to t

During the model fusion process, the API presented above is responsible for performing memory reallocation. Users may be interested in the memory footprint of the fused model (both Device and Host DRAM).

To fullfil this requirement, the tool reports memory utilization post reallocation. An example using a model compiled for Wormhole (with 6 Device and upto 4 Host DRAM channels) is provided below.
To fulfill this requirement, the tool reports memory utilization post reallocation. An example using a model compiled for Wormhole (with 6 Device and up to 4 Host DRAM channels) is provided below.

.. code-block:: bash

Expand Down
9 changes: 9 additions & 0 deletions pybuda/csrc/balancer/exceptions.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,15 @@ class BudaOpNodeLegalizerFailureInfo
return opModelFailureCountByType[failureReason];
}

// Returns the total number of failures targeted by padding. Padding aims to resolve these failures.
std::uint32_t getFailuresCountTargetedByPadding() const
{
return opModelFailureCountByType[UserAccessPreventsStreaming] +
opModelFailureCountByType[OperandAccessPreventsStreaming] +
opModelFailureCountByType[OperandAndUserAccessPreventsStreaming] +
opModelFailureCountByType[InputBufferAllocationFailure];
}

std::string toString() const
{
std::string result = "Op model failure counts by type: \n";
Expand Down
Loading