tenstorrent · vmilosevic · Jun 27, 2024 · Jun 12, 2024 · Jun 12, 2024 · Jun 13, 2024
diff --git a/.github/workflows/build-artifacts.yml b/.github/workflows/build-artifacts.yml
@@ -3,15 +3,26 @@ name: Build artifacts
 on:
   workflow_dispatch:
   workflow_call:
+  push:
+    branches:
+      - main
+  pull_request:
+    branches:
+      - main
 
 env:
   PYTHON_VERSION: "python3.10"
 
 jobs:
   build-artifacts:
+
     strategy:
       matrix:
-        arch: ["grayskull"]
+        include:
+          - arch: grayskull
+            env_script: env_for_silicon.sh
+          - arch: wormhole_b0
+            env_script: env_for_wormhole_b0.sh
     runs-on: ubuntu-22.04
     steps:
       - uses: actions/checkout@v4
@@ -21,4 +32,4 @@ jobs:
       - name: Update submodule
         run: git submodule update --init --recursive
       - name: Build for ${{ matrix.arch }}
-        run: source env_for_silicon.sh
+        run: source ${{ matrix.env_script }}
diff --git a/.github/workflows/post-commit-workflow.yml b/.github/workflows/post-commit-workflow.yml
diff --git a/.github/workflows/pull-request-workflow.yml b/.github/workflows/pull-request-workflow.yml
diff --git a/Makefile b/Makefile
@@ -49,6 +49,8 @@ DOCSDIR = $(OUT)/docs
 SUBMODULESDIR = $(OUT)/submodules
 TORCHVISIONDIR = build_deps/vision
 
+export TT_BACKEND_ERISC_PRECOMPILED_BINARIES_PATH=./erisc_hex/
+
 # Top level flags, compiler, defines etc.
 
 #WARNINGS ?= -Wall -Wextra

diff --git a/README.debug.md b/README.debug.md
@@ -109,6 +109,7 @@
  * PYBUDA\_RIBBON2\_CALCULATE\_TARGET\_CYCLES: Calculate target cycles for every epoch within Ribbon2 balancing policy. (default: 0/False)
  * PYBUDA\_RIBBON2\_CALCULATE\_TARGET\_CYCLES\_APPLY\_FILTERING: Apply filtering on GS search space while calculating dynamic cycles per epoch within Ribbon2 balancing policy. (default: 0/False)
  * PYBUDA\_RIBBON\_LEGACY: Use legacy Ribbon balancing policy. (default: 0/False)
+ * PYBUDA\_MAXIMIZE\_GRID: Reverse logic of MinimizeGrid policy. Maximize grid size for all ops. (default: 0/False)
  * PYBUDA\_ENABLE\_HOST\_INPUT\_NOP\_BUFFERING: Enable nop buffering of input host read. (default: 0/False)
  * PYBUDA\_AUTO\_RECOMPILE: Triggers handling of backend compile error and recompiles the model. (default: 1/True)
  * PYBUDA\_AUTO\_RECOMPILE\_TARGET\_CYCLES: Enables adjustment of target cycles during recompile if no errors from backend have been previously handled. Requires PYBUDA\_AUTO\_RECOMPILE to be enabled to work. (default: 0/False)

diff --git a/README.md b/README.md
@@ -49,4 +49,3 @@ Set `LD_LIBRARY_PATH` to the location of `third_party/budabackend/build/lib` - p
 ## Silicon
 
 See README.silicon.md for details on how to run on silicon.
-
diff --git a/docs/public/developer.rst b/docs/public/developer.rst
@@ -125,7 +125,7 @@ User Visible Constants
 ++++++++++++++++++++++
 
 Constant registers are implemented as objects which can be referenced
-whereever a vector can be used.
+wherever a vector can be used.
 
   * Grayskull:
 
@@ -230,8 +230,8 @@ Library
 
 Below ``Vec`` means any vector type.
 
-Grayskulll and Wormhole
-^^^^^^^^^^^^^^^^^^^^^^^
+Grayskull and Wormhole
+^^^^^^^^^^^^^^^^^^^^^^
 
 .. code-block:: c++
 
@@ -396,8 +396,8 @@ For example:
     l_reg[LRegs::LReg1] = x;         // this is necessary at the end of the function
                                      // to preserve the value in LReg1 (if desired)
 
-Miscelaneous
-************
+Miscellaneous
+*************
 
 Register Pressure Management
 ++++++++++++++++++++++++++++
@@ -413,7 +413,7 @@ loads dst_reg[0] and dst_reg[1] into temporary LREGs (as expected).
 
 The compiler will not spill registers.  Exceeding the number of registers
 available will result in the cryptic: ``error: cannot store SFPU register
-(reigster spill?) - exiting!`` without a line number.
+(register spill?) - exiting!`` without a line number.
 
 The compiler does a reasonable job with lifetime analysis when assigning
 variables to registers.  Reloading or recalculating results helps the compiler
@@ -448,7 +448,7 @@ The ``SFPREPLAY`` instruction available on Wormhole allows the RISCV processor
 to submit up to 32 SFP instructions at once.  The compiler looks for sequences
 of instructions that repeat, stores these and then "replays" them later.
 
-The current implemention of this is very much first cut: it does not handle
+The current implementation of this is very much first cut: it does not handle
 kernels with rolled up loops very well.  Best performance is typically attained by
 unrolling the top level loop and then letting the compiler find the repetitions
 and replace them with ``SFPREPLAY``.  This works well when the main loop
@@ -494,15 +494,15 @@ Register Spilling
 +++++++++++++++++
 
 The compiler does not implement register spilling.  Since Grayskull only has 4
-LRegs, running out of registers is a common occurence.  If you see the
-following: ``error: cannot store SFPU register (reigster spill?) - exiting!``
+LRegs, running out of registers is a common occurrence.  If you see the
+following: ``error: cannot store SFPU register (register spill?) - exiting!``
 you have most likely run out of registers.
 
 Error Messages
 ++++++++++++++
 
 Unfortunately, many errors are attributed to the code in the wrapper rather than in the code
-being written.  For example, using an unitialized variable would show an error at a macro
+being written.  For example, using an uninitialized variable would show an error at a macro
 called by a wrapper function before showing the line number in the user's code.
 
 Function Calls

diff --git a/docs/public/installation.rst b/docs/public/installation.rst
@@ -38,7 +38,7 @@ Python Environment Installation
 
 It is strongly recommended to use virtual environments for each project utilizing PyBUDA and Python dependencies. Creating a new virtual environment with PyBUDA and libraries is very easy.
 
-Prerequisites (detailed sections below) for python envirnment installation are listed here:
+Prerequisites (detailed sections below) for python environment installation are listed here:
 
   * `Setup HugePages (below) <#setup-hugepages>`_
   * `PCI Driver Installation (below) <#pci-driver-installation>`_

diff --git a/docs/public/terminology.rst b/docs/public/terminology.rst
@@ -27,7 +27,7 @@ The dense tensor math unit in Tensix. It performs bulk tensor math operations, s
 
 SFPU
 ----
-Tensix SIMD engine, used for various miscellaneous activations operations, such as exponents, square roots, softmax, topK, and others.
+Tensix SIMD engine, used for various miscellaneous activation operations, such as exponents, square roots, softmax, topK, and others.
 
 Unpacker
 --------
@@ -49,7 +49,7 @@ A collection of ops that fits onto one chip. In a typical workflow, epoch code w
 
 Buffer
 ------
-A reserved location in local memory, DRAM, or host memory. Buffers are used either as desinations for operation outputs, sources for operation inputs, or temporary locations for intermediate data.
+A reserved location in local memory, DRAM, or host memory. Buffers are used either as destinations for operation outputs, sources for operation inputs, or temporary locations for intermediate data.
 
 Pipe
 ----

diff --git a/docs/public/user_guide.rst b/docs/public/user_guide.rst
@@ -17,28 +17,30 @@ Compiling and running a PyBuda workload is as easy as:
   import pybuda
   import torch
   from transformers import BertModel, BertConfig
-
-  # Download the model from huggingface
-  model = BertModel.from_pretrained("bert-base-uncased")
-
-  # Wrap the pytorch model in a PyBuda module wrapper
-  module = pybuda.PyTorchModule("bert_encoder", model.encoder)
-
-  # Create a tenstorrent device
-  tt0 = pybuda.TTDevice(
-      "tt0",
-      module=module,
-      arch=pybuda.BackendDevice.Wormhole_B0,
-      devtype=pybuda.BackendType.Silicon,
-  )
-
-  # Create an input tensor
-  seq_len = 128
-  input = torch.randn(1, seq_len, model.config.hidden_size)
-
-  # Compile and run inference
-  output_queue = pybuda.run_inference(inputs=[input])
-  print(output_queue.get())
+
+  # Guard in the main module to avoid creating subprocesses recursively.
+  if __name__ == "__main__":
+      # Download the model from huggingface
+      model = BertModel.from_pretrained("bert-base-uncased")
+
+      # Wrap the pytorch model in a PyBuda module wrapper
+      module = pybuda.PyTorchModule("bert_encoder", model.encoder)
+
+      # Create a tenstorrent device
+      tt0 = pybuda.TTDevice(
+          "tt0",
+          module=module,
+          arch=pybuda.BackendDevice.Wormhole_B0,
+          devtype=pybuda.BackendType.Silicon,
+      )
+
+      # Create an input tensor
+      seq_len = 128
+      input = torch.randn(1, seq_len, model.config.hidden_size)
+
+      # Compile and run inference
+      output_queue = pybuda.run_inference(inputs=[input])
+      print(output_queue.get())
 
 
 Framework Support
@@ -90,7 +92,7 @@ PyBuda API and workflow is flexible enough that some of these steps can be merge
 Devices
 *******
 
-PyBuda makes it easy to distribute a workload onto a heterogenous set of devices available to you. This can be one or more 
+PyBuda makes it easy to distribute a workload onto a heterogeneous set of devices available to you. This can be one or more 
 Tenstorrent devices, CPUs, or GPUs. Each device that will be used to run your workflow needs to be declared by creating the appropriate
 device type and giving it a unique name:
 
@@ -121,7 +123,7 @@ To run a module on a device, it needs to be "placed" on it
    tt0.place_module(mod)
 
 This tells PyBuda that module ``mod`` needs to be compiled and executed on device ``tt0``. In this case, ``mod`` is a native PyBuda module. To
-simiarly place a PyTorch module onto a Tenstorrent device, the module must be wrapped in a :py:class:`PyTorchModule<pybuda.PyTorchModule>` wrapper:
+similarly place a PyTorch module onto a Tenstorrent device, the module must be wrapped in a :py:class:`PyTorchModule<pybuda.PyTorchModule>` wrapper:
 
 .. code-block:: python
 
@@ -147,7 +149,7 @@ PyBuda provides all-in-one APIs for compiling and running workloads, :py:func:`r
 For inference, and simple training setups, this is the simplest way to get up and running.
 
 Alternatively, the models can be compiled in a separate step, using the :py:func:`initialize_pipeline<pybuda.initialize_pipeline>` call, 
-which optioanlly takes sample inputs, if none have been pushed into the first device. Once the compilation has completed, the user 
+which optionally takes sample inputs, if none have been pushed into the first device. Once the compilation has completed, the user 
 can run :py:func:`run_forward<pybuda.run_forward>` pass through the pipeline for inference, or a loop of 
 :py:func:`run_forward<pybuda.run_forward>`, :py:func:`run_backward<pybuda.run_backward>`, and :py:func:`run_optimizer<pybuda.run_optimizer>` 
 calls to manually implement a training loop:
@@ -165,10 +167,10 @@ calls to manually implement a training loop:
 CPU Fallback
 ************
 
-If there are operators in the workload that are unsuppored by PyBuda, the user can create a CPUDevice and place module containing those 
+If there are operators in the workload that are unsupported by PyBuda, the user can create a CPUDevice and place module containing those 
 operators onto that CPUDevice. If enabled, PyBuda is capable of doing this automatically.
 
-If a TTDevice contains unsuppored operators, during compilation, the device will be split into mupltiple devices (TTDevice and CPUDevice). If
+If a TTDevice contains unsupported operators, during compilation, the device will be split into multiple devices (TTDevice and CPUDevice). If
 the CPUDevice is at the front of the pipeline (i.e. the unsupported ops are in the first half of the graph), any inputs pushed to the TTDevice
 will be redirected to the correct CPUDevice. 
 
@@ -214,7 +216,7 @@ Output queues hold PyBuda tensors. For each PyBuda tensor, user can convert it b
     output_in_tf = output_q[0].to_framework("tensorflow")
 
 Advanced training scenarios sometimes require accumulated gradients to be retrieved and analyzed. For those cases, PyBuda provides an 
-:py::func:`API<pybuda.get_parameter_gradients>` that retrieves a dictionary of all currently accumulated gradients on a device. This can be used to 
+:py:func:`API<pybuda.get_parameter_gradients>` that retrieves a dictionary of all currently accumulated gradients on a device. This can be used to 
 debug or analyze data, or even run a manual optimizer and push new weights onto the device.
 
 Saving and Loading Models
@@ -623,20 +625,22 @@ Here is a simple example to (1) tag operations of interest and (2) fetch interme
            matmul2 = pybuda.op.Matmul("matmul2", matmul1_gelu, self.weights2)
            return matmul2
 
-   # Configure Pybuda compilation options to include a list of operations to collect intermediate tensors
-   tagged_operations = ["matmul1", "gelu"]
-   pybuda.set_configuration_options(op_intermediates_to_save=tagged_operations)
+   # Guard in the main module to avoid creating subprocesses recursively.
+   if __name__ == "__main__":
+       # Configure Pybuda compilation options to include a list of operations to collect intermediate tensors
+       tagged_operations = ["matmul1", "gelu"]
+       pybuda.set_configuration_options(op_intermediates_to_save=tagged_operations)
 
-   # Invoke the run_inference API to create device, compile and run module on device:
-   output_q = pybuda.run_inference(PyBudaTestModule("test_module"), inputs=[torch.randn(1, 32, 32)])
+       # Invoke the run_inference API to create device, compile and run module on device:
+       output_q = pybuda.run_inference(PyBudaTestModule("test_module"), inputs=[torch.randn(1, 32, 32)])
 
-   # After running inference, the intermediates queue will contain the ordered list of tagged intermediates
-   intermediates_queue = pybuda.get_intermediates_queue()
-   matmul1_tensor, gelu_tensor = intermediates_queue.get()
+       # After running inference, the intermediates queue will contain the ordered list of tagged intermediates
+       intermediates_queue = pybuda.get_intermediates_queue()
+       matmul1_tensor, gelu_tensor = intermediates_queue.get()
 
-   # Print tensor values recorded from device inference
-   print(matmul1_tensor)
-   print(gelu_tensor)
+       # Print tensor values recorded from device inference
+       print(matmul1_tensor)
+       print(gelu_tensor)
 
 
 Multiple Devices
@@ -647,7 +651,7 @@ Using Multiple Tenstorrent Devices
 
 PyBuda makes it easy to parallelize workloads onto multiple devices. A single :py:class:`TTDevice<pybuda.TTDevice>` can be used as a wrapper to any number of available 
 Tenstorrent devices accessible to the host - either locally or through ethernet. The PyBuda compiler will then break up the workload over
-assigned devices using either pipeline or model parllelism strategies, or a combination of both.
+assigned devices using either pipeline or model parallelism strategies, or a combination of both.
 
 The easiest way to use all available hardware is to set ``num_chips`` parameter in :py:class:`TTDevice<pybuda.TTDevice>` to 0, which instructs it to auto-detect and use everything it can find. 
 However, ``num_chips`` and ``chip_ids`` parameters can be used to select a subset of available hardware:
@@ -765,7 +769,7 @@ The following Python code generates a Multi-Model TTI in a manner identical to t
 
   model_binary_loc = "device_images_to_merge"
   models_to_merge = ["bert_large", "deit", "hrnet", "inception", "mobilenet_v1", "mobilenet_v2", "mobilenet_v3", "resnet", "unet", "vit"]
-  target_arch = "wormhole_b0
+  target_arch = "wormhole_b0"
   merged_model_location = "multi_model_workload.tti"
 
   # Individual Model Generation Code Goes Here
@@ -776,7 +780,7 @@ The following Python code generates a Multi-Model TTI in a manner identical to t
 
 During the model fusion process, the API presented above is responsible for performing memory reallocation. Users may be interested in the memory footprint of the fused model (both Device and Host DRAM).
 
-To fullfil this requirement, the tool reports memory utilization post reallocation. An example using a model compiled for Wormhole (with 6 Device and upto 4 Host DRAM channels) is provided below.
+To fulfill this requirement, the tool reports memory utilization post reallocation. An example using a model compiled for Wormhole (with 6 Device and up to 4 Host DRAM channels) is provided below.
 
 .. code-block:: bash
 

diff --git a/pybuda/csrc/balancer/exceptions.hpp b/pybuda/csrc/balancer/exceptions.hpp
@@ -57,6 +57,15 @@ class BudaOpNodeLegalizerFailureInfo
         return opModelFailureCountByType[failureReason];
     }
 
+    // Returns the total number of failures targeted by padding. Padding aims to resolve these failures.
+    std::uint32_t getFailuresCountTargetedByPadding() const
+    {
+        return opModelFailureCountByType[UserAccessPreventsStreaming] +
+               opModelFailureCountByType[OperandAccessPreventsStreaming] +
+               opModelFailureCountByType[OperandAndUserAccessPreventsStreaming] +
+               opModelFailureCountByType[InputBufferAllocationFailure];
+    }
+
     std::string toString() const
     {
         std::string result = "Op model failure counts by type: \n";
Original file line number	Diff line number	Diff line change
Expand Up		@@ -49,4 +49,3 @@ Set `LD_LIBRARY_PATH` to the location of `third_party/budabackend/build/lib` - p
		## Silicon

		See README.silicon.md for details on how to run on silicon.