diff --git a/docs/public/developer.rst b/docs/public/developer.rst index 091fb723..c0b2b928 100644 --- a/docs/public/developer.rst +++ b/docs/public/developer.rst @@ -125,7 +125,7 @@ User Visible Constants ++++++++++++++++++++++ Constant registers are implemented as objects which can be referenced -whereever a vector can be used. +wherever a vector can be used. * Grayskull: @@ -230,8 +230,8 @@ Library Below ``Vec`` means any vector type. -Grayskulll and Wormhole -^^^^^^^^^^^^^^^^^^^^^^^ +Grayskull and Wormhole +^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: c++ @@ -396,8 +396,8 @@ For example: l_reg[LRegs::LReg1] = x; // this is necessary at the end of the function // to preserve the value in LReg1 (if desired) -Miscelaneous -************ +Miscellaneous +************* Register Pressure Management ++++++++++++++++++++++++++++ @@ -413,7 +413,7 @@ loads dst_reg[0] and dst_reg[1] into temporary LREGs (as expected). The compiler will not spill registers. Exceeding the number of registers available will result in the cryptic: ``error: cannot store SFPU register -(reigster spill?) - exiting!`` without a line number. +(register spill?) - exiting!`` without a line number. The compiler does a reasonable job with lifetime analysis when assigning variables to registers. Reloading or recalculating results helps the compiler @@ -448,7 +448,7 @@ The ``SFPREPLAY`` instruction available on Wormhole allows the RISCV processor to submit up to 32 SFP instructions at once. The compiler looks for sequences of instructions that repeat, stores these and then "replays" them later. -The current implemention of this is very much first cut: it does not handle +The current implementation of this is very much first cut: it does not handle kernels with rolled up loops very well. Best performance is typically attained by unrolling the top level loop and then letting the compiler find the repetitions and replace them with ``SFPREPLAY``. This works well when the main loop @@ -494,15 +494,15 @@ Register Spilling +++++++++++++++++ The compiler does not implement register spilling. Since Grayskull only has 4 -LRegs, running out of registers is a common occurence. If you see the -following: ``error: cannot store SFPU register (reigster spill?) - exiting!`` +LRegs, running out of registers is a common occurrence. If you see the +following: ``error: cannot store SFPU register (register spill?) - exiting!`` you have most likely run out of registers. Error Messages ++++++++++++++ Unfortunately, many errors are attributed to the code in the wrapper rather than in the code -being written. For example, using an unitialized variable would show an error at a macro +being written. For example, using an uninitialized variable would show an error at a macro called by a wrapper function before showing the line number in the user's code. Function Calls diff --git a/docs/public/installation.rst b/docs/public/installation.rst index cfa7d589..8797b41c 100644 --- a/docs/public/installation.rst +++ b/docs/public/installation.rst @@ -38,7 +38,7 @@ Python Environment Installation It is strongly recommended to use virtual environments for each project utilizing PyBUDA and Python dependencies. Creating a new virtual environment with PyBUDA and libraries is very easy. -Prerequisites (detailed sections below) for python envirnment installation are listed here: +Prerequisites (detailed sections below) for python environment installation are listed here: * `Setup HugePages (below) <#setup-hugepages>`_ * `PCI Driver Installation (below) <#pci-driver-installation>`_ diff --git a/docs/public/terminology.rst b/docs/public/terminology.rst index 1ff12b47..c492ac63 100644 --- a/docs/public/terminology.rst +++ b/docs/public/terminology.rst @@ -27,7 +27,7 @@ The dense tensor math unit in Tensix. It performs bulk tensor math operations, s SFPU ---- -Tensix SIMD engine, used for various miscellaneous activations operations, such as exponents, square roots, softmax, topK, and others. +Tensix SIMD engine, used for various miscellaneous activation operations, such as exponents, square roots, softmax, topK, and others. Unpacker -------- @@ -49,7 +49,7 @@ A collection of ops that fits onto one chip. In a typical workflow, epoch code w Buffer ------ -A reserved location in local memory, DRAM, or host memory. Buffers are used either as desinations for operation outputs, sources for operation inputs, or temporary locations for intermediate data. +A reserved location in local memory, DRAM, or host memory. Buffers are used either as destinations for operation outputs, sources for operation inputs, or temporary locations for intermediate data. Pipe ---- diff --git a/docs/public/user_guide.rst b/docs/public/user_guide.rst index 118b9269..9a4c6db0 100644 --- a/docs/public/user_guide.rst +++ b/docs/public/user_guide.rst @@ -90,7 +90,7 @@ PyBuda API and workflow is flexible enough that some of these steps can be merge Devices ******* -PyBuda makes it easy to distribute a workload onto a heterogenous set of devices available to you. This can be one or more +PyBuda makes it easy to distribute a workload onto a heterogeneous set of devices available to you. This can be one or more Tenstorrent devices, CPUs, or GPUs. Each device that will be used to run your workflow needs to be declared by creating the appropriate device type and giving it a unique name: @@ -121,7 +121,7 @@ To run a module on a device, it needs to be "placed" on it tt0.place_module(mod) This tells PyBuda that module ``mod`` needs to be compiled and executed on device ``tt0``. In this case, ``mod`` is a native PyBuda module. To -simiarly place a PyTorch module onto a Tenstorrent device, the module must be wrapped in a :py:class:`PyTorchModule` wrapper: +similarly place a PyTorch module onto a Tenstorrent device, the module must be wrapped in a :py:class:`PyTorchModule` wrapper: .. code-block:: python @@ -147,7 +147,7 @@ PyBuda provides all-in-one APIs for compiling and running workloads, :py:func:`r For inference, and simple training setups, this is the simplest way to get up and running. Alternatively, the models can be compiled in a separate step, using the :py:func:`initialize_pipeline` call, -which optioanlly takes sample inputs, if none have been pushed into the first device. Once the compilation has completed, the user +which optionally takes sample inputs, if none have been pushed into the first device. Once the compilation has completed, the user can run :py:func:`run_forward` pass through the pipeline for inference, or a loop of :py:func:`run_forward`, :py:func:`run_backward`, and :py:func:`run_optimizer` calls to manually implement a training loop: @@ -165,10 +165,10 @@ calls to manually implement a training loop: CPU Fallback ************ -If there are operators in the workload that are unsuppored by PyBuda, the user can create a CPUDevice and place module containing those +If there are operators in the workload that are unsupported by PyBuda, the user can create a CPUDevice and place module containing those operators onto that CPUDevice. If enabled, PyBuda is capable of doing this automatically. -If a TTDevice contains unsuppored operators, during compilation, the device will be split into mupltiple devices (TTDevice and CPUDevice). If +If a TTDevice contains unsupported operators, during compilation, the device will be split into multiple devices (TTDevice and CPUDevice). If the CPUDevice is at the front of the pipeline (i.e. the unsupported ops are in the first half of the graph), any inputs pushed to the TTDevice will be redirected to the correct CPUDevice. @@ -647,7 +647,7 @@ Using Multiple Tenstorrent Devices PyBuda makes it easy to parallelize workloads onto multiple devices. A single :py:class:`TTDevice` can be used as a wrapper to any number of available Tenstorrent devices accessible to the host - either locally or through ethernet. The PyBuda compiler will then break up the workload over -assigned devices using either pipeline or model parllelism strategies, or a combination of both. +assigned devices using either pipeline or model parallelism strategies, or a combination of both. The easiest way to use all available hardware is to set ``num_chips`` parameter in :py:class:`TTDevice` to 0, which instructs it to auto-detect and use everything it can find. However, ``num_chips`` and ``chip_ids`` parameters can be used to select a subset of available hardware: @@ -776,7 +776,7 @@ The following Python code generates a Multi-Model TTI in a manner identical to t During the model fusion process, the API presented above is responsible for performing memory reallocation. Users may be interested in the memory footprint of the fused model (both Device and Host DRAM). -To fullfil this requirement, the tool reports memory utilization post reallocation. An example using a model compiled for Wormhole (with 6 Device and upto 4 Host DRAM channels) is provided below. +To fulfill this requirement, the tool reports memory utilization post reallocation. An example using a model compiled for Wormhole (with 6 Device and up to 4 Host DRAM channels) is provided below. .. code-block:: bash