All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
-
Support models with subgraphs in
tools/ort-quantize.py
script and adjust configuration so that it produces usable results with more models (#530) -
Added initial optimized implementation of
MatMulInteger
for x64 (AVX2, AVX512 VNNI) and Arm 64 (with dotprod extensions) (#528, #535, #537, #541, #542, #543) -
Optimized and vectorized
DynamicQuantizeLinear
andQuantizeLinear
operations (#531, #532, #538) -
Fixed edge case bug with incorrect handling of fused MatMul-Add operations when K dimension (LHS column count) is zero (#526)
-
Fixed panic in
Conv
operator if group count is zero (#523) -
Support
MatMulInteger
operators where zero point is a vector (#521)
- Added ModernBERT masked word prediction example (#520)
-
Refactored matrix multiplication internals to prepare for supporting additional data types and architectures (#510, #511, #513, #519)
-
Optimized Softmax by using multiplication-by-reciprocal instead of division (#516)
-
Optimized matrix multiplication with specialized code for edge tiles (#505), more efficient indexing into LHS / A input (#512) and more aggressive unrolling (#518)
-
Fuse RMSNorm subgraphs (#497)
-
Optimized Gather with fast path for common case of axis=0 and faster general case (#496)
-
Error if a dimension size specified with
--size
does not match any model input (#517) -
Made output less noisy when a dimension size repeated in many inputs is not specified and is defaulted to 1 (#517)
-
Prefix timing for each run with a run number (#508)
-
Fuse and vectorize Swish activation function used in CLIP and other models (#493).
-
Avoid redundant zeroing of output buffer in
Gather
operator (#492) -
Fuse
MatMul
+Mul
orDiv
by constant on either inputs or outputs (#487, #489). In Transformers this occurs in the context of Scaled Dot Product Attention. -
Fix panic if
Model::run
is passed an input or output node ID which refers to an operator node rather than a value or constant node (#485). -
Support prepacked weights. This increases model load time and memory usage but decreases inference time. Weight pre-packing is disabled by default and can be enabled via
ModelOptions::prepack_weights
(#483). -
Support fusing LayerNormalization operator variants that don't use a bias, such as found in ModernBERT and other models (#470).
-
Support
DepthToSpace
operator (#468) -
Support fusing
Add(MatMul(a, b), bias)
subgraphs (#462) -
Improved Where operator performance by removing an old "fast" path that is now slower than the standard path (#460)
-
Optimized ReduceMean, ReduceL2 operators using SIMD ( #457)
-
Unified and optimized implementation of normalization operators (BatchNormalization, InstanceNormalization, LayerNormalization) using SIMD ( #456, #457, #465, #469, #471).
-
Added Nougat PDF-to-Markdown OCR example (#448).
-
Make Depth Anything example support variants with 3D (instead of 4D) outputs (#447).
-
Make error message more helpful if converting a model
Output
into anNdTensor
fails due to a rank (dimension count) mismatch (#446) -
Release buffer back to memory pool in Concat op if in-place concatenation is not possible (#426)
-
Enable Resize operations to be very cheap if the target size is the same as the input size (#423)
-
Reduced some unnecessary memory reservation when constructing model graph (#422).
-
Added CLIP example (#421). This computes similarity between images and text labels.
-
Added data type information to model inputs and outputs (#420)
-
Support vector inputs in MatMul operator (#418)
-
Support additional DETR-based models in the DETR example, such as Table Transformer (#413)
Breaking changes: The result of TensorBase::reshaped
now has a shorter
lifetime as it may be an owned tensor instead of a view. Method call chains that
used reshaped
in the middle may need to be split into separate statements.
-
Support indexing 1D tensors using scalars instead of arrays (#480).
-
Support using slice ranges with steps in
TensorBase::slice
(#464) -
TensorBase::reshaped
now copies its input instead of panicking if non contiguous. As a result it returns a copy-on-write (maybe owned) tensor with a shorter lifetime.
-
Support
Lowercase
,Replace
,Sequence
normalizers (#451) -
Support all the Unicode normalization normalizers (NFC, NFD, NFKC, NFKD) (#450)
-
Support
Digits
,Sequence
,Split
pre-tokenizers (#449) -
Add
Tokenizer::from_file
convenience method (#445) -
Added
Tokenizer::{encode, decode}
methods for more ergonomic tokenization and de-tokenization of text (#429) -
Started to revise tokenization pipeline to follow the one used by HuggingFace Tokenizers (#428, #429, #430, #440, #441, #443, #444, #452)
-
Support
end_of_word_suffix
in BPE model (#425)
This release adds Serde support for rten tensors and several optimizations which allow the Whisper example to run significantly faster.
- Support (de-)serializing tensors using Serde (#402)
-
Output transcription speed as a multiple of real-time in Whisper example (#403)
-
Support longer audio inputs and normalize inputs in wav2vec2 speech recognition example (#400)
- Fixed an issue where metadata associated with output value nodes was lost after a graph fusion. In the Whisper example this prevented several Transpose-MatMul fusions from being used (#401).
-
Added fast path for ArgMin / ArgMax for case when axis has unit stride (#411)
-
Optimized GatherND by avoiding redundant zeroing of output and adding fast path for contiguous inputs (#410)
-
Optimized copying of tensors with 5+ dimensions (#409)
-
Operators in subgraphs which capture their first input from a parent graph can now run in-place (#407)
-
After the initial execution plan is created, it is now re-ordered to enable more operations to run in-place (#405)
- The strategy for reserving capacity for KV-cache growth has been modified to work with models that don't append to KV-cache inputs on the first run. This benefits Hugging Face "merged" transformer models with "past" and "no-past" branches (#408)
-
The
NodeId
type used to identify model inputs and outputs is now an opaqueu32
-sized type instead of ausize
(#381) -
The tensor slicing APIs (
TensorBase::slice
etc.) now infer the rank of the output automatically, instead of requiring the caller to specify. See #367.
-
Added Whisper speech recognition example (#397)
-
Added background removal example using RMBG (#344)
-
Support i8 and u8 tensors in operator inputs, outputs and model weights (#345).
-
Support 8-bit int tensors in Cast, Gather, GatherElements, GatherND, ScatterElements, ScatterND, Expand, Flatten, Reshape, Squeeze, Transpose, Pad, Unsqueeze ops (#387)
-
Implement
QuantizeLinear
,DequantizeLinear
andDynamicQuantizeLinear
ops (#346) -
Added reference implementation of
MatMulInteger
. Quantized models using this operator will now run, but very slowly. Optimized execution for quantized models will come in future releases (#356). -
Support f16 models in model converter by widening to f32 (#372). This is an interim measure until f16 tensors are properly supported in RTen.
-
Added YOLOv11 support to YOLO example (#374)
-
Fixed AVX-512 build (#376)
-
Fixed graph optimizations not being applied correctly when a fused operation feeds directly into a subsequent fused operation (#369)
-
Fixed errors when running WebAssembly builds compiled without SIMD support (#348)
-
Made
NodeId
a u32-sized type with a niche, reducing the size of various internal data structures (#381) -
Optimized
Cast
op when source and dest types are the same (#388) -
Avoid unnecessary copying in
Squeeze
andUnsqueeze
ops (#339, #340)
- Added
--no-optimize
flag to enable testing impact of graph optimizations (#368)
-
Added more context to token generation errors (#396)
-
Support
cache_position
input in models exported from Optimum (#395) -
Added API for modifying model outputs ("logits") before sampling (#393, #394)
-
Support the new
merges
format in tokenizer.json files exported by current versions of HuggingFace Transformers (#392)
- Added
normalize_image
utility (#343)
-
Improved debug formatting of tensors (#377)
-
Changed
TensorBase::slice
to infer the rank of the output based on the rank of the input and the number of index entries in the slice arguments (#367).
-
Added speech detection example using Silero VAD (#338)
-
Support int tensors in ArgMin and ArgMax ops (#329)
-
Support "reflect" padding mode (#326)
-
Fixed panic with certain combinations of input, kernel size and padding in depthwise convolution (#336)
-
Fixed attempted out-of-bounds slice in depthwise convolution when input tensor has a row stride that exceeds the row length (#335)
-
Fixed conversion of
auto_pad
attribute for Conv operator (#333) -
Round timings to microseconds in verbose log (#331)
-
Fixed panic when slicing empty tensors (#325)
-
Fixed 1D convolution failing with non-contiguous inputs (#324)
-
Fixed conversion of shape information for scalar tensors (#323)
-
Fixed panic in softmax if the size of the normalized axis is zero (#322)
- Added
--mmap
flag to load model using memory mapping instead of reading whole file into a buffer (#330)
This release adds the infrastructure to support subgraphs, which are used in
control flow operators like If
, plus an implementation of the If
operator
and a TrOCR example which uses it.
-
Added
Model::load_static_slice
API which can be used to load models embedded in the binary withinclude_bytes!
. Thanks @hsfzxjy. -
Added TrOCR example (#304)
-
Support
If
operator (#306) -
Added full support for
Einsum
operator (#297, #299, #300, #302, #303)
-
Added
--quiet
flag (#313) -
Inputs named
use_cache_branch
now get a default value of0
(ddf4109)
-
Support models with cross-attention KV caches that are computed on the first run of the decoder (#318). This is used by Hugging Face models for encoder-decoder systems.
-
Support models without a KV cache (#305)
- Added
Tensor::remove_axis
(b823d46) - Added
Tensor::from_storage_and_layout
(54d2941)
- The BPE tokenizer no longer complains if a tokenizer contains tokens in the vocabulary which are never generated by merges and are not added special tokens (18e9b2a)
-
The
rten-convert
tool now generates models in the V2 format by default (#272). These models can only be loaded by RTen version 0.11.0 or later. The V1 format can be generated by specifying the--v1
flag. Therten
crate can load both V1 and V2 format models.See the
.rten
file format documentation for more details. -
The
reduce_{max, min, sum}
tensor methods have moved from theFloatOperators
trait to theOperators
trait (#274).
-
Added Segment Anything example (#295). This supports the original SAM models plus several derivatives with lighter-weight image encoders.
-
Added chatbot example using Qwen2 (#282). This also works with SmolLM.
-
Model::load_mmap
docs now have a better explanation of the memory and performance impact (ce0b717)
- Added partial support for
Einsum
operator (#295).
-
Avoid allocations in most cases when broadcasting tensor shapes (c4b5f26).
-
Strides of size-1 dimensions are ignored when determining whether a tensor is contiguous (#292). This allows more operations to use fast paths for contiguous tensors.
-
Optimized
LayerNormalization
andReduceMean
(#291) -
Added fast-path for
Resize
operator when input scale is 1 (#290) -
Return input buffer to pool in
Cast
operator if input needs to be copied (#289). -
Implemented
LayerNormalization
fusion (#280) -
Implemented
GELU
fusion (#277)
- Inputs with names matching the pattern
*_ids
now use zero as the auto-generated input value (78cd621)
-
TopKSampler
now supports specifying a temperature (65b837b) -
Added
Generator::append_prompt
to append to prompt after initial generation. This is useful for chat-like applications (5ef3cb2) -
Fixed an issue where
attention_mask
input had the wrong size (cae6134)
- The
tensor
andndtensor
macros have been deprecated in favor ofTensor::from
andNdTensor::from
(#286).
-
Tensor::from
now supports creating tensors from scalar values (d2ca876) -
Tensor::lanes
iterator performance was improved by making them exact-sized and fused (9e31556)
-
Token IDs are now represented as
u32
rather thanusize
, for consistency with rten-generate (#288). -
The
vocab
mapping intokenizer.json
files is now used to determine token IDs when decoding (#287).
- Fixed a crash in WebAssembly due to unsupported use of
Instant::now
(#283).
-
The
inputs
argument toModel::run
now accepts aVec<(NodeId, InputOrOutput)>
instead of&[(NodeId, Input)]
, whereInputOrOutput
is an enum that is either an ownedTensor
or aTensorView
. This enables passing ownership of an input toModel::run
, which is in turn enables efficient in-place updates to cache-like inputs.The
InputOrOutput
type implementsFrom
for tensors and tensor views, so code such as:model.run(&[(input_id, tensor_view.into())], output_ids, None)
Becomes:
model.run(vec![(input_id, tensor_view.into())], output_ids, None)
-
Add a new version of the
.rten
file format which supports models over 2GB in size. Therten-convert
tool still generates V1 models by default but will generate the V2 format if the--v2
flag is provided (#260). -
Support
Gelu
operator (#248)
-
Prevent
Model::partial_run
from propagating values through randomized operators (#240). -
Improved accuracy of timing metrics and eliminated unaccounted for ("[Other]") time #254.
This release adds a new graph optimization step as part of loading models. This
performs fusions and other optimizations to speed up inference. These
optimizations are enabled by default, but can be disabled via options in
ModelOptions
.
-
Improved parallelism in the
Softmax
operator (#258) -
Made
Tensor::inner_iter
faster (#259) -
Made
Gather
,Concat
andUnsqueeze
operators faster for small inputs. These operations are common in subgraphs that operator on tensor shapes. #255, #256, #257. -
Optimized vector-matrix multiplication (#250, #253). This benefits transformer decoder inference when the batch size is 1.
-
Fuse
Mul(X, Sigmoid(X))
subgraphs into aSilu
operation. This speeds up YOLOv8 by 8%. See #246. -
Further reduce small allocations during graph execution (#243, #245).
-
Fuse
MatMul(Transpose(X), Y)
subgraphs to avoid materializing the transposed matrix (#242). -
Perform constant propagation when loading models (#241).
-
Enabled
Concat
operator to run in-place if the caller has specifically reserved space in the first input's buffer (#239). -
Cache the last-used execution plan. This avoids recomputing the sequence of execution steps when a model is run in a loop (#234).
-
Improved performance of unary operators for non-contiguous inputs (#223)
-
Optimized
Where
operator for non-contiguous inputs (#213) -
Optimized variadic operators (#212)
-
Optimized
Pow
operator (#219)
This is a new crate which provides a convenient Iterator
-based interface for
running auto-regressive decoder models. See the gpt2
and distilvit
examples
in the rten-examples
crate for code samples.
- Support more primitive element types in
NdTensor::from
(#226).
- Added Byte Pair Encoding (BPE) tokenizer (#227)
-
RTen now creates its own Rayon thread pool where the number of threads is configured to match the physical rather than logical core count, rather than using the global Rayon thread pool. This improves performance on systems with Simultaneous Multi-Threading (aka. SMT, Hyper-Threading) (most x86_64 CPUs), but can lead to contention if the calling application has its own multi-threaded parallelism. Applications may need to adjust their own use of threading to avoid this. RTen provides functions for applications to run their own tasks within this thread pool.
See #183.
-
Fixed conversion of
Transpose
operators without aperm
attribute (#201) -
The
RunError
type returned byModel::run
is now exported (#206)
-
Made
Resize
operator parallel over rows. This benefits resize operations on images with large spatial dimensions and few channels (#208). -
Improved performance of
Conv
operator on Intel CPUs with a mitigation for the Gather Data Sampling / "Downfall" vulnerability applied. This affects most 6th-11th generation Intel CPUs (#204). -
Optimized
Concat
operator when input is not contiguous (eg. following aSlice
op) (#204) -
Improved performance of
GRU
operator by combining operations on separate gates (#188) -
Improved performance of binary operators on non-contiguous tensors (#190)
-
Added
--n_iters
flag to control how many times the model is run (#202) -
Optimize model by performing constant propagation before running the model (#202)
-
Made it easier to specify sizes for dynamic inputs. The new syntax is
--size dim_name=size
. Additionally the size for dynamic dimensions defaults to 1. See #182. -
Added
--version
flag (#181)
- Added
serde_traits
feature which implements serdeSerialize
andDeserialize
traits for geometry types (Thanks @luketpeterson, #198)
-
Added
Tensor::split_at
andTensor::split_at_mut
( #205, #207) -
Tensor::{axis_chunks, axis_chunks_mut}
iterators now preserve the layout in their output type (#207).
- The internal crate providing portable SIMD and vectorized math functions was split into two. rten-simd now contains the portable SIMD code. rten-vecmath contains the vectorized math functions.
This release contains breaking changes to the model loading APIs and code using
the TensorBase
type directly (as opposed to aliases like Tensor
). See the
notes for the rten
and rten-tensor
crates respectively.
-
The
Model::load
API now takes aVec<u8>
rather than&[u8]
as an argument. This enables it to avoid copying data internally. For the most common use case of loading a model from disk, use the newModel::load_file
API. -
The
Model::load_with_ops
API has been replaced byModelOptions::with_ops
.
-
Added
Model::load_file
API for more convenient loading of a model from a file (#174) -
Added
Model::load_mmap
API for zero-copy loading of models by using memory maps. This can be faster thanModel::load
for very large models (#174). -
Added Piper text-to-speech example (#161)
-
Support 1D inputs and padding in
ConvTranspose
(#156) -
Support
GatherND
operator (#155) -
Support
Softplus
operator (#146) -
Support converting ONNX models containing unnamed operator nodes (#143)
-
Support
RandomNormal
,RandomNormalLike
,RandomUniformLike
operators (#144)
-
Fixed incorrect calculation of update slice size in
ScatterND
operator (#157) -
Fixed incorrect conversion of
axis
attribute forArgMin
andArgMax
operators (#142) -
Fixed uninitialized read in
Gemm
operator whenalpha != 1
andbeta == 0
(#150) -
Fixed
NonMaxSuppression
operator missing overlap of boxes due to confusion of X/Y coordinates (#177)
-
Optimize
Gather
,NonZero
operator by allocating from memory pool (#168) -
Optimize
Slice
operator when slice ranges contain negative steps (#167) -
Optimize
Pad
operator by making copying of non-contiguous views more efficient (#166) -
Optimize
Conv
operator by avoiding redundant zeroing of packing buffers, optimizingim2col
setup (#165) -
Optimize
ConvTranspose
by fusing bias addition intocol2im
transform (#159) -
Parallelize
AveragePool
operator (#138) -
Improved model loading performance by avoiding copying weights in
Model::load
(#174)
- The mask matrix argument to
find_contours
now usesbool
instead ofi32
for elements. This improves performance / reduces memory usage for large masks.
This release changes the signature of the TensorBase
struct from
TensorBase<T, S: AsRef<[T]>, L: MutLayout>
to TensorBase<S: Storage, L: MutLayout>
. The element type is now available via S::Elem
. The type of S
used by views has changed from slices to new custom types. The
TensorBase::from_data
method still accepts both Vec<T>
and slices as the
data
argument, and will convert to the appropriate storage struct.
Code using the type aliases (Tensor
, TensorView
, TensorViewMut
etc.)
does not need to change.
- Added
TensorBase::{as_cow, into_cow}
(named afterstd::borrow::Cow
) to convert tensor storage to a type which isCow
-like. This is useful for writing code which works with either borrowed or owned tensors (#153).
-
Added missing checks for equality between old/new layout lengths in reshape operations (#170, #171)
-
Improved internal checks that storage slicing does not lead to out-of-bounds accesses (#163)
-
Refactored tensor storage types to fix a violation of Rust's unique ownership rules for mutable slices. This enables tests for rten-tensor and code using this crate to be run under Miri (#148).
- Revised SIMD traits to make working with masks more ergonomic and efficient (#152). Integer and floating point types with the same number of lanes will now use the same mask type.
- Added
Alloc
trait which provides a simple allocator interface, and*_in
-suffixed variants of severalTensorBase
methods, which allows specifying an allocator for the returned tensor's data buffer (#123).
- Fixed crashes in several functions when running on pre-AVX2 x64 CPUs (see
rten
changes)
-
Support
Elu
operator (#132) -
Support
Reduce*
operators that takeaxes
as a dynamic input rather than static attribute (#132)
- Fixed crash in several operators when running on x64 CPUs that do not support AVX-2 instructions (#131, #134)
-
Added a buffer pool that enables reuse of operator output and temporary buffers, avoiding the overhead of allocating and freeing large buffers using the system allocator (#108).
Statistics about buffer pool usage are printed as part of
RTEN_TIMING
output. -
Fixed a
MatMul
performance regression introduced in v0.7.0 due to virtual calls to get kernel tile size (#101) -
Optimize convolutions by using SIMD operations for im2col transform (#104)
-
Parallelize depthwise convolution (#102)
-
Avoid redundant of zeroing buffers in
Conv
,OneHot
, and various unary operations (#97, #99, #101, #106) -
Optimize
Unsqueeze
by running in-place where possible (#96) -
Optimize vector-matrix products where matrix is transposed (#94)
-
Reduced graph execution overhead by using faster hashing (#92)
-
Optimize
ScatterND
(#91) -
Support AVX-512 acceleration for
Exp
,Sigmoid
,Tanh
,Softmax
andErf
operators (#131). This requires nightly Rust and theavx512
feature enabled.
-
Add
Tensor::merge_axes
method to simplify layouts (#78) -
Add
Tensor::{uninit, assume_init}
methods for working with uninitialized buffers (#82)
-
Reduced
Graph::run
overhead by reducing allocations (#89) -
Added
Model::partial_run
API to speed up autoregressive / recurrent models by precomputing parts of the graph that depend only on inputs that are unchanging across loop iterations (#86) -
Optimize
MatMul
and binary operators by avoiding unnecessary zeroing of output buffers (#82, #88) -
Fixed incorrect output from
Gemm
operator when the bias is zero and the "C" input contained infinities / NaNs (#81) -
Optimize matrix packing operations on Intel CPUs using AVX-2 instructions (#80)
-
Optimize
Transpose
operations where input dimensions are powers of 2 by using blocking and tiling (#78) -
Exclude test files and tools from published crate (#77)
-
Optimize RNN operators for the case where the input sequence is short, by avoiding prepacking of weights in this case (#74)
-
Updated AVX-512 support to work with latest Rust nightly releases (#58)
-
Improved performance of vector-matrix product operations (#61)
-
Slightly improved WASM matrix multiplication performance with a dedicated kernel (#64)
-
Fixed conversion of RNN operators (LSTM, GRU) that explicitly declare the direction as forward (#67)
-
Support tensors with 3 or 5+ dimensions in
BatchNormalization
operator (#68) -
Support
RandomUniform
operator (#69) -
Improve matrix prepacking performance by eliminating unnecessary zero-initialization of buffers (#70)
-
Changed
OperatorType
enum in .rten schema from byte to ubyte, to allow for more operator types in future (#56) -
Made
Model
instancesSend
, enabling use with PyO3 (#55) -
The ONNX => rten model conversion tool is now an installable Python package called
rten-convert
(#53) -
Implemented
ReduceSumSquare
operator (36bbf89f)
-
Support
count_include_pad
attr in AveragePool operator (09ecb729) -
Support license/version/provenance metadata in RTen models (#48)
-
Fix error when a negative index was used with
Gather
operator (573ded4c) -
Improve performance of
MatMul
operator when row count of LHS is small and batch size is large (#51)
- Optimized
find_contours
for large images (c471a6c, 7a14f43)
- Optimize
TensorBase::map
for contiguous tensors (5562fd23) - Add
TensorBase::{from_fn, from_simple_fn}
(5e654ea0) - Add
TensorBase::try_from_data
(18817907) - Support
get_unchecked
on owned/mutable tensors (06b02eaf)
- Updated rten-vecmath dependency to latest version
The static and dynamic tensor types (NdTensorBase
, TensorBase
) have been
unified into a single implementation. Most code uses these via type aliases
(NdTensor
, Tensor
etc.), which remain the same. However there have been some
API changes as a result:
-
The
View
andNdView
traits were combined intoAsView
. The recommended way to import this trait is via the prelude (use rten_tensor::prelude::*
) -
Some inherent methods of
TensorBase
moved to theAsView
trait. You may need to add additional imports of this trait or the prelude. -
NdTensor::from_data
now has the same API signature asTensor::from_data
. This means the order of arguments is reversed compared to before. It is nowfrom_data(shape, data)
. Creating tensors with custom strides is now done viafrom_data_with_strides
orfrom_slice_with_strides
. -
Tensor methods for broadcasting and reshaping tensors now determine the rank of the result from the type of the shape argument. If passed an array, they return a static-rank view. If passed a slice, they return a dynamic-rank view.
-
Methods that insert, remove or swap axes now have an
_axis
suffix (eg.move_axis
). Previously some of these methods had a_dim
suffix. -
The
slice
method now always returns a static rank view. Usage istensor.slice::<M, _>(range)
whereM
is the rank of the result. To create a view with a dynamic dimension count, usetensor.slice_dyn(range)
instead.
- Implemented LayerNormalization operator (#44)
- Added "Depth Anything" monocular depth estimation example (#44)
- Added support for
align_corners
value forcoordinate_transformation_mode
attr in Resize operator (#44).
- Optimized index iteration for tensors (d3fd3c9)
- Optimized col2im transform used by ConvTranspose (fbc541b)
- Optimized depthwise convolution (20e83e8)
- Improved performance on Arm via a better optimized GEMM kernel (#32) and vectorized kernels for other functions (#31).
- Improved inference performance on ARM #30
- Fix softmax operator on non-x64 / wasm32 platforms (59f4815)
Initial release.