Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dev more bench #8

Merged
merged 7 commits into from
Nov 23, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 19 additions & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -23,12 +23,16 @@ num_cpus = {version = "1.0", optional=true}
[dev-dependencies]
criterion = { version = "*", features = ["html_reports"] }
rand = { version = "*", features = ["small_rng", "alloc"] }
atomic = {version = "0.5.3"}
rayon = {version = "*"}

[build-dependencies]
cxx-build = "*"

# BENCHMARKS

## misc

[[bench]]
name = "layout"
harness = false
Expand All @@ -41,10 +45,25 @@ harness = false
name = "view_access"
harness = false

## blas speedup measures

[[bench]]
name = "axpy"
path = "benches/blas-speedup/axpy.rs"
harness = false

[[bench]]
name = "gemv"
path = "benches/blas-speedup/gemv.rs"
harness = false

[[bench]]
name = "gemm"
path = "benches/blas-speedup/gemm.rs"
harness = false

## library overhead measures

[[bench]]
name = "hardcoded_gemm"
harness = false
64 changes: 48 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,27 +23,66 @@ This makes limit-testing an fundamental part of the project.

## Quickstart

### Example
The PoC itself is a library, but you can run benchmarks and examples out of the box.

The PoC itself is a library, but you can run examples by using the following command:
### Benchmarks

Benchmarks can be run using the following command:

```bash
# all benchmarks
cargo bench
# a specific benchmark
cargo bench --bench bench_name
```

All results are compiled to the `target/criterion/` folder. The following
benchmarks are available:

- `layout`: Matrix-Vector product computation; This is used to put numbers on the
importance of data layout in memory.
- `view_init`: Compare initialization performances of regular vectors to [Views][view]; This
is used to spot potential scaling issues induced by the more complex structure of Views.
- `view_access`: Compare data access performances of regular vectors to [Views][view]; This
is used to spot potential scaling issues induced by the more complex structure of Views.
- `axpy` / `gemv` / `gemm`: Measure speedup on basic BLAS implementations by running the same kernel
in serial mode first, then using parallelization on CPU. _Meant to be executed using features_.
- `hardcoded_gemm`: Compute the same operations as the `gemm` benchmark, but using a hardcoded implementation
instead of methods from the PoC. Used to assess the additional cost induced by the library.


### Examples

```bash
cargo run --example hello-world
```

The following examples are available:
The following examples are available:

- `hello-world`: ...
- `openmp-parallel`: ...
- `hello_world`: ...
- `hello_world_omp`: ...


### Documentation
## Features

A consise documentation can be generated and accessed using the following command:
Using `features`, the crate can be compiled to use different backend for execution of parallel section.
These can also be enabled in benchmarks.

```bash
cargo build --features <FEATURE>
```
cargo doc --open --no-deps
```

Available features:

- `rayon`: Uses the [rayon][2] crate to handle parallelization on CPU.
- `threads` : Uses [`std::thread`] methods to handle parallelization on CPU.
- `gpu`: Currently used as a way to gate GPU usage as this cannot be done in pure Rust.

## Compilation

The build script will read the `CXX` environment variable to choose which C++ compiler to use
for Rust/C++ interop. Note that the crate itself does not currently use C++ code, only examples
do.

## References

Expand All @@ -54,16 +93,9 @@ cargo doc --open --no-deps
- `move` keyword semantic & implementation: [link][MOVE]


### Functor Implementation

- A very specific answer to a very specific rust-lang issue: [link][FNIMPL]



[1]: https://kokkos.github.io/kokkos-core-wiki/index.html
[2]: https://docs.rs/rayon/latest/rayon/

[NDARRAY]: https://docs.rs/ndarray/latest/ndarray/
[CONSTG]: https://doc.rust-lang.org/reference/items/generics.html
[FNIMPL]: https://github.com/rust-lang/rust/issues/29625#issuecomment-1692602873
[MOVE]: https://stackoverflow.com/questions/30288782/what-are-move-semantics-in-rust
File renamed without changes.
168 changes: 168 additions & 0 deletions benches/blas-speedup/gemm.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,168 @@
use criterion::{black_box, criterion_group, criterion_main, BenchmarkId, Criterion};
use poc_kokkos_rs::{
functor::KernelArgs,
routines::{
parallel_for,
parameters::{ExecutionPolicy, ExecutionSpace, RangePolicy, Schedule},
},
view::{parameters::Layout, ViewOwned},
};
use rand::{
distributions::{Distribution, Uniform},
rngs::SmallRng,
SeedableRng,
};

// Serial GEMM
fn f1(
length: usize,
aa_init: Vec<f64>,
bb_init: Vec<f64>,
cc_init: Vec<f64>,
alpha: f64,
beta: f64,
) {
let mut aa = ViewOwned::new_from_data(aa_init, Layout::Right, [length, length]);
let mut bb = ViewOwned::new_from_data(bb_init, Layout::Left, [length, length]); // optimal layout since we iterate inside columns :)
let mut cc = ViewOwned::new_from_data(cc_init, Layout::Right, [length, length]);
black_box(&mut aa);
black_box(&mut bb);
black_box(&mut cc);

let execp = ExecutionPolicy {
space: ExecutionSpace::Serial,
range: RangePolicy::RangePolicy(0..length),
schedule: Schedule::Static,
};

// C = alpha * A * B + beta * C
let gemm_kernel = |arg: KernelArgs<1>| match arg {
// lines
KernelArgs::Index1D(i) => {
// cols
for j in 0..length {
// b[j, k] because was init using a layout left
let ab_ij: f64 = (0..length).map(|k| aa.get([i, k]) * bb.get([j, k])).sum();
let val: f64 = alpha * ab_ij + beta * cc.get([i, j]);
cc.set([i, j], val);
}
}
KernelArgs::IndexND(_) => unimplemented!(),
KernelArgs::Handle => unimplemented!(),
};
parallel_for(execp, gemm_kernel).unwrap();
black_box(&cc);
}

// DeviceCPU GEMM
fn f2(
length: usize,
aa_init: Vec<f64>,
bb_init: Vec<f64>,
cc_init: Vec<f64>,
alpha: f64,
beta: f64,
) {
let mut aa = ViewOwned::new_from_data(aa_init, Layout::Right, [length, length]);
let mut bb = ViewOwned::new_from_data(bb_init, Layout::Left, [length, length]); // optimal layout since we iterate inside columns :)
let mut cc = ViewOwned::new_from_data(cc_init, Layout::Right, [length, length]);
black_box(&mut aa);
black_box(&mut bb);
black_box(&mut cc);

let execp = ExecutionPolicy {
space: ExecutionSpace::DeviceCPU,
range: RangePolicy::RangePolicy(0..length),
schedule: Schedule::Static,
};

// C = alpha * A * B + beta * C
let gemm_kernel = |arg: KernelArgs<1>| match arg {
// lines
KernelArgs::Index1D(i) => {
// cols
for j in 0..length {
// all b[k, j] for k values are adjacent in memory thanks to the LayoutLeft
let ab_ij: f64 = (0..length).map(|k| aa.get([i, k]) * bb.get([k, j])).sum();
let val: f64 = alpha * ab_ij + beta * cc.get([i, j]);
cc.set([i, j], val);
}
}
KernelArgs::IndexND(_) => unimplemented!(),
KernelArgs::Handle => unimplemented!(),
};
parallel_for(execp, gemm_kernel).unwrap();
black_box(&cc);
}

pub fn criterion_benchmark(c: &mut Criterion) {
// Generate/Define the input
const DATA_SIZE: u32 = 10;
let length = 2_usize.pow(DATA_SIZE);
let seed: u64 = 9817498146784;
let mut rng = SmallRng::seed_from_u64(seed);
let range: Uniform<f64> = rand::distributions::Uniform::new(0.0, 100.0);
let aa_init: Vec<f64> = (0..length * length)
.map(|_| range.sample(&mut rng))
.collect();
let bb_init: Vec<f64> = (0..length * length)
.map(|_| range.sample(&mut rng))
.collect();
let cc_init: Vec<f64> = (0..length * length)
.map(|_| range.sample(&mut rng))
.collect();
let alpha: f64 = range.sample(&mut rng);
let beta: f64 = range.sample(&mut rng);

let mut group = c.benchmark_group("gemm");
group.bench_with_input(
BenchmarkId::new("exec-serial", ""),
&(
length,
aa_init.clone(),
bb_init.clone(),
cc_init.clone(),
alpha,
beta,
),
|b, (length, aa_init, bb_init, cc_init, alpha, beta)| {
b.iter(|| {
f1(
*length,
aa_init.clone(),
bb_init.clone(),
cc_init.clone(),
*alpha,
*beta,
)
})
},
);
group.bench_with_input(
BenchmarkId::new("exec-devicecpu", ""),
&(
length,
aa_init.clone(),
bb_init.clone(),
cc_init.clone(),
alpha,
beta,
),
|b, (length, aa_init, bb_init, cc_init, alpha, beta)| {
b.iter(|| {
f2(
*length,
aa_init.clone(),
bb_init.clone(),
cc_init.clone(),
*alpha,
*beta,
)
})
},
);
group.finish()
}

criterion_group!(benches, criterion_benchmark);
criterion_main!(benches);
File renamed without changes.
Loading
Loading