Home

GSoC 2018 Project Work submission

Student : Fady Essam

Organization : Boost

Mentor : Stefan Seefeld

Project title: Add GPU computations to uBLAS

content:

Overviews
- Boost
- Boost.uBLAS
- GSoC idea
- Proposal
documentation
Implementation
Acknowledgement

Overviews

Boost

Boost is a collection of free, peer-reviewed C++ libraries. We emphasize libraries that work well with the C++ Standard Library. Boost libraries are intended to be widely useful, and usable across a broad spectrum of applications. The Boost license encourages both commercial and non-commercial use.

Boost.uBLAS

uBLAS is a C++ template class library that provides BLAS level 1, 2, 3 functionality for dense, packed and sparse matrices. The design and implementation unify mathematical notation via operator overloading and efficient code generation via expression templates.

GSoC Idea

GSoC idea

The project description is simple: add support of multicore parallel and GPU computations to uBlas ! The realization is not straightforward though. Boost.uBlas is CPU only. If the compilers is able to vectorize, uBlas can take benefit of it. Here we want to extend Boost to the support of parallel architecture and GPU computations to enable it to do big data or deep learning computations.

The student will have to first understand how ublas works and how it generates and optimizes code with the expression template mechanism and then start adding options to enable the use of Boost.Compute. Test will be done on multicore systems and graphics card or computers which support Boost.Compute (through OpenCL for example).

We expect to see the basic matrix operations to be implemented like this. The code will have to be thoroughly documented and a tutorial document provided. We prefer quality of the implementation to exhaustivity.

Proposal

Here's a link to the accepted proposal (it has the milestones too)

Proposal pdf

to sum it up the proposal was to add functions that does the matrix operations on any device that supports opencl and add mechanism for if the user wants to keep matrices on them to do further computations on the device.. but then during the project I finished that and extended the project to include vectors, their operations and matrix-vector operations.

documentation

Understand the project

The usual ublas uses CPUs to operate on data sequentially in opposite to operate on devices like GPUs for example (we will keep gpu as an example). Matrices operations like product (gemm which is the heart of deep learning, elemente-wise operations and other operations) and vectors operations also can be done in parallel between many cores.

Doing operations on the gpu using opencl requires some fixed overhead so it's not suitable for doing operations on small amount of data.

But for big data (like for deep learning) the situation is totally different, because (take for example) if we want the product of 2 matrices of 2000x2000 matrix it takes on cpu (on my device which is core i7 5500U) about 1000,000 ms so the 200ms overhead of opencl is negligible relatively and because the gpu has much more cores it took about 1200 ms on the gpu (including copying the data back to the host).

Here's a graph to give you an idea about the difference between the time on cpu and on gpu vs the size of matrices (in multiplying two matrices) it was done comparing perfromance between intel i7 5500U (CPU) and AMD R5 255 (GPU)

Graph comparing cpu and gpu performance with matrix size

let's zoom out a bit to get an idea about how the performance scales with size in both cases

Graph comparing cpu and gpu performance with bigger matrix size

How to use it (Get started)

Dependencies

opencl: you must have opencl sdk for the device you intend to run the opencl operations on
clblas: you need to have the clblas library and build it on your sytem

How to setup the machine to enable the opencl ublas:

first you need to get the two dependencies decribed above (clBLAS & openCL)
- openCL : download the opencl sdk provided by your vendor
- clBLAS : download the library and use cmake to generate a vs solution (for example) with the options you need then build it for your device

you need to set their paths up in the boost configuration file as follows:

  using opencl : : <include>path/to/cl.h <search>path/to/openclLibrary ;

  using clblas : : <include>path/to/clblas.h <search>path/to/clblasLibrary ;

How to enable the opencl library functions in code:

before including <boost/numeric/ublas/matrix.hpp> you must use "#define ENABLE_OPENCL" to enable including opencl and clblas libraries in the matrix.hpp file
at the begining of your code you should declare boost::numeric::ublas::opencl::library lib;

to gain its contructor and destructor which initialize the clblas library

determine which device you want to use and use its context and command queue at the operations like:

 compute::device device = compute::system::devices().at(DEVICE_NUMBER_ON_THE_SYSTEM);

 compute::context context(device);

 compute::command_queue queue(context, device);

congrats 😄 you got the opencl operations working

note : this all might be unclear now, but refer to the tutorials below to get a clear understanding

How to run benchmarks & generate a similar graph for your device vs cpu

make sure you have set the clBLAS and OpenCL paths in the configuration file of boost to be able to build the opencl benchmarking
go to folder benchmarks and run the jamfile to build all the source file
note: if you want the benchmarked sizes, just open the operation's source file and edit the vector 'times' with the sizes you want
note: the sizes in case of matrix operation mean matrix(size, size) , but in case of vector it means vector(size)
run the operation(s) you need to plot
each operation you run will produce a file contaning its benchmarking data
run the plot.py and pass the benchmarking-data-files paths as arguments to the file like you can pass as many files as you like

python plot.py path/to/file1 path/to/file2

you got your graph!

How to run opencl testing

make sure you have set the clBLAS and OpenCL paths in the configuration file of boost to be able to build the opencl benchmarking
run the b2 in the testing folder to build tests and they will be built and run by default with the rest of the tests

Classes

boost::numeric::ublas::opencl::storage

it is used as a tag to tell easily that the data of this matrix resides on a device that supports opencl
boost::numeric::ublas::matrix<T, L, opencl::storage>

it is a special case of the boost::numeric::ublas::matrix<T, L,A> class which indicates that the data of this matrix is not on the host, but on a device that supports opencl.

it supports some functions like:

void from_host(boost::numeric::ublas::matrix<T,L,A>& m , boost::compute::command_queue& queue)

which takes a matrix already on the host and copies it to this matrix on device using the command queue sent as a parameter and its device (matrix on the device must have the same size1 and size2)

void to_host(boost::numeric::ublas::matrix<T,L,A>& m, boost::compute::command_queue& queue)

which takes a matrix with the same size as the matrix on the device and copies the content from device to host

boost::numeric::ublas::vector<T, opencl::storage>

it is a special case of boost::numeric::ublas::vector that works as a container for vectors on an opencl device and implements the same api as boost::numeric::ublas::matrix<T, L, opencl::storage>

Operations

note: All supported operations are smart enough to work with any combination of row_major or column_major matrices

for operations (almost all) operations implements this api which has three overloaded functions

takes 2 matrices (or vectors) already on an opencl device and outputs a matrix that is still on the same device
takes two matrices (or vectors) on host and copies them to device and then do the operation and copy the result back to host
as (2) but returns the result as a return value

** Here's prod function api described in details **

ublas::matrix<T, F, A> prod(ublas::matrix<T, F, A>& a, ublas::matrix<T, F, A>& b, boost::compute::command_queue& queue)

it takes two matrices that are originally not on the gpu and moves them to the gpu multiplies them on the
queue and return the results
void prod(ublas::matrix<T, F, A>& a, ublas::matrix<T, F, A>& b, ublas::matrix<T, F, A>& result, boost::compute::command_queue& queue)

does the same as the previous function but takes a refrence to the result matrix as input and puts the result values in it
void prod(ublas::matrix<T, F, opencl::storage>& a, ublas::matrix<T, F, opencl::storage>& b, ublas::matrix<T, F, opencl::storage>& result, boost::compute::command_queue& queue)

does the same as the previous function but the 3 matrices a , b & result are all not on the host, but all of them are on the same device and the queue is of the same device too (it doesn't involve copying data to or from host, so it's much faster)

Supported operation

operations	CPU uBLAS	ublas::opencl support
prod (matrix-matrix)	✔️	✔️
prod (matrix-vector)	✔️	✔️
prod (vector-matrix)	✔️	✔️
inner_prod	✔️	✔️
outer_prod	✔️	✔️
trans	✔️	✔️
swap	✔️	✔️
element_prod	✔️	✔️
element_div	✔️	✔️
operator + (matrix-matrix)	✔️	✔️ (as element_add)
operator + (vector-vector)	✔️	✔️ (as element_add)
operator + (matrix-constant)	❌	✔️ (as element_add)
operator + (vector-constant)	❌	✔️ (as element_add)
operator - (matrix-matrix)	✔️	✔️ (as element_add)
operator - (vector-vector)	✔️	✔️ (as element_add)
operator - (matrix-constant)	❌	✔️ (as element_sub)
operator - (vector-constant)	❌	✔️ (as element_sub)
element_scale (matrix-constant)	❌	✔️ (called element_scale and not element_prod because for complex numbers result.real = m.real * constant.real , result(i,j).imag = m(i,j).imag * constant.imag)
element_scale (vector-constant)	❌	✔️ (called element_scale and not element_prod because for complex numbers result.real = v.real * constant.real , result[i].imag = v[i].imag * constant.imag)
norm1	✔️	✔️ (for vectors of double and float)
norm2	✔️	✔️

also any element wise operator is supported in ublas::opencl through element_wise function

Examples

1. using the openCL operations with copying data to gpu and copying back from it

//enable including "opencl_core.hpp" and "operations.hpp" (must  be done before including matrix.hpp to get the opencl functionality
#define ENABLE_OPENCL 
#include <boost/numeric/ublas/matrix.hpp>

namespace ublas = boost::numeric::ublas;
namespace opencl = boost::numeric::ublas::opencl;
namespace compute = boost::compute;

int main()
{
  opencl::library lib; //to initialize the opencl api

  // choose the device you want to operate on and get its context and queue
  compute::device device = compute::system::devices().at(1); //change 1 to the device number you want or use default_device()
  compute::context context(device);
  compute::command_queue queue(context, device);


  ublas::matrix<float> a(500, 500, 100); //initialize it with any value (100 for example)
  ublas::matrix<float> b(500, 500, 100); //initialize it with any value (100 for example)


  ublas::matrix<float> result = opencl::prod(a, b, queue); //pass the command_queue you want to execute the operation on its device

}

2. using the openCL operations without copying data back and forth (with copying data only once to opencl device and then keep it on it to do multiple opertaions on them)

//enable including "opencl_core.hpp" and "operations.hpp" (must  be done before including matrix.hpp to get the opencl functionality
#define ENABLE_OPENCL 
#include <boost/numeric/ublas/matrix.hpp>

namespace ublas = boost::numeric::ublas;
namespace opencl = boost::numeric::ublas::opencl;
namespace compute = boost::compute;
typedef ublas::matrix<float, ublas::basic_row_major<>, opencl::storage> device_matrix;
typedef ublas::matrix<float> host_matrix;

int main()
{
  opencl::library lib; //to initialize the opencl api

  // choose the device you want to operate on and get its context and queue
  compute::device device = compute::system::devices().at(1); //change 1 to the device number you want or use default_device()
  compute::context context(device);
  compute::command_queue queue(context, device);


  host_matrix a(500, 500, 100); //initialize it with any value (100 for example)
  device_matrix a_device(a, queue); //queue is the command_queue that does the copying

  host_matrix b(500, 500, 100); //initialize it with any value (100 for example)
  device_matrix b_device(b, queue); //queue is the command_queue that does the copying

  //initialize result matrices on device to hold the result
  device_matrix result_prod_device(500, 500, context);
  device_matrix result_element_prod_device(500, 500, context);


  //note that no data copying from or to device happen here
  //so you can do multiple operations without copying back and forth
  opencl::prod(a_device, b_device, result_prod_device, queue); //pass the command_queue you want to execute the operation on its device
  opencl::element_prod(a_device, b_device, result_element_prod_device, queue); //pass the command_queue you want to execute the operation on its device



  //if you want to get the data in  host matrix
  host_matrix result_prod_host(500, 500);

  result_prod_device.to_host(result_prod_host, queue);

}

Implementation

Branch

This is the branch I was working on

opencl

Commit List

All commits I have done can be found from here:

Commit List

The files added during the project

added opencl/opencl_core.cpp : which contains the classes and the library setting up api
added opencl/operations.cpp : which has the implementation of all the supported operations
added all the files that are in test/opencl/ : The files there test all the operations supported with all the data types in these operations.. note : they are tested with each commit online using (travis CI and appveyor)
added benchmarking for all the uBLAS operations (cpu ones and opencl ones) in folder benchmarks/ (all files presented directly in the folders) and all the files in folder benchmarks/opencl/

note : the new benchmarking files produce files that are plotted into graphs for comparisons using benchmarks/plot.py (for more infromation refer to the documentation above)

edited test/Jamfile and benchmarks/Jamfile.v2 to enable running benchmarking and tests automatically with the rest of the library

Acknowledgement

I want to thank my mentor Stefan Seefeld that supported me with all the guidance and info needed through designing and implementing the project and also I want to thank google for giving me the opportunity to contribute in such big project and improve my experience throught it

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

GSoC 2018 Project Work submission

content:

Overviews

Boost

Boost.uBLAS

GSoC Idea

Proposal

documentation

Understand the project

How to use it (Get started)

Dependencies

How to setup the machine to enable the opencl ublas:

How to enable the opencl library functions in code:

How to run benchmarks & generate a similar graph for your device vs cpu

How to run opencl testing

Classes

Operations

for operations (almost all) operations implements this api which has three overloaded functions

Supported operation

Examples

1. using the openCL operations with copying data to gpu and copying back from it

2. using the openCL operations without copying data back and forth (with copying data only once to opencl device and then keep it on it to do multiple opertaions on them)

Implementation

Branch

Commit List

The files added during the project

Acknowledgement

Clone this wiki locally

Home

GSoC 2018 Project Work submission

content:

Overviews

Boost​

Boost.uBLAS

GSoC Idea

Proposal

documentation

Understand the project

How to use it (Get started)

Dependencies

How to setup the machine to enable the opencl ublas:

How to enable the opencl library functions in code:

How to run benchmarks & generate a similar graph for your device vs cpu

How to run opencl testing

Classes

Operations

for operations (almost all) operations implements this api which has three overloaded functions

Supported operation

Examples

1. using the openCL operations with copying data to gpu and copying back from it

2. using the openCL operations without copying data back and forth (with copying data only once to opencl device and then keep it on it to do multiple opertaions on them)

Implementation

Branch

Commit List

The files added during the project

Acknowledgement

Clone this wiki locally

Boost