Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hw3 #430

Open
wants to merge 15 commits into
base: HW3
Choose a base branch
from
Open

Hw3 #430

Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 17 additions & 3 deletions HW3/P2/mandelbrot.cl
Original file line number Diff line number Diff line change
Expand Up @@ -9,11 +9,25 @@ mandelbrot(__global __read_only float *coords_real,
const int y = get_global_id(1);

float c_real, c_imag;
float z_real, z_imag;
float z_real, z_imag, z_real_new, z_imag_new;
float mag2;
int iter;

if ((x < w) && (y < h)) {
// YOUR CODE HERE
;
iter = 1;
c_real = coords_real[y*w + x];
c_imag = coords_imag[y*w + x];
z_real = c_real;
z_imag = c_imag;
mag2 = z_real*z_real + z_imag*z_imag;
while ((mag2 < 4) && (iter < max_iter)){
z_real_new = z_real*z_real - z_imag*z_imag + c_real;
z_imag_new = 2*z_real*z_imag + c_imag;
mag2 = z_real_new*z_real_new + z_imag_new*z_imag_new;
z_real = z_real_new;
z_imag = z_imag_new;
iter = iter + 1;
}
out_counts[y*w + x] = iter;
}
}
13 changes: 13 additions & 0 deletions HW3/P3/P3.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@

The best configuration for my machine is:

configuration ('coalesced', 512, 128): 0.000331392 seconds

The coalesced read is faster than the blocked read on average
for the same number of work groups and workers because more
threads can do work on the same block of fetched memory. In the
blocked reads, once a thread fetchs its block to sum, more
threads may have to wait to fetch their block of memory. However in the
coalesced reads, more threads can sum elements simultaneously
more often since a fetched block of memory will be more likely to
contain elements needed by more threads than in the blocked scheme.
52 changes: 42 additions & 10 deletions HW3/P3/sum.cl
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,16 @@ __kernel void sum_coalesced(__global float* x,
{
float sum = 0;
size_t local_id = get_local_id(0);

int i, j, gID, gSize, temp, lSize, loglSize;

gID = get_global_id(0);
gSize = get_global_size(0);
lSize = get_local_size(0);

// thread i (i.e., with i = get_global_id()) should add x[i],
// x[i + get_global_size()], ... up to N-1, and store in sum.
for (;;) { // YOUR CODE HERE
; // YOUR CODE HERE
for (i = gID; i < N; i += gSize) {
sum = sum + x[i];
}

fast[local_id] = sum;
Expand All @@ -24,8 +29,17 @@ __kernel void sum_coalesced(__global float* x,
// You can assume get_local_size(0) is a power of 2.
//
// See http://www.nehalemlabs.net/prototype/blog/2014/06/16/parallel-programming-with-opencl-and-python-parallel-reduce/
for (;;) { // YOUR CODE HERE
; // YOUR CODE HERE
loglSize = 1;
temp = lSize >> 1;
while (temp > 1){
temp = temp >> 1;
loglSize = loglSize + 1;
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using the above while loop, you can incorporate this to following for loop,
by setting the initial j to lSize >> 1 or (lSize/2), and decrease it by half on each iteration.

for (j = 1; j <= loglSize; j++) {
if (local_id < (lSize >> j)) {
fast[local_id] = fast[local_id] + fast[local_id + (lSize >> j)];
}
barrier(CLK_LOCAL_MEM_FENCE);
}

if (local_id == 0) partial[get_group_id(0)] = fast[0];
Expand All @@ -38,7 +52,8 @@ __kernel void sum_blocked(__global float* x,
{
float sum = 0;
size_t local_id = get_local_id(0);
int k = ceil(float(N) / get_global_size(0));
int k = ceil((float)N / get_global_size(0));
int j, gID, temp, loglSize, lSize, minS;

// thread with global_id 0 should add 0..k-1
// thread with global_id 1 should add k..2k-1
Expand All @@ -48,8 +63,16 @@ __kernel void sum_blocked(__global float* x,
//
// Be careful that each thread stays in bounds, both relative to
// size of x (i.e., N), and the range it's assigned to sum.
for (;;) { // YOUR CODE HERE
; // YOUR CODE HERE
lSize = get_local_size(0);
gID = get_global_id(0);
if (k-1 < N - k*gID){
minS = k;
}
else{
minS = N - k*gID;
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can simplify the above if statement, and just put in the following for loop condition:
...; j < k_(gID +1) && (k_gID + j) < N; ...

for (j = 0; j < minS; j++) {
sum = sum + x[k*gID + j];
}

fast[local_id] = sum;
Expand All @@ -64,8 +87,17 @@ __kernel void sum_blocked(__global float* x,
// You can assume get_local_size(0) is a power of 2.
//
// See http://www.nehalemlabs.net/prototype/blog/2014/06/16/parallel-programming-with-opencl-and-python-parallel-reduce/
for (;;) { // YOUR CODE HERE
; // YOUR CODE HERE
loglSize = 1;
temp = lSize >> 1;
while (temp > 1){
temp = temp >> 1;
loglSize = loglSize + 1;
}
for (j = 1; j <= loglSize; j++) {
if (local_id < (lSize >> j)) {
fast[local_id] = fast[local_id] + fast[local_id + (lSize >> j)];
}
barrier(CLK_LOCAL_MEM_FENCE);
}

if (local_id == 0) partial[get_group_id(0)] = fast[0];
Expand Down
4 changes: 2 additions & 2 deletions HW3/P3/tune.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ def create_data(N):
times = {}

for num_workgroups in 2 ** np.arange(3, 10):
partial_sums = cl.Buffer(ctx, cl.mem_flags.READ_WRITE, 4 * num_workgroups + 4)
partial_sums = cl.Buffer(ctx, cl.mem_flags.READ_WRITE, 4 * num_workgroups)
host_partial = np.empty(num_workgroups).astype(np.float32)
for num_workers in 2 ** np.arange(2, 8):
local = cl.LocalMemory(num_workers * 4)
Expand All @@ -40,7 +40,7 @@ def create_data(N):
format(num_workgroups, num_workers, seconds))

for num_workgroups in 2 ** np.arange(3, 10):
partial_sums = cl.Buffer(ctx, cl.mem_flags.READ_WRITE, 4 * num_workgroups + 4)
partial_sums = cl.Buffer(ctx, cl.mem_flags.READ_WRITE, 4 * num_workgroups)
host_partial = np.empty(num_workgroups).astype(np.float32)
for num_workers in 2 ** np.arange(2, 8):
local = cl.LocalMemory(num_workers * 4)
Expand Down
85 changes: 72 additions & 13 deletions HW3/P4/median_filter.cl
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
#include "median9.h"


// 3x3 median filter
__kernel void
median_3x3(__global __read_only float *in_values,
Expand All @@ -12,23 +13,81 @@ median_3x3(__global __read_only float *in_values,
// Note: It may be easier for you to implement median filtering
// without using the local buffer, first, then adjust your code to
// use such a buffer after you have that working.
int gID, lID, x, y, lx, ly, gSizeX, gSizeY,
lSizeX, lSizeY, xTemp, yTemp, xUse, yUse,
buf_corner_x, buf_corner_y, buf_x, buf_y, row;
// the code below is adapted from the lecture code on halos
x = get_global_id(0);
y = get_global_id(1);
lx = get_local_id(0);
ly = get_local_id(1);
gSizeX = get_global_size(0);
gSizeY = get_global_size(1);
lSizeX = get_local_size(0);
lSizeY = get_local_size(1);


gID = gSizeX*y + x;
lID = lSizeX*ly + lx;

buf_corner_x = x - lx - halo;
buf_corner_y = y - ly - halo;

// Load into buffer (with 1-pixel halo).
//
// It may be helpful to consult HW3 Problem 5, and
// https://github.com/harvard-cs205/OpenCL-examples/blob/master/load_halo.cl
//
// Note that globally out-of-bounds pixels should be replaced
// with the nearest valid pixel's value.
buf_x = lx + halo;
buf_y = ly + halo;

if ((y < h) && (x < w)){

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This if statement should not be here.
It may cause a partial population of the buffer (specifically, avoid loading the column next to the right-most input matrix column).
You should look again at HW3 Problem 5, and https://github.com/harvard-cs205/OpenCL-examples/blob/master/load_halo.cl

if (lID < buf_w){ // only work with buf_w threads
xTemp = buf_corner_x + lID;
xUse = xTemp;
if (xTemp < 0){ // if pixel out of bounds, add compensation steps to find closest in bound pixel
xUse += 1;
}
if (xTemp > w - 1){
xUse -= 1;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would lead to accessing wrong values if the buffer size has more than 1 value out of bounds with the right side of the input matrix.

}
for (row = 0; row < buf_h; row++) {
yTemp = buf_corner_y + row;
yUse = yTemp;
if (yTemp < 0){
yUse += 1;
}
if (yTemp > h - 1){
yUse -= 1;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would lead to accessing wrong values if the buffer size has more than 1 value out of bounds with the bottom of the input matrix.

}
buffer[row * buf_w + lID] = in_values[yUse*gSizeX + xUse]; // assign global memory of pixel or closest in bound pixel to buffer
}
}
}

// Compute 3x3 median for each pixel in core (non-halo) pixels
//
// We've given you median9.h, and included it above, so you can
// use the median9() function.
barrier(CLK_LOCAL_MEM_FENCE);
if ((y < h) && (x < w)){
out_values[gID] = median9(buffer[(buf_y-1)*buf_w + (buf_x-1)], // take median of 8 neighbors and current pixel
buffer[(buf_y-1)*buf_w + (buf_x)],
buffer[(buf_y-1)*buf_w + (buf_x+1)],
buffer[(buf_y)*buf_w + (buf_x-1)],
buffer[(buf_y)*buf_w + (buf_x)],
buffer[(buf_y)*buf_w + (buf_x+1)],
buffer[(buf_y+1)*buf_w + (buf_x-1)],
buffer[(buf_y+1)*buf_w + (buf_x)],
buffer[(buf_y+1)*buf_w + (buf_x+1)]);
}

// Load into buffer (with 1-pixel halo).
//
// It may be helpful to consult HW3 Problem 5, and
// https://github.com/harvard-cs205/OpenCL-examples/blob/master/load_halo.cl
//
// Note that globally out-of-bounds pixels should be replaced
// with the nearest valid pixel's value.

// Each thread in the valid region (x < w, y < h) should write
// back its 3x3 neighborhood median.

// Compute 3x3 median for each pixel in core (non-halo) pixels
//
// We've given you median9.h, and included it above, so you can
// use the median9() function.


// Each thread in the valid region (x < w, y < h) should write
// back its 3x3 neighborhood median.
}
7 changes: 4 additions & 3 deletions HW3/P4/median_filter.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
from __future__ import division
import pyopencl as cl
import numpy as np
import imread
import pylab
import os.path

def round_up(global_size, group_size):
r = global_size % group_size
Expand Down Expand Up @@ -51,15 +51,16 @@ def numpy_median(image, iterations=10):
properties=cl.command_queue_properties.PROFILING_ENABLE)
print 'The queue is using the device:', queue.device.name

program = cl.Program(context, open('median_filter.cl').read()).build(options='')
curdir = os.path.dirname(os.path.realpath(__file__))
program = cl.Program(context, open('median_filter.cl').read()).build(options=['-I', curdir])

host_image = np.load('image.npz')['image'].astype(np.float32)[::2, ::2].copy()
host_image_filtered = np.zeros_like(host_image)

gpu_image_a = cl.Buffer(context, cl.mem_flags.READ_WRITE, host_image.size * 4)
gpu_image_b = cl.Buffer(context, cl.mem_flags.READ_WRITE, host_image.size * 4)

local_size = (8, 8) # 64 pixels per work group
local_size = (4, 4) # 64 pixels per work group
global_size = tuple([round_up(g, l) for g, l in zip(host_image.shape[::-1], local_size)])
width = np.int32(host_image.shape[1])
height = np.int32(host_image.shape[0])
Expand Down
86 changes: 86 additions & 0 deletions HW3/P5/P5.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@


Explanation:

Part 1:

This is the base code.

Part 2:

This is optimized over the first part because the buffer values are updated with the
grandparent values, which is guaranteed to be less than or equal to the current
buffer value.

Part 3:

This is optimized over the second part because a pixel's parent is updated to the pixel's
value if it is smaller than the pixel's parent using atomic min. However, the iteration
time increases due to the atomic (min) operation.

Part 4:

Making 1 thread update the buffer regions with grandparent values is not as efficient on average
given the time per iteration is roughly twice as long as Part 3. Even though lots of adjacent pixels
may have equal buffer values after sufficient iterations, the reduced number of memory reads
does not outweight the loss of parallelism between threads. If more threads are used, for
example due to smaller context sizes, then even more memory calls to the labels array will occur.
So using one thread to remember previous grandparent values may perform better than having
each thread fetch a value from memory simultaneously (resulting in partial serialization since
more threads will have to wait for memory) as the number of threads gets even larger.

Part 5:

If a standard min operation were used instead of atomic min, the iteration time would decrease
because the imposed serialized delays from atomic operation will not be applied. The final result
will still be correct because even if a thread overwrites the pixel's parent's value with a greater value
than another thread, the value will still be less than the original parent value. Thus the number of
iterations may increase. As stated, the value in label could increase, but that is during the same iteration.
Between iterations, label values cannot increase because a pixel's previous iteration value is compared
via the minimum operator with a new label. Thus after the current iteration finishs, each label's value
will be less than or equal to that of the previous iteration.


Results:

Maze 1

Part1:

Finished after 915 iterations, 36.084992 ms total, 0.0394371497268 ms per iteration
Found 2 regions

Part 2:

Finished after 529 iterations, 20.321376 ms total, 0.0384146994329 ms per iteration
Found 2 regions

Part 3:

Finished after 12 iterations, 0.611552 ms total, 0.0509626666667 ms per iteration
Found 2 regions

Part 4:

Finished after 11 iterations, 1.224416 ms total, 0.111310545455 ms per iteration
Found 2 regions

Maze 2

Part 1:
Finished after 532 iterations, 20.138752 ms total, 0.0378547969925 ms per iteration
Found 35 regions

Part 2:
Finished after 276 iterations, 10.62384 ms total, 0.038492173913 ms per iteration
Found 35 regions

Part 3:
Finished after 11 iterations, 0.539008 ms total, 0.0490007272727 ms per iteration
Found 35 regions

Part 4:
Finished after 10 iterations, 1.11216 ms total, 0.111216 ms per iteration
Found 35 regions


Loading