Skip to content

Latest commit

 

History

History
368 lines (293 loc) · 15.2 KB

irregular-flattening.rst

File metadata and controls

368 lines (293 loc) · 15.2 KB

Irregular Flattening

In this chapter, we investigate a number of challenging irregular algorithms, which cannot be dealt with directly using Futhark's moderate flattening technique discussed in :numref:`moderate-flattening`.

Flattening by Expansion

For dealing with large non-regular problems, we need ways to regularise the problems so that they become tractable with the regular parallel techniques that we have seen demonstrated previously. One way to regularise a problem is by padding data such that the data fits a regular parallel schema. However, by doing so, we run the risk that the program will use too many parallel resources for computations on the padding data. This problem will arise, in particular, if the data is very irregular. As a simple, and also visualisable, example, consider the task of determining the points that make up a number of line segments given by sets of two points in a 2D grid. Whereas we may easily devise an algorithm for determining the grid points that make up a single line segment, it is not immediately obvious how we can efficiently regularise the problem of drawing multiple line segments, as each line segment will end up being represented by a different number of points. If we choose to implement a padding regularisation scheme by introducing a notion of ''an empty point'', each line can be represented as the same number of points, which will allow us to map over an array of such line points for processing the lines using regular parallelism. However, the cost we pay is that even the smallest line will be represented as the same number of points as the longest line.

Another strategy for regularisation is to flatten the irregular parallelism into regular parallelism and use segmented operations to process each particular object. It turns out that there, in many cases, is a simple approach to implement such flattening, using, as we shall see, a technique called expansion, which will take care of all the knitty gritty details of the flattening. The expansion approach is centered around a function that we shall call expand, which, as the name suggests, expands a source array into a longer target array, by expanding each individual source element into multiple target elements, which can then be processed in parallel.

For implementing the expand function using only parallel operations, we shall make use of the segmented helper functions defined in :numref:`regular-flattening`. In particular, we shall make use of the functions replicated_iota, segmented_replicate, and segmented_iota.

Here is the generic type of the expand function:

val expand 'a 'b : (sz: a -> i32) -> (get: a -> i32 -> b) -> []a -> []b

The function expands a source array into a target array given (1) a function that determines, for each source element, how many target elements it expands to and (2) a function that computes a particular target element based on a source element and the target element number associated with the source. As an example, the expression expand (\x->x) (*) [2,3,1] returns the array [0,2,0,3,6,0]. The function is defined as follows:

.. literalinclude:: src/segmented.fut
   :lines: 69-73


Drawing Lines

In this section we demonstrate how to apply the flattening-by-expansion technique for obtaining a work efficient line drawing routine that draws lines fully in parallel. The technique resembles the development by Blelloch :cite:`blelloch1990vector` with the difference that it makes use of the expand function defined in the previous section. Given a number of line segments, each defined by its end points (x_1,y_1) and (x_2,y_2), the algorithm will find the set of all points constituting all the line segments.

We first present an algorithm that will find all points that constitutes a single line segment. For computing this set, observe that the number of points that make up the constituting set is the maximum of |x_2-x_1| and |y_2-y_1|, the absolute values of the difference in x-coordinates and y-coordinates, respectively. Using this observation, the algorithm can independently compute the constituting set by first calculating the proper direction and slope of a line, relative to a particular starting point.

The simple line drawing routine is given as follows:

.. literalinclude:: src/lines_seq.fut
   :lines: 4-26

Futhark code that uses the linepoints function for drawing concrete lines is shown below:

.. literalinclude:: src/lines_seq.fut
   :lines: 28-48

The function main sets up a grid and calls the function drawlines, which takes care of sequentially updating the grid with constituting points for each line, computed using the linepoints function. The resulting points look like this:

.. only:: latex

   .. image:: img/lines_grid.pdf
      :width: 600px

.. only:: html

   .. image:: img/lines_grid.svg

An unfortunate problem with the line drawing routine shown above is that it draws the lines sequentially, one by one, and therefore makes only very limited use of a GPU's parallel cores. There are various ways one may mitigate this problem. One way could be to use map to draw lines in parallel. However, such an approach will require some kind of padding to ensure that the map function will compute data of the same length, no matter the length of the line. A more resource aware approach will apply a flattening technique for computing all points defined by all lines simultaneously. Using the expand function defined in the previous section, all we need to do to implement this approach is to provide (1) a function that determines for a given line, the number of points that make up the line and (2) a function that determines the n'th point of a particular line, given the index n. The code for such an approach looks as follows:

.. literalinclude:: src/lines_flat2.fut
   :lines: 28-49

Notice that the function get_point_in_line distinguishes between whether the number of points in the line is counted by the x-axis or the y-axis. Notice also that the flattening technique can be applied only because all lines have the same color. Otherwise, when two lines intersect, the result would be undefined, due to the fact that scatter results in undefined behaviour when multiple values are written into the same location of an array.

Drawing Triangles

Another example of an algorithm worthy of flattening is an algorithm for drawing triangles. The algorithm that we present here is based on the assumption that we already have a function for drawing multiple horizontal lines in parallel. Luckily, we have such a function! The algorithm is based on the property that any triangle can be split into an upper triangle with a horizontal baseline and a lower triangle with a horizontal ceiling. Just as the algorithm for drawing lines makes use of the expand function defined earlier, so will the flattened algorithm for drawing triangles. A triangle is defined by the three points representing the corners of the triangle:

type triangle = (point, point, point)

We shall make the assumption that the three points that define the triangle have already been sorted according to the y-axis. Thus, we can assume that the first point is the top point, the third point is the lowest point, and the second point is the middle point (according to the y-axis).

The first function we need to pass to the expand function is a function that determines the number of horizontal lines in the triangle:

.. literalinclude:: src/triangles.fut
   :lines: 62-63

The second function we need to pass to the expand function is somewhat more involved. We first define a function dxdy, which computes the inverse slope of a line between two points:

.. literalinclude:: src/triangles.fut
   :lines: 65-69

We can now define the function that, given a triangle and the horizontal line number in the triangle (counted from the top), returns the corresponding line:

.. literalinclude:: src/triangles.fut
   :lines: 71-85

The function distinguishes between whether the line to compute resides in the upper or the lower subtriangle. Finally, we can define a parallel, work-efficient function that converts a number of triangles into lines:

.. literalinclude:: src/triangles.fut
   :lines: 87-91

To see the code in action, here is a function that draws three triangles on a grid of height 30 and width 62:

.. literalinclude:: src/triangles.fut
   :lines: 91-97

The function makes use of both the lines_of_triangles function that we have defined here and the work efficient drawlines function defined previously. Here is a plot of the result:

.. only:: latex

   .. image:: img/triangles_grid.pdf
      :width: 600px

.. only:: html

   .. image:: img/triangles_grid.svg

Primes by Expansion

We saw earlier in :numref:`counting-primes` how we could implement a parallel algorithm for finding the number of primes below a given number. We also found, however, that the algorithm presented was not work-efficient. It is possible to implement a work-efficient algorithm using the expand function. We will leave the task as an exercise for the reader.

Complex Flattening

Unfortunately, the flattening-by-expansion technique does not suit all irregular problems. We shall now investigate how we can flatten a highly irregular algorithm such as quick-sort. The Quick-sort algorithm can be presented very elegantly in a functional language. The function qsort that we will define has the following type:

val qsort 't [n] : (t -> t -> bool) -> [n]t -> [n]t

Given a comparison function (<=) and an array of elements xs, qsort (<=) xs returns an array with the elements in xs sorted according to <=. Consider the following pseudo-code, which, unfortunately, is not immediately Futhark code:

def qsort (<=) xs =
  if length xs < 2 then xs
  else let (left,middle,right) = partition (<=) xs[length xs / 2] xs
       in qsort (<=) left ++ middle ++ qsort (<=) right

Here the function partition returns three arrays with the first array containing elements smaller than the pivot element xs[length xs / 2], the second array containing elements equal to the pivot element, and the third array containing elements that are greater than the pivot element. There are multiple problems with this code. First, the code makes use of recursion, which is not supported by Futhark. Second, the kind of recursion used is not tail-recursion, which means that it is not directly obvious how to eliminate the recursion. Third, it is not clear how the code can avoid using an excessive amount of memory instead of making use of inplace-updates for the sorting. Finally, it seems that the code is inherently task-parallel in nature and not particularly data-parallel.

The solution is to solve a slightly more general problem. More precisely, we shall set out to sort a number of segments, simultaneously, where each segment comprises a part of the array. Notice that we are interested in supporting a notion of partial segmentation, for which the segments of interest are disjoint but do not necessarily together span the entire array. In particular, the algorithm does not need to sort segments containing previously chosen pivot values. Such segments are already located in the correct positions, which means that they need not be moved around by the segmented quick sort implementation.

We first define a type sgm that specifies a segment of an underlying one-dimensional array of values:

.. literalinclude:: src/quick_sort.fut
   :lines: 25-25

At top-level, the function qsort is defined as follows, assuming a function step of type (t -> t -> bool) -> *[n]t -> []sgm -> (*[n]t,[]sgm):

.. literalinclude:: src/quick_sort.fut
   :lines: 89-93

The step function is called initially with the array to be sorted as argument together with a singleton array containing a segment denoting the entire array to be sorted. The step function is called iteratively until the returned array of segments is empty. The job of the step function is to divide each segment into three new segments based on pivot values found for each segment. After the step function has reordered the values in the segments, the middle segment (containing values equal to a pivot) need not be dealt with again in the further process. A new array of segment descriptors is then defined and after removing empty segment descriptors, the resulting array of non-empty segment descriptors is returned by the step function together with the reordered value array.

Before we can define the step function, we first define a few helper functions. Using the functions segmented_iota and segmented_replicate, defined earlier, we can define a function for finding all the indexes represented by an array of segments:

.. literalinclude:: src/quick_sort.fut
   :lines: 28-33

We also define a function info that, given an ordering function and two elements, returns -1 if the first element is less than the second element, 0 if the elements are identical, and 1 if the first element is greater than the second element:

.. literalinclude:: src/quick_sort.fut
   :lines: 14-16

The following two functions tripit and tripadd are used for converting the classification of elements into subsegments:

.. literalinclude:: src/quick_sort.fut
   :lines: 18-23

We can now define the function step that, besides from an ordering function, takes as arguments (1) the array containing values and (2) an array of segments to be sorted. The function returns a pair of a reordered array of values and a new array of segments to be sorted:

.. literalinclude:: src/quick_sort.fut
   :lines: 35-79

The algorithm has best case work complexity O(n) (when all elements are identical), worst case work complexity O(n^2), and an average case work complexity of O(n \log n). It has best depth complexity O(1), worst depth complexity O(n) and average depth complexity O(\log n).