I switched gears since the checkpoint and proposal. The CloudLight implementation was more of a graphics project and didn't actually have as much to do with parallel programming as I had hoped.
Instead, I made a CPU based parallel BVH builder for raytracing, utilizing both multithreading with OpenMP and SIMD through intrinsics. This builder is heavily inspired by two of Ingo Wald's papers. 1 2
Raytracing is a method to generate images based on tracing the path of light. A virtual camera is placed in the scene, and rays are traced through pixels in the screen towards the scene geometry. For each ray, we find the closest intersection and color it.
My raytracer is based on texts by Peter Shirley, but is heavily modified and stripped of features. I also make a few assumptions:
For each ray that intersects the scene, we create two rays - a shadow ray and a diffuse ray. The shadow ray performs another ray-scene intersection but is directed at the point light. If we hit an object and the distance is less than the distance to the point light, we are in shadow. Otherwise we add the contribution of the light.
Here is an image generated with just the shadow rays.
The diffuse rays help us account for indirect light, such as light bouncing off nearby walls. We cap the maximum number of bounces, and also perform a russian roulette to determine if the ray should terminate at each bounce. (You could trace all rays to max depth, but given an equal amount of runtime I felt that the russian roulette generated images with less noise.) The next image has both shadow rays and diffuse rays.
We also create multiple rays per pixel, so that we can reduce the noise in our images.
While raytracing is conceptually rather simple, a quick examination shows us a large problem.
for every pixel for every sample for every triangle check intersection ...
For the image above, that is 640480100*36 = 1,105,920,000 ray triangle intersections! Not including bounce or shadow rays!
Essentially, we are in O(pixels * samples * triangles). As a project focused on parallelism, there isn't much we can do to reduce the number of pixels or samples - both result in a degradation in image quality. Instead, we trace many rays at the same time using OpenMP.
However at this point the raytracer is still fairly slow on larger images, only getting a few thousand rays/sec. We need to do something about the triangles.
A bounding volume hierarchy is essentially a spatial tree. We divide the scene into two bounding boxes at each node, until we reach leaf elements (the triangles). This allows us to perform intersection tests against the nodes instead of going through every single triangle. If we don't hit a node, it's very clear we will not hit any triangles in that node.
There are many ways to implement a BVH. The simplest is to sort all the elements relative to some axis (x, y, z) and subdivide down the median. This allows us to get our raytracer up to around 1 million rays/sec on simpler scenes. This is much better than before, but a better way of creating a BVH is to use a Surface Area Heuristic (SAH).
For a true SAH based BVH, you consider every possible splitting plane amongst the triangles in your node, and calculate the bounding boxes of all triangles to the left and right of the splitting plane. The best plane to split is the one with the least surface area, since we expect rays to be coming from any direction with equal probability. This slide from 15-462 has a bit more detail.
The first paper by Wald suggests creating bins that span the bounding box of the centroids of all triangles in our current node. We create these bins along the widest axis of our centroid bounding box.
We can iterate through all the triangles in our node and place them in a bin based on their centroid, and keep track of the bounding box for each bin (full bounding box, not just centroids). This allows us to approximate a SAH much faster than sweeping across all triangles. In fact, as few as 8 or 4 bins seems to perform quite a good job, but 16 was the sweet spot for me (and Wald). Our raytracer can now do a few million rays/sec on more complicated scenes. Much better than the median split BVH. The remaining optimizations do not significantly affect the quality of the created BVH, so raytracing results will be omitted.
There are three phases when we construct the BVH:
Plane evaluation is done serially because it must be. Besides, there are only 16 bins to check against so this is a trivially small portion of our computation.
We can parallelize across binning and partitioning, assigning a different portion of our triangles to different threads, while being careful about cache alignment boundaries to prevent false sharing. For binning, each thread updates its own local bins, and a reduction is performed serially across the bins afterwards.
For partitioning, we have two global queues where elements are pushed either to the left or right based on which side of the splitting plane they fall on. A single consumer thread takes these elements from the queue and places them into a different triangle array. To save on memory usage, we allocate 2 arrays of N triangles, and swap between the two arrays at each step of the node. Since the BVH is constructed recursively, we are guaranteed that no two threads will be modifying the same area of the array.
And since the BVH is constructed recursively, we can spawn a new task to produce the left and right subtrees of the current node. However this means that we would probably require a few levels of BVH node construction before we can create enough tasks to saturate our CPU. Wald mentions these horizontal/vertical work sharing issues and proposes adaptive switching solutions and ...
There's a simpler way: OpenMP Tasks. By representing all of the parallelism in our code as OpenMP tasks, we essentially avoid this problem altogether.
OpenMP tasks are basically treated as a work queue. Every task created will be pushed on the task queue, and threads will take tasks from the queue as they complete other ones. There is no guarantee on the order these tasks will execute, so we do need to add a little bit of synchronization to our code.
Thanks to OpenMP doing some heavy lifting, the horizontal/vertical work sharing problem is eliminated and we see almost perfect speedup.
The speedup graph was generated using my quad-core (8 threads) laptop CPU, and a Power8 node configured to be SMT1. As can se seen in the graph, speedup on my laptop CPU tapers off at around 4 cores, indicating that we are compute bound. We also see that Sibenik doesn't achieve ideal speedup. This is because Sibenik is significantly smaller than Sponza and thus doesn't generate enough work to saturate my CPU.
On just about every modern CPU, we have SIMD processing capability. My laptop has AVX2, which is 8 floats wide. So this is potentially another 8x speedup we are leaving untapped if we ignore SIMD!
This is where Wald's second paper comes into play.
First, we need to restructure all of our data since the current layout is not SIMD friendly. Wald adopts an Array of Structures of Arrays utilizing quantized and packed unorm10 data type to fit 16 elements into 3 cache lines. Since I'm not using the Xeon Phi, I don't have hardware support for unorm10, and since we are compute heavy it would be a very bad idea to add more compute to deal with this.
Instead I adopt the following layout:
In the fragments, we store the minimum and maximum coordinates of the bounding box for each triangle, as well as an ID to that triangle into a global array. This layout allows us to load chunks of the element arrays directly into SIMD, instead of relying on a gather if we simply had Arrays of Structs.
The binning phase is done 1 element at a time. This might sound a bit odd, since we just laid it out to utilize the SIMD, and each element should only fall into one bin. There are a few reasons for doing it this way.
Instead, we opt to update all 8 bins in parallel. Since previously an element mapped only into one bin, we need to change the meaning of our bins. Instead of bins being the region between splitting planes, in my code the bins are the splitting planes. Wald doesn't mention this, but it also allows us to do a cute trick where we can actually treat this as 9 separate bins in our 8 wide SIMD, so we even get a little bit more quality than representing 1 bin per SIMD lane.
Each element is placed either to the left or right variables relative to the bin using mask and blend intrinsics. We keep count of the number of elements to the right (since left is just n-right), and update the bounding boxes of the left and right elements.
Wald also suggests performing 3 passes over each fragment, and updating the x, then y, then z, respectively in each of the passes to avoid possible issues with regards to SIMD registers having to be swapped to/from memory due to oversaturation. I saw no benefit in my code from doing this, in fact it got slightly worse. The reason for this is doing 3 passes results in extra computation that we avoid in 1 pass.
Same as before, but we don't need to compute the bounding boxes since we've been updating them when we bin the elements!
Huh. Our SIMD is 8 wide so we expect 8x speedup with 1 thread. Instead we only see 1.8x speedup. What's going on?
You might have noticed I skipped the partitioning section when discussing SIMD optimizations. Let's look at another graph comparing serial vs SIMD single threaded execution time.
As you can see, our binning is indeed about 8x faster. Evaluation is pretty trivial too. What's going on with the partitioning?
My laptop supports AVX2. Wald wrote his paper for the Phi, which supports AVX512. The key difference between AVX2 and AVX512 in this regard is AVX2 does not have a scatter instruction. We can compute the centroid bounding boxes with SIMD to pass onto the constructors for our children nodes (this is the slight speedup you see here), but we are stuck partitioning the elements serially again!
So it's clear now why Wald picked the Xeon Phi instead of a CPU, and it's not just because of the massive boost in compute.
AVX512 also has a lot more instructions available. What Wald does in 12 masked SIMD min/maxes I have to do in 48 operations: Blend, And, Convert, Min.
This raises a few questions with regards to CPU BVH building:
Based on the measurements I took for sorting based BVH, a fully parallel sorting based BVH was built as quicky as a single threaded SAH based build. Assuming the best case scenario, we can only expect at most 2x speedup from a sorting based build. This might not be the right answer.
For memory regards, this also suggests that other architectures be considered, such as FPGA based approaches where you could stream the data through the accelerator in one pass, changing this from being compute heavy to bandwidth bound.
The SIMD version uses almost 3x as much memory due to precomputation requirements.
Here are some images I made during testing:
Sponza looks really boring without textures.
I don't understand gh-pages.