PiMD is a SIMD intrinsics library for the Broadcom VideoCoreIV-AG100-R GPU found on all Raspberry Pi models. Currently there does not exist any general purpose libraries for the Raspberry Pi GPU. The goal of this library is to provide an accessible interface for taking advantage of the SIMD processors in the GPU, while providing performance comparable to application-specific QPU code. Our library is extremely versatile and can be used to implement nearly any purely data parallel algorithm.
For more complete information about the Raspberry Pi's GPU, please see the VideoCore IV 3D Architecture Reference Guide.
At the core of VideoCore IV graphics processing unit is a set of 12 special purpose floating-point shader processors, termed Quad Processors (QPUs). Each QPU can be regarded as a 16-way 32-bit SIMD processor with an instruction cycle time of four system clocks. Our library only uses 8 of the QPUs in order to make it easier to evenly divide work among the QPUs (see VPM).
Each QPU contains four general-purpose accumulators as well as two large register-file memories, each with 64 registers. 32 locations on each of the A and B regfiles are general purpose registers while the other 32 are used to access register-space I/O. Our library uses the first 32 locations on each regfile to maintain user-defined variables during computation.
Uniforms are 32-bit values stored in memory that can be read in sequentially be the QPU. Our library uses uniforms to pass in function arguments, including memory address for general memory lookups (see TMU).
Each QPU has shared access to two Texture and Memory Lookup Unit (TMUs). The TMUs can be used for general memory lookups. Each TMU has a FIFO request queue, allowing the pipelining up of to four memory requests. Our library takes full advantage of this by aggressively prefetching data in order hide the latency of memory accesses.
The Vertex Pipeline Memory is a 4KB memory buffer shared among all the QPUs and intended for transferring data between the GPU and main memory. The QPUs views the VPM as a 2D array of 32-bit words, 16 words high and 64 words high. Our library partitions the VPM into 8 512 byte sections, allocating one to each QPU. Thus, each QPU is responsible for computing 8 16-wide 32-bit vectors at a time.
Each QPU has shared access to a Special Functions Unit (SFU) which can perform several less frequently used ‘exotic’ operations, including SQRT, RECIPSQRT, LOG, EXP. Our library provides access to these SFU functions.
Unlike Intel or Neon SIMD intrinsics, QPU code cannot be compiled alongside CPU code. QPU code is written in an assembly language specific to the QPU and compiled separately from any CPU code. In order to run code on the QPU, the compiled byte-code and all data must be structured in a specific way and passed in shared memory to the GPU. The primary goal of library was to abstract this complicated process away from the user and allow them to easily implement data parallel algorithms without having to worry about writing assembly, transferring data, or dividing work among the QPUs.
We worked to make a general interface that could be used with nearly any easily parallelization problem. We designed our library around a model of defining functions, consisting of QPU operations, and calling those functions on various inputs. This familiar model, used by nearly every higher-level programming language, makes our library extremely accessible, even to novice programmers.
Example code that implements SAXPY using the PiMD library:
Our model abstracts away all notions of size related to the QPU execution. From the users perspective, operations are being applied to every element in their input array simultaneously. In our model, there is a single working vector to which all operations are applied. Arguments to operations on this working vector are fetched and loaded automatically by our library, which optimally prefetches values in order to hide memory latency. In addition there are 4 hardware variables that can be used to store and retrieve variables during computation. These variables are implemented as groups of registers, allowing users to use multiple variables without requiring more expensive memory requests.
Our library defines a set of operations that correspond to instructions on the QPU. Users create QPU functions by defining a series of these instructions. As you can see in the SAXPY example above, our function is defined as a vector load, a scalar float multiply, a vector float add, and a store. Each operation defines the input that it requires (for example, OP_ADD, OP_SADD, and OP_VADD take a variable, scalar, and vector argument respectively). This interface is simple, yet powerful by having function operations that map almost directly to hardware instructions.
Arguments passed to PiMD functions must adhere to the specification implicitly defined by the series of operations that compose that function. In order to minimize boilerplate code needed to define arguments, the burden of ensuring that arguments are the correct type is passed along to the user. As a result, the same syntax is used to create arguments from integers, floats, and pointers, which can be seen in the SAXPY example above.
In order to assess the performance of our library, we compared the performance of a variety of functions on the following implementations.
We tested algorithms encompassing a variety of memory and computation profiles in order to see how our library performs what types of problems are well suited for execution on the QPU.
From the above graphs we can see that our library performed extremely well. It repeatedly out-performed both the CPU and NEON SIMD implementation on all varieties of tasks. Although these test may not be an entirely fair comparison because the CPU and NEON SIMD implementations were limited to one thread, they give us a good indiciation that the library is effective in speeding up data parallel computations.
We can also see that relative to the other implementations, our library performed the best on the compute bound test consisting of only floating-point operations. This suggests that the PiMD library is better suited for algorithms with high arithmetic intensities, as opposed to ones that require large memory transfers. This is unsurprising, as transferring data to and from memory is often a limiting factor in other GPUPU frameworks.
After implementing this library we learned quite a bit about the architecture and capabilities of the Broadcom VideoCoreIV-AG100-R GPU. Our major takeaways are:
Our PiMD library provides an accessible, flexible, and powerful interface for the Videocore GPU. The main benefits of our library are:
As mentioned previously, our comparison may be slightly unfair because they CPU and NEON SIMD implementations are single-threaded. It would be insightful to compare the performance of our library, which takes advantage of 8 QPUs, to multi-threaded CPU and NEON SIMD implementations that can take advantage of the Raspberry Pi's quad-core CPU.
Our results suggest that PiMD may also be well suited for image manipulations, which are often highly data parallel with high arithmetic intensity. Assessing the performance of our library on the following image manipulation algorithm could provide further useful data.
The majority of our time so far has been spent reading through VideoCore IV 3D Architecture Reference to gain a better understanding of the GPU architecture. We have also read through much of the relevant Raspberry Pi userland source code and learned how to interface with the GPU through the mailbox property interface. As a proof of concept, we have successfully used this interface to pull various diagnostic data from the GPU, including clock speed and temperature. We feel that we have gained a sufficient understanding of these components and believe that our project, as outlined in our proposal, is feasible.
We have determined that we will need to use an assembler in order to translate QPU assembly instructions into object code that can be executed on the QPU. We have decided to use vc4asm as it seems to have the most features and best support among the QPU assemblers that we found. Writing the assembly code will not be difficult. We have a very good understanding of how to transfer memory between the GPU and the host using QPU assembly instructions. Furthermore, our library will only consist of primitive vector operations such as add, multiply, etc. which map nicely to instructions in the QPU instruction set.
We have also looked at some existing code and have successfuly run trivial programs on the QPU. We have also done a significant amount of informal testing in order to better understand the capabilities and limitations of various parts of the GPU architecture and instruction set, including TMU memory fetches, VPM memory writes, branch statements, and QPU synchronization instructions.
We are slightly behind our originally proposed schedule due to other commitments that have taken priority in the past couple of weeks. Our current project schedule is still the one outlined in our proposal. However we still feel that we are still on pace to deliver all deliverables as listed in our proposal. Writing the assembly code should not be challenging because we have a clear understanding how to transfer memory between to/from GPU and our library only consists of primitive vector operations which map extremely nicely to QPU instructions.
For the parallelism competition we will deliver a library interface that allows to users to process vectors of data using the QPUs. This library will support all standard primitive logical and mathematical operations such as add, multiply, and, or, etc. At the competition we plan to show a demo of a common parallel algorithm running on the QPU. We will also demo a comparison same algorithm being run using only the CPU, using the NEON SIMD on the CPU, and (possibly) using existing graphics libraries on the Raspberry Pi. We will also include graph of the performance that will have previously measured.
As mentioned previously, we know have a solid understanding of the GPU architecture and how to interface with it. So there are no major issues regarding knowledge of the problem. The majority of work going forward will be designing and implementing our library as well as writing the multiple implementations of the algorithms we will use to test our library. There will likely be some significant design considerations and trade-offs we make when defining the interface for our library. While these are not explicit concerns, they are decisions that have to be made thoughtfully to ensure that our library is general purpose and easy-to-use while remaining powerful.
We will implement a SIMD intrinsics library for the Broadcom VideoCoreIV-AG100-R GPU found on all Raspberry Pi models. We will compare the performance of common data parallel algorithms implemented with our library to implementations using the traditional ARM CPU in the Raspberry Pi.
In addition to a ARM CPU, all Raspberry Pi models have a VideoCore IV graphics processing unit. At the core of this architecture is a set of 12 special purpose floating-point shader processors, termed Quad Processors (QPUs). For all intents and purposes, each QPU can be regarded as a 16-way 32-bit SIMD processor with an instruction cycle time of four system clocks. Internally, each QPU is a 4-way SIMD processor multiplexed to 16-ways by executing the same instruction for four clock cycles on four different 4-way vectors. Each QPU is dual-issue and contains two independent (and asymmetric) ALU units, an 'add' unit and a 'mul' unit, allowing the QPU to perform on 32-bit floating-point vector add and one 32-bit floating-point vector multiply on each clock cycle.
The architecture of this GPU is designed specifically for multimedia processing, namely audio, video, and graphics. The QPUs are closely coupled to 3D hardware on the chip specifically for fragment shading. Raspbian, the primary operating system for the Raspberry Pi, provides hardware accelerated implementations of OpenGL ES 1.1, OpenGL Es 2.0, OpenVG 1.1, Open EGL, OpenMAX, and 1080p30 H.264 high-profile decoding which take advantage of this partially specialized hardware. There currently is no library for more general (non-multimedia) data parallel computation on the Raspberry Pi using the QPUs without the additional 3D computation hardware. Such a library would allow developers to massively improve the performance of data parallel algorithms on the Raspberry Pi by not limiting them to the single-core ARM CPU.
The primary challenge of this project is gaining a sufficient understanding of the VideoCore IV GPU in order to implement a SIMD library using its architecture. Outside of the existing Raspberry Pi source code and an architecture reference guide released by Broadcom, there is very little documentation about this hardware. Creating our library will require a great deal of reading and experimenting to determine what steps need to be taken in order to execute instructions on the GPU.
In addition, we will need to figure out how to access QPUs directly and avoid using unnecessary hardware components of the 3D pipeline on the GPU. The This is especially tricky because the hardware was designed for fragment shading in order to provide reasonably efficient implementation of the aforementioned libraries. As a result, the existing source code (and to a lesser extent the architectural reference) will be less than perfectly helpful when writing our SIMD library.
We will also have to give careful consideration when defining the interface for our library. We need to ensure that our library functions are specific enough to be mapped to the QPUs but general enough to allow our library to be used in a wide variety of data parallel problems.
We be running and testing our code on Raspberry Pi 2 Model B. Our code base will start from scratch but will rely on the GPU interface functions defined in the Raspberry Pi userland source code. We will also be heavily using the official Broadcom VideoCore IV 3D Architecture Reference while investigating ways to interface with the GPU.
Our library should make it possible to implement a wide variety of data-parallel algorithms to run on the Raspberry Pi and achieve better performance than implementations that run only on the ARM CPU. Ignoring memory latency and sources of overhead, perfect speedup for our library using the 12 250MHz QPUs over the single 900MHz CPU is 3.33x. More practically, we hope to achieve at least a 2x speedup on completely data parallel algorithms. Ideally we would like to see a 2.5x-3x speedup after further optimizing our library.
We will demo our library by explaining the interface and types and functions it contains. We will show speedup graphs of our library on various data parallel algorithms and possibly show program output if our tests produce compelling results.
We will be writing and testing our code through Raspbian running on a Raspberry Pi 2 Model B for obvious reasons. The majority of our library will be written in C, likely with some C++ or Assembly, as these are prototypical system languages and the ones used in the Raspberry Pi source code to interface with the GPU.