PiMD
Pi (Instruction) Multiple Data


Patrick Koenig & Nate Horan

Writeup

Summary

PiMD is a SIMD intrinsics library for the Broadcom VideoCoreIV-AG100-R GPU found on all Raspberry Pi models. Currently there does not exist any general purpose libraries for the Raspberry Pi GPU. The goal of this library is to provide an accessible interface for taking advantage of the SIMD processors in the GPU, while providing performance comparable to application-specific QPU code. Our library is extremely versatile and can be used to implement nearly any purely data parallel algorithm.


Background

For more complete information about the Raspberry Pi's GPU, please see the VideoCore IV 3D Architecture Reference Guide.

QPU

At the core of VideoCore IV graphics processing unit is a set of 12 special purpose floating-point shader processors, termed Quad Processors (QPUs). Each QPU can be regarded as a 16-way 32-bit SIMD processor with an instruction cycle time of four system clocks. Our library only uses 8 of the QPUs in order to make it easier to evenly divide work among the QPUs (see VPM).


Registers

Each QPU contains four general-purpose accumulators as well as two large register-file memories, each with 64 registers. 32 locations on each of the A and B regfiles are general purpose registers while the other 32 are used to access register-space I/O. Our library uses the first 32 locations on each regfile to maintain user-defined variables during computation.


Uniforms

Uniforms are 32-bit values stored in memory that can be read in sequentially be the QPU. Our library uses uniforms to pass in function arguments, including memory address for general memory lookups (see TMU).


TMU

Each QPU has shared access to two Texture and Memory Lookup Unit (TMUs). The TMUs can be used for general memory lookups. Each TMU has a FIFO request queue, allowing the pipelining up of to four memory requests. Our library takes full advantage of this by aggressively prefetching data in order hide the latency of memory accesses.


VPM

The Vertex Pipeline Memory is a 4KB memory buffer shared among all the QPUs and intended for transferring data between the GPU and main memory. The QPUs views the VPM as a 2D array of 32-bit words, 16 words high and 64 words high. Our library partitions the VPM into 8 512 byte sections, allocating one to each QPU. Thus, each QPU is responsible for computing 8 16-wide 32-bit vectors at a time.


SFU

Each QPU has shared access to a Special Functions Unit (SFU) which can perform several less frequently used ‘exotic’ operations, including SQRT, RECIPSQRT, LOG, EXP. Our library provides access to these SFU functions.

Approach

Unlike Intel or Neon SIMD intrinsics, QPU code cannot be compiled alongside CPU code. QPU code is written in an assembly language specific to the QPU and compiled separately from any CPU code. In order to run code on the QPU, the compiled byte-code and all data must be structured in a specific way and passed in shared memory to the GPU. The primary goal of library was to abstract this complicated process away from the user and allow them to easily implement data parallel algorithms without having to worry about writing assembly, transferring data, or dividing work among the QPUs.


We worked to make a general interface that could be used with nearly any easily parallelization problem. We designed our library around a model of defining functions, consisting of QPU operations, and calling those functions on various inputs. This familiar model, used by nearly every higher-level programming language, makes our library extremely accessible, even to novice programmers.


Example code that implements SAXPY using the PiMD library:

#include <pimd.h>

int saxpy(int N, float scale, float X[], float Y[], result[]) {
    # Open the mailbox interface to interact with the GPU.
    int mb = pimd_open();

    # Define the operations
    PimdArg ops[] = {
        OP_VLOAD,   // Load a vector
        OP_SFMUL,   // Scale
        OP_VFADD,   // Second vector
        OP_STORE    // Result
    };

    # Create the function
    PimdFunction function = PimdFunction(mb, ops, 4);

    # Define the PiMD arguments
    PimdArg args[] = {
        &X,         // Input vector
        scale,      // Scale
        &Y,         // Second vector
        &result     // Result
    };

    # Call the function
    int ret = function.call(args, 4, N, 10000);

    # Free the shared memory mapped by the function call
    function.free();

    # Close the mailbox interface
    pimd_close(mb);

    return ret;
}

Abstraction

Our model abstracts away all notions of size related to the QPU execution. From the users perspective, operations are being applied to every element in their input array simultaneously. In our model, there is a single working vector to which all operations are applied. Arguments to operations on this working vector are fetched and loaded automatically by our library, which optimally prefetches values in order to hide memory latency. In addition there are 4 hardware variables that can be used to store and retrieve variables during computation. These variables are implemented as groups of registers, allowing users to use multiple variables without requiring more expensive memory requests.


Operations

Our library defines a set of operations that correspond to instructions on the QPU. Users create QPU functions by defining a series of these instructions. As you can see in the SAXPY example above, our function is defined as a vector load, a scalar float multiply, a vector float add, and a store. Each operation defines the input that it requires (for example, OP_ADD, OP_SADD, and OP_VADD take a variable, scalar, and vector argument respectively). This interface is simple, yet powerful by having function operations that map almost directly to hardware instructions.


Arguments

Arguments passed to PiMD functions must adhere to the specification implicitly defined by the series of operations that compose that function. In order to minimize boilerplate code needed to define arguments, the burden of ensuring that arguments are the correct type is passed along to the user. As a result, the same syntax is used to create arguments from integers, floats, and pointers, which can be seen in the SAXPY example above.


Results


Testing

In order to assess the performance of our library, we compared the performance of a variety of functions on the following implementations.

  • CPU, single-threaded
  • CPU NEON SIMD, single-threaded
  • PiMD GPU library

We tested algorithms encompassing a variety of memory and computation profiles in order to see how our library performs what types of problems are well suited for execution on the QPU.

  • Benchmarks
    • Bandwidth-Bound: Repeatedly load and store every element in an input array to and from main memory. We used a large enough collection of memory addresses to ensure that values are not being cached. We tested this benchmark on inputs with 5M, 10M, 15M, and 20M elements.
    • Compute-Bound: Repeatedly perform floating-point operations on an 1024 element array. We tested this benchmark on functions with 100K, 200K, 300K, and 400K operations.
  • Algorithms
    • SAXPY: Compute a * X + Y, where a is a scalar and X, Y are vectors. We tested SAXPY on inputs with 5M, 10M, 15M, and 20M elements.
  • Image Processing
    • Gaussian Blur: 2D convolution implementing a Gaussian blur. We tested this function on a variety of input images and the performance and results using our library were comparable to other Gaussian blur implementations.

Speedup

  • Bandwidth-Bound
    • NEON from Seq: 1.82x
    • PiMD from Seq: 3.76x
    • Pimd from NEON: 2.07x

  • Compute-Bound
    • NEON from Seq: 1.25x
    • PiMD from Seq: 40.66x
    • Pimd from NEON: 32.53x

  • SAXPY
    • NEON from Seq: 1.11x
    • PiMD from Seq: 3.79x
    • Pimd from NEON: 3.41x

From the above graphs we can see that our library performed extremely well. It repeatedly out-performed both the CPU and NEON SIMD implementation on all varieties of tasks. Although these test may not be an entirely fair comparison because the CPU and NEON SIMD implementations were limited to one thread, they give us a good indiciation that the library is effective in speeding up data parallel computations.

We can also see that relative to the other implementations, our library performed the best on the compute bound test consisting of only floating-point operations. This suggests that the PiMD library is better suited for algorithms with high arithmetic intensities, as opposed to ones that require large memory transfers. This is unsurprising, as transferring data to and from memory is often a limiting factor in other GPUPU frameworks.


Conclusion

After implementing this library we learned quite a bit about the architecture and capabilities of the Broadcom VideoCoreIV-AG100-R GPU. Our major takeaways are:

  • It is surprisingly powerful hardware when used effectively and efficiently. It is more than comparable with performance of the NEON SIMD on the ARM CPU.
  • It is primarily limited by memory size. The latest Raspberry Pi models only have 1GB of main memory, which severly limits the size of input data, which has to be duplicated in memory to be used on the GPU. In addition, because the VPM memory buffer is only 4KB, QPU execution requries frequent memory transfers to store results.
  • The Raspberry Pi GPU is an extremely good value, in terms of cost per FLOP, exceeding many supercomputers. However, the Raspberry Pi's hardware limitations make it impractical to exploit this for less expensive large scale computations.

Our PiMD library provides an accessible, flexible, and powerful interface for the Videocore GPU. The main benefits of our library are:

  • It is enables developers to take advantage of the GPU without needing an in depth understanding of its architecture. Writing in QPU asssembly requries a sufficient understand of all the subtle rules and restrictions regarding various instructions and registers, as well as how to correctly use the I/O mapped registers.
  • It exposes the full set of QPU arithmetic instructions, allowing users to define nearly any more complex functions by composing it of lower level operations.
  • Our library has very low overhead by directly mapping arithmetic operations to their corresponding instruction. Furthermore, by efficiently scheduling memory fetches and memory stores, we achieve early nearly optimal memory performance.

Future Work

As mentioned previously, our comparison may be slightly unfair because they CPU and NEON SIMD implementations are single-threaded. It would be insightful to compare the performance of our library, which takes advantage of 8 QPUs, to multi-threaded CPU and NEON SIMD implementations that can take advantage of the Raspberry Pi's quad-core CPU.

Our results suggest that PiMD may also be well suited for image manipulations, which are often highly data parallel with high arithmetic intensity. Assessing the performance of our library on the following image manipulation algorithm could provide further useful data.

  • RGB Manipulation: Single image tint, shade, color swap, and opacity filters.
  • Blend Images: Blend multiple images according to given parameters for each channel.
  • Edge Detection: Compute pixel differences implementing edge detection.

References

  1. VideoCore IV 3D Architecture Reference Guide - http://www.broadcom.com/docs/support/videocore/VideoCoreIV-AG100-R.pdf
  2. vc4asm Macroassembler for VideoCore IV - http://maazl.de/project/vc4asm/doc
  3. Raspberry Pi ARM side GPU libraries - https://github.com/raspberrypi/userland
  4. Videocore QPU Tutorial - https://github.com/hermanhermitage/videocoreiv-qpu

Checkpoint


Completed Work

The majority of our time so far has been spent reading through VideoCore IV 3D Architecture Reference to gain a better understanding of the GPU architecture. We have also read through much of the relevant Raspberry Pi userland source code and learned how to interface with the GPU through the mailbox property interface. As a proof of concept, we have successfully used this interface to pull various diagnostic data from the GPU, including clock speed and temperature. We feel that we have gained a sufficient understanding of these components and believe that our project, as outlined in our proposal, is feasible.


We have determined that we will need to use an assembler in order to translate QPU assembly instructions into object code that can be executed on the QPU. We have decided to use vc4asm as it seems to have the most features and best support among the QPU assemblers that we found. Writing the assembly code will not be difficult. We have a very good understanding of how to transfer memory between the GPU and the host using QPU assembly instructions. Furthermore, our library will only consist of primitive vector operations such as add, multiply, etc. which map nicely to instructions in the QPU instruction set.


We have also looked at some existing code and have successfuly run trivial programs on the QPU. We have also done a significant amount of informal testing in order to better understand the capabilities and limitations of various parts of the GPU architecture and instruction set, including TMU memory fetches, VPM memory writes, branch statements, and QPU synchronization instructions.


Updated Project Schedule

We are slightly behind our originally proposed schedule due to other commitments that have taken priority in the past couple of weeks. Our current project schedule is still the one outlined in our proposal. However we still feel that we are still on pace to deliver all deliverables as listed in our proposal. Writing the assembly code should not be challenging because we have a clear understanding how to transfer memory between to/from GPU and our library only consists of primitive vector operations which map extremely nicely to QPU instructions.


Deliverables and Parallelism Competition

For the parallelism competition we will deliver a library interface that allows to users to process vectors of data using the QPUs. This library will support all standard primitive logical and mathematical operations such as add, multiply, and, or, etc. At the competition we plan to show a demo of a common parallel algorithm running on the QPU. We will also demo a comparison same algorithm being run using only the CPU, using the NEON SIMD on the CPU, and (possibly) using existing graphics libraries on the Raspberry Pi. We will also include graph of the performance that will have previously measured.


Issues and Concerns

As mentioned previously, we know have a solid understanding of the GPU architecture and how to interface with it. So there are no major issues regarding knowledge of the problem. The majority of work going forward will be designing and implementing our library as well as writing the multiple implementations of the algorithms we will use to test our library. There will likely be some significant design considerations and trade-offs we make when defining the interface for our library. While these are not explicit concerns, they are decisions that have to be made thoughtfully to ensure that our library is general purpose and easy-to-use while remaining powerful.

Proposal


Summary

We will implement a SIMD intrinsics library for the Broadcom VideoCoreIV-AG100-R GPU found on all Raspberry Pi models. We will compare the performance of common data parallel algorithms implemented with our library to implementations using the traditional ARM CPU in the Raspberry Pi.


Background

In addition to a ARM CPU, all Raspberry Pi models have a VideoCore IV graphics processing unit. At the core of this architecture is a set of 12 special purpose floating-point shader processors, termed Quad Processors (QPUs). For all intents and purposes, each QPU can be regarded as a 16-way 32-bit SIMD processor with an instruction cycle time of four system clocks. Internally, each QPU is a 4-way SIMD processor multiplexed to 16-ways by executing the same instruction for four clock cycles on four different 4-way vectors. Each QPU is dual-issue and contains two independent (and asymmetric) ALU units, an 'add' unit and a 'mul' unit, allowing the QPU to perform on 32-bit floating-point vector add and one 32-bit floating-point vector multiply on each clock cycle.


The architecture of this GPU is designed specifically for multimedia processing, namely audio, video, and graphics. The QPUs are closely coupled to 3D hardware on the chip specifically for fragment shading. Raspbian, the primary operating system for the Raspberry Pi, provides hardware accelerated implementations of OpenGL ES 1.1, OpenGL Es 2.0, OpenVG 1.1, Open EGL, OpenMAX, and 1080p30 H.264 high-profile decoding which take advantage of this partially specialized hardware. There currently is no library for more general (non-multimedia) data parallel computation on the Raspberry Pi using the QPUs without the additional 3D computation hardware. Such a library would allow developers to massively improve the performance of data parallel algorithms on the Raspberry Pi by not limiting them to the single-core ARM CPU.


The Challenge

The primary challenge of this project is gaining a sufficient understanding of the VideoCore IV GPU in order to implement a SIMD library using its architecture. Outside of the existing Raspberry Pi source code and an architecture reference guide released by Broadcom, there is very little documentation about this hardware. Creating our library will require a great deal of reading and experimenting to determine what steps need to be taken in order to execute instructions on the GPU.


In addition, we will need to figure out how to access QPUs directly and avoid using unnecessary hardware components of the 3D pipeline on the GPU. The This is especially tricky because the hardware was designed for fragment shading in order to provide reasonably efficient implementation of the aforementioned libraries. As a result, the existing source code (and to a lesser extent the architectural reference) will be less than perfectly helpful when writing our SIMD library.


We will also have to give careful consideration when defining the interface for our library. We need to ensure that our library functions are specific enough to be mapped to the QPUs but general enough to allow our library to be used in a wide variety of data parallel problems.


Resources

We be running and testing our code on Raspberry Pi 2 Model B. Our code base will start from scratch but will rely on the GPU interface functions defined in the Raspberry Pi userland source code. We will also be heavily using the official Broadcom VideoCore IV 3D Architecture Reference while investigating ways to interface with the GPU.


Goals and Deliverables


Plan to Achieve

  • Gain an understanding of the architecture of the VideoCore IV 3D GPU and how its hardware can be used for parallel execution.
  • Successfully run code on the GPU, using the QPUs to process vectors of data.
  • Use our understanding of the hardware architecture and interface to define an interface for a SIMD library that allows for general data parallel execution on the QPUs.
  • Implement our SIMD library according to the interface we have defined.
  • Benchmark the performance of common data parallel algorithms using our library. Compare the performance of our library to CPU-only implementations, implementations that use the NEON SIMD on the CPU, and implementations that use the existing graphics libraries on the Raspberry Pi.

Hope to Achieve

  • Extend our library to support two hardware threads. Each QPU supports two hardware threads.
  • Use the Texture and Memory lookup Units (TMUs) for possible faster reads from main memory. Reads and writes from main memory are typical done using the Vertex Cache Manager & DMA (VCM & VCD) and the VPM DMA writer (VDW). The TPUs only supports main memory lookups but share an 256KB L2 cache which may allow for faster reads, especially in out-of-place algorithms.

Our library should make it possible to implement a wide variety of data-parallel algorithms to run on the Raspberry Pi and achieve better performance than implementations that run only on the ARM CPU. Ignoring memory latency and sources of overhead, perfect speedup for our library using the 12 250MHz QPUs over the single 900MHz CPU is 3.33x. More practically, we hope to achieve at least a 2x speedup on completely data parallel algorithms. Ideally we would like to see a 2.5x-3x speedup after further optimizing our library.


We will demo our library by explaining the interface and types and functions it contains. We will show speedup graphs of our library on various data parallel algorithms and possibly show program output if our tests produce compelling results.


Platform Choice

We will be writing and testing our code through Raspbian running on a Raspberry Pi 2 Model B for obvious reasons. The majority of our library will be written in C, likely with some C++ or Assembly, as these are prototypical system languages and the ones used in the Raspberry Pi source code to interface with the GPU.


Schedule


Week 1

  • Read through the architecture reference and Raspberry Pi userland source code to gain a high-level understanding of the GPU pipeline on the Raspberry Pi.
  • Pull code form an existing library and attempt to reduce it to the bare minimum needed to successfully run on the GPU. Determine what each line of code does and its importance to the overall pipeline.

Week 2

  • Define a limited number of extremely simple SIMD functions that will be the core of our library. Carefully define types that strike an appropriate balance between being general and being easily mapped to the GPU hardware.
  • Begin implementing our minimal library by writing code form scratch that will interface with the GPU.

Week 3

  • Continue implementing our minimal library and run tests to ensure that it is correct and robust to a variety of inputs.
  • Add more complex function to our library where we feel they would be useful and/or necessary when implementing some common data parallel algorithms.

Week 4

  • Finish implementing any parts of the library we added the previous week.
  • Implement a handful of highly data parallel algorithms using our library and using CPU only code. Compare performance over a variety of input sizes and observe speedup when using the GPU.

Week 5

  • If we have extra time, consider extending our library to either 1) add support for running two hardware threads on the QPUs or 2) use the TMUs and shared L2 to achieve faster main memory lookups on the GPU.