Sunday, August 22, 2010

GPGPU - General Purpose computation on Graphics Processing Units

Click on image to enlarge
GPGPU as the name suggests is termed "General Purpose computation on Graphics Processing Units". This is a technique of using a GPU, which is usually used for rendering computer graphics to perform computations of extreme orders which are traditionally handled by a CPU.
To this date, CPUs are known to have a maximum of 12-cores (AMD Opteron 6000 series and Intel Xeon processors) whereas graphics cards these days come with approximately 240+ cores. So you can imagine why there is a growing necessity to use a GPU for computational purposes.
By using a GPU, a normal person can turn his/her Personal Computer into a Personal "Super" Computer.
NVIDIA has such an offering. For more information : Tesla
Click on image to enlarge
Fig. 1. A pictorial represtation of the architecture difference between an Intel Penryn processor and an NVIDIA GeForce GTX 280 Graphics card GPU

GPGPU aims at the application developer modifying the application to take the compute-intensive kernels and map them to the GPU. The rest of the work is handled by the CPU. To perform mapping the developer has to rewrite the functions to expose the parallel-ism in them i.e. The developer has to re-write the code to use parallel computing.

NVIDIA uses it's very famous CUDA (Compute Unified Device Architecture) to infuse it's GPUs with the extraordinary compute power.
Some features of the NVIDIA Tesla series GPU Computing processors are :

GPUs powered by the Fermi-generation of the CUDA architectureDelivers cluster performance at 1/10th the cost and 1/20th the power of CPU-only systems based on the latest quad core CPUs.
448 CUDA CoresDelivers up to 515 Gigaflops of double-precision peak performance in each GPU, enabling a single workstation to deliver a Teraflop or more of performance. Single precision peak performance is over a Teraflop per GPU.
ECC MemoryMeets a critical requirement for computing accuracy and reliability for workstations. Offers protection of data in memory to enhance data integrity and reliability for applications. Register files, L1/L2 caches, shared memory, and DRAM all are ECC protected.
Desktop Cluster PerformanceSolves large-scale problems faster than a small server cluster on a single workstation with multiple GPUs.
Up to 6GB of GDDR5 memory per GPUMaximizes performance and reduces data transfers by keeping larger data sets in local memory that is attached directly to the GPU.
NVIDIA Parallel DataCache™Accelerates algorithms such as physics solvers, ray-tracing, and sparse matrix multiplication where data addresses are not known beforehand. This includes a configurable L1 cache per Streaming Multiprocessor block and a unified L2 cache for all of the processor cores.
NVIDIA GigaThread™ EngineMaximizes the throughput by faster context switching that is 10X faster than previous architecture, concurrent kernel execution, and improved thread block scheduling.
Asynchronous TransferTurbocharges system performance by transferring data over the PCIe bus while the computing cores are crunching other data. Even applications with heavy data-transfer requirements, such as seismic processing, can maximize the computing efficiency by transferring data to local memory before it is needed.
CUDA programming environment with broad support of programming languages and APIsChoose C, C++, OpenCL, DirectCompute, or Fortran to express application parallelism and take advantage of the “Fermi” GPU’s innovative architecture. NVIDIA Parallel Nsight™ tool is available for Microsoft Visual Studio developers.
High Speed , PCIe Gen 2.0 Data TransferMaximizes bandwidth between the host system and the Tesla processors. Enables Tesla systems to work with virtually any PCIe-compliant host system with an open PCIe x16 slot.

Click on image to enlarge
Fig. 2. A pictorial representation of the difference in performance between a CPU and a GPU over the years.


Applications of GPU computing are many including MATLAB acceleration, physical based simulation and physics engines, Tone mapping, Audio Signal processing, Computational Finance, Data Mining, Analytics and Databases, Molecular Dynamics, et cetera.


NVIDIA Published PDF (Must Read) : PDF Link

I hope this post wasn't too mind-boggling. Queries and Comments are always welcome.


  1. Your post seems highly NVIDIA centric. Can you highlight the main differences between CUDA, FireStream and OpenCL from a developer's point of view?
    Also, your table seems to have been copied straight from . Instead of copying and pasting, a link would've sufficed.

  2. Thanks for the quick response. I was hoping there'd be one from you. :P

    Yes I know this post is NVIDIA centric. It's because as far as I've come to know, it's mainly NVIDIA which is the driving force behind commercial and personal GPU Computing. Though I may be wrong and may not have researched enough.

    Regarding copying of the table. There is this thinking of reading only what's on a post and not going and taking a look at the mentioned sites. So I thought I'd let the others know by copying the whole table. If there is a better solution, then please enlighten me so that I can apply the same in future posts. Though it doesn't matter, the source of that table is incorrect. I copied the table directly from the NVIDIA Tesla site which I have referenced in my post.