## CUDA: An update

My activity with CUDA technology by Nvidia and parallel computing is going on (see here).  I was able to get up and running the code made available by Pedro Bicudo and Nuno Cardoso (see here) on my machine. This is a code for SU(2) QCD and, currently, these colleagues are working on the SU(3) version. The code has been written directly for a machine supporting GPU computing with CUDA architecture.

Initially, I was able to get link configurations for lattices as large as 14^4, not very large but useful for some simple analysis. After a suggestion by Nuno, I have modified a parameter in the code (number of threads per block) from 16 to 8 and the simulation reached the impressive lattice volume of 64^4! I am only able to do computations in single precision as my graphics cards were built on 2008 when double precision was yet to come. But now I am in a position to do professional analysis of lattice simulations.

I would like to remember here the current configuration of my machine:

• CPU: Intel Core 2 duo E8500 with 3.16 GHz for core, 6 MB cache.
• 4 GB of DDR3 RAM.
• 2 graphics cards 9800 GX2 with two GPUs for each and 512 MB of DDR3 RAM for each GPU. So, I have 4 GPUs at work.
• Motherboard XFX 790i Ultra (3-way SLI).
• PSU Cooler Master Silent Pro Gold 1000 W.
• Windows 7 Ultimate 64 bit
• CUDA Toolkit 3.2
• Visual Studio 2008 SP1
• Parallel Nsight (Nvidia debugger for CUDA)

This configuration performs at 2Tflops in single precision and I have reached the performance declared above for lattice QCD. The output file for a single run was about 4 GB. The simulation needs some debugging after porting as some values in the output file are zeros and they should not. Plaquette  values are good instead. Nuno produced new code from the old one but I was not able to get it running properly even if it compiled correctly.

During the week-end I am planning to further upgrade the machine. I will install another card 9800 GX2 (this one is a XFX while the others are EVGA and Nvidia respectively but are identical as the only producer is Nvidia) and 4 GB of RAM reaching the maximum value of 8 GB of RAM for my motherboard. The aim of this upgrade is to get an evaluation of both the gluon propagator and the spectrum at very large volumes, comparable with the works of the cornerstone date of Regensburg 2007. I would also like to get some code to solve $\lambda\phi^4$ theory to check my mapping theorem in four dimensions. I would like to emphasize that Rafael Frigori proved it correct in 2+1 dimensions (see here).

After the upgrade I will report on the blog. As I will get more time for this I will be able to produce some useful results that I hope to put here.

Frigori, R. (2010). Screening masses in quenched (2+1)d(2+1)d Yang–Mills theory: Universality from dynamics? Nuclear Physics B, 833 (1-2), 17-27 DOI: 10.1016/j.nuclphysb.2010.02.021