## Back to CUDA

11/02/2013

It is about two years ago when I wrote my last post about CUDA technology by NVIDIA (see here). At that time I added two new graphic cards to my PC, being on the verge to reach 3 Tflops in single precision for lattice computations.  Indeed, I have had an unlucky turn of events and these cards went back to the seller as they were not working properly and I was completely refunded. Meantime, also the motherboard failed and the hardware was largely changed  and so, I have been for a lot of time without the opportunity to work with CUDA and performing intensive computations as I planned. As it is well-known, one can find a lot of software exploiting this excellent technology provided by NVIDIA and, during these years, it has been spreading largely, both in academia and industry, making life of researchers a lot easier. Personally, I am using it also at my workplace and it is really exciting to have such a computational capability at your hand at a really affordable price.

Now, I am newly able to equip my personal computer at home with a powerful Tesla card. Some of these cards are currently dismissed as they are at the end of activity, due to upgrades of more modern ones, and so can be found at a really small price in bid sites like ebay. So, I bought a Tesla M1060 for about 200 euros. As the name says, this card has not been conceived for a personal computer but rather for servers produced by some OEMs. This can also be realized when we look at the card and see a passive cooler. This means that the card should have a proper physical dimension to enter into a server while the active dissipation through fans should be eventually provided by the server itself. Indeed, I added an 80mm Enermax fan to my chassis (also Enermax Enlobal)  to be granted that the motherboard temperature does not reach too high values. My motherboard is an ASUS P8P67 Deluxe. This is  a very good card, as usual for ASUS, providing three PCIe 2.0 slots and, in principle, one can add up to three video cards together. But if you have a couple of NVIDIA cards in SLI configuration, the slots work at x8. A single video card will work at x16.  Of course, if you plan to work with these configurations, you will need a proper PSU. I have a Cooler Master Silent Pro Gold 1000 W and I am well beyond my needs. This is what remains from my preceding configuration and is performing really well. I have also changed my CPU being this now an Intel i3-2125 with two cores at 3.30 GHz and 3Mb Cache. Finally, I added  16 Gb of Corsair Vengeance DDR3 RAM.

The installation of the card went really smooth and I have got it up and running in a few minutes on Windows 8 Pro 64 Bit,  after the installation of the proper drivers. I checked with Matlab 2011b and PGI compilers with CUDA Toolkit 5.0 properly installed. All worked fine. I would like to spend a few words about PGI compilers that are realized by The Portland Group. I have got a trial license at home and tested them while at my workplace we have a fully working license. These compilers make the realization of accelerated CUDA code absolutely easy. All you need is to insert into your C or Fortran code some preprocessing directives. I have executed some performance tests and the gain is really impressive without ever writing a single line of CUDA code. These compilers can be easily introduced into Matlab to yield mex-files or S-functions even if they are not yet supported by Mathworks (they should!) and also this I have verified without too much difficulty both for C and Fortran.

Finally, I would like to give you an idea on the way I will use CUDA technology for my aims. What I am doing right now is porting some good code for the scalar field and I would like to use it in the limit of large self-interaction to derive the spectrum of the theory. It is well-known that if you take the limit of the self-interaction going to infinity you recover the Ising model. But I would like to see what happens with intermediate but large values as I was not able to get any hint from literature on this, notwithstanding this is the workhorse for any people doing lattice computations. What seems to matter today is to show triviality at four dimensions, a well-acquired evidence. As soon as the accelerate code will run properly, I plan to share it here as it is very easy to get good code to do lattice QCD but it is very difficult to get good code for scalar field theory as well. Stay tuned!

## CUDA: Upgrading to 3 Tflops

29/03/2011

When I was a graduate student I heard a lot about the wonderful performances of a Cray-1 parallel computer and the promises to explore unknown fields of knowledge with this unleashed power. This admirable machine reached a peak of 250 Mflops. Its near parent, Cray-2, performed at 1700 Mflops and for scientists this was indeed a new era in the help to attack difficult mathematical problems. But when you look at QCD all these seem just toys for a kindergarten and one is not even able to perform the simplest computations to extract meaningful physical results. So, physicists started to project very specialized machines to hope to improve the situation.

Today the situation is changed dramatically. The reason is that the increasing need for computation to perform complex tasks on a video output requires extended parallel computation capability for very simple mathematical tasks. But these mathematical tasks is all one needs to perform scientific computations. The flagship company in this area is Nvidia that produced CUDA for their graphic cards. This means that today one can have outperforming parallel computation on a desktop computer and we are talking of some Teraflops capability! All this at a very affordable cost. With few bucks you can have on your desktop a machine performing thousand times better than a legendary Cray machine. Now, a counterpart machine of a Cray-1 is a CUDA cluster breaking the barrier of Petaflops! Something people were dreaming of just a few years ago.  This means that you can do complex and meaningful QCD computations in your office, when you like, without the need to share CPU time with anybody and pushing your machine at its best. All this with costs that are not a concern anymore.

So, with this opportunity in sight, I jumped on this bandwagon and a few months ago I upgraded my desktop computer at home into a CUDA supercomputer. The first idea was just to buy old material from Ebay at very low cost to build on what already was on my machine. On 2008 the top of the GeForce Nvidia cards was a 9800 GX2. This card comes equipped with a couple of GPUs with 128 cores each one, 0.5 Gbyte of ram for each GPU and support for CUDA architecture 1.1. No double precision available. This option started to be present with cards having CUDA architecture 1.3 some time later. You can find a card of this on Ebay for about 100-120 euros. You will also need a proper motherboard. Indeed, again on 2008, Nvidia produced nForce 790i Ultra properly fitted for these aims. This card is fitted for a 3-way SLI configuration and as my readers know, I installed till 3 9800 GX2 cards on it. I have got this card on Ebay for a similar pricing as for the video cards. Also, before to start this adventure, I already had a 750 W Cooler Master power supply. It took no much time to have this hardware up and running reaching the considerable computational power of 2 Tflops in single precision, all this with hardware at least 3 years old! For the operating system I chose Windows 7 Ultimate 64 bit after an initial failure with Linux Ubuntu 64 bit.

There is a wide choice in the web for software to run for QCD. The most widespread is surely the MILC code. This code is written for a multi-processor environment and represents the effort of several people spanning several years of development. It is well written and rather well documented. From this code a lot of papers on lattice QCD have gone through the most relevant archival journals. Quite recently they started to port this code on CUDA GPUs following a trend common to all academia. Of course, for my aims, being a lone user of CUDA and having no much time for development, I had the no much attractive perspective to try the porting of this code on GPUs. But, in the same time when I upgraded my machine, Pedro Bicudo and Nuno Cardoso published their paper on arxiv (see here) and made promptly available their code for SU(2) QCD on CUDA GPUs. You can download their up-to-date code here (if you plan to use this code just let them know as they are very helpful). So, I ported this code, originally written for Linux, to Windows 7  and I have got it up and running obtaining a right output for a lattice till $56^4$ working just in single precision as, for this hardware configuration, no double precision was available. The execution time was acceptable to few seconds on GPUs and some more at the start of the program due to CPU and GPUs exchanges. So, already at this stage I am able to be productive at a professional level with lattice computations. Just a little complain is in order here. In the web it is very easy to find good code to perform lattice QCD but nothing is possible to find for post-processing of configurations. This code is as important as the former: Without computation of observables one can do nothing with configurations or whatever else lattice QCD yields on whatever powerful machine. So, I think it would be worthwhile to have both codes available to get spectra, propagators and so on starting by a standard configuration file independently on the program that generated it. Similarly, it appears almost impossible to get lattice code for computations on lattice scalar field theory (thank you a lot to Colin Morningstar for providing me code for 2+1dimensions!). This is a workhorse for people learning lattice computation and would be helpful, at least for pedagogical reasons, to make it available in the same way QCD code is. But now, I leave aside complains and go to the most interesting part of this post: The upgrading.

In these days I made another effort to improve my machine. The idea is to improve in performance like larger lattices and shorter execution times while reducing overheating and noise. Besides, the hardware I worked with was so old that the architecture did not make available double precision. So, I decided to buy a couple of GeForce 580 GTX. This is the top of the GeForce cards (590 GTX is a couple of 580 GTX on a single card) and yields 1.5 Tflops in single precision (9800 GX2 stopped at 1 Tflops in single precision). It has Fermi architecture (CUDA 2.0) and grants double precision at a possible performance of at least 0.5 Tflops. But as happens for all video cards, a model has several producers and these producers may decide to change something in performance. After some difficulties with the dealer, I was able to get a couple of high-performance MSI N580GTX Twin Frozr II/OC at a very convenient price. With respect to Nvidia original card, these come overclocked, with a proprietary cooler system that grants a temperature reduced of 19°C with respect to the original card. Besides, higher quality components were used. I received these cards yesterday and I have immediately installed them. In a few minutes Windows 7 installed the drivers. I recompiled my executable and finally I performed a successful computation to $66^4$ with the latest version of Nuno and Pedro code. Then, I checked the temperature of the card with Nvidia System Monitor and I saw a temperature of 60° C for each card and the cooler working at 106%. This was at least 24°C lesser than my 9800 GX2 cards! Execution times were at least reduced to a half on GPUs. This new configuration grants 3 Tflops in single precision and at least 1 Tflops in double precision. My present hardware configuration is the following:

So far, I have had no much time to experiment with the new hardware. I hope to say more to you in the near future. Just stay tuned!

Nuno Cardoso, & Pedro Bicudo (2010). SU(2) Lattice Gauge Theory Simulations on Fermi GPUs J.Comput.Phys.230:3998-4010,2011 arXiv: 1010.4834v2