## Back to CUDA

11/02/2013

It is about two years ago when I wrote my last post about CUDA technology by NVIDIA (see here). At that time I added two new graphic cards to my PC, being on the verge to reach 3 Tflops in single precision for lattice computations.  Indeed, I have had an unlucky turn of events and these cards went back to the seller as they were not working properly and I was completely refunded. Meantime, also the motherboard failed and the hardware was largely changed  and so, I have been for a lot of time without the opportunity to work with CUDA and performing intensive computations as I planned. As it is well-known, one can find a lot of software exploiting this excellent technology provided by NVIDIA and, during these years, it has been spreading largely, both in academia and industry, making life of researchers a lot easier. Personally, I am using it also at my workplace and it is really exciting to have such a computational capability at your hand at a really affordable price.

Now, I am newly able to equip my personal computer at home with a powerful Tesla card. Some of these cards are currently dismissed as they are at the end of activity, due to upgrades of more modern ones, and so can be found at a really small price in bid sites like ebay. So, I bought a Tesla M1060 for about 200 euros. As the name says, this card has not been conceived for a personal computer but rather for servers produced by some OEMs. This can also be realized when we look at the card and see a passive cooler. This means that the card should have a proper physical dimension to enter into a server while the active dissipation through fans should be eventually provided by the server itself. Indeed, I added an 80mm Enermax fan to my chassis (also Enermax Enlobal)  to be granted that the motherboard temperature does not reach too high values. My motherboard is an ASUS P8P67 Deluxe. This is  a very good card, as usual for ASUS, providing three PCIe 2.0 slots and, in principle, one can add up to three video cards together. But if you have a couple of NVIDIA cards in SLI configuration, the slots work at x8. A single video card will work at x16.  Of course, if you plan to work with these configurations, you will need a proper PSU. I have a Cooler Master Silent Pro Gold 1000 W and I am well beyond my needs. This is what remains from my preceding configuration and is performing really well. I have also changed my CPU being this now an Intel i3-2125 with two cores at 3.30 GHz and 3Mb Cache. Finally, I added  16 Gb of Corsair Vengeance DDR3 RAM.

The installation of the card went really smooth and I have got it up and running in a few minutes on Windows 8 Pro 64 Bit,  after the installation of the proper drivers. I checked with Matlab 2011b and PGI compilers with CUDA Toolkit 5.0 properly installed. All worked fine. I would like to spend a few words about PGI compilers that are realized by The Portland Group. I have got a trial license at home and tested them while at my workplace we have a fully working license. These compilers make the realization of accelerated CUDA code absolutely easy. All you need is to insert into your C or Fortran code some preprocessing directives. I have executed some performance tests and the gain is really impressive without ever writing a single line of CUDA code. These compilers can be easily introduced into Matlab to yield mex-files or S-functions even if they are not yet supported by Mathworks (they should!) and also this I have verified without too much difficulty both for C and Fortran.

Finally, I would like to give you an idea on the way I will use CUDA technology for my aims. What I am doing right now is porting some good code for the scalar field and I would like to use it in the limit of large self-interaction to derive the spectrum of the theory. It is well-known that if you take the limit of the self-interaction going to infinity you recover the Ising model. But I would like to see what happens with intermediate but large values as I was not able to get any hint from literature on this, notwithstanding this is the workhorse for any people doing lattice computations. What seems to matter today is to show triviality at four dimensions, a well-acquired evidence. As soon as the accelerate code will run properly, I plan to share it here as it is very easy to get good code to do lattice QCD but it is very difficult to get good code for scalar field theory as well. Stay tuned!

## CUDA: Upgrading to 3 Tflops

29/03/2011

When I was a graduate student I heard a lot about the wonderful performances of a Cray-1 parallel computer and the promises to explore unknown fields of knowledge with this unleashed power. This admirable machine reached a peak of 250 Mflops. Its near parent, Cray-2, performed at 1700 Mflops and for scientists this was indeed a new era in the help to attack difficult mathematical problems. But when you look at QCD all these seem just toys for a kindergarten and one is not even able to perform the simplest computations to extract meaningful physical results. So, physicists started to project very specialized machines to hope to improve the situation.

Today the situation is changed dramatically. The reason is that the increasing need for computation to perform complex tasks on a video output requires extended parallel computation capability for very simple mathematical tasks. But these mathematical tasks is all one needs to perform scientific computations. The flagship company in this area is Nvidia that produced CUDA for their graphic cards. This means that today one can have outperforming parallel computation on a desktop computer and we are talking of some Teraflops capability! All this at a very affordable cost. With few bucks you can have on your desktop a machine performing thousand times better than a legendary Cray machine. Now, a counterpart machine of a Cray-1 is a CUDA cluster breaking the barrier of Petaflops! Something people were dreaming of just a few years ago.  This means that you can do complex and meaningful QCD computations in your office, when you like, without the need to share CPU time with anybody and pushing your machine at its best. All this with costs that are not a concern anymore.

So, with this opportunity in sight, I jumped on this bandwagon and a few months ago I upgraded my desktop computer at home into a CUDA supercomputer. The first idea was just to buy old material from Ebay at very low cost to build on what already was on my machine. On 2008 the top of the GeForce Nvidia cards was a 9800 GX2. This card comes equipped with a couple of GPUs with 128 cores each one, 0.5 Gbyte of ram for each GPU and support for CUDA architecture 1.1. No double precision available. This option started to be present with cards having CUDA architecture 1.3 some time later. You can find a card of this on Ebay for about 100-120 euros. You will also need a proper motherboard. Indeed, again on 2008, Nvidia produced nForce 790i Ultra properly fitted for these aims. This card is fitted for a 3-way SLI configuration and as my readers know, I installed till 3 9800 GX2 cards on it. I have got this card on Ebay for a similar pricing as for the video cards. Also, before to start this adventure, I already had a 750 W Cooler Master power supply. It took no much time to have this hardware up and running reaching the considerable computational power of 2 Tflops in single precision, all this with hardware at least 3 years old! For the operating system I chose Windows 7 Ultimate 64 bit after an initial failure with Linux Ubuntu 64 bit.

There is a wide choice in the web for software to run for QCD. The most widespread is surely the MILC code. This code is written for a multi-processor environment and represents the effort of several people spanning several years of development. It is well written and rather well documented. From this code a lot of papers on lattice QCD have gone through the most relevant archival journals. Quite recently they started to port this code on CUDA GPUs following a trend common to all academia. Of course, for my aims, being a lone user of CUDA and having no much time for development, I had the no much attractive perspective to try the porting of this code on GPUs. But, in the same time when I upgraded my machine, Pedro Bicudo and Nuno Cardoso published their paper on arxiv (see here) and made promptly available their code for SU(2) QCD on CUDA GPUs. You can download their up-to-date code here (if you plan to use this code just let them know as they are very helpful). So, I ported this code, originally written for Linux, to Windows 7  and I have got it up and running obtaining a right output for a lattice till $56^4$ working just in single precision as, for this hardware configuration, no double precision was available. The execution time was acceptable to few seconds on GPUs and some more at the start of the program due to CPU and GPUs exchanges. So, already at this stage I am able to be productive at a professional level with lattice computations. Just a little complain is in order here. In the web it is very easy to find good code to perform lattice QCD but nothing is possible to find for post-processing of configurations. This code is as important as the former: Without computation of observables one can do nothing with configurations or whatever else lattice QCD yields on whatever powerful machine. So, I think it would be worthwhile to have both codes available to get spectra, propagators and so on starting by a standard configuration file independently on the program that generated it. Similarly, it appears almost impossible to get lattice code for computations on lattice scalar field theory (thank you a lot to Colin Morningstar for providing me code for 2+1dimensions!). This is a workhorse for people learning lattice computation and would be helpful, at least for pedagogical reasons, to make it available in the same way QCD code is. But now, I leave aside complains and go to the most interesting part of this post: The upgrading.

In these days I made another effort to improve my machine. The idea is to improve in performance like larger lattices and shorter execution times while reducing overheating and noise. Besides, the hardware I worked with was so old that the architecture did not make available double precision. So, I decided to buy a couple of GeForce 580 GTX. This is the top of the GeForce cards (590 GTX is a couple of 580 GTX on a single card) and yields 1.5 Tflops in single precision (9800 GX2 stopped at 1 Tflops in single precision). It has Fermi architecture (CUDA 2.0) and grants double precision at a possible performance of at least 0.5 Tflops. But as happens for all video cards, a model has several producers and these producers may decide to change something in performance. After some difficulties with the dealer, I was able to get a couple of high-performance MSI N580GTX Twin Frozr II/OC at a very convenient price. With respect to Nvidia original card, these come overclocked, with a proprietary cooler system that grants a temperature reduced of 19°C with respect to the original card. Besides, higher quality components were used. I received these cards yesterday and I have immediately installed them. In a few minutes Windows 7 installed the drivers. I recompiled my executable and finally I performed a successful computation to $66^4$ with the latest version of Nuno and Pedro code. Then, I checked the temperature of the card with Nvidia System Monitor and I saw a temperature of 60° C for each card and the cooler working at 106%. This was at least 24°C lesser than my 9800 GX2 cards! Execution times were at least reduced to a half on GPUs. This new configuration grants 3 Tflops in single precision and at least 1 Tflops in double precision. My present hardware configuration is the following:

So far, I have had no much time to experiment with the new hardware. I hope to say more to you in the near future. Just stay tuned!

Nuno Cardoso, & Pedro Bicudo (2010). SU(2) Lattice Gauge Theory Simulations on Fermi GPUs J.Comput.Phys.230:3998-4010,2011 arXiv: 1010.4834v2

## CUDA: Lattice QCD at your desktop

15/03/2011

As my readers know, I have built up a CUDA machine on my desktop for few bucks to have lattice QCD at my home. There are a couple of reasons to write this post and the most important of this is that Pedro Bicudo and Nuno Cardoso have got their paper published on an archival journal (see here). They produced a very good code to run on a CUDA machine to do SU(2) lattice QCD (download link) that I have got up and running on my computer. They are working on the SU(3) version that is almost ready. I hope to say about this in a very near future. Currently, I am porting MILC code for the computation of the gluon propagator on my machine from the configurations I am able to generate from Nuno and Pedro’s code. This MILC code fits quite well my needs and it is very well written. This task will take me some time and I have not too much of  it unfortunately.

Presently, Nuno and Pedro’s code runs perfectly on my machine (see my preceding post here). There was no problem in the code but I just missed a compiler option to make GPUs communicate through MPI library. Once I corrected this all runs like a charm. From a hardware standpoint, I was unable to get my machine perfectly working with three cards and the reason was just overheating. A chip of the motherboard ended below one of the video card resulting in an erratic behavior of the chipset. I have got a floppy disc seen by Windows 7 when I have none! So, I decided to work just with two cards and now the system works perfectly, is stable and Windows 7 sees always four GPUs.

Nuno sent to me an updated version of their code. I will make it run as soon as possible. Of course, I know that this porting will be as smooth as before and it will take just a few minutes of my time. I suggested to him to keep up to date their site with the latest version of the code as this is evolving with continuity.

Another important reason to write this post is that I am migrating from my old GeForce 9800 GX2 cards  to a couple of the latest GeForce 580 GTX with Fermi architecture. This will afford less than one thousand euros and I will be able to get 3 Tflops in single precision and 1 Tflops in double precision with more ram for each GPU. The ambition it to upgrade my CUDA machine to computational capabilities that, in 2007, made a breakthrough in the lattice studies of the propagators for Yang-Mills theory. The main idea is to have both the code for Yang-Mills and scalar field theories running under CUDA comparing their quantum behavior in the infrared limit, an idea pioneered by Rafael Frigori quite recently (see here). Rafael showed that my mapping theorem (see here and references therein) is true also in 2+1 dimensions through lattice computations.

The GeForce 580 GTX that I bought are from MSI  (see here). These cards are overclocked with respect to the standard product and come with a very convenient price. I should say that my hardware is already stable and I am able to produce software right now. But this upgrade will take me into the Fermi architecture opening up the possibility to get double precision on CUDA. I hope to report here in the near future about this new architecture and its advantages.

Nuno Cardoso, & Pedro Bicudo (2010). SU(2) Lattice Gauge Theory Simulations on Fermi GPUs J.Comput.Phys.230:3998-4010,2011 arXiv: 1010.4834v2

Rafael B. Frigori (2009). Screening masses in quenched (2+1)d Yang-Mills theory: universality from
dynamics? Nuclear Physics B, Volume 833, Issues 1-2, 1 July 2010, Pages 17-27 arXiv: 0912.2871v2

Marco Frasca (2010). Mapping theorem and Green functions in Yang-Mills theory PoS(FacesQCD)039, 2011 arXiv: 1011.3643v3

16/02/2011

As promised (see here) I am here to talk again about my CUDA machine. I have done the following upgrade:

• Added 4 GB of RAM and now I have 8 GB of DDR3 RAM clocked at 1333 MHz. This is the maximum allowed by my motherboard.
• Added the third 9800 GX2 graphics card. This is a XFX while the other twos that I have already installed are EVGA and Nvidia respectively. These three cards are not perfectly identical as the EVGA is overclocked by the manufacturer and, for all, the firmware could not be the same.

At the start of the upgrade process things were not so straight. Sometime BIOS complained at the boot about the position of the cards in the three PCI express 2.0 slots and the system did not start at all. But after that I have found the right combination in permuting the three cards, Windows 7 recognized all of them, latest Nvidia drivers installed as a charm and the Nvidia system monitor showed the physical situation of all the GPUs. Heat is a concern here as the video cards work at about 70 °C while the rest of the hardware is at about 50 °C. The box is always open and I intend to keep it so to reduce at a minimum the risk of overheating.

The main problem arose when I tried to run my CUDA applications from a command window. I have a simple program the just enumerate GPUs in the system and also the program for lattice computations of Pedro Bicudo and Nuno Cardoso can check the system to identify the exact set of resources to perform its work at best. Both the applications, that I recompiled on the upgraded platform, just saw a single GPU. It was impossible, at first, to get a meaningful behavior from the system. I thought that this could have been a hardware problem and contacted the XFX support for my motherboard. I bought my motherboard by second hand but I was able to register the product thanks to the seller that already did so. People at XFX were very helpful and fast in giving me an answer. The technician said to me essentially that the system should have to work and so he gave me some advices to identify possible problems. I would like to remember that a 9800 GX2 contains two graphics cards and so I have six GPUs to work with. I checked all the system again until I get the nice configuration above with Windows 7 seeing all the cards. Just a point remained unanswered: Why my CUDA applications did not see the right number of GPUs. This has been an old problem for Nvidia and was overcome with a driver revision long before I tried for myself. Currently, my driver is 266.58, the latest one. The solution come out unexpectedly. It has been enough to change a setting in the Performance menu of the Nvidia monitor for the use of multi-GPU and I have got back 5 GPUs instead of just 1. This is not six but I fear that I cannot do better. The applications now work fine. I recompiled them all and I have run successfully the lattice computation till a $76^4$ lattice in single precision! With these numbers I am already able to perform professional work in lattice computations at home.

Then I spent a few time to set the development environment through the debugger Parallel Nsight and Visual Studio 2008 for 64 bit applications. So far, I was able to generate the executable of the lattice simulation under VS 2008. My aim is to debug it to understand why some values become zero in the output and they should not. Also I would like to understand why the new version of the lattice simulation that Nuno sent to me does not seem to work properly on my platform. I have taken some time trying to configure Parallel Nsight for my machine. You will need at least two graphics cards to get it run and you have to activate PhysX on the Performance monitor of Nvidia on the card that will not run your application. This was a simple enough task as the online manual of the debugger is well written. Also, enclosed examples are absolutely useful. My next week-end will be spent to fine tuning all the matter and starting doing some work with the lattice simulation.

As far as I will go further with this activity I will inform you on my blog. If you want to initiate such an enterprise by yourself, feel free to get in touch with me to overcome difficulties and hurdles you will encounter. Surely, things proved to be not so much complicated as they appeared at the start.

## CUDA: An update

04/02/2011

My activity with CUDA technology by Nvidia and parallel computing is going on (see here).  I was able to get up and running the code made available by Pedro Bicudo and Nuno Cardoso (see here) on my machine. This is a code for SU(2) QCD and, currently, these colleagues are working on the SU(3) version. The code has been written directly for a machine supporting GPU computing with CUDA architecture.

Initially, I was able to get link configurations for lattices as large as 14^4, not very large but useful for some simple analysis. After a suggestion by Nuno, I have modified a parameter in the code (number of threads per block) from 16 to 8 and the simulation reached the impressive lattice volume of 64^4! I am only able to do computations in single precision as my graphics cards were built on 2008 when double precision was yet to come. But now I am in a position to do professional analysis of lattice simulations.

I would like to remember here the current configuration of my machine:

• CPU: Intel Core 2 duo E8500 with 3.16 GHz for core, 6 MB cache.
• 4 GB of DDR3 RAM.
• 2 graphics cards 9800 GX2 with two GPUs for each and 512 MB of DDR3 RAM for each GPU. So, I have 4 GPUs at work.
• Motherboard XFX 790i Ultra (3-way SLI).
• PSU Cooler Master Silent Pro Gold 1000 W.
• Windows 7 Ultimate 64 bit
• CUDA Toolkit 3.2
• Visual Studio 2008 SP1
• Parallel Nsight (Nvidia debugger for CUDA)

This configuration performs at 2Tflops in single precision and I have reached the performance declared above for lattice QCD. The output file for a single run was about 4 GB. The simulation needs some debugging after porting as some values in the output file are zeros and they should not. Plaquette  values are good instead. Nuno produced new code from the old one but I was not able to get it running properly even if it compiled correctly.

During the week-end I am planning to further upgrade the machine. I will install another card 9800 GX2 (this one is a XFX while the others are EVGA and Nvidia respectively but are identical as the only producer is Nvidia) and 4 GB of RAM reaching the maximum value of 8 GB of RAM for my motherboard. The aim of this upgrade is to get an evaluation of both the gluon propagator and the spectrum at very large volumes, comparable with the works of the cornerstone date of Regensburg 2007. I would also like to get some code to solve $\lambda\phi^4$ theory to check my mapping theorem in four dimensions. I would like to emphasize that Rafael Frigori proved it correct in 2+1 dimensions (see here).

After the upgrade I will report on the blog. As I will get more time for this I will be able to produce some useful results that I hope to put here.

Frigori, R. (2010). Screening masses in quenched (2+1)d(2+1)d Yang–Mills theory: Universality from dynamics? Nuclear Physics B, 833 (1-2), 17-27 DOI: 10.1016/j.nuclphysb.2010.02.021

## CUDA: Lattice QCD on a Personal Computer

21/01/2011

At the conference “The many faces of QCD” (see here, here and here) I have had the opportunity to talk with people doing lattice computations at large computer facilities. They said to me that this kind of activities imply the use of large computers, user queues (as these resources are generally shared) and months of computations before to see the results. Today the situation is changing for the better due to an important technological shift. Indeed, it is well-known that graphics cards are built with graphical processing units (GPU) made by several computational cores that work in parallel. Such cores do very simple computational tasks but, due to the parallel architecture, very complex operations can be reduced to a set of such small tasks that the parallel architecture executes in an exceptionally short time. This is the reason why, on a PC equipped with such an architecture, very complex video outputs can be obtained with exceptionally good performances.

People at Nvidia have had the idea to use these cores to do just floating point operations and use them for scientific computations. This is the way CUDA (Compute Unified Device Architecture) was born. So, the first Tesla cards without graphics output, but with GPUs, were produced and the development toolkit was made freely available. Nvidia made parallel computation available to the masses. Just mounting a graphics card with CUDA architecture it is possible for everybody to have a desktop computer with Teraflops performances!

As soon as I become aware of the existence of CUDA I decide to mount on this bandwagon opening to me the opportunity to do QCD on the lattice at my home. So, I upgraded my PC at home with a couple of 9800 GX2 cards (2 GPUs for each with 512 MB of DDR3 RAM each one) having CUDA architecture 1.1. This means that these cards can do single precision computations at about 1 Tflops and my PC can express a performance of 2 Tflops. But I have no double precision. I have also changed my motherboard to a Nvidia 790i Ultra that support a 3-way SLI mode and the power supply upgraded to 1 KW (Silent Gold Cooler Master). I have added 4 GB of DDR3 RAM and maintained my CPU, an Intel Core 2 Duo E8500 with 3.16 GHz for each core. The interesting point about this configuration is that I have bought the three Nvidia cards from Ebay as used material at a very low cost. Then, I was in business with very few bucks!

Before this upgrading of my machine I had Windows XP home 32 bit installed. This operating system was only able to address 3 GB of RAM and 1 GB of it was used by the two graphics cards. This revealed a serious drawback to all the matter. In a moment I will explain what I did to overcome it.

The next important step was to obtain CUDA code for QCD. The question is that CUDA technology is going to spread rapidly into academic environment and a lot of code was available. Initially I thougth to MILC code. There is CUDA code available and people of MILC Collaboraion was very helpful. This code is built for Linux and I was not able to make this operating system up and running on my platform. Besides, I would have had needed a lot of time to make all this code working for me and I had to give up despite myself. Meantime, a couple of papers by Pedro Bicudo and Nuno Cardoso appeared (see here and here). Pedro was a nice companion at the conference “The many faces of QCD” where I have had the opportunity to know him. He was not aware I had asked the source code to his student Nuno. Nuno has been very kind to give me the link and I downloaded the code. This has been a sound starting point for the work on my platform. The code has been written for CUDA since the start and so well optimized. Pedro said to me that the optimization phase cost them a lot of work while putting down the initial code was relatively easy. They worked on a Linux platform so he was surprised when I said to him that I intended to port their code under Microsoft Windows. But this is my home PC and all my family uses it and also my attempt to install Ubuntu 64 bit revealed a failure that cost to me the use of Windows installation disk to remove the dual boot.

Then, during my Christmas holidays when I have had a lot of time, I started to port Pedro and Nuno code under Windows XP Home. It was very easy. Their code, entirely written with C++, needed just the insertion of a define. So, setting the path in a DOS mode box and using nvcc with Visual Studio 2008 (the only compiler Nvidia supports under Windows so far) I was able to get a running code but with a glitch. This code was only able to run on my CPU. The reason was that I had not enough memory under Windows XP 32 bit to complete the compilation for the code of the graphics cards. Indeed, Nvidia compiler ptxas stopped with an error and I was not able to get it running on the graphics cards of my computer. But after this step, successful for some aspects, I wrote to Pedro and Nuno informing them of my success on porting the code at least running on my CPU under Windows. The code was written so well that very few was needed to port it! Pedro said to me that something had to be changed in my machine: Mostly the graphics cards should have been taken more powerful. I am aware of this shortcoming but my budget was not so good at that time. This is surely my next upgrade (a couple of 580 GTX with Fermi architecture supporting double precision).

As I have experienced memory problems, the next step was to go to a 64 bit operating system to use all my 4 GB RAM. Indeed, on another disk of my computer, I installed Windows 7 Ultimate 64 bit. Also in this case the porting of Pedro and Nuno’s code was very smooth. In a DOS box I have obtained their code up running again but this time for my graphics cards and not just for CPU only. As I have the time I will do some computations of observables of SU(2) QCD experiencing with the limit of my machine. But this result is from yesterday and I need more time to do some physics.

Pedro informed me that they are working for SU(3) and this is more difficult. Meantime, I have to thank him and his student Nuno very much for the very good job they did and for permitting me to have lattice QCD on my computer at home successfully working. I hope this will represent a good starting point for other people doing this kind of research.

Update: Pedro authorized me to put here the link to download the code. Here it is. Thank you again Pedro!

Nuno Cardoso, & Pedro Bicudo (2010). Lattice SU(2) on GPU’s arxiv arXiv: 1010.1486v1

Nuno Cardoso, & Pedro Bicudo (2010). SU(2) Lattice Gauge Theory Simulations on Fermi GPUs J.Comput.Phys.230:3998-4010,2011 arXiv: 1010.4834v2