CUDA: Lattice QCD at your desktop

ResearchBlogging.org

As my readers know, I have built up a CUDA machine on my desktop for few bucks to have lattice QCD at my home. There are a couple of reasons to write this post and the most important of this is that Pedro Bicudo and Nuno Cardoso have got their paper published on an archival journal (see here). They produced a very good code to run on a CUDA machine to do SU(2) lattice QCD (download link) that I have got up and running on my computer. They are working on the SU(3) version that is almost ready. I hope to say about this in a very near future. Currently, I am porting MILC code for the computation of the gluon propagator on my machine from the configurations I am able to generate from Nuno and Pedro’s code. This MILC code fits quite well my needs and it is very well written. This task will take me some time and I have not too much of  it unfortunately.

Presently, Nuno and Pedro’s code runs perfectly on my machine (see my preceding post here). There was no problem in the code but I just missed a compiler option to make GPUs communicate through MPI library. Once I corrected this all runs like a charm. From a hardware standpoint, I was unable to get my machine perfectly working with three cards and the reason was just overheating. A chip of the motherboard ended below one of the video card resulting in an erratic behavior of the chipset. I have got a floppy disc seen by Windows 7 when I have none! So, I decided to work just with two cards and now the system works perfectly, is stable and Windows 7 sees always four GPUs.

Nuno sent to me an updated version of their code. I will make it run as soon as possible. Of course, I know that this porting will be as smooth as before and it will take just a few minutes of my time. I suggested to him to keep up to date their site with the latest version of the code as this is evolving with continuity.

Another important reason to write this post is that I am migrating from my old GeForce 9800 GX2 cards  to a couple of the latest GeForce 580 GTX with Fermi architecture. This will afford less than one thousand euros and I will be able to get 3 Tflops in single precision and 1 Tflops in double precision with more ram for each GPU. The ambition it to upgrade my CUDA machine to computational capabilities that, in 2007, made a breakthrough in the lattice studies of the propagators for Yang-Mills theory. The main idea is to have both the code for Yang-Mills and scalar field theories running under CUDA comparing their quantum behavior in the infrared limit, an idea pioneered by Rafael Frigori quite recently (see here). Rafael showed that my mapping theorem (see here and references therein) is true also in 2+1 dimensions through lattice computations.

The GeForce 580 GTX that I bought are from MSI  (see here). These cards are overclocked with respect to the standard product and come with a very convenient price. I should say that my hardware is already stable and I am able to produce software right now. But this upgrade will take me into the Fermi architecture opening up the possibility to get double precision on CUDA. I hope to report here in the near future about this new architecture and its advantages.

Nuno Cardoso, & Pedro Bicudo (2010). SU(2) Lattice Gauge Theory Simulations on Fermi GPUs J.Comput.Phys.230:3998-4010,2011 arXiv: 1010.4834v2

Rafael B. Frigori (2009). Screening masses in quenched (2+1)d Yang-Mills theory: universality from
dynamics? Nuclear Physics B, Volume 833, Issues 1-2, 1 July 2010, Pages 17-27 arXiv: 0912.2871v2

Marco Frasca (2010). Mapping theorem and Green functions in Yang-Mills theory PoS(FacesQCD)039, 2011 arXiv: 1011.3643v3

Advertisements

13 Responses to CUDA: Lattice QCD at your desktop

  1. André says:

    I know that you have already bought your new cards, but are you aware of the fact that there are GTX580 models out there with 3 GB of RAM that do not cost much more than the standard version (at least here in Germany)?

    Regards,

    André (a silent reader who has no clue about your actual field of physics but still enjoys reading your blog and stand back in awe ;))

    • mfrasca says:

      Hi André,

      Thank you for your comment and I am happy that you like my blog.

      Of course, at a first instance, I bought a couple of Gainward with 3 GB of ram each one. This was an unfortunate situation as the seller called me, after the buy was practically complete, saying that the cards were no more available from the provider. I have been in a situation where either I had to restart all the buy from the beginning (from another seller to be identified) and wait for my machine to upgrade or just content myself with lesser performing cards but having them in a few days. The latter was my choice and I saved about 200 euros but I will get the cards next Monday. So, in a few days I will let you know about.

      Cheers,

      Marco

  2. Nameless says:

    580’s are excellent cards and they would certainly smoke the 9800’s. But I wouldn’t buy 580’s right now. There are some really nice cards coming down the pipeline in the next few months. I just read about a dual-460 card (two 460’s on one board) that should come out very soon. Two 460’s > one 580. And then 3 to 6 months from now we’ll have Kepler.

    What is the FP operation profile (typical operations used) in that code? Depending on the profile and the complexity of the code, AMD GPUs can work very well. On paper, they beat NVIDIA in terms of flops. AMD 6970 is rated at 2.7 Tflops single precision. The general problem is that their compiler has all sorts of shortcomings and getting full performance can be tricky.

    • mfrasca says:

      Dear Nameless,

      I have also heard about 590 doubling two 580s on a single card. I do not know when all this will be available and how much that will cost. I needed a good solution right now and this is the one.

      About AMD I would say that they are in great delay with respect to Nvidia in exploiting this technology for scientific parallel computation. This is the reason why CUDA is expanding quite fastly. This is not for hardware as you rightly pointed out but the great support available from Nvidia about all this. Just they moved faster and effectively and AMD has some long way to recover since now.

      Cheers,

      Marco

      • Nameless says:

        Their support is mostly limited to a well written compiler and a bunch of libraries. If you actually want a question answered, NVIDIA CUDA forums are almost useless. Until two weeks ago, they didn’t even have a disassembler for Fermi. (Now they do, as part of the 4.0RC toolkit) But CUDA code is more likely to work well out of the box, without tweaking and tuning.

  3. Nameless says:

    BTW, if you’re interested, I could give a shot to porting your code to AMD and report about performance.

    • mfrasca says:

      Dear Nameless,

      I understand that you are a supporter of ATI/AMD. This makes the thing quite challenging as I have no prejudice. The code is public (thanks to Pedro and Nuno) and you can find the link in my post here. You are free to download, port it and post here your results.

      Cheers,

      Marco

      • Nameless says:

        I’m not a supporter of ATI/AMD. I just have a couple of numerical apps that I got ported and optimized for both platforms.

        What command line options would you suggest for testing?

        • mfrasca says:

          Dear Nameless,

          The application has some default values and you can simply run it as is. In my case, I have 4 GPUs and so I use also the option –gpu=4 otherwise the program will use just one of them. For my own purposes I need to save the output and so I also put –save. With these options your lattice will be 32^4. But I use to change this option to see how far my hardware can perform. Presently I have reached 56^4 (–n=56) as the greatest lattice dimension. The application will output the execution time.

          Please, note that I will have no way to check your ported application as I have no AMD hardware. Of course, we can compare outputs.

          Cheers,

          Marco

        • Nameless says:

          Okay, I may have underestimated the amount of work needed to port this project. It’ll take a couple of days. But I got it running on my Fermi 450. With –nx=64 –ny=64 –nz=32 –nt=32 –sweeps=100, it takes 45 seconds in single precision and 75 seconds in double precision. The whole thing seems to be memory-bounded (simple optimizations, such as replacing pow(x, 2.0) with x*x, have very little effect on run time)…

  4. mfrasca says:

    Dear Nameless,

    Nice work. Your execution times seem similar to mine.

    It would be helpful to have the code that you will port on ATI platform on-line to be downloaded and run by other interested users.

    Cheers,

    Marco

  5. Nameless says:

    OK, if you’re still interested, right now the biggest sticking point remaining is that the library makes extensive use of CURAND, which is obviously NVIDIA-specific. AMD has an alternative library, which I still need to figure out how to use. I’ll spend more time on that during the weekend. The whole affair is quite educational for me.

    I made some improvements to the original project (which accelerated the command line above by about the factor of 2), I sent an email to the person in charge of the project, but I haven’t heard anything back. Either he’s busy or I got caught in the spam filter.

    • mfrasca says:

      Hi Nameless,

      Please, let your code widely available once you are done and are certain that all is properly working. Further, your improvements seem significant. I hope that Nuno will take some time to take a look at them.

      Cheers,

      Marco

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: