The HPC Asia 2009 keynote speech given by Dr. William T.C. Kramer introduces us to the NCSA beginning with its history in field of HPC and continuing on with its latest development, the Blue Waters project. The talk also examines the concepts behind the open science-based projects and applications that are slated to run over Blue Waters. Dr. Kramer then concludes with his insight into Petascale and Exascale era challenges of the future.
Part1 | Part2 | Part3 | Part4 | Part5 | Part6 | Part7 | Part8 [PDF Download] [Video]
Part 4: Lessons Learned From the Terascale Era
So given what users expect of our machines, and they expect the same thing of our Terascale machines, what lessons can we take out of the Terascale era? First, it is important to realize that sustained performance is significantly disconnected from peak performance. Peak performance and simple tasks do not indicate at all how well sustained performance for real world applications will operate on machines.
The next lesson we talked about for a long time….. Will applications scale to the number of cores that are needed and are available? And for the most case, the answer is “yes.” The application scientists have been very innovative in scaling their applications to the large scale. They approach it often in the Terascale era with weak-scaling. And because of the complexity of the code, what we’re seeing is more people working on the same code, that being what we call community code. So we have larger and more complex codes, codes with different multi physics, codes with multi different aspects, all combined together with many workers. In some cases many hundreds or even thousands of people contributing to the same code. So that makes code maintenance more complex, but it actually does achieve to the point of running at high performance and high scalability. In order to make that work, significant funding was provided to all areas to make the applications teams re-engineer their applications codes from single processor or small numbers of processors. And new algorithms, things like adaptive mesh refinements and sparse methods were introduced and they had at least as much to do with the increase of performance and productivity from the codes as did the rule of Moore’s Law.
Another lesson is that Linux and Unix-based OSs are sufficient for Terascale computing. What we need to pay more attention to are things like OS Interference in order to continue the scaling activity. Another big challenge is finding information as to what the system is doing. In some ways the systems are chaotic. They are chaotic in the sense that a small probation in one place can have a very large and unexpected impact on another place. The information that comes out of the system can help you diagnose what’s going on in the system and help you discover if it is properly configured and running correctly. This is particularly important in systems that are horizontally integrated where people have one-of-a-kind clusters that have software combinations that are unique across that cluster.
The programming languages have stalled somewhat. MPI is a great methodology for Terascale but it has challenges for going from multi core into Petascale because Petascale relies on multi core. And the bugging performance analysis for what I call the “Strong of Heart and Mind,” in other words, it’s non-trivial. Indeed, most people are not using the tools that we thought they would be using in order to do debugging at scale. They are too complicated, too hard to use, and too hard to learn.
So, we’ve talked about some of the reason that sustained performance is sliding. The graphs over here combine peak, LINPACK, and sustained performance. The tallest one is peak obviously. This one here is the LINPACK. These are the machines that happen to be fielded at NERSC for the last 10 years. And we have sustained system performance measures; metrics that we use at NERSC, that we believe represents a reasonable expectation of sustained performance across that range of applications. You can see here that that’s not keeping up. This graph down here shows the ratio of LINPACK performance to sustained performance and you can see that it has dropped over time. So, the memory latency that you’ve heard so much about is impacting us. But also, the latency and bandwidth and, in particular, the latency associated with the communication interconnect for very large scale programs, is impacting us as well. That will probably continue to affect us as we go through the many core phases also.
Throughout the rest of this discussion, I want you to pay attention to the word “sustained” because I will never use the word “peak” again after this last slide. “Sustained” is the only real performance metric that the scientists and research community want or need. So we need to be able to translate our discussion into “ sustained” performance and compare things like the value proposition of systems based on sustained work coming out of those, that is, how productive they are, how useful they are, and how much work comes out of them. Here’s a brief analogy. I guess I do use the word “peak” one more time. My boss, Tom Dunning, says “peak” is like buying a car based solely on the speedometer top speed. “You’ll never get up to it and you’ll never use it.” And I would make an analogy that looking at just one metric like LINPACK, is like buying a NASCAR which is the U.S. auto racing structure. It turns out that you can go to Las Vegas, if you can rent a NASCAR car and drive it on a NASCAR track. My cousin did that and he got up to about 160 miles an hour. This is something that he had never done before. He did it a very specialized car tuned only for that race track with a very dedicated set of professionals to help him, and he got to drive that car for a 1 hour. After that, he went back to driving his regular car at 55 or 60 miles an hour and that’s the performance that he gets out of the auto industry.