The HPC Asia 2009 keynote speech given by Dr. William T.C. Kramer introduces us to the NCSA beginning with its history in field of HPC and continuing on with its latest development, the Blue Waters project. The talk also examines the concepts behind the open science-based projects and applications that are slated to run over Blue Waters. Dr. Kramer then concludes with his insight into Petascale and Exascale era challenges of the future.
Part1 | Part2 | Part3 | Part4 | Part5 | Part6 | Part7 | Part8 [PDF Download] [Video]
Part 7:More about Blue Waters
Who will use this new system? It’s an Open Science platform. That means that it will support many different types of usage. The benchmarks that are used in the system are only a gross approximation of the usage because you have to limit your benchmarks to a certain set of cases that a traceable, but also you cannot cover all types of sciences. So there were six primary application benchmarks, three of those that essentially require the full system to run on the order of a day to complete the science problems that they were chartered to do. And those benchmarks are in molecular dynamics, quadrupole dynamics, and turbulence. Those are the three full-scale science problems. In addition, there were some kernel tests and other things as well. Materials and weather climate are another set of applications that are currently chartered in small-scale for testing.
The other important aspect of blue waters, at least from the NCSA and the NSF’s point of view, is that this is the major leadership resource for the NSF. It’s the one that has to work because all those other machines are feeding a community that has been gradually growing and preparing to be able to make use of this resource. This is where that one resource exists so we have to make sure that it works well. The other thing, as I pointed out, is that we’re sure that the applications will change. There will be a turnover but we do not know what that is because they have not yet been selected but we’re preparing for that as well. They are being selected through a national process that is running as we speak. One thing we know is that most of those projects will not be associated with the University of Illinois or any particular institution, rather, it is what is best for the overall science and engineering activities of the country. We’re expecting the users to be very experienced. That’s part of the selection criteria. Are they ready to use such a large resource at that scale? Are they able to run applications at other facilities such as department of energy (DOE) facilities that are on Track 2 systems or maybe other ones that will be running simultaneously with us? Also, we expect most of the codes to be community codes. There will be many numbers of workers working on those codes so we have to be able to accommodate how that methodology is done.
So one of the first things about Blue Waters that you should understand is that it is a large scale consortium. The actual system being deployed is only one part of the base requirement or the base system. There are many aspects of the overall project that are beyond that in addition to it to value. The first is a collaboration of a large number of institutions, mostly located throughout the Midwest of the United States, that are committed to facilitating the effective use of Petascale computing. These universities schools and colleges are putting together programs within each one of their structures and facilitated by Blue Waters in terms of how they can bring their computational expertise up to the level of such a large scale machine and we’re looking forward to even bigger machines and bigger capabilities.
The next question is, what are the components of our project? The first component is the Petascale computing facility. That’s the facility that is being built right now to house this machine and a few others. It is a state-of-the-art and very environmentally friendly computing facility that will be completed sometime within the next 12 months. It is prepared to house not just Blue Waters but follow-on machines as well. IBM is providing the base system including the processors, memory interconnect online storage, as well as the systems software and the programming development. We have been very engaged with IBM’s through the design process for the technology. There’s a team at NCSA as well as at IBM. We just had a meeting last week to do a status check. The teams incorporate 60 people across NCSA and IBM working together.
This base machine is sufficient to do Petascale sustained computing for a variety of applications. This is the base component of the program but nowhere near sufficient for everything that’s expected. It includes value added hardware and software. There will be collaborations that will add even more value. Those collaborations are going in directions that they need to be going as we find out more and more about the machines. We have user and production support with LAN connections. That’s an additional effort in services that were providing. And on top of that, there is Petascale application collaboration team support. This is specialized support working with the early teams that have been identified with application needs and we have teams of people there to help them prepare their applications so that they can run right away. Then there is an allocation process where other groups will be allocated time not just on our machines but on the track two machines as well. Tying all those things together is the consortium that I talked about as well as the education and outreach program which is a very strong component of our activity, that is, to train the next generation computational scientist.
What does Blue Waters look like? If you compare it to a typical cluster, this happens to be a cluster called “Abe” provided by Dell with Intel processors at NCSA. and if you compare the differences in the scope of the machine, it’s on the order of a factor of 1, 000 more sustained performance. It has on the order of 100,000 more cores. It has on the order of quite a bit more memory, and storage, and networking capabilities. I should apologize that much of the detail behind this slide is proprietary and I’m not able to share it in detail with you right now but it will be made available over the next 18 months.
The evaluated collaboration areas to overcome those Terascale issues that I talked about, have been identified from these general categories. The application programming environments--What does it take for people to make use of such a machine? What tools do they need to be able to do that? Advanced programming models like UPC and (CAF). How will they play in the multi core and many core environments? Common tools infrastructure. We notice that most people are not using debuggers and performance profiling tools very often because they’re so difficult to use. This common tool infrastructure is meant to provide a base on which more advanced tools can be deployed and more easily deployed. Regarding Petascale and Exascale hierarchical storage systems, Blue Waters will be integrated with large amounts of backing storage that will be managed automatically for the users to get up to that on the order of the and Exabyte system that the users would see with some rotating disks and some other offline or near line storage.
System monitoring response. We understand that we have a lot of work to do to bring data out of the system but the challenge in the collaboration effort is what do you do with it when you get it? What types of tools, statistical learning theory, artificial intelligence, modern methodology, to extract out the most important characteristics so that you can proactively respond to what you see before a failure or be able to deal with the failure quickly? Workflow and data management. Things like Grid interfaces, being able to work with varying workflows across the system structure, compilers and a lot of emphasis on the Petascale application collaboration teams. Embedded computational experts with the application experts to be able to scale the machine and the applications to the size of the machine. We will define all of these as we go. This is just a high level breakdown of what we’re doing. We also have industrial outreach which consists of the facilities and education as a major component of our project. Here is a picture of our facility that is currently underway. If you go to the NCSA’s web site you can see a live web cam of the construction going on there.
So what do we have today? We have a system we call “Blueprint” which is the first Power 575 system that is meant to study the software environment that we have as well as to provide access to system simulators for the Blue Water systems. So both per CPU, per SMP simulators, as well as the new interconnect technology that we have are able to be simulated. Also, we have about 20 application kernels that we are using to monitor and project performance and measure and actually collaborate on some designed differences that are going on. We also have some Power 6 systems that are used for the development of the hierarchical storage system that I mentioned. Those are in place today and being used.