IC3: International Conference of Contemporary Computing

Title of the Talk

Parallel Computing Goes Mainstream

Speaker

Sunil D. Sherlekar
Intel Labs in Bangalore, India

Sunil Sherlekar is a Senior Principal Engineer and Director of Parallel Computing Research at Intel Labs in Bangalore since Sept. 2010. Earlier he was the Founder & Head of Research at Tata Computational Research Labs in Pune (2006-2010), Head of Embedded Systems R&D at Tata Consultancy Services (2002-2006), CTO at Sasken Communication Technologies (1992-2002) and on the faculty of Computer Science & Engg. at IIT Bombay (1982-1992). Sunil has a B. Tech. (Elect. Engg.) a M. Tech. (Computer Science & Engg.) and a Ph D. all from IIT Bombay. He has published several papers in the areas of Electronic Design Automation and VLSI Signal Processing and a book on VLSI Signal Processing. He was an Associate Editor of IEEE Trans on VLSI, on the Steering Committees of ASPDAC and International Conf on VLSI Design and on the Executive Committee of India Semiconductor Association. Sunil’s current areas of interest lie in mapping of HPC applications to architectures — many-core chips and clusters. Sunil is a Fellow of the Indian National Academy of Engineering and an Adjunct Professor at IIT Bombay.

Abstract

For several decades now (since Gordon More conjectured his Moore’s “Law”), the computing industry has benefited from ever-increasing processor speed. Application developers, therefore, had the privilege of planning ahead for more powerful application software with the assurance that the computing power required will be available. Processor manufacturers could also plan to keep increasing processor speeds with the assurance that application developers were ready to use them. In terms of programmer productivity, the most important implication was that most programmers could write sequential code; the only exceptions being those who wrote operating systems or application software for “supercomputers”.

A few years ago, this convenient symbiotic relationship began to unravel. The main “culprit” was limitations of technology. It was no longer feasible to make processors faster because it was not possible to handle the heat generated by the faster circuitry.

Fortunately, Moore’s Law continues to hold, hence it is possible to keep packing an increasing number of transistors on a chip. Given the continued opportunity provided by Moore’s Law and faced with the heat dissipation problem, the industry stopped increasing the speed of processors and, instead, started designing chips with multiple processors. This meant that the industry could still bring out chips which had increasing computing power albeit with each processor “core” not getting any faster.

The downside of this development is that all programmers — and not just a few select ones — need to write parallel programs to be able to actually use the computing power of multi-processor or multi-core chips. Parallel Programming has to go mainstream!

There are two challenges a programmer faces when designing parallel programs: ensuring correctness and extracting the maximum possible performance.

Over the last few decades, the Computer Science community has addressed the challenge of correctness fairly well. Besides developing alternative programming paradigms — shared-memory and message-passing — programming languages have been designed with appropriate constructs to help (but not necessarily ensure) absence of typical parallel-programming bugbears such as deadlocks and race conditions.

What is still sorely lacking, however, is any systematic methodology to improve the performance of programs. Getting the best possible performance for a given parallel program from the underlying hardware is still an art, bordering on “black magic”. The industry is in a dire need to create such a systematic methodology.

The primary obstacle to creating such a systematic methodology is that we do not yet have a programming model of hardware at the right level of abstraction. We have the ISA (Instruction Set Architecture) that is excellent for addressing functionality but has no information about hardware performance. At the other end of the spectrum, we have the RTL (Register-Transfer Level) model of hardware that provides information about hardware performance but is too detailed (not abstract enough) to be useful to programmers (at least those who are not experts).

The industry today is witnessing acceleration in the rate of reduction of hardware costs, an example being the one teraflops double-precision performance of Intel’s KNC chip. This has created the exciting opportunity to bring high-performance computing into the mainstream. However, to make this happen needs a large number of engineers who can design correct and efficient parallel programs.

Today we have a fairly good systematic methodology to design parallel programs that are functionally correct. Programmer productivity is being further enhanced with the development of Domain-Specific Languages (DSL’s). This needs to be urgently complemented with a systematic methodology to enhance performance on a given target hardware platform. The development of such a methodology must form one of the core themes of research for the computing community.