| Academic Computing and Communications Center | ||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
The Zen of Parallel Programming |
||||
|
||||
| "Many hands make light work" | ||||
|
"Many hands make light work" is a proverb that works well for people, but differently for computer chips. This article takes a brief look at computers that cooperate on a single task -- parallel programming -- why it is necessary and why it is difficult. If you want to become an expert in high-end computing, you'll need to read something much more technical. But if you are simply curious about this emerging technology, or if you want to surprise your friends with your erudition at your next cocktail party, read on. In essence, parallel programming is quite simple. Say you need to run a long program, but your boss is too impatient to wait for it to finish. Since you are a clever sort, you run half the program on your computer and the other half on your coworker's computer, at the same time. And then, before you confidently go to ask your boss for a raise, your inspiration comes. You suddenly realize that expensive, high-end computer chips give you a lot less computing power per dollar than run-of-the-mill computer chips. For the price of one expensive chip, you could buy five or six ordinary chips, each of which has about half the capability of the high-end chip. If you put them together in one box, why, you could sell a computer that is twice as fast as current models, for less money! Dreaming of stock options and initial product offerings (IPOs), you finish reading this article ... and decide you'd better ask your boss for a slightly smaller raise, before he reads this article, himself. |
||||
| Cooperation has its Limits | ||||
|
So what's wrong with your inspiration? Simply put, the whole is less than
the sum of the parts. The key ingredient is not using the two computers,
but your work in figuring how to divide up the original problem. When you
really think about it, you didn't run the calculation twice as fast, because
it took you a little extra time to transfer the input to the second computer,
and then you had to add the results from the two computers together before
presenting them to your boss.
This is generally true of most calculations -- some parts can be worked on by many processors simultaneously (called the parallel part), but other parts, such as the input or the tally of the final results, can only occupy one processor at a time (called the serial part). For example, suppose you wanted to buy 10 items at an post-holiday sale, each with a different discount. If you had 10 processors, you could use each one to multiply one item's list price by its discount, and you could do all the calculations at the same time. But if you wanted to know your total bill or your total savings, you'd have to add up the resulting numbers on a single processor. (Well, actually, I'll leave it as an exercise to show that you could profitably add up those numbers on more than one -- but much less than 10 -- processors.) It is quite common for the speed of a program to be dominated by the parallel part of its computation when it is run on a single computer. Suppose your program spends 9 minutes doing the parallel part, and only 1 minute doing the serial part. If you have two processors cooperate on the parallel part, that part of the calculation can be reduced to no less than 4.5 minutes. So 10 minutes (original time) divided by 5.5 minutes (total fast time of 4.5 + 1) gives you a speedup of about 1.8. And the speedup (1.8) divided by the number of processors (2) means your use of multiple processors was 90% efficient. Very good for a first attempt. But your boss is still tapping his foot. So, since money is no object, you throw 100 processors at the problem. The parallel part now runs in 0.09 minutes (and for reasons explained later, this would be quite hard to achieve). Now you have a speedup of 9.1 (equal to 10/1.09) but an efficiency of only 9.1% (equal to 9.1/100). The clincher comes when your boss asks you to run 1,000 of these calculations, each one with a different input data set. You quickly realize that if you run these calculations on your one 100-processor parallel machine, it will take 1,090 minutes (1,000 problems times 1.09 minutes per problem). But if you were to run them on 100 serial machines, it would take only 100 minutes (10 problems run on each of the 100 serial machines, with each problem taking 10 minutes). Maybe you won't ask for that raise, after all. Amdahl's LawThis limitation is known in computer science cocktail parties as Amdahl's Law, after the computer designer Gene Amdahl. My point in bringing this up is not to be discouraging, but to show that you must be realistic about the problems you want to solve. Parallel programming cannot make all programs go arbitrarily quickly, no matter how clever the programmer nor how many the processors. In fact, if you have a large number of small problems, and what you need is overall throughput, parallel programming is a bad way to go. (Except, of course, when you split the problems into batches and run them on a set on completely separate processors, which is a form of parallel programming. That's known at the above cocktail parties as "trivially parallel" or "embarrassingly parallel.")Some Limitations Don't MatterOn the other hand, if you want to predict tomorrow's weather, and you can reduce the calculation from 3 days to 0.5 days by using 500 processors, it is worth it. Yes, using the 500 processors is expensive and inefficient. But the alternative -- predicting yesterday's weather -- is just not useful, no matter how cheap it is. Parallel programming is important when "wall clock time" is critical, as opposed to "processor time." If, for example, a surgeon is using an interactive program to visualize a sensitive part of my body during an operation, I don't much care what the efficiency (or processor time) is, provided the response time (or wall clock time) is fast enough to keep his hands from shaking. |
||||
| The Future of High-end Computing | ||||
|
It's clear that you are always better off running on a single super-fast processor
rather than on 10 processors each being 10 times slower. Since chip speeds have
been doubling every 18 months for the last 10 years or so, why not just wait for
the hardware to improve?
First, chip fabrication plants now cost upwards of $1 billion. Faster chips mean smaller feature sizes, smaller wavelengths for lithography, more accurate instruments, more problems with heat dissipation, more crosstalk between transistors, and lots more money for fabrication plants. These problems are likely to slow chip evolution for economic reasons long before the real fundamental physical limits are reached. Second, there are some calculations that just take a lot of multiplications, and even clever algorithms can't reduce the number below a certain minimum. Fact is, if you want to tackle certain problems, you just don't have a choice -- single processors are never, ever going to be fast enough. For these problems, parallel processing is the only possible alternative. (I can say this with some confidence, even though we easily handle today what were the desirable calculations of 10 years ago. Because every time a processor gets twice as fast, the nature of what is "desirable" to do seems to go up by a factor of two. Or more.) Parallel programming is indispensable in the long run, and a very challenging endeavor for the foreseeable future. |
||||
| Different Ways to Split the Pie | ||||
|
Amdahl's Law is perhaps the easiest to understand of parallel computing's
fundamental limitations, but there are others that are important enough
to mention here. Fortunately, most of these can be dealt with, to a greater
or lesser degree, by good engineering practice. They all focus on ways
to bring more and more processors to bear on the parallel part of the calculation
without their interfering with each other.
The next issue is "load balancing," which simply means splitting up the computational work so that each processor gets a fair share. If you have 10 processors, and each one gets exactly 10% of the overall load, then the parallel part runs 10 times faster. But if 9 of the processors aren't doing quite as much work -- perhaps 9% instead of 10% -- then the remaining processor must do 19% and the parallel part will only run 5 times as fast. (Because the first 9 processors will spend half their time waiting for the 10th.) Load balancing doesn't restrict performance in the same way as Amdahl's Law, but it does mean that the engineering requirements for high performance are rather strict. Load balancing is closely related to "computational overhead," that extra calculation you must do just to keep track of the parallel organization. Of course, before you can run a single program on multiple processors, you have to decide just how you are going to split up the computational work among the available processors, and you have many choices. Perhaps you decide to have each processor do a different part of the computation. Just to maintain my quota of buzzwords, this is called "functional parallelism" because each processor is assigned a different function. Or you could have each processor do the same calculation, but on different data; this is called "data parallelism." |
||||
| The Parallel Office | ||||
|
For example, suppose you want to run a simulation of workflow in an office environment. You might assign each processor to represent a different office worker, and the communications between processors will represent the normal flow of forms, memos, and letters. The input is files of information, and the output is reports based on that information.
|
||||
| The Functionally Parallel Office | ||||
In the right circumstances, functional parallelism works really well. For example, your personal computer already has several different processors in it, in addition to its normal "brain" or central processing unit. It also has a special processor to control the graphics display, another one to control the disk drive, and another to control memory accesses. Indeed, your PC's CPU itself is broken into several stages: a stage to fetch the next instruction, a stage to decode the instruction, a stage to do multiplies, a stage to handle loops, and so forth. These stages represent "on-chip functional parallelism" and are a major reason why today's CPUs are much faster than yesterday's. In the wrong circumstances, however, functional parallelism is very inefficient. Ever have an entire office come to a halt, waiting for the boss to sign a particular memo? More importantly, functional parallelism interferes with another very important buzzword: scalability. In current usage, "scalable" has become simply a synonym for "good," but it was originally supposed to mean the ability to assign more and more processors to a given calculation in a useful and efficient manner. Imagine a secretarial pool organized by functional parallelism. For each letter produced, the first secretary types the first paragraph, then passes the paper on to the second secretary to type the second paragraph, and so on. You could double the number of letters processed by doubling the number of secretaries, letting each one type only half a paragraph before passing on the letter. But it doesn't take a rocket scientist to see this approach has its limits. |
||||
| The Data Parallel Office | ||||
Data parallelism is much better for our office: twice as many secretaries will type twice as many letters because each one types a full letter. In a data parallel organization, the workload of each secretary is exactly the same, no matter what the overall workload of the whole office. Only the office manager, who assigns the work, has to work harder when the overall workload goes up. While we're on the subject of the office manager, what would happen if the manager preassigned a year's worth of work on January 1? If done well, it would certainly cut down on the time needed for the manager to communicate with the secretaries. Low overhead equals efficiency, right? This scheme is called "static load balancing" because it is done once when the program starts. Very efficient, but not very flexible. If the manager doesn't estimate properly the amount of work each letter will take, or if a secretary is interrupted by a personal emergency, the overall office load becomes unbalanced. The cure, of course, is called "dynamic load balancing." Keep a stack of letters in an inbasket, and assign the next one to the first available secretary. Dynamic load balancing is flexible and robust but does incur more overhead. In the office setting, dynamic load balancing seems a clear winner, but in parallel programming, both styles are used, depending on circumstances. |
||||
| One Lump or Two? | ||||
|
One last concept, and you'll be ready to suitably impress your fellow party guests.
That's the idea of grain size and communications overhead. "Grain size" is simply
the amount of work a secretary can do before needing to communicate with a coworker.
The problem with each secretary typing half a paragraph isn't with functional
parallelism, per se, but that an increased overall workload could only be accommodated
by chopping the work into smaller and smaller bits, with more communication required
in between the bits. When the amount of time needed to pass paper from one secretary
to another exceeds the amount of time needed to type the requisite fraction of
a paragraph, each secretary will spend the majority of time passing, and a minority
of time typing. They're all working at top speed, mind you, but they're not getting
anything useful done because the grain size and communications overhead are mismatched.
The analogy with computers is clear. Fine-grained parallelism is more flexible and lends itself to the use of more processors. In a word, it is more scalable. But fine-grained parallelism is limited by communications speed. So in the end, fast communications is the major raison d'être for specialized parallel computers. (Pardon the last set of buzzwords: in communications, "low latency" and "high bandwidth" both mean "fast.") An important corollary is that you are always better off with fewer, faster processors for a fixed speed of communication. The idea of using a huge number of cheap (read: slow) processors won't work because your calculation will be overwhelmed by the communications overhead, even provided you are smart enough to achieve good load balancing. Of course, a single, super-fast processor won't work either, despite the the absence of communications overhead and the presence of perfect load balancing, because it is way too expensive. The proper compromise is a medium number of fast but affordable chips (the type used in high-end workstations) together with a fast internal network. Let's put it this way
Comments are appreciated; please send them to |
||||
| The ADN Connection, March/April 1996 | Previous: Does Email Have Teeth? | Next: About the ADN Connection |
| 1999-9-2 connect@uic.edu |
|