• 6989阅读
  • 0回复

How Not To Optimize [复制链接]

上一主题 下一主题
离线steinlee
 

只看楼主 倒序阅读 楼主  发表于: 2009-09-02
David Chisnall takes a look at a number of optimization techniques that have been recommended over the years, discussing why they no longer work on modern architectures.

It's often said that there are two rules to follow when optimizing code:

   1. Don't optimize.
   2. Don't optimize yet (experts only).

Although this thinking is slightly facetious, there's a good reason why the second rule is marked as "experts only." Optimizing code means either making it run faster or making it take up less space (ideally both, but in most cases it's a tradeoff).

Generally speaking, you can do two kinds of optimizations:

    * Algorithmic improvements. If you go from using a bubble sort to using a merge sort, for example, your code will be faster. A quick sort usually will make it even faster, although this result depends somewhat on the input data.

      Choosing good algorithms is generally the best way of making your code go quickly. This practice has the nice advantage that it works independently of the programming language you use or the architecture of the underlying system. Sometimes, however, even the best algorithm you can design is still too slow.
    * Optimizations specific to a language or implementation. This type of optimization requires a thorough understanding of how your compiler and CPU work. Unfortunately, over the years many of these techniques have been incorporated into rules that made sense for some architectures in the past, but no longer are practical.

This article examines some of these popular (but now outdated) rules for optimization.
A Load of Shift

On older CPUs, multiplication was very slow; in particular, multiplying by a constant was no faster than multiplying by a variable. One common trick was to decompose multiplications into sequences of add and shift operations. If you were compiling for an early RISC CPU, this would be done for you, because the CPU had no "multiply" hardware. On other machines, it was done for speed.

Multiplication by a power of two can be translated trivially into a left-shift operation. Essentially, any multiplication can be rewritten in terms of add and shift operations. For example, multiplying by 5 is equivalent to multiplying by 4 and adding the result once. The following two statements are equivalent:

// Multiply directly
a *= 5;

// Multiply using shift-and-add
a = a + (a << 2);

You can rewrite any multiplication operation in this way, and on a lot of systems this technique used to be much faster for multiplying by a constant. This was such a good trick that some compilers—including GCC—used to do this translation automatically.

Then hardware multipliers got better. CPUs like the AMD Athlon could do integer multiplication quickly. Because most programs contained a lot of multiplications and not many shifts, AMD put two shift units and one multiply unit in the Athlon, which meant that it could do two multiplications at a time, but only one shift. Because most multiplication by a constant requires several add-and-shift operations, suddenly the code was faster (not to mention clearer) if you wrote the multiplication instead of the add-and-shift.

Most modern CPUs are superscalar, meaning that they can execute two or more adjacent instructions simultaneously if the operations aren't dependent on each other and the CPU has enough free execution units. The independence requirement is important here; while two multiplications can often be done in parallel, each shift and add needed to simulate the multiplication depends on the result of the previous one, so this practice can cause pipeline stalls on in-order architectures (and even on out-of-order architectures) when other instructions are pushed too far back in the instruction stream to be seen. Modern CPUs get a lot of their speed from pipelining. Many instructions on something like a Core 2 take 10–20 cycles to execute, but because they're decomposed into a large number of independent steps, you can start a second operation only one or two cycles after the first one has started.

This practice is even more pronounced on modern systems. To demonstrate this principle, I wrote two simple programs for the Core 2 Duo. Both loop 1,000,000,000 times, multiplying a running counter by 12345. Program A uses a multiply instruction; program B uses a sequence of shifts and adds (10 in total). Program A takes 3.7 seconds to complete, whereas the shift-and-add version in program B takes 21.3 seconds. Thus, the "naïve" version is almost 7 times faster than the "optimized" version. This result isn't just because the multiplication is faster. A lot of the extra time is taken up with loads and stores, and program B's add-and-shift version involves a lot of moving of operands between hidden registers inside the execution units, and shuffling data from the end of the pipeline back to the start. Another version, with just a single shift, ran marginally faster than the program A multiplication; even turning a "multiply by 2" into a "left-shift by 1" gives almost no performance improvement (less than 1%) on a Core 2.

The important point here is that just because something used to be slow on old CPUs doesn't mean that it's still slow. You should always time your optimizations and make sure that they really offer an improvement in speed. When I started programming in C, most processors didn't have floating-point units. Any floating-point calculations were implemented entirely in software, and typically were two orders of magnitude slower than integer operations. These days, floating-point operations are done in hardware and are often within a factor of 2 of the speed of integer operations. Floating-point has gone from being something that should be used only when absolutely necessary to something to be avoided only in performance-critical code, or in code intended to run on embedded systems with no floating-point unit (FPU).
Global Arguments

On a lot of older systems, arguments were always passed on the stack. If you wanted to pass a pointer to a subroutine, you had to push it onto the stack, make the call, and then pop it off again. This technique was quite fast, but it used a bit of stack space for the copies, so a lot of people started optimizing.

The common optimization was to store a pointer to something used in a set of subroutines in a global variable. Every call just needed to do a single load to get at the value. This approach was fairly sensible, but it has a number of problems in a modern system:

    * The first problem is that this optimization was originally used in position-dependent code; that is, not in dynamic libraries. Therefore every global variable was allocated a static address. When you loaded a value from a global, you were performing a single load instruction with a constant operand—very fast. On modern systems, and in shared libraries where I've seen people try to use this technique, typically a hidden layer of indirection is involved. Global variables in position-independent code are accessed by loading the base address of the library and then adding an offset. Often the base address is found by loading a value at a non-fixed offset from a value in a register, meaning that you're now requiring several load operations to get at this value—more than you need to read the value from the stack.

      The slowdown from this change comes from the way in which memory hierarchies have evolved. Currently, cache misses are very expensive, whereas on older machines all accesses to variables in memory were at roughly the same speed. This triple-indirection means that you need to use at least three cache lines in order to be able to access the global variable quickly. Cache works best when you have values that are accessed at the same time and near each other in memory, but this optimization has the opposite effect. This approach can increase cache churn—the repeated loading and unloading of bits of memory into and out of the cache—which can have a serious negative impact on performance.
    * The second problem with this technique is that arguments are now very rarely passed in memory. On a modern RISC machine (and even on something like x86–64), the arguments are likely to be passed in registers. On something like x86, the top few stack slots are likely to be aliased with hidden registers, so accessing them is almost as fast as accessing a register. Therefore, the simple approach—just passing the pointer as an argument—often requires no memory accesses, while the "optimized" version can require three or four. I've seen a 10–20% speed improvement by undoing this optimization in some code. As a nice side effect, removing the reference to a global variable made the code reentrant, so it was trivial to split it over several threads and use the power of a modern multicore CPU.
Recursion Is Slow!

In the first year of a computer science degree program, you learn about two important things: induction and indirection—or, for the more practical, recursion and pointers. Once you've learned about recursion, you're quickly told that it's a nice theoretical model, but in practical code (in the real world), iteration is always faster: By all means, use a recursive algorithm for prototyping, but if you care about speed, then rewrite it as a loop.

Unlike the other examples we've examined so far, this advice isn't always wrong. I came across an example about a year ago, however, where it was very wrong. Someone had rewritten a nice recursive algorithm as a loop[el]and found that it ran noticeably slower. Because the programmer was using Visual Studio, my first instinct was to blame the compiler, which wasn't entirely fair; I've only come across one instance where the Microsoft C++ compiler has done anything really stupid. Anyway, I checked out the assembly for both versions. In both cases, it looked sensible. The iterative version did some clever rearrangement that made it quite difficult to follow—and made me very glad I rarely need to write assembly—but still, it ran slowly.

The reason turned out to be visible in the high-level code. This loop allocated an array as a local variable and then used it as a stack. The recursive version used the call stack directly. The iterative version needed to store a value by some offset from the stack pointer and then increment the offset. The recursive version could just use push-and-pop instructions, which happened very quickly as a single operation. Not surprisingly, the hardware stack was much faster than the one implemented entirely in software, and this speed difference was greater than the difference between the cost of the function call and the jump.

The moral of this story is that the most natural expression of an algorithm is likely to give you the fastest result. If your algorithm is naturally recursive, torturing it into a loop won't make it faster. If describing it as a loop makes sense, then making it recursive probably won't speed it up (or make it any slower, usually). A lot of compilers will now do tail-recursion optimization: If the last line in a function is a recursive call to the same function, the compiler will replace it with a jump to the start—exactly the same code you would get from a loop.
But It Was Faster!

One common trap in optimizing code is to create micro-benchmarks to test a particular approach, which leads people to make very heavy use of things like function inlining and C++ templates. If you inline a function, you get to avoid the overhead of the function call, which makes things faster. If you use a C++ template, you get a compile-time expansion, so you can avoid the cost of runtime lookup. However, these techniques both come at the expense of increasing code size. There's a growing belief that code size doesn't matter, but this idea is misguided. The Core 2 Duo has a 32KB level-1 instruction cache, and typically around 4MB of level-2 cache that's shared by instructions and data. The more time you spend with your code in the L1 cache, the faster it will run. The more L2 cache used for your code and not your data, the slower it will run.

On code that's used frequently, it's fairly common for the cost of inlining it all over the place to be much greater than the cost of doing a function call. This difference in cost is one reason for a C++ program being slower than something written in a more dynamic language, despite all your profiling that predicts the opposite result. The C++ program, with its repeated template instantiations, is causing a lot more cache misses due to its large code size.

In effect, this is just an extension of the "Don't repeat yourself" rule; it applies to compiled code almost as much as to source code. You typically can't fit your code in L1 cache, but if you can fit the code that's used 90% of the time in L2 cache, then you'll see very good performance (that is, until you start getting cache misses for data access).

This result is very problematic to measure, because it depends a lot on the state of the system. When running benchmarks, it's common to avoid running anything else. Unfortunately, that's not how most code will end up being used; it will be run concurrently with a lot of other programs, all wanting a slice of that cache. The 4MB L2 cache may only have 128KB available for your program and its data, and that's when you really start to notice speed problems from code bloat.
You Can't See the Tree for the Branches

One of my favorite mis-optimizations is to test for special cases at the start of a function and use them to shortcut special cases. It's one of my favorites because I've been guilty of doing it, and the speed boost from removing it was most satisfying.

In my case, it was in a sparse-array lookup function, where I was using NULL to indicate an empty subtree. At each point, I tested whether a pointer was NULL, and returned NULL if it was. It turned out that this was a very uncommon case. Most of the time, when you looked for something it was there. (If it wasn't, you wanted to do some recovery, and therefore you were in a slow code-path anyway.) The cost of the extra branch in that bit of code was imposing a penalty on every case, and marginally speeding up an unusual case.

The solution in my example was to have a self-referential tree node where every child pointed back to itself, and I would use that instead of NULL as the "not present" value. This approach meant that every lookup returned something, but sometimes it would navigate down a statically defined tree to get to NULL. The interface didn't change, but the code ran 20–50% faster, depending on the machine. The branch overhead was a major part of the total time spent.

This issue is part of a more general problem. Optimizing for special cases can add some overhead to every use of a function, providing a speed improvement only in special cases. Unless a special case is particularly common, or orders of magnitude slower than every other case, then it's typically not worth the bother.
The VM Does What?

Most of the techniques I've shown here are things that people using low-level languages might try—after all, they're the ones who tend to care about speed—but people using higher-level languages are not immune. A lot of received wisdom is based on old versions of a virtual machine; you may have been told to avoid some construct or use some feature because it was broken in an old version of a virtual machine, and this rule somehow gained the general consciousness as "Feature X is intrinsically slow."

One example is overuse of the final directive in Java, which can be used to indicate that a class should not be subclassed, or that a method can't be overridden. In very early Java virtual machines, final let the virtual machine do some clever optimizations, because it could eliminate the dynamic method lookup. In a modern Java virtual machine, however, the directive is only used by the code verification part; the optimizer ignores it entirely. Why? Because the VM already knows which classes aren't subclassed and which methods aren't overridden, and will apply these optimizations to all of them, not just the ones marked as final. If this practice becomes wrong in the future, as the result of loading some new classes, it will trigger recompilation of the affected classes. That's not to say that final should never be used, but it should be reserved for documentation, not for optimization.

Something similar is true with various bits of Erlang code, where certain code structures would trigger optimizations in older versions of the virtual machine, but now don't get any special treatment. If you want to make your code run quickly, give as much semantic information to your compiler and VM as possible. This design makes it easy for a future version of the compiler to identify your algorithm as a special case of something it knows how to optimize, and has the (highly beneficial) side effect of improving code readability and maintainability.

Most of the time, optimization isn't needed. Ruby, the byword in slow virtual machines, is used all over the place, with poorly written applications that use only a small fraction of a modern CPU. Don't optimize until you're sure that your code will be too slow without optimization; when you do need to optimize, make sure that your improvements actually help. A huge number of factors can affect performance, and not all of them are obvious. Make sure that you take all of the relevant factors into account.

Looking for remote C/C++ and Qt 兼职
快速回复
限100 字节
 
上一个 下一个