Trading Off Performance and Code Space
投稿人:DigiKey
2012-10-24
Usually embedded systems programmers and boxers do not draw many comparisons. However, in one aspect they do. Boxers need to fight within a particular weight limit - so their challenge is to maximize their performance (build muscle) within their specified allowance. Embedded systems are similar - you have only a limited amount of memory and you need to maximize your performance within those limits.
Often code optimizations that improve performance also reduce memory size (by reducing the amount of code you have to store). However, some optimizations require you to make a trade-off – it can go faster or it can be smaller. Making this decision is hard and depends on the application you are writing, but it is worth being aware of what optimizations fall into this category. This article covers some of the major space vs. speed trade-off optimizations you are likely to come across.
Tables vs. calculation
Suppose you want to calculate sin(x/256) for an unsigned 8-bit value x - i.e. the input represents values in the range 0.0 to 1.0. One method would be to calculate a polynomial approximation of the function. This is likely to be in the order of 5-20 instructions (depending on the architecture).
Another option would be to have a lookup table. The size of this table will depend on the required output precision. If you wanted 16 bits of output, then the table would be 256 * 16 bits = 512 bytes. This could reduce the calculation to 1 or 2 instructions but at the expense of memory.
In the above case, it may be worth the extra memory if the speed is required for those sin calculations (remember - profile first, optimize later). However, in some sense this case is fortunate in that the input is only 8 bits. If the input were 16 bits, then you would require 128 kB of memory (more than on many microcontrollers).
The size of a lookup table goes up linearly with the number of bits in the output and exponentially with the number of bits in the input. So the number of bits in the input is the key factor.
For some algorithms, it is also possible to have hybrid techniques where you do both a lookup and some calculation. In these cases, you can sometimes trade off the number of bits you use for lookup against the amount of calculation you do afterwards. An example of this is using a lookup to get an initial estimate of a function (e.g. 1/x) and then using iterative refinement to get an accurate solution.
Loop Unrolling
Loop unrolling is the transformation of a loop to a new loop whose body executes several iterations of the original loop. Consider the following code:
for (int i=0;i<10;i++) {
a[i] = b[i];
}
It can be rewritten as a loop that iterates only five times, but each iteration does two operations:
for (int i=0;i<10;i+=2) {
a[i] = b[i];
a[i+1] = b[i+1];
}
You could also fully unroll the loop to get this code:
a[0] = b[0];
a[1] = b[1];
a[2] = b[2];
a[3] = b[3];
a[4] = b[4];
a[5] = b[5];
a[6] = b[6];
a[7] = b[7];
a[8] = b[8];
a[9] = b[9];
Why is this a good thing? There are two reasons why an unrolled loop (and particularly a fully unrolled loop) may go faster:
- Unrolling a loop can reduce the overhead of exit checks since there are fewer iterations. Furthermore, there are fewer branch instructions, which may have an overhead depending on the architecture. In the case of a fully unrolled loop there are no exit checks at all.
- Unrolling a loop can give rise to other optimizations a compiler can perform (for example in the fully unrolled version above, all the array offsets are constants which is something the compiler may be able to exploit).
#pragma loop unroll 10
for (int i=0;i<10;i++) {
a[i] = b[i];
}
The following is also equivalent:
#pragma loop unroll
for (int i=0;i<10;i++) {
a[i] = b[i];
}
The unroll pragma with an argument will cause the compiler to fully unroll the loop if it can. This is particularly useful if the number of times a loop iterates varies with a #define, but you always want to fully unroll the loop, e.g.:
#pragma loop unroll
for (int i=0;i a[i] = b[i];
}
Function inlining
Function inlining replaces a function call with the body of the function definition. As an example, consider the following two functions:
int f(x) { return (x+5); }
int g(x) { return (x + f(x)); }
The definition of ‘g’ could be replaced with:
int g(x) { return (x + (x+5)); }
by taking the body of f and replicating it inside g.
This method might cause code size increase if the function being inlined is used more than once since the body of the function can be much larger than the call. The advantages of inlining a function call are:
- The calling overhead is removed (i.e., the branch and return functions)
- The register usage may be better
- It can enable further optimizations
C and similar languages (such as the XMOS derivative XC) have the inline keyword that can be added to function definitions. This function qualifier acts as a hint to the compiler that you would like this function to be inlined. It is just a hint though; compilers can inline functions without the keyword and can choose not to inline functions that have the keyword.
Summary
This article shows three of the major examples of code optimizations that increase code size: using lookup tables, loop unrolling and function inlining. Deciding when to trade off performance against code-size is one of the tricky parts of code development, but it is important to be aware of when you can do it, how you can do it and what the benefits are.
免责声明:各个作者和/或论坛参与者在本网站发表的观点、看法和意见不代表 DigiKey 的观点、看法和意见,也不代表 DigiKey 官方政策。