Optimization

The __inline Keyword for Inline Functions

An inline function is a function for which the compiler replaces a call to the function with the code for the function itself. The process of replacing a function call with the function's code is called inlining. When the compiler performs inlining for a function, the function has been inlined.

Overview

To define an inline function, add the __inline keyword to the function definition. The following is an example of a function definition using the __inline keyword:

__inline double square(double x)
{
   return x * x;
}

The __inline keyword causes a function to be inlined only if you specify the optimize option. If optimize is specified, whether or not __inline is honored depends on the setting of the inline optimizer option. By default, the inline option is in effect whenever the optimizer is run. If you specify optimize , you must also specify the noinline option if you want the __inline keyword to be ignored.

There are no restrictions on how an inline function can be coded. An inline function can declare auto variables and can call other functions, including other inline functions. Inline functions can also be recursive.

Advantages of Using Inline Functions

Since the call to an inline function is replaced with the function itself, the overhead of building a parameter list and calling the function is eliminated in the calling function. Since there is no function call, the overhead associated with entering the function and returning to the caller is eliminated in the called function.

Below is an example of a program that calls an inline function. The program produces a table of equivalent temperatures using both the Fahrenheit and Celsius scales. The conversion from Fahrenheit to Celsius scale is done with the ftoc function.

#include <stdio.h>

static double ftoc(double);

void main()
{
   double fahr, celsius;
   puts("Fahrenheit  Celsius");
   for (fahr = 0.0; fahr <= 300.0; fahr += 20.0) {
      celsius = ftoc(fahr);
      printf("   %4.0f     %6.1f\n", fahr, celsius);
   }
}

static double ftoc(double fahr)
{
   return (5.0 / 9.0) * (fahr - 32.0);
}

As written, the program performs the following operations for each of the 16 iterations of the for loop:

builds a parameter list containing the value of fahr
calls the ftoc function
allocates stack storage for ftoc
calculates the temperature on the Celsius scale
stores the result of the calculation
frees the stack storage
returns to main
assigns the result to celsius .

Suppose ftoc is defined as an inline function by adding the __inline keyword, as follows:

__inline static double ftoc(double fahr)
{
   return (5.0 / 9.0) * (fahr - 32.0);
}

When the program is compiled using the inline option, the compiler replaces the call to ftoc with the code for the ftoc function, as shown here:

#include <stdio.h>

void main()
{
   double fahr, celsius;
   puts("Fahrenheit  Celsius");
   for (fahr = 0.0; fahr <= 300.0; fahr += 20.0) {
      celsius = (5.0 / 9.0) * (fahr - 32.0);
      printf("   %4.0f     %6.1f\ n", fahr, celsius);
   }
}

Note that the definition of ftoc has been moved to the main function. The static definition has been eliminated. Of the eight steps listed above, only two steps remain in the loop:

calculate the temperature on the Celsius scale
assign the result to celsius .

Disadvantages of Using Inline Functions

The compiler generates a copy of the code for an inline function at every call to the function. If the function is very large or is called in many different places, the size of the generated code for the program can increase dramatically. In addition, using inline functions may significantly increase the amount of time required to compile the program.

Compiler Options for Inlining

Several compiler options are supported to allow control over the amount of inlining performed by the compiler. These options are discussed in the following sections.

Using the inlocal option to control inlining

The inlocal option can be used to gain some of the benefits of inlining without using the __inline keyword. This option enables the inlining of all static functions that are called exactly once in the source program. By limiting inlining to single-call static functions, the inlocal option guarantees that the generated code for the program will not increase over the size when inlining is not used. In the preceding example, the same results can be obtained without using the __inline keyword by using the inlocal option when the program is compiled.

Using the complexity option to control inlining

The complexity option provides another way to use inlining without using the __inline keyword. If the inline option is in effect, then the compiler inlines small static and extern functions automatically even if they are not defined with the __inline keyword. The complexity option assigns a meaning to the word small and takes a value between 0 and 20, inclusive. For example, you may specify complexity(4) ( -Kcomplexity=4 under UNIX System Services [USS]). This specifies that the compiler should automatically inline all functions whose complexity is no higher than 4.

Complexity is a measure of the number of discrete operations defined by the function. In general, the larger the value specified for complexity, the larger the functions that are automatically inlined. The ftoc function, described earlier, has a complexity value of 1. The following function, which multiplies the two square matrices, a and b , and returns the result in matrix c , has a complexity value of 8:

void mmult(double c[] [10], double a[] [10] , double b[] [10] )
{
   int i, j, k;

   for (i = 0; i < 10; i++)
      for (j = 0; j < 10; j++) {
         c[i] [j]  = 0.0;
         for (k = 0; k < 10; k++)
            c[i] [j]  = c[i] [j] + a[i] [k] * b[k] [j] ;
      }
}

The following function, a simple binary search function, has a complexity value of 11. This example returns the index of the element in list that has the same value as target . num_els is the number of elements in the list array. list is sorted alphabetically. If target is not found, the function returns 1.

#include <string.h>

int binsrch(char *target, char *list[] , int num_els)
{
   int where, hit;
   int low, high, current;

   low = 0;
   high = num_els;
   current = num_els / 2;  /* Find middle element of array.     */
   hit = -1;               /* Target not found yet.             */

   do {
      where = strcmp(target,list[current] );
      if (where < 0)       /* Target is in top half of list.    */
         high = current - 1;
      else if (where > 0)  /* Target is in bottom half of list. */
         low = current + 1;
      else
         hit = current;                             /* success  */
      current = (high + low) / 2;
      } while (high >= low && hit < 0);

   return hit;
}

The optimizer default is complexity(0) , which means that no functions are considered small enough to inline unless they are defined with the __inline keyword. Note that using a high value for complexity can lead to a substantial increase in the size of the generated code for the compilation.

As mentioned earlier, inline functions can call other inline functions or call themselves recursively. You can control how the compiler generates code for sequences of calls to inline functions and for recursive inline functions by using the depth and rdepth options.

Using the depth option to control inlining

The depth option specifies a limit on the number of nested inline function calls. If inline function f0 calls inline function f1 , which calls inline function fn , then a single call to f0 can result in a significant increase in the size of the function calling f0 .

The following program shows how the compiler inlines functions that call other inline functions. This program computes the length of the hypotenuse of a triangle whose sides are of lengths a and b . The main function calls hypot , which in turn calls the square function.

#include <stdio.h>
#include <math.h>

static double hypot(double, double);
static double square(double);

void main()
{
   double a, b, c;

   for (a = 1.0; a < 10.0; a += 1.5) {
      b = a + 0.75;
      c = hypot(a, b);
      printf("a = %f, b = %f, c = %f\ n", a, b, c);
   }
}

static double hypot(double a, double b)
{
   return sqrt(square(a) + square(b));
}

static double square(double x)
{
   return x * x;
}

If both hypot and square are inline functions, then the compiler generates code for main as if the following program had been used.

#include <stdio.h>
#include <math.h>

void main()
{
   double a, b, c;

   for (a = 1.0; a < 10.0; a += 1.5) {
      b = a + 0.75;
      c = sqrt(a * a + b * b);
      printf("a = %f, b = %f, c = %f\ n", a, b, c);
   }
}

Note that the square function is inlined in hypot , which is then inlined in main . In this program, the maximum calling depth is 2.

If a long sequence of inline function calls is defined, then the size of the generated code for a compilation can increase greatly because of the number of functions being inlined. The depth option can be used to control the calling depth of inline functions. If the calling depth exceeds the number specified by the depth option, the compiler stops inlining and generates calls to the functions instead.

By default, the compiler uses a maximum calling depth of 3. The compiler accepts depth option values between 0 and 6, inclusive.

Using The rdepth option to control inlining

If the rdepth option is used, the compiler inlines recursive inline functions. The rdepth option specifies a maximum depth of recursive function calls to be inlined. The following program shows an example of this kind of inlining. The fib function calculates the Fibonacci function for its argument.

#include <stdio.h>

__inline static int fib(int);

void main()
{
   int i;

   for (i = 0; i < 10; i++)
      printf("fib(%d) = %d\ n", i, fib(i));
}

__inline static int fib(int i)
{
   if (i < 2)
      return i;
   else
      return fib(i-1) + fib(i-2);
}

If the program is compiled using rdepth(2) , then the compiler generates code as if the following program had been used:

#include <stdio.h>

static int fib(int);

void main()
{
   int i;
   int result1, result2, result;   /* compiler temporary variables */

   for (i = 0; i < 10; i++) {
      if (i < 2)
         result = i;
      else {
         if ((i - 1) < 2)
            result1 = i - 1;
         else
            result1 = fib((i - 1) - 1) + fib((i - 1) - 2);
         if ((i - 2) < 2)
            result2 = i - 2;
         else
            result2 = fib((i - 2) - 1) + fib((i - 2) - 2);
         result = result1 + result2;
      }
      printf("fib(%d) = %d\ n", i, result);
   }
}

static int fib(int i)
{
   if (i < 2)
      return i;
   else
      return fib(i-1) + fib(i-2);
}

The compiler has inlined code equivalent to the first two recursive calls to the fib function. This type of inlining can be very useful with recursive functions that have limited depth.

The maximum depth that can be specified using the rdepth option is 6. A large depth value can cause a large increase in the size of the generated code for the compilation. rdepth(1) is the default; that is, the compiler will not inline recursive functions.

The __actual Keyword for Inline Functions

There is no difference between static and extern functions defined using the _ _inline keyword. However, keep in mind that the compiler generally does not create a callable function for an inline function. This is not a problem if the function is declared static because all calls to the function are replaced with the inlined code for the function. However, extern inline functions are not callable from other compilations since no callable copy of the function exists.

The _ _actual keyword can be used in the definition of an inline function. _ _actual implies _ _inline , but it also specifies that the compiler should create a callable function as well. An _ _actual extern function is, of course, callable from other compilations just as any extern function.

Note that an extern _ _inline function in a source file generates no code, whether or not the optimizer is run, unless the _ _actual keyword is specified. This implies that if you have an extern _ _inline function declared in a header file, in order to support unoptimized or debug code, you must provide an _ _actual definition in some source file linked with the application.

This restriction does not apply to static _ _inline functions. When compiling without optimization, a static _ _inline function is compiled like any other static function. If you declare a static _ _inline function in a header file and compile without optimization, you will have a copy of its code in the object file for each source file that includes the header file. Thus, there is a tradeoff between declaring such functions static, which can waste memory on multiple copies of the code, and declaring them extern, which forces the programmer to add an _ _actual declaration for each such function to an appropriate source file. Which option is better will depend on the characteristics and requirements of the application.

Functions that Cannot Be Inlined

The compiler cannot inline a function

that has its address taken
that has a variable length argument list
that is called with an argument list that does not agree with the declared parameter list.

Further Benefits of Inline Functions

There are additional benefits that occur when functions are inlined.

Extending the range of optimization

The value of using inline functions can go far beyond the obvious benefit of reducing function call overhead. In general, the compiler inlines the function and then optimizes the resulting code. Inlining often opens up additional possibilities for optimization. For example, if one or more arguments to an inline function are constant values, the compiler can often perform some of the computations at compile time.

Here is a simple example. Suppose the following program invokes the inline ftoc function given earlier:

#include <stdio.h>

void main()
{
   double fahr, celsius;

   fahr = 212.0;
   celsius = ftoc(fahr);
   printf("%fF is %fC\ n",fahr, celsius);
}

After inlining, the program looks like this:

#include <stdio.h>

void main()
{
   double fahr, celsius;

   fahr = 212.0;
   celsius = (5.0 / 9.0) * (fahr - 32.0);
   printf("%fF is %fC\ n",fahr, celsius);
}

Since the variables are assigned constant values, the compiler can compute the result of the calculation during compilation to produce code equivalent to the following program:

#include <stdio.h>

void main()
{
   printf("%fF is %fC\ n",212.0, 100.0);
}

Inline functions as replacements for macros

Since temporary auto variables can be defined in inline functions, often an inline function can be written that is easier to use than a macro.

Consider the problem of writing a function (called strlength ) that has almost the same function as the Standard strlen function. The one difference is that, if the argument to strlength is NULL, then strlength returns 0. (The strlen function is not meaningful if called with a NULL argument.) A STRLENGTH macro is easily defined as follows:

#define strlength(p) ((p == NULL) ? 0 : strlen(p))

This macro works as described, but with one drawback. Its argument is evaluated twice, once in the test and once in the call to strlen . This is what is known as an unsafe macro. If it is used with an argument that has side-effects, the result is usually incorrect. Suppose that the STRLENGTH macro is called as follows:

p = "A TYPICAL STRING";
n = strlength(p++);

The value assigned to n is 15, which is incorrect. (The intended result is 16.) This is because the macro expands to the statement shown here. (Note that p is incremented before being passed to strlen .)

n = ((p++ == NULL) ? 0 : strlen(p++));

However, it is easy to define an inline version of strlength that works correctly, as shown here:

__inline strlength(char *p)
{
   return (p == NULL) ? 0 : strlen(p);
}

Using inline functions to generate optimized code

As mentioned before, inlining often opens up additional possibilities for optimization. The following example shows how to use inline functions to take advantage of the compiler's capability to optimize the program after inlining is done.

The pow function, part of the C library, computes the value of a raised to a power p as expressed by the relation

r = a

^p pow can be called with any real values for a and p . The following inline function, called power , supplants the pow function by generating inline code for constant, nonnegative whole number values of p . For p <=16.0, the compiler generates code to compute the value directly. For p >16.0, the compiler generates a loop to compute the result. If p is a variable, negative, or contains a nonzero fractional part, the compiler generates a call to the library pow function. In no case does the compiler generate code for more than one condition.

#include <lcdef.h> [1]    pos=m;
#include <math.h> [2]    pos=m;
#undef pow [3]    pos=m;
#define pow(x, p) power(x, p, isnumconst(p)) [4]    pos=m;

static __inline double power(double a, double p, int p_is_constant)
{

      /* Test the exponent to see if it's   */
      /*  - a compile-time integer constant */
      /*  - a whole number                  */
      /*  - nonnegative                     */
   if (p_is_constant && (int) p == p && (int) p >= 0) { [5]    pos=m;

      int n = p;

           /* Handle the cases for 0 <= n <= 4 directly. */
   [6]    pos=m;   if (n == 0) return 1.0;
         else if (n == 1) return a;
         else if (n == 2) return a * a;
         else if (n == 3) return a * a * a;
         else if (n == 4) return (a * a) * (a * a);

            /* Handle 5 <= n <= 16 by calling power            */
            /* recursively.                                    */
            /* Note that power is invoked directly, specifying */
            /* 1 as the value of the p_is_constant argument.   */
            /* This is because the isnumconst macro returns    */
            /* "false" for the expressions (n/2) and           */
            /* ((n+1)/2), which would defeat the optimization. */
   [7]    pos=m;   else if (n <= 16)
         return power(a, (double)(n/2), 1) *
                      power(a, (double)((n+1)/2), 1);

            /* Handle n > 16 via a loop.  The loop below       */
            /* calculates (a ** (2 ** x))                      */
            /* for 2 <= x <= n and sums the results for each   */
            /* power of 2 that has the corresponding bit set   */
            /* in n.                                           */
   [8]    pos=m;   else {
         double prod = 1.0;
         for (; n != 0; a *= a, n >>= 1)
            if (n & 1) prod *= a;
         return prod;
      }
   }

      /* Finally, if p is negative or not a whole number, */
      /* call the library pow function.  The pow macro    */
      /* is defeated by surrounding the name "pow" with   */
      /* parentheses.                                     */
   else [9]    pos=m;
      return (pow)(a, p);
}

The numbers in circles in the code above key the explanation that follows:

<lcdef.h> contains the definition of the isnumconst macro.
<math.h> contains the declaration of the pow function.
The #undef pow preprocessor directive undefines any macros that may be defined for the name pow .
This macro defines a pow macro that will cause the power inline function to be used instead of the library pow function. Note that the isnumconst macro is used to determine whether the second argument to power is a numeric constant. The result of isnumconst is passed as the third argument to power . (See Chapter 1, "Introduction to the SAS/C Library," in SAS/C Library Reference, Volume 1 for more information about isnumconst .)
This is a constant expression and will be evaluated at compile time. The expression checks to determine if p is a numeric constant (as determined by isnumconst ), a whole number, and greater than or equal to 0.
If an if -test is a constant expression (as is this one), the compiler evaluates the expression and then generates code only for the then branch or the else branch, depending on the result of the expression. In this example, if the result is false (that is, p is not a constant whole number greater than or equal to 0) then the compiler ignores the statements that compose the then branch and does not generate code to perform them. However, if the result of the expression is true, then the compiler ignores the statements in the else branch and does not generate a call to the library's pow function.
This if -test, as well as the next four, are also constant expressions. As above, the compiler generates code for the return statement only if the result of the expression is true. Therefore, if 0 n 4, the compiler generates the appropriate return statement to compute the value of a ⁿ for n =1, 2, 3, or 4.
If 5 n 16, power is called recursively to evaluate a ⁿ . Note that a program using this function should be compiled using the rdepth option with a recursion depth of 6.
For n >16, power uses a loop to compute a ⁿ .
Finally, if p is nonconstant or is not a whole number, power calls the library's pow function to compute the result. Note that parentheses surround the name of the function. This defeats the macro definition for pow and ensures that a true function call is generated.

Here are some examples of the use of the power function:

Example 1

r = pow(a, 0);

Since the if -test (n== 0) is true, the compiler generates code to perform the statement

r = 1.0;

Example 2

r = pow(a, 2);

Since the if -test (n== 2) is true, the compiler generates code to perform the statement

r = a * a;

Example 3

r = pow(x, y);

Since y is not a constant, the compiler generates code to call the library's pow function:

r = (pow)(x, y);

Example 4

r = pow(x, 0.75);

Since y is not a whole number, the compiler again generates code to call the library's pow function:

r=(pow) (x,0.75);

Example 5

r = pow(x, 15.0);

Since 15.0> 4, pow calls itself recursively. The compiler generates code equivalent to

r = x * x * x * (x * x) * (x * x) *
    (x * x) * (x * x) * (x * x) * (x * x);

The computation above can be performed using only six floating-point multiplications. The following assembler language code illustrates the machine code instructions generated to compute x ¹⁵:

LD    0,X      Floating-point register 0 (FPR0) = x.
LD    2,X      FPR2 = x, as well.
MDR   0,2      FPR0 = x * x = x

² MDR 2,0 FPR2 = x * x² = x³ MDR 0,0 FPR0 = x² * x² = x⁴ MDR 2,0 FPR2 = x³ * x⁴ = x⁷ MDR 0,0 FPR0 = x⁴ * x⁴ = x⁸ MDR 2,0 FPR2 = x⁸ * x⁷ = x¹⁵

Note that programs that use the power function as shown should be compiled using these options: optimize , rdepth 6 , depth 3 . Since the function is defined using the __inline keyword, the complexity option is not required. If the __inline keyword is not used, you need to specify the complexity option with a value of at least 16.

Chapter Contents
Previous
Next
Top of Page