visit
This article aims to spotlight the potency of compiler optimizations, focusing on the Intel C++ compilers — renowned for their popularity and widespread usage.
Highlights: What are compiler optimizations? | -On | Architecture targeted | Interprocedural Optimization | -fno-aliasing | Compiler Optimization reports
Any compiler executes a series of steps for converting the high-level source code to the low-level machine code. These involve lexical analysis, syntax analysis, semantic analysis, intermediate code generation (or IR), optimization, and code generation.
During the optimization phase, the compiler meticulously seeks ways to transform a program, aiming for a semantically equivalent output that utilizes fewer resources or executes more rapidly. Techniques employed in this process encompass but are not limited to constant folding, loop optimization, function inlining, and dead code elimination.
Developers can specify a set of compiler flags during the compilation process, a practice familiar to those using options like “-g” or “-pg” with GCC for debugging and profiling information. As we go ahead, we’ll discuss similar compiler flags we can use while compiling our application with the Intel C++ compiler. These might help you improve your code’s efficiency and performance.
u(x,y,t) is the temperature at point (x,y) at time t.
We essentially have a C++ coding performing the Jacobi iterations on grids of variable sizes (that we call resolutions). Basically, a grid size of 500 means solving a matrix of size 500x500, and so on.
/*
* One Jacobi iteration step
*/
void jacobi(double *u, double *unew, unsigned sizex, unsigned sizey) {
int i, j;
for (j = 1; j < sizex - 1; j++) {
for (i = 1; i < sizey - 1; i++) {
unew[i * sizex + j] = 0.25 * (u[i * sizex + (j - 1)] + // left
u[i * sizex + (j + 1)] + // right
u[(i - 1) * sizex + j] + // top
u[(i + 1) * sizex + j]); // bottom
}
}
for (j = 1; j < sizex - 1; j++) {
for (i = 1; i < sizey - 1; i++) {
u[i * sizex + j] = unew[i * sizex + j];
}
}
}
MFLOP/s stands for “Million Floating Point Operations Per Second.” It is a unit of measurement used to quantify the performance of a computer or processor in terms of floating-point operations. Floating-point operations involve mathematical calculations with decimal or real numbers represented in a floating-point format.
Note 1: To provide a stable result, I run the executable 5 times for each resolution and take the average value of the MFLOP/s values.
Note 2: It’s important to note that the default optimization on Intel C++ compiler is -O2. So, it is important to specify -O0 while compiling the source code.
These are some of the most commonly used compiler flags when one begins with compiler optimizations. In an ideal case, the performance of Ofast > O3 > O2 > O1 > O0. However, this doesn’t necessarily happen. The critical points of these options are as follows:
-O1:
-O2:
-O3:
-Ofast:
It is clearly evident that all these optimizations are much faster than our base code (with “-O0”). The execution run time is 2–3x lower than the base case. What about MFLOP/s??
Overall, though only slightly, “-O3” performs the best.
The extra flags used by “-Ofast” (“-no-prec-div -fp-model fast=2”) aren’t giving any additional speedup.
The answer lies in strategic compiler flags. Experimenting with options such as “-xHost” and, more precisely, “-xCORE-AVX512” may allow us to harness the full potential of the machine’s capabilities and tailor optimizations for optimal performance.
-xHost:
-xCORE-AVX512:
Goal: Explicitly instruct the compiler to generate code that utilizes the Intel Advanced Vector Extensions 512 (AVX-512) instruction set.
Key Features: AVX-512 is an advanced SIMD (Single Instruction, Multiple Data) instruction set that offers wider vector registers and additional operations compared to previous versions like AVX2. Enabling this flag allows the compiler to leverage these advanced features for optimized performance.
Considerations: Portability is again the culprit here. The binaries generated with AVX-512 instructions may not run optimally on processors that do not support this instruction set. They may not work at all!
By default, “-xCORE-AVX512” assumes that the program will unlikely benefit from zmm registers usage. The compiler avoids using zmm registers unless a performance gain is guaranteed.
If one plans to use the zmm registers without restrictions, “” can be set to high. That’s what we’ll be doing as well.
Woohoo!
The remarkable part is that we achieved these results without any substantial manual interventions — simply by incorporating a handful of compiler flags during the application compilation process.
Note: Don’t worry if your hardware doesn’t support AVX-512. Intel C++ Compiler supports optimizations for AVX, AVX-2 and even SSE. The has everything you need to know!
IPO is a multi-step process focusing on the interactions between different functions or procedures within a program. IPO can include many different kinds of optimizations, including Forward substitution, Indirect call conversion, and Inlining.
-ipo:
Goal: Enables interprocedural optimization, allowing the compiler to analyze and optimize the entire program, beyond individual source files, during compilation.
Key Features:- Whole Program Optimization: “-ipo” performs analysis and optimization across all source files, considering the interactions between functions and procedures throughout the entire program.- Cross-function and cross-module optimization: The flag facilitates inlining functions, synchronization of optimizations, and data flow analysis across different program parts.
Considerations: It requires a separate link step. After compiling with “-ipo”, a particular link step is needed to generate the final executable. The compiler performs additional optimizations based on the whole program view during linking.
-ip:
Goal: Enables interprocedural analysis-propagation, allowing the compiler to perform some interprocedural optimizations without requiring a separate link step.
Key Features:- Analysis and propagation: “-ip” enables the compiler to perform research and data propagation across different functions and modules during compilation. However, it does not perform all optimizations that require the full program view.- Faster compilation: Unlike “-ipo”, “-ip” doesn’t necessitate a separate linking step, resulting in speedier compilation times. This can be beneficial during development when quick feedback is essential.
Considerations: Only some limited interprocedural optimizations occur, including function inlining.
-ipo generally provides more extensive interprocedural optimization capabilities as it involves a separate link step but comes at the cost of longer compilation times. [] -ip is a quicker alternative that performs some interprocedural optimizations without requiring a separate link step, making it suitable for development and testing phases.[]
Since we’re only talking about performance and different optimizations, compile times, or size of the executable not being our concern, we’ll focus on “-ipo”.
/*
* One Jacobi iteration step
*/
void jacobi(double *u, double *unew, unsigned sizex, unsigned sizey) {
int i, j;
for (j = 1; j < sizex - 1; j++) {
for (i = 1; i < sizey - 1; i++) {
unew[i * sizex + j] = 0.25 * (u[i * sizex + (j - 1)] + // left
u[i * sizex + (j + 1)] + // right
u[(i - 1) * sizex + j] + // top
u[(i + 1) * sizex + j]); // bottom
}
}
for (j = 1; j < sizex - 1; j++) {
for (i = 1; i < sizey - 1; i++) {
u[i * sizex + j] = unew[i * sizex + j];
}
}
}
jacobi() function takes a couple of pointers to double as parameters and then does something inside the nested for loops. When any compiler sees this function in the source file, it has to be very careful.
The expression to calculate unew using u involves the average of 4 neighboring u values. What if both u and unew point to the same location? This would become the classical problem of aliased pointers [].
Modern compilers are very smart and to ensure safety, they assume that aliasing could be possible. And for scenarios like this, they avoid any optimizations that may impact the semantics and the output of the code.
In our case, we know that u and unew are different memory locations and are meant to store different values. So, we can easily let the compiler know there won’t be any aliasing here.
There are two methods. First is the C “” keyword. But it requires changing the code. We don’t want that for now.
Anything simple? Let’s try “-fno-alias”.
-fno-alias:
Goal: Instruct the compiler to not assume aliasing in the program.
Key Features: Assuming no aliasing, the compiler can more freely optimize the code, potentially improving the performance.
Considerations: The developer has to be careful in using this flag as in case of any unwarranted aliasing, the program may give unexpected outputs.
Well, now we have something!!!
A closer examination of the assembly code (though not shared here) and the generated compile optimization report (see below) reveals the compiler’s savvy application of and . These transformations contribute to a highly optimized performance, showcasing the significant impact of compiler directives on code efficiency.
The Intel C++ compiler provides a valuable feature that allows users to generate an optimization report summarizing all the adjustments made for optimization purposes []. This comprehensive report is saved in the YAML file format, presenting a detailed list of optimizations applied by the compiler within the code. For a detailed description, see the official documentation on “”.
Similarly, Intel C++ compilers (and all the popular ones) also support pragma directives, which are very nice features. It’s worth checking some of the pragmas like ivdep, parallel, simd, vector, etc., on the .