Abstract | The fast Fourier transform (FFT) is the cornerstone of many supercomputer applications and therefore needs careful performance tuning. Most often, however the real performance of the FFT implementations is far below the acceptable figures. In this paper we explore several strategies for performance optimisations of the FFT computation, such as enhancing instruction-level parallelism, loop merging, and reducing the memory loads and stores by using a special-purpose automatic loop unroller. Our approach is based on the principle of complete unrolling which we apply to modify the FT kernel of the NAS Parallel Benchmarks (NPB). In experiments on two different IBM SP2 platforms, our automatically generated unrolled FFT subroutine is shown to improve the performance between 40% and 53% in comparison with the original code. Further the execution time of the entire 3-D FFT mega-step of the benchmark is faster than when calls to a similar FFT subroutine from the vendor-optimised PESSL numerical library are used. Preliminary results suggest that the completely unrolled code also outperforms FFTW another high-performance FFT package. Finally, our approach for automatic generation of moderately optimised but specialised codes requires only a modest amount of programming effort. |
---|