Although there are a wide range of fast Fourier transform (FFT) algorithms, involving a wealth of mathematics from number theory to polynomial algebras, the vast majority of FFT implementations in practice employ some variation on the Cooley-Tukey algorithm [9]. The Cooley-Tukey algorithm can be derived in two or three lines of elementary algebra. It can be implemented almost as easily, especially if only power-of-two sizes are desired; numerous popular textbooks list short FFT subroutines for power-of-two sizes, written in the language du jour. The implementation of the Cooley-Tukey algorithm, at least, would therefore seem to be a long-solved problem. In this chapter, however, we will argue that matters are not as straightforward as they might appear.

For many years, the primary route to improving upon the Cooley-Tukey
FFT seemed to be reductions in the count of arithmetic operations,
which often dominated the execution time prior to the ubiquity of fast
floating-point hardware (at least on non-embedded processors). Therefore, great effort was expended towards
finding new algorithms with reduced arithmetic
counts [13], from Winograd's method to achieve

And yet there is a vast gap between this basic mathematical theory and
the actual practice—highly optimized FFT packages are often an
order of magnitude faster than the textbook subroutines, and the
internal structure to achieve this performance is radically different
from the typical textbook presentation of the “same” Cooley-Tukey
algorithm. For example, Figure 1 plots the ratio of benchmark
speeds between a highly optimized FFT [15], [16] and a
typical textbook radix-2 implementation [38], and the
former is faster by a factor of 5–40 (with a larger ratio as

In particular, in this chapter we will discuss some of the lessons
learned and the strategies adopted in the FFTW library. FFTW [15], [16] is a widely used free-software library that computes the
discrete Fourier transform (DFT) and its various special cases.
Its performance is competitive even with manufacturer-optimized programs [16],
and this performance is *portable* thanks the structure of the
algorithms employed, self-optimization techniques, and highly
optimized kernels (FFTW's *codelets*) generated by a
special-purpose compiler.

This chapter is structured as follows. First "Review of the Cooley-Tukey FFT", we briefly review the basic ideas behind the Cooley-Tukey algorithm and define some common terminology, especially focusing on the many degrees of freedom that the abstract algorithm allows to implementations. Next, in "Goals and Background of the FFTW Project", we provide some context for FFTW's development and stress that performance, while it receives the most publicity, is not necessarily the most important consideration in the implementation of a library of this sort. Third, in "FFTs and the Memory Hierarchy", we consider a basic theoretical model of the computer memory hierarchy and its impact on FFT algorithm choices: quite general considerations push implementations towards large radices and explicitly recursive structure. Unfortunately, general considerations are not sufficient in themselves, so we will explain in "Adaptive Composition of FFT Algorithms" how FFTW self-optimizes for particular machines by selecting its algorithm at runtime from a composition of simple algorithmic steps. Furthermore, "Generating Small FFT Kernels" describes the utility and the principles of automatic code generation used to produce the highly optimized building blocks of this composition, FFTW's codelets. Finally, we will briefly consider an important non-performance issue, in "Numerical Accuracy in FFTs".

Comments:"The Fast Fourier Transform (FFT) is a landmark algorithm used in fields ranging from signal processing to high-performance computing. First popularized by two American scientists in 1965, the […]"