Added pffft

8 years ago · ef07c9d291
--- a/pffft/README.txt
+++ b/pffft/README.txt
@@ -0,0 +1,416 @@
 PFFFT: a pretty fast FFT.

 TL;DR
 --

 PFFFT does 1D Fast Fourier Transforms, of single precision real and
 complex vectors. It tries do it fast, it tries to be correct, and it
 tries to be small. Computations do take advantage of SSE1 instructions
 on x86 cpus, Altivec on powerpc cpus, and NEON on ARM cpus. The
 license is BSD-like.


 Why does it exist:
 --

 I was in search of a good performing FFT library , preferably very
 small and with a very liberal license.

 When one says "fft library", FFTW ("Fastest Fourier Transform in the
 West") is probably the first name that comes to mind -- I guess that
 99% of open-source projects that need a FFT do use FFTW, and are happy
 with it. However, it is quite a large library , which does everything
 fft related (2d transforms, 3d transforms, other transformations such
 as discrete cosine , or fast hartley). And it is licensed under the
 GNU GPL , which means that it cannot be used in non open-source
 products.

 An alternative to FFTW that is really small, is the venerable FFTPACK
 v4, which is available on NETLIB. A more recent version (v5) exists,
 but it is larger as it deals with multi-dimensional transforms. This
 is a library that is written in FORTRAN 77, a language that is now
 considered as a bit antiquated by many. FFTPACKv4 was written in 1985,
 by Dr Paul Swarztrauber of NCAR, more than 25 years ago ! And despite
 its age, benchmarks show it that it still a very good performing FFT
 library, see for example the 1d single precision benchmarks here:
 http://www.fftw.org/speed/opteron-2.2GHz-32bit/ . It is however not
 competitive with the fastest ones, such as FFTW, Intel MKL, AMD ACML,
 Apple vDSP. The reason for that is that those libraries do take
 advantage of the SSE SIMD instructions available on Intel CPUs,
 available since the days of the Pentium III. These instructions deal
 with small vectors of 4 floats at a time, instead of a single float
 for a traditionnal FPU, so when using these instructions one may expect
 a 4-fold performance improvement.

 The idea was to take this fortran fftpack v4 code, translate to C,
 modify it to deal with those SSE instructions, and check that the
 final performance is not completely ridiculous when compared to other
 SIMD FFT libraries. Translation to C was performed with f2c (
 http://www.netlib.org/f2c/ ). The resulting file was a bit edited in
 order to remove the thousands of gotos that were introduced by
 f2c. You will find the fftpack.h and fftpack.c sources in the
 repository, this a complete translation of
 http://www.netlib.org/fftpack/ , with the discrete cosine transform
 and the test program. There is no license information in the netlib
 repository, but it was confirmed to me by the fftpack v5 curators that
 the same terms do apply to fftpack v4:
 http://www.cisl.ucar.edu/css/software/fftpack5/ftpk.html . This is a
 "BSD-like" license, it is compatible with proprietary projects.

 Adapting fftpack to deal with the SIMD 4-element vectors instead of
 scalar single precision numbers was more complex than I originally
 thought, especially with the real transforms, and I ended up writing
 more code than I planned..


 The code:
 --

 Only two files, in good old C, pffft.c and pffft.h . The API is very
 very simple, just make sure that you read the comments in pffft.h.


 Comparison with other FFTs:
 --

 The idea was not to break speed records, but to get a decently fast
 fft that is at least 50% as fast as the fastest FFT -- especially on
 slowest computers . I'm more focused on getting the best performance
 on slow cpus (Atom, Intel Core 1, old Athlons, ARM Cortex-A9...), than
 on getting top performance on today fastest cpus.

 It can be used in a real-time context as the fft functions do not
 perform any memory allocation -- that is why they accept a 'work'
 array in their arguments.

 It is also a bit focused on performing 1D convolutions, that is why it
 provides "unordered" FFTs , and a fourier domain convolution
 operation.


 Benchmark results (cpu tested: core i7 2600, core 2 quad, core 1 duo, atom N270, cortex-A9, cortex-A15, A8X)
 --

 The benchmark shows the performance of various fft implementations measured in 
 MFlops, with the number of floating point operations being defined as 5Nlog2(N)
 for a length N complex fft, and 2.5*Nlog2(N) for a real fft. 
 See http://www.fftw.org/speed/method.html for an explanation of these formulas.

 MacOS Lion, gcc 4.2, 64-bit, fftw 3.3 on a 3.4 GHz core i7 2600

 Built with:

 gcc-4.2 -o test_pffft -arch x86_64 -O3 -Wall -W pffft.c test_pffft.c fftpack.c -L/usr/local/lib -I/usr/local/include/ -DHAVE_VECLIB -framework veclib -DHAVE_FFTW -lfftw3f 

 | input len |real FFTPack|  real vDSP |  real FFTW | real PFFFT | |cplx FFTPack|  cplx vDSP |  cplx FFTW | cplx PFFFT |
 |-----------+------------+------------+------------+------------| |------------+------------+------------+------------|
 |       64  |     2816   |     8596   |     7329   |     8187   | |     2887   |    14898   |    14668   |    11108   |
 |       96  |     3298   |      n/a   |     8378   |     7727   | |     3953   |      n/a   |    15680   |    10878   |
 |      128  |     3507   |    11575   |     9266   |    10108   | |     4233   |    17598   |    16427   |    12000   |
 |      160  |     3391   |      n/a   |     9838   |    10711   | |     4220   |      n/a   |    16653   |    11187   |
 |      192  |     3919   |      n/a   |     9868   |    10956   | |     4297   |      n/a   |    15770   |    12540   |
 |      256  |     4283   |    13179   |    10694   |    13128   | |     4545   |    19550   |    16350   |    13822   |
 |      384  |     3136   |      n/a   |    10810   |    12061   | |     3600   |      n/a   |    16103   |    13240   |
 |      480  |     3477   |      n/a   |    10632   |    12074   | |     3536   |      n/a   |    11630   |    12522   |
 |      512  |     3783   |    15141   |    11267   |    13838   | |     3649   |    20002   |    16560   |    13580   |
 |      640  |     3639   |      n/a   |    11164   |    13946   | |     3695   |      n/a   |    15416   |    13890   |
 |      768  |     3800   |      n/a   |    11245   |    13495   | |     3590   |      n/a   |    15802   |    14552   |
 |      800  |     3440   |      n/a   |    10499   |    13301   | |     3659   |      n/a   |    12056   |    13268   |
 |     1024  |     3924   |    15605   |    11450   |    15339   | |     3769   |    20963   |    13941   |    15467   |
 |     2048  |     4518   |    16195   |    11551   |    15532   | |     4258   |    20413   |    13723   |    15042   |
 |     2400  |     4294   |      n/a   |    10685   |    13078   | |     4093   |      n/a   |    12777   |    13119   |
 |     4096  |     4750   |    16596   |    11672   |    15817   | |     4157   |    19662   |    14316   |    14336   |
 |     8192  |     3820   |    16227   |    11084   |    12555   | |     3691   |    18132   |    12102   |    13813   |
 |     9216  |     3864   |      n/a   |    10254   |    12870   | |     3586   |      n/a   |    12119   |    13994   |
 |    16384  |     3822   |    15123   |    10454   |    12822   | |     3613   |    16874   |    12370   |    13881   |
 |    32768  |     4175   |    14512   |    10662   |    11095   | |     3881   |    14702   |    11619   |    11524   |
 |   262144  |     3317   |    11429   |     6269   |     9517   | |     2810   |    11729   |     7757   |    10179   |
 |  1048576  |     2913   |    10551   |     4730   |     5867   | |     2661   |     7881   |     3520   |     5350   |
 |-----------+------------+------------+------------+------------| |------------+------------+------------+------------|


 Debian 6, gcc 4.4.5, 64-bit, fftw 3.3.1 on a 3.4 GHz core i7 2600

 Built with:
 gcc -o test_pffft -DHAVE_FFTW -msse2 -O3 -Wall -W pffft.c test_pffft.c fftpack.c -L$HOME/local/lib -I$HOME/local/include/ -lfftw3f -lm

 | N (input length) | real FFTPack |   real FFTW  |  real PFFFT  | | cplx FFTPack |   cplx FFTW  |  cplx PFFFT  |
 |------------------+--------------+--------------+--------------| |--------------+--------------+--------------|
 |           64     |      3840    |      7680    |      8777    | |      4389    |     20480    |     11171    |
 |           96     |      4214    |      9633    |      8429    | |      4816    |     22477    |     11238    |
 |          128     |      3584    |     10240    |     10240    | |      5120    |     23893    |     11947    |
 |          192     |      4854    |     11095    |     12945    | |      4854    |     22191    |     14121    |
 |          256     |      4096    |     11703    |     16384    | |      5120    |     23406    |     13653    |
 |          384     |      4395    |     14651    |     12558    | |      4884    |     19535    |     14651    |
 |          512     |      5760    |     13166    |     15360    | |      4608    |     23040    |     15360    |
 |          768     |      4907    |     14020    |     16357    | |      4461    |     19628    |     14020    |
 |         1024     |      5120    |     14629    |     14629    | |      5120    |     20480    |     15754    |
 |         2048     |      5632    |     14080    |     18773    | |      4693    |     12516    |     16091    |
 |         4096     |      5120    |     13653    |     17554    | |      4726    |      7680    |     14456    |
 |         8192     |      4160    |      7396    |     13312    | |      4437    |     14791    |     13312    |
 |         9216     |      4210    |      6124    |     13473    | |      4491    |      7282    |     14970    |
 |        16384     |      3976    |     11010    |     14313    | |      4210    |     11450    |     13631    |
 |        32768     |      4260    |     10224    |     10954    | |      4260    |      6816    |     11797    |
 |       262144     |      3736    |      6896    |      9961    | |      2359    |      8965    |      9437    |
 |      1048576     |      2796    |      4534    |      6453    | |      1864    |      3078    |      5592    |
 |------------------+--------------+--------------+--------------| |--------------+--------------+--------------|



 MacOS Snow Leopard, gcc 4.0, 32-bit, fftw 3.3 on a 1.83 GHz core 1 duo

 Built with:

 gcc -o test_pffft -DHAVE_FFTW -DHAVE_VECLIB -O3 -Wall -W pffft.c test_pffft.c fftpack.c -L/usr/local/lib -I/usr/local/include/ -lfftw3f -framework veclib

 | input len |real FFTPack|  real vDSP |  real FFTW | real PFFFT | |cplx FFTPack|  cplx vDSP |  cplx FFTW | cplx PFFFT |
 |-----------+------------+------------+------------+------------| |------------+------------+------------+------------|
 |      64   |      745   |     2145   |     1706   |     2028   | |      961   |     3356   |     3313   |     2300   |
 |      96   |      877   |      n/a   |     1976   |     1978   | |     1059   |      n/a   |     3333   |     2233   |
 |     128   |      951   |     2808   |     2213   |     2279   | |     1202   |     3803   |     3739   |     2494   |
 |     192   |     1002   |      n/a   |     2456   |     2429   | |     1186   |      n/a   |     3701   |     2508   |
 |     256   |     1065   |     3205   |     2641   |     2793   | |     1302   |     4013   |     3912   |     2663   |
 |     384   |      845   |      n/a   |     2759   |     2499   | |      948   |      n/a   |     3729   |     2504   |
 |     512   |      900   |     3476   |     2956   |     2759   | |      974   |     4057   |     3954   |     2645   |
 |     768   |      910   |      n/a   |     2912   |     2737   | |      975   |      n/a   |     3837   |     2614   |
 |    1024   |      936   |     3583   |     3107   |     3009   | |     1006   |     4124   |     3821   |     2697   |
 |    2048   |     1057   |     3585   |     3091   |     2837   | |     1089   |     3889   |     3701   |     2513   |
 |    4096   |     1083   |     3524   |     3092   |     2733   | |     1039   |     3617   |     3462   |     2364   |
 |    8192   |      874   |     3252   |     2967   |     2363   | |      911   |     3106   |     2789   |     2302   |
 |    9216   |      898   |      n/a   |     2420   |     2290   | |      865   |      n/a   |     2676   |     2204   |
 |   16384   |      903   |     2892   |     2506   |     2421   | |      899   |     3026   |     2797   |     2289   |
 |   32768   |      965   |     2837   |     2550   |     2358   | |      920   |     2922   |     2763   |     2240   |
 |  262144   |      738   |     2422   |     1589   |     1708   | |      610   |     2038   |     1436   |     1091   |
 | 1048576   |      528   |     1207   |      845   |      880   | |      606   |     1020   |      669   |     1036   |
 |-----------+------------+------------+------------+------------| |------------+------------+------------+------------|



 Ubuntu 11.04, gcc 4.5, 32-bit, fftw 3.2 on a 2.66 core 2 quad

 Built with:
 gcc -o test_pffft -DHAVE_FFTW -msse -mfpmath=sse -O3 -Wall -W pffft.c test_pffft.c fftpack.c -L/usr/local/lib -I/usr/local/include/ -lfftw3f -lm

 | input len |real FFTPack|  real FFTW | real PFFFT | |cplx FFTPack|  cplx FFTW | cplx PFFFT |
 |-----------+------------+------------+------------| |------------+------------+------------|
 |       64  |     1920   |     3614   |     5120   | |     2194   |     7680   |     6467   |
 |       96  |     1873   |     3549   |     5187   | |     2107   |     8429   |     5863   |
 |      128  |     2240   |     3773   |     5514   | |     2560   |     7964   |     6827   |
 |      192  |     1765   |     4569   |     7767   | |     2284   |     9137   |     7061   |
 |      256  |     2048   |     5461   |     7447   | |     2731   |     9638   |     7802   |
 |      384  |     1998   |     5861   |     6762   | |     2313   |     9253   |     7644   |
 |      512  |     2095   |     6144   |     7680   | |     2194   |    10240   |     7089   |
 |      768  |     2230   |     5773   |     7549   | |     2045   |    10331   |     7010   |
 |     1024  |     2133   |     6400   |     8533   | |     2133   |    10779   |     7877   |
 |     2048  |     2011   |     7040   |     8665   | |     1942   |    10240   |     7768   |
 |     4096  |     2194   |     6827   |     8777   | |     1755   |     9452   |     6827   |
 |     8192  |     1849   |     6656   |     6656   | |     1752   |     7831   |     6827   |
 |     9216  |     1871   |     5858   |     6416   | |     1643   |     6909   |     6266   |
 |    16384  |     1883   |     6223   |     6506   | |     1664   |     7340   |     6982   |
 |    32768  |     1826   |     6390   |     6667   | |     1631   |     7481   |     6971   |
 |   262144  |     1546   |     4075   |     5977   | |     1299   |     3415   |     3551   |
 |  1048576  |     1104   |     2071   |     1730   | |     1104   |     1149   |     1834   |
 |-----------+------------+------------+------------| |------------+------------+------------|



 Ubuntu 11.04, gcc 4.5, 32-bit, fftw 3.3 on a 1.6 GHz Atom N270

 Built with:
 gcc -o test_pffft -DHAVE_FFTW -msse -mfpmath=sse -O3 -Wall -W pffft.c test_pffft.c fftpack.c -L/usr/local/lib -I/usr/local/include/ -lfftw3f -lm

 | N (input length) | real FFTPack |   real FFTW  |  real PFFFT  | | cplx FFTPack |   cplx FFTW  |  cplx PFFFT  |
 |------------------+--------------+--------------+--------------| |--------------+--------------+--------------|
 |           64     |       452    |      1041    |      1336    | |       549    |      2318    |      1781    |
 |           96     |       444    |      1297    |      1297    | |       503    |      2408    |      1686    |
 |          128     |       527    |      1525    |      1707    | |       543    |      2655    |      1886    |
 |          192     |       498    |      1653    |      1849    | |       539    |      2678    |      1942    |
 |          256     |       585    |      1862    |      2156    | |       594    |      2777    |      2244    |
 |          384     |       499    |      1870    |      1998    | |       511    |      2586    |      1890    |
 |          512     |       562    |      2095    |      2194    | |       542    |      2973    |      2194    |
 |          768     |       545    |      2045    |      2133    | |       545    |      2365    |      2133    |
 |         1024     |       595    |      2133    |      2438    | |       569    |      2695    |      2179    |
 |         2048     |       587    |      2125    |      2347    | |       521    |      2230    |      1707    |
 |         4096     |       495    |      1890    |      1834    | |       492    |      1876    |      1672    |
 |         8192     |       469    |      1548    |      1729    | |       438    |      1740    |      1664    |
 |         9216     |       468    |      1663    |      1663    | |       446    |      1585    |      1531    |
 |        16384     |       453    |      1608    |      1767    | |       398    |      1476    |      1664    |
 |        32768     |       456    |      1420    |      1503    | |       387    |      1388    |      1345    |
 |       262144     |       309    |       385    |       726    | |       262    |       415    |       840    |
 |      1048576     |       280    |       351    |       739    | |       261    |       313    |       797    |
 |------------------+--------------+--------------+--------------| |--------------+--------------+--------------|



 Windows 7, visual c++ 2010 on a 1.6 GHz Atom N270

 Built with:
 cl /Ox -D_USE_MATH_DEFINES /arch:SSE test_pffft.c pffft.c fftpack.c

 (visual c++ is definitively not very good with SSE intrinsics...)

 | N (input length) | real FFTPack |  real PFFFT  | | cplx FFTPack |  cplx PFFFT  |
 |------------------+--------------+--------------| |--------------+--------------|
 |           64     |       173    |      1009    | |       174    |      1159    |
 |           96     |       169    |      1029    | |       188    |      1201    |
 |          128     |       195    |      1242    | |       191    |      1275    |
 |          192     |       178    |      1312    | |       184    |      1276    |
 |          256     |       196    |      1591    | |       186    |      1281    |
 |          384     |       172    |      1409    | |       181    |      1281    |
 |          512     |       187    |      1640    | |       181    |      1313    |
 |          768     |       171    |      1614    | |       176    |      1258    |
 |         1024     |       186    |      1812    | |       178    |      1223    |
 |         2048     |       190    |      1707    | |       186    |      1099    |
 |         4096     |       182    |      1446    | |       177    |       975    |
 |         8192     |       175    |      1345    | |       169    |      1034    |
 |         9216     |       165    |      1271    | |       168    |      1023    |
 |        16384     |       166    |      1396    | |       165    |       949    |
 |        32768     |       172    |      1311    | |       161    |       881    |
 |       262144     |       136    |       632    | |       134    |       629    |
 |      1048576     |       134    |       698    | |       127    |       623    |
 |------------------+--------------+--------------| |--------------+--------------|



 Ubuntu 12.04, gcc-4.7.3, 32-bit, with fftw 3.3.3 (built with --enable-neon), on a 1.2GHz ARM Cortex A9 (Tegra 3)

 Built with:
 gcc-4.7 -O3 -DHAVE_FFTW -march=armv7-a -mtune=cortex-a9 -mfloat-abi=hard -mfpu=neon -ffast-math test_pffft.c pffft.c -o test_pffft_arm fftpack.c -lm -I/usr/local/include/ -L/usr/local/lib/ -lfftw3f

 | input len |real FFTPack|  real FFTW | real PFFFT | |cplx FFTPack|  cplx FFTW | cplx PFFFT |
 |-----------+------------+------------+------------| |------------+------------+------------|
 |       64  |      549   |      452   |      731   | |      512   |      602   |      640   |
 |       96  |      421   |      272   |      702   | |      496   |      571   |      602   |
 |      128  |      498   |      512   |      815   | |      597   |      618   |      652   |
 |      160  |      521   |      536   |      815   | |      586   |      669   |      625   |
 |      192  |      539   |      571   |      883   | |      485   |      597   |      626   |
 |      256  |      640   |      539   |      975   | |      569   |      611   |      671   |
 |      384  |      499   |      610   |      879   | |      499   |      602   |      637   |
 |      480  |      518   |      507   |      877   | |      496   |      661   |      616   |
 |      512  |      524   |      591   |     1002   | |      549   |      678   |      668   |
 |      640  |      542   |      612   |      955   | |      568   |      663   |      645   |
 |      768  |      557   |      613   |      981   | |      491   |      663   |      598   |
 |      800  |      514   |      353   |      882   | |      514   |      360   |      574   |
 |     1024  |      640   |      640   |     1067   | |      492   |      683   |      602   |
 |     2048  |      587   |      640   |      908   | |      486   |      640   |      552   |
 |     2400  |      479   |      368   |      777   | |      422   |      376   |      518   |
 |     4096  |      511   |      614   |      853   | |      426   |      640   |      534   |
 |     8192  |      415   |      584   |      708   | |      386   |      622   |      516   |
 |     9216  |      419   |      571   |      687   | |      364   |      586   |      506   |
 |    16384  |      426   |      577   |      716   | |      398   |      606   |      530   |
 |    32768  |      417   |      572   |      673   | |      399   |      572   |      468   |
 |   262144  |      219   |      380   |      293   | |      255   |      431   |      343   |
 |  1048576  |      202   |      274   |      237   | |      265   |      282   |      355   |
 |-----------+------------+------------+------------| |------------+------------+------------|

 Same platform as above, but this time pffft and fftpack are built with clang 3.2:

 clang -O3 -DHAVE_FFTW -march=armv7-a -mtune=cortex-a9 -mfloat-abi=hard -mfpu=neon -ffast-math test_pffft.c pffft.c -o test_pffft_arm fftpack.c -lm -I/usr/local/include/ -L/usr/local/lib/ -lfftw3f

 | input len |real FFTPack|  real FFTW | real PFFFT | |cplx FFTPack|  cplx FFTW | cplx PFFFT |
 |-----------+------------+------------+------------| |------------+------------+------------|
 |       64  |      427   |      452   |      853   | |      427   |      602   |     1024   |
 |       96  |      351   |      276   |      843   | |      337   |      571   |      963   |
 |      128  |      373   |      512   |      996   | |      390   |      618   |     1054   |
 |      160  |      426   |      536   |      987   | |      375   |      669   |      914   |
 |      192  |      404   |      571   |     1079   | |      388   |      588   |     1079   |
 |      256  |      465   |      539   |     1205   | |      445   |      602   |     1170   |
 |      384  |      366   |      610   |     1099   | |      343   |      594   |     1099   |
 |      480  |      356   |      507   |     1140   | |      335   |      651   |      931   |
 |      512  |      411   |      591   |     1213   | |      384   |      649   |     1124   |
 |      640  |      398   |      612   |     1193   | |      373   |      654   |      901   |
 |      768  |      409   |      613   |     1227   | |      383   |      663   |     1044   |
 |      800  |      411   |      348   |     1073   | |      353   |      358   |      809   |
 |     1024  |      427   |      640   |     1280   | |      413   |      692   |     1004   |
 |     2048  |      414   |      626   |     1126   | |      371   |      640   |      853   |
 |     2400  |      399   |      373   |      898   | |      319   |      368   |      653   |
 |     4096  |      404   |      602   |     1059   | |      357   |      633   |      778   |
 |     8192  |      332   |      584   |      792   | |      308   |      616   |      716   |
 |     9216  |      322   |      561   |      783   | |      299   |      586   |      687   |
 |    16384  |      344   |      568   |      778   | |      314   |      617   |      745   |
 |    32768  |      342   |      564   |      737   | |      314   |      552   |      629   |
 |   262144  |      201   |      383   |      313   | |      227   |      435   |      413   |
 |  1048576  |      187   |      262   |      251   | |      228   |      281   |      409   |
 |-----------+------------+------------+------------| |------------+------------+------------|

 So it looks like, on ARM, gcc 4.7 is the best at scalar floating point
 (the fftpack performance numbers are better with gcc), while clang is
 the best with neon intrinsics (see how pffft perf has improved with
 clang 3.2).


 NVIDIA Jetson TK1 board, gcc-4.8.2. The cpu is a 2.3GHz cortex A15 (Tegra K1).

 Built with:
 gcc -O3 -march=armv7-a -mtune=native -mfloat-abi=hard -mfpu=neon -ffast-math test_pffft.c pffft.c -o test_pffft_arm fftpack.c -lm

 | input len |real FFTPack| real PFFFT | |cplx FFTPack| cplx PFFFT |
 |-----------+------------+------------| |------------+------------|
 |       64  |     1735   |     3308   | |     1994   |     3744   |
 |       96  |     1596   |     3448   | |     1987   |     3572   |
 |      128  |     1807   |     4076   | |     2255   |     3960   |
 |      160  |     1769   |     4083   | |     2071   |     3845   |
 |      192  |     1990   |     4233   | |     2017   |     3939   |
 |      256  |     2191   |     4882   | |     2254   |     4346   |
 |      384  |     1878   |     4492   | |     2073   |     4012   |
 |      480  |     1748   |     4398   | |     1923   |     3951   |
 |      512  |     2030   |     5064   | |     2267   |     4195   |
 |      640  |     1918   |     4756   | |     2094   |     4184   |
 |      768  |     2099   |     4907   | |     2048   |     4297   |
 |      800  |     1822   |     4555   | |     1880   |     4063   |
 |     1024  |     2232   |     5355   | |     2187   |     4420   |
 |     2048  |     2176   |     4983   | |     2027   |     3602   |
 |     2400  |     1741   |     4256   | |     1710   |     3344   |
 |     4096  |     1816   |     3914   | |     1851   |     3349   |
 |     8192  |     1716   |     3481   | |     1700   |     3255   |
 |     9216  |     1735   |     3589   | |     1653   |     3094   |
 |    16384  |     1567   |     3483   | |     1637   |     3244   |
 |    32768  |     1624   |     3240   | |     1655   |     3156   |
 |   262144  |     1012   |     1898   | |      983   |     1503   |
 |  1048576  |      876   |     1154   | |      868   |     1341   |
 |-----------+------------+------------| |------------+------------|

 The performance on the tegra K1 is pretty impressive. I'm not
 including the FFTW numbers as they as slightly below the scalar
 fftpack numbers, so something must be wrong (however it seems to be
 correctly configured and is using neon simd instructions).

 When using clang 3.4 the pffft version is even a bit faster, reaching
 5.7 GFlops for real ffts of size 1024.


 iPad Air 2 with iOS9, xcode 8.0, arm64. The cpu is an Apple A8X, supposedly running at 1.5GHz.

 | input len |real FFTPack|  real vDSP | real PFFFT | |cplx FFTPack|  cplx vDSP | cplx PFFFT |
 |-----------+------------+------------+------------| |------------+------------+------------|
 |       64  |     2517   |     7995   |     6086   | |     2725   |    13006   |     8495   |
 |       96  |     2442   |      n/a   |     6691   | |     2256   |      n/a   |     7991   |
 |      128  |     2664   |    10186   |     7877   | |     2575   |    15115   |     9115   |
 |      160  |     2638   |      n/a   |     8283   | |     2682   |      n/a   |     8806   |
 |      192  |     2903   |      n/a   |     9083   | |     2634   |      n/a   |     8980   |
 |      256  |     3184   |    11452   |    10039   | |     3026   |    15410   |    10199   |
 |      384  |     2665   |      n/a   |    10100   | |     2275   |      n/a   |     9247   |
 |      480  |     2546   |      n/a   |     9863   | |     2341   |      n/a   |     8892   |
 |      512  |     2832   |    12197   |    10989   | |     2547   |    16768   |    10154   |
 |      640  |     2755   |      n/a   |    10461   | |     2569   |      n/a   |     9666   |
 |      768  |     2998   |      n/a   |    11355   | |     2585   |      n/a   |     9813   |
 |      800  |     2516   |      n/a   |    10332   | |     2433   |      n/a   |     9164   |
 |     1024  |     3109   |    12965   |    12114   | |     2869   |    16448   |    10519   |
 |     2048  |     3027   |    12996   |    12023   | |     2648   |    17304   |    10307   |
 |     2400  |     2515   |      n/a   |    10372   | |     2355   |      n/a   |     8443   |
 |     4096  |     3204   |    13603   |    12359   | |     2814   |    16570   |     9780   |
 |     8192  |     2759   |    13422   |    10824   | |     2153   |    15652   |     7884   |
 |     9216  |     2700   |      n/a   |     9938   | |     2241   |      n/a   |     7900   |
 |    16384  |     2280   |    13057   |     7976   | |      593   |     4272   |     2534   |
 |    32768  |      768   |     4269   |     2882   | |      606   |     4405   |     2604   |
 |   262144  |      724   |     3527   |     2630   | |      534   |     2418   |     2157   |
 |  1048576  |      674   |     1467   |     2135   | |      530   |     1621   |     2055   |
 |-----------+------------+------------+------------| |------------+------------+------------|

 I double-checked to make sure I did not make a mistake in the time
 measurements, as the numbers are much higher than what I initially
 expected. They are in fact higher than the number I get on the 2.8GHz
 Xeon of my 2008 mac pro.. (except for FFT lengths >= 32768 where
 having a big cache is useful). A good surprise is also that the perf
 is not too far from apple's vDSP (at least for the real FFT).

--- a/pffft/fftpack.c
+++ b/pffft/fftpack.c
--- a/pffft/fftpack.h
+++ b/pffft/fftpack.h
@@ -0,0 +1,799 @@
 /*
  Interface for the f2c translation of fftpack as found on http://www.netlib.org/fftpack/
  
   FFTPACK license:

   http://www.cisl.ucar.edu/css/software/fftpack5/ftpk.html

   Copyright (c) 2004 the University Corporation for Atmospheric
   Research ("UCAR"). All rights reserved. Developed by NCAR's
   Computational and Information Systems Laboratory, UCAR,
   www.cisl.ucar.edu.

   Redistribution and use of the Software in source and binary forms,
   with or without modification, is permitted provided that the
   following conditions are met:

   - Neither the names of NCAR's Computational and Information Systems
   Laboratory, the University Corporation for Atmospheric Research,
   nor the names of its sponsors or contributors may be used to
   endorse or promote products derived from this Software without
   specific prior written permission.  

   - Redistributions of source code must retain the above copyright
   notices, this list of conditions, and the disclaimer below.

   - Redistributions in binary form must reproduce the above copyright
   notice, this list of conditions, and the disclaimer below in the
   documentation and/or other materials provided with the
   distribution.

   THIS SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
   EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO THE WARRANTIES OF
   MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
   NONINFRINGEMENT. IN NO EVENT SHALL THE CONTRIBUTORS OR COPYRIGHT
   HOLDERS BE LIABLE FOR ANY CLAIM, INDIRECT, INCIDENTAL, SPECIAL,
   EXEMPLARY, OR CONSEQUENTIAL DAMAGES OR OTHER LIABILITY, WHETHER IN AN
   ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
   CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS WITH THE
   SOFTWARE.

   ChangeLog:
   2011/10/02: this is my first release of this file.
 */

 #ifndef FFTPACK_H
 #define FFTPACK_H

 #ifdef __cplusplus
 extern "C" {
 #endif

 // just define FFTPACK_DOUBLE_PRECISION if you want to build it as a double precision fft

 #ifndef FFTPACK_DOUBLE_PRECISION
  typedef float fftpack_real;
  typedef int   fftpack_int;
 #else
  typedef double fftpack_real;
  typedef int    fftpack_int;
 #endif

  void cffti(fftpack_int n, fftpack_real *wsave);

  void cfftf(fftpack_int n, fftpack_real *c, fftpack_real *wsave);

  void cfftb(fftpack_int n, fftpack_real *c, fftpack_real *wsave);

  void rffti(fftpack_int n, fftpack_real *wsave);
  void rfftf(fftpack_int n, fftpack_real *r, fftpack_real *wsave);
  void rfftb(fftpack_int n, fftpack_real *r, fftpack_real *wsave);

  void cosqi(fftpack_int n, fftpack_real *wsave);
  void cosqf(fftpack_int n, fftpack_real *x, fftpack_real *wsave);
  void cosqb(fftpack_int n, fftpack_real *x, fftpack_real *wsave);

  void costi(fftpack_int n, fftpack_real *wsave);
  void cost(fftpack_int n, fftpack_real *x, fftpack_real *wsave);

  void sinqi(fftpack_int n, fftpack_real *wsave);
  void sinqb(fftpack_int n, fftpack_real *x, fftpack_real *wsave);
  void sinqf(fftpack_int n, fftpack_real *x, fftpack_real *wsave);

  void sinti(fftpack_int n, fftpack_real *wsave);
  void sint(fftpack_int n, fftpack_real *x, fftpack_real *wsave);

 #ifdef __cplusplus
 }
 #endif

 #endif /* FFTPACK_H */

 /*

                      FFTPACK

 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *

                  version 4  april 1985

     a package of fortran subprograms for the fast fourier
      transform of periodic and other symmetric sequences

                         by

                  paul n swarztrauber

  national center for atmospheric research  boulder,colorado 80307

   which is sponsored by the national science foundation

 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *


 this package consists of programs which perform fast fourier
 transforms for both complex and real periodic sequences and
 certain other symmetric sequences that are listed below.

 1.   rffti     initialize  rfftf and rfftb
 2.   rfftf     forward transform of a real periodic sequence
 3.   rfftb     backward transform of a real coefficient array

 4.   ezffti    initialize ezfftf and ezfftb
 5.   ezfftf    a simplified real periodic forward transform
 6.   ezfftb    a simplified real periodic backward transform

 7.   sinti     initialize sint
 8.   sint      sine transform of a real odd sequence

 9.   costi     initialize cost
 10.  cost      cosine transform of a real even sequence

 11.  sinqi     initialize sinqf and sinqb
 12.  sinqf     forward sine transform with odd wave numbers
 13.  sinqb     unnormalized inverse of sinqf

 14.  cosqi     initialize cosqf and cosqb
 15.  cosqf     forward cosine transform with odd wave numbers
 16.  cosqb     unnormalized inverse of cosqf

 17.  cffti     initialize cfftf and cfftb
 18.  cfftf     forward transform of a complex periodic sequence
 19.  cfftb     unnormalized inverse of cfftf


 ******************************************************************

 subroutine rffti(n,wsave)

  ****************************************************************

 subroutine rffti initializes the array wsave which is used in
 both rfftf and rfftb. the prime factorization of n together with
 a tabulation of the trigonometric functions are computed and
 stored in wsave.

 input parameter

 n       the length of the sequence to be transformed.

 output parameter

 wsave   a work array which must be dimensioned at least 2*n+15.
        the same work array can be used for both rfftf and rfftb
        as long as n remains unchanged. different wsave arrays
        are required for different values of n. the contents of
        wsave must not be changed between calls of rfftf or rfftb.

 ******************************************************************

 subroutine rfftf(n,r,wsave)

 ******************************************************************

 subroutine rfftf computes the fourier coefficients of a real
 perodic sequence (fourier analysis). the transform is defined
 below at output parameter r.

 input parameters

 n       the length of the array r to be transformed.  the method
        is most efficient when n is a product of small primes.
        n may change so long as different work arrays are provided

 r       a real array of length n which contains the sequence
        to be transformed

 wsave   a work array which must be dimensioned at least 2*n+15.
        in the program that calls rfftf. the wsave array must be
        initialized by calling subroutine rffti(n,wsave) and a
        different wsave array must be used for each different
        value of n. this initialization does not have to be
        repeated so long as n remains unchanged thus subsequent
        transforms can be obtained faster than the first.
        the same wsave array can be used by rfftf and rfftb.


 output parameters

 r       r(1) = the sum from i=1 to i=n of r(i)

        if n is even set l =n/2   , if n is odd set l = (n+1)/2

          then for k = 2,...,l

             r(2*k-2) = the sum from i = 1 to i = n of

                  r(i)*cos((k-1)*(i-1)*2*pi/n)

             r(2*k-1) = the sum from i = 1 to i = n of

                 -r(i)*sin((k-1)*(i-1)*2*pi/n)

        if n is even

             r(n) = the sum from i = 1 to i = n of

                  (-1)**(i-1)*r(i)

 *****  note
             this transform is unnormalized since a call of rfftf
             followed by a call of rfftb will multiply the input
             sequence by n.

 wsave   contains results which must not be destroyed between
        calls of rfftf or rfftb.


 ******************************************************************

 subroutine rfftb(n,r,wsave)

 ******************************************************************

 subroutine rfftb computes the real perodic sequence from its
 fourier coefficients (fourier synthesis). the transform is defined
 below at output parameter r.

 input parameters

 n       the length of the array r to be transformed.  the method
        is most efficient when n is a product of small primes.
        n may change so long as different work arrays are provided

 r       a real array of length n which contains the sequence
        to be transformed

 wsave   a work array which must be dimensioned at least 2*n+15.
        in the program that calls rfftb. the wsave array must be
        initialized by calling subroutine rffti(n,wsave) and a
        different wsave array must be used for each different
        value of n. this initialization does not have to be
        repeated so long as n remains unchanged thus subsequent
        transforms can be obtained faster than the first.
        the same wsave array can be used by rfftf and rfftb.


 output parameters

 r       for n even and for i = 1,...,n

             r(i) = r(1)+(-1)**(i-1)*r(n)

                  plus the sum from k=2 to k=n/2 of

                   2.*r(2*k-2)*cos((k-1)*(i-1)*2*pi/n)

                  -2.*r(2*k-1)*sin((k-1)*(i-1)*2*pi/n)

        for n odd and for i = 1,...,n

             r(i) = r(1) plus the sum from k=2 to k=(n+1)/2 of

                  2.*r(2*k-2)*cos((k-1)*(i-1)*2*pi/n)

                 -2.*r(2*k-1)*sin((k-1)*(i-1)*2*pi/n)

 *****  note
             this transform is unnormalized since a call of rfftf
             followed by a call of rfftb will multiply the input
             sequence by n.

 wsave   contains results which must not be destroyed between
        calls of rfftb or rfftf.

 ******************************************************************

 subroutine sinti(n,wsave)

 ******************************************************************

 subroutine sinti initializes the array wsave which is used in
 subroutine sint. the prime factorization of n together with
 a tabulation of the trigonometric functions are computed and
 stored in wsave.

 input parameter

 n       the length of the sequence to be transformed.  the method
        is most efficient when n+1 is a product of small primes.

 output parameter

 wsave   a work array with at least int(2.5*n+15) locations.
        different wsave arrays are required for different values
        of n. the contents of wsave must not be changed between
        calls of sint.

 ******************************************************************

 subroutine sint(n,x,wsave)

 ******************************************************************

 subroutine sint computes the discrete fourier sine transform
 of an odd sequence x(i). the transform is defined below at
 output parameter x.

 sint is the unnormalized inverse of itself since a call of sint
 followed by another call of sint will multiply the input sequence
 x by 2*(n+1).

 the array wsave which is used by subroutine sint must be
 initialized by calling subroutine sinti(n,wsave).

 input parameters

 n       the length of the sequence to be transformed.  the method
        is most efficient when n+1 is the product of small primes.

 x       an array which contains the sequence to be transformed


 wsave   a work array with dimension at least int(2.5*n+15)
        in the program that calls sint. the wsave array must be
        initialized by calling subroutine sinti(n,wsave) and a
        different wsave array must be used for each different
        value of n. this initialization does not have to be
        repeated so long as n remains unchanged thus subsequent
        transforms can be obtained faster than the first.

 output parameters

 x       for i=1,...,n

             x(i)= the sum from k=1 to k=n

                  2*x(k)*sin(k*i*pi/(n+1))

             a call of sint followed by another call of
             sint will multiply the sequence x by 2*(n+1).
             hence sint is the unnormalized inverse
             of itself.

 wsave   contains initialization calculations which must not be
        destroyed between calls of sint.

 ******************************************************************

 subroutine costi(n,wsave)

 ******************************************************************

 subroutine costi initializes the array wsave which is used in
 subroutine cost. the prime factorization of n together with
 a tabulation of the trigonometric functions are computed and
 stored in wsave.

 input parameter

 n       the length of the sequence to be transformed.  the method
        is most efficient when n-1 is a product of small primes.

 output parameter

 wsave   a work array which must be dimensioned at least 3*n+15.
        different wsave arrays are required for different values
        of n. the contents of wsave must not be changed between
        calls of cost.

 ******************************************************************

 subroutine cost(n,x,wsave)

 ******************************************************************

 subroutine cost computes the discrete fourier cosine transform
 of an even sequence x(i). the transform is defined below at output
 parameter x.

 cost is the unnormalized inverse of itself since a call of cost
 followed by another call of cost will multiply the input sequence
 x by 2*(n-1). the transform is defined below at output parameter x

 the array wsave which is used by subroutine cost must be
 initialized by calling subroutine costi(n,wsave).

 input parameters

 n       the length of the sequence x. n must be greater than 1.
        the method is most efficient when n-1 is a product of
        small primes.

 x       an array which contains the sequence to be transformed

 wsave   a work array which must be dimensioned at least 3*n+15
        in the program that calls cost. the wsave array must be
        initialized by calling subroutine costi(n,wsave) and a
        different wsave array must be used for each different
        value of n. this initialization does not have to be
        repeated so long as n remains unchanged thus subsequent
        transforms can be obtained faster than the first.

 output parameters

 x       for i=1,...,n

            x(i) = x(1)+(-1)**(i-1)*x(n)

             + the sum from k=2 to k=n-1

                 2*x(k)*cos((k-1)*(i-1)*pi/(n-1))

             a call of cost followed by another call of
             cost will multiply the sequence x by 2*(n-1)
             hence cost is the unnormalized inverse
             of itself.

 wsave   contains initialization calculations which must not be
        destroyed between calls of cost.

 ******************************************************************

 subroutine sinqi(n,wsave)

 ******************************************************************

 subroutine sinqi initializes the array wsave which is used in
 both sinqf and sinqb. the prime factorization of n together with
 a tabulation of the trigonometric functions are computed and
 stored in wsave.

 input parameter

 n       the length of the sequence to be transformed. the method
        is most efficient when n is a product of small primes.

 output parameter

 wsave   a work array which must be dimensioned at least 3*n+15.
        the same work array can be used for both sinqf and sinqb
        as long as n remains unchanged. different wsave arrays
        are required for different values of n. the contents of
        wsave must not be changed between calls of sinqf or sinqb.

 ******************************************************************

 subroutine sinqf(n,x,wsave)

 ******************************************************************

 subroutine sinqf computes the fast fourier transform of quarter
 wave data. that is , sinqf computes the coefficients in a sine
 series representation with only odd wave numbers. the transform
 is defined below at output parameter x.

 sinqb is the unnormalized inverse of sinqf since a call of sinqf
 followed by a call of sinqb will multiply the input sequence x
 by 4*n.

 the array wsave which is used by subroutine sinqf must be
 initialized by calling subroutine sinqi(n,wsave).


 input parameters

 n       the length of the array x to be transformed.  the method
        is most efficient when n is a product of small primes.

 x       an array which contains the sequence to be transformed

 wsave   a work array which must be dimensioned at least 3*n+15.
        in the program that calls sinqf. the wsave array must be
        initialized by calling subroutine sinqi(n,wsave) and a
        different wsave array must be used for each different
        value of n. this initialization does not have to be
        repeated so long as n remains unchanged thus subsequent
        transforms can be obtained faster than the first.

 output parameters

 x       for i=1,...,n

             x(i) = (-1)**(i-1)*x(n)

                + the sum from k=1 to k=n-1 of

                2*x(k)*sin((2*i-1)*k*pi/(2*n))

             a call of sinqf followed by a call of
             sinqb will multiply the sequence x by 4*n.
             therefore sinqb is the unnormalized inverse
             of sinqf.

 wsave   contains initialization calculations which must not
        be destroyed between calls of sinqf or sinqb.

 ******************************************************************

 subroutine sinqb(n,x,wsave)

 ******************************************************************

 subroutine sinqb computes the fast fourier transform of quarter
 wave data. that is , sinqb computes a sequence from its
 representation in terms of a sine series with odd wave numbers.
 the transform is defined below at output parameter x.

 sinqf is the unnormalized inverse of sinqb since a call of sinqb
 followed by a call of sinqf will multiply the input sequence x
 by 4*n.

 the array wsave which is used by subroutine sinqb must be
 initialized by calling subroutine sinqi(n,wsave).


 input parameters

 n       the length of the array x to be transformed.  the method
        is most efficient when n is a product of small primes.

 x       an array which contains the sequence to be transformed

 wsave   a work array which must be dimensioned at least 3*n+15.
        in the program that calls sinqb. the wsave array must be
        initialized by calling subroutine sinqi(n,wsave) and a
        different wsave array must be used for each different
        value of n. this initialization does not have to be
        repeated so long as n remains unchanged thus subsequent
        transforms can be obtained faster than the first.

 output parameters

 x       for i=1,...,n

             x(i)= the sum from k=1 to k=n of

               4*x(k)*sin((2k-1)*i*pi/(2*n))

             a call of sinqb followed by a call of
             sinqf will multiply the sequence x by 4*n.
             therefore sinqf is the unnormalized inverse
             of sinqb.

 wsave   contains initialization calculations which must not
        be destroyed between calls of sinqb or sinqf.

 ******************************************************************

 subroutine cosqi(n,wsave)

 ******************************************************************

 subroutine cosqi initializes the array wsave which is used in
 both cosqf and cosqb. the prime factorization of n together with
 a tabulation of the trigonometric functions are computed and
 stored in wsave.

 input parameter

 n       the length of the array to be transformed.  the method
        is most efficient when n is a product of small primes.

 output parameter

 wsave   a work array which must be dimensioned at least 3*n+15.
        the same work array can be used for both cosqf and cosqb
        as long as n remains unchanged. different wsave arrays
        are required for different values of n. the contents of
        wsave must not be changed between calls of cosqf or cosqb.

 ******************************************************************

 subroutine cosqf(n,x,wsave)

 ******************************************************************

 subroutine cosqf computes the fast fourier transform of quarter
 wave data. that is , cosqf computes the coefficients in a cosine
 series representation with only odd wave numbers. the transform
 is defined below at output parameter x

 cosqf is the unnormalized inverse of cosqb since a call of cosqf
 followed by a call of cosqb will multiply the input sequence x
 by 4*n.

 the array wsave which is used by subroutine cosqf must be
 initialized by calling subroutine cosqi(n,wsave).


 input parameters

 n       the length of the array x to be transformed.  the method
        is most efficient when n is a product of small primes.

 x       an array which contains the sequence to be transformed

 wsave   a work array which must be dimensioned at least 3*n+15
        in the program that calls cosqf. the wsave array must be
        initialized by calling subroutine cosqi(n,wsave) and a
        different wsave array must be used for each different
        value of n. this initialization does not have to be
        repeated so long as n remains unchanged thus subsequent
        transforms can be obtained faster than the first.

 output parameters

 x       for i=1,...,n

             x(i) = x(1) plus the sum from k=2 to k=n of

                2*x(k)*cos((2*i-1)*(k-1)*pi/(2*n))

             a call of cosqf followed by a call of
             cosqb will multiply the sequence x by 4*n.
             therefore cosqb is the unnormalized inverse
             of cosqf.

 wsave   contains initialization calculations which must not
        be destroyed between calls of cosqf or cosqb.

 ******************************************************************

 subroutine cosqb(n,x,wsave)

 ******************************************************************

 subroutine cosqb computes the fast fourier transform of quarter
 wave data. that is , cosqb computes a sequence from its
 representation in terms of a cosine series with odd wave numbers.
 the transform is defined below at output parameter x.

 cosqb is the unnormalized inverse of cosqf since a call of cosqb
 followed by a call of cosqf will multiply the input sequence x
 by 4*n.

 the array wsave which is used by subroutine cosqb must be
 initialized by calling subroutine cosqi(n,wsave).


 input parameters

 n       the length of the array x to be transformed.  the method
        is most efficient when n is a product of small primes.

 x       an array which contains the sequence to be transformed

 wsave   a work array that must be dimensioned at least 3*n+15
        in the program that calls cosqb. the wsave array must be
        initialized by calling subroutine cosqi(n,wsave) and a
        different wsave array must be used for each different
        value of n. this initialization does not have to be
        repeated so long as n remains unchanged thus subsequent
        transforms can be obtained faster than the first.

 output parameters

 x       for i=1,...,n

             x(i)= the sum from k=1 to k=n of

               4*x(k)*cos((2*k-1)*(i-1)*pi/(2*n))

             a call of cosqb followed by a call of
             cosqf will multiply the sequence x by 4*n.
             therefore cosqf is the unnormalized inverse
             of cosqb.

 wsave   contains initialization calculations which must not
        be destroyed between calls of cosqb or cosqf.

 ******************************************************************

 subroutine cffti(n,wsave)

 ******************************************************************

 subroutine cffti initializes the array wsave which is used in
 both cfftf and cfftb. the prime factorization of n together with
 a tabulation of the trigonometric functions are computed and
 stored in wsave.

 input parameter

 n       the length of the sequence to be transformed

 output parameter

 wsave   a work array which must be dimensioned at least 4*n+15
        the same work array can be used for both cfftf and cfftb
        as long as n remains unchanged. different wsave arrays
        are required for different values of n. the contents of
        wsave must not be changed between calls of cfftf or cfftb.

 ******************************************************************

 subroutine cfftf(n,c,wsave)

 ******************************************************************

 subroutine cfftf computes the forward complex discrete fourier
 transform (the fourier analysis). equivalently , cfftf computes
 the fourier coefficients of a complex periodic sequence.
 the transform is defined below at output parameter c.

 the transform is not normalized. to obtain a normalized transform
 the output must be divided by n. otherwise a call of cfftf
 followed by a call of cfftb will multiply the sequence by n.

 the array wsave which is used by subroutine cfftf must be
 initialized by calling subroutine cffti(n,wsave).

 input parameters


 n      the length of the complex sequence c. the method is
       more efficient when n is the product of small primes. n

 c      a complex array of length n which contains the sequence

 wsave   a real work array which must be dimensioned at least 4n+15
        in the program that calls cfftf. the wsave array must be
        initialized by calling subroutine cffti(n,wsave) and a
        different wsave array must be used for each different
        value of n. this initialization does not have to be
        repeated so long as n remains unchanged thus subsequent
        transforms can be obtained faster than the first.
        the same wsave array can be used by cfftf and cfftb.

 output parameters

 c      for j=1,...,n

           c(j)=the sum from k=1,...,n of

                 c(k)*exp(-i*(j-1)*(k-1)*2*pi/n)

                       where i=sqrt(-1)

 wsave   contains initialization calculations which must not be
        destroyed between calls of subroutine cfftf or cfftb

 ******************************************************************

 subroutine cfftb(n,c,wsave)

 ******************************************************************

 subroutine cfftb computes the backward complex discrete fourier
 transform (the fourier synthesis). equivalently , cfftb computes
 a complex periodic sequence from its fourier coefficients.
 the transform is defined below at output parameter c.

 a call of cfftf followed by a call of cfftb will multiply the
 sequence by n.

 the array wsave which is used by subroutine cfftb must be
 initialized by calling subroutine cffti(n,wsave).

 input parameters


 n      the length of the complex sequence c. the method is
       more efficient when n is the product of small primes.

 c      a complex array of length n which contains the sequence

 wsave   a real work array which must be dimensioned at least 4n+15
        in the program that calls cfftb. the wsave array must be
        initialized by calling subroutine cffti(n,wsave) and a
        different wsave array must be used for each different
        value of n. this initialization does not have to be
        repeated so long as n remains unchanged thus subsequent
        transforms can be obtained faster than the first.
        the same wsave array can be used by cfftf and cfftb.

 output parameters

 c      for j=1,...,n

           c(j)=the sum from k=1,...,n of

                 c(k)*exp(i*(j-1)*(k-1)*2*pi/n)

                       where i=sqrt(-1)

 wsave   contains initialization calculations which must not be
        destroyed between calls of subroutine cfftf or cfftb

 */
--- a/pffft/pffft.c
+++ b/pffft/pffft.c
--- a/pffft/pffft.h
+++ b/pffft/pffft.h
@@ -0,0 +1,177 @@
 /* Copyright (c) 2013  Julien Pommier ( pommier@modartt.com ) 

   Based on original fortran 77 code from FFTPACKv4 from NETLIB,
   authored by Dr Paul Swarztrauber of NCAR, in 1985.

   As confirmed by the NCAR fftpack software curators, the following
   FFTPACKv5 license applies to FFTPACKv4 sources. My changes are
   released under the same terms.

   FFTPACK license:

   http://www.cisl.ucar.edu/css/software/fftpack5/ftpk.html

   Copyright (c) 2004 the University Corporation for Atmospheric
   Research ("UCAR"). All rights reserved. Developed by NCAR's
   Computational and Information Systems Laboratory, UCAR,
   www.cisl.ucar.edu.

   Redistribution and use of the Software in source and binary forms,
   with or without modification, is permitted provided that the
   following conditions are met:

   - Neither the names of NCAR's Computational and Information Systems
   Laboratory, the University Corporation for Atmospheric Research,
   nor the names of its sponsors or contributors may be used to
   endorse or promote products derived from this Software without
   specific prior written permission.  

   - Redistributions of source code must retain the above copyright
   notices, this list of conditions, and the disclaimer below.

   - Redistributions in binary form must reproduce the above copyright
   notice, this list of conditions, and the disclaimer below in the
   documentation and/or other materials provided with the
   distribution.

   THIS SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
   EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO THE WARRANTIES OF
   MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
   NONINFRINGEMENT. IN NO EVENT SHALL THE CONTRIBUTORS OR COPYRIGHT
   HOLDERS BE LIABLE FOR ANY CLAIM, INDIRECT, INCIDENTAL, SPECIAL,
   EXEMPLARY, OR CONSEQUENTIAL DAMAGES OR OTHER LIABILITY, WHETHER IN AN
   ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
   CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS WITH THE
   SOFTWARE.
 */
   
 /*
   PFFFT : a Pretty Fast FFT.

   This is basically an adaptation of the single precision fftpack
   (v4) as found on netlib taking advantage of SIMD instruction found
   on cpus such as intel x86 (SSE1), powerpc (Altivec), and arm (NEON).
   
   For architectures where no SIMD instruction is available, the code
   falls back to a scalar version.  

   Restrictions: 

   - 1D transforms only, with 32-bit single precision.

   - supports only transforms for inputs of length N of the form
   N=(2^a)*(3^b)*(5^c), a >= 5, b >=0, c >= 0 (32, 48, 64, 96, 128,
   144, 160, etc are all acceptable lengths). Performance is best for
   128<=N<=8192.

   - all (float*) pointers in the functions below are expected to
   have an "simd-compatible" alignment, that is 16 bytes on x86 and
   powerpc CPUs.
  
   You can allocate such buffers with the functions
   pffft_aligned_malloc / pffft_aligned_free (or with stuff like
   posix_memalign..)

 */

 #ifndef PFFFT_H
 #define PFFFT_H

 #include <stddef.h> // for size_t

 #ifdef __cplusplus
 extern "C" {
 #endif

  /* opaque struct holding internal stuff (precomputed twiddle factors)
     this struct can be shared by many threads as it contains only
     read-only data.  
  */
  typedef struct PFFFT_Setup PFFFT_Setup;

  /* direction of the transform */
  typedef enum { PFFFT_FORWARD, PFFFT_BACKWARD } pffft_direction_t;
  
  /* type of transform */
  typedef enum { PFFFT_REAL, PFFFT_COMPLEX } pffft_transform_t;

  /*
    prepare for performing transforms of size N -- the returned
    PFFFT_Setup structure is read-only so it can safely be shared by
    multiple concurrent threads. 
  */
  PFFFT_Setup *pffft_new_setup(int N, pffft_transform_t transform);
  void pffft_destroy_setup(PFFFT_Setup *);
  /* 
     Perform a Fourier transform , The z-domain data is stored in the
     most efficient order for transforming it back, or using it for
     convolution. If you need to have its content sorted in the
     "usual" way, that is as an array of interleaved complex numbers,
     either use pffft_transform_ordered , or call pffft_zreorder after
     the forward fft, and before the backward fft.

     Transforms are not scaled: PFFFT_BACKWARD(PFFFT_FORWARD(x)) = N*x.
     Typically you will want to scale the backward transform by 1/N.
     
     The 'work' pointer should point to an area of N (2*N for complex
     fft) floats, properly aligned. If 'work' is NULL, then stack will
     be used instead (this is probably the best strategy for small
     FFTs, say for N < 16384).

     input and output may alias.
  */
  void pffft_transform(PFFFT_Setup *setup, const float *input, float *output, float *work, pffft_direction_t direction);

  /* 
     Similar to pffft_transform, but makes sure that the output is
     ordered as expected (interleaved complex numbers).  This is
     similar to calling pffft_transform and then pffft_zreorder.
     
     input and output may alias.
  */
  void pffft_transform_ordered(PFFFT_Setup *setup, const float *input, float *output, float *work, pffft_direction_t direction);

  /* 
     call pffft_zreorder(.., PFFFT_FORWARD) after pffft_transform(...,
     PFFFT_FORWARD) if you want to have the frequency components in
     the correct "canonical" order, as interleaved complex numbers.
     
     (for real transforms, both 0-frequency and half frequency
     components, which are real, are assembled in the first entry as
     F(0)+i*F(n/2+1). Note that the original fftpack did place
     F(n/2+1) at the end of the arrays).
     
     input and output should not alias.
  */
  void pffft_zreorder(PFFFT_Setup *setup, const float *input, float *output, pffft_direction_t direction);

  /* 
     Perform a multiplication of the frequency components of dft_a and
     dft_b and accumulate them into dft_ab. The arrays should have
     been obtained with pffft_transform(.., PFFFT_FORWARD) and should
     *not* have been reordered with pffft_zreorder (otherwise just
     perform the operation yourself as the dft coefs are stored as
     interleaved complex numbers).
     
     the operation performed is: dft_ab += (dft_a * fdt_b)*scaling
     
     The dft_a, dft_b and dft_ab pointers may alias.
  */
  void pffft_zconvolve_accumulate(PFFFT_Setup *setup, const float *dft_a, const float *dft_b, float *dft_ab, float scaling);

  /*
    the float buffers must have the correct alignment (16-byte boundary
    on intel and powerpc). This function may be used to obtain such
    correctly aligned buffers.  
  */
  void *pffft_aligned_malloc(size_t nb_bytes);
  void pffft_aligned_free(void *);

  /* return 4 or 1 wether support SSE/Altivec instructions was enable when building pffft.c */
  int pffft_simd_size();

 #ifdef __cplusplus
 }
 #endif

 #endif // PFFFT_H
--- a/pffft/test_pffft.c
+++ b/pffft/test_pffft.c
@@ -0,0 +1,419 @@
 /*
  Copyright (c) 2013 Julien Pommier.

  Small test & bench for PFFFT, comparing its performance with the scalar FFTPACK, FFTW, and Apple vDSP

  How to build: 

  on linux, with fftw3:
  gcc -o test_pffft -DHAVE_FFTW -msse -mfpmath=sse -O3 -Wall -W pffft.c test_pffft.c fftpack.c -L/usr/local/lib -I/usr/local/include/ -lfftw3f -lm

  on macos, without fftw3:
  clang -o test_pffft -DHAVE_VECLIB -O3 -Wall -W pffft.c test_pffft.c fftpack.c -L/usr/local/lib -I/usr/local/include/ -framework Accelerate

  on macos, with fftw3:
  clang -o test_pffft -DHAVE_FFTW -DHAVE_VECLIB -O3 -Wall -W pffft.c test_pffft.c fftpack.c -L/usr/local/lib -I/usr/local/include/ -lfftw3f -framework Accelerate

  on windows, with visual c++:
  cl /Ox -D_USE_MATH_DEFINES /arch:SSE test_pffft.c pffft.c fftpack.c
  
  build without SIMD instructions:
  gcc -o test_pffft -DPFFFT_SIMD_DISABLE -O3 -Wall -W pffft.c test_pffft.c fftpack.c -lm

 */

 #include "pffft.h"
 #include "fftpack.h"

 #include <math.h>
 #include <stdio.h>
 #include <stdlib.h>
 #include <time.h>
 #include <assert.h>
 #include <string.h>

 #ifdef HAVE_SYS_TIMES
 #  include <sys/times.h>
 #  include <unistd.h>
 #endif

 #ifdef HAVE_VECLIB
 #  include <Accelerate/Accelerate.h>
 #endif

 #ifdef HAVE_FFTW
 #  include <fftw3.h>
 #endif

 #define MAX(x,y) ((x)>(y)?(x):(y))

 double frand() {
  return rand()/(double)RAND_MAX;
 }

 #if defined(HAVE_SYS_TIMES)
  inline double uclock_sec(void) {
    static double ttclk = 0.;
    if (ttclk == 0.) ttclk = sysconf(_SC_CLK_TCK);
    struct tms t; return ((double)times(&t)) / ttclk;
  }
 # else
  double uclock_sec(void)
 { return (double)clock()/(double)CLOCKS_PER_SEC; }
 #endif


 /* compare results with the regular fftpack */
 void pffft_validate_N(int N, int cplx) {
  int Nfloat = N*(cplx?2:1);
  int Nbytes = Nfloat * sizeof(float);
  float *ref, *in, *out, *tmp, *tmp2;
  PFFFT_Setup *s = pffft_new_setup(N, cplx ? PFFFT_COMPLEX : PFFFT_REAL);
  int pass;

  if (!s) { printf("Skipping N=%d, not supported\n", N); return; }
  ref = pffft_aligned_malloc(Nbytes);
  in = pffft_aligned_malloc(Nbytes);
  out = pffft_aligned_malloc(Nbytes);
  tmp = pffft_aligned_malloc(Nbytes);
  tmp2 = pffft_aligned_malloc(Nbytes);

  for (pass=0; pass < 2; ++pass) {
    float ref_max = 0;
    int k;
    //printf("N=%d pass=%d cplx=%d\n", N, pass, cplx);
    // compute reference solution with FFTPACK
    if (pass == 0) {
      float *wrk = malloc(2*Nbytes+15*sizeof(float));
      for (k=0; k < Nfloat; ++k) {
        ref[k] = in[k] = frand()*2-1; 
        out[k] = 1e30;
      }
      if (!cplx) {
        rffti(N, wrk);
        rfftf(N, ref, wrk);
        // use our ordering for real ffts instead of the one of fftpack
        {
          float refN=ref[N-1];
          for (k=N-2; k >= 1; --k) ref[k+1] = ref[k]; 
          ref[1] = refN;
        }
      } else {
        cffti(N, wrk);
        cfftf(N, ref, wrk);
      }
      free(wrk);
    }

    for (k = 0; k < Nfloat; ++k) ref_max = MAX(ref_max, fabs(ref[k]));

      
    // pass 0 : non canonical ordering of transform coefficients  
    if (pass == 0) {
      // test forward transform, with different input / output
      pffft_transform(s, in, tmp, 0, PFFFT_FORWARD);
      memcpy(tmp2, tmp, Nbytes);
      memcpy(tmp, in, Nbytes);
      pffft_transform(s, tmp, tmp, 0, PFFFT_FORWARD);
      for (k = 0; k < Nfloat; ++k) {
        assert(tmp2[k] == tmp[k]);
      }

      // test reordering
      pffft_zreorder(s, tmp, out, PFFFT_FORWARD);
      pffft_zreorder(s, out, tmp, PFFFT_BACKWARD);
      for (k = 0; k < Nfloat; ++k) {
        assert(tmp2[k] == tmp[k]);
      }
      pffft_zreorder(s, tmp, out, PFFFT_FORWARD);
    } else {
      // pass 1 : canonical ordering of transform coeffs.
      pffft_transform_ordered(s, in, tmp, 0, PFFFT_FORWARD);
      memcpy(tmp2, tmp, Nbytes);
      memcpy(tmp, in, Nbytes);
      pffft_transform_ordered(s, tmp, tmp, 0, PFFFT_FORWARD);
      for (k = 0; k < Nfloat; ++k) {
        assert(tmp2[k] == tmp[k]);
      }
      memcpy(out, tmp, Nbytes);
    }

    {
      for (k=0; k < Nfloat; ++k) {
        if (!(fabs(ref[k] - out[k]) < 1e-3*ref_max)) {
          printf("%s forward PFFFT mismatch found for N=%d\n", (cplx?"CPLX":"REAL"), N);
          exit(1);
        }
      }
        
      if (pass == 0) pffft_transform(s, tmp, out, 0, PFFFT_BACKWARD);
      else   pffft_transform_ordered(s, tmp, out, 0, PFFFT_BACKWARD);
      memcpy(tmp2, out, Nbytes);
      memcpy(out, tmp, Nbytes);
      if (pass == 0) pffft_transform(s, out, out, 0, PFFFT_BACKWARD);
      else   pffft_transform_ordered(s, out, out, 0, PFFFT_BACKWARD);
      for (k = 0; k < Nfloat; ++k) {
        assert(tmp2[k] == out[k]);
        out[k] *= 1.f/N;
      }
      for (k = 0; k < Nfloat; ++k) {
        if (fabs(in[k] - out[k]) > 1e-3 * ref_max) {
          printf("pass=%d, %s IFFFT does not match for N=%d\n", pass, (cplx?"CPLX":"REAL"), N); break;
          exit(1);
        }
      }
    }

    // quick test of the circular convolution in fft domain
    {
      float conv_err = 0, conv_max = 0;

      pffft_zreorder(s, ref, tmp, PFFFT_FORWARD);
      memset(out, 0, Nbytes);
      pffft_zconvolve_accumulate(s, ref, ref, out, 1.0);
      pffft_zreorder(s, out, tmp2, PFFFT_FORWARD);
      
      for (k=0; k < Nfloat; k += 2) {
        float ar = tmp[k], ai=tmp[k+1];
        if (cplx || k > 0) {
          tmp[k] = ar*ar - ai*ai;
          tmp[k+1] = 2*ar*ai;
        } else {
          tmp[0] = ar*ar;
          tmp[1] = ai*ai;
        }
      }
      
      for (k=0; k < Nfloat; ++k) {
        float d = fabs(tmp[k] - tmp2[k]), e = fabs(tmp[k]);
        if (d > conv_err) conv_err = d;
        if (e > conv_max) conv_max = e;
      }
      if (conv_err > 1e-5*conv_max) {
        printf("zconvolve error ? %g %g\n", conv_err, conv_max); exit(1);
      }
    }

  }

  printf("%s PFFFT is OK for N=%d\n", (cplx?"CPLX":"REAL"), N); fflush(stdout);
  
  pffft_destroy_setup(s);
  pffft_aligned_free(ref);
  pffft_aligned_free(in);
  pffft_aligned_free(out);
  pffft_aligned_free(tmp);
  pffft_aligned_free(tmp2);
 }

 void pffft_validate(int cplx) {
  static int Ntest[] = { 16, 32, 64, 96, 128, 160, 192, 256, 288, 384, 5*96, 512, 576, 5*128, 800, 864, 1024, 2048, 2592, 4000, 4096, 12000, 36864, 0};
  int k;
  for (k = 0; Ntest[k]; ++k) {
    int N = Ntest[k];
    if (N == 16 && !cplx) continue;
    pffft_validate_N(N, cplx);
  }
 }

 int array_output_format = 0;

 void show_output(const char *name, int N, int cplx, float flops, float t0, float t1, int max_iter) {
  float mflops = flops/1e6/(t1 - t0 + 1e-16);
  if (array_output_format) {
    if (flops != -1) {
      printf("|%9.0f   ", mflops);
    } else printf("|      n/a   ");
  } else {
    if (flops != -1) {
      printf("N=%5d, %s %16s : %6.0f MFlops [t=%6.0f ns, %d runs]\n", N, (cplx?"CPLX":"REAL"), name, mflops, (t1-t0)/2/max_iter * 1e9, max_iter);
    }
  }
  fflush(stdout);
 }

 void benchmark_ffts(int N, int cplx) {
  int Nfloat = (cplx ? N*2 : N);
  int Nbytes = Nfloat * sizeof(float);
  float *X = pffft_aligned_malloc(Nbytes), *Y = pffft_aligned_malloc(Nbytes), *Z = pffft_aligned_malloc(Nbytes);

  double t0, t1, flops;

  int k;
  int max_iter = 5120000/N*4;
 #ifdef __arm__
  max_iter /= 4;
 #endif
  int iter;

  for (k = 0; k < Nfloat; ++k) {
    X[k] = 0; //sqrtf(k+1);
  }

  // FFTPack benchmark
  {
    float *wrk = malloc(2*Nbytes + 15*sizeof(float));
    int max_iter_ = max_iter/pffft_simd_size(); if (max_iter_ == 0) max_iter_ = 1;
    if (cplx) cffti(N, wrk);
    else      rffti(N, wrk);
    t0 = uclock_sec();  
    
    for (iter = 0; iter < max_iter_; ++iter) {
      if (cplx) {
        cfftf(N, X, wrk);
        cfftb(N, X, wrk);
      } else {
        rfftf(N, X, wrk);
        rfftb(N, X, wrk);
      }
    }
    t1 = uclock_sec();
    free(wrk);
    
    flops = (max_iter_*2) * ((cplx ? 5 : 2.5)*N*log((double)N)/M_LN2); // see http://www.fftw.org/speed/method.html
    show_output("FFTPack", N, cplx, flops, t0, t1, max_iter_);
  }

 #ifdef HAVE_VECLIB
  int log2N = (int)(log(N)/log(2) + 0.5f);
  if (N == (1<<log2N)) {
    FFTSetup setup;

    setup = vDSP_create_fftsetup(log2N, FFT_RADIX2);
    DSPSplitComplex zsamples;
    zsamples.realp = &X[0];
    zsamples.imagp = &X[Nfloat/2];
    t0 = uclock_sec();  
    for (iter = 0; iter < max_iter; ++iter) {
      if (cplx) {
        vDSP_fft_zip(setup, &zsamples, 1, log2N, kFFTDirection_Forward);
        vDSP_fft_zip(setup, &zsamples, 1, log2N, kFFTDirection_Inverse);
      } else {
        vDSP_fft_zrip(setup, &zsamples, 1, log2N, kFFTDirection_Forward); 
        vDSP_fft_zrip(setup, &zsamples, 1, log2N, kFFTDirection_Inverse);
      }
    }
    t1 = uclock_sec();
    vDSP_destroy_fftsetup(setup);

    flops = (max_iter*2) * ((cplx ? 5 : 2.5)*N*log((double)N)/M_LN2); // see http://www.fftw.org/speed/method.html
    show_output("vDSP", N, cplx, flops, t0, t1, max_iter);
  } else {
    show_output("vDSP", N, cplx, -1, -1, -1, -1);
  }
 #endif
  
 #ifdef HAVE_FFTW
  {
    fftwf_plan planf, planb;
    fftw_complex *in = (fftw_complex*) fftwf_malloc(sizeof(fftw_complex) * N);
    fftw_complex *out = (fftw_complex*) fftwf_malloc(sizeof(fftw_complex) * N);
    memset(in, 0, sizeof(fftw_complex) * N);
    int flags = (N < 40000 ? FFTW_MEASURE : FFTW_ESTIMATE);  // measure takes a lot of time on largest ffts
    //int flags = FFTW_ESTIMATE;
    if (cplx) {
      planf = fftwf_plan_dft_1d(N, (fftwf_complex*)in, (fftwf_complex*)out, FFTW_FORWARD, flags);
      planb = fftwf_plan_dft_1d(N, (fftwf_complex*)in, (fftwf_complex*)out, FFTW_BACKWARD, flags);
    } else {
      planf = fftwf_plan_dft_r2c_1d(N, (float*)in, (fftwf_complex*)out, flags);
      planb = fftwf_plan_dft_c2r_1d(N, (fftwf_complex*)in, (float*)out, flags);
    }

    t0 = uclock_sec();  
    for (iter = 0; iter < max_iter; ++iter) {
      fftwf_execute(planf);
      fftwf_execute(planb);
    }
    t1 = uclock_sec();

    fftwf_destroy_plan(planf);
    fftwf_destroy_plan(planb);
    fftwf_free(in); fftwf_free(out);

    flops = (max_iter*2) * ((cplx ? 5 : 2.5)*N*log((double)N)/M_LN2); // see http://www.fftw.org/speed/method.html
    show_output((flags == FFTW_MEASURE ? "FFTW (meas.)" : " FFTW (estim)"), N, cplx, flops, t0, t1, max_iter);
  }
 #endif  

  // PFFFT benchmark
  {
    PFFFT_Setup *s = pffft_new_setup(N, cplx ? PFFFT_COMPLEX : PFFFT_REAL);
    if (s) {
      t0 = uclock_sec();  
      for (iter = 0; iter < max_iter; ++iter) {
        pffft_transform(s, X, Z, Y, PFFFT_FORWARD);
        pffft_transform(s, X, Z, Y, PFFFT_BACKWARD);
      }
      t1 = uclock_sec();
      pffft_destroy_setup(s);
    
      flops = (max_iter*2) * ((cplx ? 5 : 2.5)*N*log((double)N)/M_LN2); // see http://www.fftw.org/speed/method.html
      show_output("PFFFT", N, cplx, flops, t0, t1, max_iter);
    }
  }

  if (!array_output_format) {
    printf("--\n");
  }

  pffft_aligned_free(X);
  pffft_aligned_free(Y);
  pffft_aligned_free(Z);
 }

 #ifndef PFFFT_SIMD_DISABLE
 void validate_pffft_simd(); // a small function inside pffft.c that will detect compiler bugs with respect to simd instruction 
 #endif

 int main(int argc, char **argv) {
  int Nvalues[] = { 64, 96, 128, 160, 192, 256, 384, 5*96, 512, 5*128, 3*256, 800, 1024, 2048, 2400, 4096, 8192, 9*1024, 16384, 32768, 256*1024, 1024*1024, -1 };
  int i;

  if (argc > 1 && strcmp(argv[1], "--array-format") == 0) {
    array_output_format = 1;
  }

 #ifndef PFFFT_SIMD_DISABLE
  validate_pffft_simd();
 #endif
  pffft_validate(1);
  pffft_validate(0);
  if (!array_output_format) {
    for (i=0; Nvalues[i] > 0; ++i) {
      benchmark_ffts(Nvalues[i], 0 /* real fft */);
    }
    for (i=0; Nvalues[i] > 0; ++i) {
      benchmark_ffts(Nvalues[i], 1 /* cplx fft */);
    }
  } else {
    printf("| input len ");
    printf("|real FFTPack");
 #ifdef HAVE_VECLIB
    printf("|  real vDSP ");
 #endif
 #ifdef HAVE_FFTW
    printf("|  real FFTW ");
 #endif
    printf("| real PFFFT | ");

    printf("|cplx FFTPack");
 #ifdef HAVE_VECLIB
    printf("|  cplx vDSP ");
 #endif
 #ifdef HAVE_FFTW
    printf("|  cplx FFTW ");
 #endif
    printf("| cplx PFFFT |\n");
    for (i=0; Nvalues[i] > 0; ++i) {
      printf("|%9d  ", Nvalues[i]);
      benchmark_ffts(Nvalues[i], 0); 
      printf("| ");
      benchmark_ffts(Nvalues[i], 1);
      printf("|\n");
    }
    printf(" (numbers are given in MFlops)\n");
  }


  return 0;
 }