| @@ -1,11 +1,9 @@ | |||
| RACK_DIR ?= ../.. | |||
| SLUG = Befaco | |||
| VERSION = 0.6.0 | |||
| FLAGS = -I./pffft -DPFFFT_SIMD_DISABLE | |||
| SOURCES += $(wildcard src/*.cpp) pffft/pffft.c | |||
| SOURCES += $(wildcard src/*.cpp) | |||
| DISTRIBUTABLES += $(wildcard LICENSE*) res | |||
| RACK_DIR ?= ../.. | |||
| include $(RACK_DIR)/plugin.mk | |||
| @@ -1,416 +0,0 @@ | |||
| PFFFT: a pretty fast FFT. | |||
| TL;DR | |||
| -- | |||
| PFFFT does 1D Fast Fourier Transforms, of single precision real and | |||
| complex vectors. It tries do it fast, it tries to be correct, and it | |||
| tries to be small. Computations do take advantage of SSE1 instructions | |||
| on x86 cpus, Altivec on powerpc cpus, and NEON on ARM cpus. The | |||
| license is BSD-like. | |||
| Why does it exist: | |||
| -- | |||
| I was in search of a good performing FFT library , preferably very | |||
| small and with a very liberal license. | |||
| When one says "fft library", FFTW ("Fastest Fourier Transform in the | |||
| West") is probably the first name that comes to mind -- I guess that | |||
| 99% of open-source projects that need a FFT do use FFTW, and are happy | |||
| with it. However, it is quite a large library , which does everything | |||
| fft related (2d transforms, 3d transforms, other transformations such | |||
| as discrete cosine , or fast hartley). And it is licensed under the | |||
| GNU GPL , which means that it cannot be used in non open-source | |||
| products. | |||
| An alternative to FFTW that is really small, is the venerable FFTPACK | |||
| v4, which is available on NETLIB. A more recent version (v5) exists, | |||
| but it is larger as it deals with multi-dimensional transforms. This | |||
| is a library that is written in FORTRAN 77, a language that is now | |||
| considered as a bit antiquated by many. FFTPACKv4 was written in 1985, | |||
| by Dr Paul Swarztrauber of NCAR, more than 25 years ago ! And despite | |||
| its age, benchmarks show it that it still a very good performing FFT | |||
| library, see for example the 1d single precision benchmarks here: | |||
| http://www.fftw.org/speed/opteron-2.2GHz-32bit/ . It is however not | |||
| competitive with the fastest ones, such as FFTW, Intel MKL, AMD ACML, | |||
| Apple vDSP. The reason for that is that those libraries do take | |||
| advantage of the SSE SIMD instructions available on Intel CPUs, | |||
| available since the days of the Pentium III. These instructions deal | |||
| with small vectors of 4 floats at a time, instead of a single float | |||
| for a traditionnal FPU, so when using these instructions one may expect | |||
| a 4-fold performance improvement. | |||
| The idea was to take this fortran fftpack v4 code, translate to C, | |||
| modify it to deal with those SSE instructions, and check that the | |||
| final performance is not completely ridiculous when compared to other | |||
| SIMD FFT libraries. Translation to C was performed with f2c ( | |||
| http://www.netlib.org/f2c/ ). The resulting file was a bit edited in | |||
| order to remove the thousands of gotos that were introduced by | |||
| f2c. You will find the fftpack.h and fftpack.c sources in the | |||
| repository, this a complete translation of | |||
| http://www.netlib.org/fftpack/ , with the discrete cosine transform | |||
| and the test program. There is no license information in the netlib | |||
| repository, but it was confirmed to me by the fftpack v5 curators that | |||
| the same terms do apply to fftpack v4: | |||
| http://www.cisl.ucar.edu/css/software/fftpack5/ftpk.html . This is a | |||
| "BSD-like" license, it is compatible with proprietary projects. | |||
| Adapting fftpack to deal with the SIMD 4-element vectors instead of | |||
| scalar single precision numbers was more complex than I originally | |||
| thought, especially with the real transforms, and I ended up writing | |||
| more code than I planned.. | |||
| The code: | |||
| -- | |||
| Only two files, in good old C, pffft.c and pffft.h . The API is very | |||
| very simple, just make sure that you read the comments in pffft.h. | |||
| Comparison with other FFTs: | |||
| -- | |||
| The idea was not to break speed records, but to get a decently fast | |||
| fft that is at least 50% as fast as the fastest FFT -- especially on | |||
| slowest computers . I'm more focused on getting the best performance | |||
| on slow cpus (Atom, Intel Core 1, old Athlons, ARM Cortex-A9...), than | |||
| on getting top performance on today fastest cpus. | |||
| It can be used in a real-time context as the fft functions do not | |||
| perform any memory allocation -- that is why they accept a 'work' | |||
| array in their arguments. | |||
| It is also a bit focused on performing 1D convolutions, that is why it | |||
| provides "unordered" FFTs , and a fourier domain convolution | |||
| operation. | |||
| Benchmark results (cpu tested: core i7 2600, core 2 quad, core 1 duo, atom N270, cortex-A9, cortex-A15, A8X) | |||
| -- | |||
| The benchmark shows the performance of various fft implementations measured in | |||
| MFlops, with the number of floating point operations being defined as 5Nlog2(N) | |||
| for a length N complex fft, and 2.5*Nlog2(N) for a real fft. | |||
| See http://www.fftw.org/speed/method.html for an explanation of these formulas. | |||
| MacOS Lion, gcc 4.2, 64-bit, fftw 3.3 on a 3.4 GHz core i7 2600 | |||
| Built with: | |||
| gcc-4.2 -o test_pffft -arch x86_64 -O3 -Wall -W pffft.c test_pffft.c fftpack.c -L/usr/local/lib -I/usr/local/include/ -DHAVE_VECLIB -framework veclib -DHAVE_FFTW -lfftw3f | |||
| | input len |real FFTPack| real vDSP | real FFTW | real PFFFT | |cplx FFTPack| cplx vDSP | cplx FFTW | cplx PFFFT | | |||
| |-----------+------------+------------+------------+------------| |------------+------------+------------+------------| | |||
| | 64 | 2816 | 8596 | 7329 | 8187 | | 2887 | 14898 | 14668 | 11108 | | |||
| | 96 | 3298 | n/a | 8378 | 7727 | | 3953 | n/a | 15680 | 10878 | | |||
| | 128 | 3507 | 11575 | 9266 | 10108 | | 4233 | 17598 | 16427 | 12000 | | |||
| | 160 | 3391 | n/a | 9838 | 10711 | | 4220 | n/a | 16653 | 11187 | | |||
| | 192 | 3919 | n/a | 9868 | 10956 | | 4297 | n/a | 15770 | 12540 | | |||
| | 256 | 4283 | 13179 | 10694 | 13128 | | 4545 | 19550 | 16350 | 13822 | | |||
| | 384 | 3136 | n/a | 10810 | 12061 | | 3600 | n/a | 16103 | 13240 | | |||
| | 480 | 3477 | n/a | 10632 | 12074 | | 3536 | n/a | 11630 | 12522 | | |||
| | 512 | 3783 | 15141 | 11267 | 13838 | | 3649 | 20002 | 16560 | 13580 | | |||
| | 640 | 3639 | n/a | 11164 | 13946 | | 3695 | n/a | 15416 | 13890 | | |||
| | 768 | 3800 | n/a | 11245 | 13495 | | 3590 | n/a | 15802 | 14552 | | |||
| | 800 | 3440 | n/a | 10499 | 13301 | | 3659 | n/a | 12056 | 13268 | | |||
| | 1024 | 3924 | 15605 | 11450 | 15339 | | 3769 | 20963 | 13941 | 15467 | | |||
| | 2048 | 4518 | 16195 | 11551 | 15532 | | 4258 | 20413 | 13723 | 15042 | | |||
| | 2400 | 4294 | n/a | 10685 | 13078 | | 4093 | n/a | 12777 | 13119 | | |||
| | 4096 | 4750 | 16596 | 11672 | 15817 | | 4157 | 19662 | 14316 | 14336 | | |||
| | 8192 | 3820 | 16227 | 11084 | 12555 | | 3691 | 18132 | 12102 | 13813 | | |||
| | 9216 | 3864 | n/a | 10254 | 12870 | | 3586 | n/a | 12119 | 13994 | | |||
| | 16384 | 3822 | 15123 | 10454 | 12822 | | 3613 | 16874 | 12370 | 13881 | | |||
| | 32768 | 4175 | 14512 | 10662 | 11095 | | 3881 | 14702 | 11619 | 11524 | | |||
| | 262144 | 3317 | 11429 | 6269 | 9517 | | 2810 | 11729 | 7757 | 10179 | | |||
| | 1048576 | 2913 | 10551 | 4730 | 5867 | | 2661 | 7881 | 3520 | 5350 | | |||
| |-----------+------------+------------+------------+------------| |------------+------------+------------+------------| | |||
| Debian 6, gcc 4.4.5, 64-bit, fftw 3.3.1 on a 3.4 GHz core i7 2600 | |||
| Built with: | |||
| gcc -o test_pffft -DHAVE_FFTW -msse2 -O3 -Wall -W pffft.c test_pffft.c fftpack.c -L$HOME/local/lib -I$HOME/local/include/ -lfftw3f -lm | |||
| | N (input length) | real FFTPack | real FFTW | real PFFFT | | cplx FFTPack | cplx FFTW | cplx PFFFT | | |||
| |------------------+--------------+--------------+--------------| |--------------+--------------+--------------| | |||
| | 64 | 3840 | 7680 | 8777 | | 4389 | 20480 | 11171 | | |||
| | 96 | 4214 | 9633 | 8429 | | 4816 | 22477 | 11238 | | |||
| | 128 | 3584 | 10240 | 10240 | | 5120 | 23893 | 11947 | | |||
| | 192 | 4854 | 11095 | 12945 | | 4854 | 22191 | 14121 | | |||
| | 256 | 4096 | 11703 | 16384 | | 5120 | 23406 | 13653 | | |||
| | 384 | 4395 | 14651 | 12558 | | 4884 | 19535 | 14651 | | |||
| | 512 | 5760 | 13166 | 15360 | | 4608 | 23040 | 15360 | | |||
| | 768 | 4907 | 14020 | 16357 | | 4461 | 19628 | 14020 | | |||
| | 1024 | 5120 | 14629 | 14629 | | 5120 | 20480 | 15754 | | |||
| | 2048 | 5632 | 14080 | 18773 | | 4693 | 12516 | 16091 | | |||
| | 4096 | 5120 | 13653 | 17554 | | 4726 | 7680 | 14456 | | |||
| | 8192 | 4160 | 7396 | 13312 | | 4437 | 14791 | 13312 | | |||
| | 9216 | 4210 | 6124 | 13473 | | 4491 | 7282 | 14970 | | |||
| | 16384 | 3976 | 11010 | 14313 | | 4210 | 11450 | 13631 | | |||
| | 32768 | 4260 | 10224 | 10954 | | 4260 | 6816 | 11797 | | |||
| | 262144 | 3736 | 6896 | 9961 | | 2359 | 8965 | 9437 | | |||
| | 1048576 | 2796 | 4534 | 6453 | | 1864 | 3078 | 5592 | | |||
| |------------------+--------------+--------------+--------------| |--------------+--------------+--------------| | |||
| MacOS Snow Leopard, gcc 4.0, 32-bit, fftw 3.3 on a 1.83 GHz core 1 duo | |||
| Built with: | |||
| gcc -o test_pffft -DHAVE_FFTW -DHAVE_VECLIB -O3 -Wall -W pffft.c test_pffft.c fftpack.c -L/usr/local/lib -I/usr/local/include/ -lfftw3f -framework veclib | |||
| | input len |real FFTPack| real vDSP | real FFTW | real PFFFT | |cplx FFTPack| cplx vDSP | cplx FFTW | cplx PFFFT | | |||
| |-----------+------------+------------+------------+------------| |------------+------------+------------+------------| | |||
| | 64 | 745 | 2145 | 1706 | 2028 | | 961 | 3356 | 3313 | 2300 | | |||
| | 96 | 877 | n/a | 1976 | 1978 | | 1059 | n/a | 3333 | 2233 | | |||
| | 128 | 951 | 2808 | 2213 | 2279 | | 1202 | 3803 | 3739 | 2494 | | |||
| | 192 | 1002 | n/a | 2456 | 2429 | | 1186 | n/a | 3701 | 2508 | | |||
| | 256 | 1065 | 3205 | 2641 | 2793 | | 1302 | 4013 | 3912 | 2663 | | |||
| | 384 | 845 | n/a | 2759 | 2499 | | 948 | n/a | 3729 | 2504 | | |||
| | 512 | 900 | 3476 | 2956 | 2759 | | 974 | 4057 | 3954 | 2645 | | |||
| | 768 | 910 | n/a | 2912 | 2737 | | 975 | n/a | 3837 | 2614 | | |||
| | 1024 | 936 | 3583 | 3107 | 3009 | | 1006 | 4124 | 3821 | 2697 | | |||
| | 2048 | 1057 | 3585 | 3091 | 2837 | | 1089 | 3889 | 3701 | 2513 | | |||
| | 4096 | 1083 | 3524 | 3092 | 2733 | | 1039 | 3617 | 3462 | 2364 | | |||
| | 8192 | 874 | 3252 | 2967 | 2363 | | 911 | 3106 | 2789 | 2302 | | |||
| | 9216 | 898 | n/a | 2420 | 2290 | | 865 | n/a | 2676 | 2204 | | |||
| | 16384 | 903 | 2892 | 2506 | 2421 | | 899 | 3026 | 2797 | 2289 | | |||
| | 32768 | 965 | 2837 | 2550 | 2358 | | 920 | 2922 | 2763 | 2240 | | |||
| | 262144 | 738 | 2422 | 1589 | 1708 | | 610 | 2038 | 1436 | 1091 | | |||
| | 1048576 | 528 | 1207 | 845 | 880 | | 606 | 1020 | 669 | 1036 | | |||
| |-----------+------------+------------+------------+------------| |------------+------------+------------+------------| | |||
| Ubuntu 11.04, gcc 4.5, 32-bit, fftw 3.2 on a 2.66 core 2 quad | |||
| Built with: | |||
| gcc -o test_pffft -DHAVE_FFTW -msse -mfpmath=sse -O3 -Wall -W pffft.c test_pffft.c fftpack.c -L/usr/local/lib -I/usr/local/include/ -lfftw3f -lm | |||
| | input len |real FFTPack| real FFTW | real PFFFT | |cplx FFTPack| cplx FFTW | cplx PFFFT | | |||
| |-----------+------------+------------+------------| |------------+------------+------------| | |||
| | 64 | 1920 | 3614 | 5120 | | 2194 | 7680 | 6467 | | |||
| | 96 | 1873 | 3549 | 5187 | | 2107 | 8429 | 5863 | | |||
| | 128 | 2240 | 3773 | 5514 | | 2560 | 7964 | 6827 | | |||
| | 192 | 1765 | 4569 | 7767 | | 2284 | 9137 | 7061 | | |||
| | 256 | 2048 | 5461 | 7447 | | 2731 | 9638 | 7802 | | |||
| | 384 | 1998 | 5861 | 6762 | | 2313 | 9253 | 7644 | | |||
| | 512 | 2095 | 6144 | 7680 | | 2194 | 10240 | 7089 | | |||
| | 768 | 2230 | 5773 | 7549 | | 2045 | 10331 | 7010 | | |||
| | 1024 | 2133 | 6400 | 8533 | | 2133 | 10779 | 7877 | | |||
| | 2048 | 2011 | 7040 | 8665 | | 1942 | 10240 | 7768 | | |||
| | 4096 | 2194 | 6827 | 8777 | | 1755 | 9452 | 6827 | | |||
| | 8192 | 1849 | 6656 | 6656 | | 1752 | 7831 | 6827 | | |||
| | 9216 | 1871 | 5858 | 6416 | | 1643 | 6909 | 6266 | | |||
| | 16384 | 1883 | 6223 | 6506 | | 1664 | 7340 | 6982 | | |||
| | 32768 | 1826 | 6390 | 6667 | | 1631 | 7481 | 6971 | | |||
| | 262144 | 1546 | 4075 | 5977 | | 1299 | 3415 | 3551 | | |||
| | 1048576 | 1104 | 2071 | 1730 | | 1104 | 1149 | 1834 | | |||
| |-----------+------------+------------+------------| |------------+------------+------------| | |||
| Ubuntu 11.04, gcc 4.5, 32-bit, fftw 3.3 on a 1.6 GHz Atom N270 | |||
| Built with: | |||
| gcc -o test_pffft -DHAVE_FFTW -msse -mfpmath=sse -O3 -Wall -W pffft.c test_pffft.c fftpack.c -L/usr/local/lib -I/usr/local/include/ -lfftw3f -lm | |||
| | N (input length) | real FFTPack | real FFTW | real PFFFT | | cplx FFTPack | cplx FFTW | cplx PFFFT | | |||
| |------------------+--------------+--------------+--------------| |--------------+--------------+--------------| | |||
| | 64 | 452 | 1041 | 1336 | | 549 | 2318 | 1781 | | |||
| | 96 | 444 | 1297 | 1297 | | 503 | 2408 | 1686 | | |||
| | 128 | 527 | 1525 | 1707 | | 543 | 2655 | 1886 | | |||
| | 192 | 498 | 1653 | 1849 | | 539 | 2678 | 1942 | | |||
| | 256 | 585 | 1862 | 2156 | | 594 | 2777 | 2244 | | |||
| | 384 | 499 | 1870 | 1998 | | 511 | 2586 | 1890 | | |||
| | 512 | 562 | 2095 | 2194 | | 542 | 2973 | 2194 | | |||
| | 768 | 545 | 2045 | 2133 | | 545 | 2365 | 2133 | | |||
| | 1024 | 595 | 2133 | 2438 | | 569 | 2695 | 2179 | | |||
| | 2048 | 587 | 2125 | 2347 | | 521 | 2230 | 1707 | | |||
| | 4096 | 495 | 1890 | 1834 | | 492 | 1876 | 1672 | | |||
| | 8192 | 469 | 1548 | 1729 | | 438 | 1740 | 1664 | | |||
| | 9216 | 468 | 1663 | 1663 | | 446 | 1585 | 1531 | | |||
| | 16384 | 453 | 1608 | 1767 | | 398 | 1476 | 1664 | | |||
| | 32768 | 456 | 1420 | 1503 | | 387 | 1388 | 1345 | | |||
| | 262144 | 309 | 385 | 726 | | 262 | 415 | 840 | | |||
| | 1048576 | 280 | 351 | 739 | | 261 | 313 | 797 | | |||
| |------------------+--------------+--------------+--------------| |--------------+--------------+--------------| | |||
| Windows 7, visual c++ 2010 on a 1.6 GHz Atom N270 | |||
| Built with: | |||
| cl /Ox -D_USE_MATH_DEFINES /arch:SSE test_pffft.c pffft.c fftpack.c | |||
| (visual c++ is definitively not very good with SSE intrinsics...) | |||
| | N (input length) | real FFTPack | real PFFFT | | cplx FFTPack | cplx PFFFT | | |||
| |------------------+--------------+--------------| |--------------+--------------| | |||
| | 64 | 173 | 1009 | | 174 | 1159 | | |||
| | 96 | 169 | 1029 | | 188 | 1201 | | |||
| | 128 | 195 | 1242 | | 191 | 1275 | | |||
| | 192 | 178 | 1312 | | 184 | 1276 | | |||
| | 256 | 196 | 1591 | | 186 | 1281 | | |||
| | 384 | 172 | 1409 | | 181 | 1281 | | |||
| | 512 | 187 | 1640 | | 181 | 1313 | | |||
| | 768 | 171 | 1614 | | 176 | 1258 | | |||
| | 1024 | 186 | 1812 | | 178 | 1223 | | |||
| | 2048 | 190 | 1707 | | 186 | 1099 | | |||
| | 4096 | 182 | 1446 | | 177 | 975 | | |||
| | 8192 | 175 | 1345 | | 169 | 1034 | | |||
| | 9216 | 165 | 1271 | | 168 | 1023 | | |||
| | 16384 | 166 | 1396 | | 165 | 949 | | |||
| | 32768 | 172 | 1311 | | 161 | 881 | | |||
| | 262144 | 136 | 632 | | 134 | 629 | | |||
| | 1048576 | 134 | 698 | | 127 | 623 | | |||
| |------------------+--------------+--------------| |--------------+--------------| | |||
| Ubuntu 12.04, gcc-4.7.3, 32-bit, with fftw 3.3.3 (built with --enable-neon), on a 1.2GHz ARM Cortex A9 (Tegra 3) | |||
| Built with: | |||
| gcc-4.7 -O3 -DHAVE_FFTW -march=armv7-a -mtune=cortex-a9 -mfloat-abi=hard -mfpu=neon -ffast-math test_pffft.c pffft.c -o test_pffft_arm fftpack.c -lm -I/usr/local/include/ -L/usr/local/lib/ -lfftw3f | |||
| | input len |real FFTPack| real FFTW | real PFFFT | |cplx FFTPack| cplx FFTW | cplx PFFFT | | |||
| |-----------+------------+------------+------------| |------------+------------+------------| | |||
| | 64 | 549 | 452 | 731 | | 512 | 602 | 640 | | |||
| | 96 | 421 | 272 | 702 | | 496 | 571 | 602 | | |||
| | 128 | 498 | 512 | 815 | | 597 | 618 | 652 | | |||
| | 160 | 521 | 536 | 815 | | 586 | 669 | 625 | | |||
| | 192 | 539 | 571 | 883 | | 485 | 597 | 626 | | |||
| | 256 | 640 | 539 | 975 | | 569 | 611 | 671 | | |||
| | 384 | 499 | 610 | 879 | | 499 | 602 | 637 | | |||
| | 480 | 518 | 507 | 877 | | 496 | 661 | 616 | | |||
| | 512 | 524 | 591 | 1002 | | 549 | 678 | 668 | | |||
| | 640 | 542 | 612 | 955 | | 568 | 663 | 645 | | |||
| | 768 | 557 | 613 | 981 | | 491 | 663 | 598 | | |||
| | 800 | 514 | 353 | 882 | | 514 | 360 | 574 | | |||
| | 1024 | 640 | 640 | 1067 | | 492 | 683 | 602 | | |||
| | 2048 | 587 | 640 | 908 | | 486 | 640 | 552 | | |||
| | 2400 | 479 | 368 | 777 | | 422 | 376 | 518 | | |||
| | 4096 | 511 | 614 | 853 | | 426 | 640 | 534 | | |||
| | 8192 | 415 | 584 | 708 | | 386 | 622 | 516 | | |||
| | 9216 | 419 | 571 | 687 | | 364 | 586 | 506 | | |||
| | 16384 | 426 | 577 | 716 | | 398 | 606 | 530 | | |||
| | 32768 | 417 | 572 | 673 | | 399 | 572 | 468 | | |||
| | 262144 | 219 | 380 | 293 | | 255 | 431 | 343 | | |||
| | 1048576 | 202 | 274 | 237 | | 265 | 282 | 355 | | |||
| |-----------+------------+------------+------------| |------------+------------+------------| | |||
| Same platform as above, but this time pffft and fftpack are built with clang 3.2: | |||
| clang -O3 -DHAVE_FFTW -march=armv7-a -mtune=cortex-a9 -mfloat-abi=hard -mfpu=neon -ffast-math test_pffft.c pffft.c -o test_pffft_arm fftpack.c -lm -I/usr/local/include/ -L/usr/local/lib/ -lfftw3f | |||
| | input len |real FFTPack| real FFTW | real PFFFT | |cplx FFTPack| cplx FFTW | cplx PFFFT | | |||
| |-----------+------------+------------+------------| |------------+------------+------------| | |||
| | 64 | 427 | 452 | 853 | | 427 | 602 | 1024 | | |||
| | 96 | 351 | 276 | 843 | | 337 | 571 | 963 | | |||
| | 128 | 373 | 512 | 996 | | 390 | 618 | 1054 | | |||
| | 160 | 426 | 536 | 987 | | 375 | 669 | 914 | | |||
| | 192 | 404 | 571 | 1079 | | 388 | 588 | 1079 | | |||
| | 256 | 465 | 539 | 1205 | | 445 | 602 | 1170 | | |||
| | 384 | 366 | 610 | 1099 | | 343 | 594 | 1099 | | |||
| | 480 | 356 | 507 | 1140 | | 335 | 651 | 931 | | |||
| | 512 | 411 | 591 | 1213 | | 384 | 649 | 1124 | | |||
| | 640 | 398 | 612 | 1193 | | 373 | 654 | 901 | | |||
| | 768 | 409 | 613 | 1227 | | 383 | 663 | 1044 | | |||
| | 800 | 411 | 348 | 1073 | | 353 | 358 | 809 | | |||
| | 1024 | 427 | 640 | 1280 | | 413 | 692 | 1004 | | |||
| | 2048 | 414 | 626 | 1126 | | 371 | 640 | 853 | | |||
| | 2400 | 399 | 373 | 898 | | 319 | 368 | 653 | | |||
| | 4096 | 404 | 602 | 1059 | | 357 | 633 | 778 | | |||
| | 8192 | 332 | 584 | 792 | | 308 | 616 | 716 | | |||
| | 9216 | 322 | 561 | 783 | | 299 | 586 | 687 | | |||
| | 16384 | 344 | 568 | 778 | | 314 | 617 | 745 | | |||
| | 32768 | 342 | 564 | 737 | | 314 | 552 | 629 | | |||
| | 262144 | 201 | 383 | 313 | | 227 | 435 | 413 | | |||
| | 1048576 | 187 | 262 | 251 | | 228 | 281 | 409 | | |||
| |-----------+------------+------------+------------| |------------+------------+------------| | |||
| So it looks like, on ARM, gcc 4.7 is the best at scalar floating point | |||
| (the fftpack performance numbers are better with gcc), while clang is | |||
| the best with neon intrinsics (see how pffft perf has improved with | |||
| clang 3.2). | |||
| NVIDIA Jetson TK1 board, gcc-4.8.2. The cpu is a 2.3GHz cortex A15 (Tegra K1). | |||
| Built with: | |||
| gcc -O3 -march=armv7-a -mtune=native -mfloat-abi=hard -mfpu=neon -ffast-math test_pffft.c pffft.c -o test_pffft_arm fftpack.c -lm | |||
| | input len |real FFTPack| real PFFFT | |cplx FFTPack| cplx PFFFT | | |||
| |-----------+------------+------------| |------------+------------| | |||
| | 64 | 1735 | 3308 | | 1994 | 3744 | | |||
| | 96 | 1596 | 3448 | | 1987 | 3572 | | |||
| | 128 | 1807 | 4076 | | 2255 | 3960 | | |||
| | 160 | 1769 | 4083 | | 2071 | 3845 | | |||
| | 192 | 1990 | 4233 | | 2017 | 3939 | | |||
| | 256 | 2191 | 4882 | | 2254 | 4346 | | |||
| | 384 | 1878 | 4492 | | 2073 | 4012 | | |||
| | 480 | 1748 | 4398 | | 1923 | 3951 | | |||
| | 512 | 2030 | 5064 | | 2267 | 4195 | | |||
| | 640 | 1918 | 4756 | | 2094 | 4184 | | |||
| | 768 | 2099 | 4907 | | 2048 | 4297 | | |||
| | 800 | 1822 | 4555 | | 1880 | 4063 | | |||
| | 1024 | 2232 | 5355 | | 2187 | 4420 | | |||
| | 2048 | 2176 | 4983 | | 2027 | 3602 | | |||
| | 2400 | 1741 | 4256 | | 1710 | 3344 | | |||
| | 4096 | 1816 | 3914 | | 1851 | 3349 | | |||
| | 8192 | 1716 | 3481 | | 1700 | 3255 | | |||
| | 9216 | 1735 | 3589 | | 1653 | 3094 | | |||
| | 16384 | 1567 | 3483 | | 1637 | 3244 | | |||
| | 32768 | 1624 | 3240 | | 1655 | 3156 | | |||
| | 262144 | 1012 | 1898 | | 983 | 1503 | | |||
| | 1048576 | 876 | 1154 | | 868 | 1341 | | |||
| |-----------+------------+------------| |------------+------------| | |||
| The performance on the tegra K1 is pretty impressive. I'm not | |||
| including the FFTW numbers as they as slightly below the scalar | |||
| fftpack numbers, so something must be wrong (however it seems to be | |||
| correctly configured and is using neon simd instructions). | |||
| When using clang 3.4 the pffft version is even a bit faster, reaching | |||
| 5.7 GFlops for real ffts of size 1024. | |||
| iPad Air 2 with iOS9, xcode 8.0, arm64. The cpu is an Apple A8X, supposedly running at 1.5GHz. | |||
| | input len |real FFTPack| real vDSP | real PFFFT | |cplx FFTPack| cplx vDSP | cplx PFFFT | | |||
| |-----------+------------+------------+------------| |------------+------------+------------| | |||
| | 64 | 2517 | 7995 | 6086 | | 2725 | 13006 | 8495 | | |||
| | 96 | 2442 | n/a | 6691 | | 2256 | n/a | 7991 | | |||
| | 128 | 2664 | 10186 | 7877 | | 2575 | 15115 | 9115 | | |||
| | 160 | 2638 | n/a | 8283 | | 2682 | n/a | 8806 | | |||
| | 192 | 2903 | n/a | 9083 | | 2634 | n/a | 8980 | | |||
| | 256 | 3184 | 11452 | 10039 | | 3026 | 15410 | 10199 | | |||
| | 384 | 2665 | n/a | 10100 | | 2275 | n/a | 9247 | | |||
| | 480 | 2546 | n/a | 9863 | | 2341 | n/a | 8892 | | |||
| | 512 | 2832 | 12197 | 10989 | | 2547 | 16768 | 10154 | | |||
| | 640 | 2755 | n/a | 10461 | | 2569 | n/a | 9666 | | |||
| | 768 | 2998 | n/a | 11355 | | 2585 | n/a | 9813 | | |||
| | 800 | 2516 | n/a | 10332 | | 2433 | n/a | 9164 | | |||
| | 1024 | 3109 | 12965 | 12114 | | 2869 | 16448 | 10519 | | |||
| | 2048 | 3027 | 12996 | 12023 | | 2648 | 17304 | 10307 | | |||
| | 2400 | 2515 | n/a | 10372 | | 2355 | n/a | 8443 | | |||
| | 4096 | 3204 | 13603 | 12359 | | 2814 | 16570 | 9780 | | |||
| | 8192 | 2759 | 13422 | 10824 | | 2153 | 15652 | 7884 | | |||
| | 9216 | 2700 | n/a | 9938 | | 2241 | n/a | 7900 | | |||
| | 16384 | 2280 | 13057 | 7976 | | 593 | 4272 | 2534 | | |||
| | 32768 | 768 | 4269 | 2882 | | 606 | 4405 | 2604 | | |||
| | 262144 | 724 | 3527 | 2630 | | 534 | 2418 | 2157 | | |||
| | 1048576 | 674 | 1467 | 2135 | | 530 | 1621 | 2055 | | |||
| |-----------+------------+------------+------------| |------------+------------+------------| | |||
| I double-checked to make sure I did not make a mistake in the time | |||
| measurements, as the numbers are much higher than what I initially | |||
| expected. They are in fact higher than the number I get on the 2.8GHz | |||
| Xeon of my 2008 mac pro.. (except for FFT lengths >= 32768 where | |||
| having a big cache is useful). A good surprise is also that the perf | |||
| is not too far from apple's vDSP (at least for the real FFT). | |||
| @@ -1,799 +0,0 @@ | |||
| /* | |||
| Interface for the f2c translation of fftpack as found on http://www.netlib.org/fftpack/ | |||
| FFTPACK license: | |||
| http://www.cisl.ucar.edu/css/software/fftpack5/ftpk.html | |||
| Copyright (c) 2004 the University Corporation for Atmospheric | |||
| Research ("UCAR"). All rights reserved. Developed by NCAR's | |||
| Computational and Information Systems Laboratory, UCAR, | |||
| www.cisl.ucar.edu. | |||
| Redistribution and use of the Software in source and binary forms, | |||
| with or without modification, is permitted provided that the | |||
| following conditions are met: | |||
| - Neither the names of NCAR's Computational and Information Systems | |||
| Laboratory, the University Corporation for Atmospheric Research, | |||
| nor the names of its sponsors or contributors may be used to | |||
| endorse or promote products derived from this Software without | |||
| specific prior written permission. | |||
| - Redistributions of source code must retain the above copyright | |||
| notices, this list of conditions, and the disclaimer below. | |||
| - Redistributions in binary form must reproduce the above copyright | |||
| notice, this list of conditions, and the disclaimer below in the | |||
| documentation and/or other materials provided with the | |||
| distribution. | |||
| THIS SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, | |||
| EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO THE WARRANTIES OF | |||
| MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND | |||
| NONINFRINGEMENT. IN NO EVENT SHALL THE CONTRIBUTORS OR COPYRIGHT | |||
| HOLDERS BE LIABLE FOR ANY CLAIM, INDIRECT, INCIDENTAL, SPECIAL, | |||
| EXEMPLARY, OR CONSEQUENTIAL DAMAGES OR OTHER LIABILITY, WHETHER IN AN | |||
| ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN | |||
| CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS WITH THE | |||
| SOFTWARE. | |||
| ChangeLog: | |||
| 2011/10/02: this is my first release of this file. | |||
| */ | |||
| #ifndef FFTPACK_H | |||
| #define FFTPACK_H | |||
| #ifdef __cplusplus | |||
| extern "C" { | |||
| #endif | |||
| // just define FFTPACK_DOUBLE_PRECISION if you want to build it as a double precision fft | |||
| #ifndef FFTPACK_DOUBLE_PRECISION | |||
| typedef float fftpack_real; | |||
| typedef int fftpack_int; | |||
| #else | |||
| typedef double fftpack_real; | |||
| typedef int fftpack_int; | |||
| #endif | |||
| void cffti(fftpack_int n, fftpack_real *wsave); | |||
| void cfftf(fftpack_int n, fftpack_real *c, fftpack_real *wsave); | |||
| void cfftb(fftpack_int n, fftpack_real *c, fftpack_real *wsave); | |||
| void rffti(fftpack_int n, fftpack_real *wsave); | |||
| void rfftf(fftpack_int n, fftpack_real *r, fftpack_real *wsave); | |||
| void rfftb(fftpack_int n, fftpack_real *r, fftpack_real *wsave); | |||
| void cosqi(fftpack_int n, fftpack_real *wsave); | |||
| void cosqf(fftpack_int n, fftpack_real *x, fftpack_real *wsave); | |||
| void cosqb(fftpack_int n, fftpack_real *x, fftpack_real *wsave); | |||
| void costi(fftpack_int n, fftpack_real *wsave); | |||
| void cost(fftpack_int n, fftpack_real *x, fftpack_real *wsave); | |||
| void sinqi(fftpack_int n, fftpack_real *wsave); | |||
| void sinqb(fftpack_int n, fftpack_real *x, fftpack_real *wsave); | |||
| void sinqf(fftpack_int n, fftpack_real *x, fftpack_real *wsave); | |||
| void sinti(fftpack_int n, fftpack_real *wsave); | |||
| void sint(fftpack_int n, fftpack_real *x, fftpack_real *wsave); | |||
| #ifdef __cplusplus | |||
| } | |||
| #endif | |||
| #endif /* FFTPACK_H */ | |||
| /* | |||
| FFTPACK | |||
| * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * | |||
| version 4 april 1985 | |||
| a package of fortran subprograms for the fast fourier | |||
| transform of periodic and other symmetric sequences | |||
| by | |||
| paul n swarztrauber | |||
| national center for atmospheric research boulder,colorado 80307 | |||
| which is sponsored by the national science foundation | |||
| * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * | |||
| this package consists of programs which perform fast fourier | |||
| transforms for both complex and real periodic sequences and | |||
| certain other symmetric sequences that are listed below. | |||
| 1. rffti initialize rfftf and rfftb | |||
| 2. rfftf forward transform of a real periodic sequence | |||
| 3. rfftb backward transform of a real coefficient array | |||
| 4. ezffti initialize ezfftf and ezfftb | |||
| 5. ezfftf a simplified real periodic forward transform | |||
| 6. ezfftb a simplified real periodic backward transform | |||
| 7. sinti initialize sint | |||
| 8. sint sine transform of a real odd sequence | |||
| 9. costi initialize cost | |||
| 10. cost cosine transform of a real even sequence | |||
| 11. sinqi initialize sinqf and sinqb | |||
| 12. sinqf forward sine transform with odd wave numbers | |||
| 13. sinqb unnormalized inverse of sinqf | |||
| 14. cosqi initialize cosqf and cosqb | |||
| 15. cosqf forward cosine transform with odd wave numbers | |||
| 16. cosqb unnormalized inverse of cosqf | |||
| 17. cffti initialize cfftf and cfftb | |||
| 18. cfftf forward transform of a complex periodic sequence | |||
| 19. cfftb unnormalized inverse of cfftf | |||
| ****************************************************************** | |||
| subroutine rffti(n,wsave) | |||
| **************************************************************** | |||
| subroutine rffti initializes the array wsave which is used in | |||
| both rfftf and rfftb. the prime factorization of n together with | |||
| a tabulation of the trigonometric functions are computed and | |||
| stored in wsave. | |||
| input parameter | |||
| n the length of the sequence to be transformed. | |||
| output parameter | |||
| wsave a work array which must be dimensioned at least 2*n+15. | |||
| the same work array can be used for both rfftf and rfftb | |||
| as long as n remains unchanged. different wsave arrays | |||
| are required for different values of n. the contents of | |||
| wsave must not be changed between calls of rfftf or rfftb. | |||
| ****************************************************************** | |||
| subroutine rfftf(n,r,wsave) | |||
| ****************************************************************** | |||
| subroutine rfftf computes the fourier coefficients of a real | |||
| perodic sequence (fourier analysis). the transform is defined | |||
| below at output parameter r. | |||
| input parameters | |||
| n the length of the array r to be transformed. the method | |||
| is most efficient when n is a product of small primes. | |||
| n may change so long as different work arrays are provided | |||
| r a real array of length n which contains the sequence | |||
| to be transformed | |||
| wsave a work array which must be dimensioned at least 2*n+15. | |||
| in the program that calls rfftf. the wsave array must be | |||
| initialized by calling subroutine rffti(n,wsave) and a | |||
| different wsave array must be used for each different | |||
| value of n. this initialization does not have to be | |||
| repeated so long as n remains unchanged thus subsequent | |||
| transforms can be obtained faster than the first. | |||
| the same wsave array can be used by rfftf and rfftb. | |||
| output parameters | |||
| r r(1) = the sum from i=1 to i=n of r(i) | |||
| if n is even set l =n/2 , if n is odd set l = (n+1)/2 | |||
| then for k = 2,...,l | |||
| r(2*k-2) = the sum from i = 1 to i = n of | |||
| r(i)*cos((k-1)*(i-1)*2*pi/n) | |||
| r(2*k-1) = the sum from i = 1 to i = n of | |||
| -r(i)*sin((k-1)*(i-1)*2*pi/n) | |||
| if n is even | |||
| r(n) = the sum from i = 1 to i = n of | |||
| (-1)**(i-1)*r(i) | |||
| ***** note | |||
| this transform is unnormalized since a call of rfftf | |||
| followed by a call of rfftb will multiply the input | |||
| sequence by n. | |||
| wsave contains results which must not be destroyed between | |||
| calls of rfftf or rfftb. | |||
| ****************************************************************** | |||
| subroutine rfftb(n,r,wsave) | |||
| ****************************************************************** | |||
| subroutine rfftb computes the real perodic sequence from its | |||
| fourier coefficients (fourier synthesis). the transform is defined | |||
| below at output parameter r. | |||
| input parameters | |||
| n the length of the array r to be transformed. the method | |||
| is most efficient when n is a product of small primes. | |||
| n may change so long as different work arrays are provided | |||
| r a real array of length n which contains the sequence | |||
| to be transformed | |||
| wsave a work array which must be dimensioned at least 2*n+15. | |||
| in the program that calls rfftb. the wsave array must be | |||
| initialized by calling subroutine rffti(n,wsave) and a | |||
| different wsave array must be used for each different | |||
| value of n. this initialization does not have to be | |||
| repeated so long as n remains unchanged thus subsequent | |||
| transforms can be obtained faster than the first. | |||
| the same wsave array can be used by rfftf and rfftb. | |||
| output parameters | |||
| r for n even and for i = 1,...,n | |||
| r(i) = r(1)+(-1)**(i-1)*r(n) | |||
| plus the sum from k=2 to k=n/2 of | |||
| 2.*r(2*k-2)*cos((k-1)*(i-1)*2*pi/n) | |||
| -2.*r(2*k-1)*sin((k-1)*(i-1)*2*pi/n) | |||
| for n odd and for i = 1,...,n | |||
| r(i) = r(1) plus the sum from k=2 to k=(n+1)/2 of | |||
| 2.*r(2*k-2)*cos((k-1)*(i-1)*2*pi/n) | |||
| -2.*r(2*k-1)*sin((k-1)*(i-1)*2*pi/n) | |||
| ***** note | |||
| this transform is unnormalized since a call of rfftf | |||
| followed by a call of rfftb will multiply the input | |||
| sequence by n. | |||
| wsave contains results which must not be destroyed between | |||
| calls of rfftb or rfftf. | |||
| ****************************************************************** | |||
| subroutine sinti(n,wsave) | |||
| ****************************************************************** | |||
| subroutine sinti initializes the array wsave which is used in | |||
| subroutine sint. the prime factorization of n together with | |||
| a tabulation of the trigonometric functions are computed and | |||
| stored in wsave. | |||
| input parameter | |||
| n the length of the sequence to be transformed. the method | |||
| is most efficient when n+1 is a product of small primes. | |||
| output parameter | |||
| wsave a work array with at least int(2.5*n+15) locations. | |||
| different wsave arrays are required for different values | |||
| of n. the contents of wsave must not be changed between | |||
| calls of sint. | |||
| ****************************************************************** | |||
| subroutine sint(n,x,wsave) | |||
| ****************************************************************** | |||
| subroutine sint computes the discrete fourier sine transform | |||
| of an odd sequence x(i). the transform is defined below at | |||
| output parameter x. | |||
| sint is the unnormalized inverse of itself since a call of sint | |||
| followed by another call of sint will multiply the input sequence | |||
| x by 2*(n+1). | |||
| the array wsave which is used by subroutine sint must be | |||
| initialized by calling subroutine sinti(n,wsave). | |||
| input parameters | |||
| n the length of the sequence to be transformed. the method | |||
| is most efficient when n+1 is the product of small primes. | |||
| x an array which contains the sequence to be transformed | |||
| wsave a work array with dimension at least int(2.5*n+15) | |||
| in the program that calls sint. the wsave array must be | |||
| initialized by calling subroutine sinti(n,wsave) and a | |||
| different wsave array must be used for each different | |||
| value of n. this initialization does not have to be | |||
| repeated so long as n remains unchanged thus subsequent | |||
| transforms can be obtained faster than the first. | |||
| output parameters | |||
| x for i=1,...,n | |||
| x(i)= the sum from k=1 to k=n | |||
| 2*x(k)*sin(k*i*pi/(n+1)) | |||
| a call of sint followed by another call of | |||
| sint will multiply the sequence x by 2*(n+1). | |||
| hence sint is the unnormalized inverse | |||
| of itself. | |||
| wsave contains initialization calculations which must not be | |||
| destroyed between calls of sint. | |||
| ****************************************************************** | |||
| subroutine costi(n,wsave) | |||
| ****************************************************************** | |||
| subroutine costi initializes the array wsave which is used in | |||
| subroutine cost. the prime factorization of n together with | |||
| a tabulation of the trigonometric functions are computed and | |||
| stored in wsave. | |||
| input parameter | |||
| n the length of the sequence to be transformed. the method | |||
| is most efficient when n-1 is a product of small primes. | |||
| output parameter | |||
| wsave a work array which must be dimensioned at least 3*n+15. | |||
| different wsave arrays are required for different values | |||
| of n. the contents of wsave must not be changed between | |||
| calls of cost. | |||
| ****************************************************************** | |||
| subroutine cost(n,x,wsave) | |||
| ****************************************************************** | |||
| subroutine cost computes the discrete fourier cosine transform | |||
| of an even sequence x(i). the transform is defined below at output | |||
| parameter x. | |||
| cost is the unnormalized inverse of itself since a call of cost | |||
| followed by another call of cost will multiply the input sequence | |||
| x by 2*(n-1). the transform is defined below at output parameter x | |||
| the array wsave which is used by subroutine cost must be | |||
| initialized by calling subroutine costi(n,wsave). | |||
| input parameters | |||
| n the length of the sequence x. n must be greater than 1. | |||
| the method is most efficient when n-1 is a product of | |||
| small primes. | |||
| x an array which contains the sequence to be transformed | |||
| wsave a work array which must be dimensioned at least 3*n+15 | |||
| in the program that calls cost. the wsave array must be | |||
| initialized by calling subroutine costi(n,wsave) and a | |||
| different wsave array must be used for each different | |||
| value of n. this initialization does not have to be | |||
| repeated so long as n remains unchanged thus subsequent | |||
| transforms can be obtained faster than the first. | |||
| output parameters | |||
| x for i=1,...,n | |||
| x(i) = x(1)+(-1)**(i-1)*x(n) | |||
| + the sum from k=2 to k=n-1 | |||
| 2*x(k)*cos((k-1)*(i-1)*pi/(n-1)) | |||
| a call of cost followed by another call of | |||
| cost will multiply the sequence x by 2*(n-1) | |||
| hence cost is the unnormalized inverse | |||
| of itself. | |||
| wsave contains initialization calculations which must not be | |||
| destroyed between calls of cost. | |||
| ****************************************************************** | |||
| subroutine sinqi(n,wsave) | |||
| ****************************************************************** | |||
| subroutine sinqi initializes the array wsave which is used in | |||
| both sinqf and sinqb. the prime factorization of n together with | |||
| a tabulation of the trigonometric functions are computed and | |||
| stored in wsave. | |||
| input parameter | |||
| n the length of the sequence to be transformed. the method | |||
| is most efficient when n is a product of small primes. | |||
| output parameter | |||
| wsave a work array which must be dimensioned at least 3*n+15. | |||
| the same work array can be used for both sinqf and sinqb | |||
| as long as n remains unchanged. different wsave arrays | |||
| are required for different values of n. the contents of | |||
| wsave must not be changed between calls of sinqf or sinqb. | |||
| ****************************************************************** | |||
| subroutine sinqf(n,x,wsave) | |||
| ****************************************************************** | |||
| subroutine sinqf computes the fast fourier transform of quarter | |||
| wave data. that is , sinqf computes the coefficients in a sine | |||
| series representation with only odd wave numbers. the transform | |||
| is defined below at output parameter x. | |||
| sinqb is the unnormalized inverse of sinqf since a call of sinqf | |||
| followed by a call of sinqb will multiply the input sequence x | |||
| by 4*n. | |||
| the array wsave which is used by subroutine sinqf must be | |||
| initialized by calling subroutine sinqi(n,wsave). | |||
| input parameters | |||
| n the length of the array x to be transformed. the method | |||
| is most efficient when n is a product of small primes. | |||
| x an array which contains the sequence to be transformed | |||
| wsave a work array which must be dimensioned at least 3*n+15. | |||
| in the program that calls sinqf. the wsave array must be | |||
| initialized by calling subroutine sinqi(n,wsave) and a | |||
| different wsave array must be used for each different | |||
| value of n. this initialization does not have to be | |||
| repeated so long as n remains unchanged thus subsequent | |||
| transforms can be obtained faster than the first. | |||
| output parameters | |||
| x for i=1,...,n | |||
| x(i) = (-1)**(i-1)*x(n) | |||
| + the sum from k=1 to k=n-1 of | |||
| 2*x(k)*sin((2*i-1)*k*pi/(2*n)) | |||
| a call of sinqf followed by a call of | |||
| sinqb will multiply the sequence x by 4*n. | |||
| therefore sinqb is the unnormalized inverse | |||
| of sinqf. | |||
| wsave contains initialization calculations which must not | |||
| be destroyed between calls of sinqf or sinqb. | |||
| ****************************************************************** | |||
| subroutine sinqb(n,x,wsave) | |||
| ****************************************************************** | |||
| subroutine sinqb computes the fast fourier transform of quarter | |||
| wave data. that is , sinqb computes a sequence from its | |||
| representation in terms of a sine series with odd wave numbers. | |||
| the transform is defined below at output parameter x. | |||
| sinqf is the unnormalized inverse of sinqb since a call of sinqb | |||
| followed by a call of sinqf will multiply the input sequence x | |||
| by 4*n. | |||
| the array wsave which is used by subroutine sinqb must be | |||
| initialized by calling subroutine sinqi(n,wsave). | |||
| input parameters | |||
| n the length of the array x to be transformed. the method | |||
| is most efficient when n is a product of small primes. | |||
| x an array which contains the sequence to be transformed | |||
| wsave a work array which must be dimensioned at least 3*n+15. | |||
| in the program that calls sinqb. the wsave array must be | |||
| initialized by calling subroutine sinqi(n,wsave) and a | |||
| different wsave array must be used for each different | |||
| value of n. this initialization does not have to be | |||
| repeated so long as n remains unchanged thus subsequent | |||
| transforms can be obtained faster than the first. | |||
| output parameters | |||
| x for i=1,...,n | |||
| x(i)= the sum from k=1 to k=n of | |||
| 4*x(k)*sin((2k-1)*i*pi/(2*n)) | |||
| a call of sinqb followed by a call of | |||
| sinqf will multiply the sequence x by 4*n. | |||
| therefore sinqf is the unnormalized inverse | |||
| of sinqb. | |||
| wsave contains initialization calculations which must not | |||
| be destroyed between calls of sinqb or sinqf. | |||
| ****************************************************************** | |||
| subroutine cosqi(n,wsave) | |||
| ****************************************************************** | |||
| subroutine cosqi initializes the array wsave which is used in | |||
| both cosqf and cosqb. the prime factorization of n together with | |||
| a tabulation of the trigonometric functions are computed and | |||
| stored in wsave. | |||
| input parameter | |||
| n the length of the array to be transformed. the method | |||
| is most efficient when n is a product of small primes. | |||
| output parameter | |||
| wsave a work array which must be dimensioned at least 3*n+15. | |||
| the same work array can be used for both cosqf and cosqb | |||
| as long as n remains unchanged. different wsave arrays | |||
| are required for different values of n. the contents of | |||
| wsave must not be changed between calls of cosqf or cosqb. | |||
| ****************************************************************** | |||
| subroutine cosqf(n,x,wsave) | |||
| ****************************************************************** | |||
| subroutine cosqf computes the fast fourier transform of quarter | |||
| wave data. that is , cosqf computes the coefficients in a cosine | |||
| series representation with only odd wave numbers. the transform | |||
| is defined below at output parameter x | |||
| cosqf is the unnormalized inverse of cosqb since a call of cosqf | |||
| followed by a call of cosqb will multiply the input sequence x | |||
| by 4*n. | |||
| the array wsave which is used by subroutine cosqf must be | |||
| initialized by calling subroutine cosqi(n,wsave). | |||
| input parameters | |||
| n the length of the array x to be transformed. the method | |||
| is most efficient when n is a product of small primes. | |||
| x an array which contains the sequence to be transformed | |||
| wsave a work array which must be dimensioned at least 3*n+15 | |||
| in the program that calls cosqf. the wsave array must be | |||
| initialized by calling subroutine cosqi(n,wsave) and a | |||
| different wsave array must be used for each different | |||
| value of n. this initialization does not have to be | |||
| repeated so long as n remains unchanged thus subsequent | |||
| transforms can be obtained faster than the first. | |||
| output parameters | |||
| x for i=1,...,n | |||
| x(i) = x(1) plus the sum from k=2 to k=n of | |||
| 2*x(k)*cos((2*i-1)*(k-1)*pi/(2*n)) | |||
| a call of cosqf followed by a call of | |||
| cosqb will multiply the sequence x by 4*n. | |||
| therefore cosqb is the unnormalized inverse | |||
| of cosqf. | |||
| wsave contains initialization calculations which must not | |||
| be destroyed between calls of cosqf or cosqb. | |||
| ****************************************************************** | |||
| subroutine cosqb(n,x,wsave) | |||
| ****************************************************************** | |||
| subroutine cosqb computes the fast fourier transform of quarter | |||
| wave data. that is , cosqb computes a sequence from its | |||
| representation in terms of a cosine series with odd wave numbers. | |||
| the transform is defined below at output parameter x. | |||
| cosqb is the unnormalized inverse of cosqf since a call of cosqb | |||
| followed by a call of cosqf will multiply the input sequence x | |||
| by 4*n. | |||
| the array wsave which is used by subroutine cosqb must be | |||
| initialized by calling subroutine cosqi(n,wsave). | |||
| input parameters | |||
| n the length of the array x to be transformed. the method | |||
| is most efficient when n is a product of small primes. | |||
| x an array which contains the sequence to be transformed | |||
| wsave a work array that must be dimensioned at least 3*n+15 | |||
| in the program that calls cosqb. the wsave array must be | |||
| initialized by calling subroutine cosqi(n,wsave) and a | |||
| different wsave array must be used for each different | |||
| value of n. this initialization does not have to be | |||
| repeated so long as n remains unchanged thus subsequent | |||
| transforms can be obtained faster than the first. | |||
| output parameters | |||
| x for i=1,...,n | |||
| x(i)= the sum from k=1 to k=n of | |||
| 4*x(k)*cos((2*k-1)*(i-1)*pi/(2*n)) | |||
| a call of cosqb followed by a call of | |||
| cosqf will multiply the sequence x by 4*n. | |||
| therefore cosqf is the unnormalized inverse | |||
| of cosqb. | |||
| wsave contains initialization calculations which must not | |||
| be destroyed between calls of cosqb or cosqf. | |||
| ****************************************************************** | |||
| subroutine cffti(n,wsave) | |||
| ****************************************************************** | |||
| subroutine cffti initializes the array wsave which is used in | |||
| both cfftf and cfftb. the prime factorization of n together with | |||
| a tabulation of the trigonometric functions are computed and | |||
| stored in wsave. | |||
| input parameter | |||
| n the length of the sequence to be transformed | |||
| output parameter | |||
| wsave a work array which must be dimensioned at least 4*n+15 | |||
| the same work array can be used for both cfftf and cfftb | |||
| as long as n remains unchanged. different wsave arrays | |||
| are required for different values of n. the contents of | |||
| wsave must not be changed between calls of cfftf or cfftb. | |||
| ****************************************************************** | |||
| subroutine cfftf(n,c,wsave) | |||
| ****************************************************************** | |||
| subroutine cfftf computes the forward complex discrete fourier | |||
| transform (the fourier analysis). equivalently , cfftf computes | |||
| the fourier coefficients of a complex periodic sequence. | |||
| the transform is defined below at output parameter c. | |||
| the transform is not normalized. to obtain a normalized transform | |||
| the output must be divided by n. otherwise a call of cfftf | |||
| followed by a call of cfftb will multiply the sequence by n. | |||
| the array wsave which is used by subroutine cfftf must be | |||
| initialized by calling subroutine cffti(n,wsave). | |||
| input parameters | |||
| n the length of the complex sequence c. the method is | |||
| more efficient when n is the product of small primes. n | |||
| c a complex array of length n which contains the sequence | |||
| wsave a real work array which must be dimensioned at least 4n+15 | |||
| in the program that calls cfftf. the wsave array must be | |||
| initialized by calling subroutine cffti(n,wsave) and a | |||
| different wsave array must be used for each different | |||
| value of n. this initialization does not have to be | |||
| repeated so long as n remains unchanged thus subsequent | |||
| transforms can be obtained faster than the first. | |||
| the same wsave array can be used by cfftf and cfftb. | |||
| output parameters | |||
| c for j=1,...,n | |||
| c(j)=the sum from k=1,...,n of | |||
| c(k)*exp(-i*(j-1)*(k-1)*2*pi/n) | |||
| where i=sqrt(-1) | |||
| wsave contains initialization calculations which must not be | |||
| destroyed between calls of subroutine cfftf or cfftb | |||
| ****************************************************************** | |||
| subroutine cfftb(n,c,wsave) | |||
| ****************************************************************** | |||
| subroutine cfftb computes the backward complex discrete fourier | |||
| transform (the fourier synthesis). equivalently , cfftb computes | |||
| a complex periodic sequence from its fourier coefficients. | |||
| the transform is defined below at output parameter c. | |||
| a call of cfftf followed by a call of cfftb will multiply the | |||
| sequence by n. | |||
| the array wsave which is used by subroutine cfftb must be | |||
| initialized by calling subroutine cffti(n,wsave). | |||
| input parameters | |||
| n the length of the complex sequence c. the method is | |||
| more efficient when n is the product of small primes. | |||
| c a complex array of length n which contains the sequence | |||
| wsave a real work array which must be dimensioned at least 4n+15 | |||
| in the program that calls cfftb. the wsave array must be | |||
| initialized by calling subroutine cffti(n,wsave) and a | |||
| different wsave array must be used for each different | |||
| value of n. this initialization does not have to be | |||
| repeated so long as n remains unchanged thus subsequent | |||
| transforms can be obtained faster than the first. | |||
| the same wsave array can be used by cfftf and cfftb. | |||
| output parameters | |||
| c for j=1,...,n | |||
| c(j)=the sum from k=1,...,n of | |||
| c(k)*exp(i*(j-1)*(k-1)*2*pi/n) | |||
| where i=sqrt(-1) | |||
| wsave contains initialization calculations which must not be | |||
| destroyed between calls of subroutine cfftf or cfftb | |||
| */ | |||
| @@ -1,177 +0,0 @@ | |||
| /* Copyright (c) 2013 Julien Pommier ( pommier@modartt.com ) | |||
| Based on original fortran 77 code from FFTPACKv4 from NETLIB, | |||
| authored by Dr Paul Swarztrauber of NCAR, in 1985. | |||
| As confirmed by the NCAR fftpack software curators, the following | |||
| FFTPACKv5 license applies to FFTPACKv4 sources. My changes are | |||
| released under the same terms. | |||
| FFTPACK license: | |||
| http://www.cisl.ucar.edu/css/software/fftpack5/ftpk.html | |||
| Copyright (c) 2004 the University Corporation for Atmospheric | |||
| Research ("UCAR"). All rights reserved. Developed by NCAR's | |||
| Computational and Information Systems Laboratory, UCAR, | |||
| www.cisl.ucar.edu. | |||
| Redistribution and use of the Software in source and binary forms, | |||
| with or without modification, is permitted provided that the | |||
| following conditions are met: | |||
| - Neither the names of NCAR's Computational and Information Systems | |||
| Laboratory, the University Corporation for Atmospheric Research, | |||
| nor the names of its sponsors or contributors may be used to | |||
| endorse or promote products derived from this Software without | |||
| specific prior written permission. | |||
| - Redistributions of source code must retain the above copyright | |||
| notices, this list of conditions, and the disclaimer below. | |||
| - Redistributions in binary form must reproduce the above copyright | |||
| notice, this list of conditions, and the disclaimer below in the | |||
| documentation and/or other materials provided with the | |||
| distribution. | |||
| THIS SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, | |||
| EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO THE WARRANTIES OF | |||
| MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND | |||
| NONINFRINGEMENT. IN NO EVENT SHALL THE CONTRIBUTORS OR COPYRIGHT | |||
| HOLDERS BE LIABLE FOR ANY CLAIM, INDIRECT, INCIDENTAL, SPECIAL, | |||
| EXEMPLARY, OR CONSEQUENTIAL DAMAGES OR OTHER LIABILITY, WHETHER IN AN | |||
| ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN | |||
| CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS WITH THE | |||
| SOFTWARE. | |||
| */ | |||
| /* | |||
| PFFFT : a Pretty Fast FFT. | |||
| This is basically an adaptation of the single precision fftpack | |||
| (v4) as found on netlib taking advantage of SIMD instruction found | |||
| on cpus such as intel x86 (SSE1), powerpc (Altivec), and arm (NEON). | |||
| For architectures where no SIMD instruction is available, the code | |||
| falls back to a scalar version. | |||
| Restrictions: | |||
| - 1D transforms only, with 32-bit single precision. | |||
| - supports only transforms for inputs of length N of the form | |||
| N=(2^a)*(3^b)*(5^c), a >= 5, b >=0, c >= 0 (32, 48, 64, 96, 128, | |||
| 144, 160, etc are all acceptable lengths). Performance is best for | |||
| 128<=N<=8192. | |||
| - all (float*) pointers in the functions below are expected to | |||
| have an "simd-compatible" alignment, that is 16 bytes on x86 and | |||
| powerpc CPUs. | |||
| You can allocate such buffers with the functions | |||
| pffft_aligned_malloc / pffft_aligned_free (or with stuff like | |||
| posix_memalign..) | |||
| */ | |||
| #ifndef PFFFT_H | |||
| #define PFFFT_H | |||
| #include <stddef.h> // for size_t | |||
| #ifdef __cplusplus | |||
| extern "C" { | |||
| #endif | |||
| /* opaque struct holding internal stuff (precomputed twiddle factors) | |||
| this struct can be shared by many threads as it contains only | |||
| read-only data. | |||
| */ | |||
| typedef struct PFFFT_Setup PFFFT_Setup; | |||
| /* direction of the transform */ | |||
| typedef enum { PFFFT_FORWARD, PFFFT_BACKWARD } pffft_direction_t; | |||
| /* type of transform */ | |||
| typedef enum { PFFFT_REAL, PFFFT_COMPLEX } pffft_transform_t; | |||
| /* | |||
| prepare for performing transforms of size N -- the returned | |||
| PFFFT_Setup structure is read-only so it can safely be shared by | |||
| multiple concurrent threads. | |||
| */ | |||
| PFFFT_Setup *pffft_new_setup(int N, pffft_transform_t transform); | |||
| void pffft_destroy_setup(PFFFT_Setup *); | |||
| /* | |||
| Perform a Fourier transform , The z-domain data is stored in the | |||
| most efficient order for transforming it back, or using it for | |||
| convolution. If you need to have its content sorted in the | |||
| "usual" way, that is as an array of interleaved complex numbers, | |||
| either use pffft_transform_ordered , or call pffft_zreorder after | |||
| the forward fft, and before the backward fft. | |||
| Transforms are not scaled: PFFFT_BACKWARD(PFFFT_FORWARD(x)) = N*x. | |||
| Typically you will want to scale the backward transform by 1/N. | |||
| The 'work' pointer should point to an area of N (2*N for complex | |||
| fft) floats, properly aligned. If 'work' is NULL, then stack will | |||
| be used instead (this is probably the best strategy for small | |||
| FFTs, say for N < 16384). | |||
| input and output may alias. | |||
| */ | |||
| void pffft_transform(PFFFT_Setup *setup, const float *input, float *output, float *work, pffft_direction_t direction); | |||
| /* | |||
| Similar to pffft_transform, but makes sure that the output is | |||
| ordered as expected (interleaved complex numbers). This is | |||
| similar to calling pffft_transform and then pffft_zreorder. | |||
| input and output may alias. | |||
| */ | |||
| void pffft_transform_ordered(PFFFT_Setup *setup, const float *input, float *output, float *work, pffft_direction_t direction); | |||
| /* | |||
| call pffft_zreorder(.., PFFFT_FORWARD) after pffft_transform(..., | |||
| PFFFT_FORWARD) if you want to have the frequency components in | |||
| the correct "canonical" order, as interleaved complex numbers. | |||
| (for real transforms, both 0-frequency and half frequency | |||
| components, which are real, are assembled in the first entry as | |||
| F(0)+i*F(n/2+1). Note that the original fftpack did place | |||
| F(n/2+1) at the end of the arrays). | |||
| input and output should not alias. | |||
| */ | |||
| void pffft_zreorder(PFFFT_Setup *setup, const float *input, float *output, pffft_direction_t direction); | |||
| /* | |||
| Perform a multiplication of the frequency components of dft_a and | |||
| dft_b and accumulate them into dft_ab. The arrays should have | |||
| been obtained with pffft_transform(.., PFFFT_FORWARD) and should | |||
| *not* have been reordered with pffft_zreorder (otherwise just | |||
| perform the operation yourself as the dft coefs are stored as | |||
| interleaved complex numbers). | |||
| the operation performed is: dft_ab += (dft_a * fdt_b)*scaling | |||
| The dft_a, dft_b and dft_ab pointers may alias. | |||
| */ | |||
| void pffft_zconvolve_accumulate(PFFFT_Setup *setup, const float *dft_a, const float *dft_b, float *dft_ab, float scaling); | |||
| /* | |||
| the float buffers must have the correct alignment (16-byte boundary | |||
| on intel and powerpc). This function may be used to obtain such | |||
| correctly aligned buffers. | |||
| */ | |||
| void *pffft_aligned_malloc(size_t nb_bytes); | |||
| void pffft_aligned_free(void *); | |||
| /* return 4 or 1 wether support SSE/Altivec instructions was enable when building pffft.c */ | |||
| int pffft_simd_size(); | |||
| #ifdef __cplusplus | |||
| } | |||
| #endif | |||
| #endif // PFFFT_H | |||
| @@ -1,419 +0,0 @@ | |||
| /* | |||
| Copyright (c) 2013 Julien Pommier. | |||
| Small test & bench for PFFFT, comparing its performance with the scalar FFTPACK, FFTW, and Apple vDSP | |||
| How to build: | |||
| on linux, with fftw3: | |||
| gcc -o test_pffft -DHAVE_FFTW -msse -mfpmath=sse -O3 -Wall -W pffft.c test_pffft.c fftpack.c -L/usr/local/lib -I/usr/local/include/ -lfftw3f -lm | |||
| on macos, without fftw3: | |||
| clang -o test_pffft -DHAVE_VECLIB -O3 -Wall -W pffft.c test_pffft.c fftpack.c -L/usr/local/lib -I/usr/local/include/ -framework Accelerate | |||
| on macos, with fftw3: | |||
| clang -o test_pffft -DHAVE_FFTW -DHAVE_VECLIB -O3 -Wall -W pffft.c test_pffft.c fftpack.c -L/usr/local/lib -I/usr/local/include/ -lfftw3f -framework Accelerate | |||
| on windows, with visual c++: | |||
| cl /Ox -D_USE_MATH_DEFINES /arch:SSE test_pffft.c pffft.c fftpack.c | |||
| build without SIMD instructions: | |||
| gcc -o test_pffft -DPFFFT_SIMD_DISABLE -O3 -Wall -W pffft.c test_pffft.c fftpack.c -lm | |||
| */ | |||
| #include "pffft.h" | |||
| #include "fftpack.h" | |||
| #include <math.h> | |||
| #include <stdio.h> | |||
| #include <stdlib.h> | |||
| #include <time.h> | |||
| #include <assert.h> | |||
| #include <string.h> | |||
| #ifdef HAVE_SYS_TIMES | |||
| # include <sys/times.h> | |||
| # include <unistd.h> | |||
| #endif | |||
| #ifdef HAVE_VECLIB | |||
| # include <Accelerate/Accelerate.h> | |||
| #endif | |||
| #ifdef HAVE_FFTW | |||
| # include <fftw3.h> | |||
| #endif | |||
| #define MAX(x,y) ((x)>(y)?(x):(y)) | |||
| double frand() { | |||
| return rand()/(double)RAND_MAX; | |||
| } | |||
| #if defined(HAVE_SYS_TIMES) | |||
| inline double uclock_sec(void) { | |||
| static double ttclk = 0.; | |||
| if (ttclk == 0.) ttclk = sysconf(_SC_CLK_TCK); | |||
| struct tms t; return ((double)times(&t)) / ttclk; | |||
| } | |||
| # else | |||
| double uclock_sec(void) | |||
| { return (double)clock()/(double)CLOCKS_PER_SEC; } | |||
| #endif | |||
| /* compare results with the regular fftpack */ | |||
| void pffft_validate_N(int N, int cplx) { | |||
| int Nfloat = N*(cplx?2:1); | |||
| int Nbytes = Nfloat * sizeof(float); | |||
| float *ref, *in, *out, *tmp, *tmp2; | |||
| PFFFT_Setup *s = pffft_new_setup(N, cplx ? PFFFT_COMPLEX : PFFFT_REAL); | |||
| int pass; | |||
| if (!s) { printf("Skipping N=%d, not supported\n", N); return; } | |||
| ref = pffft_aligned_malloc(Nbytes); | |||
| in = pffft_aligned_malloc(Nbytes); | |||
| out = pffft_aligned_malloc(Nbytes); | |||
| tmp = pffft_aligned_malloc(Nbytes); | |||
| tmp2 = pffft_aligned_malloc(Nbytes); | |||
| for (pass=0; pass < 2; ++pass) { | |||
| float ref_max = 0; | |||
| int k; | |||
| //printf("N=%d pass=%d cplx=%d\n", N, pass, cplx); | |||
| // compute reference solution with FFTPACK | |||
| if (pass == 0) { | |||
| float *wrk = malloc(2*Nbytes+15*sizeof(float)); | |||
| for (k=0; k < Nfloat; ++k) { | |||
| ref[k] = in[k] = frand()*2-1; | |||
| out[k] = 1e30; | |||
| } | |||
| if (!cplx) { | |||
| rffti(N, wrk); | |||
| rfftf(N, ref, wrk); | |||
| // use our ordering for real ffts instead of the one of fftpack | |||
| { | |||
| float refN=ref[N-1]; | |||
| for (k=N-2; k >= 1; --k) ref[k+1] = ref[k]; | |||
| ref[1] = refN; | |||
| } | |||
| } else { | |||
| cffti(N, wrk); | |||
| cfftf(N, ref, wrk); | |||
| } | |||
| free(wrk); | |||
| } | |||
| for (k = 0; k < Nfloat; ++k) ref_max = MAX(ref_max, fabs(ref[k])); | |||
| // pass 0 : non canonical ordering of transform coefficients | |||
| if (pass == 0) { | |||
| // test forward transform, with different input / output | |||
| pffft_transform(s, in, tmp, 0, PFFFT_FORWARD); | |||
| memcpy(tmp2, tmp, Nbytes); | |||
| memcpy(tmp, in, Nbytes); | |||
| pffft_transform(s, tmp, tmp, 0, PFFFT_FORWARD); | |||
| for (k = 0; k < Nfloat; ++k) { | |||
| assert(tmp2[k] == tmp[k]); | |||
| } | |||
| // test reordering | |||
| pffft_zreorder(s, tmp, out, PFFFT_FORWARD); | |||
| pffft_zreorder(s, out, tmp, PFFFT_BACKWARD); | |||
| for (k = 0; k < Nfloat; ++k) { | |||
| assert(tmp2[k] == tmp[k]); | |||
| } | |||
| pffft_zreorder(s, tmp, out, PFFFT_FORWARD); | |||
| } else { | |||
| // pass 1 : canonical ordering of transform coeffs. | |||
| pffft_transform_ordered(s, in, tmp, 0, PFFFT_FORWARD); | |||
| memcpy(tmp2, tmp, Nbytes); | |||
| memcpy(tmp, in, Nbytes); | |||
| pffft_transform_ordered(s, tmp, tmp, 0, PFFFT_FORWARD); | |||
| for (k = 0; k < Nfloat; ++k) { | |||
| assert(tmp2[k] == tmp[k]); | |||
| } | |||
| memcpy(out, tmp, Nbytes); | |||
| } | |||
| { | |||
| for (k=0; k < Nfloat; ++k) { | |||
| if (!(fabs(ref[k] - out[k]) < 1e-3*ref_max)) { | |||
| printf("%s forward PFFFT mismatch found for N=%d\n", (cplx?"CPLX":"REAL"), N); | |||
| exit(1); | |||
| } | |||
| } | |||
| if (pass == 0) pffft_transform(s, tmp, out, 0, PFFFT_BACKWARD); | |||
| else pffft_transform_ordered(s, tmp, out, 0, PFFFT_BACKWARD); | |||
| memcpy(tmp2, out, Nbytes); | |||
| memcpy(out, tmp, Nbytes); | |||
| if (pass == 0) pffft_transform(s, out, out, 0, PFFFT_BACKWARD); | |||
| else pffft_transform_ordered(s, out, out, 0, PFFFT_BACKWARD); | |||
| for (k = 0; k < Nfloat; ++k) { | |||
| assert(tmp2[k] == out[k]); | |||
| out[k] *= 1.f/N; | |||
| } | |||
| for (k = 0; k < Nfloat; ++k) { | |||
| if (fabs(in[k] - out[k]) > 1e-3 * ref_max) { | |||
| printf("pass=%d, %s IFFFT does not match for N=%d\n", pass, (cplx?"CPLX":"REAL"), N); break; | |||
| exit(1); | |||
| } | |||
| } | |||
| } | |||
| // quick test of the circular convolution in fft domain | |||
| { | |||
| float conv_err = 0, conv_max = 0; | |||
| pffft_zreorder(s, ref, tmp, PFFFT_FORWARD); | |||
| memset(out, 0, Nbytes); | |||
| pffft_zconvolve_accumulate(s, ref, ref, out, 1.0); | |||
| pffft_zreorder(s, out, tmp2, PFFFT_FORWARD); | |||
| for (k=0; k < Nfloat; k += 2) { | |||
| float ar = tmp[k], ai=tmp[k+1]; | |||
| if (cplx || k > 0) { | |||
| tmp[k] = ar*ar - ai*ai; | |||
| tmp[k+1] = 2*ar*ai; | |||
| } else { | |||
| tmp[0] = ar*ar; | |||
| tmp[1] = ai*ai; | |||
| } | |||
| } | |||
| for (k=0; k < Nfloat; ++k) { | |||
| float d = fabs(tmp[k] - tmp2[k]), e = fabs(tmp[k]); | |||
| if (d > conv_err) conv_err = d; | |||
| if (e > conv_max) conv_max = e; | |||
| } | |||
| if (conv_err > 1e-5*conv_max) { | |||
| printf("zconvolve error ? %g %g\n", conv_err, conv_max); exit(1); | |||
| } | |||
| } | |||
| } | |||
| printf("%s PFFFT is OK for N=%d\n", (cplx?"CPLX":"REAL"), N); fflush(stdout); | |||
| pffft_destroy_setup(s); | |||
| pffft_aligned_free(ref); | |||
| pffft_aligned_free(in); | |||
| pffft_aligned_free(out); | |||
| pffft_aligned_free(tmp); | |||
| pffft_aligned_free(tmp2); | |||
| } | |||
| void pffft_validate(int cplx) { | |||
| static int Ntest[] = { 16, 32, 64, 96, 128, 160, 192, 256, 288, 384, 5*96, 512, 576, 5*128, 800, 864, 1024, 2048, 2592, 4000, 4096, 12000, 36864, 0}; | |||
| int k; | |||
| for (k = 0; Ntest[k]; ++k) { | |||
| int N = Ntest[k]; | |||
| if (N == 16 && !cplx) continue; | |||
| pffft_validate_N(N, cplx); | |||
| } | |||
| } | |||
| int array_output_format = 0; | |||
| void show_output(const char *name, int N, int cplx, float flops, float t0, float t1, int max_iter) { | |||
| float mflops = flops/1e6/(t1 - t0 + 1e-16); | |||
| if (array_output_format) { | |||
| if (flops != -1) { | |||
| printf("|%9.0f ", mflops); | |||
| } else printf("| n/a "); | |||
| } else { | |||
| if (flops != -1) { | |||
| printf("N=%5d, %s %16s : %6.0f MFlops [t=%6.0f ns, %d runs]\n", N, (cplx?"CPLX":"REAL"), name, mflops, (t1-t0)/2/max_iter * 1e9, max_iter); | |||
| } | |||
| } | |||
| fflush(stdout); | |||
| } | |||
| void benchmark_ffts(int N, int cplx) { | |||
| int Nfloat = (cplx ? N*2 : N); | |||
| int Nbytes = Nfloat * sizeof(float); | |||
| float *X = pffft_aligned_malloc(Nbytes), *Y = pffft_aligned_malloc(Nbytes), *Z = pffft_aligned_malloc(Nbytes); | |||
| double t0, t1, flops; | |||
| int k; | |||
| int max_iter = 5120000/N*4; | |||
| #ifdef __arm__ | |||
| max_iter /= 4; | |||
| #endif | |||
| int iter; | |||
| for (k = 0; k < Nfloat; ++k) { | |||
| X[k] = 0; //sqrtf(k+1); | |||
| } | |||
| // FFTPack benchmark | |||
| { | |||
| float *wrk = malloc(2*Nbytes + 15*sizeof(float)); | |||
| int max_iter_ = max_iter/pffft_simd_size(); if (max_iter_ == 0) max_iter_ = 1; | |||
| if (cplx) cffti(N, wrk); | |||
| else rffti(N, wrk); | |||
| t0 = uclock_sec(); | |||
| for (iter = 0; iter < max_iter_; ++iter) { | |||
| if (cplx) { | |||
| cfftf(N, X, wrk); | |||
| cfftb(N, X, wrk); | |||
| } else { | |||
| rfftf(N, X, wrk); | |||
| rfftb(N, X, wrk); | |||
| } | |||
| } | |||
| t1 = uclock_sec(); | |||
| free(wrk); | |||
| flops = (max_iter_*2) * ((cplx ? 5 : 2.5)*N*log((double)N)/M_LN2); // see http://www.fftw.org/speed/method.html | |||
| show_output("FFTPack", N, cplx, flops, t0, t1, max_iter_); | |||
| } | |||
| #ifdef HAVE_VECLIB | |||
| int log2N = (int)(log(N)/log(2) + 0.5f); | |||
| if (N == (1<<log2N)) { | |||
| FFTSetup setup; | |||
| setup = vDSP_create_fftsetup(log2N, FFT_RADIX2); | |||
| DSPSplitComplex zsamples; | |||
| zsamples.realp = &X[0]; | |||
| zsamples.imagp = &X[Nfloat/2]; | |||
| t0 = uclock_sec(); | |||
| for (iter = 0; iter < max_iter; ++iter) { | |||
| if (cplx) { | |||
| vDSP_fft_zip(setup, &zsamples, 1, log2N, kFFTDirection_Forward); | |||
| vDSP_fft_zip(setup, &zsamples, 1, log2N, kFFTDirection_Inverse); | |||
| } else { | |||
| vDSP_fft_zrip(setup, &zsamples, 1, log2N, kFFTDirection_Forward); | |||
| vDSP_fft_zrip(setup, &zsamples, 1, log2N, kFFTDirection_Inverse); | |||
| } | |||
| } | |||
| t1 = uclock_sec(); | |||
| vDSP_destroy_fftsetup(setup); | |||
| flops = (max_iter*2) * ((cplx ? 5 : 2.5)*N*log((double)N)/M_LN2); // see http://www.fftw.org/speed/method.html | |||
| show_output("vDSP", N, cplx, flops, t0, t1, max_iter); | |||
| } else { | |||
| show_output("vDSP", N, cplx, -1, -1, -1, -1); | |||
| } | |||
| #endif | |||
| #ifdef HAVE_FFTW | |||
| { | |||
| fftwf_plan planf, planb; | |||
| fftw_complex *in = (fftw_complex*) fftwf_malloc(sizeof(fftw_complex) * N); | |||
| fftw_complex *out = (fftw_complex*) fftwf_malloc(sizeof(fftw_complex) * N); | |||
| memset(in, 0, sizeof(fftw_complex) * N); | |||
| int flags = (N < 40000 ? FFTW_MEASURE : FFTW_ESTIMATE); // measure takes a lot of time on largest ffts | |||
| //int flags = FFTW_ESTIMATE; | |||
| if (cplx) { | |||
| planf = fftwf_plan_dft_1d(N, (fftwf_complex*)in, (fftwf_complex*)out, FFTW_FORWARD, flags); | |||
| planb = fftwf_plan_dft_1d(N, (fftwf_complex*)in, (fftwf_complex*)out, FFTW_BACKWARD, flags); | |||
| } else { | |||
| planf = fftwf_plan_dft_r2c_1d(N, (float*)in, (fftwf_complex*)out, flags); | |||
| planb = fftwf_plan_dft_c2r_1d(N, (fftwf_complex*)in, (float*)out, flags); | |||
| } | |||
| t0 = uclock_sec(); | |||
| for (iter = 0; iter < max_iter; ++iter) { | |||
| fftwf_execute(planf); | |||
| fftwf_execute(planb); | |||
| } | |||
| t1 = uclock_sec(); | |||
| fftwf_destroy_plan(planf); | |||
| fftwf_destroy_plan(planb); | |||
| fftwf_free(in); fftwf_free(out); | |||
| flops = (max_iter*2) * ((cplx ? 5 : 2.5)*N*log((double)N)/M_LN2); // see http://www.fftw.org/speed/method.html | |||
| show_output((flags == FFTW_MEASURE ? "FFTW (meas.)" : " FFTW (estim)"), N, cplx, flops, t0, t1, max_iter); | |||
| } | |||
| #endif | |||
| // PFFFT benchmark | |||
| { | |||
| PFFFT_Setup *s = pffft_new_setup(N, cplx ? PFFFT_COMPLEX : PFFFT_REAL); | |||
| if (s) { | |||
| t0 = uclock_sec(); | |||
| for (iter = 0; iter < max_iter; ++iter) { | |||
| pffft_transform(s, X, Z, Y, PFFFT_FORWARD); | |||
| pffft_transform(s, X, Z, Y, PFFFT_BACKWARD); | |||
| } | |||
| t1 = uclock_sec(); | |||
| pffft_destroy_setup(s); | |||
| flops = (max_iter*2) * ((cplx ? 5 : 2.5)*N*log((double)N)/M_LN2); // see http://www.fftw.org/speed/method.html | |||
| show_output("PFFFT", N, cplx, flops, t0, t1, max_iter); | |||
| } | |||
| } | |||
| if (!array_output_format) { | |||
| printf("--\n"); | |||
| } | |||
| pffft_aligned_free(X); | |||
| pffft_aligned_free(Y); | |||
| pffft_aligned_free(Z); | |||
| } | |||
| #ifndef PFFFT_SIMD_DISABLE | |||
| void validate_pffft_simd(); // a small function inside pffft.c that will detect compiler bugs with respect to simd instruction | |||
| #endif | |||
| int main(int argc, char **argv) { | |||
| int Nvalues[] = { 64, 96, 128, 160, 192, 256, 384, 5*96, 512, 5*128, 3*256, 800, 1024, 2048, 2400, 4096, 8192, 9*1024, 16384, 32768, 256*1024, 1024*1024, -1 }; | |||
| int i; | |||
| if (argc > 1 && strcmp(argv[1], "--array-format") == 0) { | |||
| array_output_format = 1; | |||
| } | |||
| #ifndef PFFFT_SIMD_DISABLE | |||
| validate_pffft_simd(); | |||
| #endif | |||
| pffft_validate(1); | |||
| pffft_validate(0); | |||
| if (!array_output_format) { | |||
| for (i=0; Nvalues[i] > 0; ++i) { | |||
| benchmark_ffts(Nvalues[i], 0 /* real fft */); | |||
| } | |||
| for (i=0; Nvalues[i] > 0; ++i) { | |||
| benchmark_ffts(Nvalues[i], 1 /* cplx fft */); | |||
| } | |||
| } else { | |||
| printf("| input len "); | |||
| printf("|real FFTPack"); | |||
| #ifdef HAVE_VECLIB | |||
| printf("| real vDSP "); | |||
| #endif | |||
| #ifdef HAVE_FFTW | |||
| printf("| real FFTW "); | |||
| #endif | |||
| printf("| real PFFFT | "); | |||
| printf("|cplx FFTPack"); | |||
| #ifdef HAVE_VECLIB | |||
| printf("| cplx vDSP "); | |||
| #endif | |||
| #ifdef HAVE_FFTW | |||
| printf("| cplx FFTW "); | |||
| #endif | |||
| printf("| cplx PFFFT |\n"); | |||
| for (i=0; Nvalues[i] > 0; ++i) { | |||
| printf("|%9d ", Nvalues[i]); | |||
| benchmark_ffts(Nvalues[i], 0); | |||
| printf("| "); | |||
| benchmark_ffts(Nvalues[i], 1); | |||
| printf("|\n"); | |||
| } | |||
| printf(" (numbers are given in MFlops)\n"); | |||
| } | |||
| return 0; | |||
| } | |||
| @@ -143,11 +143,11 @@ void Rampage::step() { | |||
| outputs[RISING_A_OUTPUT + c].value = (rising ? 10.0 : 0.0); | |||
| outputs[FALLING_A_OUTPUT + c].value = (falling ? 10.0 : 0.0); | |||
| lights[RISING_A_LIGHT + c].value = (rising ? 1.0 : 0.0); | |||
| lights[FALLING_A_LIGHT + c].value = (falling ? 1.0 : 0.0); | |||
| lights[RISING_A_LIGHT + c].setBrightnessSmooth(rising ? 1.0 : 0.0); | |||
| lights[FALLING_A_LIGHT + c].setBrightnessSmooth(falling ? 1.0 : 0.0); | |||
| outputs[EOC_A_OUTPUT + c].value = (endOfCyclePulse[c].process(engineGetSampleTime()) ? 10.0 : 0.0); | |||
| outputs[OUT_A_OUTPUT + c].value = out[c]; | |||
| lights[OUT_A_LIGHT + c].value = out[c] / 10.0; | |||
| lights[OUT_A_LIGHT + c].setBrightnessSmooth(out[c] / 10.0); | |||
| } | |||
| // Logic | |||
| @@ -162,9 +162,9 @@ void Rampage::step() { | |||
| outputs[MIN_OUTPUT].value = fminf(a, b); | |||
| outputs[MAX_OUTPUT].value = fmaxf(a, b); | |||
| // Lights | |||
| lights[COMPARATOR_LIGHT].value = outputs[COMPARATOR_OUTPUT].value / 10.0; | |||
| lights[MIN_LIGHT].value = outputs[MIN_OUTPUT].value / 10.0; | |||
| lights[MAX_LIGHT].value = outputs[MAX_OUTPUT].value / 10.0; | |||
| lights[COMPARATOR_LIGHT].setBrightnessSmooth(outputs[COMPARATOR_OUTPUT].value / 10.0); | |||
| lights[MIN_LIGHT].setBrightnessSmooth(outputs[MIN_OUTPUT].value / 10.0); | |||
| lights[MAX_LIGHT].setBrightnessSmooth(outputs[MAX_OUTPUT].value / 10.0); | |||
| } | |||
| @@ -4,6 +4,7 @@ | |||
| #include "dsp/samplerate.hpp" | |||
| #include "dsp/ringbuffer.hpp" | |||
| #include "dsp/filter.hpp" | |||
| #include "dsp/fir.hpp" | |||
| #include "pffft.h" | |||
| @@ -27,115 +28,6 @@ void springReverbInit() { | |||
| } | |||
| struct RealTimeConvolver { | |||
| // `kernelBlocks` number of contiguous FFT blocks of size `blockSize` | |||
| // indexed by [i * blockSize*2 + j] | |||
| float *kernelFfts = NULL; | |||
| float *inputFfts = NULL; | |||
| float *outputTail = NULL; | |||
| float *tmpBlock = NULL; | |||
| size_t blockSize; | |||
| size_t kernelBlocks = 0; | |||
| size_t inputPos = 0; | |||
| // kiss_fftr_cfg fft_cfg; | |||
| // kiss_fftr_cfg ifft_cfg; | |||
| PFFFT_Setup *pffft; | |||
| /** blocksize should be >=32 and a power of 2 */ | |||
| RealTimeConvolver(size_t blockSize) { | |||
| this->blockSize = blockSize; | |||
| pffft = pffft_new_setup(blockSize*2, PFFFT_REAL); | |||
| outputTail = new float[blockSize](); | |||
| tmpBlock = new float[blockSize*2](); | |||
| } | |||
| ~RealTimeConvolver() { | |||
| clear(); | |||
| delete[] outputTail; | |||
| delete[] tmpBlock; | |||
| pffft_destroy_setup(pffft); | |||
| } | |||
| void clear() { | |||
| if (kernelFfts) { | |||
| pffft_aligned_free(kernelFfts); | |||
| kernelFfts = NULL; | |||
| } | |||
| if (inputFfts) { | |||
| pffft_aligned_free(inputFfts); | |||
| inputFfts = NULL; | |||
| } | |||
| kernelBlocks = 0; | |||
| inputPos = 0; | |||
| } | |||
| void setKernel(const float *kernel, size_t length) { | |||
| clear(); | |||
| if (!kernel || length == 0) | |||
| return; | |||
| // Round up to the nearest factor blockSize | |||
| kernelBlocks = (length - 1) / blockSize + 1; | |||
| // Allocate blocks | |||
| kernelFfts = (float*) pffft_aligned_malloc(sizeof(float) * blockSize*2 * kernelBlocks); | |||
| inputFfts = (float*) pffft_aligned_malloc(sizeof(float) * blockSize*2 * kernelBlocks); | |||
| memset(inputFfts, 0, sizeof(float) * blockSize*2 * kernelBlocks); | |||
| for (size_t i = 0; i < kernelBlocks; i++) { | |||
| // Pad each block with zeros | |||
| memset(tmpBlock, 0, sizeof(float) * blockSize*2); | |||
| size_t len = min((int) blockSize, (int) (length - i*blockSize)); | |||
| memcpy(tmpBlock, &kernel[i*blockSize], sizeof(float)*len); | |||
| // Compute fft | |||
| pffft_transform(pffft, tmpBlock, &kernelFfts[blockSize*2 * i], NULL, PFFFT_FORWARD); | |||
| } | |||
| } | |||
| /** Applies reverb to input | |||
| input and output must be size blockSize | |||
| */ | |||
| void processBlock(const float *input, float *output) { | |||
| if (kernelBlocks == 0) { | |||
| memset(output, 0, sizeof(float) * blockSize); | |||
| return; | |||
| } | |||
| // Step input position | |||
| inputPos = (inputPos + 1) % kernelBlocks; | |||
| // Pad block with zeros | |||
| memset(tmpBlock, 0, sizeof(float) * blockSize*2); | |||
| memcpy(tmpBlock, input, sizeof(float) * blockSize); | |||
| // Compute input fft | |||
| pffft_transform(pffft, tmpBlock, &inputFfts[blockSize*2 * inputPos], NULL, PFFFT_FORWARD); | |||
| // Create output fft | |||
| memset(tmpBlock, 0, sizeof(float) * blockSize*2); | |||
| // convolve input fft by kernel fft | |||
| // Note: This is the CPU bottleneck loop | |||
| for (size_t i = 0; i < kernelBlocks; i++) { | |||
| size_t pos = (inputPos - i + kernelBlocks) % kernelBlocks; | |||
| pffft_zconvolve_accumulate(pffft, &kernelFfts[blockSize*2 * i], &inputFfts[blockSize*2 * pos], tmpBlock, 1.0); | |||
| } | |||
| // Compute output | |||
| pffft_transform(pffft, tmpBlock, tmpBlock, NULL, PFFFT_BACKWARD); | |||
| // Add block tail from last output block | |||
| for (size_t i = 0; i < blockSize; i++) { | |||
| tmpBlock[i] += outputTail[i]; | |||
| } | |||
| // Copy output block to output | |||
| for (size_t i = 0; i < blockSize; i++) { | |||
| // Scale based on FFT | |||
| output[i] = tmpBlock[i] / blockSize; | |||
| } | |||
| // Set tail | |||
| for (size_t i = 0; i < blockSize; i++) { | |||
| outputTail[i] = tmpBlock[i + blockSize]; | |||
| } | |||
| } | |||
| }; | |||
| #define BLOCKSIZE 1024 | |||
| struct SpringReverb : Module { | |||