Author | SHA1 | Message | Date |
---|---|---|---|
|
b5161908e0 |
SBR DSP: fix SSE code to not use SSE2 instructions.
movq from SSE register _to_ memory is an SSE2 instruction. Use the SSE movlps function instead that does the same thing. Signed-off-by: Reimar Döffinger <Reimar.Doeffinger@gmx.de> Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com> |
13 years ago |
|
2784d18791 |
SBR DSP x86: implement SSE sbr_hf_g_filt
Unrolling the main loop to process, instead of 4 elements: - 8: minor gain of 2 cycles (not worth the extra object size) - 2: loss of 8 cycles. Assigning STEP to a register is a loss. Output address (Y) is almost always unaligned. Timings: - C (32/64 bits): 117/109 cycles - SSE: 57 cycles Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com> |
13 years ago |
|
34454c761f |
SBR DSP x86: implement SSE sbr_sum_square_sse
The 32bits targets have been compiled with -mfpmath=sse for proper reference. sbr_sum_square C /32bits: 82c (unrolled)/102c C /64bits: 69c (unrolled)/82c SSE/32bits: 42c SSE/64bits: 31c Use of SSE4.1 dpps to perform the final sum is slower. Not unrolling to perform 8 operations in a loop yields 10 more cycles. Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com> |
13 years ago |