| Author | SHA1 | Message | Date | 
|---|---|---|---|
| 
							
							
								 | 
						2b140a3d09 | 
							
							
								
								x86: use 32-bit source registers with movd instruction
							
							
							
							
							
							
								
							
							
							yasm tolerates mismatch between movd/movq and source register size, adjusting the instruction according to the register. nasm is more strict. Signed-off-by: Mans Rullgard <mans@mansr.com>  | 
						13 years ago | 
| 
							
							
								 | 
						6797d1948b | x86: rv40: Mark rv40_weight functions as MMX2; they use MMX2 instructions. | 13 years ago | 
| 
							
							
								 | 
						110d0cdc9d | 
							
							
								
								rv40dsp x86: MMX/MMX2/3DNow/SSE2/SSSE3 implementations of MC
							
							
							
							
							
							
								
							
							
							Code mostly inspired by vp8's MC, however:
- its MMX2 horizontal filter is worse because it can't take advantage of
  the coefficient redundancy
- that same coefficient redundancy allows better code for non-SSSE3 versions
Benchmark (rounded to tens of unit):
        V8x8  H8x8  2D8x8  V16x16  H16x16  2D16x16
C       445    358   985    1785    1559    3280
MMX*    219    271   478     714     929    1443
SSE2    131    158   294     425     515     892
SSSE3   120    122   248     387     390     763
End result is overall around a 15% speedup for SSSE3 version (on 6 sequences);
all loop filter functions now take around 55% of decoding time, while luma MC
dsp functions are around 6%, chroma ones are 1.3% and biweight around 2.3%.
Signed-off-by: Diego Biurrun <diego@biurrun.de>
							
						 | 
						13 years ago | 
| 
							
							
								 | 
						2130bd8f5b | 
							
							
								
								rv40dsp x86: use only one register, for both increment and loop counter
							
							
							
							
							
							
								
							
							
							Around 10 cycles faster for luma. Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>  | 
						13 years ago | 
| 
							
							
								 | 
						272b252c01 | 
							
							
								
								rv40dsp: implement prescaled versions for biweight.
							
							
							
							
							
							
								
							
							
							Quite often, the original weights are multiple of 512. By prescaling them by 1/512 when they are computed (once per frame), no intermediate shifting is needed, and no prescaling on each call either. The x86 code already used that trick. Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>  | 
						13 years ago | 
| 
							
							
								 | 
						e5c9de2ab7 | 
							
							
								
								rv40: x86 SIMD for biweight
							
							
							
							
							
							
								
							
							
							Provide MMX, SSE2 and SSSE3 versions, with a fast-path when the weights are multiples of 512 (which is often the case when the values round up nicely). *_TIMER report for the 16x16 and 8x8 cases: C: 9015 decicycles in 16, 524257 runs, 31 skips 2656 decicycles in 8, 524271 runs, 17 skips MMX: 4156 decicycles in 16, 262090 runs, 54 skips 1206 decicycles in 8, 262131 runs, 13 skips MMX on fast-path: 2760 decicycles in 16, 524222 runs, 66 skips 995 decicycles in 8, 524252 runs, 36 skips SSE2: 2163 decicycles in 16, 262131 runs, 13 skips 832 decicycles in 8, 262137 runs, 7 skips SSE2 with fast path: 1783 decicycles in 16, 524276 runs, 12 skips 711 decicycles in 8, 524283 runs, 5 skips SSSE3: 2117 decicycles in 16, 262136 runs, 8 skips 814 decicycles in 8, 262143 runs, 1 skips SSSE3 with fast path: 1315 decicycles in 16, 524285 runs, 3 skips 578 decicycles in 8, 524286 runs, 2 skips This means around a 4% speedup for some sequences. Signed-off-by: Diego Biurrun <diego@biurrun.de>  | 
						14 years ago |