This work is sponsored by, and copyright, Google.
Previously all subpartitions except the eob=1 (DC) case ran with
the same runtime:
vp9_inv_dct_dct_16x16_sub16_add_neon: 1373.2
vp9_inv_dct_dct_32x32_sub32_add_neon: 8089.0
By skipping individual 8x16 or 8x32 pixel slices in the first pass,
we reduce the runtime of these functions like this:
vp9_inv_dct_dct_16x16_sub1_add_neon: 235.3
vp9_inv_dct_dct_16x16_sub2_add_neon: 1036.7
vp9_inv_dct_dct_16x16_sub4_add_neon: 1036.7
vp9_inv_dct_dct_16x16_sub8_add_neon: 1036.7
vp9_inv_dct_dct_16x16_sub12_add_neon: 1372.1
vp9_inv_dct_dct_16x16_sub16_add_neon: 1372.1
vp9_inv_dct_dct_32x32_sub1_add_neon: 555.1
vp9_inv_dct_dct_32x32_sub2_add_neon: 5190.2
vp9_inv_dct_dct_32x32_sub4_add_neon: 5180.0
vp9_inv_dct_dct_32x32_sub8_add_neon: 5183.1
vp9_inv_dct_dct_32x32_sub12_add_neon: 6161.5
vp9_inv_dct_dct_32x32_sub16_add_neon: 6155.5
vp9_inv_dct_dct_32x32_sub20_add_neon: 7136.3
vp9_inv_dct_dct_32x32_sub24_add_neon: 7128.4
vp9_inv_dct_dct_32x32_sub28_add_neon: 8098.9
vp9_inv_dct_dct_32x32_sub32_add_neon: 8098.8
I.e. in general a very minor overhead for the full subpartition case due
to the additional cmps, but a significant speedup for the cases when we
only need to process a small part of the actual input data.
This is cherrypicked from libav commits
cad42fadcd and
a0c443a398.
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
This work is sponsored by, and copyright, Google.
Previously all subpartitions except the eob=1 (DC) case ran with
the same runtime:
Cortex A7 A8 A9 A53
vp9_inv_dct_dct_16x16_sub16_add_neon: 3188.1 2435.4 2499.0 1969.0
vp9_inv_dct_dct_32x32_sub32_add_neon: 18531.7 16582.3 14207.6 12000.3
By skipping individual 4x16 or 4x32 pixel slices in the first pass,
we reduce the runtime of these functions like this:
vp9_inv_dct_dct_16x16_sub1_add_neon: 274.6 189.5 211.7 235.8
vp9_inv_dct_dct_16x16_sub2_add_neon: 2064.0 1534.8 1719.4 1248.7
vp9_inv_dct_dct_16x16_sub4_add_neon: 2135.0 1477.2 1736.3 1249.5
vp9_inv_dct_dct_16x16_sub8_add_neon: 2446.7 1828.7 1993.6 1494.7
vp9_inv_dct_dct_16x16_sub12_add_neon: 2832.4 2118.3 2266.5 1735.1
vp9_inv_dct_dct_16x16_sub16_add_neon: 3211.7 2475.3 2523.5 1983.1
vp9_inv_dct_dct_32x32_sub1_add_neon: 756.2 456.7 862.0 553.9
vp9_inv_dct_dct_32x32_sub2_add_neon: 10682.2 8190.4 8539.2 6762.5
vp9_inv_dct_dct_32x32_sub4_add_neon: 10813.5 8014.9 8518.3 6762.8
vp9_inv_dct_dct_32x32_sub8_add_neon: 11859.6 9313.0 9347.4 7514.5
vp9_inv_dct_dct_32x32_sub12_add_neon: 12946.6 10752.4 10192.2 8280.2
vp9_inv_dct_dct_32x32_sub16_add_neon: 14074.6 11946.5 11001.4 9008.6
vp9_inv_dct_dct_32x32_sub20_add_neon: 15269.9 13662.7 11816.1 9762.6
vp9_inv_dct_dct_32x32_sub24_add_neon: 16327.9 14940.1 12626.7 10516.0
vp9_inv_dct_dct_32x32_sub28_add_neon: 17462.7 15776.1 13446.2 11264.7
vp9_inv_dct_dct_32x32_sub32_add_neon: 18575.5 17157.0 14249.3 12015.1
I.e. in general a very minor overhead for the full subpartition case due
to the additional loads and cmps, but a significant speedup for the cases
when we only need to process a small part of the actual input data.
In common VP9 content in a few inspected clips, 70-90% of the non-dc-only
16x16 and 32x32 IDCTs only have nonzero coefficients in the upper left
8x8 or 16x16 subpartitions respectively.
This is cherrypicked from libav commit
9c8bc74c2b.
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
This avoids reloading them if they haven't been clobbered, if the
first pass also was idct.
This is similar to what was done in the aarch64 version.
This is cherrypicked from libav commit
3c87039a40.
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
Since the same parameter is used for both input and output,
the name inout is more fitting.
This matches the naming used below in the dmbutterfly macro.
This is cherrypicked from libav commit
79566ec8c7.
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
The clobbering tests in checkasm are only invoked when testing
correctness, so this bug didn't show up when benchmarking the
dc-only version.
This is cherrypicked from libav commit
4d960a1185.
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
This is one instruction less for thumb, and only have got
1/2 arm/thumb specific instructions.
This is cherrypicked from libav commit
e5b0fc170f.
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
The latter is 1 cycle faster on a cortex-53 and since the operands are
bytewise (or larger) bitmask (impossible to overflow to zero) both are
equivalent.
This is cherrypicked from libav commit
e7ae8f7a71.
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
Since aarch64 has enough free general purpose registers use them to
branch to the appropiate storage code. 1-2 cycles faster for the
functions using loop_filter 8/16, ... on a cortex-a53. Mixed results
(up to 2 cycles faster/slower) on a cortex-a57.
This is cherrypicked from libav commit
d7595de0b2.
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
Seemingly ff_clear_block_sse assumed that the block array is aligned,
so make sure it is.
Fixes ticket #6079
Signed-off-by: James Almer <jamrial@gmail.com>
when hlsenc use flag second_level_segment_index,
second_level_segment_size and second_level_segment_duration,
the rename is ok but the output filename always use the old filename
so move the rename operation after the close the ts file and
before open new segment
Reported-by: Christian Johannesen <chrisjohannesen@gmail.com>
Reviewed-by: Bodecs Bela <bodecsb@vivanet.hu>
Signed-off-by: Steven Liu <lq@chinaffmpeg.org>
CID: 1396852
check the devices_list alloc status,
and release the devices_list when alloc devices error
Reviewed-by: Michael Niedermayer <michael@niedermayer.cc>
Signed-off-by: Steven Liu <lq@chinaffmpeg.org>
Fixes compilation with hardcoded tables after eaff1aa09e
and e71b8119e7
Reviewed-by: Timo Rothenpieler <timo@rothenpieler.org>
Signed-off-by: James Almer <jamrial@gmail.com>
The current condition can trigger in cases where it shouldn't, with
unexpected results.
Make sure that:
- container cropping is really based on the original dimensions from the
caller
- those dimenions are discarded on size change
The code is still quite hacky and eventually should be deprecated and
removed, with the decision about which cropping is used delegated to the
caller.
Introducing enforced sync points in arbitrary places is bad for
performance. Since the vast majority of receiving code (QSV VPP or
encoders, retrieving frames through hwcontext) will do the syncing, this
change should not be visible to most callers. But bumping micro just in
case.
This is also consistent with what VAAPI hwaccel does.
We can pick the correct slice index directly from the ID3D11VideoDecoderOutputView
casted from data[3].
Signed-off-by: Anton Khirnov <anton@khirnov.net>
No need to loop through the known surfaces, we'll use the requested surface
anyway.
The loop is only done for DXVA2.
Signed-off-by: Anton Khirnov <anton@khirnov.net>