* commit '7c55fac7dfa8bad9644dea5d03309da30be69563':
fate: Add test for webp
Noop, we already have a variety of webp tests, including a fate-webp target,
which would collide with this test.
Merged-by: Hendrik Leppkes <h.leppkes@gmail.com>
__MAC_10_11 can be present in updated revision of an older SDK so it
can't reliably detect availability of kAudioFormatEnhancedAC3 constant.
Fixes: b4daa2c40f ('lavc/audiotoolboxdec: add eac3 decoder')
Cc: Rodger Combs <rodger.combs@gmail.com>
Signed-off-by: Dmitry Kalinkin <dmitry.kalinkin@gmail.com>
Previous version reviewed by: Rodger Combs <rodger.combs@gmail.com>
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
This work is sponsored by, and copyright, Google.
These are ported from the ARM version; thanks to the larger
amount of registers available, we can do the 16x16 and 32x32
transforms in slices 8 pixels wide instead of 4. This gives
a speedup of around 1.4x compared to the 32 bit version.
The fact that aarch64 doesn't have the same d/q register
aliasing makes some of the macros quite a bit simpler as well.
Examples of runtimes vs the 32 bit version, on a Cortex A53:
ARM AArch64
vp9_inv_adst_adst_4x4_add_neon: 90.0 87.7
vp9_inv_adst_adst_8x8_add_neon: 400.0 354.7
vp9_inv_adst_adst_16x16_add_neon: 2526.5 1827.2
vp9_inv_dct_dct_4x4_add_neon: 74.0 72.7
vp9_inv_dct_dct_8x8_add_neon: 271.0 256.7
vp9_inv_dct_dct_16x16_add_neon: 1960.7 1372.7
vp9_inv_dct_dct_32x32_add_neon: 11988.9 8088.3
vp9_inv_wht_wht_4x4_add_neon: 63.0 57.7
The speedup vs C code (2-4x) is smaller than in the 32 bit case,
mostly because the C code ends up significantly faster (around
1.6x faster, with GCC 5.4) when built for aarch64.
Examples of runtimes vs C on a Cortex A57 (for a slightly older version
of the patch):
A57 gcc-5.3 neon
vp9_inv_adst_adst_4x4_add_neon: 152.2 60.0
vp9_inv_adst_adst_8x8_add_neon: 948.2 288.0
vp9_inv_adst_adst_16x16_add_neon: 4830.4 1380.5
vp9_inv_dct_dct_4x4_add_neon: 153.0 58.6
vp9_inv_dct_dct_8x8_add_neon: 789.2 180.2
vp9_inv_dct_dct_16x16_add_neon: 3639.6 917.1
vp9_inv_dct_dct_32x32_add_neon: 20462.1 4985.0
vp9_inv_wht_wht_4x4_add_neon: 91.0 49.8
The asm is around factor 3-4 faster than C on the cortex-a57 and the asm
is around 30-50% faster on the a57 compared to the a53.
Signed-off-by: Martin Storsjö <martin@martin.st>
This work is sponsored by, and copyright, Google.
These are ported from the ARM version; thanks to the larger
amount of registers available, we can do the loop filters with
16 pixels at a time. The implementation is fully templated, with
a single macro which can generate versions for both 8 and
16 pixels wide, for both 4, 8 and 16 pixels loop filters
(and the 4/8 mixed versions as well).
For the 8 pixel wide versions, it is pretty close in speed (the
v_4_8 and v_8_8 filters are the best examples of this; the h_4_8
and h_8_8 filters seem to get some gain in the load/transpose/store
part). For the 16 pixels wide ones, we get a speedup of around
1.2-1.4x compared to the 32 bit version.
Examples of runtimes vs the 32 bit version, on a Cortex A53:
ARM AArch64
vp9_loop_filter_h_4_8_neon: 144.0 127.2
vp9_loop_filter_h_8_8_neon: 207.0 182.5
vp9_loop_filter_h_16_8_neon: 415.0 328.7
vp9_loop_filter_h_16_16_neon: 672.0 558.6
vp9_loop_filter_mix2_h_44_16_neon: 302.0 203.5
vp9_loop_filter_mix2_h_48_16_neon: 365.0 305.2
vp9_loop_filter_mix2_h_84_16_neon: 365.0 305.2
vp9_loop_filter_mix2_h_88_16_neon: 376.0 305.2
vp9_loop_filter_mix2_v_44_16_neon: 193.2 128.2
vp9_loop_filter_mix2_v_48_16_neon: 246.7 218.4
vp9_loop_filter_mix2_v_84_16_neon: 248.0 218.5
vp9_loop_filter_mix2_v_88_16_neon: 302.0 218.2
vp9_loop_filter_v_4_8_neon: 89.0 88.7
vp9_loop_filter_v_8_8_neon: 141.0 137.7
vp9_loop_filter_v_16_8_neon: 295.0 272.7
vp9_loop_filter_v_16_16_neon: 546.0 453.7
The speedup vs C code in checkasm tests is around 2-7x, which is
pretty much the same as for the 32 bit version. Even if these functions
are faster than their 32 bit equivalent, the C version that we compare
to also became around 1.3-1.7x faster than the C version in 32 bit.
Based on START_TIMER/STOP_TIMER wrapping around a few individual
functions, the speedup vs C code is around 4-5x.
Examples of runtimes vs C on a Cortex A57 (for a slightly older version
of the patch):
A57 gcc-5.3 neon
loop_filter_h_4_8_neon: 256.6 93.4
loop_filter_h_8_8_neon: 307.3 139.1
loop_filter_h_16_8_neon: 340.1 254.1
loop_filter_h_16_16_neon: 827.0 407.9
loop_filter_mix2_h_44_16_neon: 524.5 155.4
loop_filter_mix2_h_48_16_neon: 644.5 173.3
loop_filter_mix2_h_84_16_neon: 630.5 222.0
loop_filter_mix2_h_88_16_neon: 697.3 222.0
loop_filter_mix2_v_44_16_neon: 598.5 100.6
loop_filter_mix2_v_48_16_neon: 651.5 127.0
loop_filter_mix2_v_84_16_neon: 591.5 167.1
loop_filter_mix2_v_88_16_neon: 855.1 166.7
loop_filter_v_4_8_neon: 271.7 65.3
loop_filter_v_8_8_neon: 312.5 106.9
loop_filter_v_16_8_neon: 473.3 206.5
loop_filter_v_16_16_neon: 976.1 327.8
The speed-up compared to the C functions is 2.5 to 6 and the cortex-a57
is again 30-50% faster than the cortex-a53.
Signed-off-by: Martin Storsjö <martin@martin.st>
* commit 'e48746deec48e9ff195841bc3266b4e153a878cd':
checkasm: h264dsp: Move the x and y variables into the randomize_buffer macro
Merged-by: Hendrik Leppkes <h.leppkes@gmail.com>
* commit '82b7525173f20702a8cbc26ebedbf4b69b8fecec':
Add an OpenH264 decoder wrapper
This commit is a noop, see c5d326f551
Merged-by: Hendrik Leppkes <h.leppkes@gmail.com>
* commit '785c25443b56adb6dbbb78d68cccbd9bd4a42e05':
movenc: Apply offsets on timestamps when peeking into interleaving queues
Merged-by: Hendrik Leppkes <h.leppkes@gmail.com>
* commit 'eccfb9778ae939764d17457f34338d140832d9e1':
qsvdec_hevc: add the UID of the HEVC HW decoder plugin
Merged-by: Hendrik Leppkes <h.leppkes@gmail.com>
* commit 'c3f113d58488df7594a489bdbb993a69ad47063c':
vf_hwdownload: allocate the destination frame for the pool size
Merged-by: Hendrik Leppkes <h.leppkes@gmail.com>
* commit 'fdfe01365d579189d9a55b3741dba2ac46eb1df8':
hwcontext: allocate the destination frame for the pool size
Merged-by: Hendrik Leppkes <h.leppkes@gmail.com>
* commit '5fcae3b3f93fd02b3d1e009b9d9b17410fca9498':
hwcontext: clarify the behaviour of transfer_data() for cropped frames
Merged-by: Hendrik Leppkes <h.leppkes@gmail.com>
* commit '94ebf5565849e4dc036d2ca43979571ed3736457':
avconv: restructure sending EOF to filters
Noop, as its a fixup to a previously skipped commit
Merged-by: Hendrik Leppkes <h.leppkes@gmail.com>
* commit 'd2e56cf753a6c462041dee897d9d0c90f349988c':
avconv: move flushing the queued frames to configure_filtergraph()
Noop, as its a fixup to a previously skipped commit
Merged-by: Hendrik Leppkes <h.leppkes@gmail.com>
The Intel binary iHD driver does not support the
VASurfaceAttribMemoryType, so surface allocation will fail when using
it.
(cherry picked from commit 2124711b95)
If no string argument is supplied when av_hwdevice_ctx_create() is
called to create a VAAPI device, we currently only try the default
X11 display (that is, $DISPLAY) to find a device, and will therefore
fail in the absence of an X server to connect to. Change the logic
to also look for a device via the first DRM render node (that is,
"/dev/dri/renderD128"), which is probably the right thing to use in
most simple configurations which only have one DRM device.
(cherry picked from commit 121f34d5f0)
No longer leaks memory when used with a driver with the "render does
not destroy param buffers" quirk (i.e. Intel i965).
(cherry picked from commit 221ffca631)
Fixes ticket #5871.
The driver being used is detected inside av_hwdevice_ctx_init() and
the quirks field then set from a table of known device. If this
behaviour is unwanted, the user can also set the quirks field
manually.
Also adds the Intel i965 driver quirk (it does not destroy parameter
buffers used in a call to vaRenderPicture()) and detects that driver
to set it.
(cherry picked from commit 4926fa9a4a)
This prevents a division by zero in read_packet.
Reviewed-by: Paul B Mahol <onemda@gmail.com>
Signed-off-by: Andreas Cadhalpun <Andreas.Cadhalpun@googlemail.com>
Set up the encoder with a hardware context which will match the one
the decoder will use when it starts later.
Includes 02c2761973dfc886e94a60a9c7d6d30c296d5b8c, with additional
hackery to get around a3a0230a98 being
skipped.
* commit '8a62d2c28fbacd1ae20c35887a1eecba2be14371':
vaapi_encode: Maintain a pool of bitstream output buffers
Merged-by: Hendrik Leppkes <h.leppkes@gmail.com>
* commit '4a081f224e12f4227ae966bcbdd5384f22121ecf':
libavcodec: fix constness in clobber test avcodec_open2() wrappers
Merged-by: Hendrik Leppkes <h.leppkes@gmail.com>
* commit '02c2761973dfc886e94a60a9c7d6d30c296d5b8c':
avconv_qsv: use the device creation API
Not merged, our ffmpeg hwaccel infra is not quite the same as avconvs.
Merged-by: Hendrik Leppkes <h.leppkes@gmail.com>
* commit '232399e3ee219d16d0e0d482c9f31a26202d4993':
avconv: pass the hwaccel frames context to the decoder
Merged-by: Hendrik Leppkes <h.leppkes@gmail.com>
* commit 'a3a0230a9870b9018dc7415ae5872784d524cfe5':
avconv: init filtergraphs only after we have a frame on each input
This commit is a noop since it doesn't apply cleanly due to differences
in the dataflow between avconv and ffmpeg, and thus fixing this in the
scope of a merge is unfeasible.
Merged-by: Hendrik Leppkes <h.leppkes@gmail.com>
* commit '3e265ca58f0505470186dce300ab66a6eac3978e':
avconv: do packet ts rescaling in write_packet()
This commit is a noop since it doesn't apply cleanly due to differences
in the dataflow between avconv and ffmpeg, and thus fixing this in the
scope of a merge is unfeasible.
Merged-by: Hendrik Leppkes <h.leppkes@gmail.com>
* commit 'ba7397baef796ca3991fe1c921bc91054407c48b':
avconv: factor out initializing stream parameters for encoding
Merged-by: Hendrik Leppkes <h.leppkes@gmail.com>
The handling of the other block sizes was limited to 'SCALED == 0' in
commit dc96c0f9fc96bf4167633befc074394062793322, so this assert should
be disabled, too, as it can now be triggered.
Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: Andreas Cadhalpun <Andreas.Cadhalpun@googlemail.com>
Fixes building for Windows x86 with MSVC using the link libraries distributed with the CUDA SDK.
check_lib2 is required here because it includes the header to get the full signature of the
function, including the stdcall calling convention and all of its arguments, which enables
the linker to determine the fully qualified object name and resolve it through the import
library, since the CUDA SDK libraries do not include un-qualified aliases.