Netflix Finds x265 20% More Efficient than VP9 Ne
Post# of 96879
Netflix compared 5,000 clips from 500 titles in its library using the x264, x265, and libvpx codecs. x265's implementation of HEVC was the clear winner on quality and efficiency, but whether that matters in light of compatibility and licensing issues isn't so obvious.
The most significant finding from Netflix’s large-scale codec comparison earlier this week is that, according to Netflix’s VMAF quality benchmark, HEVC was about 20% more efficient than VP9 at 360p, 720p, and 1080p resolutions. This leads to two questions: Is this the final word on HEVC vs. VP9 quality? If so, does it matter? I'll address both questions in this article. Let’s start with a closer look at Netflix’s findings.
Netflix’s Presentation
Netflix’s presentation was entitled, “A Large-Scale Video Codec Comparison of x264, x265 and Libvpx for Practical VOD applications.” Netflix’s Jan De Cock presented the results at the August 31 SPIE Applications of Digital Image Processing conference. The presentation is online here starting about 60 minutes in. Netflix plans to release a paper detailing their findings, but that wasn’t available for this article.
De Cock made several points up front. First, the study focused exclusively on VOD, as opposed to live. Second, Netflix compared actual codecs, not specifications. That is, the company didn’t compare the H.264/AVC, HEVC, and VP9 encoding specs; they compared implementations of those specs, specifically the x264, x265, and libvpx codecs. This is potentially significant, because though x264 is widely regarded as the highest-quality H.264 codec, x265 recently ranked fourth out of six HEVC codecs in the Moscow University HEVC/H.265 Video Codecs Comparison (PDF available here). If you run your own tests with one of the HEVC codecs that surpassed x265, the disparity over libvpx might even be greater. For this article, I’ll use VP9 rather than libvpx, which doesn’t quite roll off the tongue or keyboard, but I mean Google’s own implementation of VP9.
By any measure, Netflix’s test set was exhaustive, including 5,000 12-second clips taken from 500 titles in Netflix’s library, including 1080p and 4K sources of animated and real-world content, with a wide diversity of detail and motion. Netflix produced at three resolutions—360p, 720p, and 1080—using the most exhaustive, complex settings available for each codec (e.g. placebo for x265 and x264). All encodes were single threaded, with no slices, tiles, or wavefronts, and produced with a keyframe interval of 4 seconds.
Beyond these settings, Netflix tuned their encodes for two scenarios. The first tuned for the highest possible PSNR (Peak Signal-to-Noise Ratio) scores, enabling the tuning for PSNR option, and disabling aq-mode and psycho-visual settings. The second tuned for visual quality, disabling tuning for PSNR, and enabling aq-mode, psy-rd in x264 and x265, and psy-rdog in x265. These distinctions will be important to the results presented below.
Netflix encoded the files using content-adaptive three-pass encoding, where a first pass used CRF mode (x264 and x265) or constant QP mode (VP9) to determine the appropriate target bitrate. Then Netflix encoded the files to that target using two-pass VBR. Netflix used 8 CRF/QP targets to create a rate-quality curve for each encoded file. Overall, Netflix encoded over 720,000 encoded videos, with more than 200 million encoded frames.
Netflix applied six quality metrics to the encoded files, including PSNR, Structural Similarity (SSIM), PSNR based mean square error (PSNR MSE), Multi-Scale SSIM (MS-SSIM), Visual Information Fidelity (VIF) and the Video Multimethod Assessment Fusion (VMAF), a metric largely designed by Netflix.
The Results Please
Netflix presented three sets of results. The first analyzed the files tuned for PSNR using the PSNR benchmark. According to these results, x265 outperformed VP9 by an average of about 6.6% over 360p, 720p, and 1080p resolutions.
The second set of results assessed the files encoded for visual quality using the MS-SSIM benchmark. In these results, x265 outperformed VP9 by about 3% on average over the three tested resolutions, though VP9 outperformed x265 in the 1080p configuration.
The third set of results, presented in Figure 1, analyzed the files encoded for visual quality with the VMAF benchmark. In these tests, x265 proved 20.1% more efficient than VP9, on average, over all three tested resolutions, though VP9 proved 31.6% more efficient than x264 at 720p and 42.6% at 1080p, numbers you’ll see confirmed in actual deployments below.
Figure 1. In this test grouping, x265 outperformed VP9 by just over 20% on average.
I asked Netflix which set of results they felt was most significant. Their response was, “We believe that VMAF results will have the best correlation to user perception of quality. We use this metric, and sanity-check against other metrics (PSNR, SSIM, VIF, etc.) internally.” In other words, according to Netflix, the results shown in Figure 1 were the most relevant of the three.
Analyzing the Analysis
I asked Netflix about their motivations for performing this test, and they answered, “We wanted to understand the current state of the x265 and libvpx codec implementations when used to generate non-realtime encodes optimized for OTT use case. It was important to see how the codecs performed when testing on a diverse set of premium content from our catalog. This test can help us find areas of improvement for the different codecs.” So, just as you would expect, the tests were designed to benchmark codecs for high-volume encodes of premium content.
While clearly relevant to other premium OTT vendors, it’s unclear whether you can generalize these results to other forms of content, and the results raise a few questions. For example, user-generated shops like YouTube have to encode a much broader range of content, from shaky smartphone videos to VR input. Corporations encode a completely different range of videos, mostly low motion training videos, talking heads, or even PowerPoint or Camtasia-based videos. Do Netflix’s findings apply to these types of clips?
A similar question can be asked about the encoding settings, which are much more complex than typically used for actual production. For example, the last time I tested, the x265 placebo preset took over 20 times longer to encode than the medium preset that’s usually recommended (and delivered about 17% higher quality). In encoding shops operating at capacity, this means that using the placebo settings would increase encoding costs by a factor of 20. Of course, when you distribute each encoded file tens of millions of times, encoding time/cost really doesn’t matter. Would the results have changed if Netflix had used settings more typically used by other encoding shops?
In addition, while VMAF has tremendous credibility by virtue of its heritage, it’s a new benchmark. Neither x265 nor VP9 has been tuned for the test, a common practice that sounds like gaming the system, but it's valid if the metric accurately measures what it purports to. Will the results be the same once both codecs are fully tuned for VMAF?
Finally, VMAF is just one of multiple metrics that claim to accurately predict subjective scoring. Others include the ClearView System Option-Sarnoff JND available with Video Clarity's ClearView tools, the Difference Mean Opinion Score (DMOS) available with the Tektronix Picture Quality Analyzer (PQA), and the SSIMplus metric, which is available in the SSIMWave Video QoE Monitor (SQM). Before shutting the door on the HEVC vs. VP9 quality debate, it would be interesting to see the results from these other tools.
Overall, Netflix’s tests clearly show that x265 is 20% more efficient than libvpx when encoding premium content using the most stringent settings, and measuring quality with VMAF. Whether you can apply those conclusions to production encodes of other forms of content remains to be seen, as does how the codecs, or even the benchmark, will stack up in twelve to eighteen months. But even if you take Netflix’s results at face value, does it matter?
OK, x265 is Better by 20%. Does it Matter?
Let’s start by agreeing that better is undeniably better. No question. But other factors, including cost and usability/accessibility are often equally or even more important.
Let’s look at the browser market for computers and notebooks. Here, VP9 is available in Chrome, Firefox, and Edge, which by some measures, includes up to 89% of all browsers, while HEVC is only available in Microsoft Edge, and then only where hardware decode is also present. For producers who have migrated to HTML5, converting from H.264 to VP9 is simple, and can deliver substantial benefits.
For example, JWPlayer recently deployed VP9 streams packaged with DASH for four publishers with large video catalogues. After 6,000 hours of video playback, JWPlayer sent DASH/VP9 streams to desktop and mobile users with compatible browsers, and served HLS/H.264 to the rest. As shown in Figure 2, the average HLS user consumed 40% more bandwidth than those playing VP9. Despite the extra bandwidth, 65% of VP9/DASH streams were viewed at 1080p, compared to 28% for H.264/HLS (Figure 2). In these real-world trials, VP9 delivered an overall improved quality of experience at substantially lower cost the publisher.