Recently, we were delighted to make a change that will improve bandwidth management for our customers. The Kaltura transcoding logic now applies Content Aware Encoding (CAE) to all content ingested into Kaltura (with exceptions made for a few specific customers to account for their special needs.)
Until fairly recently, video was encoded according to set “bitrate ladders”, where a bitrate was paired with a resolution for several set intervals (such as 235 Kbps/320×240, 375 Kbps/384×288, etc.). If too low a resolution is chosen, the picture will appear fuzzy to viewers. If too high a bitrate is chosen, the viewer will experience buffering, while too low a bitrate will result in annoying encoding “artifacts”. The goal is to balance for an optimal viewer experience. The bitrate ladder was applied to all content to try to achieve an acceptable playback experience and bandwidth usage on average, without taking specific videos’ content into account.
Content Aware Encoding, on the other hand, examines each individual video’s characteristics and optimizes encoding accordingly. The point of CAE is to reduce the playback bandwidth, but still provide the same quality viewing experience. In many cases, the new bandwidth can be as low as half that of ‘non-CAE’ videos.
Basically, this means that the transcoding logic optimizes the encoding procedure per each content source, based on content complexity level. Less complex videos, such as videos with simpler animation or uncluttered backgrounds, require lower bitrates to get a sufficient quality level. Higher complexity, such as very busy backgrounds or lots of movement while requires a higher bitrate. By choosing the lowest acceptable playback bitrate for each individual video, more end users will get better user experience. Lower playback bitrate means more users will get a better streaming experience – meaning less buffering, al the while maintaining streamlined playback quality.
By limiting the encoded rendition output for content where high bit rate resolution is unnecessary – gain in bitrate efficiency is substantial.
For more details, read on.
As noted before, non-CAE encoding process applies the same encoding ladder to all the ingested sources. Furthermore, all the assets are forced to the max rendition bitrate. In our case, this meant generation of all 6 flavors that were included in our default flavor set, even if there was no visual difference between the lowest flavor and the highest flavor.
Until quite recently, most companies used this mode. But now, that’s changing.
In the beginning of this year, Netflix published this article that describes their approach to the CAE and their attempts to integrate this technology. Shortly afterwards, Streaming Media published a commentary on Netflix article and their version of CAE.
In the article, Netflix described the way they worked out their CAE flow: in-depth research of transcoding quality issues and comparison between various combinations of frame-size/bitrate and other parameters. This included both automatic quality measurements and manual inspections of the generated test contents. The resultant automatic flow analyzes those combinations to get the optimal encoding configuration for every new ‘title’/content. The results are very good, but the process is very time- and resource-consuming.
In the Streaming Media article, Jan Ozer described a much simpler approach that is based on the ffmpeg capability to force a specific quality level (via CRF parameter). Examining encodings with different CRF’s made it possible to define which CRF suits quality requirements and to make the encoding as visually similar to the source as possible with the lowest possible bitrate. This way, each source can have the maximal encoding bitrate defined. From this, it is possible to derive the rest of the ‘encoding ladder’.
This approach is much simpler than Netflix’s solution, but it is also less precise.
Examining YouTube renditions shows that they do not stick to rigid encoding ladder—different clips get different rendition/bitrate spreading. This means that YouTube applies CAE as well. This was confirmed a couple of months ago, when YouTube published this article explaining how they used machine-learning with Google Brain to implement their Content Aware Encoding.
Although it was not called ‘content aware’, Kaltura’s previous transcoding mode, had some CAE capabilities.
Still, this approach was not efficient. In many cases, there was no need for HD flavors, although the source bitrate was very high. There was no visual difference between HD assets and lower bitrate assets. For example, webcast videos, lectures, and user-generated content all looked the same.
So there were definitely gains to be made with Content Aware Encoding. But which approach was best for Kaltura?
The Netflix approach requires extensive processing for every source. Kaltura’s daily entry ingest load is 10K-60K. Therefore, it wouldn’t be efficient to add that level of additional processing. On the other hand, YouTube’s flow evolved from long machine learning research. We’re not planning on challenging Google’s machine learning expertise any time soon.
After experimenting with many approaches, the approach described by Streaming Media looked promising.
The first phase was to verify that the ‘forced quality’ (CRF) approach can predict the required max bitrate.
This phase involved defining 6 content categories: Film, Simple Animation, Real World Action, Talking Heads, Screencam, and Webcasting. For each category, we examined several samples. For each, we evaluated on multiple criteria:
These limited scope tests verified the initial Streaming Media assumptions: we can use CRF=23 forced quality rendition to predict the required rendition bitrate. This is the ‘Source Complexity’ value.
The Source Complexity should be determined before any asset conversion can take place. The simplistic way would be to run full source conversion (with CRF=23) in order to get the Source Complexity value. Although it is much simpler and shorter than Netflix’s flow, conversion of an HD source into an HD rendition might last several hours in Kaltura’s current transcoding environment. Applying this to ALL of the sources would double (or at least significantly increase) the entry time-to-ready and would probably overload Kaltura transcoding resources.
Therefore, we decided to limit the Complexity evaluation time.
The method used was to generate 20 samples (1 second each), spread throughout the whole file. We calculated average I and P frame sizes and used them in order to estimate the final bitrate. This kind of processing takes on average around 30 seconds, and for the longest case, approximately 60 seconds.
The verification process required a much larger number of samples. We randomly selected several thousand entries from the last 3 months. This list was used for all following proof-of-concept tests.
For the verification tests, around ~100 samples were used. In most cases, the sampled complexity evaluation results were ~10-20% higher than the ‘non-sampled’ complexity evaluation. But there were several cases when the sampled results were ~30-40% lower than ‘non-sampled’. Since the sampled complexity value would be used to set the max rendition bitrate, the resulting rendition files would have insufficient quality.
In order to avoid quality reduction issues, the final encoding flow limits the Content Aware Encoding ‘gain’ to be at most 50% of the transcoding parameters video bitrate value. For example, if the transcoding parameters bitrate is set to 4000Kbps and the source complexity bitrate is 1000Kbps, the CAE logic will set the max bitrate to 2000Kbps. (Despite that, the complexity level is 1000Kbps).
For roughly 1000 random sample entries, we used the asset’s command lines to generate the proof-of-concept files. The video bitrate in those command lines was changed to the highest of the recommended source complexity bitrates (see above). We tested 50% of the predefined asset bitrate. All the other encoding parameters remained the same.
Lowering bitrate causes some quality degradation. The goal of this proof-of-concept phase was to check whether the final quality of the files that were generated in ‘Content Aware’ mode was still sufficient, despite the lower bitrate. PSNR was used as a main quality metrics tool due to the fact that there is quite a lot of data linking the PSNR values and the subjective quality perception. For each sample, PSNR was calculated both for the original asset files and for the POC renditions.
The following are some PSNR values-of-interest:
Therefore:
For high quality sources, the PSNR delta between POC and asset was, in most cases, very small: in the ‘GOOD’ to ‘OK’ range (see above). For low quality sources, the PSNR difference was disturbingly high: up to 1-2 (before the 50% gain limitation, it was up to 4 and sometimes higher). But despite the large PSNR gap, there was no visual quality difference between POC files and the assets.
We activated Content Aware Encoding for several test customers to see whether there are any issues or complaints. In parallel, we tested some of the content (both playback & PSNR) ad monitored to make sure no customer complaints and issues cropped up.
The next step was to activate CAE for one of our ‘content intensive’ customers. Following are some resulting stats:
Samples | AVG Source | AVG Asset | |
CAE | 3921 | 7602 | 7952 |
NON-CAE | 6041 | 7258 | 9663 |
This is a comparison of 3 weeks of the content intensive customer’s load, with and without CAE, for a source with bitrate>2000Kbps.
Assuming that the content style in both 3 week periods is approximately the same, the ‘AVG Asset’ of CAE period is ~20% lower than for NON-CAE. The customer had no issues or complaints.
Here we are today. Having thoroughly tested CAE to our satisfaction, we have now officially changed the Kaltura transcoding logic to apply Content Aware Encoding (CAE) to all content ingested to Kaltura. The default ‘ContentAwareness’ value was changed to 0.5, activating CAE as a default for all Kaltura transcodings.
We’re excited to move into this new age of Content Aware Encoding. By reducing the chance of buffering without significantly impacting the crispness of the image, we can offer our clients and their end users a significantly improved overall experience, while reducing the bandwidth load. Enjoy!