编程知识 cdmana.com

Loop filtering technology based on deep learning

1. What is deep learning

Deep learning (Deep Learning) It's a branch of machine learning , It is an artificial neural network architecture , An algorithm for representation learning of data [1]. Generally, learn simple representation first ( Such as image edge ), Then, based on the simple representation, further study the representation with higher abstract level ( Such as image corners , Then to a part of the object ), Finally, a high-level representation suitable for a specific task is obtained . Because the representation layers of learning are generally deep , So it's called deep learning .

Convolutional neural networks (Convolutional Neural Network, CNN) It is a common type of neural network in deep learning , Because of its unique architecture, it is very suitable for various image processing tasks , therefore CNN In image classification 、 Object recognition 、 Great success has been achieved in many fields such as super-resolution .

2. What is loop filtering ?

Lossy image / Irreversible quantization operation is introduced in the process of video compression , There is distortion between the reconstructed image and the original image . Loop filtering technology filters the reconstructed image , To reduce distortion , So as to improve the compression efficiency and reconstruction quality . Common filters in video coding standards include deblocking filter (Deblocking Filter, [2]), Pixel adaptive offset (Sample Adaptive Offset, [3]), And adaptive loop filtering (Adaptive Loop Filter, [4]). Be careful , There may be an infinite number of original images corresponding to the distorted image , Therefore, the image quality enhancement here belongs to an ill posed problem , Almost no filter can perfectly restore the original image .

3. Why can deep learning do loop filtering ?

The most common way to solve ill conditioned problems is to transform the prior knowledge of signals into regularization constraints , So as to reduce the solution space . Convolutional neural network architecture itself has been proved to be a good way to capture the prior knowledge of natural images [5], Using it to filter the image is equivalent to using implicit regularization . Besides ,CNN We can further learn from the massive training data about specific tasks ( Such as the de compression distortion here ) Useful a priori knowledge , Establish the mapping from the distorted image to the original image , So as to complete the image quality enhancement .

4. How do you do it? ?

The literature [6] It's an earlier attempt , Its use CNN replace HEVC (High Efficiency Video Coding) Loop filter in (Deblocking and SAO), The quality of compressed video is enhanced and good results are achieved .

In order to accomplish this task , First, you need to prepare training data , That is, there is a distortion graph and the corresponding original graph , such CNN The mapping from distorted image to original image can be captured during training . For this reason, the author uses HEVC Compress the training set image , To get a distorted reconstructed image . These distorted images are combined with the corresponding original image , You can get a lot similar to { Distorted images , original image } Such training samples .

Next, you need to determine the network structure , here CNN Accept MxM The size of the distorted image as input , Then the output MxM Size filtered image . therefore , Loop filtering task belongs to the category of image quality enhancement , You can refer to denoising , Network architecture used for underlying computer vision tasks such as super-resolution . In this paper , The author designed the design as shown in the figure 1 Shown 4 layer CNN, See table for specific configuration 1. Because the input and output resolutions of such tasks are consistent , In order to maintain high-precision spatial position information , The network structure generally does not include down sampling operations ( Note that it's average , Some networks will adopt down sampling operation to increase the spatial receptive field , At the same time, by increasing the number of convolution channels , Maintain high-precision reconstruction ). Two sizes of convolution kernels are used in the middle two layers of the network 、, To obtain multi-scale features . Besides , The network contains a layer hopping connection , About to enter images and CNN The output of the last layer is added . thus , The actual output of the network becomes the difference between the reconstructed image and the original image , It embodies the idea of residual learning [7]. The advantage of residual learning is that it can accelerate the convergence speed and , Improve the performance after convergence .

chart 1. The literature [6] Network structure in , One 4 Layer full convolution network

surface 1. The literature [6] Network configuration in

To drive CNN Training , The loss function is also required , The author adopts the following formula MSE Loss function . here N Is the number of training samples ,X and Y The sub table represents the distorted image and the corresponding original image ,𝜃 Represents all the parameters of the convolution network ( Weights and offsets ).

Here we are , The network weight can be updated by the random gradient descent method , This usually consists of a deep learning framework ( Such as PyTorch) Done automatically , The parameters obtained after training are the parameters with the best average significance for the whole training set . The reader may ask : The obtained parameters are optimal relative to the training set , How does it perform in an unprecedented dataset ? Usually we assume that : The training set and test set are sampled from the same distribution , This ensures that CNN Similar performance in the test set . To improve generalization performance , The training set is usually required to contain enough samples , To reflect the true distribution of the data , Besides , There are other ways to avoid over fitting the network , If yes CNN Parameters ( That is to say 𝜃) exert 1- Norm or 2- Norm constraints , These methods are collectively referred to as regularization , It is a very important research field in deep learning .

The author wrote for each QP spot , namely QP{22,27,32,37}, A model is trained respectively , Then integrate the model into the reference software for testing , The final performance is shown in the table below 2. Here's a comparison anchor yes HM-16.0, Configure to All-Intra. Because it's just All-Intra Tested under configuration , So you can also put CNN Filtering is regarded as post-processing operation . If the filtered frame is used as a reference for subsequent frame coding , It becomes a loop filter .

surface 2. The literature [6] Compared with HEVC Of BD-rate save

5. JVET-U0068

Next, let's focus on this year JVET Loop filter design in a proposal [8], The algorithm is called DAM(Deep-filters with Adaptive Model-selection), It is a reference software for the latest generation of video coding standards VTM-9.0 Design , It also includes new features such as mode selection .

chart 2. (a) DAM The use of CNN framework ,M Represents the number of feature maps ,N Represents the spatial resolution of the feature map ;(b) Structure of residual element .

Network architecture : DAM The backbone of the filtering method is shown in the figure above 2 Shown , In order to increase the receptive field , Reduce complexity , This network contains a step of 2 The convolution of layer , This layer reduces the spatial resolution of the feature map to half of the input size in both horizontal and vertical directions , The feature map output from this layer will pass through several sequentially stacked residual units . The last feature of the convolution layer is the residual of the convolution unit , Output 4 Sub characteristic graph . Last ,shuffle The layer generates a filtered image with the same spatial resolution as the input . Other details related to this architecture are as follows :

  1. For all convolutions , Use 3x3 Convolution kernel . For the internal convolution layer , The number of feature maps is set to 128. For activation functions , Use PReLU.

  2. For different slice Different types of training models .

  3. When it comes to intra slice When training convolutional neural network filter , Prediction and blocking information are also fed into the network .

Adaptive model selection : First , Every slice or CTU The unit can decide whether to use a filter based on convolutional neural network ; secondly , When a slice perhaps CTU The unit determines when a filter based on convolutional neural network is used , It can be further determined which of the three candidate models is used . For this purpose , Use {17, 22, 27, 32, 37, 42} Medium QP Numerical training of different models . Will encode the current sequence Of QP Write it down as q, Then the candidate model is for {q, q-5, q-10} The three models of training . The selection process is based on the rate distortion cost function , Then, the relevant mode representation information is written into the code stream .

infer : Use PyTorch stay VTM In the implementation of CNN Online inference , The following table 3 The relevant network information in the inference stage is given .

Unable to copy content in load

surface 3. Network information at the time of inference

Training : With PyTorch For the training platform , Train separately for intra slice and inter slice Convolution neural network filter , And train different models to adapt to different QP spot , The network information in the training stage is listed in table 4 in .

Unable to copy content in load

surface 4. Network information during training

experimental result : The author will DAM stay VTM-9.0 The integration test is carried out on , Deblocking and filtering in existing filters SAO Be turned off , and ALF( and CCALF) It is placed behind the filter based on convolutional neural network . The test results are shown in the table 5~ surface 7 Shown , stay AI,RA, and LDB Under configuration , Y、Cb、 and Cr Three channels BD-rate The savings are :{8.33%, 23.11%, 23.55%}, {10.28%, 28.22%, 27.97%}, and {9.46%, 23.74%, 23.09%}. Besides , chart 3 The use of DAM Subjective comparison before and after the algorithm .

RA
Y U V EncT DecT
Class A1 -9.25% -22.08% -22.88% 248% 21338%
Class A2 -11.20% -29.17% -28.76% 238% 19116%
Class B -9.79% -31.05% -29.57% 248% 23968%
Class C -10.97% -28.59% -29.18% 196% 21502%
Class E
Overall -10.28% -28.22% -27.97% 231% 21743%
Class D -12.17% -28.99% -30.27% 187% 20512%
Class F -5.27% -16.93% -16.58% 337% 9883%

surface 5. DAM stay VTM9.0(RA) Performance on

LDB
Y U V EncT DecT
Class A1
Class A2
Class B -8.69% -24.51% -25.26% 235% 10572%
Class C -10.21% -24.50% -24.60% 189% 11145%
Class E -9.75% -21.46% -17.47% 455% 8730%
Overall -9.46% -23.74% -23.09% 258% 10257%
Class D -11.56% -26.59% -27.98% 182% 12071%
Class F -5.05% -15.83% -15.41% 331.70% 5723%

surface 6:DAM stay VTM9.0(LDB) Performance on

AI
Y U V EncT DecT
Class A1 -7.04% -19.72% -20.94% 423% 14896%
Class A2 -7.32% -24.01% -22.73% 268% 13380%
Class B -7.48% -24.24% -24.06% 240% 13606%
Class C -8.90% -22.61% -25.08% 176% 10061%
Class E -11.30% -24.37% -24.11% 274% 14814%
Overall -8.33% -23.11% -23.55% 257% 13065%
Class D -8.75% -22.68% -24.96% 158% 10571%
Class F -5.03% -15.94% -15.38% 170% 9364%

surface 7:AVG The proposal is in VTM9.0(AI) Performance on

chart 3. Left : Original picture ( come from JVET Pass test sequence BlowingBubbles); in :VTM-11.0 Compress ,QP42, Rebuild quality 27.78dB; Right :VTM-11.0+DAM,QP42, Rebuild quality 28.02dB.

6. Summary and Outlook

In recent years, , Loop filtering ( Including post-treatment ) The research work is mainly distributed in the following aspects ( But not limited to, ):

  1. Use more input information , For example, in addition to reconstructing the frame, information such as prediction and division can also be used as CNN The input of [9];
  2. Use a more complex network structure [9, 10, 11];
  3. How to use the correlation between video image frames to improve performance [12, 13];
  4. How to design a unified model for different quality levels [14].

On the whole , Coding tools based on deep learning are in the ascendant , While showing attractive performance , It also causes high complexity , In the future, in order to implement these depth coding tools , Complexity - Performance tradeoff optimization will be a very important research direction .

7. reference

[1] Wikipedia contributors. " Deep learning ." Wikipedia , An encyclopedia of freedom . Wikipedia , An encyclopedia of freedom , 10 Mar. 2021. Web. 10 Mar. 2021.‹zh.wikipedia.org/w/index.php…›.

[2] Norkin, Andrey, et al. "HEVC deblocking filter," IEEE Transactions on Circuits and Systems for Video Technology 22.12 (2012): 1746-1754.

[3] Fu, Chih-Ming, et al. "Sample adaptive offset in the HEVC standard," IEEE Transactions on Circuits and Systems for Video technology 22.12 (2012): 1755-1764.

[4] Tsai, Chia-Yang, et al. "Adaptive loop filtering for video coding," IEEE Journal of Selected Topics in Signal Processing 7.6 (2013): 934-945.

[5] Ulyanov, Dmitry, et al. "Deep image prior," Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.

[6] Dai, Yuanying, et al. "A convolutional neural network approach for post-processing in HEVC intra coding," International Conference on Multimedia Modeling. Springer, Cham, 2017.

[7] He, Kaiming, et al. "Deep residual learning for image recognition," Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.

[8] Li, Yue, et al. “Convolutional neural network-based in-loop filter with adaptive model selection,” JVET-U0068. Teleconference, 2021.

[9] Lin, Weiyao, et al. "Partition-aware adaptive switching neural networks for post-processing in HEVC," IEEE Transactions on Multimedia 22.11 (2019): 2749-2763.

[10] Ma, Di, et al. "MFRNet: a new CNN architecture for post-processing and in-loop filtering," IEEE Journal of Selected Topics in Signal Processing (2020).

[11] Zhang, Yongbing, et al. "Residual highway convolutional neural networks for in-loop filtering in HEVC," IEEE Transactions on image processing 27.8 (2018): 3827-3841.

[12] Yang, Ren, et al. "Multi-frame quality enhancement for compressed video," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.

[13] Li, Tianyi, et al. "A deep learning approach for multi-frame in-loop filter of HEVC," IEEE Transactions on Image Processing 28.11 (2019): 5663-5678.

[14] Zhou, Lulu, et al. “Convolutional neural network filter (CNNF) for intra frame,” JVET-I0022. Gwangju, 2018.

版权声明
本文为[Byte video cloud technology team]所创,转载请带上原文链接,感谢
https://cdmana.com/2021/07/20210730173821904h.html

Scroll to Top