Applying CUDA Architecture to Accelerate Full Search Block Matching Algorithm for High Performance Motion Estimation in Video Encoding

2011 23rd International Symposium on Computer Architecture and High Performance Computing Applying CUDA Architecture to Accelerate Full Search Block Matching Algorithm for High Performance Motion Estimation in Video Encoding Eduarda Monteiro, Bruno Vizzotto, Cláudio Diniz, Bruno Zatt, Sergio Bampi Informatics Institute - PPGC - PGMICRO Federal University of Rio Grande do Sul (UFRGS) Porto Alegre, Brazil {ermonteiro, bbvizzotto, cmdiniz, bzatt, bampi} Abstract— This work presents
of 8
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
  Applying CUDA Architecture to Accelerate Full Search Block Matching Algorithm for High Performance Motion Estimation in Video Encoding Eduarda Monteiro, Bruno Vizzotto, Cláudio Diniz, Bruno Zatt, Sergio Bampi Informatics Institute - PPGC - PGMICRO Federal University of Rio Grande do Sul (UFRGS) Porto Alegre, Brazil {ermonteiro, bbvizzotto, cmdiniz, bzatt, bampi}  Abstract   — This work presents a parallel GPU-based solution for the Motion Estimation (ME) process in a video encoding system. We propose a way to partition the steps of Full Search  block matching algorithm in the CUDA architecture. A comparison among the performance achieved by this solution with a theoretical model and two other implementations (sequential and parallel using OpenMP library) is made as well. We obtained a O(n²/log²n)  speed-up which fits the proposed theoretical model considering different search areas. It represents up to 600x   gain compared to the serial implementation, and 66x   compared to the parallel OpenMP implementation.  Keywords- Motion Estimation; H.264/AVC; GPU; CUDA. I.   I  NTRODUCTION In the past decade, the demand for high quality digital video applications has brought the attention of industry and academy to drive the development of advanced video coding techniques and standards. It resulted in the publication of H.264/AVC [1], the state-of-the-art video coding standard,  providing higher coding efficiency and requiring increased computational complexity compared with previous standards (MPEG-2, MPEG-4, H.263). Boosted by the codecs and devices evolution, digital video coding has become present in a wide range of applications including TV broadcasting, video conferencing, video surveillance, and portable multimedia devices, to list a few. Among all innovative tools featured by latest video coding standards the Motion Estimation (ME) is the most important in order to obtain expressive coding gains. In H.264/AVC the ME provides even more coding efficiency due to the insertion of bi-prediction and variable block-size capabilities [1]. The new tools represent significant complexity increase in comparison to predecessor standards  posing a big challenge to real-time video encoding realization at high definition. The Motion Estimation explores the temporal redundancy of a video by searching in reference frames (previously encoded frames) the most similar region in the current frame. It is performed in a block-by-block basis, for each block of a video frame, using a block matching algorithm to find the best ‘match’ in a reference frame. The  best ‘match’ is defined by using a similarity criterion, e.g. Sum of Absolute Differences (SAD), the most commonly used in commercial and research implementations. Once the  best matching candidate is selected, a motion vector (MV)  pointing to that position is calculated. The MV indicates to the video decoder the location of the most similar block, in the reference frame, that must be used to ‘predict’ the current  block. In this way, only the motion vector and the residues, i.e. a pixel-by-pixel difference between the current block and the best matching block, are transmitted to the decoder. The block matching algorithm and the matching similarity criteria are important encoder issues not standardized by H.264/AVC. The block matching task requires intensive computation and memory communication, representing 80% of the total computational complexity of current video encoders [3]. Some block matching algorithms for ME have a great potential of parallelism.  Full Search  (FS) [2], the optimal solution, performs the search for the  best match exhaustively inside a search area by calculating the similarity for every candidate block position (with a  pixel-by-pixel offset). In total, 15625 candidate blocks have to be calculated considering a search area of 128x128  pixels and a block size of 4x4  in the  full search  algorithm. However, the SAD calculation for one block has no data dependencies with other blocks enabling simultaneous  parallel processing. In other words, ME using FS algorithm has promising potential for efficient implementation on massively parallel architectures. Recently, the academic and industrial parallel processing communities have turned their attention to Graphic Processing Units (GPU). GPUs have been srcinally developed for graphic processing and 3D rendering but, due to the great potential of parallelization, GPUs capabilities and flexibility were extended targeting general purpose  processing. These devices are referred as GPGPU – General Purpose GPU. The FERMI architecture [3], first proposed by  NVIDIA in 2009, is the most popular and prominent GPGPU solution available nowadays. It enables high computing  performance increase over previous architectures. Its main innovations are the inclusion of two levels of cache (L1 and L2), the addition of cores and special units for data handling, and double-precision floating point units. CUDA architecture (Compute Unified Device Architecture) [4] was proposed by  NVIDIA in 2007 [3] with the objective to exploit the high degree of inherent parallelism to their graphical devices. The great computational power offered by this type of technology has made of this architecture a great prominence in diverse areas, in special in the scientific community. 2011 23rd International Symposium on Computer Architecture and High Performance Computing 1550-6533/11 $26.00 © 2011 IEEEDOI 10.1109/SBAC-PAD.2011.19128   2011 23rd International Symposium on Computer Architecture and High Performance Computing 1550-6533/11 $26.00 © 2011 IEEEDOI 10.1109/SBAC-PAD.2011.19128  By exploring the inherent parallelism potential of ME FS algorithm and the large parallel processing power of recent GPUs, this work presents a parallel GPU-based solution for the FS block matching algorithm implemented on CUDA. We propose an efficient mapping for the FS algorithm to the CUDA programming model. Further, the performance of our solution running motion estimation for real video sequences is compared to the state-of-the-art and to in-house developed serial and OpenMP [14] implementations. It was also compared to the theoretical complexity calculated in terms of computation and communication. This paper is organized as follows. Section II presents the state-of-the–art related work. Motion Estimation basic concepts are presented in Section III. In Section IV is described our Motion Estimation  Full Search  block matching algorithm developed in CUDA. Section V shows the results, analysis, and comparisons with state-of-the-art. Section VI concludes the work. II.   R  ELATED W ORK   The standard H.264/AVC is considered state-of-the–art in video coding by having better results than the existing standards (MPEG-2). Different video encoding software solutions for H.264/AVC have been developed since its standardization, e.g. the JM reference software [5] and the x264 free software library [6]. However, these software have none or only proprietary libraries for GPU acceleration (the case of x264). Some works aiming to accelerate H.264/AVC video encoding using GPU can be found in the literature. The works described in [7]-[9] focuses specifically on the implementation of ME algorithms into GPU cards, which are directly related to the scope presented in our work. Chen et al. [7] presented a Motion Estimation algorithm using GPU and CUDA architecture considering  Full Search  algorithm. This work is divided the ME in different steps to achieve high parallelism through low data transference  between memory CPU and GPU. These steps are: (i)  SAD values calculation for fixed size block, 4x4  pixels; (ii)  SAD values calculation for variable block sizes; (iii)  SAD comparisons for integer accuracy; (iv)  SAD comparisons for fractional accuracy; (v)  refinement of ME for factionary accuracy. This paper considers variable block sizes getting a high number of candidate blocks to be evaluated, so the  performance of this work was not efficient. Moreover, the ME with fractional accuracy needed a refinement which also impacted on results presented. The device card considered in this paper is NVIDIA GeForce 8800GTX with CUDA architecture. Lin et al. [8] focuses on efficient parallelization of the Motion Estimation. This paper does not consider CUDA architecture since it was not yet proposed by the time of that work. Thus, it uses the texture memory as main source of data management. This proposed algorithm is based in a technique called multi-pass encoding considering  Full Search  algorithm. The main drawback of this approach is the  performance limitation imposed by multiple iteration steps for SAD calculation and SAD values comparisons. The device card considered in this paper is NVIDIA GeForce 6800GT. Lee et al. [9] presents three alternatives of ME in GPU  based in  Full Search  algorithm: Integer Accuracy, Fractional Accuracy and Integer Accuracy considering three reference frames in parallel. The best performance results were achieved with integer accuracy considering three reference frames. The solution with fractional accuracy enables further refinement of the matching algorithm (increasing quality) but it is performance-wise inefficient on GPUs due to data dependencies. In this work the considered device was a  NIVIDIA GeForce 7800GT. III.   M OTION E STIMATION  Fig. 1 shows the block diagram related to a generic video encoder. The blocks illustrated in Fig. 1 have the common goal of reducing the existing types of redundancy in digital videos: temporal, spatial, and entropy. In summary, the main  blocks are: (i)  inter-frame prediction: aims to reduce the temporal redundancy between frames of a video, i.e. it focuses on the correlation between temporally neighboring frames; (ii)  intra-frame prediction: reduces spatial redundancy in the current frame (current frame being encoded), i.e., it focuses on the correlation between the  pixels distributed within the same frame; (iii)  transforms and quantization: responsible for reducing irrelevant information for the human visual system in order to achieve higher coding gain; (iv)  entropy coding: focuses on the reducing of entropic redundancy, i.e., it is related to the representation of coded symbols, associating for most frequent symbols smaller codes. Figure 1. H.264/AVC Encoder System Diagram. The ME process (part of inter-frame prediction, see Fig. 1) is detailed in Fig. 2. Basically, the ME identifies motion  between neighboring frames in a scene and provides a map of displacement using motion vectors. Firstly, one (or more) reference frames are selected within the group of pictures (GOP). Next, for each block inside the current frame a search for the most similar block is performed. The search is  bounded by a region called the search area (filled area in Fig. 2). The search area is the region in the reference frame typically centered in the same relative position of the current  block. For each block it is calculated a motion vector (represented as a tuple of x and y coordinates) pointing to the  position of the block with highest similarity in the reference 129   129  frame. Thus, as a product of this process, vector and the residual data are transmittsrcinal information regarding the entire fra  Figure 2. Optimum Motion Estimation Algori  A.    Full Search Algorithm Considering that any block can be chosis being performed, all the blocks that conarea can be classified as candidate blocks. Talgorithm determine how the candidate bloinside the search area, in order to find the bAmong the different approaches of sear Motion Estimation, the  Full Search  is cwork. The  Full Search  algorithm aims tmatch’ between the block of the curren possible positions inside the search area seframe. Thus, this algorithm is compexpensive compared to other existing alliterature. Despite this reason, this algorithoptimal because it is capable of generating vectors (best matches), resulting in the band best encoding efficiency among all estimation algorithms [2].  B.   Similarity Criteria To find the best matching, a metric is reqthe differences between the current and canthis context, a similarity criterion mustcriterion is known also as distortion criteri(or difference) is inversely proportional similarity between the blocks. Different similarity criteria are often coding: (i)  Mean Square Error (MSE); (ii)  Transformed Differences (SATD); (iii)  SDifferences (SAD). Considering its simplithe most used similarity criterion for Madopted in this work. SAD calculates the dithe current block and each candidate bloarea, by adding the pixel-by-pixel absolute         only the motion d instead of the es. thm Diagram. en when the ME stitute the search his way, a search ck search moves st matching.   ch algorithms for nsidered in this o find the ‘best t frame and all t in the reference tationally more gorithms in the m is considerate improved motion est video quality ther fast motion uired to evaluate didate blocks. In  be used. This n. The distortion o the degree of used in video Sum of Absolute um of Absolute city, the SAD is and it is also stortion between ck in the search ifferences: The SAD definition is presentewidth and h is the height of bot block. The candidate block is cholowest SAD value, i.e. the lower dicurrent block. The position (x,y)  of  block is represented by the motion C.   Theoretical Model A theoretical model was used t  Full Search  algorithm behavior. Random-Access Machine) modcalculate the parallel complexity. not enough to model the parallarchitecture, since this paradigm cwhich neglects practical issues sucThen, we improved PRAM model communication between CPU and This analysis has the main goal t basis for experimental results. Thwas based on the following varia block: 4x4 ; (ii)  size of search arerefers the search area width and hei (iii)  frame resolution: M x N  , wher width and height in pixels, rescriteria: SAD. 1)   Sequential Complexity: In sequential complexity of the considered the total number ocalculations, SAD operations, SA block. It was also taken into accouin the search area and the frame r the algorithm presents a complexity                     2)    Parallel Complexity (PRA  parallel complexity of the ME FS the PRAM model. It was considerex log (n) for each sub-problem insiThus, the obtained parallel complex  ã    Number of processors  P(  thread is responsible by a                   ã   Execution Time Tp(n) : Thnumber of subtractions aadditions and SAD compof size of log²n :     (1) in (1), where w is the h candidate and current sen when it presents the stortion in relation to the the best block candidate ector. o analyze previously the The PRAM (Processor l was considered to owever, only PRAM is el complexity of GPU onsiders shared memory as communication time.  by adding the concept of PU. o establish a comparative complexity calculation  bles: (i)  size of current : n x n  pixels, where n  ght (square search area); e M   and  N   are the frame ectively; (iv)  similarity order to calculate the E FS algorithm we subtractions, absolute comparisons for a 4x4  t the numbers of blocks esolution. In conclusion, of O(n2) .            (2) M): To calculate the algorithm we first used d a granularity of log (n) de the n x n search area. ity is O(log2n)  based on:  ) : Considering that each egion log (n) x log (n) .    (3)   e execution considers the d absolutes calculations, risons between regions   130   130                    ã   Total Cost C(n) : is the cost to imp program (P(n) * Tp(n)) :                 3)   Communication: The hardwar considered in this work is based on GPU attached to the CPU and, consequently,communication time is of key importance f  performance. In addition to complexity PRAM model, we included a concept o based on latency (l)  and throughput (d) obtained a more realistic model whconsideration two data transfers (input daGPU and the calculation output from GPUexecution of one function (f(n)) :                  It can be noted that the communegligible when compared with the computthe expected speed-up calculated is shown i        (7) IV.   M OTION E STIMATION I MPLEMENTA A RCHITECTURE   Noticing the computational complexitME and trying to take advantage of the sigof parallelism inherent to the  Full Searc  propose here a highly parallelizable salgorithm on graphic cards (GPU). Amoarchitectural options for general purpose Gthe CUDA architecture [4] was used. CU2007 by NVIDIA, enables a programmigraphics processors. This architecture, (Simple Instruction, Multiple Data) approac block matching algorithms as it provides m parallelism. In this work, the algorithm  Full Search  in C++ language using CUDA functions. Sthe similarity criterion to choose the best mcandidate blocks. Only integer accuracy MIn our experiments were used videos of diff such as CIF (352x288), HD720p (1280x72(1920x1080), obtained as input parameters. The hardware platform used in this wor a CPU and a GPU. The algorithm propose presented in Fig. 3. Initially, the CPU diviframes, where the reference frames and cselected. These frames are transferred to     (4)   lement a parallel (5) e architecture s a co-processor the CPU-GPU r the application obtained by the communication . This way, we ich takes into ta from CPU to to CPU and the       (6) nication time is ation time. Then, n (7). TION FOR CUDA y introduced by nificant potential h  algorithm, we lution for this ng the different U programming, A, proposed in ng interface for ased on SIMD h, is adequate for assive data-level as implemented AD was used as atch between the was considered. erent resolutions, 0) and HD1080p is composed by in this paper is des the video in rrent frames are PU memory, as shown in the Fig. 3. The ME unde paper is composed by two steps: (i  for all candidate blocks inside comparison of SAD values of all ca best matches (lowest SAD). Finalstored in a text file with all motion current frame and transferred to CPA library called Thrust [1manipulation between CPU and developed for CUDA architecture creation of parallel applications. UGPU transfer requires only a simplthe attribution signal). The execution of this applicatiaccordingly the processing hierarchIn this hierarchy there are three thread: the basic unit of the procethreads; (iii)  grid: composed of madefined a programming model for tthe Fig. 4. Figure 3. Proposed Alg Based on the concepts present parallelization in GPU was per following entities: (i)  Kernel: The in parallel way on the GPU. This This implementation is based on oresponsible for execution of ME SAD values calculation and Ssearching the lowest SAD value ( (ii)  Thread: each thread is responone 4x4  video block. We used th basis for block-by-block comparisogranularity as the macroblock (16The threads that execution the kern CUDA proposed in this SAD values calculation the search area; (ii)  didate blocks to find the ly, the ME results were vectors generated for the memory. ] was used for data GPU. This library was in order to facilitate the sing this library, a CPU- attribution (overloading n in GPU is carried on y of CUDA architecture. important concepts: (i)  sing; (ii)  block: a set of ny blocks. This way, we is algorithm as shown in rithm Flow. d in the Fig. 4, the ME ormed considering the rocedure to be executed ernel is started by CPU. nly one kernel, which is under CUDA, through D values comparison, E under CUDA, Fig. 3); sible by computation of 4x4  video block as the n to achieve finer motion x16 pixels); (iii)  Block: el is organized in blocks. 131   131
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks

We need your sign to support Project to invent "SMART AND CONTROLLABLE REFLECTIVE BALLOONS" to cover the Sun and Save Our Earth.

More details...

Sign Now!

We are very appreciated for your Prompt Action!