Block-based motion estimation analysis for lip reading user authentication systems

Block-based motion estimation analysis for lip reading user authentication systems
of 10
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
  Block-Based Motion Estimation Analysis for Lip Reading UserAuthentication Systems KHALED ALGHATHBAR HANAN A. MAHMOUDCentre of Excellence in Information Assurance,King Saud University, Riyadh,Kingdom of Saudi Arabia  Abstract: - This paper proposes a lip reading technique for speech recognition by using motion estimationanalysis. The method described in this paper represents a sub-system of the Silent Pass project. Silent Pass is alip reading password entry system for security applications. It presents a user authentication system based onpassword lip reading. Motion estimation is done for lip movement image sequences representing speech. Inthis methodology, the motion estimation is computed without extracting the speaker’s lip contours and location.This leads to obtaining robust visual features for lip movements representing utterances. Our methodologycomprises of two phases, a training phase and a recognition phase. In both phases an n x n video frame of theimage sequence for an utterance (can be an alphanumeric character, word or a sentence in more complicatedanalysis) is divided into m x m blocks. Our method calculates and fits eight curves for each frame. Each curverepresents motion estimation of this frame in a specific direction. These eight curves are representing set of features of a specific frame and are extracted in an unsupervised manner. The feature set consists of the integralvalues of the motion estimation. These features are expected to be extremely effective in the training phase.The feature sets are used to characterize specific utterances with no additional acoustic feature set. A corpus of utterances and their motion estimation features are built in the training phase. The recognition phase isaccomplished by extracting the feature set, from the new image sequence of lip movement of an utterance, andcompare it to the corpus using the mean square error metric for recognition. Key-Words: -   Lip reading, Speech recognition, Motion estimation, User authentication, Feature Extraction. 1 Introduction Automatic Speech Recognition (ASR) systems areplaying successful roles in an recognizing speechwith high accuracy rates [1]. Although highrecognition accuracy can be obtained for cleanspeech using the state-of-the-art technology even if the vocabulary size is large, the accuracy largelydecreases in noisy environments. Increasing therobustness to noisy environments is one of the mostimportant issues of ASR. The speech recognitionsystem which combines both auditory and visualinformation has been demonstrated to outperformthe audio-only system. Most of the multi-modalspeech recognition methods use visual features,typically lip information, in addition to the acousticfeatures [3]. Visual information aids indistinguishing acoustically similar sounds, such asnasal sounds: /n/, /m/, and /ng/ [7, 10]. Lip-readinghas become a hot topic for human computerinteraction (HCI) and audio-visual speechrecognition (AVSR). Lip reading systems can beutilized in many application such as hearingimpaired aid and for noisy environment, wherespeech is highly unrecognizable and lately aspassword entry scheme as suggested by our researchlisted in [2]. This paper concentrates on the visual-only lip-reading system which has also attractedsignificant interest. Feature extraction is a crucialpart for a lip-reading system. Various visual featureshave been proposed in the literature. In general,feature extraction methods are pixel based wherefeatures are employed directly from the image or lipcontour based, in which a detection model is usedto extract the mouth area or some combinations of the two methods. A typical method to extract pixelbased features are image transforms such asDiscrete Cosine Transform (DCT) [7, 8, 10-12],Principal Component Analysis (PCA) [5-9,13] ,Discrete Wavelet Transform (DWT) [7] and LinearDiscriminate Analysis (LDA) [8] have been WSEAS TRANSACTIONS onINFORMATION SCIENCE and APPLICATIONSKhaled Alghathbar, Hanan A. MahmoudISSN: 1790-0832829Issue 5, Volume 6, May 2009  employed for lip-reading. Other feature extractionmethods utilize motion analysis of image sequencesrepresenting lip movement while uttering somespeech are previously done. Mase et al reportedtheir lip-reading system for recognizing connectedEnglish digits using an optical-flow analysis [10].Various advantages exist in using the optical-flowanalysis for audio-visual bimodal speechrecognition. The optical-flow vectors are calculatedwithout using any prior knowledge about the shapeof the object [11], [12]. Thus the visual features canbe detected without extracting the lip locations andcontours [13], [14]. But such method suffers frommain disadvantages of the optical flow analysismethodology [15],[16].   In this paper, we propose using the motionestimation analysis for robust speech recognitionfrom lip reading alone. Visual features are extractedfrom the image sequences and are used for modeltraining and recognition. Block based motionestimation techniques are used to extract visualfeatures blindly without any prior knowledge of liplocation. This paper is organized as follows:Previous work is shown in Section 2. The principleof the motion estimation method is explained insection 3. The proposed lip-reading speechrecognition system is described in Section 4.Experimental setup and results are shown in Section5. Finally, conclusion and future works will followin Section 6.  2 Previous Work There has been much research in recent years inbuilding automatic visual speech recognitionsystems (automated lip-reading) . Many systemshave been designed to show that speech recognitionis possible using only the visual information. Someresearchers have done comparisons on a number of visual feature sets in attempts to find those featuresthat yield the best recognition performance. Suchsystems are either speaker dependent or speakerindependent systems. In literature experimentationsare done by examining isolated vowels, CVCsyllables (consonant-vowel-consonant) , isolatedwords, connected digits and continuous speech [1-3]. The recognition engine takes many forms, somerecognition systems were based on dynamic timewarping. Others use neural network architectures.An increasing number of systems that rely onhidden Markov models (HMM) are built. Petajandeveloped one of the first audio-visual recognitionsystems [6]. In this system, a camera captures themouth image and thresholds it into two levels,mouth images are analyzed to derive the mouthopen area, the perimeter, the height and the width.In this system, the speech is processed by theacoustic recognizer to produce the first fewcandidate words, and then these words are passed onto the visual recognizer for final decision. Later, thesystem was modified by using dynamic timewarping, where a number of features of the binaryimages such as height, width, and perimeter, alongwith derivatives of these quantities are used as theinput to an HMM based visual recognition system.He sought to find the combination of parametersthat would lead to the best recognition performance.The feature set he settled upon was dominated byderivative information. This showed that thedynamics of the visual feature set is important forspeech recognition. The same conclusion had beenreached in a study that used optical flow as input fora visual speech recognizer [8]. In [7] it was shownthat the physical dimensions of the mouth canprovide good recognition performance. Theyanalyzed recognition performance of VCV syllables.They placed reflective markers on the speaker’smouth, and used these to extract 14 distances,sampled at 30 frames a second. They experimentedwith both a mean squared error distance, and anoptimized weighted mean squared error distance.Equal weighting of the parameters led to a 78%viseme recognition rate. In [3], the pixel values of the mouth image are fed to a multi-layer network with no feature extraction for the mouth height orthe mouth width performed. Other researchercombined the visual features, either geometricparameters such as the mouth height and width ornon-geometric parameters such as the wavelettransform of the mouth images to form a jointfeature vector [8]. Researchers have also tried toconvert mouth movements into spoken speechdirectly. In [6], a system called “image-inputmicrophone” takes the mouth image as input,analyze the lip features such as mouth width andheight, and derive the corresponding vocal-tracttransfer function. The transfer function was thenused to synthesize the speech waveform. Theadvantage of the image-input microphone is that it isnot affected by acoustic noise, and therefore is moreappropriate for a noisy environment.  3 Motion Estimation Analysis Motion estimation removes temporal redundanciesamong video frames and is a computation intensiveoperation in the video encoding process. Block based schemes assume that each block of the currentframe is obtained from the translation of somecorresponding region in a reference frame. Motion WSEAS TRANSACTIONS onINFORMATION SCIENCE and APPLICATIONSKhaled Alghathbar, Hanan A. MahmoudISSN: 1790-0832830Issue 5, Volume 6, May 2009  estimation tries to identify this best matching regionin the reference frame for every block in the currentframe.   Fig. 1 Block Based Motion Estimation   In fig. 1, the gray block on the right corresponds tothe current block and the gray area on the leftrepresents the best match found for the currentblock, in the reference frame. The displacement iscalled the motion vector. The search range specifiedby the baseline H.263 standard allows motionvectors to range between –15 pixels and 16 pixels ineither dimension. The size of the Search window isof size 32 X32 about the search centre. Block-matching algorithm (BMA) for motion estimation(ME) has been widely adopted by the current videocompression standards, such as H.261, H.263,MPEG-1, MPEG-2, MPEG-4 and H.264 [1] due toits effectiveness and simple implementation. Themost straightforward BMA is the  full search (FS),which exhaustively evaluates all the possiblecandidate blocks within the search window.However, this method is very computationallyintensive, and can consume up to 80% of thecomputational power of the encoder. This limitationmakes ME the main bottleneck in real-time videocoding applications including lip reading systems.Consequently, fast BMAs are used to decrease thecomputational cost with the expense of lessaccuracy in determining the correct motion vectors.Many fast BMAs were proposed, such as three-stepsearch (TSS), four-step search (4SS), block-baseddiamond search (DS) algorithms, etc. In this paper,we are going to employ the different mentionedblock based to induce the motion vector feature set.Performance evaluation of different techniques willbe studied. Experiments using different trainingalgorithms will be used Recognition errorpercentages will be presented for differentmethodologies. 3.1 Full-search block-matching algorithm Full-search block matching algorithm (FSBM) findsthe best match for a reference block in the currentframe within a search area S in the previous frame.The criterion for best match is the candidate block with the minimum amount of distortion whencompared with the reference block. The measureused for calculating distortion is the sum of absolutedifferences ( SAD) of intensity values between thetwo blocks. The SAD for the candidate block of size  N  x N  at position (u,v) can be defined as: ∑ ∑ = = −++=  N i N  j  jivv juiuvuSAD 11 ),(),(),(  (1)Where v(i,j) and u(i+u, j+v) are intensity values atposition (i,j) of the reference block and (i+u, j+v) of the candidate block in search area S . The search areais formed by extending the reference block by asearch range w on each side forming a search area of  (2w+N) 2   pixels. As a result, there are (2w +1) candidate blocks in both horizontal and verticaldirections i.e. a total of  (2w+1) 2   candidate blockshave to be searched corresponding to each referenceblock. The distortion value is computed for eachcandidate block and the minimum value SAD min isfound. The block matching process generates amotion vector (u,v) min and the correspondingdistortion value SAD min . 4 Visual Lip Reading System In this section we are going to discuss the proposedtechnique for lip reading. The proposed technique iscomposed of two phases. The first phase is thetraining phase which results in feature extractionfrom image sequences representing differentutterance lip movement. The second phase is therecognition phase, where new utterance through lipmovement is compared against the output of thetraining phase and recognized. 4.1 Training phase and feature extraction An image sequence is captured with the frame rateof 30 frames/sec and the resolution size of 360×240.Block-based motion estimation technique is used.Motion vectors representing motion of block iscomputed from a pair of consecutive images. Weextract the motion vectors of the different blocks.The feature extraction of the training phase isillustrated in the following algorithm. The used WSEAS TRANSACTIONS onINFORMATION SCIENCE and APPLICATIONSKhaled Alghathbar, Hanan A. MahmoudISSN: 1790-0832831Issue 5, Volume 6, May 2009  block matching algorithm can be full search or 3SSor FSS or DS. The previously mentioned algorithmwill produce motion vectors with values from -3 to+3 in one of the eight geographical directions. Thisrestriction in defining motion vectors is due to thefact that lip motion in utterance is very restricted atthe rate of 15 frames per second, motion is veryslow.The diagram in fig. 2 illustrates block 1 in thetraining phase algorithm: Algorithm Training (W).Each video frame, of the utterance lip movementsequence, is fed to the frame division module into8X8 blocks. Each of these blocks (block  1 to block  n )are fed to the motion estimation module to formotion analysis and production of the motionvector of such block. a set of n motion vectors{mv(block  1 ), …. mv(block  n )} are produced.Many videos for the same utterance are fediteratively to Block 1 to calculate average motionvectors for each block in a frame of these videos.The set of the average motion vectors are fed toBlock 2. Where eight curves are built from theaverage motion vectors for each frame. Each curveis the value of the average motion vector of a block versus its particular location in a video frame. Eachcurve represents the average motion of blocks in thevideo frame in a certain direction (we restricted thedirections to the eight geographical directions) andthis curve is fitted to be a continuous curve. Suchcurve represents a motion feature of the utterancevideo. The motion feature will be represented bythe area under this curve by taking the integrationvalue of this curve, which is done by the areacalculation module, the area is calculated as inequation 1. Number of motion features for anutterance video is equal to 8f  ,where f  equals tothe number of frames in the utterance video. Eachutterance and its motion features are stored in theutterance database. Algorithm Training(Word: W )    Begin    For word W repeat the following steps j timesby different speakers { 1. Record a video of lip reading the word W by aspeaker S  j ; 2. Divide the video into n frames;  3. Do for (k= 0, k=k+2, k< n){a. Frame k is divided into m blocks each of size 8X8 pixels;  b. Motion vectors, M, of all the m blocksare calculated between frame k and frame k+1;M = set of motion vectors = {mv i , i=1 to 8}, i is one of the eight principlegeographical directions;c. for i= 1 to 8 do{ 1. Draw and fit a discreet graph:DG i (k) between mv i and the locationof the block, location of the block isnumbered in a spiral fashion starting from the center of each 64X64 block;  2. )2((k)DG(k)Area ii   ∫ = ;}}4. feature set  j = k},8to1i(k),{Area  j i ∀=  }    Training-feature-set (W) ={average k},8to1i(k)),(Area  j i ∀=      End 4.2 Explanation of the training algorithm 1.   The video sequence of the lip read word hasn frames, each frame is divided into blocksof 8X8 pixels for the motion vectorscalculations, as we assume that an 8X8block moves translations motion as a oneunit.2.   To draw the graphs DG i (k), a frame isdivided into blocks each of 64X64 pixels, or8X8 blocks of size 8X8 pixels. Location of the block is numbered in a spiral fashionstarting from the center of each 64X64block.3.   There are 8 curves representing the motionvectors of the video sequence.4.   To calculate the integration of each curve,area under the curve is calculated by anapproximate method. 4.3 Lip Reading Recognition Phase The lip reading phase is similar to the trainingphase, it starts with the unknown utterance lipmovement video, these utterance should berecognized. This video is fed to Block 1, wheremotion vectors for this video are calculated andfitted into 8f curves. Integral values of these curvesare the areas under these curves and are fed to thecomparison module. The comparison modulecompares the motion features of the input video WSEAS TRANSACTIONS onINFORMATION SCIENCE and APPLICATIONSKhaled Alghathbar, Hanan A. MahmoudISSN: 1790-0832832Issue 5, Volume 6, May 2009  against the utterance stored in the database by usingthe mean square error function. MSE is calculatedbetween the input utterance features and everyutterance features stored in the database. MSE iscalculated as shown in equation 2. MMSE iscomputed to choose the candidate utterance that ismost similar to the input utterance. MMSE iscalculated as in equation 3. Algorithm Recognize (WORD: video)Begin Divide the WORD video into n frames;1. Do for (k= 0, k=k+2, k< n){a. Frame k is divided into m blocks each of size 8X8 pixels;  b. Motion vectors, M, of all the m blocksare calculated between frame k and frame k+1; M = set of motion vectors = {mv i , i=1 to 8}, i is one of the eight principlegeographical directions;c. for i= 1 to 8 do{1. Draw and fit a discreet graph:DG i (k) between mv i and thelocation of the block, location of theblock is numbered in a spiralfashion starting from the center of  each 64X64 block;  2.   ∫ =  (k)DG(k)Area ii ;}}2. feature (WORD)= k},8to1i(k),{fArea ∀= i  3. Calculate mean square error MSE  j for the inpututterance and utterance j as follows: )3())()(( 1 211 k  Areak  fArea  x MSE  iik i j −= ∑∑ ==  Calculate minimum mean square error MMSE asfollows: )4(1)min(  xto j MSE  MMSE   j ==  4. If MMSE > threshold then the WORD isunrecognizable otherwise the WORD = W d (W d is aword in the corpus corresponds to the minimumMSE) End Fig. 2.a Block1 (Motion estimation and motionvector extraction for one video sequence)Fig. 2.b Populating the utterance database 8 areas per frame UtteranceDatabase 8 curves per frame Curve fittingBlock 1Many videosfor the sameutterance Average motionvectors mv Area calculationAdd to database Utterancefeaturesmv(block  1 )mv(block  n )Motion estimationmoduleEach frameFrame divisionmoduleVideoframesEach block  WSEAS TRANSACTIONS onINFORMATION SCIENCE and APPLICATIONSKhaled Alghathbar, Hanan A. MahmoudISSN: 1790-0832833Issue 5, Volume 6, May 2009
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks

We need your sign to support Project to invent "SMART AND CONTROLLABLE REFLECTIVE BALLOONS" to cover the Sun and Save Our Earth.

More details...

Sign Now!

We are very appreciated for your Prompt Action!