A Multifaceted Investigation into Speech Reading

A Multifaceted Investigation into Speech Reading
of 15
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
  A Multifaceted Investigation into SpeechReading Trent W. Lewis and David M.W. Powers School of Informatics and Engineering, Flinders Univeristy of South Australia,Australia Abstract. Speech reading is the act of speech perception using both acoustic andvisual information. This is something we all do as humans and can be utilised bymachines to improve traditional speech recognition systems. We have been follow-ing a line of research that started with a simple audio-visual speech recognitionsystem to what is now a multifaceted investigation into speech reading. This paperoverviews our feature extraction technique, red exclusion, and its analysis usingneural networks and then looks at several neural network integration architecturesfor speech reading. 1 Introduction Automatic speech recognition (ASR) performs well under restricted condi-tions (rates of up to 98-99% word acccuracy). When we step outside theboundaries, however, performance can be serverly degraded and the utilityof such systems comes under fire [2]. The question then arises how are humansable to still recognise speech in unfavourable conditions such as a busy office,a train station or a construction site? Is our acoustic apparatus performingan enormous amount of noise filtering and reduction or is it that we are usinganother source of information? It is in fact the latter which may be an answerto robust speech reconition.Work from the areas of psychology and linguistics has shed much light onhow humans perceive speech, not only how we percieve it acoustically butalso visually, such as lip-reading in deaf people. This has evolved into whatis now known as speech reading  , or audio-visual speech recognition (AVSR)for the engineer [6]. The most important finding from this research is thatnormally hearing people do rely on vision for speech perception and that theset of visually perceivable speech sounds forms a complementary set to thatof the acoustically perceivable sounds in the presence of noise. This set of visually perceivable speech sounds have been named visemes , that is vis ualphon emes [19].Researchers in the fields of engineering and computer science have takenthese ideas and applied them to traditional speech recogntion systems withvery encouraging results (for a comprehensive review see [18]). Although onlyminimal improvement is found under optimal conditions, improvements usinga degraded acoustic signal have been large [9]. For example, Meier et al. havereported up to a 50% error reduction when vision is incorporated.  2 Trent W. Lewis and David M.W. Powers For the development of a successful AVSR system a truly multi-talentedgroup is required. Expertise are needed to interpret findings from psycholin-guistics, a solid grounding in the traditional acoustic speech recognition, anda grasp on the computer vision techniques for relevant visual feature extrac-tion. However, a new problem also arises with AVSR and that is how to bestcombine the acoustic and visual signals without the result being worse thanacoustic or visual recognition alone, that is catastrophic fusion  [14]. This isa lively research area in AVSR and the effectiveness of different technques,such as early, intermediate and late, are still being decided.Our initial interest in this area was in the fusion of acoustic and visualdata, however, this interest soon spread to other areas of AVSR, especiallyvisual feature extraction. For our preliminary investigations into this areawe attemtped to use pre-existing techniques to move us quickly to the finalstage of fusion. Acoustic feature extraction is a mature field and did not posea problem but visual feature extraction proved difficult. These difficulties areoutlined in section 2.2 as well as our solution to the problem, red exclusion,and our continuing research into this area using neural networks (NN). Thefinal sections report on results using some of the approaches we are pursuingbased on using NN for the recognition process in both the acoustic and visualsignals as well as the fusion process. 2 Feature Extraction 2.1 Acoustic Features According to Schafer and Rabiner, the choice of the representation of the(acoustic) speech signal is critical [17]. Many different representations of speech have been developed, including simple waveform codings, time andfrequency domain techniques, linear predictive coding, and nonlinear or ho-momorphic representations. Here, we focus on the homomorphic representa-tions, especially the mel-cepstrum  representation.The mel-frequency scale is defined as a linear frequency spacing below1000 Hz and a logarithmic spacing above 1000 Hz [4]. This representationis preferred by many in the speech community as it more closely resemblesthe subjective human perception of sinewave pitch [3,15]. A compact repre-sentation of the phonetically important features of the speech signal can beencoded by a set of mel-cepstrum coefficients, with the cepstral coefficientsbeing the Fourier transform representation of the log magnitude spectrum.The mel-cepstrum representation of acoustic speech has had great suc-cess in all areas of speech processing, including speech recognition. It hasbeen found to be a more robust, reliable feature set for speech recognitionthan other forms of representation [4,15]. Thus, it was decided that this wasthe best representation to be used for the following recognition experiments.Moreover, the cepstrum has been found to be invaluable in identifying thevoicing of particular speech segments [17].  A Multifaceted Investigation into Speech Reading 3 To extract the mel-cepstrum coefficients from the speech signal the Matlabspeech processing toolbox VOICEBOX  was used [3], exploiting the first 12cepstral coefficients, 12 delta-cepstral coefficients, 1 log-power and 1 deltalog-power [14]. This is a total of 26 features per acoustic frame, and 130 perdata vector (5 frames), which is comparably to the number of visual featuresdiscussed in the next section. These features are used in NNs for the ASRand AVSR experiments discussed in section 4. 2.2 Visual Features The accurate extraction of lip features for recognition is a very importantfirst step in AVSR. Moreover, the consistency of the extraction is critical if it is to be used for a variety of conditions and people. According to Bre-gler, Manke, Hild, and Waibel [2], broadly speaking there exist two differentschools of thought when it comes to visual processing. At one extreme, thereare those who believe that the feature extraction stage should reduce thevisual input to the least amount of hand-crafted features as possible, suchas deformable templates [8]. This type of approach has the advantage thatthe number of visual inputs are drastically reduced - potential speeding upsubsequent processesing and reducing the variability and increasing general-isability. However, this approach has been heavily criticised as it can be timeconsuming in fitting a model to each frame [16] and, most importantly, themodel may exclude linguistically relevant information [2,7]. The opponentsof this approach believe that only minimal processing should be applied tothe found mouth image, so as to the amount of information lost due to anytransformation. For example, Gray et al. [7] found that simply using the dif-ference between the current and previous frames produce results that werebetter than using PCA. However, in this approach the feature vector is equalto the size of the image (40x60 in most cases), which is potentially ordersof magnitudes larger than a model based approach. This can potentially be-come a problem depending on the choice of recognition system and trainingregime, however, successful systems have been developed using both HMMsand NNs using this approach [13,14]. 2.3 The Basis of Red Exclusion In a previous paper we demonstrated that many of the current pixel-basedtechniques do not adequately identify the lip corners, or even the lip regionin some cases [11]. This led to us to define our own lip feature extractiontechnique.The idea behind red exclusion is simple and borne out of similarwork in entire face extraction. The predominate face extraction technique isbased on the idea that human skin colour occupies a very small spectrum of colour space when total brightness is accounted for [10]. Surprising to manyis that this theory appears to hold across different races and skin colours,and is thus a very robust and relatively quick process for face extraction.  4 Trent W. Lewis and David M.W. Powers Some have attempted to extract the mouth from face based on similarideas using the same colour space. For example, Wark, Sridharan, and Chan-dran [21] used equation (1) to identify candidate lip pixels. L lim ≤ RG ≤ U  lim , (1)where R and G are the red and green colour components, respectively, and L lim and U  lim are the lower and upper boundaries that define which valuesof  RG are considered lip pixels. Others have attempted this type of extractedusing hue and saturation, or a modification of R and G components as in(2), which is the colour space used for face extraction. Unfortunately, thetechniques outlined did not prove adequate for our purposes as shown in [11]. r = RR + G + B,g = GR + G + B (2)To solve the problem we took a step back from situation and assesed whatwe were attempting to do. Thiss is in essence, to seperate an object (the lips)from a very similar back ground (the face). In this case the similarity wasthat both objects were predominately red. So, we thought, that any contrastthat arose between the two object would be in the levels of green and/or bluecontained in the objects. After our empirical investigation, we found thatindeed an excellent contrast was exhibited when filtering the image with (3),the log giving even more contrast compared with the linear form. log  GB  (3)Thresholding (in log or linear form) and then applying morphological oper-ators, such as opening and closing, the mouth area can easily be identified.Figure 1a is a grayscale-enhanced 1 image after RE on an image of one of the subjects and 1b is example of visual features used for recognition. Fromthe selected visual features meta-features, such as height, width, and motion(given two frames) can be calculated and then all these features are used asinputs in into NN for visual and audio-visual speech recognition. 3 Investigating Red Exclusion We are currently undertaking research investigating the how and why of redexclusion and whether it is related to the psychophysical contrast functionsof mammalian vision and the associated theories and models of colour op-ponency. Moreover, our research also seeks to find out whether componentseparation algorithms and self-organizing NNs given the task of discrimating 1 A grayscale image with pixels meeting a criteria, eg. RE, highlighted.  A Multifaceted Investigation into Speech Reading 5 10 20 30 40 50 6051015202530 ab Fig.1. a) Example of red exclusion, and b) Visual features used for recognition between objects solely based on colour (Red, Green, Blue) result in forma-tion of a traditional colour opponent system (Blue-Yellow, Green-Red, Black-White). Furthermore, such research may be able to give further support tothe colour opponent theory by demonstrating its utility in an engineeringcontext. 3.1 A Visual Examination As preliminary investigation into whether the properties of red exclusion arerelated to that of the colour opponents we performed a visual examinationof the contrasting abilities of a modified red exclusion equation (4). kG + (1 − k ) RB, 0 ≤ k ≤ 1 (4)Note that the k = 1 case is equivalent to RE and, under the simple as-sumption that Yellow = Red + Green, a value of  k = 0 . 5 is equivalent tocontrasting of the Blue-Yellow opponents. However, we have not yet foundprecise indications as to exactly which yellow hue is the physiological oppo-nent of blue.Under a visual examination, that is displaying the filtered image on thescreen, the mouth shape can easily be identified in the range 0 . 5 ≤ k ≤ 1.Below 0 . 5, however, visual discrimantion of the mouth becomes difficult andat a value of  k = 0, that is RB , it is impossible to determine the mouth areafrom the surrounding face (see figure 2) 3.2 Neural Networks, Red Exclusion, and Opponent Colours This simple examination does indeed lend support to the idea that a colouropponent approach to mouth extraction, such as red exclusion, can be morefruitful than other techniques. Moreover, these initial experiments confirmthat the B-Y opponent is also close to optimal for facial features.The previous section outlined a preliminary investigation into red exclu-sion. However, this didn’t really tell us anything useful in terms of how tobetter use red exclusion for automatic extraction. Thus, we need to analyti-cally find which coefficients or weightings given to the the Red, Green, and
Similar documents
View more...
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks