Instruction manuals

A pilot study on augmented speech communication based on Electro-Magnetic Articulography

Description
A pilot study on augmented speech communication based on Electro-Magnetic Articulography
Published
of 7
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
Share
Transcript
  A pilot study on augmented speech communication based on Electro-MagneticArticulography Panikos Heracleous a, ⇑ , Pierre Badin b , Gérard Bailly b , Norihiro Hagita a a  ATR, Intelligent Robotics and Communication Laboratories, 2-2-2 Hikaridai Seika-cho, Soraku-gun, Kyoto-fu 619-0288, Japan b GIPSA-lab, Speech and Cognition Department, UMR 5216, CNRS-Grenoble University, 961 rue dela Houille Blanche Domaine universitaire, F-38402 Saint Martin d’Hères cedex, France a r t i c l e i n f o  Article history: Received 7 May 2010Available online 19 February 2011Communicated by R.C. Guido Keywords: Augmented speechElectro-Magnetic Articulography (EMA)Automatic speech recognitionHidden Markov model (HMMs)FusionNoise robustness a b s t r a c t Speechisthemostnaturalformofcommunicationforhumanbeings.However, insituationswhereaudiospeech is not available because of disability or adverse environmental condition, people may resort toalternative methods such as augmented speech, that is, audio speech supplemented or replaced by othermodalities, such as audiovisual speech, or Cued Speech. This article introduces augmented speech com-municationbasedonElectro-Magnetic Articulography(EMA). Movementsofthetongue, lips, andjawaretrackedbyEMAandareusedasfeaturestocreatehiddenMarkovmodels(HMMs).Inaddition, automaticphoneme recognition experiments are conducted to examine the possibility of recognizing speech onlyfromarticulation, that is, without any audio information. The results obtained are promising, which con-firm that phonetic features characterizing articulation are as discriminating as those characterizingacoustics (except for voicing). This article also describes experiments conducted in noisy environmentsusing fused audio and EMA parameters. It has been observed that when EMA parameters are fused withnoisy audio speech, the recognition rate increases significantly as compared with using noisy audiospeech only.   2011 Elsevier B.V. All rights reserved. 1. Introduction Speech is the most natural form of communication for humanbeings and is often described as a unimodal communication chan-nel.However,itiswellknownthatspeechismulti-modalinnatureand includes the auditive, visual, and tactile modalities (i.e., as inTadoma communication (Reed et al., 1992)). Other less natural modalities such as electromyographic signal, invisible articulatordisplay, or brain electrical activity or electromagnetic activity(Miyawaki et al., 2008) can also be considered. Therefore, in situa- tionswhereaudiospeechisnotavailableoriscorruptedbecauseof disability or adverse environmental condition, people may resortto alternative methods such as augmented speech.Augmentedspeechisusedforpeoplewithspeechproductionorperception disorders and makes it possible for such people to usetheir inborn abilities to communicate with each other as well aswith people without disorders. Also, augmented speech is beingincreasingly used in pronunciation training for second-languagelearners, in speech therapy for speech-retarded children, and inspeech perception and production rehabilitation for hearing-impaired people. Augmented speech is robust against audio noiseand can thus be used even when communication occurs in noisyenvironments. Since augmented speech can be silent, it can alsobe usedin situations where privacyincommunicationis desirable.In several automatic speech recognition systems, visual infor-mation from lips/mouth and facial movements has been used incombination with audio signals. In such cases, visual informationis used to complement the audio information to improve thesystem’s robustness against acoustic noise (Potamianos et al.,2003).For the orally educated deaf or hearing-impaired people, lipreading remains a crucial speech modality, though it is not suffi-cient to achieve full communication. Therefore, in (Cornett, 1967)developed the Cued Speech system as a supplement to lip reading(Cornett, 1967). Recently, the first author of this article has pre-sented studies on automatic Cued Speech recognition using handgestures in combination with lip/mouth information (Heracleouset al., 2009).Several other studies have been introduced that deal with theproblem of alternative speech communication based on speechmodalities other than audio speech. A method for communicationbased on inaudible speech received through body tissues has beenintroduced using the Non-Audible Murmur (NAM) microphone.NAMmicrophones have been used for receiving and automaticallyrecognizing sounds of speech-impaired people, for ensuringprivacy in communication, and for achieving robustness againstnoise (Nakamura et al., 2008; Heracleous et al., 2007). NAM 0167-8655/$ - see front matter   2011 Elsevier B.V. All rights reserved.doi:10.1016/j.patrec.2011.02.009 ⇑ Corresponding author. Fax: +81 774 95 1408. E-mail address:  panikos@atr.jp (P. Heracleous).Pattern Recognition Letters 32 (2011) 1119–1125 Contents lists available at ScienceDirect Pattern Recognition Letters journal homepage: www.elsevier.com/locate/patrec  microphone was also used in speech synthesis and speech conver-sion studies (Toda and Shikano, 2005; Tran et al., 2008). A few researchers have addressed the problem of augmentedspeech based on the activation signal of the muscles producedduring speech production ( Jou et al., 2006). The OUISPER project (Hueber et al., 2008) attempts to automatically recognize and resynthesize speech based on the signals of tongue movementscaptured by an ultrasound device in combination with lipinformation.The present study aims to assess the possibility of developingautomatic speech recognition based on articulatory informationonly, that is, without any audio information. It also aims to quan-tify the contribution of the tongue, which is usually an invisiblearticulator, in comparison with that of the lips.AnElectro-magneticArticulography(EMA)device(Perkelletal.,1992)wasusedtotrackthemovementsofthetongue,jaw,andlipsduring speech production. These parameters were used as featuresto create HMMs, and automatic phoneme recognition experimentswere conducted. Similar studies dealing with the automatic recog-nition of articulatory speech have been introduced in (Uraga andHain, 2006; Wrench and Richmond, 2000; Fagan et al., 2008). Thisarticle, however, focuses on the automatic recognition of the EMAparameters alone, without the use of any audio information. Inaddition, the current study was conducted using French data,which reveal differences in articulation from the English language.In the current study, the visible and invisible articulators arerecognized separately in order to investigate the possible differ-ences between them. Finally, the visible and invisible articulatoryparameters were fused using multi-stream HMM decision fusion,and a common experiment was conducted.Thestructureofthepaperisasfollows:Section2focusesonthemethodology used in this study. The tracking methods, the corpus,the statistical modeling, and the fusion methods applied are intro-duced. In Section 3, experimental results for phoneme classifica-tion in clean and noisy environments are reported. In Section 4remaining problems are discussed, and finally, Section 5 includesthe conclusions of the current study. 2. Methods  2.1. Tracking the movements of lips, jaw, and tongue Fortrackingofarticulatorymovements,anEMAdevicepresentsa good compromise: the Carstens AG100 used in the present studycansimultaneouslytracktheverticalandhorizontalcoordinatesinthe midsagittal plane of 10 receiver coils that can be glued to thevarious oro-facial articulators inside and outside the vocal tract.The sampling frequency was 500Hz, and the accuracy of thesystem was better than 0.1cm. The coils have the advantage of tracking flesh points, i.e. physical locations of the articulators, incontrast to the medical imaging techniques that provide only con-tours. A drawback of this technique is the poor spatial resolutionrelated to the limited number of points. However, it has beenshown that the number of degree of freedom of articulators forspeech (jaw, lips, tongue, velum) is limited, and that a small butsufficient number of locations can allowto retrieve measurementswith a good accuracy (Badin et al., 2010). Finally, another impor- tant drawback of EMA is its partially invasive nature: the receivercoilshaveadiameterofabout0.3cmandmustbeconnectedtothedevice by thin wires that can slightly interfere with the articula-tion. Fig. 1a  shows the coils attached to the tongue. The coils were fixed using dental glue.  The authors used six coils of the AG100, asillustrated in Fig. 1b; a jaw coil was attached to the lower incisors,whereas a tip coil, a mid coil, and a back coil were respectively at-tached at approximately 1.2cm, 4.2cm, and 7.3cm from theextremity of the tongue; an upper lip coil and a lower lip coil wereattached to the boundaries between the vermilion and the skin inthemidsagittalplane.Anothertwocoilsattachedtotheupperinci-sor and to the nose served as reference for alignment. The audio-speech signal was recorded at a sampling frequency of 22,050Hz,insynchronizationwiththeEMAparameters,whichwererecordedat a sampling frequency of 500Hz.  2.2. Corpus Thecorpusconsistedof asetof tworepetitionsof224nonsensevowel-consonant–vowel(VCV)sequences(slowspeech,where,Cisone of the 16 French consonants and V is one of the 14 French oraland nasal vowels); two repetitions of 109 pairs of consonant–vo-wel-consonant (CVC) real French words, differing only by a singlecue (the French version of the Diagnostic Rhyme Test); 68 shortFrench sentences; and 9 longer French sentences.The continuoussentenceswereusedinordertoincreasethetrainingdata.Thecor-pus contained 4081 allophones (i.e., 40% VCV, 30% CVC, and 30%from continuous sentences). For HMM training and test, 2721(i.e., two-third) and 1360 (i.e., one-third) of the phones were used,respectively. The test data contained 682 vowel instances and 568consonant instances. The phoneme instances were extracted from Fig. 1.  (a)PhotoofEMAreceptorcoilsattachedtothesubject’stongue;(b) Locationsofthesixactivecoils(yellowdiskswithblackcentersandredarrows)inthemidsagittalplane, superposed for ease of interpretation on an MRI image where the speech articulators have been outlined. (For interpretation of the refernces to color in this figurelegend the reader is referred to the web version of this article.)1120  P. Heracleous et al./Pattern Recognition Letters 32 (2011) 1119–1125  the sentences using a forced alignment based on the audio signal,followed by a manual correction of the segmentations, and thetraining and test utterances consisted of isolated phones.  2.3. Hidden Markov models Hidden Markov models (HMMs) (Rabiner, 1989) are statistical modelsthathavebecomethemostpopularspeechmodelsinauto-matic speech recognition. The main reason for this success is theirability to characterize speech signals in a mathematically tractableway.AhiddenMarkovmodelconsistsofafinitesetofstatesinwhicheach state is associated with a statistical distribution. The statesare connected, and these connections are characterized by theirtransition probabilities.An HMM can be defined by virtue of the following elements:   The number of states of the model,  N  . The number of states andthepossible connections betweenstatesaredefinedbytheuseraccording to the task.   The number of observations,  M  . If the observations are continu-ous, then we will have to use a continuous probability densityfunction instead of a set of discrete probabilities. Usually, theprobability density is approximated by a weighted sum of   M  Gaussian distributions. For the description of the Gaussian dis-tributions, the means and variances are needed. These parame-ters are computed during the HMMtraining using training dataand some parameter estimation algorithms, such as the Baum–Welch re-estimation.   A set of state transition probabilities K ={ a ij } a ij  ¼  p ð q t  þ 1  ¼  j j q t   ¼  i Þ ;  1 6 i ;  j 6 N  ;  ð 1 Þ where  q t   is the current state.Transition probabilities should satisfy the following constraints: a ij  >  0 ;  1 6 i ;  j 6 N   ð 2 Þ and X N  j ¼ 1 a ij  ¼  1 ;  1 6 i 6 N  :   A probability distribution for each state  B  ={ b  j ( k )}In the case of continuous observations, a probability densityfunction is used, which usually is approximated by a weightedsum of   M   Gaussians, b  j ð o t  Þ ¼ X M m ¼ 1 c   jm N  ð l  jm ; R  jm ; o t  Þ ;  ð 3 Þ where  c   jm  are the Gaussian weights,  l  jm  are the means,  R  jm  arethe covariance matrices,  o t   is the input feature vector, and N  ( l  jm , R  jm , o t  ) is the Gaussian distribution.   The initial state distribution,  p ={ p i }, where p i  ¼  p ð q 1  ¼  i Þ ;  1 6 i 6 N  :  ð 4 Þ  2.4. Statistical modeling  Inthisstudy,38context-independent,left-to-rightwithnoskip,3-statephonemeHMMswereused.EightGaussiansperstateandadiagonalcovariancematrixwereused.Theaudiosignalwasdown-sampled to 16,000Hz; subsequently, 12 Mel-Frequency CepstralCoefficients (MFCC) (Furui, 1986; Muralishankar and O’Shaugh- nessy, 2008), along with the first and second derivatives, were ex-tracted. The EMA signal was down-sampled to 100Hz in order tobe synchronized with the audio feature extraction rate (i.e.,10ms). Because the EMA coordinates were partially correlated, aglobalPrincipalComponentAnalysis(PCA)(Pearson,1901)wasap- plied before HMM modeling. Three articulatory HMM sets weretrainedusing all the PCA components, along with the first and sec-ond derivatives. In the first HMM set, the coordinates of lips and jaw (LJ) were used (i.e., 6 PCA components, first and second deriv-atives). In the second HMM set, the parameters of tongue (T) wereused(i.e., 6PCAcomponents, first andsecondtimederivatives). Fi-nally, a common HMM set for lips, jaw, and tongue (LJT) was cre-ated (i.e., 6 lips PCA components, first and second derivatives; 6tongue PCA components, first and second derivatives). For trainingand test the HTK 3.1 toolkit (Young et al., 2001) was used.  2.5. Fusion methods In this section, the fusion methods used to integrate the audiosignal with the EMA signal are introduced. Previously, several fu-sion methods have been introduced (Nefian et al., 2002; Henneckeet al., 1996; Nakamura et al., 2002; Adjoudani and Benoît, 1996;Chen, 2001; Pearson, 1901). In this study a feature fusion methodand two decision methods were used; a state synchronous and astate asynchronous fusion methods.  2.5.1. Concatenative feature fusion The feature concatenation is the simplest state synchronous fu-sion method. It uses the concatenation of the synchronous audiospeech signal and EMA signal as the joint feature vector O  AE t   ¼  O ð  A Þ T  t   ; O ð E  Þ T  t  h i T  2  R D ;  ð 5 Þ where  O  AE t   is the joint audio-EMA feature vector,  O ð  A Þ t   the audio fea-ture vector,  O ð E  Þ t   the EMA feature vector, and  D  the dimension of the joint feature vector. In these experiments, the dimension of theaudiostreamwas36andthedimensionoftheEMAstreamwasalso36. The dimension  D  of the joint audio-EMA feature vectors was,therefore 72.  2.5.2. Multi-stream HMM decision fusion Multi-stream HMM decision fusion is a state synchronousdecision fusion, which captures the reliability of each stream, bycombining the likelihoods of single-stream HMM classifiers (Pota-mianosetal., 2003).Theemissionlikelihoodofmulti-streamHMMis the product of the emission likelihoods of single-stream compo-nentsweightedappropriatelybystreamweights.Giventhe O  com-bined observation vector, that is, the audio and EMA elements, theemission probability of multi-stream HMM is given by b  j ð O t  Þ ¼ Y S s ¼ 1 X M  s m ¼ 1 c   jsm N  ð O st  ; l  jsm ; R  jsm Þ " # k s ;  ð 6 Þ where  N  ( O ; l , R ) is the value in  O  of a multivariate Gaussian withmean  l  and covariance matrix  R , and  S   is the number of thestreams.Foreachstream s , M  s  Gaussiansinamixtureareused,witheach weighted with  c   jsm . The contribution of each stream isweighted by  k s . In this study, we assume that the stream weightsdo not depend on state  j  and time  t  . However, a constraint was ap-plied. Namely, k e  ¼  1   k a  8 k a  2 ð 0 ; 1 Þ ð 7 Þ where  k a  is the audio stream weight, and  k e  is the EMA streamweight. In these experiments, the weights were adjusted experi-mentally to 0.7 and 0.3 values, respectively. The selected weightswere obtainedby maximizingthe accuracy onseveral experiments. P. Heracleous et al./Pattern Recognition Letters 32 (2011) 1119–1125  1121   2.5.3. Late fusion Adisadvantageofthepreviouslydescribedfusionmethodistheassumption of there being a synchrony between the two streams.Late fusion was applied to enable asynchrony between the audioand EMAstreams, in this study. Inthe latefusion method, two sin-gleHMM-basedclassifierswereusedfortheaudiospeechandEMAspeech, respectively. For each test utterance (i.e., isolated phone),the two classifiers provided an output list, which included all thephone hypotheses along with their likelihoods. Following that, allthe separate unimodal hypotheses were combined into bi-modalhypotheses using the weighted likelihoods, as it is given by, logP   AE  ð h Þ ¼  k a logP   A ð h j O  A Þþ  k e logP  E  ð h j O E  Þ ;  ð 8 Þ where logP   AE  ( h ) is thescoreof thecombinedbi-modal hypothesis  h , logP   A ( h j O  A ) the score of the  h  provided by the audio classifier, and logP  E  ( h j O E  ) the score of the  h  provided by the EMA classifier.  k a and  k e  are the stream weights with the same constraints appliedin multi-stream HMM fusion.The procedure described here finally resulted in a combined  N  -bestlist,inwhichthetophypothesiswasselectedasthecorrectbi-modal output. A similar method was also introduced in (Potami-anos et al., 2003). Fig. 2 shows the integration of audio signal withEMA signal based on early and late fusion methods. In the case of early fusion, the feature vectors of the two different modalitieswere concatenated into a combined one, and a common HMMset was trained. In the decoding stage, a common classifier wasused, which received the combined feature vectors. In the case of late fusion, two separate HMM sets were trained for the audioand the articulatory modality, respectively. The system providedtwounimodalresults,whichwerecombinedintothehypothesizedbimodal output. 3. Results  3.1. Phoneme classification in clean environments Fig. 3 shows the results obtained for vowel-, consonant-, andphoneme classification using EMA and audio parameters basedon multi-stream HMM decision fusion. It is observed that EMAcan capture the speech information with very high accuracy. Inaddition, the results show that the EMA tongue parameters cancapturethespeechinformationbetterthantheEMALJparameters.Integrating the LJ and tongue parameters leads to higher accuracythan that for LJ or tongue separately. Also, it can be observed thatthe vowel classification accuracy when using EMA parameters is93.1%comparedwiththe 99.1%classificationaccuracywhenaudioparameters are used. Inthe caseof consonant and phonemerecog-nition, however, larger differences are obtained. However, sinceEMAcannotcapturethevoicing,ahighernumberofconfusionsap-pear between voiced and unvoiced consonants articulated in thesame place (e.g.,/  p / and / b /, / t  / and / d /, etc.)To decide whether the results are statistically significants, theMcNemars’s test was performed (Gillick and Cox, 1989). Table 1 shows the statistical significances of the accuracy differences.The p-values with a 95% confidence interval were less than 0.001indicating that statistically significant differences in phonemeaccuracy were observed for all the cases.In addition to the phoneme recognition based on multi-streamHMM decision fusion, an additional experiment was conductedusing LJ and T EMA parameters integrated with simple concatena-tivefeaturefusion.Thephonemeclassificationaccuracywas77.8%,the vowel classification accuracy was 92.3%, and the consonantclassification accuracy was 74.7%. The accuracies were slightlylower as compared with multi-stream HMM decision fusion.To show the differences in phoneme recognition using lips andtongue EMA parameters, the errors that occurred in the plosivesand fricatives recognitionwere analyzed. Table 2 shows the confu-sion matrix of the French plosives when lips and jaw parameterswere used. In the French language, there are six plosives in threepairs: {/  p /,/ b /}, {/ t  /,/ d /}, {/ k /, and /  g  /}. In each pair, the plosivesare articulated in the same place with the first plosive to be un-voiced, and the second one voiced. The Table 2 shows a high num-ber of confusions between the plosives articulated in the sameplace.AstheEMAinformationdoesnotcontainanyvoicing,voicedandunvoicedplosivesarticulatedatthesameplacearehighlycon- (a)(b) Fig. 2.  (a) Audio speech integrated with EMA signal based on feature- and multi-streamHMMdecisionfusionmethods.(b)AudiospeechintegratedwithEMAsignalbased on late fusion method. Fig.3.  ClassificationaccuracyusingEMAandaudioparameterswhenusinglipsand jawfeatures (LJ), tongue features (T), lips, jaw, and tongue features (LJT), and audiofeatures.  Table 1 Statistical significance of the accuracy differences when using lips, tongue, lips andtongue, and audio features. Lips Tongue Lips+Tongue AudioLips – Yes Yes YesTongue – – Yes YesLips+Tongue – – – Yes1122  P. Heracleous et al./Pattern Recognition Letters 32 (2011) 1119–1125  fusables. The results also show that the plosives articulated at theback (i.e.,/ k / and/  g  /) show the lowest classification accuracy.Table3showstheconfusionmatrixoftheFrenchplosiveswhentongue parameters were used. As shown, confusions appear be-tween voiced/unvoiced phonemes articulated in the same placewith a very small number of inter-confusions (i.e., confusions withthe other pairs). Compared with the case of using lips parameters,the number of confusions is much lower. Also, / k / and /  g  /, whichare articulated at the back, are classified with a higher accuracycomparedtothelipsparameters.However,alsointhiscase,theab-sence of voicing results in confusions between voiced/unvoicedcounterparts.Table 4 shows the confusion matrix of the French fricativeswhen lip parameters were used. In the French language there arethree pairs of fricatives, each one consisting of a voiced/unvoicedcounterpart,asfollows:{/  f  /,/ v /},{/ s /,/  z  /},{/ sh /,and/  zh /}.Theresultsobtained show that because of the voicing absence, voiced/un-voiced counterparts are highly confusables.Table 5 shows the confusion matrix of the fricatives, when ton-gue parameters were used. The same phenomenon is observed inthis parameter also. Voiced/unvoiced fricatives articulated in thesame place show a high number of confusions. Compared, how-ever, with the case of using lip parameters, the number of confu-sions is lower.The effect of the absence of voicing informationin the EMA sig-nal can be illustrated by clustering the voiced/unvoiced consonantarticulated in the same place (e.g., {/  p /,/ b /}, {/ t  /,/ d /}, {/ k /,/  g  /}, etc.)In the experiment, all HMM models were used, but in the evalua-tion of the results, the voiced/unvoiced counterparts were clus-tered. Fig. 4 shows the results obtained for consonant recognitionwhenclusteringwasapplied.Itcanbeobservedthattheconsonantaccuracy is significantly increased, and when LJT EMA parametersare used, the accuracy raised to 91.2% as compared with 75.2%accuracy when clustering was not applied. It should be noted thatone important articulator, the velum, was not recorded in the EMAsetup used in this experiment. As a result, the nasal feature cannotbe recovered. Therefore, / m / and / n / should be further clusteredwith {/  p /,/ b /} and {/ t  /,/ d /}, respectively. This in turn leads to a96.7% consonant accuracy. Confusion scores also confirm that / l /is sufficiently contrasted in the mid-sagittal section.  3.2. Phoneme classification in noisy environments Intheseexperiments, simulatednoisydataonseveralsignal-to-noise ratio (SNR) levels were fused with the EMA parameters, andweretestedusingmulti-conditionHMMs(i.e.,HMMstrainedusingEMA data and noisy audio data on different SNR levels). The firststream consisted of 12 MFCC, 12 D MFCC, and 12 DD MFCC param-eters. The second stream consisted of 12 EMA PCA parameters,along with the first and second derivatives.Fig. 5 shows the comparison of the three fusion methods wherewhite noise was used. In the case of SNR with a   10 dB level, theaccuracy when using the feature fusion was 79.1%, when usingthe multi-stream HMM fusion it was 81.3%, and when using latefusion it was 83.4%. In the case of clean speech, the accuracy whenusing the feature fusion was 85.67%, when using the multi-streamHMM fusion it was 92.1%, and when using the late fusion it was94.1%. It was seen that the highest accuracy was achieved whenlate fusion was used. On the other hand, the lowest performancewas obtained when concatenative feature fusion was used. A pos-sible reason could be that feature fusion cannot capture the reli-ability of each stream. Another possible reason for lower  Table 2 Confusion matrix of the French plosives when lip and jaw parameters were used. Theleft column indicates the test instances. The top raw indicates the correctlyrecognized instances. /  p / / b / / t  / / d / / k / /  g  / %c/  p / 11 5 1 1 2 0 55.0/ b / 9 19 0 0 0 0 67.9/ t  / 1 0 35 9 0 2 74.5/ d / 0 0 7 19 1 2 65.5/ k / 1 1 0 3 6 7 33.3/  g  / 0 0 1 2 7 10 50.0  Table 3 Confusion matrix of the French plosives when tongue parameters were used. The leftcolumn indicates the test instances. The top raw indicates the correctly recognizedinstances. /  p / / b / / t  / / d / / k / /  g  / %c/  p / 13 7 0 0 0 0 65.0/ b / 3 23 0 2 0 0 82.1/ t  / 0 0 42 3 2 0 89.4/ d / 0 0 5 24 0 0 82.8/ k / 0 0 0 0 13 5 72.2/  g  / 0 0 0 0 5 15 75.0  Table 4 Confusion matrix of the French fricatives when lip and jaw parameters were used. Theleft column indicates the test instances. The top raw indicates the correctlyrecognized instances. /  f  / / v / / s / /  z  / / sh / /  zh / %c/  f  / 7 10 0 0 0 0 41.2/ v / 6 17 0 0 0 0 73.9/ s / 0 0 17 10 0 2 58.6/  z  / 0 0 6 16 0 0 72.7/ sh / 0 0 0 1 14 9 58.3/  zh / 0 0 0 0 8 11 57.9  Table 5 Confusion matrix of the French fricatives when tongue parameters were used. The leftcolumn indicates the test instances. The top raw indicates the correctly recognizedinstances. /  f  / / v / / s / /  z  / / sh / /  zh / %c/  f  / 8 9 0 0 0 0 47.1/ v / 5 17 0 1 0 0 73.9/ s / 0 0 20 9 0 0 69.0/  z  / 0 0 5 17 0 0 77.3/ sh / 0 0 0 1 15 8 62.5/  zh / 0 0 0 0 6 13 68.4 Fig. 4.  Consonant classification accuracy when consonants articulated in the sameplace are clustered. P. Heracleous et al./Pattern Recognition Letters 32 (2011) 1119–1125  1123
Search
Similar documents
View more...
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks