Essays & Theses

Speech feature extraction using independent component analysis

Description
Speech feature extraction using independent component analysis
Published
of 4
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
Share
Transcript
  SPEECHFEATUREEXTRACTIONUSINGINDEPENDENTCOMPONENTANALYSIS  Jong-Hwan Lee 1   , Ho-Young Jung 2   , Te-Won Lee 3   , Soo-Young Lee 1 1  Brain Science Research Center and Department of Electrical EngineeringKorea Advanced Institute of Science and Technology373-1 Kusong-Dong, Yusong-Gu, Taejon, 305-701 Korea(TEL: +82-42-869-8031, FAX: +82-42-869-8570, E-mail: jhlee@neuron.kaist.ac.kr) 2  Electronics and Telecommunications Research Institute161 Kajong-dong, Yusong-Gu, Taejon, 305-350, Korea 3  Computational Neurobiology Laboratory, The Salk Institute10010 N. Torrey Pines Road, La Jolla, California 92037, USAand the Institute for Neural Computation, University of California, San Diego, USA ABSTRACT In this paper, we proposed new speech features using independentcomponent analysis to human speeches. When independent com-ponent analysis is applied to speech signals for efficient encodingthe adapted basis functions resemble Gabor-like features. Trainedbasis functions have some redundancies, so we select some of the basis functions by reordering method. The basis functionsare almost ordered from low frequency basis vector to high fre-quency basis vector. And this is compatible with the fact that hu-man speech signals have much more information on low frequencyrange. Those features can be used in automatic speech recognitionsystems and the proposed method gives much better recognitionrates than conventional mel-frequency cepstral features. 1. INTRODUCTION Speech signals are composed of independent higher order statisti-cal characteristics. Independent component analysis (ICA) had ex-tracted feature vectors based on these higher order statistics fromnatural scenes and music sound [1], [2]. These features are local-ized in both time (space) and frequency. However, no feature hasbeen extracted for human speeches for speech recognition. In thispaper, we report the extraction of Gabor-like features from natu-ral human speeches. Extracted speech features look like bandpassfilters which they have center frequencies and limited bandwidth.In many of the filter bank approaches, bandpass filters aredesigned to have mel-scaled center frequencies by mathematicalprocess. And their bandwidthes are also determined by certain ab-stractmathematical properties. Inauditory modeledfeatureextrac-tion process, filter bank resembles the characteristics of the basilarmembrane (BM). In the inner ear’s cochlea, the input speech sig-nals induce mechnical vibration on the basilar membrane. Andeach position of basilar membrane responds to some localized fre-quency information of the speech signals. Then in auditory basedfeatureextractionprocess eachbandpass filtersaremodeled bythisfrequency characteristics of basilar membrane. Thisresearch wassupported asBrain Science &Engineering ResearchProgram by Korean Ministry of Science and Technology. Onthe otherhand, inthispaper trained basisvectors reflect thestatistical properties of input speeches better than any other filterbank methods. For each time frame, extracted feature coefficientvectors are obtained using trained basis vectors. Finally, recogni-tion rates with the ICA-based features are compared to those withthemel-frequency cepstral coefficients(MFCCs)forisolated-wordrecognition tasks. 2. EXTRACTINGSPEECHFEATURESUSINGICA To extract independent feature vectors from speech signals, ICAalgorithm is applied to a number of human speech segments. AnICA network is trained to obtain independent components  u  fromspeech segment  x , and the trained weight matrix  W  extract basisfunction coefficients u from x . ICAassume theobservation x isthelinear mixture of the independent components  u . If   A  denote theinverse matrix of   W then the columns of  A represent basis featurevectors of observation  x . u  =  W   x x  =  A   u  To extract basis functions one has to train mixing matrix  A or un-mixing matrix W , and we trained the mixing matrix W . x1 u1x2 u2xN uN  W  g(u)y1y2yNInputspeech signal x u Figure 1: ICA network for training the basis vectors.The learning rule is based on maximization of joint entropyH( y ), and is represented as [3]    W  /  @I  (  y  ; x  )  @  W  =  @H  (  y  )  @  W  (1)   W  /    W  T  ] ?  1  +     @p  (  u  ) @  u   p  (  u  )    x  T  (2)where p( u ) denotes the approximation of the speech signalprobability density function,  p  (  u  i )=  @y  i =@u  i =  @g  (  u  i )  =@u  i  .Here, g( u ) is a nonlinearity function, which approximates the cu-mulative distribution function of the source signal  u [3].Natural gradient is also introduced to improve a convergingspeed [4]. Particulary, this method does not require the inverse of matrix W, and provides the following rule:   W  /  @H  (  y  )  @  W W  T  W  =    I  ?  '  (  u  )  u  T    W  ; (3)where '  (  u  )   is related to the source probability density functionand called as the score function.(a)(b)Figure 2: (a)Ordered row vectors of unmixing matrix  W ,(b)frequency spectrum.Using the learning rule in Eq.(3),  W is iteratively updated bygradient ascent manner until convergence. Let’s denote N as thesizeof speech segments, which arerandomly generated from train-ing speech signals. Fig.1 shows the basis vector training network.(a)(b)Figure 3: (a)Ordered basis vectors (column vectors of mixing ma-trix A ), (b)frequency spectrum.ICA network is composed of N inputs and N outputs, and N basisvectors are produced from N by N matrix  A ( A = W  ?  1   ). 3. SELECTIONOFDOMINANTFEATUREVECTORS For speech recognition, one may select dominant feature vectorsfrom the N basis vectors. The ICA algorithm finds independentcomponents corresponding to the dimensionality of the input, andmay result in redundant components. To reduce this redundancy,several techniques have been proposed [5].In this paper, the contribution of basis vectors to the speechsignal and the variability of the basis vector coefficients are con-sidered.Thecontribution means thepower of thebasis vectorin speechsignals, and the L  2   -norm ( jj a  i jj  , where a  i  is the i-th column vec-tor of   A ) can represent the relative importance of basis vectors.Therefore, from N basis vectors ordered in decreasing L  2   -norm,M dominant feature vectors can be selected.The variability denotes the variance of the basis vector coeffi-cients, and this can represent the relative importance of basis vec-tors in recognizing speech signals.Fig.4(a) shows the L  2   -norm of the reordered basis vectors and  (b) shows the coefficient variance of the corresponding basis vec-tor. One can see those two ordering methods provide almost samebasis vector order and basis vectors after about 30th are negligiblein both contribution and variability.Theobtained MfeaturevectorsconstitutetheM-channel filter-bank, and provide a spectral vector every time frame. 0 5 10 15 20 25 30 35 40 45 5000.050.10.150.20.250.30.35Basis vector index    L   2 −  n  o  r  m (a) 0 5 10 15 20 25 30 35 40 45 5000.050.10.150.20.25Basis vector index    V  a  r   i  a  n  c  e (b)Figure 4: (a)Basis vector contributions to speech signals,(b)Variance of basis vector coefficients. 4. TRAININGUSINGREALDATAANDRECOGNITIONEXPERIMENTS To train the basis vectors from natural human speech signals, 75phonetically balancedKorean wordsutteredfrom59speakers wereused. Speech segments composed of 50 samples, i.e., 3.1ms timeinterval at 16kHz sampling rates, were randomly generated. Total 10  5  segments were generated, and each segment was pre-whitenedto improve the convergence speed [1]. The whitening filter, Wz is: Wz  =  <  (  x  ?  <  x  >  )(  x  ?  <  x  >  )  T  >  ?  I  2  This removes both first- and second-order statistics from theinput data, x and makes the covariance matrix of  x equal to I .Then, unmixing matrix W inICAnetwork wasobtained bythelearning rule Eq.(3) using the speech segments.  W was initializedto an identity matrix, and '  (  u  )   was assumed as a sign function. 0 5 10 15 20 25 30 35 40 45 5002000400060008000ICA basis vector index    C  e  n   t  e  r   f  r  e  q  u  e  n  c  y   (   H  z   ) (a)0 2 4 6 8 10 12 14 16 180100020003000400050006000MFCC filter bank index    C  e  n   t  e  r   f  r  e  q  u  e  n  c  y   (   H  z   ) (b) Figure 5: (a)Center frequencies of the ICA trained basis vectors,(b)center frequencies of the MFCC filter banks.This assumption is based on  Laplacian  distribution of real speechsignal components. It improves the coding efficiency of speechsignals, since most of the coefficients on  u  are almost zero andonly a few important informations of speech signals are encodedin the tails of the  Laplacian  distribution.300 sweeps through whole segments were performed, and  W was updated every 100 input speech segments. The learning ratewas fixed to 0.001 during the first 100 sweeps, 0.0005 during thenext 100 sweeps, and 0.0001 during the last 100 sweeps. Obtainedunmixing vectors are shown in fig.2(a) and (b) shows their fre-quency magnitude spectra. Finally, 50 basis vectors were obtainedfrom columns of the inverse matrix of   W . Fig.3(a) and (b) show50 basis vectors and their frequency magnitude spectra ordered bythe contribution to speech signals. The corresponding contribu-tion values are also shown in Fig.4(a). Several learned basis func-tions arelocalizedinboth frequency and timeand resemble Gabor-like filters. In Fig.3(b), the basis vectors are almost ordered fromlow frequency component to high frequency component, whichcomes from relatively larger energy in low frequencies of the hu-man speeches. N basis functionsSelectM basis functionsFrame analysisfor each channel speechsignalM-channel outputsFrame vector sequence composed of M spectral components Figure 6: Block diagram of feature extraction.TheICA-basedfeatureswereappliedtoanisolated-wordrecog-  nition task. The vocabulary consists of 75 Korean words, and 38and 10 speakers uttered once to form training and test data, re-spectively. Whole word models were used for classification, andwere represented by 15-state left-to-right continuous-density hid-den Markov model (CDHMM). Speech features were extractedwith the top M feature vectors in Fig.3(a) on 30ms time windowevery 10ms. The M spectral components’ energies were scaledlogarithmically, and 13 cepstral coefficients were extracted. Fig.6shows the block diagram of this feature extraction process.For comparison, standard MFCC features were extracted. InMFCC feature extraction process, filter bank which have 18 mel-scaled center frequencies in fig.5(b) were used. Then, logarithmi-cally scaled 18 spectral components were transformed to 13 cep-stral coefficients by discrete cosine transform (DCT).Fig.5(a) and (b) show the center frequencies of ICA basis vec-tors and MFCC filter bank. In comparison with the MFCC filterbanks, ICA basis vectors have linearly distributed center frequen-cies. 30th basis vector has about 4600Hz center frequency and inthis range one may get sufficient spectral information to recognizespeech signals.Table 1: Recognition Results.ErrorRates(%)MFCC 3.810 basis 5.1Proposed 20 basis 2.030 basis 2.4Method 40 basis 3.950 basis 4.3Table1shows theperformance of thestandardMFCCand pro-posed feature extraction method for various M values. When 20feature vectors were selected in the contribution order shown inFig.4(a), the proposed method yielded 47.4% error reduction fromthe case of the standard MFCC. This result shows that the only afew active coefficients of   u  are sufficient for encoding the speechsignal. 5. CONCLUSION In this paper, we proposed new speech features for speech recog-nition using information maximization algorithm of ICA. ICA issuccessfully applied to extract features that efficiently encode thespeech signals. Many of the extracted features are localized bothtime and frequency and are much similar with Gabor filters. Thesespeech features make new filter bank and each filter provides lo-calized spectral compoents in every time frame of input speeches.Thenewfeaturesdemonstrated betterrecognitionperformancethan the standard MFCC features. ICA basis vectors capture thehigher order structures of speech signals better than MFCC filterbank. 6. REFERENCES [1] Bell A.J. and Sejnowski T.J.: ‘The ”Independent Compo-nents” of natural scenes are edge filters’,  Vision research ,1997, vol. 37, (23), pp.3327-3338[2] Bell A.J.and Sejnowski T.J.: ‘Learning thehigher-order struc-ture of a natural sound’,  Network: Computation in Neural Sys-tems , 1996, 7, pp.261-266[3] Lee T.W.: ‘Independent component analysis - Theory and ap-plications’, Kluwer Academic Publishers, 1998[4] Amari S., Cichocki A., and Yang H.: ‘A new learning algo-rithm for blind signal separation’,  Advances in Neural Infor-mation Processing Systems , 1996, 8, pp.757-763[5] Bartlett M.S., Lades H.M., and Sejnowski T.J.: ‘Independentcomponent representations for face recognition’, Proceedingsof the SPIE Symposium on Electronic Imaging: Science andTechnology; Conference on Human Vision and ElectronicImaging III, San Jose, California, January 1998[6] OjaE.: ‘Thenonlinear PCAlearning rulein independent com-ponent analysis’,  Neurocomputing , 1997, 17, (1), pp.25-46[7] Olshausen B.A. and Field D.J.: ‘Emergence of simple-cell re-ceptive field properties by learning a sparse code for naturalimages’,  Nature , 1996, 381, pp.607-609
Search
Tags
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks