Description

Speech feature extraction using independent component analysis

All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.

Related Documents

Share

Transcript

SPEECHFEATUREEXTRACTIONUSINGINDEPENDENTCOMPONENTANALYSIS
Jong-Hwan Lee
1
, Ho-Young Jung
2
, Te-Won Lee
3
, Soo-Young Lee
1 1
Brain Science Research Center and Department of Electrical EngineeringKorea Advanced Institute of Science and Technology373-1 Kusong-Dong, Yusong-Gu, Taejon, 305-701 Korea(TEL: +82-42-869-8031, FAX: +82-42-869-8570, E-mail: jhlee@neuron.kaist.ac.kr)
2
Electronics and Telecommunications Research Institute161 Kajong-dong, Yusong-Gu, Taejon, 305-350, Korea
3
Computational Neurobiology Laboratory, The Salk Institute10010 N. Torrey Pines Road, La Jolla, California 92037, USAand the Institute for Neural Computation, University of California, San Diego, USA
ABSTRACT
In this paper, we proposed new speech features using independentcomponent analysis to human speeches. When independent com-ponent analysis is applied to speech signals for efﬁcient encodingthe adapted basis functions resemble Gabor-like features. Trainedbasis functions have some redundancies, so we select some of the basis functions by reordering method. The basis functionsare almost ordered from low frequency basis vector to high fre-quency basis vector. And this is compatible with the fact that hu-man speech signals have much more information on low frequencyrange. Those features can be used in automatic speech recognitionsystems and the proposed method gives much better recognitionrates than conventional mel-frequency cepstral features.
1. INTRODUCTION
Speech signals are composed of independent higher order statisti-cal characteristics. Independent component analysis (ICA) had ex-tracted feature vectors based on these higher order statistics fromnatural scenes and music sound [1], [2]. These features are local-ized in both time (space) and frequency. However, no feature hasbeen extracted for human speeches for speech recognition. In thispaper, we report the extraction of Gabor-like features from natu-ral human speeches. Extracted speech features look like bandpassﬁlters which they have center frequencies and limited bandwidth.In many of the ﬁlter bank approaches, bandpass ﬁlters aredesigned to have mel-scaled center frequencies by mathematicalprocess. And their bandwidthes are also determined by certain ab-stractmathematical properties. Inauditory modeledfeatureextrac-tion process, ﬁlter bank resembles the characteristics of the basilarmembrane (BM). In the inner ear’s cochlea, the input speech sig-nals induce mechnical vibration on the basilar membrane. Andeach position of basilar membrane responds to some localized fre-quency information of the speech signals. Then in auditory basedfeatureextractionprocess eachbandpass ﬁltersaremodeled bythisfrequency characteristics of basilar membrane.
Thisresearch wassupported asBrain Science &Engineering ResearchProgram by Korean Ministry of Science and Technology.
Onthe otherhand, inthispaper trained basisvectors reﬂect thestatistical properties of input speeches better than any other ﬁlterbank methods. For each time frame, extracted feature coefﬁcientvectors are obtained using trained basis vectors. Finally, recogni-tion rates with the ICA-based features are compared to those withthemel-frequency cepstral coefﬁcients(MFCCs)forisolated-wordrecognition tasks.
2. EXTRACTINGSPEECHFEATURESUSINGICA
To extract independent feature vectors from speech signals, ICAalgorithm is applied to a number of human speech segments. AnICA network is trained to obtain independent components
u
fromspeech segment
x
, and the trained weight matrix
W
extract basisfunction coefﬁcients
u
from
x
. ICAassume theobservation
x
isthelinear mixture of the independent components
u
. If
A
denote theinverse matrix of
W
then the columns of
A
represent basis featurevectors of observation
x
.
u
=
W
x x
=
A
u
To extract basis functions one has to train mixing matrix
A
or un-mixing matrix
W
, and we trained the mixing matrix
W
.
x1 u1x2 u2xN uN
W
g(u)y1y2yNInputspeech signal
x u
Figure 1: ICA network for training the basis vectors.The learning rule is based on maximization of joint entropyH(
y
), and is represented as [3]
W
/
@I
(
y
;
x
)
@
W
=
@H
(
y
)
@
W
(1)
W
/
W
T
]
?
1
+
@p
(
u
)
@
u
p
(
u
)
x
T
(2)where p(
u
) denotes the approximation of the speech signalprobability density function,
p
(
u
i
)=
@y
i
=@u
i
=
@g
(
u
i
)
=@u
i
.Here, g(
u
) is a nonlinearity function, which approximates the cu-mulative distribution function of the source signal
u
[3].Natural gradient is also introduced to improve a convergingspeed [4]. Particulary, this method does not require the inverse of matrix W, and provides the following rule:
W
/
@H
(
y
)
@
W W
T
W
=
I
?
'
(
u
)
u
T
W
;
(3)where
'
(
u
)
is related to the source probability density functionand called as the score function.(a)(b)Figure 2: (a)Ordered row vectors of unmixing matrix
W
,(b)frequency spectrum.Using the learning rule in Eq.(3),
W
is iteratively updated bygradient ascent manner until convergence. Let’s denote N as thesizeof speech segments, which arerandomly generated from train-ing speech signals. Fig.1 shows the basis vector training network.(a)(b)Figure 3: (a)Ordered basis vectors (column vectors of mixing ma-trix
A
), (b)frequency spectrum.ICA network is composed of N inputs and N outputs, and N basisvectors are produced from N by N matrix
A
(
A
=
W
?
1
).
3. SELECTIONOFDOMINANTFEATUREVECTORS
For speech recognition, one may select dominant feature vectorsfrom the N basis vectors. The ICA algorithm ﬁnds independentcomponents corresponding to the dimensionality of the input, andmay result in redundant components. To reduce this redundancy,several techniques have been proposed [5].In this paper, the contribution of basis vectors to the speechsignal and the variability of the basis vector coefﬁcients are con-sidered.Thecontribution means thepower of thebasis vectorin speechsignals, and the
L
2
-norm (
jj
a
i
jj
, where
a
i
is the i-th column vec-tor of
A
) can represent the relative importance of basis vectors.Therefore, from N basis vectors ordered in decreasing
L
2
-norm,M dominant feature vectors can be selected.The variability denotes the variance of the basis vector coefﬁ-cients, and this can represent the relative importance of basis vec-tors in recognizing speech signals.Fig.4(a) shows the
L
2
-norm of the reordered basis vectors and
(b) shows the coefﬁcient variance of the corresponding basis vec-tor. One can see those two ordering methods provide almost samebasis vector order and basis vectors after about 30th are negligiblein both contribution and variability.Theobtained MfeaturevectorsconstitutetheM-channel ﬁlter-bank, and provide a spectral vector every time frame.
0 5 10 15 20 25 30 35 40 45 5000.050.10.150.20.250.30.35Basis vector index
L 2 − n o r m
(a)
0 5 10 15 20 25 30 35 40 45 5000.050.10.150.20.25Basis vector index
V a r i a n c e
(b)Figure 4: (a)Basis vector contributions to speech signals,(b)Variance of basis vector coefﬁcients.
4. TRAININGUSINGREALDATAANDRECOGNITIONEXPERIMENTS
To train the basis vectors from natural human speech signals, 75phonetically balancedKorean wordsutteredfrom59speakers wereused. Speech segments composed of 50 samples, i.e., 3.1ms timeinterval at 16kHz sampling rates, were randomly generated. Total
10
5
segments were generated, and each segment was pre-whitenedto improve the convergence speed [1]. The whitening ﬁlter, Wz is:
Wz
=
<
(
x
?
<
x
>
)(
x
?
<
x
>
)
T
>
?
I
2
This removes both ﬁrst- and second-order statistics from theinput data,
x
and makes the covariance matrix of
x
equal to
I
.Then, unmixing matrix
W
inICAnetwork wasobtained bythelearning rule Eq.(3) using the speech segments.
W
was initializedto an identity matrix, and
'
(
u
)
was assumed as a sign function.
0 5 10 15 20 25 30 35 40 45 5002000400060008000ICA basis vector index
C e n t e r f r e q u e n c y ( H z )
(a)0 2 4 6 8 10 12 14 16 180100020003000400050006000MFCC filter bank index
C e n t e r f r e q u e n c y ( H z )
(b)
Figure 5: (a)Center frequencies of the ICA trained basis vectors,(b)center frequencies of the MFCC ﬁlter banks.This assumption is based on
Laplacian
distribution of real speechsignal components. It improves the coding efﬁciency of speechsignals, since most of the coefﬁcients on
u
are almost zero andonly a few important informations of speech signals are encodedin the tails of the
Laplacian
distribution.300 sweeps through whole segments were performed, and
W
was updated every 100 input speech segments. The learning ratewas ﬁxed to 0.001 during the ﬁrst 100 sweeps, 0.0005 during thenext 100 sweeps, and 0.0001 during the last 100 sweeps. Obtainedunmixing vectors are shown in ﬁg.2(a) and (b) shows their fre-quency magnitude spectra. Finally, 50 basis vectors were obtainedfrom columns of the inverse matrix of
W
. Fig.3(a) and (b) show50 basis vectors and their frequency magnitude spectra ordered bythe contribution to speech signals. The corresponding contribu-tion values are also shown in Fig.4(a). Several learned basis func-tions arelocalizedinboth frequency and timeand resemble Gabor-like ﬁlters. In Fig.3(b), the basis vectors are almost ordered fromlow frequency component to high frequency component, whichcomes from relatively larger energy in low frequencies of the hu-man speeches.
N basis functionsSelectM basis functionsFrame analysisfor each channel
speechsignalM-channel outputsFrame vector sequence composed of M spectral components
Figure 6: Block diagram of feature extraction.TheICA-basedfeatureswereappliedtoanisolated-wordrecog-
nition task. The vocabulary consists of 75 Korean words, and 38and 10 speakers uttered once to form training and test data, re-spectively. Whole word models were used for classiﬁcation, andwere represented by 15-state left-to-right continuous-density hid-den Markov model (CDHMM). Speech features were extractedwith the top M feature vectors in Fig.3(a) on 30ms time windowevery 10ms. The M spectral components’ energies were scaledlogarithmically, and 13 cepstral coefﬁcients were extracted. Fig.6shows the block diagram of this feature extraction process.For comparison, standard MFCC features were extracted. InMFCC feature extraction process, ﬁlter bank which have 18 mel-scaled center frequencies in ﬁg.5(b) were used. Then, logarithmi-cally scaled 18 spectral components were transformed to 13 cep-stral coefﬁcients by discrete cosine transform (DCT).Fig.5(a) and (b) show the center frequencies of ICA basis vec-tors and MFCC ﬁlter bank. In comparison with the MFCC ﬁlterbanks, ICA basis vectors have linearly distributed center frequen-cies. 30th basis vector has about 4600Hz center frequency and inthis range one may get sufﬁcient spectral information to recognizespeech signals.Table 1: Recognition Results.ErrorRates(%)MFCC 3.810 basis 5.1Proposed 20 basis 2.030 basis 2.4Method 40 basis 3.950 basis 4.3Table1shows theperformance of thestandardMFCCand pro-posed feature extraction method for various M values. When 20feature vectors were selected in the contribution order shown inFig.4(a), the proposed method yielded 47.4% error reduction fromthe case of the standard MFCC. This result shows that the only afew active coefﬁcients of
u
are sufﬁcient for encoding the speechsignal.
5. CONCLUSION
In this paper, we proposed new speech features for speech recog-nition using information maximization algorithm of ICA. ICA issuccessfully applied to extract features that efﬁciently encode thespeech signals. Many of the extracted features are localized bothtime and frequency and are much similar with Gabor ﬁlters. Thesespeech features make new ﬁlter bank and each ﬁlter provides lo-calized spectral compoents in every time frame of input speeches.Thenewfeaturesdemonstrated betterrecognitionperformancethan the standard MFCC features. ICA basis vectors capture thehigher order structures of speech signals better than MFCC ﬁlterbank.
6. REFERENCES
[1] Bell A.J. and Sejnowski T.J.: ‘The ”Independent Compo-nents” of natural scenes are edge ﬁlters’,
Vision research
,1997, vol. 37, (23), pp.3327-3338[2] Bell A.J.and Sejnowski T.J.: ‘Learning thehigher-order struc-ture of a natural sound’,
Network: Computation in Neural Sys-tems
, 1996, 7, pp.261-266[3] Lee T.W.: ‘Independent component analysis - Theory and ap-plications’, Kluwer Academic Publishers, 1998[4] Amari S., Cichocki A., and Yang H.: ‘A new learning algo-rithm for blind signal separation’,
Advances in Neural Infor-mation Processing Systems
, 1996, 8, pp.757-763[5] Bartlett M.S., Lades H.M., and Sejnowski T.J.: ‘Independentcomponent representations for face recognition’, Proceedingsof the SPIE Symposium on Electronic Imaging: Science andTechnology; Conference on Human Vision and ElectronicImaging III, San Jose, California, January 1998[6] OjaE.: ‘Thenonlinear PCAlearning rulein independent com-ponent analysis’,
Neurocomputing
, 1997, 17, (1), pp.25-46[7] Olshausen B.A. and Field D.J.: ‘Emergence of simple-cell re-ceptive ﬁeld properties by learning a sparse code for naturalimages’,
Nature
, 1996, 381, pp.607-609

Search

Similar documents

Tags

Related Search

Independent Component AnalysisIndependent Component Analysis in power systeIndependent Component Analysis (ICA)Comparitive Analysis of Feature Extraction TePrincipal Component AnalysisFeature ExtractionPrincipal component analysis (PCA)Principle component analysisFeature extraction, PCA, DataminingEEG feature extraction detection

We Need Your Support

Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks