Self Improvement

FORMANT ESTIMATION OF SPEECH SIGNALS USING SUBSPACE-BASED SPECTRAL ANALYSIS

Description
FORMANT ESTIMATION OF SPEECH SIGNALS USING SUBSPACE-BASED SPECTRAL ANALYSIS Sotrs Karabetsos, Prros Tsakouls, Stavroula-Evta Fotnea, Ioanns Dologlou Insttute for Language and Speech Processng (ILSP) Artemdos
Published
of 5
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
Share
Transcript
FORMANT ESTIMATION OF SPEECH SIGNALS USING SUBSPACE-BASED SPECTRAL ANALYSIS Sotrs Karabetsos, Prros Tsakouls, Stavroula-Evta Fotnea, Ioanns Dologlou Insttute for Language and Speech Processng (ILSP) Artemdos 6 & Epdavrou, Marouss, GR , Athens, Greece phone: , fax: , emal: web: ABSTRACT The objectve of ths paper s to propose a sgnal processng scheme that employs subspace-based spectral analyss for the purpose of formant estmaton of speech sgnals. Specfcally, the scheme s based on decmatve spectral estmaton that uses Egenanalyss and SVD (Sngular Value Decomposton). The underlyng model assumes a decomposton of the processed sgnal nto complex damped snusods. In the case of formant trackng, the algorthm s appled on a small amount of the autocorrelaton coeffcents of a speech frame. The proposed scheme s evaluated on both artfcal and real speech utterances from the TIMIT database. For the frst case, comparatve results to standard methods are provded whch ndcate that the proposed methodology successfully estmates formant trajectores. 1. INTRODUCTION Formant estmaton of speech sgnals s of specal concern snce they consst an mportant feature set that could be used for a wde range of applcatons, spannng from phoneme classfcaton to speech synthess and recognton [1]. Many methods for Formant estmaton rely on an all-pole assumpton of the vocal tract transfer functon derved from a Lnear Predctve (LP) analyss of speech. More robust estmaton s acheved when contnuty constrant on formant canddates s appled, ether rule-based or through dynamc programmng [1][2][3]. Moreover, alternatve sgnal processng approaches have also been proposed. For example, n [4] the AM-FM modulaton model and the multband demodulaton analyss scheme are appled to formant trackng. Another approach s presented n [5], n whch formants are estmated by pre-flterng the speech sgnal pror to spectral peak estmaton, wth a tme varyng adaptve flter. An mprovement on the latter scheme has been proposed n [6]. Other recent approaches nclude the processng of the dfferental phase spectrum [7] and the utlzaton of partcle flters [8]. In ths paper, an alternatve approach s proposed whch s based on spectral estmaton of speech sgnals usng subspace-based technques. Early nvestgatons and results were presented n [9] whch had provded frst ndcatons on the potentals of the method. The current paper reports on the full algorthmc scheme resultng n a more robust methodology that s evaluated usng both synthetc and natural speech sgnals whle at the same tme s compared aganst current formant trackng methods. Specfcally, the method exploted here s called DESED (DEcmatve Spectral Estmaton by factor D), whch has been successfully used n NMR spectroscopy ([1], [11]) and for the case of formant estmaton s modfed to process a small amount of the autocorrelaton coeffcents of a speech frame. For the rest of ths paper we call ths algorthm DESED-ACOR (DESED on AutoCORellaton of speech). Notce that the autocorrelaton coeffcents seem to convey nformaton for the formants of a speech sgnal [12]. In addton, the method performs Egenanalyss and SVD (Sngular Value Decomposton) of Hankel matrces structured from sgnal samples. The underlyng model assumes a decomposton of the processed sgnal nto complex damped snusods. The number and the accuracy of the estmated snusods depend on the requested model order. For the case of formant estmaton, snce the number of requested frequences s known wthn a specfed frequency range (e.g. usually four formants n the range - 4KHz), the requred model emerges to be exact. The rest of the paper s organzed as follows. In secton 2, we present the DESED-ACOR algorthm and we detal on the proposed scheme for robust formant estmaton. In secton 3, we present the expermental evaluaton of the algorthm through comparatve results on synthetc speech sgnals and llustratve examples of real (male and female) speech sgnals from the TIMIT database. Fnally, n secton 4, concludng remarks are dscussed. 2. THE DESED-ACOR ALGORITHM The schematc dagram of the proposed methodology s shown n Fgure 1. An optonal resamplng process s avalable n order to adjust the orgnal samplng frequency to approprate level when decmaton s to be used. Then, the speech sgnal s passed through a low pass FIR flter that has a twofold purpose. Frst, t flters the sgnal so as to retan only the frequency band of nterest and second acts as an ant-alas flter when decmaton s used. Please note that an FIR flter does not ntroduce any new poles snce ts transfer functon contans only zeros. Furthermore, the speech sgnal s passed through a pre-emphass flter and then dvded nto overlappng frames. The pre-emphass factor s tunable. The analyss can be ether frame synchronous or ptch- x(n) Resamplng (optonal) LPF (Ant-alasng low pass flterng) Pre-emphass flterng Ptch synchronous Analyss (optonal) Wndowng Estmaton of Autocorrelaton z the sgnal poles and N = λ p where the nteger constant λ has values λ 4. As far as formant estmaton s concerned, the objectve s to estmate the frequences, for = 1,..., p. Algorthm 1: DESED-ACOR - Construct the L M Hankel matrx S from the N data ponts r xx (n) of (1), where L D M, p L D, L + M 1 = N and D s the decmaton factor. - Construct the matrces and as the D -order lower S D S shft (top D rows deleted) and the D -order upper shft (bottom D rows deleted) equvalents of S. - Employ the SVD: H S = U Σ V and truncate to order sngular values. Ths re- p by retanng only the largest p sults to the enhanced verson. S e f - Compute matrx X = S pnv( S ). The egenvalues λ of D e X gve the decmated sgnal pole estmates, whch n turn gve the estmates for the formant frequences. synchronous. Ptch estmaton s done usng the algorthm presented n [13]. The latter s used when the samplng frequency s hgh enough to ensure enough samples over a ptch perod. Pror to autocorrelaton estmaton a Hammng wndow multples every analyss frame. The DESED-ACOR algorthm processes a small amount of the autocorrelaton coeffcents and yelds the estmates of formant frequences. The DESED-ACOR s outlned n Algorthm 1. Expermental observatons have shown that the number of autocorrelaton lags depends on the samplng frequency and the requested number of poles. Moreover, for a model order p we need at least 4 p lags. A typcal value for the number of autocorrelaton lags range wthn 24 to 64. The estmaton of the autocorrelaton functon s calculated as, Ns n 1 x k = r ( n) = ( k) x( k + n) (1) xx where, x(k) s the k th sample of a speech frame, N s s the length of a speech frame and n s bounded by the requested number of lags. We further notce that for the case of formant frequency estmaton the model order emerges to be exact. The assumed sgnal model s of the form: r xx p n ( n) = g z, n =,..., N 1 (2) = 1 DESED-ACOR Formants Fgure 1: The proposed Formant estmaton scheme. where, p s the number of complex damped snusods that comprse the measured sgnal, g the complex ampltude, In ths paper we have chosen to evaluate the least squares verson of DESED (DESED-LS) nstead of ts total least squares counterpart (DESED-TLS), snce prelmnary experments ndcated that the latter dd not seem to perform better than the frst n formant estmaton. Ths has also been found true for the case of NMR spectroscopy [1]. However, a quanttatve comparson between the two solutons would be benefcal and s planned for future research. 3. EXPERIMENTAL RESULTS The formant trackng scheme was tested on several utterances, uttered by both male and female speakers. In order to quanttatve evaluate the proposed algorthm, comparson tests were performed on synthetc speech sgnals usng the Klatt parallel syntheszer [14]. The performance of DESED- ACOR s compared to that of two publcly avalable speech analyss tools namely, Praat (http://www.praat.org) and WaveSurfer (http://www.speech.kth.se/wavesurfer/). Expermentaton also ncluded sentences from the TIMIT database, presented later n ths secton. Snce there s no systematc way or a straghtforward crteron to apprecate formant estmates, evaluaton s qualtatve by vsual nspecton wth the ad of the correspondng spectrograms. We have evaluated all three methods n voced regons of three synthetc speech sgnals (namely synth_, =1,2,3) havng a total duraton of 1 sec. In order to provde comparatve results we have tred to exclude regons where any of the methods fal to estmate a formant value. For all methods we have manually tuned the parameters for best possble estmaton (note that for the case of examples synth_1 and synth_2, the parameters of the Praat software was set up to request for fve formants snce t was unable to locate the frst formant). (a) (b) (c) RM S error (Hz) RMS error (Hz) RMS error (Hz) Fgure 2 present averaged RMS (Root Mean Squared) formant frequency estmaton errors for all synthetc sgnals expermented wth. It s observed that the DESED-ACOR method acheves relable formant estmates, especally for the frst formant where n all cases t has the smaller RMS error. Furthermore, n most cases DESED-ACOR s among the best two formant estmators and n general provdes comparable estmates n relevant to Praat and WaveSurfer DESED-ACOR WaveSurf Praat Fgure 2: Formant trackng averaged RMS errors for DESED-ACOR, WaveSurfer and Praat: (a) on Synth_1, (b) on Synth_2 and (c) on Synth_3. The horzontal axs denotes the formant number. Addtonally, t should be noted that, n the case of DESED- ACOR, no stylzaton or post-processng n formant trajectores and estmates s performed. An example of formant trackng wth DESED-ACOR together wth actual formants used n synthess s depcted n fgure 3. It s shown that formant estmates closely follow real trajectores, falng only at unvoced regons or transtons. Generally, expermentaton showed that DESED- ACOR s able to relably estmate the whole set of formants whle for effcency and best estmaton accuracy (mnmum RMS error for every ndvdual formant) depends on the proper adjustment of the parameters of the sgnal processng methodology pror to DESED algorthm applcaton. For example, t has been observed that estmaton accuracy for the frst formant decreases whle for the fourth formant ncreases by adjustng the pre-emphass factor. However, a systematc nvestgaton on the parameter values and the nterrelaton between them s the man goal of planned research. Smlarly, as mentoned before, the parameter values of the two other estmators that DESED-ACOR s compared wth, are also manually adjusted to guarantee as more effcent estmaton results as possble. An example of real speech concerns a speech sentence from a male speaker from the TIMIT database. The sgnal comprses the Englsh utterance She had your dark sut n greasy wash-water all year. Fgure 4 llustrates the estmated formant trajectores supermposed on the correspondng spectrogram. The samplng frequency s 16 KHz. A frame sze of 256 samples wth 5% overlap was used. The requested number of formants was set to four (p = 8) whle the decmaton factor was 2 and the number of autocorrelaton lags was 32. The cut-off frequency of the FIR flter was set to 4 KHz. Obvously, the algorthm manages to track smooth formant trajectores wth decreased dsperson. Ths s confrmed from the hgh densty regons of the correspondng spectrogram ndcatng hgh energy frequency bands (peaks) of the sgnal s spectrum Tme Fgure 3: DESED-ACOR formant trackng on synthetc speech. Contnuous lnes denote actual formant tracks. Dots ndcate DESED-ACOR estmates. Fgure 4: Formant trajectory estmaton usng the DESED-ACOR (DESED-LS on Autocorrelaton coeffcents) method on an utterance of a male speaker from the TIMIT database Tme Fgure 5: Formant trackng of DESED-ACOR (dots-yellow), WaveSurfer (hexagon-blue) and Praat (cross-red) on a selected regon of the natural speech of fgure 4. Fgure 5 depcts the estmated formant trajectores derved from DESED-ACOR, WaveSurfer and Praat, on a selected regon from the example llustrated n fgure 4. It s seen that DESED-ACOR acheves consstent formant tracks whch are comparable to that of WaveSurfer and Praat although the latter performs poorly n some cases. Another paradgm s that of a sentence from the TIMIT database uttered by a female speaker. The utterance s She had your dark sut n greasy wash-water all year. Fgure 6, depcts the estmated formant trajectores supermposed on the correspondng spectrogram. The orgnal samplng frequency of 16 KHz s converted to 24 KHz. A frame sze of 512 samples wth 8% overlap was used. The requested number of formants was set to fve (p = 1) whle the decmaton factor was 1 and the number of autocorrelaton lags was 4. The cut-off frequency of the FIR flter was set to 6 KHz snce formant values for females are usually hgher than males. In accordance wth the prevous example, the algorthm manages to track the whole set of formants and produces smooth trajectores wth decreased dsperson. Ths s agan confrmed from the hgh densty regons of the correspondng spectrogram where hgh energy frequency bands (peaks) of the sgnal s spectrum are ncely modelled. 4. CONCLUSIONS In ths paper, we have nvestgated the use of subspacebased technques on formant frequences estmaton of speech sgnals and we have proposed a sgnal processng scheme to enhance ther ablty for robust trackng. The sgnal model assumes decomposton nto complex damped snusods whch act as formant canddates. Furthermore, we have seen that the use of autocorrelaton lags convey much of the nformaton of speech formants thus requrng an exact model order dependng on the number of formants. Consequently, ths leads to mproved capablty for successful formant trajectores estmaton. Furthermore, some refnements n the sgnal analyss process were ntroduced n order to obtan a robust algorthm for formant extracton. 5. ACKNOWLEDGEMENTS Ths work has been partally supported by the Natonal Techncal Unversty grant THALIS/M.I.R.C. 22. DESED LS Order: 1 Frame Sze (N): Tme Fgure 6: Formant trajectory estmaton usng the DESED-ACOR method on an utterance of a female speaker from the TIMIT database. REFERENCES [1] M. Lee, J. van Santen, B. Mobus, and J. Olve, ''Formant Trackng usng Context-Dependent Phonemc Informaton,'' IEEE Trans. Acoust., Speech and Audo Processng, vol. 13, no. 5, pp , Sept. 25. [2] R. W. Schafer and L. R. Rabner, ''System for automatc formant analyss of voced speech,'' J. Acoust., Soc. Am., Vol. 47, no. 2, pp , 197. [3] D. Talkn, ''Speech Formant Trajectory Estmaton Usng Dynamc Programmng wth Modulated Transton Costs,'' J. Acoust., Soc. Am., S1, pp. S55, [4] A. Potamanos and P. Maragos, ''Speech Formant and Bandwdth Trackng usng Multband Energy Demodulaton,'' J. Acoust., Soc. Am., Vol. 99, no. 6, pp , June [5] A. Rao and R. Kumaresan, ''On decomposng Speech nto Modulated Components,'' IEEE Trans. Acoust., Speech and Audo Processng, vol. 8, no. 3, pp , May 2. [6] K. Mustafa and I. C. Bruce, ''Robust Formant Trackng for Contnuous Speech wth Speaker Varablty,'' IEEE Trans. Acoust., Speech and Audo Processng, vol. PP, no. X, pp. 1-1, accepted for publcaton, Jan. 25. [7] B. Bozkurt, T. Dutot, B. Doval, and C. D Alessandro, Improved dfferental Phase Spectrum processng for Formant trackng, n Proc. InterSpeech-ICSLP 24, Jehu Island, Korea, 24, pp [8] Y. Sh and E. Chang, Spectrogram-Based Formant Trackng va Partcle Flters, n Proc. ICASSP 23, Hong Kong, Chna, Apr. 23, pp. I-168-I-171. [9] S. Karabetsos, P. Tsakouls, S-E. Fotnea, and I. Dologlou, ''On the Use of a Decmatve Spectral Estmaton Method Based on Egenanalyss and SVD for Formant and Bandwdth Trackng of Speech Sgnals'', n Proc. Interspeech-25, Lsbon, Portugal, 25, pp [1] S-E. Fotnea, I. Dologlou, and G. Carayanns, ''A new decmatve spectral estmaton method wth unconstraned model order and decmaton factor'', Total Least Squares and Errors-n-Varables Modelng: Analyss, Algorthms and Applcatons, Van Huffel, S., and Lemmerlng, P. (Eds), Kluwer Academc Publshers, pp , 22. [11] S-E. Fotnea, I. Dologlou, and G. Carayanns, ''Decmaton and SVD to estmate exponentally damped snusods n the presence of nose'', n Proc. ICASSP 21, Utah, USA, 21 pp [12] G. Carayanns and P. Jospa, ''On the Analyss of Autocorrelaton Functon for Speech Spectra Estmaton Applcaton for Nasalty detecton'', n Proc. ICASSP 1977, Vol. 2, pp , [13] I. Dologlou and G. Carayanns, ''Ptch Detecton based on zero-phase Flterng'', Speech Communcaton, vol. 8, No. 4, pp , [14] D. H. Klatt, Software for a cascade/parallel formant syntheszer, J. Acoust., Soc. Am., vol. 67, pp , 198.
Search
Similar documents
View more...
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks
SAVE OUR EARTH

We need your sign to support Project to invent "SMART AND CONTROLLABLE REFLECTIVE BALLOONS" to cover the Sun and Save Our Earth.

More details...

Sign Now!

We are very appreciated for your Prompt Action!

x