Marketing

A turbo-style algorithm for lexical baseforms estimation

Description
ABSTRACT In this research, an iterative and unsupervised Turbo-style algorithm is presented and implemented for the task of automatic lexical acquisition. The algorithm makes use of spoken examples of both spellings and words and fuses information
Categories
Published
of 4
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
Share
Transcript
  ATURBO-STYLEALGORITHMFORLEXICALBASEFORMSESTIMATION Ghinwa F. Choueiter, Mesrob I. Ohannessian, Stephanie Seneff, and James R. GlassMIT Computer Science and Artificial Intelligence Laboratory32 Vassar Street, Cambridge, MA 02139 { ghinwa, mesrob, seneff, glass } @mit.edu ABSTRACT In this research, an iterative and unsupervised Turbo-style al-gorithm is presented and implemented for the task of auto-matic lexical acquisition. The algorithm makes use of spokenexamples of both spellings and words and fuses informationfrom letter and subword recognizers to boost the overall lex-ical learning performance. The algorithm is tested on a chal-lenging lexicon of restaurant and street names and evaluatedin terms of spelling accuracy and letter error rate. Absoluteimprovementsof 7.2% and 3% (15.5% relative improvement)are obtained in the spelling accuracy and the letter error raterespectively following only 2 iterations of the algorithm.  Index Terms —  Turbo-style, spelling, pronunciation, lex-ical acquisition 1. INTRODUCTION In speech recognitionsystems, automatic lexical update is theprocess of introducing new entries into the phonetic dictio-nary as well as refining pre-existing ones. Such an updateprocess can be triggered by newly acquired information suchas a spoken example of an unknown word or its spelling. Thecapability of automatically learning a reliable estimate of alexical entry (both spelling and phonetic baseform) of a wordfrom spoken examples, can prove quite beneficial. For ex-ample, consider spoken dialogue systems, which have beenslowlyemergingas anaturalsolutionforinformationretrievalapplications [1]. Such systems often suffer from dialoguebreakdown at critical points that convey important informa-tion such as named entities or geographical locations. Onesuccessful approach proposed for error recovery in dialoguesystems lies in speak-and-spell models, that prompt the userfor the spelling of an unrecognizedword [2, 3]. In such cases,both the spoken spelling and word are available. The ques-tion that this research attempts to answer is: Given both thespoken spelling and spoken word how well can a valid lexicalentry in a dictionary be learnt?This research introduces an unsupervised iterative tech-nique denoted  Turbo-style algorithm  and applies it to the task  This research was supported by the Industrial Technology Research In-stitute (ITRI) in Taiwan. of automatic lexical acquisition. In particular, spoken exam-ples from two complementary domains, spelling and pronun-ciation, are presented to a letter and subword recognizer re-spectively. Theoutputof eachrecognizeris thenprocessedbya bi-directionalletter-to-sound(L2S)modelandinjectedback into the other recognizer in the form of   soft   bias information.Such a set-up is denoted Turbo-style learning algorithm sinceit is inspired by the principles of Turbo Codes [4]. The termTurbo Code is in turn a reference to turbo-charged engineswhere part of the output power is fed back to the engine toimprove the performance of the whole system.There has been significant research on automatic lexicalgeneration [5, 6, 7]. However, the novel contribution of thiswork is two-fold: (1) Spoken examples of both the spellingand the word are used as opposed to the word only, and (2) abi-directionalL2Smodelisusedtoexchangebiasinformationbetween the spelling and pronunciationdomain so as to boostthe overall performance of the tandem model. It is worth not-ing that the set-up does not consult a lexicon when estimatingthe spelling.Thebasicprincipleoftheproposedalgorithmis thefusionof several sources of information,and it can be generalized todifferent set-ups. For example, a recent approach to unsuper-vised patterndiscoveryin speech producesreliable clusters of similarspeechpatterns[8]. Thegeneratedclusterscanbepro-cessed bymultiplesubwordrecognizerswhoseoutputscanbefused to boost the pronunciation recognition performance.In the rest of the paper, the Turbo-style algorithm is de-scribed in Section 2, and the implementation components inSection 3. The experimental set-up and parameter tuning aredepicted in Sections 4 and 5 respectively. Section 6 reportsthe results, and Section 7 concludes with a summary. 2. THETURBO-STYLEALGORITHM Inthissection,theTurbo-styleiterativealgorithmispresented.The basic principle behind the proposed algorithm is to havetwo complementary recognizers, spelling and pronunciation,exchange bias information such that the performance of bothsystems is improved. In this particular implementation, theletter recognizer first generates an  N  -best list, which is pro- jected into the complementary subword domain using a bi-  directional L2S model. The projected  N  -best list is used tobias the subword LM, by injecting into it the pronunciationsthat best match the estimated spelling. A similar procedure isrepeated in the subword domain. The algorithm is illustratedin Figure 1, and the steps for a pair of spoken spelling andword are as follows:(1) The spoken spelling is presented to the letter recognizer,and a letter N  1 -best list is generated.(2) The letter N  1 -best list is processed by the L2S model, anda subword M  1 -best list is produced.(3) A bias subword language model (LM) is trained with thesubword M  1 -best list, and interpolated with the base sub-word LM by a factor  w 1 . The interpolated LM becomesthe new base subword LM.(4) A subword recognizer is built with the new interpolatedsubword LM, the spoken word is presented to the sub-word recognizer,and a subword M  2 -best list is generated.(5) The subword M  2 -best list is processed by the S2L model,and a letter N  2 -best list is produced.(6) A bias letter LM is trained with the letter  N  2 -best list,and the bias letter LM is interpolated with the base letterLMbyafactor w 2 . TheinterpolatedLM becomesthenewbase letter LM.(7) A letter recognizer is built with the new interpolated let-ter LM.(8) Go back to Step (1). Letter RecognizerLetter N -best list 1 L2SSubword M -best list 1 Build BiasSubword LMInterpolateBase+w Bias Subword LMs Build Subword RecognizerSubword RecognizerSubword M -best list 2 S2LLetter N -best list 2 Build BiasLetter LMInterpolateBase+w Bias Letter LMs Build Letter RecognizerSTARTSpellingWord 12 Tunable variables:N , N , M , M , w , and w . 112221 Fig. 1 . Illustration of the unsupervised Turbo-style algorithmused to refine the estimates of the spelling and the pronuncia-tion of a new word.Figure 1 shows 7 parameters that need to be set, N  1 , M  1 , w 1 , N  2 , M  2 , w 2 , and K  , the number of iterations performed.The tuning of these parameters is described in Section 5. 3. IMPLEMENTATION COMPONENTS3.1. The Bi-Directional L2S/S2L Model Thebi-directionalL2Smodelusedinthis researchis basedona context-free grammar (CFG) designed to encode positionaland phonological constraints in sub-syllabic structures. TheCFG-based subword model is described in detail in [9], andevaluated successfully on the task of automatic pronunciationgeneration in [10]. Briefly, the CFG describes all possibleways sub-syllabic structures map to subword units as well asall possible ways subword units map to spellings. The CFGpre-terminals are the subword units, which encode only pro-nunciation information, and the terminals are letter clusterswhich encode spelling. The total number of pre-terminalsand terminals are 677 and 1573 respectively. A by-productof the CFG is an automatically derived mapping between sub-words and their spellings, which results in hybrid units, de-noted spellnemes 1 . Generating a statistical L2S model is fa-cilitated by the spellneme units. The L2S model,  T  L 2 U  , ismodeled using finite state transducers (FSTs) as follows: T  L 2 U   = T  L 2 SP   ◦  G SP   ◦  T  SP  2 U   (1)where T  L 2 SP   and T  SP  2 U   are mappings from letters to spell-nemes and from spellnemes to subwords respectively, and G SP   is a spellneme trigram. A search through  T  L 2 U   pro-duces an  N  -best list of pronunciations corresponding to theinput spelling. An S2L model is generated similarly. 3.2. The Subword and Letter Recognizers The subword recognizer is modeled as a weighted FST, R S  : R S   = C   ◦  P   ◦  L S   ◦  G S   (2)where C   denotes the mapping from context-dependentmodellabels to context-independentphone labels, P   the phonologi-cal rules that map phonelabels to phonemesequences, L S   thesubword lexicon, which is a mapping from phonemic units tosubwords obtained from the CFG, and  G S   the subword tri-gram. A search through  R S   produces an  N  -best list of pro-nunciations corresponding to the spoken word.Theletter recognizeris similarlyimplementedas a weigh-ted FST,  R L . The letter lexicon,  L L  contains 27 entries, the26 letters of the alphabet and the apostrophe. 4. EXPERIMENTAL SET-UP The SUMMIT segment-based speech recognition system isused [11] in all the experiments. Context-dependent diphoneacousticmodelsareusedwithanMFCC(Mel-FrequencyCep-stral Coefficient) based feature representation. The diphones 1 Other researchers have used the term  graphones  for these types of units(e.g. Bisani and Ney [7]).  are modeled with diagonal Gaussian mixture models with amaximumof 75 Gaussians per model, and are trained on tele-phone speech. The spellneme trigram, G SP   used by the L2Smodel is built with 55k parsed nouns extracted from the LDCpronlex dictionary. The letter trigram,  G L , is trained with300k Google words, and the subword trigram, G S  , with thesame set parsed with the L2S model.Forthepurposeofthisresearch,603Massachusettsrestau-rant and street names were recorded together with their spo-ken spellings. This set is part of a larger data collection effortdescribed in more detail in [10]. The 603 spelling/word pairsare split into a development (Dev) set of 300 pairs and a Testset of 303. 5. PARAMETER TUNING In this section, the process of setting the parameters of thealgorithm is presented. There are various ways of approach-ing such a problem, and the choice here is to set N  1  and M  2 separately, while  M  1  and  w 1  are tuned simultaneously, andsimilarly for N  2  and w 2 . N 1  and M 2  correspond to the number of top candidate spell-ings and pronunciations generated by the letter and subwordrecognizers respectively.  N  1  is chosen to achieve an effec-tive compromise between capturing the correct spelling andweeding out incorrect ones. This is done by presenting theDev data to the letter recognizer and monitoring the depth of the correctspelling in the top 100candidates. By this process, N  1  is empirically set to 20.In a similar procedure on the pronunciation side,  M  2  isempirically set to 50. However, it is worth noting that whilereference spellings are available for the letter set-up, no ref-erences are available for the subword set-up. To avoid havingto manually transcribe subword baseforms, the L2S model isused to automatically generate them [10]. N 2  and w 2  denote the number of top candidate spellings pro-duced by the S2L model and the weight of the letter bias LMrespectively. They are tuned to improve the performance of the letter recognizer on the Dev set. Performance is evaluatedin terms of   spelling match rate . A match is when the correctword occurs in the N  1 -best list generated by the letter recog-nizer, where N  1  = 20 . Since M  2  = 50 , a subword 50-bestlist is processed by the S2L, producinga spelling N  2 -best list,where N  2  = 20, 100, 500, 1000, 5000, 10000. For each valueof  N  2 , a bias LM is trained with the spelling N  2 -best list andinterpolated with a base LM. The interpolation weight, w 2  isvaried between 0 and 1 in 0.2 steps. For each ( N  2 , w 2 ) pair,a letter recognizer is built and the spelling 20-best list is gen-erated. Figure 2 reports the performance as a function of  N  2 and w 2 , and illustrates that mid-range values of both N  2  and w 2  are best. Based on this, N  2  is set to 1000 and w 2  to 0.4. 123400.20.40.60.81203040506070log 10 (N 2 )w 2    S  p  e   l   l   i  n  g   M  a   t  c   h   R  a   t  e   i  n   t   h  e   T  o  p   2   0 Fig. 2 . The spelling accuracy, in a 20-best spelling list, eval-uated on the Dev set as a function of  N  2  and w 2 . M 1  and w 1  correspond to the number of top candidate sub-word sequences generated by the S2L model and the weightof the subword bias LM respectively. They are tuned simi-larly to  N  2  and  w 2 . For lack of space, we only report that M  1  is set to 1000 and w 1  to 0.8. Compared to w 2 , the resultsindicate that the subword recognizer is more confident aboutthebias informationobtainedfromthe letter domainthanviceversa. This is expectedsince the spellingdomainis morecon-strained and hence more reliable than the subword one. K  corresponds to the number of iterations of the Turbo-stylealgorithm. To set K  , the algorithm is run on the Dev set untillittle change in performance is observed. The results are re-ported in Table 1 in terms of spelling match rates. The firstcolumn is the iteration number, where iteration 0 refers to theinitial results prior to receiving any bias information from thecomplementarydomain. The second to fifth columns give thespelling matchrates in the top 1, 10, 20, and 100 spellingcan-didates. The results in Table 1 show substantial improvementin the spelling matchrates followingiteration2. For example,the top 1 spelling accuracy improves by an absolute 5.7%. Itis noted here that the results of the 0 th iteration correspond tothe spelling recognizer alone without any feedback from thepronunciation domain. Based on the observation that no sig-nificant improvementoccurs beyond iteration 3, K  is set to 2. Iteration # Top 1 Top 10 Top 20 Top 1000 19.3% 50.6% 57.6% 77.6%1 24.3% 53.6% 62.3% 78%2 25% 56.3% 62.6% 76.6%3 25% 56% 62.6% 76.6% Table 1 . Top 1, 10, 20, and 100 spelling match rates on theDev set as a function of iterations.
Search
Tags
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks