A natural acoustic front-end for Interactive TV in the EU-Project DICIT

ldquoDistant-talking Interfaces for Control of Interactive TVrdquo (DICIT) is a European Union-funded project whose main objective is to integrate distant-talking voice interaction as a complementary modality to the use of a remote control in
of 6
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
  A natural acoustic front-end for Interactive TV in the EU-Project DICIT L. Marquardt a , P. Svaizer b , E. Mabande a , A. Brutti b , C. Zieger b , M. Omologo b , and W. Kellermann a a Multimedia Communications and Signal Processing, University of Erlangen-Nuremberg, Cauerstr. 7, 91058 Erlangen, Germany b Fondazione Bruno Kessler - irst, Via Sommarive 18, 38100 Trento, ItalyE-mail addresses:  { marquardt,mabande,wk  } a ,  { svaizer,brutti,zieger,omologo } b Abstract ”Distant-talking Interfaces for Control of InteractiveTV” (DICIT) is a European Union-funded project whosemain objective is to integrate distant-talking voice interac-tion as a complementary modality to the use of a remotecontrol in interactive TV systems. Hands-free and seamlesscontrol enables a natural user-system interaction providinga suitable means to greatly ease information retrieval. Inthe given living room scenario the system recognizes com-mands spoken by multiple and possibly moving users, evenin the presence of background noise and TV surround au-dio. This paper focuses on the multichannel acoustic front-end (MCAF) processing for acoustic scene interpretationwhich is based on the combination of multi-channel acous-tic echo cancellation, blind source separation, beamform-ing, acoustic event classification, and multiple speaker lo-calization. The fully functional DICIT prototype consists of the MCAF, automatic speech recognition, natural languageunderstanding, mixed-initiative dialogue and satellite con-nection. 1. Introduction The goal of DICIT [1] is to provide a user-friendly multi-modal interface that allows voice-based access to a virtualsmart assistant for interacting with TV-related digital de-vices and infotainment services, such as digital TV, Hi-Fi audio devices, etc., in a typical living room. Multi-ple and possibly moving users can use their voice for con-trolling the TV, e.g., requesting information about an up-coming program and scheduling its recording without theneed for any hand-held or head-mounted gear. This sce-nario requires real-time-capable acoustic signal processingtechniques which compensate for the impairments of thedesired speech signals by acoustic echoes from the loud-speakers, local interferers, ambient noise and reverbera-tion. Accordingly, one of the key components for the pro-totypes developed within the DICIT project is the combina-tion of state-of-the-art multichannel acoustic echo cancel-lation (MC-AEC), beamforming (BF), blind source separa- This work was partially supported by the European Commission within theDICIT project under contract number 034624. tion (BSS), smart speech filtering (SSF) based on acousticevent detection and classification, and multiple source lo-calization (SLOC) techniques.The subsequent sections of this paper are structured asfollows: In Sect. 2 we describe the general architectureof the overall DICIT system. The acoustic front-end asa crucial building block of the DICIT system is presentedin Sect. 3: We first describe the currently fully integratedfront-end which is based on MC-AEC, BF, SLOC and SSF(see also the video at [1]). An alternative approach un-der development, featuring BSS, MC-AEC and SSF, is pre-sented next. Conclusions and an outlook on next steps andfurther possible improvements are given in Sect. 4. 2. The DICIT System In the following, we first describe the architecture of theoverall DICIT system, outline the functionality of its mostimportant components, and briefly describe the used hard-ware. 2.1. System architecture The main building blocks of the DICIT system arethe signal acquisition and playback hardware, the acous-tic front-end processing, the automatic speech recognition(ASR) and natural language understanding (NLU) unit, andthe actual dialogue manager (DM) as depicted in Fig. 1. Figure 1. DICIT architecture The first block comprises the hardware for signal ac-quisition and reproduction as detailed in the upper partof Fig. 2. The main components are here the 13-channelmicrophone array and a multichannel loudspeaker system,capturing the acoustic signals from the environment, andplaying back the digitally mixed outputs from the TV andthe dialogue system in stereo format, respectively. Note thatthe TV system comprises a remote control device as wellas a set-top box (STB) platform, providing access to on-airsatellite signals.  The acoustic front-end processing, which will be de-scribed in detail in Sect. 3, extracts the desired speech fromthemicrophonesignalsandpassesittothesubsequentASR.Given the state of the art in robust speech recognition, itis still crucial for the targeted environment to remove tothe greatest possible extent any signal impairments due toreverberation, background noise, interferers, and acousticfeedback from the loudspeakers to the microphones and toforward only signal segments to the ASR that can reliablybe classified as user speech.Continuous speech recognition technology in DICIT isbased on IBM Embedded ViaVoice (EVV) [2]. Acousticmodels have been trained in order to optimize the recogni-tion performance given the distant-talking voice character-istics as well as the typical noisy and reverberant conditionsof the addressed scenario. The ASR output is interpreted byan NLU unit which employs a statistical modelling calledMulti-level Action Classifier [2]. This processing chain hasbeen optimized for the English, German, and Italian lan-guages to enable multilinguality as additional feature of thesystem.The DM finally manages all interactions with user in-put and system output and interfaces to external data anddevices. Depending on the NLU output or remote con-trol input, the DM is primarily responsible for information-retrieval from the electronic program guide (EPG) and forcontrolling the TV/STB-System. Feedback to the user ispossible by acoustic means via speech generation and vi-sual means via the screen. 2.2. Hardware setup Apart from the microphone array and the loudspeakers,the hardware setup consists of AD-/DA-converters, pream-plifiers, the STB and two PCs - the usage of two PCs wascommanded by the use of two different operating systems.The first Linux-based PC is equipped with a multichanneldigital soundcard and hosts the acoustic front-end process-ing modules. The second Windows-based PC hosts ASR,NLU, and the DM - communication between the two PCs isestablished via a TCP-based standard internet protocol. Thevideo signal from the STB is displayed via an LCD screenor video projector. 3. Acoustic front-end The acoustic front-end foresees different combinationsof signal processing components. Its configuration de-pends primarily on computational constraints and the re-quirements of the specific scenario. The following subsec-tions describe two practically relevant architectures, bothfeaturing MC-AEC but differing with respect to the em-ployed spatial processing and source localization tech-niques. While the first configuration is part of the cur-rent DICIT prototype and uses beamforming and traditionalcorrelation-based source localization, the BSS-based archi-tecture which aims at an extended functionality and reducesthe number of microphones, is currently being integrated. 3.1. BF-/SLOC-based front-end The DICIT prototype is based on an acoustic front-endwhich efficiently combines stereo acoustic echo cancella-tion (SAEC), BF, SLOC, and SSF. The front-end and itsconnection to the signal acquisition and playback stage aredepicted in Fig. 2. Figure 2. Acoustic front-end (based on BFand SLOC) We first consider the structure of the entire BF-/SLOC-based front-end, before its individual components are de-scribed in more detail. While BF extracts the speech sig-nal srcinating from the desired look direction with mini-mum distortion and suppresses unwanted noise and inter-ference [3], AEC compensates for the acoustic couplingbetween loudspeakers and sensors [4]. Since the scenarioimplies an almost unconstrained and possibly time-varyinguser position, an according adaptive BF structure was em-ployed. Its combination with the SAEC structure wasguided by the principles laid out in [5]: Since applyingSAEC to all 13 microphone signals is computationally tooexpensive, SAEC was placed behind the BF structure. Aset of five data-independent beamformers is computed inparallel, which cover possible speaker positions and track moving users by switching between beams. Thereby, theAECs do not need to track time-varying beamformers. In-stead of one SAEC behind each beamformer output, onlyone SAEC is calculated for the beam covering the source of interest. Assuming that beam-switches occur infrequently,the necessary readaptation of the SAEC filter coefficients is  acceptable. The reuse of AEC filter coefficients determinedfor previously selected beamformers further reduces the im-pact of occasionally switching beams.The selection of the beamformer output to be passed tothe SAEC is made by the source localization. As the SLOCneeds to use microphone signals which still contain acous-tic echoes of the TV audio signals, a-priori knowledge onthe loudspeaker positions has to be exploited to exclude TVloudspeakers as sources of interest.Finally, theSSFmoduleanalyzestheoutputoftheSAECin order to detect speech segments from the user. For a ro-bust system it is crucial that only the desired speech seg-ments and no nonstationary noise or echo residuals will bepassed to the ASR - the corresponding decision is supportedby the SLOC information.As an example, Fig. 3 shows the effect of the front-endprocessing for a recording containing five control utterances(”ok”, ”set volume to seven”, ”CNN”, ”set volume to five”,and ”show me the EPG”), of a speaker at a distance of 2.5meters to the microphone array in broadside direction, in aroom with a reverberation time of 300msec, a backgroundnoise level of 36dB SPL, and real TV audio output. Theupper subplot shows a single microphone input while thelower plot depicts the AEC output together with the correctsegmentation by the SSF unit. The cancellation of TV loud-speaker echoes is characterized by a mean error return lossenhancement (ERLE) of 28dB calculated over the last fiveseconds. (The delay between microphone input and AECoutput is 20msec.) 05101520253035404550−0.4− signal05101520253035404550−0.100.1Segmented signal after BF, AEC, and SSFt [s] Figure 3. Acoustic front-end processing The following paragraphs outline the algorithms thathave been chosen and adapted for the described scenario. Beamforming.  To account for the wideband nature of speech and ensure good spatial selectivity, a nested array-based BF design was chosen [6] using 13 microphonesfor four subarrays, one which uses seven microphones andthree of which use five microphones each, with spacings of 0.32 m, 0.16 m, 0.08 mand 0.04 m, respectively. These sub-arrays operate in the frequency bands of 100-900Hz, 900-1800Hz, 1800-3600Hz, and 3600-8000 Hz, respectively.In the acoustic front-end, the BF module consists of afilter-and-sum beamformer (FSB) and five steering units(SU). The FSB based on a Dolph-Chebyshev design (FSB-DC) [7] with FIR filters of length 512 taps was selected herefor its good spatial selectivity and its robustness to sensorcalibration errors. The steering units (SU) consist of setsof fractional delay filters [8] which perform the steering of the beam to the five predefined look directions. They areinserted after the FSB filtering of the individual channels.Thereby, the FSB filtering of the microphone signals is re-quired only once for all beams and only the delaying and thesummation of the microphone channels has to be carried outfor each beam. Multi-channel Acoustic Echo Cancellation.  The algo-rithm employed for the current acoustic front-end here isbased on the generalized frequency-domain adaptive fil-tering (GFDAF) paradigm [9]. Exploiting the computa-tional efficiency of the FFT for minimizing computationalload, it also accounts for the cross-correlations among thedifferent reproduction channels to accelerate convergenceof the filters and, consequently, achieves a more efficientecho suppression. This is crucial in the given scenario asuser movements have to be expected, which in turn implyrapid changes of the impulse responses of the loudspeaker-enclosure-microphone (LEM) system that has to be identi-fied by the adaptive filters.Since the stereo channels of the TV audio are usuallyvery similar and therefore not only highly auto-correlatedbutalsooftenstronglycross-correlated, aprecedingchanneldecorrelation(seeFig.2)allowsafurtheraccelerationofthefilter convergence. Apart from breaking up the interchannelcorrelation it is required that the introduced signal manipu-lations must not cause audible artifacts. For the discussedacousticfront-endthephasemodulation-basedapproachac-cording to [10] has been implemented which reconciles therequirements of low complexity and convergence supportwith the demand for not impairing subjective audio quality,especially the spatial image of the reproduced sound.Due to the combination of a single AEC with theswitched beamformer described above, the AEC sees adifferent acoustic echo path after each beam-switch. Toavoid the need to readapt the AEC filters starting from non-matching coefficients, the filter coefficients that were iden-tified in the previous use of the respective beam can be usedas a starting point for readaptation [5]. In the given sce-nario, thisprovestobeveryefficientasunderlinedbyFig.4,where ERLE is compared for adaptation with (right) andwithout coefficient buffering (left) following a beam-switchof the DICIT beamformer at t=2s, given continuous TV au-dio output. Source Localization  Acoustic maps, computed on a gridof points in an enclosure, express the plausibility of soundbeing generated at those points and hence represent a valid  05100102030t [s]    i  n  s   t .   E   R   L   E   [   d   B   ] 05100102030t [s] Figure 4. Effect of beam-switching withoutand with coefficient buffering solution to the SLOC problem. In particular the global co-herence field (GCF) [11], also known as SRP-PHAT [3],combines the information obtained through a generalizedcross-correlation phase transform (GCC-PHAT) [12] anal-ysis at different microphone pairs. Given a GCF map, theSLOC problem can be addressed by picking the peaks ap-pearing atthespatialpointscorresponding toactive acousticsources. In DICIT, the sub-array consisting of seven micro-phones at 0.32 m distance is used for GCF computation as itguarantees good performance at a reasonable computationalpower cost.In order to avoid beam-switching during silence phasesand to reduce the impact of false beam-switches due tofaulty localization estimates, the SLOC module providesthe BF with a new position estimation only if the map peak is above a given fixed threshold. In fact, the amplitude of the peak is correlated with the relevance of acoustic activityand can therefore act as an embedded acoustic activity de-tector. If the map peak is below the chosen threshold, theprevious position is kept. Besides robustness, promptness isa crucial requirement for the module so that the system canquickly steer the beam toward the speaker as soon as he/shestarts speaking. A memoryless localization is therefore em-ployed in combination with a post-processing whose goal isto suppress outliers, i.e. isolated estimates located far awayfrom the current speaker area.As the SLOC module operates on the microphone sig-nals still containing the TV echoes, estimating the posi-tion of the user requires suppression of the loudspeaker sig-nals. In DICIT, the loudspeaker contributions are removedat GCC-PHAT level by exploiting the knowledge of theirpositions relative to the microphone array. The approachis derived from the multiple source localization approachin [13], treating the single user plus TV loudspeakers asmultiple simultaneously active sources. Fig. 5 shows anexample of a GCF map before (left) and after (right) theremoval of the loudspeaker contributions (bright colors rep-resent high values, the stereo loudspeakers and the DICITarray are schematically depicted on the right side of eachplot). Only after the deemphasis of the loudspeakers, theuser position (indicated by the circle) corresponds to thehighest activity region as visible in the right plot.Experiments conducted on Wizard of Oz data collectedin reverberant rooms [14] show that the SLOC module esti-mates the source position with an RMS error of 7.5 degrees. Figure 5. GCF map before and after the re-moval of the loudspeaker contributions Smart Speech Filtering  After the signal processing byMC-AEC, sound produced by the TV has been almost com-pletely cancelled from beamformer output, therefore usercommands can be detected on the basis of the dynamicsof the resulting signal. Constraints are applied concerningminimum duration of utterances and maximum duration of pauses between words in order to isolate potential relevantsignal segments. Additionally, only signals segments ex-hibiting a sufficient spatial coherence at the microphones,possibly produced by a speaker in an area in front of andoriented towards the TV, are retained. Thus, speakers inother areas or not addressing DICIT can be ignored. SLOCinformation is exploited at this stage in order to take into ac-count both thespeaker’s position and likely orientation [15]. 3.2. BSS-based front-end The BSS-based front-end to be described in the follow-ing represents an alternative approach to the front-end pre-sented in the previous section and is currently being inte-grated into the overall system. Fig. 6 shows the correspond-ing block diagram. Figure 6. Acoustic front-end (based on BSS) Since BSS can be interpreted as a set of adaptivenull-beamformers, it replaces the functionality of data-independent beamformers and source localization of thefirst approach. One major advantage of the BSS-based  front-end is the reduction of the number of microphones.For the envisaged BSS-based front-end only two sensorswill be needed, which is supposed to be of great impor-tance with respect to the overall system complexity, useracceptance and cost. A second benefit is that, in contrastto the prototype that is based on the front-end described inSect. 3.1 which can currently extract only one active user,BSS using two sensor signals is also able to extract two si-multaneously speaking users. In any case two streams of data will be delivered to the following SSF module, carry-ing the following signals: •  If no user is active, two zero-valued signals arrive atthe SSF component, •  If one user is active, its signal will appear in one SSFinput and will be attenuated in the other SSF input, •  If two users are simultaneously active, each SSF inputwill be dominated by one user signal.BSS can be combined in two different ways with AEC.The AEC can be performed directly on the microphone in-puts, or it can be applied at a later stage, to the BSS outputs.Taking into account considerations described in [5, 16], weconcentrated on the ”AEC-first” alternative, as shown inFig. 6. The SLOC module depicted in Fig. 6 representsan additional source of information and might supplementthe BSS-inherent source localization and thus also help toimprove the decisions to be made by the SSF. SSF first pro-cesses the two input streams provided by BSS in order todetect speech segments and reject any non-speech event bymeans of an acoustic event classifier. Moreover, becauseSSF here has to work on more than one input stream itis likely that two simultaneously active speakers will cre-ate two streams with valid speech segments. Therefore itmust be decided which speech signal to pass to the ASRand which one to reject. This decision can be based on aspeaker identification. The algorithms for MC-AEC and therelated signal decorrelation will be the same as in the pre-ceding Sect. 3.1. The following paragraph introduces thecomponents which are only used within the BSS-based ar-chitecture. Blind Source Separation.  The extraction of up to two si-multaneously active sources with two microphones corre-sponds to the overdetermined or the determined BSS case,respectively. Approaches based on independent componentanalysis (ICA) are well suited for both cases, merely underthe assumption of statistical independence of the srcinalsource signals.Here, we consider a broadband BSS approach basedon the TRINICON framework [17]. For the developmentof the BSS-based front-end, we implemented an efficientsecond-order-statistics(SOS)versionoftheTRINICONup-date rule [18].While BSS recovers the srcinal source signals froma (possibly reverberant) sound mixture without a prioriknowledge about the locations of the sources, the BSSdemixingfiltersalsocontaininformationonthesourceloca-tions. One way to retrieve the localization information hasbeenpresentedin[19]. ItreliesontheabilityofabroadbandBSS algorithm to perform blind adaptive identification of the acoustical environment for two microphone channels.Thus, two time-differences-of-arrival (TDOAs) can be ex-tracted by identifying the highest peaks in the BSS filters,corresponding to the direct paths. AcousticEventClassificationandSpeakerIdentificationfor SSF  In the foreseen scenario a classification step maybe necessary to discriminate actual speech segments fromother interfering events (phone ringing, sneezing, laugh-ing...). The foreseen acoustic event classification (AECL)is based on a set of mel frequency cepstral coefficients(MFCCs) as acoustic signal features. A score is computedby comparing the observed feature vector with Gaussianmixture models (GMM), trained on examples of the con-sidered acoustic events. The best match in terms of aver-age likelihood provides the classification of the signal seg-ment [20].Moreover, when the classified event is speech, it may benecessary to classify the speaker identity as well. In thiscase a speaker identification (SID) capability must be in-troduced in the SSF, consisting of the two steps of featureextraction and score computation. The acoustic features areagain MFCCs, while the scoring is accomplished by com-bining the results of two sub-systems implementing GMMand support vector machine (SVM), respectively [21]. Inthe GMM-based sub-system speaker dependent models areobtained through maximum a posteriori (MAP) adaptationof the mean vectors starting from a universal backgroundmodel (UBM) that represents the background speaker pop-ulation. In the SVM-based sub-system, elements belong-ing to non-linearly separable classes are discriminated onthe basis of a binary classification, operated by non-linearkernel functions. When more than one speaker is active,the performance of SID is strongly related to the amount of residual interfering speech that may be present at the BSSoutput. The effect of BSS on SID performance will be fur-ther investigated. 4. Conclusions and outlook In this paper we presented the multichannel acousticfront-end of an already fully functional prototype for inter-active TV, which has been developed within the EU-fundedproject DICIT. We also introduced an alternative architec-ture based on BSS, which extends the functionality of theBF-/SLOC-based approach for multi-user scenarios.TheBF-/SLOC-basedfront-endsupportsoneuserwhosemovements can be tracked fast enough by the SLOC mod-ule, so that the combination of BF and AEC guarantees agood signal quality of the desired speech. The SSF can thuspass the user commands to a subsequent ASR while reject-ing undesired residual disturbances. The front-end architec-ture accounts for computational constraints with an efficient
Similar documents
View more...
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks