Art & Photos

A TENTATIVE TYPOLOGY OF AUDIO SOURCE SEPARATION TASKS

Description
A TENTATIVE TYPOLOGY OF AUDIO SOURCE SEPARATION TASKS
Categories
Published
of 6
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
Share
Transcript
  A TENTATIVE TYPOLOGY OF AUDIO SOURCE SEPARATION TASKS Emmanuel Vincent Cédric Févotte Rémi Gribonval Xavier Rodet Éric Le Carpentier Laurent Benaroya Axel Röbel Frédéric Bimbot  IRCAM, Analysis-Synthesis Group IRCCyN, ADTS Group IRISA, METISS Project1, place Igor Stravinsky 1, rue de la Noë – BP 92 101 Campus de BeaulieuF-75004 PARIS F-44321 NANTES CEDEX 03 F-35042 RENNES CEDEXFRANCE FRANCE FRANCEemmanuel.vincent@ircam.fr cedric.fevotte@irccyn.ec-nantes.fr remi.gribonval@irisa.fr ABSTRACT We propose a preliminary step towards the construction of a global evaluation frameworkfor Blind Audio Source Sep-aration (BASS) algorithms. BASS covers many potentialapplications that involve a more restricted number of tasks.An algorithm may perform well on some tasks and poorlyon others. Various factors affect the difficulty of each task and the criteria that should be used to assess the perfor-mance of algorithms that try to address it. Thus a typol-ogy of BASS tasks would greatly help the building of anevaluation framework. We describe some typical BASS ap-plications and propose some qualitative criteria to evaluateseparation in each case. We then list some of the tasks to beaccomplished and present a possible classification scheme. 1. INTRODUCTION Blind Audio Source Separation (BASS) has been a subjectof intense workduringthe latest years. Several models haveemerged, such as Independent Component Analysis (ICA)[1] and Sparse Decompositions(SD) [2], and it is now moreor less well known how to solve the separation problem un-der these models with efficient and robust algorithms. How-ever BASS is not just about solving some tractable model( e.g.  finding independent or sparse components), it is aboutrecovering results that make sense according to the targetapplication.BASS covers many applications, such as high qualityseparation of musical sources, signal/speech enhancement,multimedia documents indexing, speech recognition in a“cocktail party” environment or source localization for au-ditory scene analysis. Depending on the application, BASS This work is part of a Junior Researchers Project funded by GdR ISIS(CNRS). See  http://www.ircam.fr/anasyn/ISIS/ for some in-sights on the Project. algorithms have to address different tasks. For example,some applications require finding the number of sourcesgiven the observations, and others require recovering thesource signals given the observations and the structure of the mixing system.Agivenseparationalgorithmmayperformwellonsometasks and poorly on others. Depending on the task, variousfactors affect the difficulty of the separation, and distinctcriteria may be used to evaluate the performanceof an algo-rithm, and compare it to other algorithms.The first step to determine which task(s) a given sepa-ration algorithm may achieve is to list and classify some of the interesting tasks. As tasks and applications are related,this implies to list and classify typical applicationsof BASStoo. We attempt to address these questions in this paper. Asa further step, we propose in a companion paper [3] somenumerical criteria to evaluate the performance of BASS al-gorithms on some of these tasks.In Section 2, we present two large classes of BASS ap-plications. In Sections 3 and 4, we give some examplesamong each class and we identify some candidate quali-tative criteria to measure separation quality and separationdifficulty. In Section 5, we list some of the tasks to be ad-dressed by BASS algorithms and we present a possible ty-pology.Let us emphasize that this paper should be consideredas a preliminary proposal that does not contain final re-sults but rather presents some thoughts on the definition,the typology and the evaluation of BASS tasks. We hopethe source separation community will consider these top-ics more closely, so that the construction of an agreed-uponevaluationframeworkforBASSalgorithmswillbecomepos-sible. 715 4th International Symposium on Independent Component Analysis and Blind Signal Separation (ICA2003), April 2003, Nara, Japan 715 4th International Symposium on Independent Component Analysis and Blind Signal Separation (ICA2003), April 2003, Nara, Japan  2. BASS APPLICATIONS An importantdistinctionthat can be made amongBASS ap-plications is whether the output of the algorithm is a setof extracted sources that are intended to be listened to ornot. We term these two categories Audio Quality Oriented(AQO) and Significance Oriented (SO) applications.AQO applications extract sources that are listened to,straight after separationor after some post-processingaudiotreatment. Most of the literature focuses on this goal by us-ing ICA- and SD-related methods (see [4] for a review of ICA methods applied to audio signals). Some criteria forseparation quality and separation difficulty have been pro-posed in [5, 6], and we propose some others in our compan-ion paper [3].In SO applications, the extracted sources and/or mix-ing parameters are processed to obtain information at moreabstract levels, in order to find a representationof the obser-vations related to human perception. For instance, lookingfor the number and the kind of the instruments in a musicalexcerpt enters the scope of SO separation. Separation qual-ity criteria are generally less demanding than in AQO ap-plications because the aim of SO separation is only to keepspecific features of the sources. Thus, a rough separationmay be sufficient (possibly with high distortion), depend-ing on the robustness of the subsequent feature extractionalgorithms.An important remark is that separation and informationextraction do not have to be separated processes. For exam-ple, the auditory system uses  a priori  and contextual infor-mation to perform separation, which means recognition canhelp separation.For the sake of clarity, let us define the few notationsusedin therest ofthis paper. Thegeneral(possiblyconvolu-tive)BASS problem                                   is expressedusing the matrix of filters formalism as            , where   is the vector of the    sources               ,    is the vector of the   observations                 ,    is the matrix of mixing filters.Note that we limit ourselves to linear time-invariant mixingsystems. A similar analysis could be carried out extendingthe modelto non-lineartime-variantsystems to take into ac-count the dynamic compression applied to radio broadcastsor the spatial movements of the sources for example. 3. AUDIO QUALITY ORIENTED SEPARATION Within AQO applications, we can distinguish two majorfamilies of applications. The first category is related toapplications where we are interested in each individual ex-tracted source, while the second one corresponds to appli-cations where the goal is to listen to a new mixture of thesources. 3.1. One versus all The  one versus all  problem consists in extracting one sortof sound (the target source      ) from a mixture. Generallythe other sources are considered as noise.Some examples include restoration of old monophonicmusical recordings [7], speech de-noising and de-reverbe-ration for auditory protheses or mobile phones [8] and ex-traction of some interesting sounds in a polyphonic musicalexcerpt for electronic music creation.In this context, a “good” separation requires estimatingthe source with a high Signal to Noise Ratio (SNR). TheSNR criterion can be modified to model the specificities of hearing, such as masking phenomena, as does the criterionintroduced in [9]. Other criteria can be used to evaluate theseparation quality when an exact estimation is not needed.Thisisoftenthecasewhenindeterminaciesariseduetocon-volutive mixing. For some applications, it may be sufficientto recover some filtered versions of the target source      andnot      itself [10, 11]. The “naturalness” of these versionscould be measured by criteria like timbre distortion [12] orcomparisons with a database of room impulse responses.For other applications, one may wish to extract the contri-bution of       to each sensor [13], that is to say to estimate themultichannel signal                       ℄    . Insuch cases, quality criteria may have to take into accountthe difference between the perceived spatial direction of      when listening to       and to    .The problem of extracting several sources in order tolisten to them separately falls in the  one versus all  categorytoo. For each source, a global SNR can be computed andcan be decomposed into the contributions of crosstalk (re-mainder of the other sources), additive noise and algorith-mic artifacts [3].The  one versus all  problem may be tackled at variousdifficulty levels. The algorithms are influenced by the num-ber of sources, the number of sensors, the noise level, thedependency between the sources, the kind of mixing (in-stantaneous  vs  convolutive), etc. The blind case is usuallyaddressedbyICA[1]andSD [2]. Otheralgorithmscanhan-dle  a priori  information,like a modelof the playingmusicalinstrument and its musical score used in [14] or the video of the lips corresponding to a noisy speech signal used in [8]. 3.2. Audio scene modification Audio scene modification consists in obtaining a new mix-ture                                   ℄    , for example by 716716  extracting all the sources               from the srcinal obser-vations    , applying an adapted audio processing      to eachsource and remixing the tracks using a possibly differentmixing matrix    , in order to listen to the result       . Letus note that prior extraction of each source is not a require-ment in such applications, it is only a convenient way todescribe the desired result and a possible way to achieve it.Examples include re-mastering of a stereo CD, blindmultichannel diffusion of stereo recordings [15], spatial in-terpolation [16] and cancellation of the voice in a song for“automatic karaoke”.Evaluation of the separation results may rely on calcu-lating the SNR of the estimated remixed scene  w.r.t.  the ex-pected result (that is to say the scene constructed by remix-ing the true sources). Depending on the signal “zones” af-fectedbypost-processingandremixing,qualitycriteriamaybe a little less restrictivethan forthe  one versus all  problem.For example, when the purpose is to increase slightly the“presence” of an instrument in a CD, distortion or crosstalk in the extracted instrument won’t account for much in thefinal result. Indeed, the other sources will most likely mask the zones containing crosstalk after remixing, and even lar-ger zones since auditory masking effects usually come intoplay [9].Difficulties encounteredby the algorithms include thoseof the  one versus all  problem : they are influenced by thenumber of sources, the number of sensors, etc. But theamountof changeintroducedby the intermediateaudiopro-cessing      and the new number of channels obtained af-ter remixing with    also play a role. For instance, givena mono recording, it is more difficult to cancel one of itssources or to broadcast each of its sources on one or morechannel(s) than to augment slightly the volume of one of its sources. Like in the  one versus all  problem, algorithmscan also use  a priori  information or not. The use of suchinformationis to determine satisfactory    and      to achievethe desired effect. In blind remixing, this can only be doneby relying on directly computable features, such as the di-rection or the instantaneous power of the sources. Whenmodels of the sources are available, it becomes conceivableto name each source ( i.e.  to label it with the right model),so that it can undergo more specific treatments. One canthink of raising the level of “the voice” in a recording if amodel of “voice” is available. 4. SIGNIFICANCE ORIENTED SEPARATION SOapplicationsaimatretrievingsourcefeaturesand/ormix-ing parameters to describe complex audio signals at variouscognitivelevels, focusingondifferentaspects ofsound[17].The purpose of finding an exhaustive description of a com-plex audio scene is called Auditory Scene Analysis (ASA)[18]. Most SO applications can be seen as by-products of ASA.The main applications of SO separation concern the in-dexing of audiovisual databases and the construction of in-telligent hearing systems. Depending on the application,one may need low level descriptive elements, high levelonesorboth. Someexamplesofdescriptiveelementsarethescore of each instrument in a musical excerpt [19], the textpronounced by a speaker in a noisy environment [20, 21],or the spatial position of the sources  w.r.t.  the sensors in a“real world” recording [22], etc. Other descriptions consistin telling the nameof eachinstrumentandthe musicalgenre[23], identifyingthe speaker [24], linking audio sources andcorresponding visual objects on a video [25], etc.The purpose of S0 separation is to preserve as much aspossible the features used to compute the descriptive ele-ments. To evaluate the quality of a global description con-sisting ofmanydescriptiveelements,thequalityofeachele-ment is first evaluatedseparately by a distinct criterion. Thequality of continuous-valued descriptive parameters, suchas the positions of the sources, is measured by simple dis-tances. The evaluation of discrete-valued descriptive pa-rameters, such as the name of the instrument is done bycalculating misclassification or recognition error rates [26].The quality of the whole audio description may then beexpressed by a weighted combination of all these criteria.When such a weighting is hard to choose objectively,it maybe preferable to conduct a series of listening tests to obtaina global separation/description grade [27, 18].Criteria for the separation difficulty depend on the ap-plication. The number of sensors and their selectivity andthe amount of reverberation in the environment affect theretrieval of the mixing parameters. Source classification ismore or less difficult according to the number of classes torecognizeandthe robustnessof the featurescalculation. Forsome applications, a real-time constraint is also needed. 5. AUDIO SOURCE SEPARATION TASKS As we have shown, AQO and SO separation are used inmany different applications, each one having its own evalu-ation criteria. However, these applications correspond to asmaller number of tasks to be accomplished by BASS algo-rithms, depending on the relevant objects in the model, thekind of mixing and the amount of available information.A task is specified by the nature of the objects that thealgorithm takes as an input, the nature of its output, and 717717  Task Input OutputCounting     Blind mixingidentificationstructure of    ( not always )    Blind sourceextraction structure of     or               Blind remixingstructure of     ,generic    and                   Detection sources models               number        of sources follow-ing     IdentificationRepresentation model of     descriptionof     and   Sourceextractionmodel+descriptionof     and     or               Remixingmodel+descriptionof     and    ,adapted    and                   Table 1 . Some BASS tasks (see Section 5 for commentsand previous Sections for notations)a qualitative description of how the quality of the outputshould be assessed.Table 1 lists some tasks according to the input-outputschemeof the algorithmsusingthe notationsof the previousSections. In the ’Input’ column, we only list what comes inaddition to the observations    . The (qualitative) evaluationcriteria are implicit, but it is indeed a crucial step to definerelevant and agreed upon procedures to evaluate the perfor-mance of an algorithm on a task.Note that for some tasks the well-known indetermina-cies of the BASS problem are explicitly expressed in thedescription of the output, using a permutation matrix    anda diagonal matrix of filters    , or using the different delim-iters    and    for ordered and unordered sets.For the remixing task, the input include the new mixingmatrix and audio processing to perform on each source. Inthe non-blind case, it is conceivable to specify this intrin-sically so that a given processing is performed on a sourceidentified by some model (“the piano” for example). This isdenoted by ’adapted’ B and f. However, in the blind situa-tion, it seems very hard to specify the nature of the sourcethatshouldundergoagivenaudioprocessing. Onemaythusbe restricted to specifying a source in terms of “the loudestsource” or “the leftmost source” on a stereo recording. InTable 1, this is denoted by ’generic’ B and f.We tried to choose the names of the tasks in correspon-dence to what is used in the literature. For example, theBlind Mixing Identification task contains as a special casewhat is usually called Blind System Identification [28]. TheDetectiontaskisclosetotheVerificationprobleminspeakerrecognition [29] and to the Classification problem in au-diovisual database indexing. The Remixing task includesthe Cancellation problem, which consists in cancelling onesource in the mixture.The main distinction we propose between tasks is whe-ther models of the sources (generally learned from a data-base of samples of the sources) are available or not. Thedifference between blind tasks and their semi-blind “coun-terparts” is indeed quite important.However,contrary to other contributionsconcerningthesubject [5, 6], we group in each task the instantaneous mix-ing case and the convolutive one. In fact, we believe thevarious possible structures of     should be considered asvarious difficulty criteria (or subtasks) for the same overalltask, rather than separate tasks. By structure of     , we meaninformation such as the number of sources or the length of the mixing filters (simple gain, gain-delay, short FIR, IIRwith few parameters). For some tasks, this structure is givenas input to the algorithms.For some non blind tasks, we also group the problemswhere a model of the sources is given and those where adescription of the sources is also available. The term modelcovers all sorts of general signal models, such as the hid-den Markov models used in [8], the modified additive mod-els used in [12] or even a physical model of the source in-strument [30]. Source models can also contain learned in-formation about source interaction, for example parametersdescribing the degree of independence between them [31].There can also be models of the mixing system. In mostcases, the definition of a task does not include a specifictype of model that an algorithmcan rely on in order to solvethe task. Generally, the algorithm is trained –prior to run-ning on “live” data– using some training samples of eachsource. In this context, a description may be any kind of knowledge that restricts the models depending on the par-ticular piece of signal considered : a temporal segmenta-tion, a musical score, the size of the recording room. As forthe distinction betweeninstantaneousand convolutiveprob-lems, we believe that giving or not this kind of descriptionsto the algorithmis facing differentdifficulty levels inside anoveralltask, but not separate tasks. This puts togetherratherdifferent problems, for example extracting a piano and a vi-olin playing together with the only information that thereare a piano and a violin, or performing the same extraction 718718  knowing their scores. However, these are the extreme casesamong many intermediate assumptions. Sometimes, onlyone of the sources is learned and only an imperfect score isavailable, like in [14]. 6. CONCLUSION In this paper, we described some of the most typical appli-cations encountered in the field of BASS, and we proposedto group these applications into two main categories : AQO(Audio Quality Oriented) and SO (Significance Oriented)separation. AQOapplicationsaimatextractingsourcesforalistening purpose, whereas in SO applications the extractedsources are used for classification and description.For each application,we stressed some of the audiospe-cificities to be taken into account when designing relatedBASS algorithms. We proposed some qualitative criteria toevaluate the performanceof a given algorithm for the appli-cation, and the difficulty of the application itself dependingon various factors such as the available amount of prior in-formation, the noise level, the instantaneous or convolutivenature of the mixtures, etc.This lead us to propose a tentative typology of the cor-responding tasks to be solved by BASS algorithms, accord-ing to the input-output scheme of the algorithms. The mainclassification axis is the distinction between blind and nonblind tasks. We retained the classical distinction betweeninstantaneous and convolutive mixing as different levels of difficulty for a given task.We should insist here that the proposed typology is in-tended to serve as a preliminary proposal, and we encour-age researchers in the community to share their ideas aboutBASS tasks typology and evaluation or related topics usingthe discussion list on our dedicated web-site [32]. 7. FUTURE WORK This work constitutes an important step towards the defini-tion of a global evaluation framework for BASS algorithmsunder different tasks. However in this article we have onlydescribed candidate qualitative criteria. Hence, the nextsteps should consist in transforming these criteria into nu-merical formulae and in building a “smart” benchmark of test signals according to the various BASS tasks we identi-fied.In a related paper [3], we expose quantitative criteria tomeasure the performance of source separation algorithmson the (Blind) Source Extraction and the (Blind) Remixingtasks. These criteria measure the performance in terms of interferences, noise and algorithmic artifacts, by properlytaking into account the gain indeterminacies of source sep-aration. They can be used in (over-)determined as well asunder-determinedproblems. Other tasks require the designof other relevant numerical criteria.A database structure and some labeled test signals arealso readily available online [32].Finally, it seems that some BASS applications such asaudioscenemodificationhavebeenless studiedthanthe oneversus all  problem for instance, despite their generally lessdemanding requirements. We hope this work will triggerinterest for new research goals. 8. ACKNOWLEDGMENTS This work has been performed within a Junior ResearchersProject “Resources for Audio Signal Separation” funded byGdR ISIS (CNRS). The goal of the project is to identifythe specificities of audio signal separation, to suggest rele-vant numerical criteria to evaluate separation quality, and togather test signals of calibrated difficulty level, in order toevaluate the performance of existing and future algorithms.Some evaluation routines, a database of audio signalsand a discussion list can be found on the web-site [32]. 9. REFERENCES [1] J.-F. Cardoso, “Blind source separation : statisticalprinciples,” in  IEEE Proc. , 1998, vol. 90, pp. 2009–2026.[2] M. Zibulevsky and B.A. Pearlmutter, “Blind sourceseparation by sparse decomposition in a signal dictio-nnary,”  Neural Computation , vol. 13, no. 4, 2001.[3] R. Gribonval,L. Benaroya,E.Vincent, andC. Févotte,“Proposals for performance measurement in sourceseparation,” in  Proc. Int. Workshop on ICA and BSS (ICA’03) , 2003, submitted.[4] K. Torkkola, “Blind separation for audio signals - arewe there yet ?,” in  Proc. Int. Worshop on ICA and BSS (ICA’99) , 1999.[5] D. Schobben, K. Torkkola, and P. Smaragdis, “Evalu-ation of blind signal separation methods,” in  Proc. Int.Workshop on ICA and BSS (ICA’99) , 1999, pp. 261–266.[6] R.H. Lambert, “Difficulty measures and figures of merit for source separation,” in  Proc. Int. Workshopon ICA and BSS (ICA’99) , 1999, pp. 133–138.[7] Olivier Cappé,  Techniques de réduction de bruit pour la restauration d’enregistrements musicaux , Ph.D.thesis, Télécom Paris, 1993. 719719
Search
Tags
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks