Memoirs

A STUDY OF TEMPO TRACKING ALGORITHMS FROM POLYPHONIC MUSIC SIGNALS

Description
In this paper, algorithms for tempo tracking from poly- phonic music signals are introduced. These new meth- ods are based on the association of a filter bank with ro- bust pitch detection algorithms such as the spectral sum or spectral product.
Categories
Published
of 5
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
Share
Transcript
  4 COST 276 Workshop March 31, April 1, 2003 A STUDY OF TEMPO TRACKING ALGORITHMS FROM POLYPHONIC MUSIC SIGNALS  M. Alonso, B. David, G. Richard  ´Ecole Nationale Sup´erieure des T´el´ecommunications (ENST)D´ept. Traitement du Signal et des Images46, Rue Barrault75634 Paris, Cedex 13Email:miguel.alonso,bertrand.david,gael.richard@enst.fr ABSTRACT In this paper, algorithms for tempo tracking from poly-phonic music signals are introduced. These new meth-ods are based on the association of a filter bank with ro-bust pitch detection algorithms such as the spectral sum orspectral product. These algorithms are further improvedby using an onset detector in each band. These algorithmsare then compared to two of the most reliable methods of the literature using a small manually annotated database of short sequences of music signals. It is shown that, despitetheirsimplicity, thesenewapproachesareveryefficientandoutperform the tested methods. 1. INTRODUCTION Theenormousamountofunstructuredmultimediadataavail-able nowadays and the spread of its use as a data sourcein many applications are introducing new challenges to re-searchers in information and signal processing. The contin-uously growing amount of this digital multimedia informa-tion increases the difficulty of its access and management,thus hampering its practical usefulness. As a consequence,there is a clear need for content-based multimedia data in-dexing, processing and retrieval techniques.If multimedia and multimodal approaches represent es-sential challenges, the more classical approach consistinginbuildingnewanalysisorindexingtechnologiesonagivenmedia are still needed to overcome the current limitationsof today approaches.For example in audio, numerous problems still exist toextract high level descriptors directly from polyphonic mu-sic signals. The tempo (or beat) is one of the most im-portant descriptor since many applications can be derivedfrom the automatic recognition of the rhythmic structure of a music signal: •  automaticrhythmicalignementofmultipleinstruments,channels or musical pieces (for mixing or  karaok ´ e ) •  automatic indexing, segmentation and style classifi-cation of music databases, •  beat driven computer graphics (virtual dancers, etc..)Tempo or beat analysis of musical signals is a domainof research that receives a growing interest as shown by thevariety of recent publications [8],[10],[1], [12], [6]. Thisproblem is apparently simple (most people even withoutany musical knowledge have no difficulties to find the beatof a musical performance). However, automatic recogni-tion is more complex especially for music styles that donot include strong rhythmic patterns (such as classical or jazz music, for example).If earlier approaches focused on MIDI signals (or sim-ple real audio signals such as purely percussive signals [9]),today approaches are directly dealing with polyphonic mu-sic. Scheirer [8] proposed a method associating a filterbank with a set of comb-filters. Simpler methods were intro-duced by by Sepp¨annen [10] using sound onset detectionor by Tzanetakis [12] in the context of musical genre clas-sification. Another approach was also proposed by Goto[2] to infer the hierarchical beat structure. In most of theseworks, the rhythm detection (or periodicity) is based on asimple inter-onset time detection or on the traditional auto-correlation method.In this paper, several algorithms for tempo tracking areintroduced. These methods are based on the association of a filter bank with robust pitch detection algorithms such asthe spectral sum or spectral products. The performancesof these algorithms are evaluated against some of the mostreliable algorithms of the literature using a small manuallyannotated database of short sequences of music signals.The paper is organized as follows: the next section de-scribes our new algorithms including some minor improve-ments of the srcinal Scheirer’s algorithm. Results of thesealgorithms compared to two methods of the literature aregiven in section 3. Finally, in section 4 we suggest someconclusions. 2. TEMPO ESTIMATION ALGORITHMS Most of the algorithms designed to estimate the tempo of musical pieces [8, 7, 3, 12] are based on the same basicsteps. Inparticular, theyallprocessseveralfrequencybands1  4 COST 276 Workshop March 31, April 1, 2003separately, combining the results in the end. According tothe experimental results found in [8], this assumption pre-serves rhythm perception for most music signals, and haveprove to be efficient for many kinds of frequency bands de-composition. These basic steps are described below:1. subband decomposition of the signal, provided by afilterbank,2. onset detection in each subband.3. estimation of the periodicity in each subband.4. combinationoftheresultstoobtainthegeneraltempo.The differences between the algorithms found in the lit-terature rely on the implementations of those steps. Forinstance, the well-established Scheirer algorithm uses a sixband IIR filter bank for the first stage. Nevertheless we alsofound a eight band filter bank in [7] and a 21 nearly criticalband filter bank in [4]. There are also different techniquesused to detect the onset times: based on a half wave or fullwave rectification, using the enveloppe or its squared value,deriving the difference function or the relative differencefunction, applying thresholding or not.According to these descriptions, the structure of the wholesystem would appear as sketched on the flow diagram of the figure 1. Σ detection Onset detection Onset  M M  PeriodicitydetectionPeriodicitydetection  . . .  H (z)  x(t) 0 7   H (z)  . . . . . . . . . extraction Envelope extraction Envelope  . . . . . . Peak picking Tempo output Fig. 1 . Proposed tempo estimation system flow diagram.Our tempo estimation approach uses the general psychoa-coustic simplification proposed by [8] and also adopted byPaulus [7]. The input signal to the tempo estimation systemis first divided into eight non-overlapping frequency bandsusing a filterbank of sixth-order butterworth filters. Thelowest band is obtained by lowpass filtering with a cutoff frequency of 100 Hz, the seven higher bands are logarith-mically distributed in frequency between 100 Hz and half the sampling frequency (8000 Hz for our experiments), assuggested by Paulus in [7]. We obtain the subband signals x k ( t ) , where  k  = 0 ,..., 7 .Next, extractionofthesignal’senvelopeiscarriedout. Thisis a fundamental step aiming at precisely finding the soundonset points provided to the tempo estimation algorithms.This task is accomplished using a system mostly based onthe onset detector proposed by Klapuri in [3, 4], a flow di-agram is presented in Fig. 2. ( ) .  2 subbandonsetsLPF dt d  ( log(A(t)) ) Thres−holding  M  Half−waverectificationsubbandsignal Fig. 2 . Envelope extraction and onset detector flow dia-gram.At each subband, the input signal is first half-wave rec-tified and squared. Then, amplitude envelopes,  A k ( t ) , ateach frequency channel are calculated by convolving thesubband signals with a 100 ms descending half-Hanningwindow (linear phase lowpass filter) and then the outputof each band is decimated in order to reduce the computa-tional burden of the following stages, the decimation factorbeing 16. For the bandwise onset detection we use the firstorder  relative difference function  W  k ( t ) , which gives theamount of change in the signal in relation to its absolutelevel. This is equivalent to differentiating the logarithm of the amplitude envelope, as given by Eq. (1). W  n ( t ) =  ddt (log( A n ( t )))  (1)The relative difference function is a psychoacoustically rel-evant measure, since the perceived increase in signal am-plitude is in relation to its level, the same amount of in-crease is more prominent in a quiet signal [3, 4]. Hence,we detect onset components by a peak picking operation,which looks for peaks above a given threshold. The thresh-old value was found experimentally to be around  1 . 5 σ W  k ,where  σ W  k  stands for the standard deviation of the signal W  k ( t ) .Until this point, two of the proposed tempo estimation sys-tems (spectral and summary autocovariance function) fol-low the same principle. From here on they will be treatedseparately, thus one of them takes place in the frequencydomain while the other in the time domain.  2.1. Spectral methods Duetotheirstrongrelationship, twodifferentspectraltempoestimation methods are presented: the  harmonic spectral 2  4 COST 276 Workshop March 31, April 1, 2003 sum andthe harmonicspectralproduct  . Bothofthese meth-ods come from traditional pitch determination techniques.At the output of the onset detection block, subband sig-nals have the appearance of a quasi-periodic train pulse. Inorder to find the bandwise fundamental frequency of suchtrain pulses, the Fourier transform of the subband signals, X  k ( e jω n ) , is calculated. Prior to the FFT calculation, sub-band signals are zero padded to have a size  l x  given by: l x  = 2 ⌊ log 2 ( length ( x k ( t )) ⌋ +2 (2)where  ⌊·⌋  stands for  the integer part of  . 2.1.1. Spectral Sum The spectral sum is a reliable pitch determination tech-nique. It’s principle lies on the assumption that the powerspectrum of the signal is formed of strong harmonics lo-cated at integer multiples of the signal’s fundamental fre-quency. Forthepurposeoffindingthisfrequency, thepowerspectrum is compressed by a factor  l , then the obtainedspectra are added. In normalized frequency, this is indi-cated by Eq. (3). S  k ( e jω n ) = M   l =1 | X  k ( e jlω n ) | 2 for  ω n  < πM   (3)Consequently, the signal’s fundamental frequency is stron-gly reinforced. In addition, all subband spectral sums areadded together, this strengthens even more the fundamen-tal frequency which is shown in the form of an easily de-tectable prominent peak, as depicted in Fig. 3 for a particu-lar music signal. There, we can clearly see the most salientpeak located roughly at a frequency of 1.05 Hz, which cor-responds to a beat rate of 63 Beat Per Minute (BPM). 1 1.5 2 2.5 3 3.5 4 4.5 5−18−16−14−12−10−8−6−4−20 Frequency (Hz)    N  o  r  m  a   l   i  z  e   d  m  a  g  n   i   t  u   d  e   (   d   B   ) Spectral sum Fig. 3 . Spectral sum of a music signal.In our implementation  M   was set to 6 and the fundamen-tal frequency search was carried out in the interval rangingfrom  5 / 6  to 5 Hz, which corresponds to a beat rate between50 and 300 BPM. 2.1.2. Spectral Product  As briefly mentioned, another method for pitch estimationclosely related to the spectral sum is the spectral product,in normalized frequency it is defined by Eq. (4). S  k ( e jω n ) = M   l =1 | X  k ( e jlω n ) | 2 for  ω n  < πM   (4)In a similar way to the preceeding method, for the spec-tral product implementation  M   was set to 6 and the funda-mental frequency search was performed within the samefrequency interval. Fig. (4) depicts the result for the afore-said signal. Once again, we can see the most salient peak located at about the same frequency of 1.05 Hz. Note,however, that the spectral product method shows a muchhigher prominent peak relatively to the other secondarypeaks compared to the spectral sum method. 1 1.5 2 2.5 3 3.5 4 4.5 5−120−100−80−60−40−200 Frequency (Hz)    N  o  r  m  a   l   i  z  e   d  m  a  g  n   i   t  u   d  e   (   d   B   ) Spectral product Fig. 4 . Spectral product of a music signal.  2.2. Summary Autocovariance Function The conception of this method was suggested by the work done in the multipitch detection field by [11]. The band-wise train pulse like signals at the output of the sound onsetdetector are convolved with a 100 ms odd length Hanningwindow. Then, the autocovariance function,  Γ k ( τ  ) , of thesubband signals is computed, as given by Eq. (5). Γ k ( τ  ) =  t [ x k ( t + τ  )  − x k ][ x k ( t )  − x k ]  (5)Then, the bandwise autocovariance functions are added to-gether to form the  summary autocovariance function ,SACVF ( τ  ) , obtained by Eq. (6).SACVF ( τ  ) = 7  k =0 Γ k ( τ  )  (6)3  4 COST 276 Workshop March 31, April 1, 2003Asinthespectralmethodscase, we’reonlyinterestedin de-tecting the periodicities of the subband signals  x k ( t )  whocorrespond to the most salient peaks in SACVF ( τ  ) . In ad-dition, we are only concerned about finding a beat rate be-tween 50 and 300 BPM. Thus, the time lag  τ   varies onlywithin the range of 200 to 1250 ms. For our SACVF im-plementation this result is shown in Fig. (5), where theforesaid signal has a dominant periodicity, depicted by themost salient peak, at lag of approximately 950 ms, whichnearly corresponds to a fundamental frequency of 1.05 Hz. 200 300 400 500 600 700 800 900 1000 1100 1200−0.500.51 Time lag (ms)    N  o  r  m  a   l   i  z  e   d  m  a  g  n   i   t  u   d  e Summary autocovariance function Fig.5 . Summary autocovariance funcion of a music signal.Finally, to ensure a better tempo extraction, the three mostsalient peaks are detected and a relation of multiplicity be-tween them is searched via a simple numerical algorithm.For instance, in Fig. (5) the first, second and fourth peaksare the most prominent and they bear a strong relationshipbetween them. The second peak is located at lag practicallytwice that of first one and the fourth peak is located at a lagnearly four times that of the first one. This allows to have amore robust decision. If no relation of multiplicity is foundamong the detected peaks, the most salient one is taken asthe right tempo.  2.3. Bank of resonators An alternate method dedicated to the determination of thetempo is given in [8] and [5]. In order to estimate the onsetperiodicity (pulse) in each subband, a so-called bank of res-onators is used. These resonators are simply oversampledversions of autoregressive filters of order 1. For instance,the k  th filter has a z-transfer function of the kind: H  ( z ) =  β  k 1  − α k z − T  k (7)The principle of the method is the following. The im-pulse response of this filter is zero unless for the time in-dices multiples of the oversampling factor  T  k . When it re-ceives a periodic pulse train of period  T  k , the output sam-ples cumulate and the output level is increased.The factors  T  k  are set as integer numbers of samples inorder to cover the whole range of the analysed tempi, from T  1  to  T  M  . According to the principle described above, the β  k  and  α k  coefficients are chosen constant for all filters inorder to be able to compare the non-zero outputs. For a T  k -periodic pulse train, the transient time of the k  th filter isthus of the order of  τ  k  =  − 3 T  k ln α In our implementation,  α  is set to ensure the maximumtransient time  τ  M   to be far less than the length of the anal-ysis window. The tempo is determined following the mainsteps:1. computation of the responses  y k ( t )  of each filter tothe centered amplitude envelopes,2. estimation the mean power  σ 2 k  of   y k ,3. extraction of the time indices  t i ,i  = 1 ,...,N  k  suchas y k ( t i )  >  3 σ k , andcomputationofthemeanpower P  k  = 1 /N  kN  k  i =1 y k ( t i ) 2 ,4. Extraction of the tempo corresponding to the factor T  u  with  u  = arg k  max P  k 3. SIMULATION RESULTS  3.1. Sound database The database used for evaluation is constituted of 55 shortsegments of musical signals (each of 10 seconds long). Theshort musical excerpts were chosen in order to representdifferent styles : Classical music (23 % of the database),Rock or modern pop music (33 %), traditional songs (12%), Latin/cuban music (12%) and jazz (20 %). All signalsare sampled at 16kHz. This sound database has been man-ually annotated by skilled musicians. The procedure formanually estimating the tempo is the following: •  themusicianlistenstoamusicalsegmentusinghead-phones (sometimes several times in a row to be ac-customed with the tempi), •  while listening, he/she taps the tempo, •  the tapping signal is recorded and tempo is automat-ically extracted from it, •  all tempo are finally manually checked, however dueto the impulsiveness nature of the tapping signals, noerrors was found after the automatic extraction.4  4 COST 276 Workshop March 31, April 1, 2003Method Pourcentage of correct estimationScheirer 76Scheirer Modified 85Autocovariance 87Paulus 74Spectral Sum 87Spectral prod 89 Table1 . Performances obtained for several tempo trackingalgorithms  3.2. Results This section gives the results of several algorithms on thedatabase described in the previous section. Despite the  lim-ited size of our database, this test gives good indication of the performances of each algorithm. The estimation pro-vided by an algorithm is labelled as correct when it differsfrom less than 5% from the srcinal tempo without count-ing errors of doubling or halving. An estimated tempo  T  e is then labelled correct if: 0 . 95  αT < T  e  <  1 . 05  αT   (8)where  α  = 0 . 5  or  α  = 2 , and where  T   is the valid(manually measures) tempo.The table 1 gives the results obtained for five algorithm:the srcinal Scheirer algorithm [8], a modified version of ittaking into account a new resonator bank, (see section 2.3),the approach introduced by Paulus [7], the autocovariancemethod, the spectral sum and the spectral product methods.If there is no significant differences between the fourbest methods, the approaches introduced in this paper out-perform the classical approach of Scheirer and the recentlyintroduced method by Paulus. It is though important tonotice that the database used is small and that it will beimportant to conduct complementary tests on an extendeddatabase to confirm these initial results. 4. CONCLUSION This paper has proposed several new algorithms for tempotracking and has compared them to a number of methodsdescribed in the literature. Although the dataset used forevaluationisratherlimited, itisseenthatthemethodsintro-duced are very accurate. Future work will include an evalu-ation of these algorithms on a larger dataset, the extensionof our best algorithm for real-time tracking of tempo andits adaptation to small size analysis windows. Finally re-trieval experiments will be conducted based solely on therhythmic pattern. 5. REFERENCES [1] S.E. Dixon. A beat tracking system for audio signals.  Austrian Research Institute for Artificial Intelligence.Vienne. , pages 311–320, Avr. 2000.[2] M. Goto and Y. Muraoka. An audiobased realtimebeat tracking system and its applications.  Proceed-ings of the International Computer Music Confer-ence , 1998.[3] Anssi Klapuri. Automatic transcription of music.Master’s thesis, Department of Information Technol-ogy, Tampere University of Technology, 1998.[4] Anssi Klapuri. Sound onset detection by applyingpsychoacoustic knowledge. In  IEEE Int. Conf. Acous-tics, Speech, Signal Processing (ICASSP) , pages pp.3089–3092, 1999.[5] Edward W. Large and John F. Kolen. Resonance andthe perception ofmusical metter.  ConnectionScience ,Vol. 6:pp. 177–208, 1994.[6] J. Laroche. Estimating tempo, swing and beat loca-tions in audio recordings.  Proc. Int. Workshop on ap- plications of Signal Processing to Audio and Acous-tics , WAASPA:pp. 131–135, 2001.[7] Jouni Paulus and Anssi Klapuri. Measuring the sim-ilarity of rhythmic patterns. In  3rd. InternationalConference on Music Information Retrieval (ISMIR) ,2002.[8] Eric D. Scheirer. Tempo and beat analysis of acousticmusic signals.  J. Acoust. Soc. Am. , Vol. 103(1):pp.588–601, Jan. 1998.[9] A. Schloss. On the automatic transcription of percus-sive music, 1985.[10] Jarno Sepp¨anen. Tatum grids analysis of musical sig-nals.  New Paltz, New York  , pages 21–24, Oct. 2001.[11] Tero Tolonen and Matti Karjalainen. A computa-tionally efficient multipitch analysis model.  IEEE Trans. Speech Audio Processing , Vol. 8(6):pp. 708–716, Nov. 2000.[12] George Tzanetakis and Perry Cook. Musical genreclassification of audio signals.  IEEE Trans. Speech Audio Processing , Vol. 10(5):pp. 293–301, Jul. 2002.5
Search
Similar documents
View more...
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks