Tempo and beat estimation of musical signals

Tempo and beat estimation of musical signals
of 6
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
  TEMPO AND BEAT ESTIMATION OF MUSICAL SIGNALS  Miguel Alonso, Bertrand David , Ga¨ el Richard  ENST-GET, D´epartement TSI46, rue Barrault, Paris75634 cedex 13, France { malonso,bedavid,grichard } ABSTRACT Tempo estimation is fundamental in automatic musicprocessing and in many multimedia applications. Thispaper presents an automatic tempo tracking system thatprocesses audio recordings and determines the beats perminute and temporal beat location. The concept of spec-tral energyflux is definedand leads to an efficientnote on-set detector. The algorithm involves three stages: a front-end analysis that efficiently extracts onsets, a periodicitydetection block and the temporal estimation of beat loca-tions. The performance of the proposed method is evalu-ated using a large database of 489 excerpts from severalmusical genres. The global recognition rate is 89.7 %.Results are discussed and compared to other tempo esti-mation systems. Keywords:  beat, tempo, onset detection. 1. INTRODUCTION It is very difficult to understand western music withoutperceiving beats, since a beat is a fundamental unit of the temporal structure of music [4]. For this reason, au-tomatic beat tracking, or tempo tracking, is an essentialtask for many applications such as musical analysis, auto-matic rhythm alignment of multiple musical instruments,cut and paste operations in audio editing, beat driven spe-cialeffects. Althoughitmightappearsimpleatfirst, tempotracking has proved to be a difficult task when dealingwith a broad variety of musical genres as shown by thelarge numberof publicationson this subject appeareddur-ing the last years [2, 5, 6, 8, 9, 10, 12].Earlier tempo tracking approaches focused on MIDIor other symbolic formats, where note onsets are alreadyavailable to the estimation algorithm. More recent ap-proachesdirectly deal with ordinaryCD audiorecordings. Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copiesare not made or distributed for profit or commercial advantage and thatcopies bear this notice and the full citation on the first page.c  2004 Universitat Pompeu Fabra. The system that we present in this paper lies into this cat-egory.For musical genres with a straightforwardrhythm suchas rap, rock, reggae and others where a strong percus-sive strike drives the rhythm, current beat trackers indi-cate high performance as pointed out by [5, 9, 12]. How-ever, the robustness of the beat tracking systems is oftenmuch less guaranteed when dealing with classical musicbecause of the weakness of the techniques employed inattack detection and tempo variations inherent to that kindof music.In the present article, we describe an algorithm to esti-mate the tempo of a piece of music (in beats per minute orbpm) and identify the temporal locations when it occurs.Like most of the systems available in the literature, thisalgorithmrelies on a classical scheme: a front-endproces-sor extracts the onset locations from a time-frequency orsubband analysis of the signal, traditionally using a filterbank [1, 7, 10, 12] or using the discrete Fourier transform[3, 5, 6, 8, 9]. Then, a periodicity estimation algorithmfinds the rate at which these events occur. A large varietyof methods has been used for this purpose, for example, abank of oscillators which resonate at integer multiples of their characteristic frequency [6, 9, 12], pitch detectionmethods [1, 10], histograms of the inter-onset intervals[2, 13], probabilistic approaches such as Gaussian mix-ture model to express the likelihood of the onset locations[8].In this paper, following Laroche’s approach [9], wedefine the quantity so-called spectral energy flux as thederivative of the signal frequency content with respect totime. Although this principle has been previously used[3, 6, 8, 9], a significant improvement has been obtainedby using an optimal filter to approximate this derivative.We exploit this approach to obtain a high performanceonset detector and integrate it into a tempo tracking algo-rithm. We demonstrate the usefulness of this approach byvalidating the proposed system using a large manually an-notated data base that contains excerpts from rock, latin,pop, soul, classical, rap/hip-hop and others. The paper isorganized as follows: Section 2 provides a detailed de-scription of the three main stages that compose the sys-tem. In Section 3, test results are provided and comparedto other methods. The system parameters used during the  PeriodicityestimationBeatlocation Synchronised beat signal Audio Signal Onset detection − Spectral analysis (STFT)− Spectral energy flux  function Detection Figure 1 . Architecture of the beat tracking algorithm.validation procedure are provided as well as commentsabout the issues of the algorithm. Finally, Section 4  sum-marizes the achievements of the presented algorithm anddiscusses possible directions for further improvements. 2. DESCRIPTION OF THE ALGORITHM In this paper, it is assumed that the tempo of the audio sig-nal is constant over the duration of the analysis windowand that it eventuallyevolves slowly fromone to the other.In addition, we suppose that the tempo lies between 60and 200 BPM, without loss of generality since any othervalue can be mapped into this range. The algorithm pro-posed is composed of three major steps (see figure 1): •  onset detection : it consists in computinga detectionfunction based on the spectral energy flux of the in-put audio signal; •  periodicity estimation  : the periodicity of the de-tection function is estimated using pitch detectiontechniques ; •  beat location estimation  : the position of the corre-spondingbeatsisobtainedfromthecross-correlationbetweenthedetectionfunctionandanartificialpulse-train. 2.1. Onset detection The aim of onset detection consists in extracting a detec-tion function that will indicate the location of the mostsalient features of the audio signal such as note changes,harmonic changes and percussive events. As a matter of fact,theseeventsareparticularlyimportantinthebeatper-ception process.Note onsets are easily masked in the overall signal en-ergy by continuous tones of higher amplitude [9], whilethey are more likely detected after separating them in fre-quency channels. We propose to follow a frequency do-main approach [3, 5, 6, 8, 9] as it proves to outperformtime-domain methods based on direct processing of thetemporal waveform as a whole. 2.1.1. Spectral analysis and spectral energy flux The input audio signal is analyzed using a decimated ver-sion ofthe short-timeFouriertransform(STFT),i.e., shortsignalsegmentsareextractedatregulartimeintervals,mul-tiplied by an analysis window and transformed into thefrequency domain by means of a Fourier transform. Thisleads to˜  X  (  f  , m ) =  N  − 1 ∑ n = 0 w ( n )  x ( n + mM  ) e −  j 2 π  fn (1)where  x ( n )  denotes the audio signal,  w ( n )  the finite anal-ysis window of size  N   in samples,  M   the hop size in sam-ples,  m  the frame index and  f   the frequency.Motivated by the work of Laroche [9], we define thespectral energy flux  E  (  f  , k  )  as an approximation to thederivative of the signal frequency content with respect totime  E  (  f  , k  ) = ∑ m h ( m − k  ) G (  f  , m )  (2)where  h ( m )  approximates a differentiator filter with:  H  ( e  j 2 π  f  )  ≃  j 2 π  f   (3)and the transformation G (  f  , m ) = F    {|  ˜  X  (  f  , m ) |}  (4)is obtained via a two step process:  a low-pass filtering  of  |  ˜  X  (  f  , m ) |  to performenergyintegrationin a way similar tothat in the auditory system, emphasizing the most recentinputs, but masking rapid modulations [14] and  a non-linear compression . For example, in [9] Larocheproposes h ( m )  as a first order differentiator filter ( h  = [ 1;  − 1 ] ), nolow-pass filtering is applied and the non-linear compres-sion function is  G (  f  , n ) =  arcsinh ( |  ˜  X  (  f  , m ) | ) . In [6] Kla-puri uses the same first order differentiator filter, but forthe transformation, he performs the low-pass filtering af-ter applying a logarithmic compression function.In the present work we propose  h ( m )  to be a FIR filterdifferentiator. Sucha filteris designedbyaRemezoptimi-sation procedure which leads to the best approximationtoEq. (3) in the minimax sense [11]. This new approach,compared to the first order difference used in [6, 8, 9]highly improvesthe extractionof musical meaningfulfea-tures such as percussive attacks and chord changes. Inaddition,  G (  f  , k  )  is obtained via low-pass filtering witha second half of a Hanning window followed by a loga-rithmic compressionfunctionas suggestedby Klapuri [7],since the logarithmicdifferencefunctiongives the amountof change in a signal in relation to its absolute level. Thisis a psycho-acoustic relevant measure since the perceivedsignalamplitudeis inrelationtoits level,thesame amountof increase being more prominent in a quite signal [7].During the system development, several orders for thedifferentiator filter  h ( m )  were tested. We found that usingan order 8 filter was the best tradeoff between complex-ity and efficiency. In practice, the algorithm uses an  N   point FFT to evaluate the STFT, thus the frequency  chan-nels 1 to  N  2  of the signal’s time–frequency representationare filtered using  h ( m )  to obtain the spectral energy flux.Then, all the positive contributions of these channels aresummed to produce a temporal waveform  v ( k  )  that ex-hibits sharp maxima at transients and note onsets, i.e.,those instants where the energy flux is large.Beat tends to occur at note onsets, so we must first dis-tinguish the ”true beat” peaks from the spurious ones in v ( k  )  to obtain a proper detection function  p ( k  ) . In addi-tion, we work under the supposition that these unwantedpeaks are muchsmaller in amplitudecomparedto the noteattack peaks. Thus, a peak-picking algorithm that selectspeaks above a dynamic threshold calculated with the helpof a median filter is a simple and efficient solution to thisproblem. The median filter is a nonlinear technique thatcomputes the pointwise median inside a window of length2 i + 1 formed by a subset of   v ( k  ) , thus the median thresh-old curve is given by the expression: θ ( k  ) = C   ·  median ( g k  )  (5)where  g k   =  { vk  − i ,... , vk  ,... , vk  + i }  and  C   is a prede-fined scaling factor to artificially rise the threshold curveslightly above the steady state level of the signal. To en-sure accurate detection, the length of the median filtermust be longer than the average width of the peaks of the detection function. In practice, we set the median fil-ter length to 200 ms. Then, we obtain the signal ˆ  p ( k  ) = v ( k  ) − θ ( k  ) , which is half-wave rectified to produce thedetection function  p ( k  ) :  p ( k  ) =   ˆ  p ( k  )  if ˆ  p ( k  )  >  00 otherwise (6)Inourtests, the onset detectordescribedabovehas pro-ved to be a robust scheme that provides good results for awide range of musical instruments and attacks at a rela-tively low computational cost. For example, Figure 2-ashows the time waveform of a piano recording containingseven attacks. These attacks can be easily observed in thesignal’s spectrogram in Figure 2-b. The physical interpre-tation of Figure 2-c can be seen as the rate at which thefrequency-content energy of the audio signal varies at agiven time instant, i.e., the spectral energyflux. In this ex-ample, seven vertical stripes represent the reinforcementof the energy variation, clearly indicating the location of the attacks (the position of the spectrogram edges). Whenall the positive energy variations are summed in the fre-quency domain and thresholded, we obtain the detectionfunction  p ( k  )  shown in Figure 2-d. An example of an in-strument with smooth attacks, a violin, is shown in Figure3. Large energy variations in the frequency content of theaudiosignal can still be observedas verticalstripes in Fig-ure 3-c. After summing the positive contributions, six of the seven attacks are properly detected as shown by thecorresponding largest peaks in Figure 3-d.    a     )   a   m   p     l     i    t   u     d   e     b     )     f   r   e   q   u   e   n   c   y     (     k     H   z     )   c     )     f   r   e   q   u   e   n   c   y     (     k     H   z     ) time (s)      d     )   m   a   g   n     i    t   u     d   e 0 1 2 3 40 1 2 3 40 1 2 3 40 1 2 3 400.510101-101 Figure 2 . From top to bottom: time waveform of a pianosignal, its spectrogram,its spectral energyflux and the de-tection function  p ( k  ) . 2.2. Periodicity estimation The detection function  p ( k  )  at the output of the onset de-tection stage can be seen as a quasi-periodic and noisypulse-train that exhibits large peaks at note attacks. Thenext step is to estimate the tempo of the audio signal, i.e.,the periodicity of the note onset pulses. Two methodsfrom traditional pitch determination techniques are em-ployed: the spectral product and the autocorrelation func-tion. These techniques have already been used for thispurpose in [1]. 2.2.1. Spectral product  Thespectralproductprincipleassumesthatthepowerspec-trum of the signal is formed from strong harmonics lo-cated at integer multiples of the signal’s fundamental fre-quency. To find this frequency, the power spectrum iscompressed by a factor  m , then the obtained spectra aremultiplied,leadingtoa reinforcedfundamentalfrequency.For a normalized frequency, this is given by: S  ( e  j 2 π  f  ) =  M  ∏ m = 1 | P ( e  j 2 π mf  ) |  for  f   < 12  M  (7)where  P ( e  j 2 π  f  )  is the discrete Fourier transform of   p ( k  ) .Then, the estimated tempo T is easily obtained by pickingout the frequency index corresponding to the largest peak of   S  ( e  j 2 π  f  ) . The tempo is constrained to fall in the range60  < T <  200 BPM.     a     )   a   m   p     l     i    t   u     d   e     b     )     f   r   e   q   u   e   n   c   y     (     k     H   z     )   c     )     f   r   e   q   u   e   n   c   y     (     k     H   z     ) time (s)      d     )   m   a   g   n     i    t   u     d   e 0 1 2 3 4 50 1 2 3 40 1 2 3 4 50 1 2 3 4 500.51012012 -101 Figure 3 . From top to bottom: time waveform of a vio-lin signal, its spectrogram, its spectral energy flux and thedetection function  p ( k  ) . 2.2.2. Autocorrelation function This is a classical method in periodicity estimation. Thenon-normalized deterministic autocorrelation function of   p ( k  )  is calculated as follows: r  ( τ ) = ∑ k   p ( k  + τ )  p ( k  )  (8)Again, we suppose that 60  < T <  200 BPM. Hence, dur-ing the calculation of the autocorrelation, only the valuesof   r  ( τ )  corresponding to the range of 300 ms to 1 s arecalculated. To find the estimated tempo  T , the lag of thethree largest peaks of   r  ( τ )  are analyzed and a multiplicityrelationship between them is searched. In the case that norelation is found, the lag of the largest peak is taken as thebeat period. 2.3. Beat location To find the beat location, we use a method based on thecomb filter idea that resembles previous work carried outby [6, 9, 12]. We create an artificial pulse-train  q ( t  )  of tempo T previously calculated as explained in Section 2.2and cross-correlate it with  p ( k  ) . This operation has a lowcomputationalcost, since the correlation is evaluated onlyat the indices correspondingto the maxima of   p ( k  ) . Then,we call  t  0  the time index where this cross-correlation ismaximal and we consider it as the starting location of thebeat. For the second and succesive beats in the analysiswindow, a beat period  T    is added to the previous beat Genre Pieces Percentageclassical 137 28.0 % jazz 79 16.2 %latin 37 7.6 %pop 40 8.2 %rock 44 9.0 %reggae 30 6.1 %soul 24 4.9 %rap,  hip-hop 20 4.1 %techno 23 4.7 %other 55 11.2 % total 489 100 % Table 1 . Genre distribution of the test database.location, i.e.,  t  i  =  ⌊ t  i − 1  + T    ⌋  and a corresponding peak in  p ( k  )  is searchedwithin the area t  i ± ∆ . If no peak is found,the beat is placed in its expected position t  i . When the lastbeat of the window occurs, its location is stored in orderto assure the continuity with the first beat of the new anal-ysis window. Where the tempo of the new analysis win-dow differs bymore than 10 % fromthe previoustempo, anew beat phase is estimated. The peaks are searchedusingthe new beat period without referencing the previous beatphase. 3. PERFORMANCE ANALYSIS3.1. Database, annotationand evaluationprotocole The proposed algorithm was evaluated using a corpus of 489 musical excerpts taken from commercial CD record-ings. These pieces were selected to cover as many char-acteristics as possible: various tempi in the 50 to 200BPM range,awide varietyof instruments,dynamicrange,studio/liverecordings,old/recentrecordings,with/withoutvocals, male/female vocals and with/without percussions.They were also selected to represent a wide diversity of musical genres as shown in Table 1.From each of the selected recordings, an excerpt of 20seconds having a relatively constant tempo, was extractedand convertedto a monophonicsignal sampled at 16 kHz.The procedure for manually estimating the tempo of eachmusical piece is the following: •  themusicianlistenstoamusicalexcerptusinghead-phones (if required, several times in a row to be ac-customed to the tempo), •  while listening, he/she taps the tempo, •  the tapping signal is recorded and the tempo is ex-tracted from it.As pointed out by Goto in [4], the beat is a perceptualconcept that people feel in music, so it is generally diffi-cult to define the ”correct beat” in an objective way. Peo-ple have a tendency to tap at different metric levels. For
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks

We need your sign to support Project to invent "SMART AND CONTROLLABLE REFLECTIVE BALLOONS" to cover the Sun and Save Our Earth.

More details...

Sign Now!

We are very appreciated for your Prompt Action!