Math & Engineering

A Robust Audio Fingerprint Algorithm

Description
A ROBUST AUDIO FINGERPRINT EXTRACTION ALGORITHM J´ rˆ me Leboss´ eo e France T´ l´ com R&D ee 32 rue des coutures 14000 Caen,France jerome.lebosse@orange-ft.com ABSTRACT An Audio fingerprint is a small digest of an audio file computed from its main perceptual properties. Like human fingerprints, Audio fingerprints allows to identify an audio file among a set of candidates but does not allow to retreive any other characteristics of the files. Applications of Audio fingerprint include audio monitoring on
Published
of 6
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
Share
Transcript
  A ROBUST AUDIO FINGERPRINT EXTRACTION ALGORITHM J´erˆome Leboss´e France T´el´ecom R&D32 rue des coutures14000 Caen,France jerome.lebosse@orange-ft.comLuc BrunGREYC UMR 6072ENSICAEN,6 Boulevard du Mar´echal Juin14050 Caen, Franceluc.brun@greyc.ensicaen.frJean Claude PaillesFrance T´el´ecom R&D,32 rue des coutures14000 Caen,France jeanclaude.pailles@orange-ft.com ABSTRACT An Audio fingerprint is a small digest of an audio filecomputed from its main perceptual properties. Like humanfingerprints, Audio fingerprints allows to identify an audiofile among a set of candidates but does not allow to retreiveany other characteristics of the files. Applications of Audiofingerprint include audio monitoring on broadcast chanels,filtering peer to peer networks, meta data restoration inlarge audio library and the protection of author’s copy-rights within a Digital Right Management(DRM) system.We propose in this paper a new fingerprint extraction algo-rithm which combines a segmentation method with a newfingerprint construction scheme. The proposed method isrobust against compression and time shifting alterations of the audio files. KEY WORDS audio fingerprint, segmentation, indexation. 1 Introduction Audio fingerprint [5, 7, 1] aims at defining a small signa-ture (the fingerprint) from a content based on its percep-tual properties. Audio fingerprints share several propertieswith their human counter parts. Firstly, one audio finger-print allows to identify an audio file from a small amountof data. Secondly, as with human fingerprints no proper-ties of an audio file may be readily derived from its fin-gerprint. Applications of fingerprints include audio moni-toring on broadcast channels, filetering on peer to peer net-works, meta datacheckorrestorationonlargeaudiolibraryand digital rights managements.In order to correctly identify an audio file from itsperceptual properties, fingerprint methods have to be ro-bust against alterations (compression, cuts,...). Moreover,a fingerprint system should be able to recover a file from ashort sample in a short time interval. The computationalcost of a fingerprint system should thus be low. More-over, the fingerprint should be composed of elementarykeys (called subfingerprints) based on small parts of thesignal. The subfingerprints are then computed either con-tinuously along the signal or at a sufficient rate in order tobe able to characterize a file from a short sample.A fingerprint system consists of two components: amethod which extract the fingerprint and a search methodwhich match fingerprints. In this study, we focus our atten-tion on the first part of the system : the fingerprint extrac-tion. We First present in section 2 alternative approaches.Then, we describe our method in section 3. The robust-ness of the proposed fingerprint is evaluated in Section 4through two experiments. 2 State of the art As mentioned in Section 1, a fingerprint is usually madeof a sequence of consecutive keys in order to identify anypart of the signal. The first step of a fingerprint algorithmconsists thus to extract from the signal a sequence of smallintervalsandto associate to eachintervala value(called thesubfingerprint) which characterises it.A usual method [7, 5] consists to decompose the sig-nal into a sequence of overlapping intervals called frames,of a few milliseconds. Enframing the audio stream allowsto treat each frame as a relatively stationary sound. Theframes are then weighted by a Hamming window to min-imize the discontinuities at the beginning and at the endof each frame. An overlap factor between frames, whichdepends on the frame’s size, is used to reduce the shiftingeffects.Several methods combine the enframing decomposi-tion with a design of the subfingerprint based on the FFTof the signal. However, a cut of the signal or the concate-nation of a silence at its beginning is roughly equivalent toa translation. While the energy of the whole signal is pre-served by such a transformation, the computed energy oneach interval may be drastically changed [8]. Moreover,the use of overlapping windows only reduces the influenceof such cuts (Section 4.4).Another way to segment an audio waveform is to findparticular positions in the audio signal, called onsets [6, 3].Typical onset detection schemes decompose the signal intoframes and associate some perceptual quantities to eachframe. The onsets are then detected from the signal en-coding the distance between the feature vectors of adjacentframes. The main drawback of onset methods within theaudio identification framework is the fact that the numberof onsets detected in a short time interval (say 1s) is un-predictable and is usually too low to provide an efficientcharacterization of the signal. Therefore a fingerprint al-  gorithm based on an onset approach may delay the iden-tification for an unpredictable period until it has collecteda sufficient number of subfingerprint. This unpredictabil-ity is a major drawback for fingerprinting systems whichshould identify a file in a short time interval. Enframingis thus generally used to decompose the signal into smallintervals in audio identification systems.Once the signal is divided into intervals, discrimi-nant features have to be associated to each interval. Thesequence of features computed along the waveform de-fines the audio fingerprint. Kurth [7] proposes a subfin-gerprint design based on the global energyof each interval.However,this methoddoes not captureenoughinformationfrom the spectrum to provide a reliable fingerprint indexa-tion scheme. Additional information about each intervalmay be obtained by considering the FFT of each intervaland the energy of each of its frequency bin [2]. Differentheuristics have been proposed based on this basic idea. Forexample, Johnson and Woodland associate to each framethe mel-frequencyPLP cepstral coefficients of its spectrumtogether with the derivative of these coefficients. Burges etal.[1]useaparticulardecompositionofthespectrumcalledthe Modulated Complex Lapped Transform (MCLT).The method of Kalker and Haitsma [5] follows theabove approach and uses a decomposition of the spec-trum of each frame into bands using a logarithmic spac-ing. The authors decompose the signal into frames of 0.37s with an overlap factor of 31/32. The subfingerprintof each frame is then defined as a 32 digit number com-puted from the decomposition of the spectrum. The se-quence of bits of each frame is defined from the sign of the energy differences computed both between two con-secutive bands of a same frame and between two con-secutive frames. More precisely, let us define EB ( n,m ) as the energy of the m th band within the n th frame and ∆ EB ( n,m ) = EB ( n,m ) − EB ( n,m + 1) as the dif-ference of the energyof two successive band within a sameframe. The value of the m th bit of the n th frame ( F  ( n,m ) )is then defined as: F  ( n,m ) =  1 if  ∆ EB ( n,m ) − ∆ EB ( n − 1 ,m ) ≥ 00 if  ∆ EB ( n,m ) − ∆ EB ( n − 1 ,m ) ≤ 0 3 Robust Identification As mentioned in Section 1, enframing methods insure thata sufficient number of frames is selected within an inputsegment. However, the selection of a sequence of contigu-ous frames is sensitive to random cropping or shifting op-erations which may be performed on the signal (Sections 2and 4). This drawback is attenuated but not completelyovercomed by the use of overlapping frames. On the otherhand, segmentation methods are less sensitive to croppingor shifting operations but do not insure that sufficient timeintervals will be selected in a given time inerval. 3.1 Audio segmentation The basic idea of our method is to combine the respectiveadvantages of enframing and segmentation methods by se-lecting a small time intervalswithin a largerone. The smallinterval allows the detection of characteristic parts of thesignal whereas the larger interval insures a minimum se-lection rate of intervals.This process could be decomposedinto three steps (Figure 1): ã In the first step, an interval, called the Observation In-terval( I  o )is set at thebeginningofthewaveform. Thelength of this intervall is typically equal to few hun-dredths of seconds. ã Thewaveforminside I  o is analysedinthe secondstep.We dividein this step, the interval I  o intoshorterover-lapping sub-intervals of a few millisecond, called En-ergy Intervals ( I  e ). The energy of each I  e interval iscomputed by taking the mean of the samples ampli-tude within the interval. The interval I  e with a maxi-mal energy ( I  emax ) within I  o is then selected. ã In the third step, a last interval, called the FeaturesInterval ( I  f  ) is defined centered on I  emax .Finally, a features extraction algorithm is applied on I  f  tocompute a sub-fingerprint (Section 3.2).Figure 1. Audio Segmentation : The interval I  emax cor-responds to the interval of greatest energy within I  o . Theinterval I  f  is centered around I  emax .Given a selected I  f  interval, the begining of the next I  o interval is set to the end of  I  e (Fig. 1). The distancebetween the centers of two I  f  intervals lies thus between I  e and I  o - I  e . This method provides a higher robustnessagainst time shifting than the basic strategy which consiststo select a sequence of consecutive I  o intervals. Indeed, if a strong peak is present within a signal this peak will beselected as I  emax both in the srcinal and shifted signal.Since I  f  is centered around I  emax the next I  o interval willstart at a same position both within the srcinal signal and  its shifted version. This strategy allows thus to synchro-nise the two fingerprints on significant peaks of the signal.Moreover,using the basic strategy based on consecutive I  o intervals, a maximum I  e interval located at the transitionbetween two I  o intervals would not be detected. Finally,our strategy allows to detect, and select as I  f  several I  e in-tervals located in a same I  o with near maximal energies.Using the basic strategy only one of them would be de-tected while the degradation of the signal may exchangethe selection of two I  e interval whose energy is near themaximum. This strategy enforces thus the robustness of out method against compression. 3.2 SubFingerprint design Our method to compute a sub fingerprint on each I  f  inter-val is based on the same scheme than the one of Haistmaand Kalker [5](Section 2). We thus use as these authorsa decomposition of the spectrum of  I  f  into a sequence of bands with a logarithmic spacing. However, as shown byour experiments (Section 4), strong compression rates mayalter significantly the robustess of the sub fingerprint ex-traction algorithm. Indeed, the corruptions of the signalby noise, compression, or cutting operations reduces dras-tically the number of  I  f  intervals with a same sub finger-print. Haistma and Kalker attenuate this problem by usingHamming distance between the sequences of sub finger-print of two audio files [5]. However, the corruption of subfingerprints by noise and alterations corrupts the Ham-ing distances and reduces the amount of information thatan indexation algorithm may deduce from such distances.The robustness of the sub fingerprint extraction algorithmmay be improved using the two following remarks:1. The uses of two successive frames to design a subfingerprint involves the corruption of two subfinger-prints if an error occurs in the measure of their com-mon frame.2. The comparison of the energies of two successivebands of a spectrum within a same frame is sensitiveto the errors which may corrupt a single band.We solve the first source of errors by using only one framefor each subfingerprint computation. The second sourceof errors is connected to the corruption of one band of the I  f  spectrum. Using the same notations than in Sec-tion2,thealterationofthemeasureoftheenergyofa singleband ( EB ( n,m ) ) alters the values of  ∆ EB ( n,m − 1) and ∆ EB ( n,m ) . This alteration of the bands energies may beconsidered as the presence of a random noise on the sig-nal ( EB ( n,m ) ) m ∈ { 1 ,..,M  } where M  represents theindex of the last energy band. If we suppose that the noiseis non correlated between the different samples of the sig-nal ( EB ( n,m ) ) m ∈ 1 ,..,M  , the influence of noise maybe reduced by using a function of  m defined as a sum of some EB ( n,m ) . We thus define the mean energy S  ( n,m ) of a band m , within a frame n as the mean of all the band’senergies from 0 to m : S  ( n,m ) =1 m m  j =0 EB ( n,j ) We then replace EB ( n,m ) by S  ( n,m ) in the com-putation of the differences of energies band and define the m th bit of the sub fingerprint associated to the frame n ( F  ( n,m ) ) as follows: F  ( n,m ) =  1 if S  ( n,m ) − S  ( n,m − 1) ≥ 00 otherwise One can easily show that S  ( n,m ) − S  ( n,m − 1) = 1 m ( EB ( n,m ) − S  ( n,m − 1)) . The above formula maythus be simplified as follows: F  ( n,m ) =  1 if EB ( n,m ) − S  ( n,m − 1) ≥ 00 otherwise The sub fingerprint of each frame n , is defined by theconcatenation of the M  bits F  ( n,m ) with m ∈ { 1 ,...M  } .The parameter M  is fixed to 32 in our experiments (Sec-tion 4). The audio fingerprint is then defined as the con-catenation of its sequence of sub-fingerprints . 3.3 Resistance to attacks If our methodis used within the DRM framework,one pos-sible attack could consist to consider an audio file whosefingerprint belongs to the database and to modify it out-side the I  f  intervals used to compute the fingerprint. TheDRM application would thus be unable to distiguish thisnew signal from the older one. Such an attack is possibleonly if the I  f  intervals constitute a small parts of the wholesignal. However, in the experiments presented below (Sec-tion 4) the size of the I  f  intervals has been fixed to 80 ms and the mean number of  I  f  interval per second computedon our database is equal to 21 , 9 . The time interval usedto compute subfingerprints within 1 second is thus equalto 80 ms ∗ 21 , 9 = 1 , 76 s . The I  f  intervals are thus de-fined on a large part of the signal with many overlaps. Amodification of the signal, outside the I  f  intervals wouldthus perform very few modifications and lead to an alteredversion close to the srcinal.One alternative attack consists to insert random peakswithin the signal in order to destroy its fingerprint. How-ever, using fingerprint methods within a DRM framework,an audio file not identified within the database has no as-sociated rights (including the right of reading). Such a ma-nipulation would thus be meaningless. 4 Experiments Our database contains 357 songs of approximately 4 min-utes each (around 5300 values per song). All songs were  subjected to the MP3 encoding/decoding at various ratesand shifted by adding silence of various length at the be-ginningof each song. The intervals I  o , I  f  and I  e have beenrespectively set to 100ms, 80ms and 1ms for these experi-ments. 4.1 Size of the fingerprints The size of the intervals I  e and I  o being respectively equalto 1 and 100 milliseconds, the minimum and maximum de-tection rates of  I  f  intervals during one second are respec-tively equalsto 10 and 1000 (Section3.1). These lower andupper bounds of the detection rate represent extremal val-ues which have not been reached within our test database.Indeed, the minimal and maximal detection rates measuredon our database are respectively equal to 18 and 34 . Themean detection rate on the whole database is equal to 21 . 9 with a standard deviation of  3 . 5 .Sinceeachsubfingerprintis storedon 4 bytes, andthat 21 . 9 subfingerprints are computed per seconds, the size re-quired by our method to store one minute of signal is equalto 21 . 9 ∗ 4 ∗ 60 = 5 . 2 Kilo bytes per minute. On theother hand, the method of Kalker and Haitsma, uses inter-vals of  370 ms with an overlap of  31 / 32 . Each new frame,adds thus 11 . 56 ms to the signal covered by the fingerprint.Since each subfingerprint is stored on 4 bytes, the size re-quired by a fingerprint corresponding to one minute of sig-nal is equal to 4 ∗ 60 / (11 . 5610 − 3 ) = 20 . 5 Kilo bytes. Thesize of the fingerprint used by our method is thus approxi-mately 4 times lower than the one of Kalker and Haitsma.These results have been confirmed by measuring on ourdatabase the mean size required to store both types of fin-gerprints. 4.2 Measurement of the performances Let us denote by T  i the set of intervals used to compute thesubfingerprints of a signal s i . Using the method of Kalkerand Haitsma [5], this set is equal to the number of slidingwindows used by this algorithm. Using our method thisset is equal to the number of  I  f  intervals defined by thesegmentation step.Some additional quantities may be usefully definedtomeasure the performances of our method: Given an audiofile s i , let usdenoteby SP  i ⊂ T  i the set ofintervalslocatedat a same position within s i and a degraded version of  s i .We consider, that two intervals have a same position if thedistance between the center of the two intervals is less than 0 . 25 ms . Whenthedegradedversionof  s i is shifted,thepo-sition of the interval within the shifted version is translatedby the shift before computing the distance. Moveover, letusadditionallyconsidertheset SV  i ⊂ SP  i ofintervalshav-ing a same location and a same subfingerprint value within s i and its degraded version.Given a specific degradation, several quantities maybe defined in order to measure the performances of our al-gorithm: Segmentation rate: This quantity represents the meanvalue of  I  f  intervals located at a same position withinthe srcinal signal and its degraded version. Thisquantity is thus a measure of the performances of oursegmentation algorithm. It is formally defined by : SR =1 N  N   i =1 | SP  i || T  i | (1)where | . | denotethe cardinalofthe set and N  the num-ber of audio files of the database. Recognition rate: The robustness of our method used tocompute subfingerprints is measured for each s i bythe ratio between the cardinals of  SV  i and SP  i . Wemeasure thus the ratio of intervals whose subfinger-print value remains unchanged by a degradation. Themean value of this ratio over the whole database de-fines the recognition rate and is formally defined by: RR =1 N  N   i =1 | SV  i || SP  i | (2) Total recognition rate : The recognition rate definedabove, measures the robutness of our subfingerprintsindependently of the segmentation step. A globalmeasure of both steps may be achieved by comput-ing for each signal s i the ratio between SV  i and T  i .This measure may be understood, as the product, foreach s i , of  | SP  i || T  i | and | SV  i || SP  i | respectively used to definethe segmentation and the recognition rates. The meanvalue of this ratio over the database defines the totalrecognition rate formally defined as : TRR =1 N  N   i =1 | SV  i || T  i | (3)We also measured the performance of Kalker andHaitsma [5] algorithm. The segmentation rate and therecognition rate are meaningless for this method since itdoes not perform a segmentation step. We thus only usethe total recognition rate to measure the performances of this algorithm. Further experiments using the bit rate errorbetween files may be found in [ ? ]Note that the quantities defined in this section mea-sure the robustness of our fingerprints against degradation.These quantities don’t readily allow to measure the effi-ciency of an indexingscheme which does not constitute thecore of this paper (section 5). Indeed, the sets SP  i and SV  i are usually not known when a request on the fingerprintdatabase is performed using the fingerprint of an unknownsignal. 4.3 Influence of compression All the audio files of our database are endoded using 705 Kbps . The compression rate may thus be measured using
Search
Tags
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks