A Low Latency Implementation of a Non Uniform Partitioned Overlap and Save Algorithm for Real Time Applications

A Low Latency Implementation of a Non Uniform Partitioned Overlap and Save Algorithm for Real Time Applications
of 11
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
  See discussions, stats, and author profiles for this publication at: A Low Latency Implementation of a NonUniform Partitioned Convolution algorithm forRoom acoustic simulation  Article   in  Signal Image and Video Processing · July 2012 DOI: 10.1007/s11760-012-0387-0 CITATIONS 5 READS 104 5 authors , including:Andrea PrimaveraUniversità Politecnica delle Marche 30   PUBLICATIONS   64   CITATIONS   SEE PROFILE Stefania CecchiUniversità Politecnica delle Marche 107   PUBLICATIONS   406   CITATIONS   SEE PROFILE Paolo PerettiUniversità Politecnica delle Marche 39   PUBLICATIONS   212   CITATIONS   SEE PROFILE Francesco PiazzaUniversità Politecnica delle Marche 392   PUBLICATIONS   2,842   CITATIONS   SEE PROFILE All content following this page was uploaded by Stefania Cecchi on 13 April 2015. The user has requested enhancement of the downloaded file. All in-text references underlined in blue are added to the srcinal documentand are linked to publications on ResearchGate, letting you access and read them immediately.  SIViP (2014) 8:985–994DOI 10.1007/s11760-012-0387-0 ORIGINAL PAPER A low latency implementation of a non-uniform partitionedconvolution algorithm for room acoustic simulation Andrea Primavera  ·  Stefania Cecchi  ·  Laura Romoli  · Paolo Peretti  ·  Francesco Piazza Received: 21 December 2011 / Revised: 24 September 2012 / Accepted: 25 September 2012 / Published online: 11 October 2012© Springer-Verlag London 2012 Abstract  Finite impulse response convolution is one of the most widely used operations in digital signal processingfield for filtering operations. Low computationally demand-ingtechniquesareessentialforcalculatingconvolutionswithlow input/output latency in real scenarios, considering thatthe real-time requirements are strictly related to the impulseresponse length. In this context, a complete overview of thestate of the art relative to the algorithms for fast computa-tion of convolution is described here. Then, a novel percep-tual approach employed to reduce the computational costof fast convolution algorithms is here presented. It is basedon the pre-processing of a selected impulse response and itallows to further reduce the number of complex multiplica-tions considering the energy decay relief and the absolutethreshold of hearing, as psychoacoustic constraints. Severalresults are reported in terms of computational cost and per-ceived audioqualityinorder toprove theeffectiveness oftheproposed approach also introducing comparisons with theexisting techniques of the state of the art. Keywords  Absolute threshold of hearing · Energy decay relief  · Fast convolution · Non-uniform partitioning · Room acoustic A. Primavera · S. Cecchi ( B ) · L. Romoli · P. Peretti · F. PiazzaDepartment of Information Engineering, Università Politecnicadelle Marche, Via Brecce Bianche, 60131 Ancona, Italye-mail: s.cecchi@univpm.itA. Primaverae-mail: a.primavera@univpm.itL. Romolie-mail: l.romoli@univpm.itP. Perettie-mail: p.peretti@univpm.itF. Piazzae-mail: 1 Introduction Finite impulse response (FIR) convolution is one of the mostwidely used operations in digital signal processing field.It has an important role in many different real-time audioapplications(e.g.,loudspeakerequalization,activenoisecon-trol, digital audio effects) where low computational cost isrequired in order to respect the real-time constraint. In thiscontext, digital artificial reverberation is the application thatreallypointsoutlimitsofreal-timeFIRfiltering.Convolutionwith long impulse responses (IRs) can be performed in orderto simulate large environments, and thus, low input/output(I/O) latencies are usually required.Several techniques have been proposed in the literaturewith the aim of calculating convolution with low compu-tational requirements based on discrete Fourier transform(DFT)[4,9,12,15].Moreover,severaleffortshavebeenmade inordertoimprovetheperformanceofthesefastconvolutionalgorithms taking into consideration the hardware architec-ture of the final system [1,2,5,17] and the optimization of  some parameters [3,10,18]. Considering all these aspects, a low latency implementation of a non-uniform partitionedoverlap and save algorithm for real-time applications hasbeenproposedbyusin[14]:itisfocusedonanoptimalparti- tioning of the IR introducing a psychoacoustic optimizationand an optimized procedure for a multi-threaded implemen-tation.In this context, starting from the previous approach [14], a novel algorithm, based on a perceptually motivated pre-processing of the IR, is introduced in order to further reducethe computational cost required to perform the fast convolu-tion operation.The paper is organized as follows. Section 2 describesthe state of the art relative to the techniques introduced forconvolution. Then, the proposed approach is presented in  1 3  986 SIViP (2014) 8:985–994 Sect.3,consideringfirstlytheadopted fastconvolution algo- rithm (Sect. 3.1) and then the psychoacoustic pre-processing appliedtotheselectedIR(Sect.3.2).Theeffectivenessoftheapproach in terms of required workload and perceived audioqualityisdiscussedinSect.4.Finally,conclusionsandfuture work are drawn in Sect. 5. 2 State of the art Convolution is a mathematical operation widely used in thefieldofdigitalaudiosignalprocessing,especiallyforfilteringoperationseveninreal-timescenarios.Therefore,techniquesfor calculating convolution with low computational require-ments have been deeply investigated: In the literature, twomain classes can be identified, that is, time domain and FFT-based fast convolution techniques. In the following, a brief descriptionofthesemethodsisreported,showingthecompu-tational cost required for each sample, that is, the number of arithmeticoperationsnormalizedbythenumberofprocessedsamples.Regarding the time domain approach, the discrete-timeconvolution is described by the following equation:  y [ n ]=  x  [ n ]∗ h [ n ]=  N  − 1  m = 0  x  [ n − m ] h [ m ] ,  (1)where  N   is the length of the impulse response  h . Given aninput signal  x   of length  Z   samples, the output signal has alengthof   N  +  Z  − 1.Thecomputationalcost(i.e.,thenumberof arithmetic operations) required for each sample resultsequal to  (  N  − 1 )  additions and  N   multiplications. Therefore,it is evident that, from a practical point of view, the timedomainapproachresultstooexpensiveandnotfeasiblewhenthe length of the IR increases (high values of   N  ) becauseof the high number of multiplications and additions to beperformed.To cope with this problem, over the years differentapproaches have been presented in the literature based onfast frequency domain transforms and the computation of the discrete convolution within the transform domain [18]. In 1966, Stockham presented the first fast convolution algo-rithm: it is based on the assumption that it is possible tohave an efficient implementation of a linear convolution as asimple multiplication of discrete Fourier spectra in the DFTdomain [15]. More specifically, taking into consideration the circular convolution  y [ n ]=  x  [ n ] N  h [ n ]=  N  − 1  m = 0  x  [ ( n − m )  N  ] h [ m ]  (2)where  ()  N   represents the modulo N operation and the DFTproperty  x  [ n ] N  h [ n ]↔  X  [ k  ]  H  [ k  ] ,  (3)where  X  [ k  ]  and  H  [ k  ]  are the DFT transforms of thesequences  x  [ n ] and h [ n ] ,itresultsthattheconvolutioncanbecomputedasacomplexmultiplicationofthediscreteFourierspectra. An inverse discrete Fourier transform (IDFT) has tobe applied in order to obtain the sequence  y [ n ]  in the timedomain. In particular, the output of the circular convolutionis equal to that of the linear convolution choosing a propervalue for the number of points  L  ≥  Z   +  N   − 1 used forthe DFT/IDFT computation. Considering that an L-pointsradix-2 FFT can be computed with  kL logL arithmetic oper-ations, where k is a scaling factor that depends on the usedFFT algorithm, and that each complex-valued multiplicationrequires six arithmetic operations, the computational cost isequalto  2 kL  log  L + 6 (  L / 2 + 1 )  /(  Z  +  N  − 1 ) ,exploitingthe symmetry property of the DFT for real-valued input dataandassumingthattheDFTbinsoftheIRcanbecalculatedinadvance. Thus, the frequency domain approach results moreconvenient than the time domain approach due to the pos-sibility of exploiting the FFT efficiency. Unfortunately, thisapproach is not feasible in many real-time applications. Infact, the filtering can be performed only when all the dataare available but the input signal has typically an indefinitelength: this leads to an impractical computation of the DFTand to a high delay in processing [12]. In order to cope with this unreasonable delay, two differ-ent techniques, that is, overlap and save (OLS) and overlapand add (OLA), were proposed in the literature for block convolution based on processing the input signal segmentedinto sections of length  K   [12]. The overlap and save method [12] performs the computation of the output signal through  L  points DFT/IDFT, where  L  ≥  K   +  N   − 1. Considering h , an IR of length  N   and  x  , the input sequence of length  Z  (  Z   >>  N  ),itispossibletorepresent  x   assegmentsoflength K   +  N   − 1 with  N   − 1 overlapped samples between adja-cent segments. After computing the circular convolution of each section with  h , the first  N  − 1 points must be discardedand the remaining block is used to obtain  y [ n ] . Figure 1shows the block diagram of the overlap and save algorithm.The efficiency of OLS can be maximized setting the inputbuffer length equal to the IR length (50% overlap). Typi-cally,  L  =  2 K   equal to a power of two is considered [1], even if nowadays it could not be the best solution by now, asdiscussed in [18]. The workload required by this algorithm is reported in Table 1.Block convolution can be performed through an alterna-tive method, that is, overlap and add method. It performsthe computation of the output signal by overlapping andadding the filtered sections [12]. Considering an IR  h  of length  N   samples and an input sequence  x   of length  Z   1 3  SIViP (2014) 8:985–994 987 samples(  Z   >>  N  ),itispossibletorepresent  x   asthesumof shifted segments of length  K   samples. Taking into accountthat each shifted segment has only  K   nonzero points and h  has length  N   samples, the linear convolution has length K   +  N   −  1 and can be computed through L-points DFT(with  L  ≥  Z  +  N  − 1). All these filtered sections overlap by  N  − 1 samples and must be added to obtain  y [ n ] . Regardingthe workload required by this algorithm, it is analogous totheoverlapandsavealgorithmexceptforconsideringthefur-thercomputationoftheoverlapofthe  N  − 1filteredsections.Typically,  L  is a power of two for exploiting FFT efficiency[1,18]. However, these approaches result not efficient to obtainlow I/O latency convolutions in the case of long IRs since ahigh overlap factor is required. Thus, methods based on thepartitioningoftheIRhavetobeintroducedinordertoprovideboth efficiency and low latency. In particular, the partitionedconvolution applied to OLS and OLA algorithms leads topartitioned OLS (POLS) and partitioned OLA (POLA) algo-rithms. In the following, for the sake of brevity, only thePOLSalgorithmwillbediscussed,butsimilarconsiderationscan be done for POLA algorithm. In [9], Kulp discussed the POLS technique: firstly, the FIR filter is partitioned in Fig. 1  Blockdiagramoftheoverlapandsavealgorithmassuming  L  =  N   + K   − 1 as reported in [4] P  sub-filters  h  p  of equal size  K   with  p  =  0 , 1 ,...,  P  − 1;then, each sub-filter is treated as a separate IR and an OLS isperformed for each of these  P  partitions; finally, the resultsobtained from the frequency multiplications with each of thesesub-filtersareaddedtogetherinordertogetthedesiredconvolution. In this way, it is possible to choose the FFTlength and thereby to set the I/O latency. Obviously, from atheoretical point of view, the computational cost required toperform the POLS is higher than in the OLS due to the extraoperations required for partitioning (Table 1), but, actually, there are many non-obvious advantages of partitioned con-volution, for example, the fact that the computational load ismoved from FFT computations to multiplications and addi-tions [1,18]. The block diagram of this technique is reported in Fig. 2.In1995,Gardnerproposedaninnovativeapproach,callednon-uniform partitioned overlap and save with the aim of solving the main drawbacks introduced by the uniform par-titioning [4]. Indeed, this approach was conceived for imple- menting real time convolution without I/O latency and withlow computational requirements (Fig. 3). More in detail, the FIRfilterispartitionedin  S   differentsub-filters g n  ofincreas-ing length. According to the length of each sub-filter, theconvolution is computed through either OLS or POLS algo-rithms: the higher the framesize dimension (i.e., the audiostreaming block length), the lower the computational cost.The length  M  n − 1  of each sub-filter  g n  depends on the frame-size dimension  K  n  used for the computation of the convolu-tion relative to the next section in formulae:   M  n  ≥  K  n  n  = 0 n − 1  i = 0  M  i  ≥  K  n  n  = 1 ,...,  S  − 2  M  n  =  N   − S  − 2  i = 0  M  i  n  =  S  − 1(4)where  N   is the IR length and  S   is the number of sub-filters.The value of   K  0  is properly set according to the I/O con-straints.Starting from the aforementioned techniques, during thelastdecade,severaleffortshavebeenmadeinordertofurtherimprove the performance of the fast convolution algorithmsconsideringadhocsolutionsforthespecifiedhardwarearchi-tecture and some parameters optimizations. Table 1  Computational costrequired for each sample interms of real-valued arithmeticoperations [18]Algorithm Arithmetic operationsTime domain  N   +  N   − 1FFT-based fast convolutionUnpartitioned block convolution (OLS)  2 kL  log  L + 6 (  L / 2 + 1 ) K  Partitioned block convolution (POLS)  2 kL  log  L + 6 P (  L / 2 + 1 ) + 2 ( P − 1 )(  L / 2 + 1 ) K   1 3  988 SIViP (2014) 8:985–994 Fig. 2  Blockdiagramoftheuniformpartitionedoverlapandsavealgo-rithm assuming  L  = 2 K   (50% overlap) Fig. 3  Block diagram of the non-uniform partitioned overlap and savealgorithm Taking into consideration the optimization for a partic-ular architecture, a GPU-based implementation has beendescribed in [17], while DSP-based implementation of uni- form partitioned overlap and save techniques was proposedin [1]. Whereas considering conventional operating systems, time distributed and multi-threaded techniques for imple-menting real-time partitioned convolution algorithms aredescribed in [2]. Differently, a new approach to perform a time distributed FFT for efficient low latency convolutionsuseful for non-multi-threaded contexts was presented in [5].Regardingtheparametersoptimization,anewtechniqueabletofindanoptimalfilterpartitionforefficientlongconvolutionwith low I/O delay was proposed in [3]. For a specified I/O delay and filter length, the algorithm finds the non-uniformfilter partition that minimizes the computational cost of theconvolution exploiting the Viterbi algorithm. Following thesame idea, another approach capable to determine the bestFFT size in the case of uniform partitioned convolution waspresentedin[18].Furtherimprovementsbasedonthehuman ear sensitivity have been introduced in order to reduce thecomputationalcostrequiredtoperformtheconvolutionoper-ation. In particular, since real IRs decay faster at high fre-quencies than at low frequencies, the number of complexmultiplications to be performed can be lowered by takinginto consideration only the spectral components with signif-icant energy content, as suggested in [10]. Considering all these aspects (i.e., parameters choice andhardware-based optimization) and starting from previousresults [14], a low latency implementation of a non uniform partitioned overlap and save algorithm for real-time applica-tions is here proposed. In the following, a widely descriptionwill be presented, focusing on an optimal partitioning of theIR, a multi-threaded implementation, and a novel psychoa-coustic optimization based on the pre-processing of the IR. 3 Proposed algorithm The proposed approach aims to obtain the calculation of convolution with low I/O latency and low computationalcost. It is based on the non-uniform partitioning of theimpulse response considering an optimized implementationand psychoacoustic criteria. In the first case, an optimalpartitioning and a multi-threaded implementation have beendevelopedasdiscussedin[14].Regardingthepsychoacoustic criteria, novel contributions are introduced in this paperfocusing on the human ear sensitivity: more specifically, apsychoacoustic-based pre-processing of the IR is presentedin order to lower the computational requirement. In the fol-lowing, these two aspects will be pointed out.3.1 Fast convolution algorithmThe approach proposed in [14] for fast convolution is based on the non-uniform partitioning of the IR into  S   sections  g n of different lengths  M  n : a uniform POLS is then applied oneach section (Fig. 4). Low I/O latency is ensured by the firstPOLS characterized by a small block size  K  1  (e.g.,  K  1  = 64or  K  1  =  128 samples), while a larger block size  K  n  can beused for the other POLSs. Some expedients were discussedin [14] for improving the performance. First,consideringthattheNUPOLSiscomposedofdiffer-entPOLSs,anautomaticparallelizationoftheoperationscanbe obtained using a multi-threaded implementation, that is,  1 3
Similar documents
View more...
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks