Description

A Low Latency Implementation of a Non Uniform Partitioned Overlap and Save Algorithm for Real Time Applications

All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.

Related Documents

Share

Transcript

See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/236839141
A Low Latency Implementation of a NonUniform Partitioned Convolution algorithm forRoom acoustic simulation
Article
in
Signal Image and Video Processing · July 2012
DOI: 10.1007/s11760-012-0387-0
CITATIONS
5
READS
104
5 authors
, including:Andrea PrimaveraUniversità Politecnica delle Marche
30
PUBLICATIONS
64
CITATIONS
SEE PROFILE
Stefania CecchiUniversità Politecnica delle Marche
107
PUBLICATIONS
406
CITATIONS
SEE PROFILE
Paolo PerettiUniversità Politecnica delle Marche
39
PUBLICATIONS
212
CITATIONS
SEE PROFILE
Francesco PiazzaUniversità Politecnica delle Marche
392
PUBLICATIONS
2,842
CITATIONS
SEE PROFILE
All content following this page was uploaded by Stefania Cecchi on 13 April 2015.
The user has requested enhancement of the downloaded file. All in-text references underlined in blue are added to the srcinal documentand are linked to publications on ResearchGate, letting you access and read them immediately.
SIViP (2014) 8:985–994DOI 10.1007/s11760-012-0387-0
ORIGINAL PAPER
A low latency implementation of a non-uniform partitionedconvolution algorithm for room acoustic simulation
Andrea Primavera
·
Stefania Cecchi
·
Laura Romoli
·
Paolo Peretti
·
Francesco Piazza
Received: 21 December 2011 / Revised: 24 September 2012 / Accepted: 25 September 2012 / Published online: 11 October 2012© Springer-Verlag London 2012
Abstract
Finite impulse response convolution is one of the most widely used operations in digital signal processingﬁeld for ﬁltering operations. Low computationally demand-ingtechniquesareessentialforcalculatingconvolutionswithlow input/output latency in real scenarios, considering thatthe real-time requirements are strictly related to the impulseresponse length. In this context, a complete overview of thestate of the art relative to the algorithms for fast computa-tion of convolution is described here. Then, a novel percep-tual approach employed to reduce the computational costof fast convolution algorithms is here presented. It is basedon the pre-processing of a selected impulse response and itallows to further reduce the number of complex multiplica-tions considering the energy decay relief and the absolutethreshold of hearing, as psychoacoustic constraints. Severalresults are reported in terms of computational cost and per-ceived audioqualityinorder toprove theeffectiveness oftheproposed approach also introducing comparisons with theexisting techniques of the state of the art.
Keywords
Absolute threshold of hearing
·
Energy decay relief
·
Fast convolution
·
Non-uniform partitioning
·
Room acoustic
A. Primavera
·
S. Cecchi (
B
)
·
L. Romoli
·
P. Peretti
·
F. PiazzaDepartment of Information Engineering, Università Politecnicadelle Marche, Via Brecce Bianche, 60131 Ancona, Italye-mail: s.cecchi@univpm.itA. Primaverae-mail: a.primavera@univpm.itL. Romolie-mail: l.romoli@univpm.itP. Perettie-mail: p.peretti@univpm.itF. Piazzae-mail: f.piazza@univpm.it
1 Introduction
Finite impulse response (FIR) convolution is one of the mostwidely used operations in digital signal processing ﬁeld.It has an important role in many different real-time audioapplications(e.g.,loudspeakerequalization,activenoisecon-trol, digital audio effects) where low computational cost isrequired in order to respect the real-time constraint. In thiscontext, digital artiﬁcial reverberation is the application thatreallypointsoutlimitsofreal-timeFIRﬁltering.Convolutionwith long impulse responses (IRs) can be performed in orderto simulate large environments, and thus, low input/output(I/O) latencies are usually required.Several techniques have been proposed in the literaturewith the aim of calculating convolution with low compu-tational requirements based on discrete Fourier transform(DFT)[4,9,12,15].Moreover,severaleffortshavebeenmade
inordertoimprovetheperformanceofthesefastconvolutionalgorithms taking into consideration the hardware architec-ture of the ﬁnal system [1,2,5,17] and the optimization of
some parameters [3,10,18]. Considering all these aspects,
a low latency implementation of a non-uniform partitionedoverlap and save algorithm for real-time applications hasbeenproposedbyusin[14]:itisfocusedonanoptimalparti-
tioning of the IR introducing a psychoacoustic optimizationand an optimized procedure for a multi-threaded implemen-tation.In this context, starting from the previous approach [14],
a novel algorithm, based on a perceptually motivated pre-processing of the IR, is introduced in order to further reducethe computational cost required to perform the fast convolu-tion operation.The paper is organized as follows. Section 2 describesthe state of the art relative to the techniques introduced forconvolution. Then, the proposed approach is presented in
1 3
986 SIViP (2014) 8:985–994
Sect.3,consideringﬁrstlytheadopted fastconvolution algo-
rithm (Sect. 3.1) and then the psychoacoustic pre-processing
appliedtotheselectedIR(Sect.3.2).Theeffectivenessoftheapproach in terms of required workload and perceived audioqualityisdiscussedinSect.4.Finally,conclusionsandfuture
work are drawn in Sect. 5.
2 State of the art
Convolution is a mathematical operation widely used in theﬁeldofdigitalaudiosignalprocessing,especiallyforﬁlteringoperationseveninreal-timescenarios.Therefore,techniquesfor calculating convolution with low computational require-ments have been deeply investigated: In the literature, twomain classes can be identiﬁed, that is, time domain and FFT-based fast convolution techniques. In the following, a brief descriptionofthesemethodsisreported,showingthecompu-tational cost required for each sample, that is, the number of arithmeticoperationsnormalizedbythenumberofprocessedsamples.Regarding the time domain approach, the discrete-timeconvolution is described by the following equation:
y
[
n
]=
x
[
n
]∗
h
[
n
]=
N
−
1
m
=
0
x
[
n
−
m
]
h
[
m
]
,
(1)where
N
is the length of the impulse response
h
. Given aninput signal
x
of length
Z
samples, the output signal has alengthof
N
+
Z
−
1.Thecomputationalcost(i.e.,thenumberof arithmetic operations) required for each sample resultsequal to
(
N
−
1
)
additions and
N
multiplications. Therefore,it is evident that, from a practical point of view, the timedomainapproachresultstooexpensiveandnotfeasiblewhenthe length of the IR increases (high values of
N
) becauseof the high number of multiplications and additions to beperformed.To cope with this problem, over the years differentapproaches have been presented in the literature based onfast frequency domain transforms and the computation of the discrete convolution within the transform domain [18].
In 1966, Stockham presented the ﬁrst fast convolution algo-rithm: it is based on the assumption that it is possible tohave an efﬁcient implementation of a linear convolution as asimple multiplication of discrete Fourier spectra in the DFTdomain [15]. More speciﬁcally, taking into consideration the
circular convolution
y
[
n
]=
x
[
n
]
N
h
[
n
]=
N
−
1
m
=
0
x
[
(
n
−
m
)
N
]
h
[
m
]
(2)where
()
N
represents the modulo N operation and the DFTproperty
x
[
n
]
N
h
[
n
]↔
X
[
k
]
H
[
k
]
,
(3)where
X
[
k
]
and
H
[
k
]
are the DFT transforms of thesequences
x
[
n
]
and
h
[
n
]
,itresultsthattheconvolutioncanbecomputedasacomplexmultiplicationofthediscreteFourierspectra. An inverse discrete Fourier transform (IDFT) has tobe applied in order to obtain the sequence
y
[
n
]
in the timedomain. In particular, the output of the circular convolutionis equal to that of the linear convolution choosing a propervalue for the number of points
L
≥
Z
+
N
−
1 used forthe DFT/IDFT computation. Considering that an L-pointsradix-2 FFT can be computed with
kL
logL arithmetic oper-ations, where k is a scaling factor that depends on the usedFFT algorithm, and that each complex-valued multiplicationrequires six arithmetic operations, the computational cost isequalto
2
kL
log
L
+
6
(
L
/
2
+
1
)
/(
Z
+
N
−
1
)
,exploitingthe symmetry property of the DFT for real-valued input dataandassumingthattheDFTbinsoftheIRcanbecalculatedinadvance. Thus, the frequency domain approach results moreconvenient than the time domain approach due to the pos-sibility of exploiting the FFT efﬁciency. Unfortunately, thisapproach is not feasible in many real-time applications. Infact, the ﬁltering can be performed only when all the dataare available but the input signal has typically an indeﬁnitelength: this leads to an impractical computation of the DFTand to a high delay in processing [12].
In order to cope with this unreasonable delay, two differ-ent techniques, that is, overlap and save (OLS) and overlapand add (OLA), were proposed in the literature for block convolution based on processing the input signal segmentedinto sections of length
K
[12]. The overlap and save method
[12] performs the computation of the output signal through
L
points DFT/IDFT, where
L
≥
K
+
N
−
1. Considering
h
, an IR of length
N
and
x
, the input sequence of length
Z
(
Z
>>
N
),itispossibletorepresent
x
assegmentsoflength
K
+
N
−
1 with
N
−
1 overlapped samples between adja-cent segments. After computing the circular convolution of each section with
h
, the ﬁrst
N
−
1 points must be discardedand the remaining block is used to obtain
y
[
n
]
. Figure 1shows the block diagram of the overlap and save algorithm.The efﬁciency of OLS can be maximized setting the inputbuffer length equal to the IR length (50% overlap). Typi-cally,
L
=
2
K
equal to a power of two is considered [1],
even if nowadays it could not be the best solution by now, asdiscussed in [18]. The workload required by this algorithm
is reported in Table 1.Block convolution can be performed through an alterna-tive method, that is, overlap and add method. It performsthe computation of the output signal by overlapping andadding the ﬁltered sections [12]. Considering an IR
h
of length
N
samples and an input sequence
x
of length
Z
1 3
SIViP (2014) 8:985–994 987
samples(
Z
>>
N
),itispossibletorepresent
x
asthesumof shifted segments of length
K
samples. Taking into accountthat each shifted segment has only
K
nonzero points and
h
has length
N
samples, the linear convolution has length
K
+
N
−
1 and can be computed through L-points DFT(with
L
≥
Z
+
N
−
1). All these ﬁltered sections overlap by
N
−
1 samples and must be added to obtain
y
[
n
]
. Regardingthe workload required by this algorithm, it is analogous totheoverlapandsavealgorithmexceptforconsideringthefur-thercomputationoftheoverlapofthe
N
−
1ﬁlteredsections.Typically,
L
is a power of two for exploiting FFT efﬁciency[1,18].
However, these approaches result not efﬁcient to obtainlow I/O latency convolutions in the case of long IRs since ahigh overlap factor is required. Thus, methods based on thepartitioningoftheIRhavetobeintroducedinordertoprovideboth efﬁciency and low latency. In particular, the partitionedconvolution applied to OLS and OLA algorithms leads topartitioned OLS (POLS) and partitioned OLA (POLA) algo-rithms. In the following, for the sake of brevity, only thePOLSalgorithmwillbediscussed,butsimilarconsiderationscan be done for POLA algorithm. In [9], Kulp discussed
the POLS technique: ﬁrstly, the FIR ﬁlter is partitioned in
Fig. 1
Blockdiagramoftheoverlapandsavealgorithmassuming
L
=
N
+
K
−
1 as reported in [4]
P
sub-ﬁlters
h
p
of equal size
K
with
p
=
0
,
1
,...,
P
−
1;then, each sub-ﬁlter is treated as a separate IR and an OLS isperformed for each of these
P
partitions; ﬁnally, the resultsobtained from the frequency multiplications with each of thesesub-ﬁltersareaddedtogetherinordertogetthedesiredconvolution. In this way, it is possible to choose the FFTlength and thereby to set the I/O latency. Obviously, from atheoretical point of view, the computational cost required toperform the POLS is higher than in the OLS due to the extraoperations required for partitioning (Table 1), but, actually,
there are many non-obvious advantages of partitioned con-volution, for example, the fact that the computational load ismoved from FFT computations to multiplications and addi-tions [1,18]. The block diagram of this technique is reported
in Fig. 2.In1995,Gardnerproposedaninnovativeapproach,callednon-uniform partitioned overlap and save with the aim of solving the main drawbacks introduced by the uniform par-titioning [4]. Indeed, this approach was conceived for imple-
menting real time convolution without I/O latency and withlow computational requirements (Fig. 3). More in detail, the
FIRﬁlterispartitionedin
S
differentsub-ﬁlters
g
n
ofincreas-ing length. According to the length of each sub-ﬁlter, theconvolution is computed through either OLS or POLS algo-rithms: the higher the framesize dimension (i.e., the audiostreaming block length), the lower the computational cost.The length
M
n
−
1
of each sub-ﬁlter
g
n
depends on the frame-size dimension
K
n
used for the computation of the convolu-tion relative to the next section in formulae:
M
n
≥
K
n
n
=
0
n
−
1
i
=
0
M
i
≥
K
n
n
=
1
,...,
S
−
2
M
n
=
N
−
S
−
2
i
=
0
M
i
n
=
S
−
1(4)where
N
is the IR length and
S
is the number of sub-ﬁlters.The value of
K
0
is properly set according to the I/O con-straints.Starting from the aforementioned techniques, during thelastdecade,severaleffortshavebeenmadeinordertofurtherimprove the performance of the fast convolution algorithmsconsideringadhocsolutionsforthespeciﬁedhardwarearchi-tecture and some parameters optimizations.
Table 1
Computational costrequired for each sample interms of real-valued arithmeticoperations [18]Algorithm Arithmetic operationsTime domain
N
+
N
−
1FFT-based fast convolutionUnpartitioned block convolution (OLS)
2
kL
log
L
+
6
(
L
/
2
+
1
)
K
Partitioned block convolution (POLS)
2
kL
log
L
+
6
P
(
L
/
2
+
1
)
+
2
(
P
−
1
)(
L
/
2
+
1
)
K
1 3
988 SIViP (2014) 8:985–994
Fig. 2
Blockdiagramoftheuniformpartitionedoverlapandsavealgo-rithm assuming
L
=
2
K
(50% overlap)
Fig. 3
Block diagram of the non-uniform partitioned overlap and savealgorithm
Taking into consideration the optimization for a partic-ular architecture, a GPU-based implementation has beendescribed in [17], while DSP-based implementation of uni-
form partitioned overlap and save techniques was proposedin [1]. Whereas considering conventional operating systems,
time distributed and multi-threaded techniques for imple-menting real-time partitioned convolution algorithms aredescribed in [2]. Differently, a new approach to perform a
time distributed FFT for efﬁcient low latency convolutionsuseful for non-multi-threaded contexts was presented in [5].Regardingtheparametersoptimization,anewtechniqueabletoﬁndanoptimalﬁlterpartitionforefﬁcientlongconvolutionwith low I/O delay was proposed in [3]. For a speciﬁed I/O
delay and ﬁlter length, the algorithm ﬁnds the non-uniformﬁlter partition that minimizes the computational cost of theconvolution exploiting the Viterbi algorithm. Following thesame idea, another approach capable to determine the bestFFT size in the case of uniform partitioned convolution waspresentedin[18].Furtherimprovementsbasedonthehuman
ear sensitivity have been introduced in order to reduce thecomputationalcostrequiredtoperformtheconvolutionoper-ation. In particular, since real IRs decay faster at high fre-quencies than at low frequencies, the number of complexmultiplications to be performed can be lowered by takinginto consideration only the spectral components with signif-icant energy content, as suggested in [10].
Considering all these aspects (i.e., parameters choice andhardware-based optimization) and starting from previousresults [14], a low latency implementation of a non uniform
partitioned overlap and save algorithm for real-time applica-tions is here proposed. In the following, a widely descriptionwill be presented, focusing on an optimal partitioning of theIR, a multi-threaded implementation, and a novel psychoa-coustic optimization based on the pre-processing of the IR.
3 Proposed algorithm
The proposed approach aims to obtain the calculation of convolution with low I/O latency and low computationalcost. It is based on the non-uniform partitioning of theimpulse response considering an optimized implementationand psychoacoustic criteria. In the ﬁrst case, an optimalpartitioning and a multi-threaded implementation have beendevelopedasdiscussedin[14].Regardingthepsychoacoustic
criteria, novel contributions are introduced in this paperfocusing on the human ear sensitivity: more speciﬁcally, apsychoacoustic-based pre-processing of the IR is presentedin order to lower the computational requirement. In the fol-lowing, these two aspects will be pointed out.3.1 Fast convolution algorithmThe approach proposed in [14] for fast convolution is based
on the non-uniform partitioning of the IR into
S
sections
g
n
of different lengths
M
n
: a uniform POLS is then applied oneach section (Fig. 4). Low I/O latency is ensured by the ﬁrstPOLS characterized by a small block size
K
1
(e.g.,
K
1
=
64or
K
1
=
128 samples), while a larger block size
K
n
can beused for the other POLSs. Some expedients were discussedin [14] for improving the performance.
First,consideringthattheNUPOLSiscomposedofdiffer-entPOLSs,anautomaticparallelizationoftheoperationscanbe obtained using a multi-threaded implementation, that is,
1 3

Search

Similar documents

Related Search

A novel comprehensive method for real time ViStudy, Design and Implementation of a Quadcopthe effects of teacher politics in a non-poliDevelopment and Implementation of a Fuzzy LogDrawing as a non-verbal communicationImpact of African Diaspora in New World and AEU state liability and non implementation of Wrongful life - diritto a non nascere - bioetMeasurement of erosion rates with Be-10 and AThe Relationship Between a Non-Muslim and Mus

We Need Your Support

Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks