a r X i v : 1 5 0 1 . 0 4 8 6 0 v 1 [ c s . I T ] 1 7 J a n 2 0 1 5
Reference Matlab code is available at
sites.google.com/site/shahriarshahabuddin/matlab simulator
A Customized Lattice Reduction Multiprocessor forMIMO Detection
Shahriar Shahabuddin, Janne Janhunen,Zaheer Khan, and Markku Juntti
Centre for Wireless CommunicationsUniversity of Oulu, FinlandEmail: ﬁrstname.lastname@ee.oulu.ﬁ
Amanullah Ghazi
Department of Computer Science and EngineeringUniversity of Oulu, FinlandEmail: ﬁrstname.lastname@ee.oulu.ﬁ
Abstract
—Lattice reduction (LR) is a preprocessing techniquefor multipleinput multipleoutput (MIMO) symbol detection toachieve better bit errorrate (BER) performance. In this paper,we propose a customized homogeneous multiprocessor for LR.The processor cores are based on transport triggered architecture (TTA). We propose some modiﬁcation of the popular LRalgorithm, LenstraLenstraLov´asz (LLL) for high throughput.The TTA cores are programmed with high level language. EachTTA core consists of several special function units to acceleratethe program code. The multiprocessor takes 187 cycles to reducea single matrix for LR. The architecture is synthesized on 90 nmtechnology and takes 405 kgates at 210 MHz.
I. I
NTRODUCTION
Multipleinput multipleoutput (MIMO) is a key technologyto utilize the available radio spectrum efﬁciently. The basicidea of MIMO is to send multiple independent data streamsfrom multiple antennas in the same frequency band. Theseindependent streams need to be separated at the receiver toidentify the symbol that is being transmitted by using a MIMOdetector. Maximum likelihood (ML) is the optimal solutionfor the MIMO detection problem that compares the incomingsymbol with every possible symbols in the constellation.However, the ML algorithm is too complex for practical realtime implementations. Linear detection is popular for practicalimplementations. The linear MIMO detection algorithms areless complex, but suffers from a degraded bit errorrate (BER)performance.Lattice reduction (LR) is a preprocessing technique that canbe used with the linear detection to signiﬁcantly improve theBER performance and reduce the gap between the traditionallinear detectors and optimal ML. LR transforms the MIMOchannel matrix to a near orthogonal matrix and thus facilitatesto achieve a better BER performance. The most used LRalgorithm is called the LenstraLenstraLov´asz (LLL) algorithm according to the name of the inventors [1]. The LLLalgorithm poses many challenges due to the undeterministicexecution time and higher computational complexity. We propose a modiﬁed LLL (MLLL) algorithm that is based on thesrcinal LLL algorithm on complex domain. We use a ﬁxedstructure for the LLL based on [2]. Instead of using the Lov´asz
condition, a less complex Siegel condition is applied [3]. Anearly termination technique is used as proposed in [4]. Wedemonstrate by Matlab simulation that the BER performanceloss of the hardware friendly MLLL algorithm is negligible.There are several hardware accelerators proposed in [4][5] [6] [7] for different LR algorithms. The ﬁxed hardwareimplementations provide high data rate and consume lesssilicon area compared to the customized application speciﬁcprocessors (ASIP). The drawback of the ﬁxed hardware implement ation is that it operates only on a ﬁxed set of parametersdue to the hardwired control path and it is not possible tomodify the control path in the future. An ASIP customizedfor a small set of algorithms is an attractive solution in termsof cost, silicon area and high throughput. Most importantly, anASIP reduces the design risk with an instruction memory thatcan be used to load new programs or control instructions. Thecontrol instructions can be easily obtained by a retargetablecompiler for that particular customized architecture.A customized very long instruction word (VLIW) processoris implemented in [8] for the LR. We take different approachand design a customized multiprocessor based on the transporttriggered architecture (TTA) paradigm. TTA is a processordesign philosophy where the programmer can control theinternal data transports between different function units of theprocessor. TTA exploits the instruction level parallelism (ILP)by processing several instructions in a single clock cycle. TheTTA based codesign environment (TCE) tool is used in thiswork to design the TTA processor cores. TCE enables thedesigner to write an application with a high level languageand design the target processor in a graphical user interface atthe same time. A turbo decoder and a MIMO detector designusing TCE can be found in [9] and [10]. In this work, every
core of the proposed multiprocessor is programmed with Clanguage to shorten the timetomarket. The multiprocessortakes 187 cycles and achieves a maximum clock frequency of 210 MHz on 90 nm technology. To our knowledge, this is theﬁrst TTA based customized architecture for LR.II. S
YSTEM MODEL
A. Conventional MIMO Detection
Consider a MIMO system consists of
M
T
transmit antennas,which are sending data over the channel and
N
R
receive antennas which are receiving transmitted bits from the channel. Themodulation scheme that is used here is quadrature amplitudemodulation (QAM) with the assumption
N
R
≥
M
T
. Thereceived signal
y
can be represented as
Reference Matlab code is available at
sites.google.com/site/shahriarshahabuddin/matlab simulator
y
=
Hx
+
n
,
(1)where
y
∈
C
N
R
is the received signal vector,
x
∈
C
M
T
isthe transmit symbol vector,
H
∈
C
N
R
×
M
T
is the channelmatrix and
n
∈
C
N
R
is the circularly symmetric complexwhite Gaussian noise vector with zero mean and variance
σ
2
.In the receiver, the linear zero forcing (ZF) detector calculates the inverse of the channel matrix to compute thetransmitted symbol vector which can be expressed by,
˜x
= (
H
H
H
)
1
Hx
=
H
†
x
.
(2)where
H
is the channel matrix and
(
·
)
†
denotes the pseudoinverse. Typically, the channel matrix
H
is QR decomposed intotwo parts as
H
=
QR
. Here
Q
∈
C
(
N
R
×
N
R
)
denotes a unitarymatrix and
R
∈
C
(
N
R
×
N
R
)
denotes an upper triangular matrix.
B. Lattice Reduction
A lattice is a periodic arrangement of discrete points. Alattice can be characterized in terms of a set of basis vectors,where any points of the lattice can be represented by asuperposition of integer multiples of the basis vectors. Forsimplicity, we call the set
B
= (
b
1
,b
2
,....,b
n
)
as the basis of the lattice.A complex valued lattice in the
n
dimensional complexspace
C
n
can be deﬁned as
L
=
{
υ

υ
=
B
ω
}
,
(3)where
B
is the basis of the lattice and
ω
= [
ω
1
,ω
2
,....,ω
n
]
.Note that in (3), the
υ
,
ω
and matrix
B
can be replaced with
y
,
x
and
H
respectively to obtain
L
=
{
y

y
=
Hx
}
. In thiscase, the vector space
L
is the set of all possible undisturbedreceived signal points. There are many bases that can span thespace
L
, and the aim of the LR algorithm is to ﬁnd a set of least correlated base with the shortest basis vectors [11].
C. LRbased MIMO Detection
LR ﬁnds an improved basis for the lattice induced by thechannel. The srcinal basis and the reduced basis are relatedby a unimodular matrix,
T
. Therefore, the LR aided detectionﬁnds the received symbol in the new reduced basis andafterwards transfer the signal in the srcinal lattice. The newchannel matrix after the LR can be expressed as,
˜
H
=
HT
and the transmitted signal is also treated as multiplied by
T
−
1
which is
z
=
T
−
1
x
for the reduced basis. The received signal
y
=
Hx
+
n
can be expressed as
y
=
HTT
−
1
x
+
n
= ˜
Hz
+
n
.
(4)The LR aided detection operates on
˜
H
and
z
instead of
H
and
x
. The LR aided ZF detector can be expressed as
˜x
= (
˜H
H
˜H
)
1
˜Hz
=
˜H
†
z
.
(5)The LR algorithm is applied on the QR decomposed
H
toobtain the modiﬁed
˜
Q
and
˜
R
. Afterwards, the lattice reducedchannel matrix can be obtained as
˜
H
= ˜
Q
˜
R
.III. L
ATTICE
R
EDUCTION
A
LGORITHM
LLL algorithm is widely used to compute the suitableunimodular matrix
T
and to obtain a reduced lattice basis.LLL was originally proposed for the real valued LR [1].However, the channel matrix is naturally complex valued andtherefore, complex version of LLL (CLLL) is used to reducethe complexity.The CLLL algorithm suffers from irregular dataﬂow,which eventually leads to higher latency. Therefore, a ﬁxedcomplexity LLL (fcLLL) algorithm is proposed in [2]. ThefcLLL alters the signal ﬂow of the CLLL to follow a deterministic structure. It is possible to utilize less complex Siegelalgorithm instead of the complex Lov´asz condition [3]. It is
also very important to use an early termination mechanism tomeet the strict requirements. Applying all this modiﬁcations,we propose a modiﬁedLLL (MLLL) algorithm for LR withless complexity and negligible BER performance loss. TheMLLL implemented in this paper is summarized in Algorithm1.
Algorithm 1 Modiﬁed CLLL Algorithm (MLLL)
INPUT:
Q
∈
C
N
R
×
N
R
,
R
∈
C
N
R
×
N
R
,
δ
1: Initialization
˜Q
:=
Q
,
˜R
:=
R
,
T
:=
I
M
T
2:
k
:= 2
3:
while
k
≤
iterations
4:
for
l
=
k
−
1
to
1
step
−
1
5:
µ
=
˜R
(
l,k
)
/
˜R
(
l,l
)
6:
if
µ
= 0
7:
˜R
(1 :
l,k
) :=
˜R
(1 :
l,k
)
−
µ
˜R
(1 :
l,l
)
8:
T
(:
,k
) :=
T
(:
,k
)
−
µ
T
(:
,l
)
9:
end
10:
end
11:
if
δ
˜R
(
k
−
1
,k
−
1)
2
>
˜R
(
k,k
)
2
12: Swap columns
k
−
1
and
k
in
˜R
and
T
13:
Θ =
α β
−
β α
with
α
=
˜R
(
k
−
1
,k
−
1)
˜R
(
k
−
1:
k,k
−
1)
and
β
=
˜R
(
k,k
−
1)
˜R
(
k
−
1:
k,k
−
1)
14:
˜R
(
k
−
1 :
k,k
−
1 :
k
) := Θ
˜R
(
k
−
1 :
k,k
−
1 :
k
)
15:
˜Q
(:
,k
−
1 :
k
) :=
˜Q
(:
,k
−
1 :
k
)Θ
T
16:
k
:=
max
{
k
−
1
,
2
}
17:
else
18:
k
:=
k
+ 1
19:
end
20:
end
The BER performance of the traditional ZF, srcinal CLLLaided ZF, MLLL aided ZF and the optimal ML is simulatedfor various signaltonoise (SNR) in a Matlab simulator. Anadditive white Gaussian noise (AWGN) channel is used for16QAM modulation and the BER is averaged over 10 000MonteCarlo trials. Fig. 1 shows the MLLL algorithm with 5iterations. It can be seen that the performance loss is negligiblecompared to the srcinal CLLL algorithm.
Reference Matlab code is available at
sites.google.com/site/shahriarshahabuddin/matlab simulator
051015202530354010
−4
10
−3
10
−2
10
−1
10
0
average SNR per receive antenna [dB]
b i t e r r o r r a t e ( B E R )
ZFCLLLMLLL (5 iterations)ML
Fig. 1. BER peformance of MLLL algorithm.
IV. TTA M
ULTIPROCESSOR FOR
MLLL
A. Special Function Units
Six special function units (SFU) are designed and written inVHSIC hardware description language (VHDL) to accelerateeach iteration of the MLLL algorithm. A special function unitis designed to support the complex multiplication (CMUL)operation. Data level parallelism (DLP) is applied in the designby packing the 16bit real part and 16bit complex part ina 32bit complex variable. Therefore, CMUL uses four 16bit multipliers, a single 16bit adder and 16bit subtractor tosupport the complex multiplication.Two SFUs for
µ
calculation and size reduction are designedaccording to [8]. These SFUs are singlecycle and multiplierless. It is observed from the Matlab simulations that the valueof
µ
has a range of
[
−
4
,
4]
, and thus, the dynamic range of the SFUs are set accordingly. Another simple SFU is designedto compute the SIEGEL criterion. Instead of multiplying theinput with
.
75
the SIEGEL SFU calculates the value with acombination of two shifters and one adder.
Fig. 2. 4cycle cordic architecture.
The most complex SFU that lies in the datapath of a singleTTA core is the CORDIC SFU. A masterslave cordic isconsidered in this work [4]. The masterslave CORDIC isa combination of two CORDIC blocks in vectoring modeand rotation mode respectively. By setting the input as 1and 0 of the CORDIC with rotation mode, it is possible tocalculate the cosine and sine values directly. Therefore, theangle calculation done in a conventional CORDIC block isnot needed here. In every stage it is possible to calculate thevalues of the signums and add or subtract accordingly in therotation mode. As we need a 16bit CORDIC, there are twooptions to design it. An iterative CORDIC that uses registersand iterates 16 times over the 1stage datapath. However, ittakes 16clock cycles to compute the output. For a processorbased implementation a 16cycle SFU is complex as there willbe 15 NOP operations in the assembly code. It is possible tofully unroll the CORDIC block without any registers. Thenthe critical path for the CORDIC block becomes too high. Weﬁnd a compromise between the two approach and design a4stage CORDIC datapath that can be reused four times tocreate a 4cycle masterslave CORDIC. The block diagram of the masterslave CORDIC is presented in Fig. 2, where theellipse contains a single stage of the datapath. An ARRANGESFU is designed to rearrange the 32bit variables.
B. High Level Architecture of the multiprocessor
A 32bit ﬁxed point TTA processor is designed to support asingle iteration of the MLLL algorithm and ﬁve of these TTAcores are connected in a pipelined fashion to compute the LRmatrix. Part of a single TTA processor core is illustrated insidethe dotted block of Fig. 3. For readability, the whole processoris not given. The blocks in the upper part of the core representthe function units and register ﬁles of the processor. The black horizontal straight lines represent the buses of the processor.The vertical rectangular blocks represent the sockets.
Fig. 3. The multiprocessor architecture.
Each core includes the load/store unit (LSU), arithmeticlogic unit (ALU), global control unit (GCU), register ﬁles,several conventional function units and SFUs. The
Q
,
R
and
T
matrix are read from three separate ﬁrstinﬁrstout(FIFO) memory buffer by using the function units calledSTREAM. The STREAM units can read every input samplein one clock cycle. Three STREAM units are used to get theinputs simultaneously. Three STREAM unit is used to writethe outputs in the memory buffer.Ten register ﬁles are used to save the intermediate results. Asingle Boolean register ﬁle is included in the processor design.When the registers are not enough, the processor is able toaccess the data memory to temporary store data through theLSU. The SFUs can be called by macros to accelerate theprogram code.