A customized lattice reduction multiprocessor for MIMO detection

A customized lattice reduction multiprocessor for MIMO detection
of 4
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
    a  r   X   i  v  :   1   5   0   1 .   0   4   8   6   0  v   1   [  c  s .   I   T   ]   1   7   J  a  n   2   0   1   5 Reference Matlab code is available at  sites.google.com/site/shahriarshahabuddin/matlab simulator A Customized Lattice Reduction Multiprocessor forMIMO Detection Shahriar Shahabuddin, Janne Janhunen,Zaheer Khan, and Markku Juntti Centre for Wireless CommunicationsUniversity of Oulu, FinlandEmail: firstname.lastname@ee.oulu.fi Amanullah Ghazi Department of Computer Science and EngineeringUniversity of Oulu, FinlandEmail: firstname.lastname@ee.oulu.fi  Abstract —Lattice reduction (LR) is a preprocessing techniquefor multiple-input multiple-output (MIMO) symbol detection toachieve better bit error-rate (BER) performance. In this paper,we propose a customized homogeneous multiprocessor for LR.The processor cores are based on transport triggered architec-ture (TTA). We propose some modification of the popular LRalgorithm, Lenstra-Lenstra-Lov´asz (LLL) for high throughput.The TTA cores are programmed with high level language. EachTTA core consists of several special function units to acceleratethe program code. The multiprocessor takes 187 cycles to reducea single matrix for LR. The architecture is synthesized on 90 nmtechnology and takes 405 kgates at 210 MHz. I. I NTRODUCTION Multiple-input multiple-output (MIMO) is a key technologyto utilize the available radio spectrum efficiently. The basicidea of MIMO is to send multiple independent data streamsfrom multiple antennas in the same frequency band. Theseindependent streams need to be separated at the receiver toidentify the symbol that is being transmitted by using a MIMOdetector. Maximum likelihood (ML) is the optimal solutionfor the MIMO detection problem that compares the incomingsymbol with every possible symbols in the constellation.However, the ML algorithm is too complex for practical real-time implementations. Linear detection is popular for practicalimplementations. The linear MIMO detection algorithms areless complex, but suffers from a degraded bit error-rate (BER)performance.Lattice reduction (LR) is a preprocessing technique that canbe used with the linear detection to significantly improve theBER performance and reduce the gap between the traditionallinear detectors and optimal ML. LR transforms the MIMOchannel matrix to a near orthogonal matrix and thus facilitatesto achieve a better BER performance. The most used LRalgorithm is called the Lenstra-Lenstra-Lov´asz (LLL) algo-rithm according to the name of the inventors [1]. The LLLalgorithm poses many challenges due to the undeterministicexecution time and higher computational complexity. We pro-pose a modified LLL (MLLL) algorithm that is based on thesrcinal LLL algorithm on complex domain. We use a fixedstructure for the LLL based on [2]. Instead of using the Lov´asz condition, a less complex Siegel condition is applied [3]. Anearly termination technique is used as proposed in [4]. Wedemonstrate by Matlab simulation that the BER performanceloss of the hardware friendly MLLL algorithm is negligible.There are several hardware accelerators proposed in [4][5] [6] [7] for different LR algorithms. The fixed hardwareimplementations provide high data rate and consume lesssilicon area compared to the customized application specificprocessors (ASIP). The drawback of the fixed hardware imple-ment ation is that it operates only on a fixed set of parametersdue to the hardwired control path and it is not possible tomodify the control path in the future. An ASIP customizedfor a small set of algorithms is an attractive solution in termsof cost, silicon area and high throughput. Most importantly, anASIP reduces the design risk with an instruction memory thatcan be used to load new programs or control instructions. Thecontrol instructions can be easily obtained by a retargetablecompiler for that particular customized architecture.A customized very long instruction word (VLIW) processoris implemented in [8] for the LR. We take different approachand design a customized multiprocessor based on the transporttriggered architecture (TTA) paradigm. TTA is a processordesign philosophy where the programmer can control theinternal data transports between different function units of theprocessor. TTA exploits the instruction level parallelism (ILP)by processing several instructions in a single clock cycle. TheTTA based codesign environment (TCE) tool is used in thiswork to design the TTA processor cores. TCE enables thedesigner to write an application with a high level languageand design the target processor in a graphical user interface atthe same time. A turbo decoder and a MIMO detector designusing TCE can be found in [9] and [10]. In this work, every core of the proposed multiprocessor is programmed with Clanguage to shorten the time-to-market. The multiprocessortakes 187 cycles and achieves a maximum clock frequency of 210 MHz on 90 nm technology. To our knowledge, this is thefirst TTA based customized architecture for LR.II. S YSTEM MODEL  A. Conventional MIMO Detection Consider a MIMO system consists of  M  T   transmit antennas,which are sending data over the channel and N  R  receive anten-nas which are receiving transmitted bits from the channel. Themodulation scheme that is used here is quadrature amplitudemodulation (QAM) with the assumption  N  R  ≥  M  T  . Thereceived signal  y  can be represented as  Reference Matlab code is available at  sites.google.com/site/shahriarshahabuddin/matlab simulator y  =  Hx + n ,  (1)where  y  ∈  C N  R is the received signal vector,  x  ∈  C M  T  isthe transmit symbol vector,  H  ∈  C N  R × M  T  is the channelmatrix and  n  ∈  C N  R is the circularly symmetric complexwhite Gaussian noise vector with zero mean and variance  σ 2 .In the receiver, the linear zero forcing (ZF) detector cal-culates the inverse of the channel matrix to compute thetransmitted symbol vector which can be expressed by, ˜x  = ( H H H ) -1 Hx  =  H † x .  (2)where  H  is the channel matrix and  ( · ) † denotes the pseudoin-verse. Typically, the channel matrix  H  is QR decomposed intotwo parts as H  =  QR . Here Q  ∈ C ( N  R × N  R ) denotes a unitarymatrix and R  ∈ C ( N  R × N  R ) denotes an upper triangular matrix.  B. Lattice Reduction A lattice is a periodic arrangement of discrete points. Alattice can be characterized in terms of a set of basis vectors,where any points of the lattice can be represented by asuperposition of integer multiples of the basis vectors. Forsimplicity, we call the set  B  = ( b 1 ,b 2 ,....,b n )  as the basis of the lattice.A complex valued lattice in the  n -dimensional complexspace  C n can be defined as L  =  { υ | υ  =  B ω } ,  (3)where  B  is the basis of the lattice and  ω  = [ ω 1 ,ω 2 ,....,ω n ] .Note that in (3), the  υ ,  ω  and matrix  B  can be replaced with y ,  x  and  H  respectively to obtain  L  =  { y | y  =  Hx } . In thiscase, the vector space  L  is the set of all possible undisturbedreceived signal points. There are many bases that can span thespace  L , and the aim of the LR algorithm is to find a set of least correlated base with the shortest basis vectors [11]. C. LR-based MIMO Detection LR finds an improved basis for the lattice induced by thechannel. The srcinal basis and the reduced basis are relatedby a unimodular matrix,  T . Therefore, the LR aided detectionfinds the received symbol in the new reduced basis andafterwards transfer the signal in the srcinal lattice. The newchannel matrix after the LR can be expressed as,  ˜ H  =  HT and the transmitted signal is also treated as multiplied by  T − 1 which is  z  =  T − 1 x  for the reduced basis. The received signal y  =  Hx + n  can be expressed as y  =  HTT − 1 x + n  = ˜ Hz + n .  (4)The LR aided detection operates on  ˜ H  and  z  instead of   H and  x . The LR aided ZF detector can be expressed as ˜x  = ( ˜H H ˜H ) -1 ˜Hz  =  ˜H † z .  (5)The LR algorithm is applied on the QR decomposed  H  toobtain the modified  ˜ Q  and  ˜ R . Afterwards, the lattice reducedchannel matrix can be obtained as  ˜ H  = ˜ Q ˜ R .III. L ATTICE  R EDUCTION  A LGORITHM LLL algorithm is widely used to compute the suitableunimodular matrix  T  and to obtain a reduced lattice basis.LLL was originally proposed for the real valued LR [1].However, the channel matrix is naturally complex valued andtherefore, complex version of LLL (CLLL) is used to reducethe complexity.The CLLL algorithm suffers from irregular dataflow,which eventually leads to higher latency. Therefore, a fixed-complexity LLL (fcLLL) algorithm is proposed in [2]. ThefcLLL alters the signal flow of the CLLL to follow a deter-ministic structure. It is possible to utilize less complex Siegelalgorithm instead of the complex Lov´asz condition [3]. It is also very important to use an early termination mechanism tomeet the strict requirements. Applying all this modifications,we propose a modified-LLL (MLLL) algorithm for LR withless complexity and negligible BER performance loss. TheMLLL implemented in this paper is summarized in Algorithm1. Algorithm 1 Modified CLLL Algorithm (MLLL) INPUT:  Q  ∈ C N  R × N  R ,  R  ∈ C N  R × N  R ,  δ  1: Initialization  ˜Q  :=  Q  ,  ˜R  :=  R  ,  T  :=  I M  T  2:  k  := 2 3:  while  k  ≤  iterations 4:  for  l  =  k − 1  to  1  step  − 1 5:  µ  =  ˜R ( l,k ) / ˜R ( l,l ) 6:  if   µ   = 0 7:  ˜R (1 :  l,k ) :=  ˜R (1 :  l,k ) − µ ˜R (1 :  l,l ) 8:  T (: ,k ) :=  T (: ,k ) − µ T (: ,l ) 9:  end 10:  end 11:  if   δ  ˜R ( k  − 1 ,k − 1) 2 >  ˜R ( k,k ) 2 12: Swap columns  k − 1  and  k  in  ˜R  and  T 13:  Θ =   α β  − β α   with  α  =  ˜R ( k − 1 ,k − 1)  ˜R ( k − 1: k,k − 1)   and β   =  ˜R ( k,k − 1)  ˜R ( k − 1: k,k − 1)  14:  ˜R ( k − 1 :  k,k − 1 :  k ) := Θ ˜R ( k − 1 :  k,k − 1 :  k ) 15:  ˜Q (: ,k − 1 :  k ) :=  ˜Q (: ,k − 1 :  k )Θ T  16:  k  :=  max { k − 1 , 2 } 17:  else 18:  k  :=  k  + 1 19:  end 20:  end The BER performance of the traditional ZF, srcinal CLLLaided ZF, MLLL aided ZF and the optimal ML is simulatedfor various signal-to-noise (SNR) in a Matlab simulator. Anadditive white Gaussian noise (AWGN) channel is used for16-QAM modulation and the BER is averaged over 10 000Monte-Carlo trials. Fig. 1 shows the MLLL algorithm with 5iterations. It can be seen that the performance loss is negligiblecompared to the srcinal CLLL algorithm.  Reference Matlab code is available at  sites.google.com/site/shahriarshahabuddin/matlab simulator 051015202530354010 −4 10 −3 10 −2 10 −1 10 0 average SNR per receive antenna [dB]    b   i   t  e  r  r  o  r  r  a   t  e   (   B   E   R   )   ZFCLLLMLLL (5 iterations)ML Fig. 1. BER peformance of MLLL algorithm. IV. TTA M ULTIPROCESSOR FOR  MLLL  A. Special Function Units Six special function units (SFU) are designed and written inVHSIC hardware description language (VHDL) to accelerateeach iteration of the MLLL algorithm. A special function unitis designed to support the complex multiplication (CMUL)operation. Data level parallelism (DLP) is applied in the designby packing the 16-bit real part and 16-bit complex part ina 32-bit complex variable. Therefore, CMUL uses four 16-bit multipliers, a single 16-bit adder and 16-bit subtractor tosupport the complex multiplication.Two SFUs for  µ  calculation and size reduction are designedaccording to [8]. These SFUs are single-cycle and multiplier-less. It is observed from the Matlab simulations that the valueof   µ  has a range of   [ − 4 , 4] , and thus, the dynamic range of the SFUs are set accordingly. Another simple SFU is designedto compute the SIEGEL criterion. Instead of multiplying theinput with  . 75  the SIEGEL SFU calculates the value with acombination of two shifters and one adder. Fig. 2. 4-cycle cordic architecture. The most complex SFU that lies in the datapath of a singleTTA core is the CORDIC SFU. A master-slave cordic isconsidered in this work  [4]. The master-slave CORDIC isa combination of two CORDIC blocks in vectoring modeand rotation mode respectively. By setting the input as 1and 0 of the CORDIC with rotation mode, it is possible tocalculate the cosine and sine values directly. Therefore, theangle calculation done in a conventional CORDIC block isnot needed here. In every stage it is possible to calculate thevalues of the signums and add or subtract accordingly in therotation mode. As we need a 16-bit CORDIC, there are twooptions to design it. An iterative CORDIC that uses registersand iterates 16 times over the 1-stage datapath. However, ittakes 16-clock cycles to compute the output. For a processorbased implementation a 16-cycle SFU is complex as there willbe 15 NOP operations in the assembly code. It is possible tofully unroll the CORDIC block without any registers. Thenthe critical path for the CORDIC block becomes too high. Wefind a compromise between the two approach and design a4-stage CORDIC datapath that can be reused four times tocreate a 4-cycle master-slave CORDIC. The block diagram of the master-slave CORDIC is presented in Fig. 2, where theellipse contains a single stage of the datapath. An ARRANGESFU is designed to rearrange the 32-bit variables.  B. High Level Architecture of the multiprocessor  A 32-bit fixed point TTA processor is designed to support asingle iteration of the MLLL algorithm and five of these TTAcores are connected in a pipelined fashion to compute the LRmatrix. Part of a single TTA processor core is illustrated insidethe dotted block of Fig. 3. For readability, the whole processoris not given. The blocks in the upper part of the core representthe function units and register files of the processor. The black horizontal straight lines represent the buses of the processor.The vertical rectangular blocks represent the sockets. Fig. 3. The multiprocessor architecture. Each core includes the load/store unit (LSU), arithmeticlogic unit (ALU), global control unit (GCU), register files,several conventional function units and SFUs. The  Q ,  R and  T  matrix are read from three separate first-in-first-out(FIFO) memory buffer by using the function units calledSTREAM. The STREAM units can read every input samplein one clock cycle. Three STREAM units are used to get theinputs simultaneously. Three STREAM unit is used to writethe outputs in the memory buffer.Ten register files are used to save the intermediate results. Asingle Boolean register file is included in the processor design.When the registers are not enough, the processor is able toaccess the data memory to temporary store data through theLSU. The SFUs can be called by macros to accelerate theprogram code.
Similar documents
View more...
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks