Research

A sliding-window kernel RLS algorithm and its application to nonlinear channel identification

Description
A sliding-window kernel RLS algorithm and its application to nonlinear channel identification
Categories
Published
of 4
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
Share
Transcript
  A SLIDING-WINDOW KERNEL RLS ALGORITHM AND ITS APPLICATION TONONLINEAR CHANNEL IDENTIFICATION Steven Van Vaerenbergh, Javier V ´  ı a and Ignacio Santamar ´  ı a Dept. of Communications Engineering, University of Cantabria, SpainE-mail: { steven,jvia,nacho } @gtas.dicom.unican.es ABSTRACT In this paper we propose a new kernel-based version of therecursive least-squares (RLS) algorithm for fast adaptive non-linear fi ltering. Unlike other previous approaches, we com-bine a sliding-window approach (to fi x the dimensions of thekernel matrix) with conventional L 2 -norm regularization (toimprove generalization). The proposed kernel RLS algorithmisappliedtoanonlinearchannelidenti fi cationproblem(specif-ically, a linear fi lter followed by a memoryless nonlinearity),which typically appears in satellite communications or digitalmagnetic recording systems. We show that the proposed al-gorithm is able to operate in a time-varying environment andtracks abrupt changes in either the linear fi lter or the nonlin-earity. 1. INTRODUCTION In recent years a number of kernel methods, including supportvector machines [1], kernel principal component analysis [2],kernel Fisher discriminant analysis [3] and kernel canonicalcorrelation analysis [4, 5] have been proposed and success-fully applied to classi fi cation and nonlinear regression prob-lems. In their srcinal forms, most of these algorithms cannotbe used to operate online since a number of dif  fi culties are in-troduced by the kernel methods, such as the time and memorycomplexities (because of the growing kernel matrix) and theneed to avoid over fi tting of the problem.Recently a kernel RLS algorithm was proposed that dealtwith both dif  fi culties [6]: by applying a sparsi fi cation proce-dure the kernel matrix size was limited and the order of theproblem was reduced. In this paper we present a differentapproach, applying a sliding-window approach and conven-tional regularization. This way the size of the kernel matrixcan be fi xed rather than limited, allowing the algorithm to op-erate online in time-varying environments.The basic idea of kernel methods is to transform the data x i from the input space to a high dimensional feature spaceof vectors Φ( x i ) , where the inner products can be calculatedusing a positive de fi nite kernel function satisfying Mercer’scondition [1]: κ ( x i , x j ) =  Φ( x i ) , Φ( x j )  . This simple andelegant idea, also known as the “kernel trick”, allows inner This work was supported by MEC (Ministerio de Educaci´on y Ciencia)under grant TEC2004-06451-C05-02. products in the feature space to be computed without makingdirect reference to the feature vectors.A common nonlinear kernel is the Gaussian kernel κ ( x , y ) = exp  − x − y  2 2 σ 2  . (1)This kernel will be used to calculate the elements of a kernelmatrix  . In the sliding-window approach, updating this kernelmatrix means fi rst removing its fi rst row and fi rst column andthen adding a new row and column at the end, based on newobservations. Calculation of its inverse is needed to updatethe algorithm’s solution. To this end, two matrix inversionformulas were derived in Appendix A. One of them has al-ready been used in kernel methods [7] but in order to fi x thedimensions of the kernel matrix we also introduce a comple-mentary formula.The rest of this paper is organized as follows: in Section 2a kernel transformation is introduced into linear least-squaresregression theory. A detailed description of the proposed al-gorithm is given in Section 3, and in Section 4 it is applied toa nonlinear channel identi fi cation problem. Finally, Section 5summarizes the main conclusions of this work. 2. LEAST-SQUARES REGRESSION2.1. Linear Methods The least-squares (LS) criterion [8] is a widely used methodin signal processing. Given a vector y ∈ R N  × 1 and a datamatrix X ∈ R N  × M  of observations, it consists in seeking theoptimal vector h ∈ R M  × 1 that solves J  = min h  y − Xh  2 . (2)It should be clear that the solution h can be represented in thebasis de fi ned by the rows of  X . Hence it can also be written as h = X T  a , making it a linear combination of the input patterns(this is sometimes denoted as the “dual representation”).For many problems however, not all data are known inadvance and the solution has to be re-calculated as the newobservations become available. An online algorithm is thenneeded, which in case of linear problems is given by the well-known recursive least-squares (RLS) algorithm [8]. V - 7891-4244-0469-X/06/$20.00 ©2006 IEEE ICASSP 2006  2.2. Kernel Methods The linear LS methods can be extended to nonlinear versionsby transforming the data into a feature space. Using the trans-formed vector ˜ h ∈ R M   × 1 and the transformed data matrix  ˜ X ∈ R N  × M   , the LS problem (2) can be written in featurespace as J   = min  ˜ h  y − ˜ X ˜ h  2 . (3)The transformed solution ˜ h can now also be represented in thebasis de fi ned by the rows of the (transformed) data matrix ˜ X ,namely as  ˜ h = ˜ X T  α . (4)Moreover, introducing the kernel matrix  K = ˜ X ˜ X T  the LSproblem in feature space (3) can be rewritten as J   = min α  y − K α  2 (5)in which the solution α is an N  × 1 vector. The advantageof writing the nonlinear LS problem in the dual notation isthat thanks to the “kernel trick”, we only need to compute K ,which is done as K ( i,j ) = κ ( X i , X j ) , (6)where X i and X j are the i -th and j -th rows of  X . As a con-sequence the computational complexity of operating in thishigh-dimensional space is not necessarily larger than that of working in the srcinal low-dimensional space. 2.3. Measures Against Over fi tting For most useful kernel functions, the dimension of the featurespace, M   , will be much higher than the number of availabledata points N  (for instance, in case the Gaussian kernel isused the feature space will have dimension M   = ∞ ). Inthese cases, Eq. (5) could have an in fi nite number of solu-tions, representing an over fi t problem.Various techniques to handle this over fi tting have beenpresented. One possible method is to reduce the order of thefeaturespace[6,4,5]. Asecondmethod, usedhere, istoregu-larize the solution. More speci fi cally, the norm of the solution  ˜ h is penalized to obtain the following problem: J   = min  ˜ h  y − ˜ X ˜ h  2 + c   ˜ h  2 (7) = min α  y − K α  2 + c α T  K α (8)whose solution is given by α = K − 1 reg y (9)with K reg = ( K + c I ) , c a regularization constant and I theidentity matrix. 3. THE ONLINE ALGORITHM In various situations it is preferred to have an online, i.e. re-cursive, version instead of a batch algorithm. In particular,if the data points y are the result of a time-varying process,an online algorithm able to track these time variations can bedesigned. In any case, the key feature of an online algorithmis that the number of computations required per new samplemust not increase as the number if samples increases. 3.1. A Sliding-Window Approach The presented algorithm is a regularized kernel version of theRLS algorithm. An online prediction setup assumes we aregiven a stream of input-output pairs { ( x 1 ,y 1 ) , ( x 2 ,y 2 ) ,... } .The sliding-window approach consists in only taking the last N  pairs of this stream into account. For window n , the obser-vation vector y n = [ y n ,y n − 1 ,...,y n − N  +1 ] T  and observa-tion matrix X n = [ x n , x n − 1 ,..., x n − N  +1 ] T  are formed, andthecorresponding regularized kernel matrix K n = ˜ X n ˜ X T n + c I can be calculated.Notethatitisnecessarytolimitthenumberofdatavectors x n , N  , for which the kernel matrix is calculated. Contrary tostandard linear RLS, for which the correlation matrices have fi xed sizes depending on the ( fi xed) dimension of the input vectors M  , the size of the kernel matrix in an online scenariodepends on the number of observations N  .In [6], a kernel RLS algorithm is designed that limits thematrix sizes by means of a sparsi fi cation procedure, whichmaps the samples to a (limited) dictionary. It allows both toreduce the order of the feature space (which prevents over fi t-ting) and to keep the complexity of the algorithm bounded.In our approach these two measures are obtained by two dif-ferent mechanisms. On one hand, the regularization againstover fi tting is done by penalizing the solutions, as in (9). Onthe other hand, the complexity of the algorithm is reducedby considering only the observations in a window with fi xedlength. The advantage of the latter approach is that it is able totrack time variations without any extra computational burden. 3.2. Updating the Inverse of the Kernel Matrix The calculation of the updated solution α n requires the cal-culation of the N  × N  inverse matrix K − 1 n for each window.Thisiscostlybothcomputationallyandmemory-wise(requir-ing O ( N  3 ) operations). Therefore an update algorithm is de-veloped that can compute K − 1 n solely from knowledge of thedata of the current window { X n , y n } and the previous K − 1 n − 1 .The updated solution α n can then be calculated in a straight-forward way using Eq. (9).Given the regularized kernel matrix K n − 1 , the new reg-ularized kernel matrix K n can be constructed by removingthe fi rst row and column of  K n − 1 , referred to as ˆK n − 1 , andadding kernels of the new data as the last row and column: K n =  ˆK n − 1 k n − 1 ( x n ) k n − 1 ( x n ) T  k nn + c  (10) V - 790  Algorithm 1 Summary of the proposed adaptive algorithm.Initialize K 0 as (1 + c ) I and K − 10 as I / (1 + c ) for n = 1 , 2 ,... do Obtain ˆK n − 1 out of  K n − 1 Calculate ˆK − 1 n − 1 according to Eq. (12)Obtain K n according to Eq. (10)Calculate K − 1 n according to Eq. (11)Obtain the updated solution α n = K − 1 n y n end for where k n − 1 ( x n ) = [ κ ( x n − N  +1 , x n ) ,...,κ ( x n − 1 , x n )] T  and k nn = κ ( x n , x n ) .Calculating the inverse kernel matrix K − 1 n is done in twosteps, using the two inversion formulas derived in appendicesA.1 and A.2. Note that these formulas do not calculate theinverse matricesexplicitly, but ratherderive them fromknownmatrices maintaining an overall time and memory complexityof  O ( N  2 ) of the algorithm.First, given K n − 1 and K − 1 n − 1 , the inverse of the N  − 1 × N  − 1 matrix ˆK n − 1 is calculated according to Eq. (12). Then K − 1 n can be calculated applying the matrix inversion formulafrom Eq. 11, based on the knowledge of  ˆK − 1 n − 1 and K n . Thecomplete algorithm is summarized in Alg. (1). 4. EXAMPLE PROBLEM: IDENTIFICATION OF ANONLINEAR WIENER SYSTEM In this section, we consider the identi fi cation problem for anonlinearWienersystem, andcomparetheperformanceoftheproposed kernel RLS algorithm to the standard approach us-ing a multilayer perceptron (MLP). Since kernel methods pro-vide a natural nonlinear extension of linear regression meth-ods, the proposed system is supposed to perform well com-pared to the MLP [7]. 4.1. Experimental Setup The nonlinear Wiener system is a well-known and simplenonlinear system which consists of a series connection of alinear fi lter and a memoryless non-linearity (see Fig. 1). Sucha nonlinear channel can be encountered in digital satellitecommunications [9] and in digital magnetic recording [10].Traditionally, the problem of blind nonlinear equalization oridenti fi cationhasbeentackledbyconsideringnonlinearstruc-tures such as MLPs [11], recurrent neural networks [12], orpiecewise linear networks [13].Here we consider a supervised identi fi cation problem, inwhich moreover at a given time instant the linear channel co-ef  fi cients are changed abruptly to compare the tracking capa-bilities of both algorithms: During the fi rst part of the sim-ulation, the linear channel is H  1 ( z ) = 1 + 0 . 0668 z − 1 − 0 . 4764 − 2 + 0 . 8070 − 3 and after receiving 500 symbols it ischangedinto H  2 ( z ) = 1 − 0 . 4326 z − 1 − 0 . 6656 z − 2 +0 . 7153 − 3 .A binary signal is sent through this channel and then the non-linear function y = tanh( x ) is applied on it, where x is thes[n]x[n]y[n]v[n]z[n]H(z)f(.) Fig. 1 . A nonlinear Wiener system.linear channel output. Finally, white Gaussian noise is addedto match an SNR of  20 dB. The Wiener system is then treatedas a black box of which only input and output are known. 4.2. Simulation Results System identi fi cation was fi rst performed by an MLP with 8 neurons in its hidden layer and then using the sliding-windowkernel RLS algorithm, for two different window sizes N  . Forbothmethodsweappliedtime-embeddingtechniquesinwhichthe length L of the linear channel was known. More specif-ically, the used MLP was a time-delay MLP with L inputs,and the input vectors for the kernel RLS algorithm were time-delayed vectors of length L , s ( n ) = [ s ( n − L +1) ,..., s ( n )] .In each iteration, system identi fi cation was performed by es-timating the output sample corresponding to the next inputsample, and comparing it to the actual output. The meansquare error (MSE) for both approaches is shown in Fig. 2.Most noticeable is the fast convergence of the kernel RLSalgorithm: convergence time is of the order of the windowlength.Further note that the structure of the nonlinear system hasnot been exploited while performing identi fi cation. Obvi-ously, the presented kernel RLS method can be extended andused as the basis of a more complex algorithm that modelsbetter the known system structure. For instance, the solutionto the nonlinear Wiener identi fi cation problem could be foundas the solution to two coupled LS problems, where the fi rstone applies a linear kernel on the input data and the secondone applies a nonlinear kernel on the output data. Also, theimplications of not knowing the correct linear channel lengthremain to be studied. We will consider this and other exten-sions as future research lines. 5. CONCLUSIONS A kernel-based version of the RLS algorithm was presented.Its main features are the introduction of regularization againstover fi tting (by penalizing the solutions) and the combinationof a sliding-window approach and ef  fi cient matrix inversionformulas to keep the complexity of the problem bounded.Thanks to the use of a sliding-window the algorithm is able toprovide tracking in a time-varying environment.First results of this algorithm are promising, and suggest itcan be extended to deal with the nonlinear extensions of mostproblems that are classically solved by linear RLS. Future re-search lines also include its direct application to online kernelcanonical correlation analysis (kernel CCA). V - 791  0 500 1000 1500 − 25 − 20 − 15 − 10 − 505101520 iterations    M   S   E   (   d   B   ) MLPkernel RLS, N=75kernel RLS, N=150 Fig. 2 . MSE of the identi fi cation of the nonlinear Wiener sys-tem of Fig. 1, for the standard method using an MLP and forthe window-based kernel RLS algorithm with window length N  = 150 (thick curve) and N  = 75 (thin curve). A changein fi lter coef  fi cients of the nonlinear Wiener system was in-troduced after 500 iterations. The results were averaged outover 250 Monte-Carlo simulations. 6. REFERENCES [1] V.N.Vapnik, TheNatureofStatisticalLearningTheory ,Springer-Verlag New York, Inc., New York, USA, 1995.[2] B. Sch¨olkopf, A. J. Smola, and K.-R. M¨uller, “Non-linear component analysis as a kernel eigenvalue prob-lem,” Neural Computation , vol. 10, no. 5, pp. 1299–1319, 1998.[3] S. Mika, G. R¨atsch, J. Weston, B. Sch¨olkopf, and K.-R. M¨uller, “Fisher discriminant analysis with kernels,”in Proc. NNSP’99 , Y.-H Hu, J. Larsen, E. Wilson, andS. Douglas, Eds. Jan. 1999, pp. 41–48, IEEE.[4] F. R. Bach and M. I. Jordan, “Kernel independent com-ponent analysis,” Journal of Machine Learning Re-search , vol. 3, pp. 1–48, 2003.[5] D. R. Hardoon, S. Szedmak, and J. Shawe-Taylor,“Canonical correlation analysis: An overview with ap-plication to learning methods,” Technical Report CSD-TR-03-02, RoyalHollowayUniversityofLondon, 2003.[6] Y. Engel, S. Mannor, and R. Meir, “The kernel recursiveleast squares-algorithm,” IEEE Transactions on SignalProcessing , vol. 52, no. 8, Aug. 2004.[7] B. Sch¨olkopf and A. J. Smola, Learning with Kernels ,The MIT Press, Cambridge, MA, 2002.[8] A.H.Sayed, FundamentalsofAdaptiveFiltering , Wiley,New York, USA, 2003.[9] G. Kechriotis, E. Zarvas, and E. S. Manolakos, “Usingrecurrent neural networks for adaptive communicationchannelequalization,” IEEETrans.onNeuralNetworks ,vol. 5, pp. 267–278, Mar 1994.[10] N. P. Sands and J. M. Ciof  fi , “Nonlinear channel modelsfordigitalmagneticrecording,” IEEETrans.Magn. , vol.29, pp. 3996–3998, Nov 1993.[11] D. Erdogmus, D. Rende, J. C. Principe, and T. F. Wong,“Nonlinear channel equalization using multilayer per-ceptrons with information-theoric criterion,” in Proc. IEEE Workshop on Neural Networks and Signal Pro-cessing XI  , Falmouth, MA, Sept 2001, pp. 401–451.[12] T. Adali and X. Liu, “Canonical piecewise linear net-work for nonlinear fi ltering and its application to blindequalization,” Signal Process. , vol. 61, no. 2, pp. 145–155, Sept 1997.[13] P. W. Holland and R. E. Welch, “Robust regresison us-ing iterative reweighted least squares,” Commun. Statist.Theory Methods , vol. A. 6, no. 9, pp. 813–827, 1997. A. MATRIX INVERSION FORMULASA.1. Adding a row and a column To a given non-singular matrix A a row and column are addedas shown below, resulting in matrix K . The inverse matrix K − 1 can then be expressed in terms of the known elementsand A − 1 as follows: K =  A bb T  d  , K − 1 =  E f f  T  g  ⇒ ⎧⎨⎩ AE + bf  T  = IAf  + b g = 0b T  f  + dg = 1 ⇒ K − 1 =  A − 1 ( I + bb T  A − 1 H  g ) − A − 1 b g − ( A − 1 b ) T  g g  (11)with g = ( d − b T  A − 1 b ) − 1 . A.2. Removing the fi rst row and column From a given non-singular matrix K a row and column areremoved as shown below, resulting in matrix D . The inversematrix D − 1 cantheneasilybeexpressedintermsoftheknownelements of  K − 1 as follows: K =  a b T  b D  , K − 1 =  e f  T  f G  ⇒  b e + Df  = 0bf  T  + DG = I ⇒ D − 1 = G − ff  T  /e. (12) V - 792
Search
Similar documents
View more...
Tags
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks