Funny & Jokes

A Dynamical-System Analysis of the Optimum s-Gradient Algorithm

Description
A Dynamical-System Analysis of the Optimum s-Gradient Algorithm
Categories
Published
of 42
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
Share
Transcript
  1A dynamical-system analysis of the optimum s -gradient algorithm L. Pronzato, H.P. Wynn, and A. Zhigljavsky Summary.  We study the asymptotic behaviour of Forsythe’s  s -optimum gradientalgorithm for the minimization of a quadratic function in  R d using a renormalizationthat converts the algorithm into iterations applied to a probability measure. Boundson the performance of the algorithm (rate of convergence) are obtained throughoptimum design theory and the limiting behaviour of the algorithm for  s  = 2 isinvestigated into details. Algorithms that switch periodically between  s  = 1 and s  = 2 are shown to converge much faster than when  s  is fixed at 2. 1.1 Introduction The asymptotic behavior of the steepest-descent algorithm (that is, the opti-mum 1-gradient method) for the minimization of a quadratic function in  R d is well-known, see Akaike (1959); Nocedal  et al.  (1998, 2002) and Chap. 7 of (Pronzato  et al. , 2000). Any vector  y  of norm one with two nonzero compo-nents only is a fixed point for two iterations of the algorithm after a suitablerenormalization. The main result is that, in the renormalized space, one typ-ically observes convergence to a two-point limit set which lies in the spacespanned by the eigenvectors corresponding to the smallest and largest eigen-values of the matrix  A  of the quadratic function. The proof for boundedquadratic operators in Hilbert space is similar to the proof for  R d althoughmore technical, see Pronzato  et al.  (2001, 2006). In both cases, the methodconsists of converting the renormalized algorithm into iterations applied toa measure  ν  k  supported on the spectrum of   A . The additional technicalitiesarise from the fact that in the Hilbert space case the measure may be contin-uous. For  s  = 1, the well-known inequality of Kantorovich gives a bound onthe rate of convergence of the algorithm, see Kantorovich and Akilov (1982)and (Luenberger, 1973, p. 151). However, the actual asymptotic rate of con-vergence, although satisfying the Kantorovich bound, depends on the startingpoint and is difficult to predict; a lower bound can be obtained (Pronzato et al. , 2001, 2006) from considerations on the stability of the fixed points forthe attractor.The situation is much more complicated for the optimum  s -gradient algo-rithm with  s  ≥  2 and the paper extends the results presented in (Forsythe,    h  a   l  -   0   0   3   5   8   7   5   5 ,  v  e  r  s   i  o  n   1  -   4   F  e   b   2   0   0   9 Author manuscript, published in "Optimal Design and Related Areas in Optimization and Statistics, Luc Pronzato, AnatolyZhigljavsky (Ed.) (2009) 39-80" DOI : 10.1007/978-0-387-79936-0_3  2 L. Pronzato, H.P. Wynn, and A. Zhigljavsky 1968) in several directions. First, two different sequences are shown to bemonotonically increasing along the trajectory followed by the algorithm (af-ter a suitable renormalization) and a link with optimum design theory isestablished for the construction of upper bounds for these sequences. Second,the case  s  = 2 is investigated into details and a precise characterization of the limiting behavior of the renormalized algorithm is given. Finally, we showhow switching periodically between the algorithms with respectively  s  = 1 and s  = 2 drastically improves the rate of convergence. The resulting algorithmis shown to have superlinear convergence in  R 3 and we give some explana-tions for the fast convergence observed in simulations in  R d with  d  large: byswitching periodically between algorithms one destroys the stability of thelimiting behavior obtained when  s  is fixed (which is always associated withslow convergence).The chapter is organized as follows. Sect. 1.2 presents the optimum  s -gradient algorithm for the minimization of a quadratic function in  R d , first inthe srcinal space and then, after a suitable renormalization, as a transforma-tion applied to a probability measure. Rates of convergence are defined in thesame section. The asymptotic behavior of the optimum  s -gradient algorithmin  R d is considered in Sect. 1.3 where some of the properties established in(Forsythe, 1968) are recalled. The analysis for the case  s  = 2 is detailed inSect. 1.4. Switching strategies that periodically alternate between  s  = 1 and s  = 2 are considered in Sect. 1.5. 1.2 The optimum  s -gradient algorithm for theminimization of a quadratic function Let  A  be a real bounded self-adjoint (symmetric) operator in a real Hilbertspace  H  with inner product ( x,y ) and norm given by   x   = ( x,x ) 1 / 2 . Weshall assume that  A  is positive, bounded below, and its spectral boundarieswill be denoted by  m  and  M  : m  = inf   x  =1 ( Ax,x ) , M   = sup  x  =1 ( Ax,x ) , with 0  < m < M <  ∞ . The function  f  0  to be minimized with respect to t  ∈ H  is the quadratic form f  0 ( t ) = 12( At,t ) − ( t,y )for some  y  ∈ H , the minimum of which is located at  t ∗ =  A − 1 y . By a trans-lation of the srcin, which corresponds to the definition of   x  =  t − t ∗ as thevariable of interest, the minimization of   f  0  becomes equivalent to that of   f  defined by f  ( x ) = 12( Ax,x ) ,  (1.1)    h  a   l  -   0   0   3   5   8   7   5   5 ,  v  e  r  s   i  o  n   1  -   4   F  e   b   2   0   0   9  1 A dynamical-system analysis of the optimum  s -gradient algorithm 3 which is minimum at  x ∗ = 0. The directional derivative of   f   at  x  in thedirection  u  is ∇ u f  ( x ) = ( Ax,u ) . The direction of steepest descent at  x  is  − g , with  g  =  g ( x ) =  Ax  the gradientof   f   at  x . The minimum of   f   along the line  L 1 ( x ) =  { x  +  γAx, γ   ∈  R }  isobtained for the optimum step-length γ  ∗ =  −  ( g,g )( Ag,g )  , which corresponds to the usual steepest-descent algorithm. One iteration of the steepest-descent algorithm, or optimum 1-gradient method, is thus x k +1  =  x k  −  ( g k ,g k )( Ag k ,g k ) g k  ,  (1.2)with  g k  =  Ax k  and  x 0  some initial element in  H . For any integer  s  ≥  1, definethe  s -dimensional plane of steepest descent by L s ( x ) =  { x + s  i =1 γ  i A i x, γ  i  ∈  R  for all  i } . In the optimum  s -gradient method,  x k +1  is chosen as the point in  L s ( x k ) thatminimizes  f  . When  H  =  R d ,  A  is  d × d  symmetric positive-definite matrixwith minimum and maximum eigenvalues respectively  m  and  M  , and  x k +1  isuniquely defined provided that the  d  eigenvalues of   A  are all distinct. Also,in that case  L d ( x k ) =  R d and only the case  s  ≤  d  is of interest. We shall givespecial attention to the case  s  = 2. 1.2.1 Updating rules Similarly to (Pronzato  et al. , 2001, 2006) and Chap. 2 of this volume, considerthe renormalized gradient z ( x ) =  g ( x )( g ( x ) ,g ( x )) 1 / 2  , so that ( z ( x ) ,z ( x )) = 1 and denote  z k  =  z ( x k ) for all  k . Also define µ kj  = ( A j z k ,z k ) , j  ∈  Z ,  (1.3)so that  µ k 0  = 1 for any  k  and the optimum step-length of the optimum 1-gradient at step  k  is  − 1 /µ k 1 , see (1.2). The optimum choice of the  s γ  i ’s in theoptimum  s -gradient can be obtained by direct minimization of   f   over  L s ( x k ).A simpler construction follows from the observation that  g k +1 , and thus  z k +1 ,must be orthogonal to  L s ( x k ), and thus to  z k ,Az k ,...,A s − 1 z k . The vector of     h  a   l  -   0   0   3   5   8   7   5   5 ,  v  e  r  s   i  o  n   1  -   4   F  e   b   2   0   0   9  4 L. Pronzato, H.P. Wynn, and A. Zhigljavsky optimum step-lengthes at step  k ,  −→ γ   k  = ( γ  k 1 ,...,γ  ks ) ⊤ , is thus solution of thefollowing system of   s  linear equations M ks, 1 −→ γ   k  =  − (1 ,µ k 1 ,...,µ ks − 1 ) ⊤ ,  (1.4)where  M ks, 1  is the  s  ×  s  (symmetric) matrix with element ( i,j ) given by { M ks, 1 } i,j  =  µ ki + j − 1 .The following remark will be important later on, when we shall comparethe rates of convergence of different algorithms. Remark 1.  One may notice that one step of the optimum  s -gradient methodstarting from some  x  in  H  corresponds to  s  successive steps of the conjugategradient algorithm starting from the same  x , see (Luenberger, 1973, p. 179).  The next remark shows the connection with optimum design of experi-ments, which will be further considered in Sect. 1.2.3 (see also Pronzato  et al. (2005) where the connection is developed around the case of the steepest-descent algorithm). Remark 2.  Consider Least-Squares (LS) estimation in a regression model s  i =1 γ  i t i =  − 1 + ε i with ( ε i ) a sequence of i.i.d. errors with zero mean. Assume that the  t i ’sare generated according to a probability (design) measure  ξ  . Then, the LSestimator of the parameters  γ  i ,  i  = 1 ,...,s  isˆ γ   =  −    ( t,t 2 ,...,t s ) ⊤ ( t,t 2 ,...,t s ) ξ  (d t )  − 1     ( t,t 2 ,...,t s ) ⊤ ξ  (d t )and coincides with  −→ γ   k  when  ξ   is such that    t j +1 ξ  (d t ) =  µ kj  , j  = 0 , 1 , 2 ... The information matrix  M ( ξ  ) for this LS estimation problem then coincideswith  M ks, 1 .   Using (1.4), one iteration of the optimum  s -gradient method thus gives x k +1  =  Q ks ( A ) x k  , g k +1  =  Q ks ( A ) g k  (1.5)where  Q ks ( t ) is the polynomial  Q ks ( t ) = 1+  si =1 γ  ki  t i with the  γ  ki  solutions of (1.4). Note that the use of any other polynomial  P  ( t ) of degree  s  or less, andsuch that  P  (0) = 1, yields a larger value for  f  ( x k +1 ). Using (1.4), we obtain    h  a   l  -   0   0   3   5   8   7   5   5 ,  v  e  r  s   i  o  n   1  -   4   F  e   b   2   0   0   9  1 A dynamical-system analysis of the optimum  s -gradient algorithm 5 Q ks ( t ) = 1 − (1 ,µ k 1 ,...,µ ks − 1 )[ M ks, 1 ] − 1  t ... t s  and direct calculations give Q ks ( t ) =  1  µ k 1  ... µ ks − 1  1 µ k 1  µ k 2  ... µ ks  t ......  ... ...... µ ks  µ ks +1  ... µ k 2 s − 1  t s  | M ks, 1 |  (1.6)where, for any square matrix  M ,  | M |  denotes its determinant. The derivationof the updating rule for the normalized gradient  z k  relies on the computationof the inner product ( g k +1 ,g k +1 ). From the orthogonality property of   g k +1  to g k ,Ag k ,...,A s − 1 g k  we get( g k +1 ,g k +1 ) = ( g k +1 ,γ  ks A s g k ) =  γ  ks  ( Q ks ( A ) A s g k ,g k )=  1  µ k 1  ... µ ks − 1  µ ks µ k 1  µ k 2  ... µ ks  µ ks +1 ......  ... ...... µ ks  µ ks +1  ... µ k 2 s − 1  µ k 2 s  | M ks, 1 |  γ  ks  ( g k ,g k ) ,  (1.7)where  γ  ks , the coefficient of   t s in  Q ks ( t ), is given by γ  ks  =  1  µ k 1  ... µ ks − 1 µ k 1  µ k 2  ... µ ks ......  ... ... µ ks − 1  µ ks  ... µ k 2 s − 2  | M ks, 1 |  .  (1.8) 1.2.2 The optimum  s -gradient algorithm as a sequence of transformations of a probability measure When H  =  R d , we can assume that  A  is already diagonalized, with eigenvalues0  < m  =  λ 1  ≤  λ 2  ≤ ··· ≤  λ d  =  M  , and consider [ z k ] 2 i , with [ z k ] i  the  i -thcomponent of   z k , as a mass on the eigenvalue  λ i  (note that   di =1 [ z k ] 2 i  =  µ k 0  =1). Define the discrete probability measure  ν  k  supported on ( λ 1 ,...,λ d ) by ν  k ( λ i ) = [ z k ] 2 i , so that its  j -th moment is  µ kj ,  j  ∈  Z , see (1.3). Remark 3.  When two eigenvalues  λ i  and  λ j  of   A  are equal, their masses [ z k ] 2 i and [ z k ] 2 j  can be added since the updating rule is the same for the two com-ponents [ z k ] i  and [ z k ] j . Concerning the analysis of the rate of convergence of     h  a   l  -   0   0   3   5   8   7   5   5 ,  v  e  r  s   i  o  n   1  -   4   F  e   b   2   0   0   9
Search
Similar documents
View more...
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks