A general framework for structured sparsity via proximal optimization

Abstract We study a generalized framework for structured sparsity. It extends the well known methods of Lasso and Group Lasso by incorporating additional constraints on the variables as part of a convex optimization problem. This framework provides a
of 9
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
  A General Framework for Structured Sparsity via Proximal Optimization Luca Baldassarre Jean Morales Andreas Argyriou Massimiliano Pontil UCLLondon, UCLLondon, UK TTIChicago, USAargyriou@ttic.eduUCLLondon,  Abstract We study a generalized frameworkfor structured sparsity . It extends the well known methods of Lasso and Group Lasso by incorporating addi-tional constraints on the variables as part of aconvex optimization problem. This framework provides a straightforward way of favouring pre-scribed sparsity patterns , such as orderings, con-tiguous regions and overlapping groups, amongothers. Available optimization methods are lim-ited to specific constraint sets and tend to notscale well with sample size and dimensional-ity. We propose a first order proximal method  ,which builds upon results on fixed points andsuccessive approximations. The algorithm canbe applied to a general class of conic and normconstraints sets and relies on a proximity opera-tor subproblem which can be computed numeri-cally. Experiments on different regression prob-lems demonstrate state-of-the-art statistical per-formance, which improves over Lasso, GroupLasso andStructOMP.Theyalso demonstratetheefficiency of the optimization algorithm and itsscalability with the size of the problem. 1 Introduction We study the problem of learninga sparse linear regressionmodel. The goal is to estimate a parameter vector β  ∗ ∈ R n from a vector of measurements y ∈ R m , obtained fromthe model y = Xβ  ∗ + ξ  , where X  is an m × n matrix,which may be fixed or randomly chosen, and ξ  ∈ R m is a vector resulting from the presence of noise. We areinterested in sparse estimation under additional conditionson the sparsity pattern of  β  ∗ . In other words, not only do Appearing in Proceedings of the 15 th International Conference onArtificial Intelligence and Statistics (AISTATS) 2012, La Palma,Canary Islands. Volume XX of JMLR: W&CP XX. Copyright2012 by the authors. we expect that β  ∗ is sparse but also that it exhibits struc-tured sparsity , namely certain configurations of its nonzerocomponents are preferred to others. This problem arises inseveral applications, such as regression, image denoising,background subtraction etc. – see [9, 11] for a discussion.In this paper, we build upon the structured sparsity frame-work recently proposed by [12, 13]. It is a regularizationmethod, formulated as a convex, non-smooth optimizationproblem over a vector of auxiliary parameters. This ap-proachprovidesa constructiveway to favor certain sparsitypatternsoftheregressionvector β  . Specifically,thisformu-lation involves a penalty function given by the formula Ω( β  | Λ) = inf   12 n  i =1  β  2 i λ i + λ i  : λ ∈ Λ  . This function can be interpreted as an extension of a well-known variational form for the ℓ 1 norm. The convex con-straint set Λ provides a means to incorporate prior knowl-edge on the magnitude of the componentsof the regressionvector. As we explain in Section 2, the sparsity pattern of  β  is contained in that of the auxiliary vector λ at the opti-mum. Hence, if the set Λ allows only for certain sparsitypatterns of  λ , the same propertywill be “transferred”to theregression vector β  .The first contribution of this paper is the introduction of atractable class of regularizers of the above form which ex-tend the examples described in [12, 13]. Specifically, westudy in detail the cases in which the set Λ is defined by norm or conic constraints , combined with a linear map. Aswe shall see, these cases include formulations which canbe used for learning graph sparsity and hierarchical spar-sity , in the terminology of [9]. That is, the sparsity patternof the vector β  ∗ may consist of a few contiguous regionsin one or more dimensions, or may be embedded in a treestructure. This sparsity problem may arise in several appli-cations, rangingfrom functional magnetic resonanceimag-ing [7, 25], to scene recognition in vision [8], to multi-task learning [1, 18] and to bioinformatics[20] – to mention buta few.A main limitation of the technique described in [12, 13]is that in many cases of interest the penalty function can- 82  A General Framework for Structured Sparsity via Proximal Optimization not be easily computed. This makes it difficult to solve theassociated regularization problem. For example, [12, 13]proposes to use block coordinate descent, but this methodis feasible only for limited choices of the set Λ . The secondcontribution of this paper is an efficient accelerated prox-imal point method to solve regularized least squares withthe penalty function Ω( ·| Λ) in the general case of set Λ described above. The method combines a fast fixed pointiterative scheme, which is inspired by recent work by [14]with an accelerated first order method equivalent to FISTA[4]. We present a numerical study of the efficiency of theproposed method and a statistical comparison of the pro-posed penalty functions with the greedy method of [9], theLasso and the Group Lasso.Recently, there has been significant research interest onstructuredsparsityandtheliteratureonthissubjectis grow-ing fast, see for example [1, 9, 10, 11, 26] and refer-ences therein for an indicative list of papers. In this work,we mainly focus on convex penalty methods and comparethem to greedy methods [3, 9]. The latter provide a natu-ral extension of techniques proposed in the signal process-ing community and, as argued in [9], allow for a signifi-cant performance improvement over more generic sparsitymodels such as the Lasso or the GroupLasso [26]. The for-mer methods have until recentlyfocused mainly on extend-ing the Group Lasso, by consideringthe possibility that thegroups overlap according to certain hierarchical structures[11, 27]. Very recently, general choices of convex penaltyfunctions have been proposed [2, 12, 13]. In this paper webuild upon [12, 13], providing both new instances of thepenalty function and improved optimization algorithms.The paper is organized as follows. In Section 2, we setour notation, review the method of [12, 13] and recall somebasic facts from convex analysis. In Section 3, we providesomegeneralinsightsonthemethodandintroducetwonewclasses of sets Λ . InSection 4, we presentourtechniqueforcomputing the proximity operator of the penalty functionand the resulting accelerated proximal method. In Section5, we assess the efficiency and the statistical performanceof the method via some numerical experiments. 2 Background In this section, we introduce our notation, review the learn-ing method which we study in this paper and recall somebasic facts from convex analysis.We let R + and R ++ be the nonnegative and positive realline, respectively. For every β  ∈ R n we define | β  | ∈ R n + to be the vector | β  | = ( | β  i | ) ni =1 . For every p ≥ 1 , wedefine the ℓ  p norm of  β  as  β    p = (  ni =1 | β  i |  p ) 1 p . Wedenote by · , · the standard inner product in R n , that is,if  x,t ∈ R n , then  x,t  =  ni =1 x i t i . If  C  ⊆ R n , wedenote by δ  C  : R n → R the indicator functionof the set C  ,that is, δ  C  ( x ) = 0 if  x ∈ C  and δ  C  ( x ) = + ∞ otherwise.Finally, if  ϕ : R n → R is a convex function, we use ∂ϕ ( x ) to denote the subdifferential of  ϕ at x ∈ R n [21].We nowreviewthe structuredsparsityapproachof[12, 13].Given an m × n input data matrix X  and an output vector y ∈ R m , obtained from the linear regression model y = Xβ  ∗ + ξ  discussed earlier, they consider the optimizationproblem inf   12  Xβ  − y  22 + ρ Γ( β,λ ) : β  ∈ R n ,λ ∈ Λ  (2.1)where ρ is a positive parameter, Λ is a prescribed convexsubset of the positive orthant R n ++ and the function Γ : R n × R n ++ → R is given by the formula Γ( β,λ ) =12 n  i =1  β  2 i λ i + λ i  . Note that the infimum over λ in general is not attained,however the infimum over β  is always attained. Since theauxiliary vector λ appears only in the second term and ourgoal is to estimate β  ∗ , we may also directly consider theregularization problem min  12  Xβ  − y  22 + ρ Ω( β  ) : β  ∈ R n  , (2.2)where the penalty function takes the form Ω( β  ) = inf  { Γ( β,λ ) : λ ∈ Λ } . This problem is still convex because the function Γ is jointly convex [5]. Also, note that the function Ω is in-dependent of the signs of the components of  β  .For a generic convex set Λ , since the penalty function Ω is not easily computable, one needs to deal directly withproblem (2.1). To this end, we recall here the definition of the proximity operator [15]. Definition 2.1. Let  ϕ be a real-valued convex function on R d . The proximity operator of  ϕ is defined, for every t ∈ R d by prox ϕ ( t ) :=argmin  12  z − t  22 + ϕ ( z ) : z ∈ R d  . The proximity operator is well-defined, because the aboveminimum exists and is unique. 3 The set Λ In this section, we first provide some general insights onhow the set Λ can favour certain sparsity patterns on β  and,secondly, we introduce two new classes of sets Λ that canbe used in many relevant applications.We begin by noting that, for every λ ∈ R n ++ , the quadraticfunction Γ( · ,λ ) provides an upper bound to the ℓ 1 norm, 83  Luca Baldassarre, Jean Morales, Andreas Argyriou, Massimiliano Pontil namely it holds that Ω( β  ) ≥  β   1 and the inequality istight if and only if  | β  | ∈ Λ . This fact is an immediateconsequenceof thearithmetic-geometricinequality. Inpar-ticular, we see that if we choose Λ = R n ++ , the method(2.2) reduces to the Lasso 1 . The above observation sug-gests a heuristic interpretation of the method (2.2): amongall vectors β  which have a fixed value of the ℓ 1 norm, thepenalty function Ω will encouragethose for which | β  |∈ Λ .Moreover, when | β  | ∈ Λ the function Ω reduces to the ℓ 1 norm and, so, the solution of problem (2.2) is expected tobe sparse. The penalty function therefore will encouragecertain desired sparsity patterns.The last point can be better understood by looking at prob-lem (2.1). For every solution (ˆ β, ˆ λ ) , the sparsity patternof  ˆ β  is contained in the sparsity pattern of  ˆ λ , that is, the in-dices associatedwith nonzerocomponentsof  ˆ β  area subsetof those of  ˆ λ . Indeed, if  ˆ λ i = 0 it must hold that ˆ β  i = 0 as well, since the objective would diverge otherwise (be-cause of the ratio β  2 i /λ i ). Therefore, if the set Λ favorscertain sparse solutions of  ˆ λ , the same sparsity pattern willbe reflected on ˆ β  . Moreover, the regularization term  i λ i favors sparse vectors λ . For example, a constraint of theform λ 1 ≥ ··· ≥ λ n , favors consecutive zeros at the endof  λ and non-zeroseverywhereelse. This will lead to zerosat the end of  β  as well. Thus, in many cases like this, itis easy to incorporate a convex constraint on λ , whereas itmay not be possible to do the same directly on β  .In this paper, we consider sets Λ of the form Λ = { λ ∈ R n ++ : Aλ ∈ S  } where S  is a convex set and A is a k × n matrix. Twomain choices of interest are when S  is a convexcone or theunit ball of a norm. We shall refer to the corresponding set Λ as conic constraint  or norm constraint  set, respectively.We nextdiscuss two specific examples, which highlighttheflexibility of our approach and help us understand the spar-sity patterns favoured by each choice.Within the conic constraintsets, we may choose S  = R k ++ ,so that Λ = { λ ∈ R n ++ : Aλ ≥ 0 } . For example, in[12, 13], they considered the set Λ = { λ ∈ R n ++ : λ 1 ≥··· ≥ λ n } and derived an explicit formula of the corre-sponding regularizer Ω( ·| Λ) . This set can be used to en-courage hierarchical sparsity. Note that, for a generic ma-trix A , the penalty function cannot be computed explicitly.In the next section, we present an optimization method thatovercomes this difficulty.Within the normconstraint sets, we may choose S  to be the ℓ 1 -unit ball and A the edge map of a graph G with edgeset E  , so that Λ =  λ ∈ R n ++ :  ( i,j ) ∈ E  | λ i − λ j |≤ 1  . This set can be used to encourage sparsity patterns consist- 1 More generally, method (2.2) includes the Group Lassomethod, see [12, 13]. ingoffewconnectedregions/subgraphsofthegraph G . Forexample if  G is a 1 D-grid we have that Λ = { λ ∈ R n ++ :  n − 1 i =1 | λ i +1 − λ i |≤ 1 } , so the correspondingpenalty willfavour vectors which are constant within few connected re-gions. 4 Optimization Method Inthis section,we discuss howto solveproblem(2.1)usingan accelerated first-order method that scales linearly withrespect to the problem size, as we later show in the ex-periments. This method relies on the computation of theproximity operator of the function Γ , restricted to R n × Λ .Since the exact computation of the proximity operator ispossible only in simple special cases of sets Λ , we presenthere an efficient fixed-point algorithm for computing theproximity operator that can be applied to a wide varietyof constraints. Finally, we discuss an accelerated proximalmethod that leverages our algorithm. 4.1 Computation of the Proximity Operator According to Definition 2.1, the proximal operator of  Γ at ( α,µ ) ∈ R n × R n is the solution of the problem min  12  ( β,λ ) − ( α,µ )  22 + ρ Γ( β,λ ) : β  ∈ R n ,λ ∈ Λ  . (4.1)For fixed λ , a direct computation yields that the objectivefunction in (4.1) attains its minimum at β  i ( λ ) = α i λ i λ i + ρ. Using this equation we obtain the simplified problem min  12  λ − µ  2 + ρ 2 n  i =1  α 2 i λ i + ρ + λ i  : λ ∈ Λ  . (4.2)This problem can still be interpreted as a proximity mapcomputation. We discuss how to solve it under our generalassumption Λ = { λ : λ ∈ R n ++ ,Aλ ∈ S  } . Moreover,we assume that the projection on the set S  can be easilycomputed. To this end, we define the ( n + k ) × n matrix B =  I A  andthefunction ϕ ( s,t ) = ϕ 1 ( s )+ ϕ 2 ( t ) , ( s,t ) ∈ R n × R k ,where ϕ 1 ( s ) = ρ 2 n  i =1  α 2 i s i + ρ + s i + δ  R ++ ( s i )  , and ϕ 2 ( t ) = δ  S  ( t ) . Note that the solution of problem (4.2)is the same as the proximity map of the linearly compositefunction ϕ ◦ B at µ , which solves the problem min  12  λ − µ  2 + ϕ ( Bλ ) : λ ∈ R n  . 84  A General Framework for Structured Sparsity via Proximal Optimization At first sight this problem seems difficult to solve. How-ever, it turns out that if the proximity map of the function ϕ has a simple form, the following theorem adapted from[14, Theorem 3.1] can be used to accomplish this task. Forease of notation we set d = n + k . Theorem 4.1. Let  ϕ be a convex functionon R d  , B a d × n matrix, µ ∈ R n  , c > 0  , and define the mapping H  : R d → R d at  v ∈ R d as H  ( v ) = ( I  − prox ϕc )(( I  − cBB ⊤ ) v + Bµ ) . Then, for any fixed point  ˆ v of  H   , it holds that  prox ϕ ◦ B ( µ ) = µ − cB ⊤ ˆ v The Picard iterates { v s : s ∈ N } ⊆ R d , starting at v 0 ∈ R d , are defined by the recursive equation v s = H  ( v s − 1 ) .Since the operator I  − prox ϕ is nonexpansive 2 (see e.g.[6]), the map H  is nonexpansiveif  c ∈  0 , 2 || B || 2  . Becauseof this, the Picard iterates are not guaranteedto convergetoa fixed point of  H  . However,a simple modification with anaveraging scheme can be used to compute the fixed point. Theorem 4.2. [19] Let  H  : R d → R d be a nonexpansivemapping which has at least one fixed point and let  H  κ := κI  + (1 − κ ) H  . Then, for every κ ∈ (0 , 1)  , the Picard iterates of  H  κ converge to a fixed point of  H  . The required proximity operator of  ϕ is directly given, forevery ( s,t ) ∈ R n × R k , by prox ϕ ( s,t ) =  prox ϕ 1 ( s ) , prox ϕ 2 ( t )  . Both prox ϕ 1 and prox ϕ 2 can be easily computed. The lat-ter requires computing the projection on the set S  . Theformer requires, for each component of the vector s ∈ R n ,the solutionof a cubicequationas stated in the nextlemma. Lemma 4.1. For every µ,α ∈ R and  r,ρ > 0  , the func-tion h : R + → R defined at  s as h ( s ) := ( s − µ ) 2 + r  α 2 s + ρ + s  has a unique minimum on its domain, whichis attained at  ( x 0 − ρ ) +  , where x 0 is the largest real root of the polynomial 2 x 3 + ( r − 2( µ + ρ )) x 2 − rα 2 . Proof. Setting the derivative of  h equal to zero and mak-ing the change of variable x = s + ρ yields the polyno-mial stated in the lemma. Let x 0 be the largest root of thispolynomial. Since the function h is strictly convex on itsdomain and grows at infinity, its minimum can be attainedonly at one point, which is x 0 − ρ , if  x 0 > ρ , and zerootherwise. 4.2 Accelerated Proximal Method Theorem4.1motivatesa proximalnumericalapproach(Al-gorithm 1 below) to solving problem (2.1) and, in turn, 2 A mapping T  : R d → R d is said nonexpansive if   T  ( v ) − T  ( v ′ )  2 ≤ v − v ′  2 , for every v,v ′ ∈ R d . problem (2.2). Let E  ( β  ) = 12  Xβ  − y  22 and assume anupper bound L of   X  ⊤ X   is known. 3 Proximal first-ordermethods – see [4, 6, 17, 23] and references therein – can beused for nonsmoothoptimization, where the objective con-sists ofthe sumofa smoothtermanda non-smoothterm,inour case E  and Γ + δ  Λ , respectively. The idea is to replace E  with its linear approximation around a point w t specificto iteration t . This leads to the computation of a proximityoperator, and specifically in our case to u t := ( β  t ,λ t ) ← argmin  L 2  ( β,λ ) − ( w t − 1 L ∇ E  ( w t ))  22 + ρ Γ( β,λ ) : β  ∈ R n ,λ ∈ Λ  . Subsequently, the point w t is updated,based on the current and previous estimates of the solution u t ,u t − 1 ,... and the process repeats. Algorithm 1 Proximal structured sparsity algorithm (NEPIO). u 1 ,w 1 ← arbitrary feasible values for t=1,2,... do Compute a fixed point ˆ v ( t ) of  H  t by Picard-Opial u t +1 ← w t − 1 L ∇ E  ( w t ) − cL B ⊤ ˆ v ( t ) w t +1 ← π t +1 u t +1 − ( π t +1 − 1) u t end for The simplest (and a commonly used) update rule is w t = u t . By contrast, accelerated proximal methods proposedby [17] use a carefully chosen w update with two levelsof memory, u t ,u t − 1 . If the proximity map can be exactlycomputed, such schemes exhibit a fast quadratic decay interms of the iteration count, that is, the distance of the ob- jective from the minimal value is O  1 T  2  after T  itera-tions. In the case that the proximal operator is computed numerically , it has been shown only very recently [22, 24]that, under some circumstances, the accelerated methodstill converges with the rate O  1 T  2  . The main advantagesof accelerated methods are their low cost per iteration andtheir scalability to large problem sizes. Moreover, in ap-plications where a thresholding operator is involved – as inLemma 4.1– the zeros in the solutionare exact, which maybe desirable.For our purposes, we use a version of accelerated meth-ods influenced by [23] (described in Algorithm 1). Ac-cording to Nesterov, the optimal update is w t +1 ← u t +1 + θ t +1  1 θ t − 1  ( u t +1 − u t ) wherethesequence θ t is definedby θ 1 = 1 and the recursive equation 1 − θ t +1 θ 2 t +1 =1 θ 2 t . We have adapted [23, Algorithm 2] (equivalent to FISTA[4]) by computing the proximity operator of  ϕL ◦ B us-ing the Picard-Opial process described in Section 4.1. Werephrased the algorithm using the sequence π t := 1 − θ t + √  1 − θ t = 1 − θ t + θ t θ t − 1 for numerical stability. At each 3 For variants of such algorithms which adaptively learn L , seethe above references. 85  Luca Baldassarre, Jean Morales, Andreas Argyriou, Massimiliano Pontil iteration, the map H  t is defined by H  t ( v ) := ( I  − prox φc )  I  − cLBB ⊤  v − 1 LB ( ∇ E  ( w t ) − Lw t )  . We also apply an Opial averaging, so that the update atstage s of the proximity computation is v s +1 = κv s +(1 − κ ) H  t ( v s ) . By Theorem 4.1, the fixed point pro-cess combined with the assignment of  u are equivalent to u t +1 ← prox ϕL ◦ B  w t − 1 L ∇ E  ( w t )  .The reason for resorting to Picard-Opial is that exact com-putation of the proximity operator (4.2) is possible onlyin simple special cases for the set Λ . By contrast, ourapproach can be applied to a wide variety of constraints.Moreover, we are not aware of another proximal methodfor solving problems (2.1) or (2.2) and alternatives like in-teriorpointmethodsdonotscalewell with problemsize. Inthe next section, we will demonstrate empirically the scal-ability of Algorithm 1, as well as the efficiency of both theproximity map computation and the overall method. 5 Numerical Simulations In this section, we present experiments with method (2.1).The main goal of the experiments is to study both thecomputational and the statistical estimation properties of this method. One important aim of the experiments is todemonstrate that the method is statistically competitive orsuperior to state-of-the-art methods while being computa-tionally efficient. The methods employed are the “Lasso”,“StructOMP” [9], “GL1”, the Group Lasso variant pre-sented in [10], and “GL2”, a Group Lasso with overlap-ping groups. For both Group Lasso methods, we used asgroups all sets of  4 contiguous variables ( 1 D) or the setsof all neighboursof each variable ( 2 D). Moreover,we usedmethod (2.1) with the following choices for the constraintset Λ : • Λ = { λ :  Aλ  1 ≤ α } , where A is the edge map of a 1 D or 2 D grid – we refer to the correspondingmethodas “Grid-C”. • Λ = { λ : Aλ ≥ 0 } , where A is the edge map of atree graph – we refer to the corresponding method as“Tree-C”.We solved the optimization problem (2.1) either with thetoolbox CVX 4 or with the proximal method presented inSection 4. When using the proximal method, we foundthatsetting the parameter κ from Opial’s Theorem to 0 . 2 gavethe best results, even though in [14] they show that con-vergence of the fixed-point iterations is guaranteed also for 4 κ = 0 . Our main stopping criterion is based on the de-crease in the objective value of (2.1) which must be lessthan 10 − 8 . For the computation of the proximity operator,we stopped the iterations of the Picard-Opial method whenthe relative difference between two consecutive iterates issmaller than 10 − 2 . We studied the effect of varying thistolerance in the next experiments. We used the square lossandcomputedtheLipschitzconstant L usingsingularvaluedecomposition (if this were not possible, a Frobenius esti-mate could be used). Finally, the implementation ran on an8GB memory quad-core Intel machine. 5.1 Efficiency experiments First, we investigated the computational properties of theproximal method (NEPIO). Our aim in these experimentswas to show that our algorithm has a time complexitythat scales linearly with the number of variables, whilethe relative number of training examples is kept constant.We considered both the Grid and the Tree constraints andcompared our algorithm to the toolbox CVX, which isan interior-point method solver. As is commonly known,interior-point methods are very fast, but do not scale wellwith the problem size. We also compared to the non-acceleratedversionofouralgorithm,similar to ISTA [4, 6].ISTA has been shown [6] to converge in the presence of very general, but summable, errors in the computation of the proximity operator. In the case of the Tree constraint,we further compared with an adapted version of the alter-nating algorithm (AA) of [12, 13]. For each problem size,we repeatedthe experiments 10 times and we report the av-erage computation time in Figure 1 for Grid-C and Tree-C.All methods achieve objective values that are within 1% of each other, apart from ISTA that sometimes did not con-verge after 10 5 iterations. The proposed method scales lin-early with the problem size, making it suitable for largescale experiments.In order to empirically assess the importance of the Picard-Opial tolerance for converging to a good solution, we con-sidered a problem with 100 variables for both the Grid andtheTreeconstraintsandrepeatedtheexperiments 100 timeswithdifferentsamplingoftrainingexamples. Foreachcon-straint, we evaluated the average distance from the solutionobtained by our method with different values of the Picard-Opial tolerance to the solution obtained by CVX.We did not observe any improvementin the solution by de-creasing a fixed tolerance from 10 − 2 to 10 − 8 or by settingthe tolerance to decrease as 1 /T  α with α equal to 1 , 1 . 5 or 2 , as suggested by very recent results [22, 24]. However,decreasing the tolerance remarkably increased the compu-tational overhead from an average of  5 s for a fixed toler-ance of  10 − 2 to 80 s for 1 /T  2 in the case of the Grid con-straint.Finally, we considered the 2 D Grid-C case and observed 86
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks