Math & Engineering

A risk ratio comparison of L0 and L1 penalized regression

Description
In the past decade, there has been an explosion of interest in using l1-regularization in place of l0-regularization for feature selection. We present theoretical results showing that while l1-penalized linear regression never outperforms
Published
of 17
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
Share
Transcript
  A risk ratio comparison of   L 0  and  L 1  penalized regression Dongyu Lin  dongyu@wharton.upenn.edu Dean P. Foster  dean@foster.net Department of Statistics The Wharton School, University of Pennsylvania Philadelphia, PA 19104, USA Lyle H. Ungar  ungar@cis.upenn.edu Department of Computer and Information Science University of Pennsylvania Philadelphia, PA 19104, USA Editor: Abstract In the past decade, there has been an explosion of interest in using  l 1 -regularization in placeof   l 0 -regularization for feature selection. We present theoretical results showing that while l 1 -penalized linear regression never outperforms  l 0 -regularization by more than a constantfactor, in some cases using an  l 1  penalty is infinitely worse than using an  l 0  penalty. Wealso compare algorithms for solving these two problems and show that although solutionscan be found efficiently for the  l 1  problem, the “optimal”  l 1  solutions are often inferiorto  l 0  solutions found using greedy classic stepwise regression. Furthermore, we show thatsolutions obtained by solving the convex  l 1  problem can be improved by selecting the bestof the  l 1  models (for different regularization penalties) by using an  l 0  criterion. Keywords:  Variable Selection, Regularization, Stepwise Regression add citation to our Workshop paper Lin et al. (2008) 1 Introduction In the past decade, a rich literature has been developed using  l 1 -regularization for linearregression including Lasso (Tibshirani, 1996), LARS (Efron et al., 2004), fused lasso (Tibshi- rani et al., 2005), grouped lasso (Yuan and Lin, 2006), relaxed lasso (Meinshausen, 2007), and elastic net (Zou and Hastie, 2005). These methods, like the  l 0 -penalized regressionmethods which preceded them Akaike (1973); Schwarz (1978); Foster and George (1994), address variable selection problems in which there is a large set of potential features, onlya few of which are likely to be helpful. This type of sparsity is common in machine learningtasks, such as predicting disease based on thousands of genes, or predicting the topic of adocument based on the occurrences of hundreds of thousands of words. l 1 -regularization is popular because, unlike the  l 0  regularization historically used for fea-ture selection in regression problems, the  l 1  penalty gives rise to a convex problem that canbe solved efficiently using convex optimization methods.  l 1  methods have given reasonableresults on a number of data sets, but there has been no careful analysis of how they performwhen compared to  l 0  methods. This paper provides a formal analysis of the two methods,1  and shows that  l 1  can give arbitrarily worse models. We offer some intuition as to why thisis the case –  l 1  shrinks coefficients too much and does not zero out enough of them – andsuggest how to use an  l 0  penalty with  l 1  optimization.We consider the classic normal linear model y  =  X  β +  ε, with  n  observations  y  = ( y 1 ,...,y n ) ′ and  p  features  x 1 ,..., x  p ,  p  ≫  n , where  X   =( x 1 ,..., x  p ) is an  n ×  p  “design matrix” of features,  β  = ( β  1 ,...,β   p ) ′ is the coefficientparameters, and error  ε  ∼  N  ( 0 ,σ 2 I  n ). Assume that only a subset of   { x  j }  p j =1  has nonzerocoefficients.The traditional statistical approach to this problem, namely, the  l 0  regularization prob-lem, finds an estimator that minimizes the  l 0  penalized sum of squared errorsargmin β   y − X  β  2 +  λ 0  β  l 0  ,  (1)where  β  l 0  =    pi =1  I  { β  i  =0 }  counts the number of nonzero coefficients. However, this prob-lem is NP hard Natarajan (1995). A tractable problem relaxes the  l 0  penalty to the  l 1  norm  β  l 1  =    pi =1 | β  i |  and seeksargmin β   y − X  β  2 +  λ 1  β  l 1  ,  (2)and is known as the  l 1 -regularization problem Tibshirani (1996). The exact computation of  (2) is, in the worst case, much more efficient because of the convexity Efron et al. (2004); Candes and Tao (2007). We assess our models using the predictive risk function (3) R ( β ,  ˆ β ) =  E β  ˆ y − E ( y | X  )  22  =  E β  X  ˆ β − X  β  22 .  (3)We are interested in the ratios of the risks of the estimates provided by these two criteria.Unlike risk functions, predictive risk measures the true prediction error with irreducible vari-ance from which noise has been removed. Smaller risks imply better expected performance inthe future for prediction purpose. Recent literature has a focus on the selection consistency,where whether or the true variable can be identified is critical. However, in real applica-tion, due to the prevalent multicollinearity, highly correlated predictors are hard to separatefrom “true” or “false”. Here we focus on the purpose of prediction accuracy and provocate[FORGET how to spell] the concept of predictive risk.  explain risk vs. consistencyand the relation of risk to out-of-sample error; what else is this called b otherpeople? see www-stat.wharton.upenn.edu/ stine/research/select.predRisk.pdf;maybe also Barbieri and Berger (2004) Our first result in this paper, given below as Theorems 1 and 2, is that  l 0  estimatesprovide more accurate predictions than  l 1  estimates do, in the sense of minimax risk ratios,as illustrated in Figure 1: •  inf  γ  0  sup β R ( β , ˆ β l 0 ( γ  0 )) R ( β , ˆ β l 1 ( γ  1 ))  is bounded by a small constant; furthermore, it is close to onefor most  γ  1 s, especially for large  γ  1 s, which are mostly used in sparse systems.2  •  inf  γ  1  sup β R ( β , ˆ β l 1 ( γ  1 )) R ( β , ˆ β l 0 ( γ  0 ))  tends to infinity quadratically; in an extremely sparse system,the  l 1  estimate may perform arbitrarily badly. •  R ( β ,  ˆ β l 1 ( γ  1 )) is more likely to have a larger risk than  R ( β ,  ˆ β l 0 ( γ  0 )) does. 0246810    0   1   2   3   4   5 γ γ   _1    l  o  g   (   R   i  s   k   R  a   t   i  o   ) Log(Risk Ratio) Range max/min when R0(0)=R1(0)optimal bounds 0246810    1 .   0   1 .   2   1 .   4   1 .   6   1 .   8   2 .   0 γ γ   _1    S  u  p   R   i  s   k   R  a   t   i  o Sup(l0Risk/l1Risk) R0(0)=R1(0)optimal 0246810    0   2   0   4   0   6   0   8   0   1   0   0 γ γ   _0    S  u  p   R   i  s   k   R  a   t   i  o Sup(l1Risk/l0Risk) R0(0)=R1(0)optimal Figure 1:  Left : The gray area shows the feasible region for the risk ratios–the log risk-ratio is abovezero when  l 0  produces a better fit. The graph shows that most of the time  l 0  is better.The actual estimators being compared are those that have the same risk at  β   = 0, i.e., R (0 ,  ˆ β  l 0 ( γ  0 )) =  R (0 ,  ˆ β  l 1 ( γ  1 )).  Middle : This graph traces out the bottom envelope of theleft hand graph (but takes the reciprocal risk ratio and no longer uses the logarithm scale). Thedashed blue line displays sup β  R ( β,  ˆ β  l 0 ( γ  0 )) /R ( β,  ˆ β  l 1 ( γ  1 )) for  γ  0  calibrated to have the samerisk at zero as  γ  1 . This maximum ratio tends to 1 when  γ  1  →  0 (the saturated case) or  ∞ (the sparse case). With an optimal choice of   γ  0 , inf  γ  0  sup β  R ( β,  ˆ β  l 0 ( γ  0 )) /R ( β,  ˆ β  l 1 ( γ  1 )) (solid redline) behaves similarly. Specifically, the supremum over  γ  1  is bounded by 1.8.  Right : Thisgraph traces out the upper envelopes of the left hand graph on a normal scale. When  γ  0  →∞ ,sup β  R ( β,  ˆ β  l 1 ( γ  1 )) /R ( β,  ˆ β  l 0 ( γ  0 )) tends to  ∞ , for both  γ  1  that is calibrated at  β   = 0 and thatminimizes the maximum risk ratio. A detailed discussion on the risk ratios will be presented in Section 3, along with adiscussion of other advantages of   l 0  regularization. Our other results in the paper includeshowing that applying the  l 0  criterion on an  l 1  subset searching path can find the bestperforming model (Section 4) and running stepwise regression and Lasso on a reduced NPhard example shows that stepwise regression gives better solutions (Section 5).We compare  l 0  vs.  l 1  penalties under three assumptions about the structure of thefeature matrix  X  : independence, incoherence (near independence) and when the  l 0  problemis NP-hard. For independence, we find: ... For near independence, we find that  l 1  penalizedregression followed by  l 0  (explain) beats  l 1 , and for the NP-hard case, we find that if onecould do the search, then the risk ratio could be arbitrarily bad for  l 1  relative to  l 0 2 Background on Risk Ratio what is it, why is it good, where has it been used? risk vs consistence; andrelation to out-of-sample error 3  3 Risk Ratio Results 3.1  l 0  solutions give more accurate predictions. Suppose that ˆ β  is an estimator of   β . Remember that the predictive risk of  ˆ β  is defined as R ( β ,  ˆ β ) =  E β   ˆ y − E ( y | X  )  22  =  E β   X  ˆ β − X  β  22 . We furthermore consider the case when  X   is orthogonal in this section. (For example,wavelets, Fourier transforms, and PCA all are orthogonal). The  l 0  problem (1) can then besolved by simply picking those predictors with least squares estimates  | ˆ β  i |  > γ  , where thechoice of   γ   depends on the penalty  λ 0  in (1). It was shown Donoho and Johnstone (1994); Foster and George (1994) that  λ 0  = 2 σ 2 log  p  is optimal in the sense that it asymptoticallyminimizes the maximum predictive risk inflation due to selection.Letˆ β  l 0 ( γ  0 ) =  ˆ β  1 I  {| ˆ β  1 | >γ  0 } ,...,  ˆ β   p I  {| ˆ β  p | >γ  0 }  ′ (4)be the  l 0  estimator that solves (1), and let the  l 1  solution to (2) beˆ β  l 1 ( γ  1 ) =  sign(ˆ β  1 )( | ˆ β  1 |− γ  1 ) + ,..., sign(ˆ β   p )( | ˆ β   p |− γ  1 ) +  ′ ,  (5)where the ˆ β  i ’s are the least squares estimates.We are interested in the ratios of the risks of these two estimates, R ( β,  ˆ β  l 0 ( γ  0 )) R ( β,  ˆ β  l 1 ( γ  1 ))and  R ( β,  ˆ β  l 1 ( γ  1 )) R ( β,  ˆ β  l 0 ( γ  0 )). I.e., we want to know how the risk is inflated when another criterion is used. The smallerthe risk ratio, the less risky (and hence better) the numerator estimate is, compared to thedenominator estimate. Specifically, a risk ratio less than one implies that the top estimateis better than the bottom estimate.Formally, we have the following theorems, whose proofs are given in the last section: Theorem 1  There exists a constant   C  1  such that for any   γ  0  ≥ 0 , inf  γ  1 sup β  R ( β,  ˆ β  l 1 ( γ  1 )) R ( β,  ˆ β  l 0 ( γ  0 )) ≥ C  1  +  γ  0 .  (6)I.e., given  γ  0 , for any  γ  1 , there exist  β  ’s such that the ratio becomes extremely large.Contrast this with the protection provided by  l 0 : Theorem 2  There exists a constant   C  2  >  0  such that for any   γ  1  ≥ 0 , inf  γ  0 sup β  R ( β,  ˆ β  l 0 ( γ  0 )) R ( β,  ˆ β  l 1 ( γ  1 )) ≤ 1 +  C  2 γ  − 11  .  (7)4  The above theorems can definitely be strengthened, as demonstrated by the boundsshown in Figure 1, but at the cost of complicating the proofs. We conjecture that there exist constants  r >  1, and  C  3 ,C  4 ,C  5  >  0, such thatinf  γ  1 sup β  R ( β,  ˆ β  l 1 ( γ  1 )) R ( β,  ˆ β  l 0 ( γ  0 )) ≥  1 +  C  3 γ  r 0 ,  (8)inf  γ  0 sup β  R ( β,  ˆ β  l 0 ( γ  0 )) R ( β,  ˆ β  l 1 ( γ  1 )) ≤  1 +  C  4 γ  1 e − C  5 γ  1 .  (9)These theorems suggest that for any  γ  1  chosen by the algorithm, we can always adapt γ  0  such that ˆ β  l 0 ( γ  0 ) outperforms ˆ β  l 1 ( γ  1 ) most of the time and loses out a little for some  β  ’s;but for any  γ  0  chosen, no  γ  1  can perform consistently reasonably well on all  β  ’s.Because of the additivity of risk functions, (see appendix equations (14) and (15)), due to the orthogonality assumption, we focus on the individual behavior of   β   for each singlefeature. Also the risk functions are symmetric on  β  , so only the cases of   β   ≥  0 will bedisplayed.Figure 2 illustrates that given  γ  1 , we can pick a  γ  0 , s.t. the risk ratio is below 1 formost  β   except around ( γ  0  + γ  1 ) / 2, yet this ratio does not exceed one by more than a smallfactor, even for the worst case. 0.00.51.01.5    0 .   0   0 .   5   1 .   0   1 .   5 ββ  /( γ γ   _1+ γ γ   _0)    R   i  s   k   R  a   t   i  o l0risk/l1risk gamma_1= 2gamma_1= 4gamma_1= 6gamma_1= 10 Figure 2:  For each  γ  1 , we let  γ  0  =  γ  1 +4log( γ  1 ) /γ  1 . This choice of   γ  0  makes the risk ratios small at  β   ≈ 0and  β   ≥ γ  0 , only inflated around  β/ ( γ  0  + γ  1 ) = 1 / 2, albeit very little especially when  γ  1  is largeenough. The intuition as to why  l 0  fares better than  l 1  in the risk ratio results is that  l 1  mustmake a “devil’s choice” between shrinking the coefficients too much or putting in too manyspurious features.  l 0  penalized regression avoids this problem. This section explains this inmore detail.5
Search
Similar documents
View more...
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks