A General Framework for Structured Sparsity via Proximal Optimization
Luca Baldassarre Jean Morales Andreas Argyriou Massimiliano Pontil
UCLLondon, UKl.baldassarre@cs.ucl.ac.uk UCLLondon, UK j.morales@cs.ucl.ac.uk TTIChicago, USAargyriou@ttic.eduUCLLondon, UKm.pontil@cs.ucl.ac.uk
Abstract
We study a generalized frameworkfor
structured sparsity
. It extends the well known methods of Lasso and Group Lasso by incorporating additional constraints on the variables as part of aconvex optimization problem. This framework provides a straightforward way of favouring prescribed
sparsity patterns
, such as orderings, contiguous regions and overlapping groups, amongothers. Available optimization methods are limited to speciﬁc constraint sets and tend to notscale well with sample size and dimensionality. We propose a
ﬁrst order proximal method
,which builds upon results on ﬁxed points andsuccessive approximations. The algorithm canbe applied to a general class of conic and normconstraints sets and relies on a proximity operator subproblem which can be computed numerically. Experiments on different regression problems demonstrate stateoftheart statistical performance, which improves over Lasso, GroupLasso andStructOMP.Theyalso demonstratetheefﬁciency of the optimization algorithm and itsscalability with the size of the problem.
1 Introduction
We study the problem of learninga sparse linear regressionmodel. The goal is to estimate a parameter vector
β
∗
∈
R
n
from a vector of measurements
y
∈
R
m
, obtained fromthe model
y
=
Xβ
∗
+
ξ
, where
X
is an
m
×
n
matrix,which may be ﬁxed or randomly chosen, and
ξ
∈
R
m
is a vector resulting from the presence of noise. We areinterested in sparse estimation under additional conditionson the sparsity pattern of
β
∗
. In other words, not only do
Appearing in Proceedings of the
15
th
International Conference onArtiﬁcial Intelligence and Statistics (AISTATS) 2012, La Palma,Canary Islands. Volume XX of JMLR: W&CP XX. Copyright2012 by the authors.
we expect that
β
∗
is sparse but also that it exhibits
structured sparsity
, namely certain conﬁgurations of its nonzerocomponents are preferred to others. This problem arises inseveral applications, such as regression, image denoising,background subtraction etc. – see [9, 11] for a discussion.In this paper, we build upon the structured sparsity framework recently proposed by [12, 13]. It is a regularizationmethod, formulated as a convex, nonsmooth optimizationproblem over a vector of auxiliary parameters. This approachprovidesa constructiveway to favor certain sparsitypatternsoftheregressionvector
β
. Speciﬁcally,thisformulation involves a penalty function given by the formula
Ω(
β

Λ) = inf
12
n
i
=1
β
2
i
λ
i
+
λ
i
:
λ
∈
Λ
.
This function can be interpreted as an extension of a wellknown variational form for the
ℓ
1
norm. The convex constraint set
Λ
provides a means to incorporate prior knowledge on the magnitude of the componentsof the regressionvector. As we explain in Section 2, the sparsity pattern of
β
is contained in that of the auxiliary vector
λ
at the optimum. Hence, if the set
Λ
allows only for certain sparsitypatterns of
λ
, the same propertywill be “transferred”to theregression vector
β
.The ﬁrst contribution of this paper is the introduction of atractable class of regularizers of the above form which extend the examples described in [12, 13]. Speciﬁcally, westudy in detail the cases in which the set
Λ
is deﬁned by
norm
or
conic constraints
, combined with a linear map. Aswe shall see, these cases include formulations which canbe used for learning
graph sparsity
and
hierarchical sparsity
, in the terminology of [9]. That is, the sparsity patternof the vector
β
∗
may consist of a few contiguous regionsin one or more dimensions, or may be embedded in a treestructure. This sparsity problem may arise in several applications, rangingfrom functional magnetic resonanceimaging [7, 25], to scene recognition in vision [8], to multitask learning [1, 18] and to bioinformatics[20] – to mention buta few.A main limitation of the technique described in [12, 13]is that in many cases of interest the penalty function can
82
A General Framework for Structured Sparsity via Proximal Optimization
not be easily computed. This makes it difﬁcult to solve theassociated regularization problem. For example, [12, 13]proposes to use block coordinate descent, but this methodis feasible only for limited choices of the set
Λ
. The secondcontribution of this paper is an efﬁcient accelerated proximal point method to solve regularized least squares withthe penalty function
Ω(
·
Λ)
in the general case of set
Λ
described above. The method combines a fast ﬁxed pointiterative scheme, which is inspired by recent work by [14]with an accelerated ﬁrst order method equivalent to FISTA[4]. We present a numerical study of the efﬁciency of theproposed method and a statistical comparison of the proposed penalty functions with the greedy method of [9], theLasso and the Group Lasso.Recently, there has been signiﬁcant research interest onstructuredsparsityandtheliteratureonthissubjectis growing fast, see for example [1, 9, 10, 11, 26] and references therein for an indicative list of papers. In this work,we mainly focus on convex penalty methods and comparethem to greedy methods [3, 9]. The latter provide a natural extension of techniques proposed in the signal processing community and, as argued in [9], allow for a signiﬁcant performance improvement over more generic sparsitymodels such as the Lasso or the GroupLasso [26]. The former methods have until recentlyfocused mainly on extending the Group Lasso, by consideringthe possibility that thegroups overlap according to certain hierarchical structures[11, 27]. Very recently, general choices of convex penaltyfunctions have been proposed [2, 12, 13]. In this paper webuild upon [12, 13], providing both new instances of thepenalty function and improved optimization algorithms.The paper is organized as follows. In Section 2, we setour notation, review the method of [12, 13] and recall somebasic facts from convex analysis. In Section 3, we providesomegeneralinsightsonthemethodandintroducetwonewclasses of sets
Λ
. InSection 4, we presentourtechniqueforcomputing the proximity operator of the penalty functionand the resulting accelerated proximal method. In Section5, we assess the efﬁciency and the statistical performanceof the method via some numerical experiments.
2 Background
In this section, we introduce our notation, review the learning method which we study in this paper and recall somebasic facts from convex analysis.We let
R
+
and
R
++
be the nonnegative and positive realline, respectively. For every
β
∈
R
n
we deﬁne

β
 ∈
R
n
+
to be the vector

β

= (

β
i

)
ni
=1
. For every
p
≥
1
, wedeﬁne the
ℓ
p
norm of
β
as
β
p
= (
ni
=1

β
i

p
)
1
p
. Wedenote by
·
,
·
the standard inner product in
R
n
, that is,if
x,t
∈
R
n
, then
x,t
=
ni
=1
x
i
t
i
. If
C
⊆
R
n
, wedenote by
δ
C
:
R
n
→
R
the indicator functionof the set
C
,that is,
δ
C
(
x
) = 0
if
x
∈
C
and
δ
C
(
x
) = +
∞
otherwise.Finally, if
ϕ
:
R
n
→
R
is a convex function, we use
∂ϕ
(
x
)
to denote the
subdifferential
of
ϕ
at
x
∈
R
n
[21].We nowreviewthe structuredsparsityapproachof[12, 13].Given an
m
×
n
input data matrix
X
and an output vector
y
∈
R
m
, obtained from the linear regression model
y
=
Xβ
∗
+
ξ
discussed earlier, they consider the optimizationproblem
inf
12
Xβ
−
y
22
+
ρ
Γ(
β,λ
) :
β
∈
R
n
,λ
∈
Λ
(2.1)where
ρ
is a positive parameter,
Λ
is a prescribed convexsubset of the positive orthant
R
n
++
and the function
Γ :
R
n
×
R
n
++
→
R
is given by the formula
Γ(
β,λ
) =12
n
i
=1
β
2
i
λ
i
+
λ
i
.
Note that the inﬁmum over
λ
in general is not attained,however the inﬁmum over
β
is always attained. Since theauxiliary vector
λ
appears only in the second term and ourgoal is to estimate
β
∗
, we may also directly consider theregularization problem
min
12
Xβ
−
y
22
+
ρ
Ω(
β
) :
β
∈
R
n
,
(2.2)where the penalty function takes the form
Ω(
β
) = inf
{
Γ(
β,λ
) :
λ
∈
Λ
}
.
This problem is still convex because the function
Γ
is jointly convex [5]. Also, note that the function
Ω
is independent of the signs of the components of
β
.For a generic convex set
Λ
, since the penalty function
Ω
is not easily computable, one needs to deal directly withproblem (2.1). To this end, we recall here the deﬁnition of the proximity operator [15].
Deﬁnition 2.1.
Let
ϕ
be a realvalued convex function on
R
d
. The proximity operator of
ϕ
is deﬁned, for every
t
∈
R
d
by
prox
ϕ
(
t
) :=argmin
12
z
−
t
22
+
ϕ
(
z
) :
z
∈
R
d
.
The proximity operator is welldeﬁned, because the aboveminimum exists and is unique.
3 The set
Λ
In this section, we ﬁrst provide some general insights onhow the set
Λ
can favour certain sparsity patterns on
β
and,secondly, we introduce two new classes of sets
Λ
that canbe used in many relevant applications.We begin by noting that, for every
λ
∈
R
n
++
, the quadraticfunction
Γ(
·
,λ
)
provides an upper bound to the
ℓ
1
norm,
83
Luca Baldassarre, Jean Morales, Andreas Argyriou, Massimiliano Pontil
namely it holds that
Ω(
β
)
≥
β
1
and the inequality istight if and only if

β
 ∈
Λ
. This fact is an immediateconsequenceof thearithmeticgeometricinequality. Inparticular, we see that if we choose
Λ =
R
n
++
, the method(2.2) reduces to the Lasso
1
. The above observation suggests a heuristic interpretation of the method (2.2): amongall vectors
β
which have a ﬁxed value of the
ℓ
1
norm, thepenalty function
Ω
will encouragethose for which

β
∈
Λ
.Moreover, when

β
 ∈
Λ
the function
Ω
reduces to the
ℓ
1
norm and, so, the solution of problem (2.2) is expected tobe sparse. The penalty function therefore will encouragecertain desired sparsity patterns.The last point can be better understood by looking at problem (2.1). For every solution
(ˆ
β,
ˆ
λ
)
, the sparsity patternof
ˆ
β
is contained in the sparsity pattern of
ˆ
λ
, that is, the indices associatedwith nonzerocomponentsof
ˆ
β
area subsetof those of
ˆ
λ
. Indeed, if
ˆ
λ
i
= 0
it must hold that
ˆ
β
i
= 0
as well, since the objective would diverge otherwise (because of the ratio
β
2
i
/λ
i
). Therefore, if the set
Λ
favorscertain sparse solutions of
ˆ
λ
, the same sparsity pattern willbe reﬂected on
ˆ
β
. Moreover, the regularization term
i
λ
i
favors sparse vectors
λ
. For example, a constraint of theform
λ
1
≥ ··· ≥
λ
n
, favors consecutive zeros at the endof
λ
and nonzeroseverywhereelse. This will lead to zerosat the end of
β
as well. Thus, in many cases like this, itis easy to incorporate a convex constraint on
λ
, whereas itmay not be possible to do the same directly on
β
.In this paper, we consider sets
Λ
of the form
Λ =
{
λ
∈
R
n
++
:
Aλ
∈
S
}
where
S
is a convex set and
A
is a
k
×
n
matrix. Twomain choices of interest are when
S
is a convexcone or theunit ball of a norm. We shall refer to the corresponding set
Λ
as
conic constraint
or
norm constraint
set, respectively.We nextdiscuss two speciﬁc examples, which highlighttheﬂexibility of our approach and help us understand the sparsity patterns favoured by each choice.Within the conic constraintsets, we may choose
S
=
R
k
++
,so that
Λ =
{
λ
∈
R
n
++
:
Aλ
≥
0
}
. For example, in[12, 13], they considered the set
Λ =
{
λ
∈
R
n
++
:
λ
1
≥··· ≥
λ
n
}
and derived an explicit formula of the corresponding regularizer
Ω(
·
Λ)
. This set can be used to encourage hierarchical sparsity. Note that, for a generic matrix
A
, the penalty function cannot be computed explicitly.In the next section, we present an optimization method thatovercomes this difﬁculty.Within the normconstraint sets, we may choose
S
to be the
ℓ
1
unit ball and
A
the edge map of a graph
G
with edgeset
E
, so that
Λ =
λ
∈
R
n
++
:
(
i,j
)
∈
E

λ
i
−
λ
j
≤
1
.
This set can be used to encourage sparsity patterns consist
1
More generally, method (2.2) includes the Group Lassomethod, see [12, 13].
ingoffewconnectedregions/subgraphsofthegraph
G
. Forexample if
G
is a
1
Dgrid we have that
Λ =
{
λ
∈
R
n
++
:
n
−
1
i
=1

λ
i
+1
−
λ
i
≤
1
}
, so the correspondingpenalty willfavour vectors which are constant within few connected regions.
4 Optimization Method
Inthis section,we discuss howto solveproblem(2.1)usingan accelerated ﬁrstorder method that scales linearly withrespect to the problem size, as we later show in the experiments. This method relies on the computation of theproximity operator of the function
Γ
, restricted to
R
n
×
Λ
.Since the exact computation of the proximity operator ispossible only in simple special cases of sets
Λ
, we presenthere an efﬁcient ﬁxedpoint algorithm for computing theproximity operator that can be applied to a wide varietyof constraints. Finally, we discuss an accelerated proximalmethod that leverages our algorithm.
4.1 Computation of the Proximity Operator
According to Deﬁnition 2.1, the proximal operator of
Γ
at
(
α,µ
)
∈
R
n
×
R
n
is the solution of the problem
min
12
(
β,λ
)
−
(
α,µ
)
22
+
ρ
Γ(
β,λ
) :
β
∈
R
n
,λ
∈
Λ
.
(4.1)For ﬁxed
λ
, a direct computation yields that the objectivefunction in (4.1) attains its minimum at
β
i
(
λ
) =
α
i
λ
i
λ
i
+
ρ.
Using this equation we obtain the simpliﬁed problem
min
12
λ
−
µ
2
+
ρ
2
n
i
=1
α
2
i
λ
i
+
ρ
+
λ
i
:
λ
∈
Λ
.
(4.2)This problem can still be interpreted as a proximity mapcomputation. We discuss how to solve it under our generalassumption
Λ =
{
λ
:
λ
∈
R
n
++
,Aλ
∈
S
}
. Moreover,we assume that the projection on the set
S
can be easilycomputed. To this end, we deﬁne the
(
n
+
k
)
×
n
matrix
B
=
I A
andthefunction
ϕ
(
s,t
) =
ϕ
1
(
s
)+
ϕ
2
(
t
)
,
(
s,t
)
∈
R
n
×
R
k
,where
ϕ
1
(
s
) =
ρ
2
n
i
=1
α
2
i
s
i
+
ρ
+
s
i
+
δ
R
++
(
s
i
)
,
and
ϕ
2
(
t
) =
δ
S
(
t
)
. Note that the solution of problem (4.2)is the same as the proximity map of the linearly compositefunction
ϕ
◦
B
at
µ
, which solves the problem
min
12
λ
−
µ
2
+
ϕ
(
Bλ
) :
λ
∈
R
n
.
84
A General Framework for Structured Sparsity via Proximal Optimization
At ﬁrst sight this problem seems difﬁcult to solve. However, it turns out that if the proximity map of the function
ϕ
has a simple form, the following theorem adapted from[14, Theorem 3.1] can be used to accomplish this task. Forease of notation we set
d
=
n
+
k
.
Theorem 4.1.
Let
ϕ
be a convex functionon
R
d
,
B
a
d
×
n
matrix,
µ
∈
R
n
,
c >
0
, and deﬁne the mapping
H
:
R
d
→
R
d
at
v
∈
R
d
as
H
(
v
) = (
I
−
prox
ϕc
)((
I
−
cBB
⊤
)
v
+
Bµ
)
.
Then, for any ﬁxed point
ˆ
v
of
H
, it holds that
prox
ϕ
◦
B
(
µ
) =
µ
−
cB
⊤
ˆ
v
The
Picard iterates
{
v
s
:
s
∈
N
} ⊆
R
d
, starting at
v
0
∈
R
d
, are deﬁned by the recursive equation
v
s
=
H
(
v
s
−
1
)
.Since the operator
I
−
prox
ϕ
is
nonexpansive
2
(see e.g.[6]), the map
H
is nonexpansiveif
c
∈
0
,
2

B

2
. Becauseof this, the Picard iterates are not guaranteedto convergetoa ﬁxed point of
H
. However,a simple modiﬁcation with anaveraging scheme can be used to compute the ﬁxed point.
Theorem 4.2.
[19] Let
H
:
R
d
→
R
d
be a nonexpansivemapping which has at least one ﬁxed point and let
H
κ
:=
κI
+ (1
−
κ
)
H
. Then, for every
κ
∈
(0
,
1)
, the Picard iterates of
H
κ
converge to a ﬁxed point of
H
.
The required proximity operator of
ϕ
is directly given, forevery
(
s,t
)
∈
R
n
×
R
k
, by
prox
ϕ
(
s,t
) =
prox
ϕ
1
(
s
)
,
prox
ϕ
2
(
t
)
.
Both
prox
ϕ
1
and
prox
ϕ
2
can be easily computed. The latter requires computing the projection on the set
S
. Theformer requires, for each component of the vector
s
∈
R
n
,the solutionof a cubicequationas stated in the nextlemma.
Lemma 4.1.
For every
µ,α
∈
R
and
r,ρ >
0
, the function
h
:
R
+
→
R
deﬁned at
s
as
h
(
s
) := (
s
−
µ
)
2
+
r
α
2
s
+
ρ
+
s
has a unique minimum on its domain, whichis attained at
(
x
0
−
ρ
)
+
, where
x
0
is the largest real root of the polynomial
2
x
3
+ (
r
−
2(
µ
+
ρ
))
x
2
−
rα
2
.
Proof.
Setting the derivative of
h
equal to zero and making the change of variable
x
=
s
+
ρ
yields the polynomial stated in the lemma. Let
x
0
be the largest root of thispolynomial. Since the function
h
is strictly convex on itsdomain and grows at inﬁnity, its minimum can be attainedonly at one point, which is
x
0
−
ρ
, if
x
0
> ρ
, and zerootherwise.
4.2 Accelerated Proximal Method
Theorem4.1motivatesa proximalnumericalapproach(Algorithm 1 below) to solving problem (2.1) and, in turn,
2
A mapping
T
:
R
d
→
R
d
is said nonexpansive if
T
(
v
)
−
T
(
v
′
)
2
≤
v
−
v
′
2
, for every
v,v
′
∈
R
d
.
problem (2.2). Let
E
(
β
) =
12
Xβ
−
y
22
and assume anupper bound
L
of
X
⊤
X
is known.
3
Proximal ﬁrstordermethods – see [4, 6, 17, 23] and references therein – can beused for nonsmoothoptimization, where the objective consists ofthe sumofa smoothtermanda nonsmoothterm,inour case
E
and
Γ +
δ
Λ
, respectively. The idea is to replace
E
with its linear approximation around a point
w
t
speciﬁcto iteration
t
. This leads to the computation of a proximityoperator, and speciﬁcally in our case to
u
t
:= (
β
t
,λ
t
)
←
argmin
L
2
(
β,λ
)
−
(
w
t
−
1
L
∇
E
(
w
t
))
22
+
ρ
Γ(
β,λ
) :
β
∈
R
n
,λ
∈
Λ
. Subsequently, the point
w
t
is updated,based on the current and previous estimates of the solution
u
t
,u
t
−
1
,...
and the process repeats.
Algorithm 1
Proximal structured sparsity algorithm (NEPIO).
u
1
,w
1
←
arbitrary feasible values
for
t=1,2,...
do
Compute a ﬁxed point
ˆ
v
(
t
)
of
H
t
by PicardOpial
u
t
+1
←
w
t
−
1
L
∇
E
(
w
t
)
−
cL
B
⊤
ˆ
v
(
t
)
w
t
+1
←
π
t
+1
u
t
+1
−
(
π
t
+1
−
1)
u
t
end for
The simplest (and a commonly used) update rule is
w
t
=
u
t
. By contrast,
accelerated proximal methods
proposedby [17] use a carefully chosen
w
update with two levelsof memory,
u
t
,u
t
−
1
. If the proximity map can be exactlycomputed, such schemes exhibit a fast quadratic decay interms of the iteration count, that is, the distance of the ob jective from the minimal value is
O
1
T
2
after
T
iterations. In the case that the proximal operator is computed
numerically
, it has been shown only very recently [22, 24]that, under some circumstances, the accelerated methodstill converges with the rate
O
1
T
2
. The main advantagesof accelerated methods are their low cost per iteration andtheir scalability to large problem sizes. Moreover, in applications where a thresholding operator is involved – as inLemma 4.1– the zeros in the solutionare exact, which maybe desirable.For our purposes, we use a version of accelerated methods inﬂuenced by [23] (described in Algorithm 1). According to Nesterov, the optimal update is
w
t
+1
←
u
t
+1
+
θ
t
+1
1
θ
t
−
1
(
u
t
+1
−
u
t
)
wherethesequence
θ
t
is deﬁnedby
θ
1
= 1
and the recursive equation
1
−
θ
t
+1
θ
2
t
+1
=1
θ
2
t
.
We have adapted [23, Algorithm 2] (equivalent to FISTA[4]) by computing the proximity operator of
ϕL
◦
B
using the PicardOpial process described in Section 4.1. Werephrased the algorithm using the sequence
π
t
:= 1
−
θ
t
+
√
1
−
θ
t
= 1
−
θ
t
+
θ
t
θ
t
−
1
for numerical stability. At each
3
For variants of such algorithms which adaptively learn
L
, seethe above references.
85
Luca Baldassarre, Jean Morales, Andreas Argyriou, Massimiliano Pontil
iteration, the map
H
t
is deﬁned by
H
t
(
v
) := (
I
−
prox
φc
)
I
−
cLBB
⊤
v
−
1
LB
(
∇
E
(
w
t
)
−
Lw
t
)
.
We also apply an Opial averaging, so that the update atstage
s
of the proximity computation is
v
s
+1
=
κv
s
+(1
−
κ
)
H
t
(
v
s
)
. By Theorem 4.1, the ﬁxed point process combined with the assignment of
u
are equivalent to
u
t
+1
←
prox
ϕL
◦
B
w
t
−
1
L
∇
E
(
w
t
)
.The reason for resorting to PicardOpial is that exact computation of the proximity operator (4.2) is possible onlyin simple special cases for the set
Λ
. By contrast, ourapproach can be applied to a wide variety of constraints.Moreover, we are not aware of another proximal methodfor solving problems (2.1) or (2.2) and alternatives like interiorpointmethodsdonotscalewell with problemsize. Inthe next section, we will demonstrate empirically the scalability of Algorithm 1, as well as the efﬁciency of both theproximity map computation and the overall method.
5 Numerical Simulations
In this section, we present experiments with method (2.1).The main goal of the experiments is to study both thecomputational and the statistical estimation properties of this method. One important aim of the experiments is todemonstrate that the method is statistically competitive orsuperior to stateoftheart methods while being computationally efﬁcient. The methods employed are the “Lasso”,“StructOMP” [9], “GL1”, the Group Lasso variant presented in [10], and “GL2”, a Group Lasso with overlapping groups. For both Group Lasso methods, we used asgroups all sets of
4
contiguous variables (
1
D) or the setsof all neighboursof each variable (
2
D). Moreover,we usedmethod (2.1) with the following choices for the constraintset
Λ
:
•
Λ =
{
λ
:
Aλ
1
≤
α
}
, where
A
is the edge map of a
1
D or
2
D grid – we refer to the correspondingmethodas “GridC”.
•
Λ =
{
λ
:
Aλ
≥
0
}
, where
A
is the edge map of atree graph – we refer to the corresponding method as“TreeC”.We solved the optimization problem (2.1) either with thetoolbox CVX
4
or with the proximal method presented inSection 4. When using the proximal method, we foundthatsetting the parameter
κ
from Opial’s Theorem to
0
.
2
gavethe best results, even though in [14] they show that convergence of the ﬁxedpoint iterations is guaranteed also for
4
http://cvxr.com/cvx/
κ
= 0
. Our main stopping criterion is based on the decrease in the objective value of (2.1) which must be lessthan
10
−
8
. For the computation of the proximity operator,we stopped the iterations of the PicardOpial method whenthe relative difference between two consecutive iterates issmaller than
10
−
2
. We studied the effect of varying thistolerance in the next experiments. We used the square lossandcomputedtheLipschitzconstant
L
usingsingularvaluedecomposition (if this were not possible, a Frobenius estimate could be used). Finally, the implementation ran on an8GB memory quadcore Intel machine.
5.1 Efﬁciency experiments
First, we investigated the computational properties of theproximal method (NEPIO). Our aim in these experimentswas to show that our algorithm has a time complexitythat scales linearly with the number of variables, whilethe relative number of training examples is kept constant.We considered both the Grid and the Tree constraints andcompared our algorithm to the toolbox CVX, which isan interiorpoint method solver. As is commonly known,interiorpoint methods are very fast, but do not scale wellwith the problem size. We also compared to the nonacceleratedversionofouralgorithm,similar to ISTA [4, 6].ISTA has been shown [6] to converge in the presence of very general, but summable, errors in the computation of the proximity operator. In the case of the Tree constraint,we further compared with an adapted version of the alternating algorithm (AA) of [12, 13]. For each problem size,we repeatedthe experiments
10
times and we report the average computation time in Figure 1 for GridC and TreeC.All methods achieve objective values that are within
1%
of each other, apart from ISTA that sometimes did not converge after
10
5
iterations. The proposed method scales linearly with the problem size, making it suitable for largescale experiments.In order to empirically assess the importance of the PicardOpial tolerance for converging to a good solution, we considered a problem with
100
variables for both the Grid andtheTreeconstraintsandrepeatedtheexperiments
100
timeswithdifferentsamplingoftrainingexamples. Foreachconstraint, we evaluated the average distance from the solutionobtained by our method with different values of the PicardOpial tolerance to the solution obtained by CVX.We did not observe any improvementin the solution by decreasing a ﬁxed tolerance from
10
−
2
to
10
−
8
or by settingthe tolerance to decrease as
1
/T
α
with
α
equal to
1
,
1
.
5
or
2
, as suggested by very recent results [22, 24]. However,decreasing the tolerance remarkably increased the computational overhead from an average of
5
s
for a ﬁxed tolerance of
10
−
2
to
80
s
for
1
/T
2
in the case of the Grid constraint.Finally, we considered the
2
D GridC case and observed
86