This post gives a performance estimation program (PEP) motivation for gradient descent algorithms with momentum and a version of Nesterov’s Fast Gradient Method.
These proofs are not new (see (d’Aspremont, Scieur, and Taylor 2021, Section 4.4.1)), but I hope the presentation clarifies where such proofs come from.
Our goal is to find a minimizer of x∈Rdminf(x) given only that f is convex and 1-smooth and that
f has a minimizer x⋆. All the ideas extend to the general setting of an L-smooth convex function, with slightly more cumbersome notation.
Recall, a convex function f:Rd→R is said to
be 1-smooth if it is differentiable and for all
x,y∈Rd, ∥∇f(x)−∇f(y)∥≤∥x−y∥. Equivalently, this holds if and only if for all
x,y∈Rd, f(y)≤f(x)+⟨∇f(x),y−x⟩+21∥x−y∥2.
We will attempt to minimize f using a fixed-step first-order method
(FSFOM). Formally, this is any algorithm of the form x0is given as part of the inputx1=x0−H1,0∇f(x0)x2=x0−H2,0∇f(x0)−H2,1∇f(x1)⋮xn=x0−i=0∑n−1Hn,i∇f(xi) parameterized by a matrix
H∈R[0,n]×[0,n] that has support below the
diagonal, i.e., Hi,j=0 for all j≥i. The matrix H fixes
in advance the stepsizes to be taken at each iteration. H being
supported below the diagonal corresponds to the natural requirement that
xi should only depend on the gradients
∇f(x0),…,∇f(xi−1).
Let I={0,1,…,n,⋆} and introduce the
shorthand fi=f(xi) and gi=∇f(xi) for all
i∈I.
Now, let us consider the set of first-order data
{(fi,gi,xi)}i∈I. The fact that this
data comes from a 1-smooth convex function imposes constraints on what
this data may look like.
Lemma 1 ((Taylor, Hendrickx, and Glineur
2017, Corollary 1)). The first-order data
{(fi,gi,xi)}i∈I can be realized by a 1-smooth convex function (in short, is interpolable) if and only if for all
i,j∈IQi,j:=fj−fi+⟨gj,xi−xj⟩+21∥gi−gj∥2≤0.
Note that the quantity Qi,j is linear in the formal variables
f:=⎝⎜⎜⎜⎜⎜⎜⎛f0f1⋮fnf⋆⎠⎟⎟⎟⎟⎟⎟⎞G:=⎝⎜⎜⎜⎜⎜⎜⎜⎛∥g0∥2⟨g1,g0⟩⋮⟨gn,g0⟩⟨x0−x⋆,g0⟩⟨g0,g1⟩∥g1∥2⋮⟨gn,g1⟩⟨x0−x⋆,g1⟩……⋱……⟨g0,gn⟩⟨g1,gn⟩⋮∥gn∥2⟨x0−x⋆,gn⟩⟨g0,x0−x⋆⟩⟨g1,x0−x⋆⟩⋮⟨gn,x0−x⋆⟩∥x0−x⋆∥2⎠⎟⎟⎟⎟⎟⎟⎟⎞.
Example 1. As an example, we can write Q3,1 as
Q3,1=f1−f3+⟨g1,x3−x1⟩+21∥g1−g3∥2=f1−f3+⟨g1,(H1,0−H3,0)g0−H3,1g1−H3,2g2⟩+21∥g1−g3∥2=⎝⎜⎜⎜⎜⎜⎜⎜⎜⎛010−10⋮⎠⎟⎟⎟⎟⎟⎟⎟⎟⎞⊺f+21⟨⎝⎜⎜⎜⎜⎜⎜⎛0H1,0−H3,000⋮H1,0−H3,01−2H3,1−H3,2−1⋮0−H3,200⋮0−101⋮…………⋱⎠⎟⎟⎟⎟⎟⎟⎞,G⟩
We learn that the set of interpolable (f,G) pairs can be written as
D:=⎩⎪⎨⎪⎧(f,G)∈RI×SI:Qi,j≤0,∀i,j∈IG⪰0rank(G)≤d⎭⎪⎬⎪⎫. Here, each Qi,j is a linear function in (f,G). We
will make a high-dimensional assumption, d≥n+2, so that we can
ignore the rank constraint. Then, D becomes the feasible
domain of a semidefinite program (SDP)!
Convergence guaranteees and the shadow iterate
Let’s consider perhaps the simplest FSFOM corresponding to the matrix
H=(0) and n=0. In this “algorithm”, we will simply return
the starting iterate x0. From the basic definition of 1-smoothness,
we know that f0−f⋆−21∥x0−x⋆∥2≤0. Let’s think about what this means in the PEP framework.
In the language of PEP,
f0−f⋆−21∥x0−x⋆∥2 is
a linear expression in (f,G) that is nonpositive on the cone
D. By strong
dualityStrong
duality holds under very minor conditions that I will sweep under the
rug here. See (Taylor, Hendrickx, and Glineur
2017, Theorem 6)., we can certify this
inequality using a nonnegative combination of the Qi,j and an
additional PSD slack term. Concretely, there must exist
λ∈R+I×I and a PSD
matrix S∈S+I so that f0−f⋆−21∥x0−x⋆∥2=i,j∈I∑λi,jQi,j−21⟨S,G⟩ as a formal expression in the variables f and G.
For our choice of FSFOM, there is a unique such certificate:
f0−f⋆−21∥x0−x⋆∥2=Q⋆,0−21⟨(1−1−11),G⟩. Rearranging this and dropping the term
Q⋆,0≤0, we get the more precise bound: f0−f⋆≤21∥x0−x⋆∥2−21∥x0−g0−x⋆∥2. Let us define z1:=x0−g0. Then,
f0−f⋆≤21∥x0−x⋆∥2−21∥z1−x⋆∥2.
What does this tell us? One of two things must be true. Either:
x0 outperforms the initial bound on f0−f⋆, or
z1 is close to x⋆.
We refer to z1 as a shadow iterate for this FSFOM. We emphasize
that z1 lives in the affine space
x0+span({g0}) and can be constructed
from the first-order information (f0,x0,g0).
Perhaps surprisingly, a similar story holds for any FSFOM. The
following derivation is implicit in (Grimmer, Shu, and Wang
2024, Proposition 4).
Lemma 2. Suppose we have a FSFOM parameterized by
H∈R[0,n]×[0,n] and suppose we can give a
guarantee*Here,
assume either that strong duality holds or that the value of
θn2 is the largest that can be certified via the PEP
SDP.* of the form θn2(fn−f⋆)≤21∥x0−x⋆∥2. Then, there exist weights v0,…,vn so that
zn+1:=x0−∑i=0nvigi satisfies θn2(fn−f⋆)≤21∥x0−x⋆∥2−21∥∥∥zn+12−x⋆∥∥∥2.
Proof. Again, in the language of PEP, this is a linear expression in
(f,G) that is nonpositive on the cone D and can be
certified via a nonnegative
λ∈R+I×I and a PSD
S∈SI: θn2(fn−f⋆)−21∥x0−x⋆∥2=i,j∈I∑λi,jQi,j−21⟨S,G⟩. By inspection, the coefficient on
∥x0−x⋆∥2, i.e., G⋆,⋆, is
zero in each of the Qi,j expressions. Thus, for the above identity
to hold, it must be that S⋆,⋆=1. By the Schur complement
lemma, S⪰⎝⎜⎜⎜⎜⎜⎜⎛S0,⋆S1,⋆⋮Sn,⋆1⎠⎟⎟⎟⎟⎟⎟⎞⎝⎜⎜⎜⎜⎜⎜⎛S0,⋆S1,⋆⋮Sn,⋆1⎠⎟⎟⎟⎟⎟⎟⎞⊺.
Thus, we may set vi=Si,⋆ and zn+1=x0+∑i=0nvigi
to complete the proof. ◻
Again, eitherxn outperforms the initial estimate
fn−f⋆≤2θn21∥x0−x⋆∥2orzn+1 is close to x⋆. We refer to zn+1 as the
shadow iterate for the FSFOM H. Note that zn+1 lives in the affine
space
x0+span({g0,…,gn})
and can be constructed upon learning
{(fi,gi,xi)}i∈[0,n].
Deriving a momentum method
Guessing the form of the method
Suppose we have computed xn−1, gn−1, and zn satisfying
the following inductive hypothesis: QH:=θn−12(fn−1−f⋆)−(21∥x0−x⋆∥2−21∥zn−x⋆∥2)≤0.
We must now decide how to choose our next point
xn∈x0+span({g0,…,gn−1}).
Two natural choices in this affine space come to mind: First, we could
take a gradient descent step xn=xn−1−gn−1. This would be
a good next step assuming that fn−1−f⋆ is already small.
Alternatively, we could take xn to be the shadow iterate zn.
This would be a good next step assuming that fn−1−f⋆ is
large (so that ∥zn−x⋆∥ is small).
From here, it is a small jump to guess the form
xn=α(xn−1−gn−1)+(1−α)zn, where
α∈(0,1) is chosen to hedge between the two cases.
Definition of α
We will choose α to greedily minimize the worst-case value of
fn−f⋆ with respect to
∥x0−x⋆∥2. Here, the worst-case is
naturally taken over the pairs f=⎝⎜⎛fn−1fnf⋆⎠⎟⎞,G=⎝⎜⎜⎜⎛∥gn−1∥2⋅⋅⋅⟨gn−1,xn−1−zn⟩∥xn−1−zn∥2⋅⋅⟨gn−1,gn⟩⟨xn−1−zn,gn⟩∥gn∥2⋅⟨gn−1,zn−x⋆⟩⟨xn−1−zn,zn−x⋆⟩⟨gn,zn−x⋆⟩∥zn−x⋆∥2⎠⎟⎟⎟⎞ (where entries below the diagonal in G are defined by
symmetry) satisfying DH:=⎩⎪⎨⎪⎧(f,G):QH≤0Qi,j≤0,∀i,j∈{n−1,n,⋆}G⪰0⎭⎪⎬⎪⎫. Again, QH and each of the Qi,j are linear
functions in (f,G).
Our goal is to pick α and θn2 to solve α,θn2max{θn2:θn2(fn−f⋆)−21∥x0−x⋆∥2≤0,∀(f,G)∈DHα∈(0,1)}=α,θn2,v,λ,Smax{θn2:θn2(fn−f⋆)−21∥x0−x⋆∥2=∑i,j∈{n,n+1,⋆}λi,jQi,j+vQh−Sα∈(0,1),λ≥0,v≥0,S⪰0}. Here, λ is a nonnegative matrix indexed by
{n,n+1,⋆}2 and S∈S+4.
Writing out each of the Qi,j and equating the coefficients on f
and G, we arrive at the following α,θn2,λmax⎩⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎧θn2:α∈(0,1),λ≥0⎝⎜⎛−λn−1,n+λn,n−1−λn−1,⋆+λ⋆,n−1+θn−12λn−1,n−λn,n−1−λn,⋆+λ⋆,nλn−1,⋆−λ⋆,n−1+λn,⋆−λ⋆,n−θn−12⎠⎟⎞=⎝⎜⎛0θn2−θn2⎠⎟⎞M⪰0⎭⎪⎪⎪⎪⎪⎬⎪⎪⎪⎪⎪⎫ where M is defined as (λn−1,n+λn,n−1(1−2α)+λn−1,⋆+λ⋆,n−1⋅⋅⋅λn,n−1(α−1)−λ⋆,n−10⋅⋅λn−1,n(α−1)−λn,n−1+λ⋆,nαλn−1,n(1−α)−λ⋆,nαλn−1,n+λn,n−1+λn,⋆+λ⋆,n⋅−λ⋆,n−10−λ⋆,n1) and entries below the diagonal are defined by symmetry.
This problem looks far uglier than it really is. In fact, we can reduce
it to a one-dimensional optimization problem and solve it explicitly
without too much effort (see
Appendix A). The punchline will be that the
optimal solution sets α=1+2θn−12+1+4θn−122θn−12 with optimal value θn=21+1+4θn−12. Furthermore, by Lemma 1, we will necessarily have a shadow
iterate zn+1 so that θn2(fn−f⋆)≤21∥x0−x⋆∥2−21∥zn+1−x⋆∥2 and we can continue the induction. By inspecting the
slack term M that we will find analytically in
Appendix A, we will learn that we can set
zn+1=zn−(θn2−θn−12)gn. This gives an accelerated gradient method with an O(n−2) convergence rate.
We summarize the above derivation as the following algorithm:
Fast Gradient Method
Initialize θ0=1, z1=x0−g0
For t=1,…,T
Set α=1+2θt−12+1+4θt−122θt−12 and
θt=21+1+4θt−12
Set xt=α(xt−1−gt−1)+(1−α)zt
Set zt+1=zt−(θt2−θt−12)gt
The z update can/should be moved to the beginning of the next iteration so that we only need to compute one gradient in each iteration. However, this form emphasizes that we maintain a valid inductive hypothesis by the end of each iteration.
Alexandre d’Aspremont, Damien Scieur, and Adrien Taylor. 2021. “Acceleration methods.” Foundations and Trends in Optimization 5(1-2): 1-245.
Yoel Drori and Marc Teboulle. 2014. “Performance of First-Order
Methods for Smooth Convex Minimization: A Novel Approach.” Mathematical
Programming 145 (1): 451–82.
Benjamin Grimmer, Kevin Shu, and Alex L Wang. 2024. “Composing
Optimized Stepsize Schedules for Gradient Descent.” arXiv Preprint
arXiv:2410.16249.
Adrien B Taylor, Julien M Hendrickx, and François Glineur. 2017.
“Smooth Strongly Convex Interpolation and Exact Worst-Case Performance
of First-Order Methods.” Mathematical Programming 161: 307–45.
Appendix A: Solving for α
Our goal is to solve: α,θn2,λmax⎩⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎧θn2:α∈(0,1),λ≥0⎝⎜⎛−λn−1,n+λn,n−1−λn−1,⋆+λ⋆,n−1+θn−12λn−1,n−λn,n−1−λn,⋆+λ⋆,nλn−1,⋆−λ⋆,n−1+λn,⋆−λ⋆,n−θn−12⎠⎟⎞=⎝⎜⎛0θn2−θn2⎠⎟⎞M⪰0⎭⎪⎪⎪⎪⎪⎬⎪⎪⎪⎪⎪⎫.
Suppose we have any feasible solution to this problem. We will construct
a new feasible solution with the same objective value but with a
specific structure. First, the fact that M2,2=0 implies
that M1,2=0 and M2,3=0, i.e,
λ⋆,n−1=λn,n−1=0 and
α=λn−1,n+λ⋆,nλn−1,n.
Thus, M=⎝⎜⎜⎜⎛λn−1,n+λn−1,⋆⋅⋅⋅00⋅⋅00λn−1,n+λn,⋆+λ⋆,n⋅00−λ⋆,n1⎠⎟⎟⎟⎞. We will modify our feasible solution by increasing both
λn,⋆ and λ⋆,n simultaneously. Note that
the value of the objective function does not change and the linear
equality constraint remains satisfied. On the other hand, by increasing
these values enough, we can force the bottom-right 2×2 block of
M to be rank-one. We deduce there exists an optimal
solution with λ⋆,n−1=λn,n−1=0λn−1,n+λn,⋆+λ⋆,n=λ⋆,n2. Additionally, as long as this constraint is satisfied,
we can find a setting of α so that M⪰0.
We may thus simplify the original problem to the problem below. Here, we
have enforced the constraints listed above and the linear equality
constraints from the original problem. We additionally parameterize
λn−1,n=s, λ⋆,n=t: s,tmax⎩⎪⎨⎪⎧2s+2t−t2:s,t>0θn−12≥st≥21+1+4s⎭⎪⎬⎪⎫. Note that for any s>0, we will have the constraint
t≥21+1+4s>1. Thus, the optimal choice of t must
set t=21+1+4s. Finally, smax{21+2s+1+4s:0<s≤θn−12}. The optimal solution is s=θn−12. We can then
construct the optimal solution to the original problem by substituting
in backwards to get α=1+2θn−12+1+4θn−122θn−12.