What is momentum? – A PEP view

This post gives a performance estimation program (PEP) motivation for gradient descent algorithms with momentum and a version of Nesterov’s Fast Gradient Method. These proofs are not new (see (d’Aspremont, Scieur, and Taylor 2021, Section 4.4.1)), but I hope the presentation clarifies where such proofs come from.


PEP background

This section summarizes some background material on the Performance Estimation Programming (PEP) methodology initiated by (Drori and Teboulle 2014) and (Taylor, Hendrickx, and Glineur 2017).

Our goal is to find a minimizer of minxRdf(x)\begin{aligned} \min_{x\in{\mathbb{R}}^d} f(x) \end{aligned} given only that ff is convex and 11-smooth and that ff has a minimizer xx_\star. All the ideas extend to the general setting of an LL-smooth convex function, with slightly more cumbersome notation.

Recall, a convex function f:RdRf:{\mathbb{R}}^d\to{\mathbb{R}} is said to be 11-smooth if it is differentiable and for all x,yRdx,y\in{\mathbb{R}}^d, f(x)f(y)xy.\begin{aligned} \left\lVert \nabla f(x) - \nabla f(y) \right\rVert \leq \left\lVert x - y \right\rVert. \end{aligned} Equivalently, this holds if and only if for all x,yRdx,y\in{\mathbb{R}}^d, f(y)f(x)+f(x),yx+12xy2.\begin{aligned} f(y) \leq f(x) + \left\langle \nabla f(x), y-x \right\rangle+\frac{1}{2}\left\lVert x-y \right\rVert^2. \end{aligned}

We will attempt to minimize ff using a fixed-step first-order method (FSFOM). Formally, this is any algorithm of the form x0is given as part of the inputx1=x0H1,0f(x0)x2=x0H2,0f(x0)H2,1f(x1)xn=x0i=0n1Hn,if(xi)\begin{aligned} &x_0 \text{\qquad is given as part of the input}\\ &x_1 = x_0 - H_{1,0} \nabla f(x_0)\\ &x_2 = x_0 - H_{2,0} \nabla f(x_0) - H_{2,1} \nabla f(x_1)\\ &\vdots\\ &x_n = x_0 - \sum_{i=0}^{n-1} H_{n,i} \nabla f(x_i) \end{aligned} parameterized by a matrix HR[0,n]×[0,n]H\in{\mathbb{R}}^{[0,n]\times[0,n]} that has support below the diagonal, i.e., Hi,j=0H_{i,j} = 0 for all jij\geq i. The matrix HH fixes in advance the stepsizes to be taken at each iteration. HH being supported below the diagonal corresponds to the natural requirement that xix_i should only depend on the gradients f(x0),,f(xi1)\nabla f(x_0),\dots,\nabla f(x_{i-1}).

Let I={0,1,,n,}\mathcal{I} = \left\{0,1,\dots,n,\star\right\} and introduce the shorthand fi=f(xi)f_i = f(x_i) and gi=f(xi)g_i = \nabla f(x_i) for all iIi\in\mathcal{I}.

Now, let us consider the set of first-order data {(fi,gi,xi)}iI\left\{(f_i, g_i, x_i)\right\}_{i\in\mathcal{I}}. The fact that this data comes from a 11-smooth convex function imposes constraints on what this data may look like.

Lemma 1 ((Taylor, Hendrickx, and Glineur 2017, Corollary 1)). The first-order data {(fi,gi,xi)}iI\left\{(f_i, g_i, x_i)\right\}_{i\in\mathcal{I}} can be realized by a 11-smooth convex function (in short, is interpolable) if and only if for all i,jIi,j\in\mathcal{I} Qi,jfjfi+gj,xixj+12gigj20.\begin{aligned} Q_{i,j}\coloneqq f_j - f_i + \left\langle g_j, x_i - x_j \right\rangle + \frac{1}{2}\left\lVert g_i - g_j \right\rVert^2 \leq 0. \end{aligned}

Note that the quantity Qi,jQ_{i,j} is linear in the formal variables f(f0f1fnf)G(g02g0,g1g0,gng0,x0xg1,g0g12g1,gng1,x0xgn,g0gn,g1gn2gn,x0xx0x,g0x0x,g1x0x,gnx0x2).\begin{aligned} f\coloneqq \begin{pmatrix} f_0\\ f_1\\ \vdots\\ f_n\\ f_\star \end{pmatrix} \qquad G\coloneqq \begin{pmatrix} \left\lVert g_0 \right\rVert^2 & \left\langle g_0, g_1 \right\rangle & \dots & \left\langle g_0, g_n \right\rangle & \left\langle g_0, x_0 - x_\star \right\rangle\\ \left\langle g_1, g_0 \right\rangle & \left\lVert g_1 \right\rVert^2 & \dots & \left\langle g_1, g_n \right\rangle & \left\langle g_1, x_0 - x_\star \right\rangle\\ \vdots & \vdots & \ddots & \vdots & \vdots\\ \left\langle g_n, g_0 \right\rangle & \left\langle g_n,g_1 \right\rangle & \dots & \left\lVert g_n \right\rVert^2 & \left\langle g_n, x_0 - x_\star \right\rangle\\ \left\langle x_0 - x_\star, g_0 \right\rangle & \left\langle x_0 - x_\star, g_1 \right\rangle & \dots & \left\langle x_0 - x_\star, g_n \right\rangle & \left\lVert x_0 - x_\star \right\rVert^2 \end{pmatrix}. \end{aligned}

Example 1. As an example, we can write Q3,1Q_{3,1} as Q3,1=f1f3+g1,x3x1+12g1g32=f1f3+g1,(H1,0H3,0)g0H3,1g1H3,2g2+12g1g32=(01010)f+12(0H1,0H3,000H1,0H3,012H3,1H3,210H3,2000101),G\begin{aligned} Q_{3,1} &= f_1 - f_3 + \left\langle g_1, x_3 - x_1 \right\rangle + \frac{1}{2}\left\lVert g_1 - g_3 \right\rVert^2\\ &= f_1 - f_3 + \left\langle g_1,\left(H_{1,0}-H_{3,0}\right)g_0 - H_{3,1}g_1 - H_{3,2} g_2 \right\rangle + \frac{1}{2}\left\lVert g_1 - g_3 \right\rVert^2\\ &= \begin{pmatrix} 0\\ 1\\ 0\\ -1\\ 0\\ \vdots \end{pmatrix}^\intercal f + \frac{1}{2}\left\langle \begin{pmatrix} 0 & H_{1,0} - H_{3,0} & 0 & 0 & \dots\\ H_{1,0} - H_{3,0} & 1-2H_{3,1} & -H_{3,2} & -1 & \dots\\ 0 & -H_{3,2} & 0 & 0 & \dots\\ 0 & -1 & 0 & 1 & \dots\\ \vdots & \vdots & \vdots & \vdots & \ddots \end{pmatrix},G \right\rangle \end{aligned}

We learn that the set of interpolable (f,G)(f,G) pairs can be written as D{(f,G)RI×SI:Qi,j0,i,jIG0rank(G)d}.\begin{aligned} {\mathcal D}\coloneqq \left\{(f,G)\in{\mathbb{R}}^{\mathcal{I}}\times {\mathbb{S}}^{\mathcal{I}}:\, \begin{array}{l} Q_{i,j} \leq 0,\qquad\forall i,j\in\mathcal{I}\\ G\succeq 0\\ \operatorname{rank}(G) \leq d \end{array}\right\}. \end{aligned} Here, each Qi,jQ_{i,j} is a linear function in (f,G)(f,G). We will make a high-dimensional assumption, dn+2d \geq n+2, so that we can ignore the rank constraint. Then, D{\mathcal D} becomes the feasible domain of a semidefinite program (SDP)!


Convergence guaranteees and the shadow iterate

Let’s consider perhaps the simplest FSFOM corresponding to the matrix H=(0)H = \begin{pmatrix} 0 \end{pmatrix} and n=0n = 0. In this “algorithm”, we will simply return the starting iterate x0x_0. From the basic definition of 11-smoothness, we know that f0f12x0x20.\begin{aligned} f_0 - f_\star - \frac{1}{2}\left\lVert x_0 - x_\star \right\rVert^2 \leq 0. \end{aligned} Let’s think about what this means in the PEP framework. In the language of PEP, f0f12x0x2f_0 - f_\star - \frac{1}{2}\left\lVert x_0 - x_\star \right\rVert^2 is a linear expression in (f,G)(f,G) that is nonpositive on the cone D{\mathcal D}. By strong dualityStrong duality holds under very minor conditions that I will sweep under the rug here. See (Taylor, Hendrickx, and Glineur 2017, Theorem 6)., we can certify this inequality using a nonnegative combination of the Qi,jQ_{i,j} and an additional PSD slack term. Concretely, there must exist λR+I×I\lambda \in{\mathbb{R}}^{\mathcal{I}\times\mathcal{I}}_+ and a PSD matrix SS+IS\in{\mathbb{S}}^{\mathcal{I}}_+ so that f0f12x0x2=i,jIλi,jQi,j12S,G\begin{aligned} f_0 - f_\star - \frac{1}{2}\left\lVert x_0 - x_\star \right\rVert^2 &= \sum_{i,j\in\mathcal{I}}\lambda_{i,j}Q_{i,j} - \frac{1}{2}\left\langle S,G \right\rangle \end{aligned} as a formal expression in the variables ff and GG.

For our choice of FSFOM, there is a unique such certificate: f0f12x0x2=Q,012(1111),G.\begin{aligned} f_0 - f_\star - \frac{1}{2}\left\lVert x_0 - x_\star \right\rVert^2 &= Q_{\star,0} - \frac{1}{2}\left\langle \begin{pmatrix} 1 & -1\\ -1 & 1 \end{pmatrix},G \right\rangle. \end{aligned} Rearranging this and dropping the term Q,00Q_{\star,0}\leq 0, we get the more precise bound: f0f12x0x212x0g0x2.\begin{aligned} f_0 - f_\star \leq \frac{1}{2}\left\lVert x_0 - x_\star \right\rVert^2 - \frac{1}{2}\left\lVert x_0 - g_0 - x_\star \right\rVert^2. \end{aligned} Let us define z1x0g0z_1 \coloneqq x_0 - g_0. Then, f0f12x0x212z1x2.\begin{aligned} f_0 - f_\star \leq \frac{1}{2}\left\lVert x_0 - x_\star \right\rVert^2 - \frac{1}{2}\left\lVert z_1 - x_\star \right\rVert^2. \end{aligned}

What does this tell us? One of two things must be true. Either:

  • x0x_0 outperforms the initial bound on f0ff_0 -f_\star, or

  • z1z_1 is close to xx_\star.

We refer to z1z_1 as a shadow iterate for this FSFOM. We emphasize that z1z_1 lives in the affine space x0+span({g0})x_0+\operatorname{span}(\left\{g_0\right\}) and can be constructed from the first-order information (f0,x0,g0)(f_0,x_0,g_0).

Perhaps surprisingly, a similar story holds for any FSFOM. The following derivation is implicit in (Grimmer, Shu, and Wang 2024, Proposition 4).

Lemma 2. Suppose we have a FSFOM parameterized by HR[0,n]×[0,n]H\in{\mathbb{R}}^{[0,n]\times[0,n]} and suppose we can give a guarantee*Here, assume either that strong duality holds or that the value of θn2\theta_n^2 is the largest that can be certified via the PEP SDP.* of the form θn2(fnf)12x0x2.\begin{aligned} \theta_n^2(f_n - f_\star) \leq \frac{1}{2}\left\lVert x_0 - x_\star \right\rVert^2. \end{aligned} Then, there exist weights v0,,vnv_0,\dots,v_n so that zn+1x0i=0nvigiz_{n+1}\coloneqq x_0- \sum_{i=0}^n v_ig_i satisfies θn2(fnf)12x0x212zn+12x2.\begin{aligned} \theta_n^2 (f_n-f_\star)\leq \frac{1}{2}\left\lVert x_0-x_\star \right\rVert^2 - \frac{1}{2}\left\lVert z_{n+1}^2-x_\star \right\rVert^2. \end{aligned}

Proof. Again, in the language of PEP, this is a linear expression in (f,G)(f,G) that is nonpositive on the cone D{\mathcal D} and can be certified via a nonnegative λR+I×I\lambda\in{\mathbb{R}}^{\mathcal{I} \times\mathcal{I}}_+ and a PSD SSIS\in{\mathbb{S}}^{\mathcal{I}}: θn2(fnf)12x0x2=i,jIλi,jQi,j12S,G.\begin{aligned} \theta_n^2(f_n - f_\star) - \frac{1}{2}\left\lVert x_0 - x_\star \right\rVert^2 &= \sum_{i,j\in\mathcal{I}}\lambda_{i,j}Q_{i,j} - \frac{1}{2}\left\langle S,G \right\rangle. \end{aligned} By inspection, the coefficient on x0x2\left\lVert x_0 -x_\star \right\rVert^2, i.e., G,G_{\star,\star}, is zero in each of the Qi,jQ_{i,j} expressions. Thus, for the above identity to hold, it must be that S,=1S_{\star,\star} = 1. By the Schur complement lemma, S(S0,S1,Sn,1)(S0,S1,Sn,1).\begin{aligned} S\succeq \begin{pmatrix} S_{0,\star}\\ S_{1,\star}\\ \vdots\\ S_{n,\star}\\ 1 \end{pmatrix}\begin{pmatrix} S_{0,\star}\\ S_{1,\star}\\ \vdots\\ S_{n,\star}\\ 1 \end{pmatrix}^\intercal. \end{aligned}

Thus, we may set vi=Si,v_i = S_{i,\star} and zn+1=x0+i=0nvigiz_{n+1} = x_0 + \sum_{i=0}^n v_ig_i to complete the proof. ◻

Again, either xnx_n outperforms the initial estimate fnf12θn2x0x2f_n - f_\star\leq \frac{1}{2\theta_n^2}\left\lVert x_0-x_\star \right\rVert^2 or zn+1z_{n+1} is close to xx_\star. We refer to zn+1z_{n+1} as the shadow iterate for the FSFOM HH. Note that zn+1z_{n+1} lives in the affine space x0+span({g0,,gn})x_0 + \operatorname{span}\left(\left\{g_0, \dots,g_n\right\}\right) and can be constructed upon learning {(fi,gi,xi)}i[0,n]\left\{(f_i,g_i,x_i)\right\}_{i\in[0,n]}.


Deriving a momentum method

Guessing the form of the method

Suppose we have computed xn1x_{n-1}, gn1g_{n-1}, and znz_{n} satisfying the following inductive hypothesis: QHθn12(fn1f)(12x0x212znx2)0.\begin{aligned} Q_{H}\coloneqq \theta_{n-1}^2(f_{n-1} - f_\star) - \left(\frac{1}{2}\left\lVert x_0 -x_\star \right\rVert^2 - \frac{1}{2}\left\lVert z_{n} - x_\star \right\rVert^2\right) \leq 0. \end{aligned}

We must now decide how to choose our next point xnx0+span({g0,,gn1})x_n \in x_0 + \operatorname{span}(\left\{g_0,\dots,g_{n-1}\right\}). Two natural choices in this affine space come to mind: First, we could take a gradient descent step xn=xn1gn1x_{n} = x_{n-1} - g_{n-1}. This would be a good next step assuming that fn1ff_{n-1} - f_\star is already small. Alternatively, we could take xnx_{n} to be the shadow iterate znz_{n}. This would be a good next step assuming that fn1ff_{n-1} - f_\star is large (so that znx\left\lVert z_{n}- x_\star \right\rVert is small).

From here, it is a small jump to guess the form xn=α(xn1gn1)+(1α)zn,x_n = \alpha(x_{n-1} - g_{n-1}) + (1-\alpha)z_n, where α(0,1)\alpha\in(0,1) is chosen to hedge between the two cases.

Definition of α\alpha

We will choose α\alpha to greedily minimize the worst-case value of fnff_n - f_\star with respect to x0x2\left\lVert x_0-x_\star \right\rVert^2. Here, the worst-case is naturally taken over the pairs f=(fn1fnf),G=(gn12gn1,xn1zngn1,gngn1,znxxn1zn2xn1zn,gnxn1zn,znxgn2gn,znxznx2)\begin{aligned} f = \begin{pmatrix} f_{n-1}\\ f_{n}\\ f_\star \end{pmatrix},\qquad G = \begin{pmatrix} \left\lVert g_{n-1} \right\rVert^2& \left\langle g_{n-1}, x_{n-1}- z_n \right\rangle& \left\langle g_{n-1},g_n \right\rangle & \left\langle g_{n-1},z_n - x_\star \right\rangle\\ \cdot & \left\lVert x_{n-1} - z_n \right\rVert^2& \left\langle x_{n-1} - z_n, g_n \right\rangle & \left\langle x_{n-1} - z_n,z_n - x_\star \right\rangle\\ \cdot & \cdot & \left\lVert g_n \right\rVert^2 & \left\langle g_n,z_n - x_\star \right\rangle\\ \cdot&\cdot & \cdot & \left\lVert z_n - x_\star \right\rVert^2 \end{pmatrix} \end{aligned} (where entries below the diagonal in GG are defined by symmetry) satisfying DH{(f,G):QH0Qi,j0,i,j{n1,n,}G0}.\begin{aligned} {\mathcal D}_H\coloneqq \left\{(f,G):\, \begin{array}{l} Q_H\leq 0\\ Q_{i,j}\leq 0,\, \forall i,j\in\left\{n-1,n,\star\right\}\\ G\succeq 0 \end{array}\right\}. \end{aligned} Again, QHQ_H and each of the Qi,jQ_{i,j} are linear functions in (f,G)(f,G).

Our goal is to pick α\alpha and θn2\theta_n^2 to solve maxα,θn2{θn2:θn2(fnf)12x0x20,(f,G)DHα(0,1)}=maxα,θn2,v,λ,S{θn2:θn2(fnf)12x0x2=i,j{n,n+1,}λi,jQi,j+vQhSα(0,1),λ0,v0,S0}.\begin{aligned} &\max_{\alpha,\theta^2_n}\left\{\theta^2_n:\, \begin{array}{l} \theta^2_n(f_{n}-f_\star) - \frac{1}{2}\left\lVert x_0-x_\star \right\rVert^2 \leq 0,\,\forall (f,G)\in {\mathcal D}_H\\ \alpha\in(0,1) \end{array}\right\}\\ &\qquad = \max_{\alpha,\theta^2_n,v,\lambda, S}\left\{\theta^2_n:\, \begin{array}{l} \theta^2_n(f_{n}-f_\star) - \frac{1}{2}\left\lVert x_0-x_\star \right\rVert^2 = \sum_{i,j\in\left\{n,n+1,\star\right\}}\lambda_{i,j}Q_{i,j} + vQ_h - S\\ \alpha\in(0,1), \lambda\geq 0,\, v\geq 0,\, S\succeq 0 \end{array}\right\}. \end{aligned} Here, λ\lambda is a nonnegative matrix indexed by {n,n+1,}2\left\{n,n+1,\star\right\}^2 and SS+4S\in{\mathbb{S}}^{4}_+.

Writing out each of the Qi,jQ_{i,j} and equating the coefficients on ff and GG, we arrive at the following maxα,θn2,λ{θn2:α(0,1),λ0(λn1,n+λn,n1λn1,+λ,n1+θn12λn1,nλn,n1λn,+λ,nλn1,λ,n1+λn,λ,nθn12)=(0θn2θn2)M0}\begin{aligned} \max_{\alpha,\theta^2_n,\lambda}\left\{\theta^2_n:\, \begin{array}{l} \alpha\in(0,1), \lambda\geq 0\\ \begin{pmatrix} -\lambda_{n-1,n} + \lambda_{n,n-1} - \lambda_{n-1,\star} + \lambda_{\star,n-1} + \theta^2_{n-1}\\ \lambda_{n-1,n} - \lambda_{n,n-1} - \lambda_{n,\star} + \lambda_{\star,n}\\ \lambda_{n-1,\star} - \lambda_{\star,n-1} + \lambda_{n,\star} - \lambda_{\star,n} - \theta^2_{n-1} \end{pmatrix} = \begin{pmatrix} 0\\ \theta_n^2\\ -\theta_n^2 \end{pmatrix}\\ {\mathcal M}\succeq 0 \end{array}\right\} \end{aligned} where M{\mathcal M} is defined as (λn1,n+λn,n1(12α)+λn1,+λ,n1λn,n1(α1)λ,n1λn1,n(α1)λn,n1+λ,nαλ,n10λn1,n(1α)λ,nα0λn1,n+λn,n1+λn,+λ,nλ,n1)\begin{aligned} \bgroup \left( \begin{smallmatrix} \lambda_{n-1,n} + \lambda_{n,n-1}(1-2\alpha) + \lambda_{n-1,\star} + \lambda_{\star,n-1} & \lambda_{n,n-1}(\alpha-1) - \lambda_{\star,n-1} & \lambda_{n-1,n}(\alpha-1) - \lambda_{n,n-1} + \lambda_{\star,n} \alpha& - \lambda_{\star,n-1}\\ \cdot& 0 & \lambda_{n-1,n}(1-\alpha) - \lambda_{\star,n}\alpha& 0\\ \cdot & \cdot & \lambda_{n-1,n} + \lambda_{n,n-1} + \lambda_{n,\star} + \lambda_{\star,n} & -\lambda_{\star,n}\\ \cdot & \cdot & \cdot & 1 \end{smallmatrix} \right) \egroup \end{aligned} and entries below the diagonal are defined by symmetry.

This problem looks far uglier than it really is. In fact, we can reduce it to a one-dimensional optimization problem and solve it explicitly without too much effort (see Appendix A). The punchline will be that the optimal solution sets α=2θn121+2θn12+1+4θn12\begin{aligned} \alpha = \frac{2\theta^2_{n-1}}{1+2\theta^2_{n-1}+\sqrt{1+4\theta^2_{n-1}}} \end{aligned} with optimal value θn=1+1+4θn122.\begin{aligned} \theta_{n} = \frac{1 + \sqrt{1+4\theta^2_{n-1}}}{2}. \end{aligned} Furthermore, by Lemma 1, we will necessarily have a shadow iterate zn+1z_{n+1} so that θn2(fnf)12x0x212zn+1x2\begin{aligned} \theta_{n}^2(f_n-f_\star) \leq \frac{1}{2}\left\lVert x_0-x_\star \right\rVert^2 - \frac{1}{2}\left\lVert z_{n+1}-x_\star \right\rVert^2 \end{aligned} and we can continue the induction. By inspecting the slack term M{\mathcal M} that we will find analytically in Appendix A, we will learn that we can set zn+1=zn(θn2θn12)gnz_{n+1}=z_n - (\theta_n^2-\theta_{n-1}^2)g_n. This gives an accelerated gradient method with an O(n2)O(n^{-2}) convergence rate. We summarize the above derivation as the following algorithm:

Fast Gradient Method

  • Initialize θ0=1\theta_0 = 1, z1=x0g0z_1 = x_0 - g_0
  • For t=1,,Tt = 1,\dots,T
    • Set α=2θt121+2θt12+1+4θt12\alpha = \frac{2\theta_{t-1}^2}{1 + 2\theta_{t-1}^2 +\sqrt{1+4\theta_{t-1}^2}} and θt=1+1+4θt122\theta_t = \frac{1 + \sqrt{1 + 4\theta_{t-1}^2}}{2}
    • Set xt=α(xt1gt1)+(1α)ztx_{t} = \alpha(x_{t-1} - g_{t-1}) + (1-\alpha)z_{t}
    • Set zt+1=zt(θt2θt12)gtz_{t+1} = z_t - (\theta_t^2 - \theta_{t-1}^2) g_t

The zz update can/should be moved to the beginning of the next iteration so that we only need to compute one gradient in each iteration. However, this form emphasizes that we maintain a valid inductive hypothesis by the end of each iteration.


Alexandre d’Aspremont, Damien Scieur, and Adrien Taylor. 2021. “Acceleration methods.” Foundations and Trends in Optimization 5(1-2): 1-245.

Yoel Drori and Marc Teboulle. 2014. “Performance of First-Order Methods for Smooth Convex Minimization: A Novel Approach.” Mathematical Programming 145 (1): 451–82.

Benjamin Grimmer, Kevin Shu, and Alex L Wang. 2024. “Composing Optimized Stepsize Schedules for Gradient Descent.” arXiv Preprint arXiv:2410.16249.

Adrien B Taylor, Julien M Hendrickx, and François Glineur. 2017. “Smooth Strongly Convex Interpolation and Exact Worst-Case Performance of First-Order Methods.” Mathematical Programming 161: 307–45.


Appendix A: Solving for α\alpha

Our goal is to solve: maxα,θn2,λ{θn2:α(0,1),λ0(λn1,n+λn,n1λn1,+λ,n1+θn12λn1,nλn,n1λn,+λ,nλn1,λ,n1+λn,λ,nθn12)=(0θn2θn2)M0}.\begin{aligned} \max_{\alpha,\theta^2_n,\lambda}\left\{\theta^2_n:\, \begin{array}{l} \alpha\in(0,1), \lambda\geq 0\\ \begin{pmatrix} -\lambda_{n-1,n} + \lambda_{n,n-1} - \lambda_{n-1,\star} + \lambda_{\star,n-1} + \theta^2_{n-1}\\ \lambda_{n-1,n} - \lambda_{n,n-1} - \lambda_{n,\star} + \lambda_{\star,n}\\ \lambda_{n-1,\star} - \lambda_{\star,n-1} + \lambda_{n,\star} - \lambda_{\star,n} - \theta^2_{n-1} \end{pmatrix} = \begin{pmatrix} 0\\ \theta_n^2\\ -\theta_n^2 \end{pmatrix}\\ {\mathcal M}\succeq 0 \end{array}\right\}. \end{aligned}

Suppose we have any feasible solution to this problem. We will construct a new feasible solution with the same objective value but with a specific structure. First, the fact that M2,2=0{\mathcal M}_{2,2}=0 implies that M1,2=0{\mathcal M}_{1,2}=0 and M2,3=0{\mathcal M}_{2,3}=0, i.e, λ,n1=λn,n1=0\lambda_{\star,n-1}=\lambda_{n,n-1}=0 and α=λn1,nλn1,n+λ,n\alpha = \frac{\lambda_{n-1,n}}{\lambda_{n-1,n}+\lambda_{\star,n}}. Thus, M=(λn1,n+λn1,000000λn1,n+λn,+λ,nλ,n1).\begin{aligned} {\mathcal M}=\begin{pmatrix} \lambda_{n-1,n} + \lambda_{n-1,\star} & 0 & 0 & 0\\ \cdot& 0 & 0 & 0\\ \cdot & \cdot & \lambda_{n-1,n} + \lambda_{n,\star} + \lambda_{\star,n} & -\lambda_{\star,n}\\ \cdot & \cdot & \cdot & 1 \end{pmatrix}. \end{aligned} We will modify our feasible solution by increasing both λn,\lambda_{n,\star} and λ,n\lambda_{\star,n} simultaneously. Note that the value of the objective function does not change and the linear equality constraint remains satisfied. On the other hand, by increasing these values enough, we can force the bottom-right 2×22\times 2 block of M{\mathcal M} to be rank-one. We deduce there exists an optimal solution with λ,n1=λn,n1=0λn1,n+λn,+λ,n=λ,n2.\begin{gathered} \lambda_{\star,n-1}=\lambda_{n,n-1}=0\\ \lambda_{n-1,n}+\lambda_{n,\star}+\lambda_{\star,n}=\lambda_{\star,n}^2. \end{gathered} Additionally, as long as this constraint is satisfied, we can find a setting of α\alpha so that M0{\mathcal M}\succeq 0.

We may thus simplify the original problem to the problem below. Here, we have enforced the constraints listed above and the linear equality constraints from the original problem. We additionally parameterize λn1,n=s\lambda_{n-1,n}= s, λ,n=t\lambda_{\star,n}=t: maxs,t{2s+2tt2:s,t>0θn12st1+1+4s2}.\begin{aligned} \max_{s,t}\left\{2s+2t-t^2:\, \begin{array}{l} s,t>0\\ \theta^2_{n-1}\geq s\\ t \geq \frac{1+\sqrt{1+4s}}{2} \end{array}\right\}. \end{aligned} Note that for any s>0s>0, we will have the constraint t1+1+4s2>1t\geq \frac{1+\sqrt{1+4s}}{2}>1. Thus, the optimal choice of tt must set t=1+1+4s2t = \frac{1+\sqrt{1+4s}}{2}. Finally, maxs{1+2s+1+4s2:0<sθn12}.\begin{aligned} \max_{s}\left\{\frac{1+2s+ \sqrt{1+4s}}{2}:\, 0<s\leq \theta^2_{n-1}\right\}. \end{aligned} The optimal solution is s=θn12s=\theta^2_{n-1}. We can then construct the optimal solution to the original problem by substituting in backwards to get α=2θn121+2θn12+1+4θn12.\begin{aligned} \alpha = \frac{2\theta^2_{n-1}}{1+2\theta^2_{n-1}+\sqrt{1+4\theta^2_{n-1}}}. \end{aligned}

Last updated Mar 05, 2025