This post gives a performance estimation program (PEP) motivation for gradient descent algorithms with momentum and a version of Nesterov’s Fast Gradient Method. These proofs are not new (see (d’Aspremont, Scieur, and Taylor 2021, Section 4.4.1)), but I hope the presentation clarifies where such proofs come from.

PEP background

This section summarizes some background material on the Performance Estimation Programming (PEP) methodology initiated by (Drori and Teboulle 2014) and (Taylor, Hendrickx, and Glineur 2017).

Our goal is to find a minimizer of $\begin{aligned} \min_{x\in{\mathbb{R}}^d} f(x) \end{aligned}$ given only that $f$ is convex and $1$ -smooth and that $f$ has a minimizer $x_\star$ . All the ideas extend to the general setting of an $L$ -smooth convex function, with slightly more cumbersome notation.

Recall, a convex function $f:{\mathbb{R}}^d\to{\mathbb{R}}$ is said to be $1$ -smooth if it is differentiable and for all $x,y\in{\mathbb{R}}^d$ , $\begin{aligned} \left\lVert \nabla f(x) - \nabla f(y) \right\rVert \leq \left\lVert x - y \right\rVert. \end{aligned}$ Equivalently, this holds if and only if for all $x,y\in{\mathbb{R}}^d$ , $\begin{aligned} f(y) \leq f(x) + \left\langle \nabla f(x), y-x \right\rangle+\frac{1}{2}\left\lVert x-y \right\rVert^2. \end{aligned}$

We will attempt to minimize $f$ using a fixed-step first-order method (FSFOM). Formally, this is any algorithm of the form $\begin{aligned} &x_0 \text{\qquad is given as part of the input}\\ &x_1 = x_0 - H_{1,0} \nabla f(x_0)\\ &x_2 = x_0 - H_{2,0} \nabla f(x_0) - H_{2,1} \nabla f(x_1)\\ &\vdots\\ &x_n = x_0 - \sum_{i=0}^{n-1} H_{n,i} \nabla f(x_i) \end{aligned}$ parameterized by a matrix $H\in{\mathbb{R}}^{[0,n]\times[0,n]}$ that has support below the diagonal, i.e., $H_{i,j} = 0$ for all $j\geq i$ . The matrix $H$ fixes in advance the stepsizes to be taken at each iteration. $H$ being supported below the diagonal corresponds to the natural requirement that $x_i$ should only depend on the gradients $\nabla f(x_0),\dots,\nabla f(x_{i-1})$ .

Let $\mathcal{I} = \left\{0,1,\dots,n,\star\right\}$ and introduce the shorthand $f_i = f(x_i)$ and $g_i = \nabla f(x_i)$ for all $i\in\mathcal{I}$ .

Now, let us consider the set of first-order data $\left\{(f_i, g_i, x_i)\right\}_{i\in\mathcal{I}}$ . The fact that this data comes from a $1$ -smooth convex function imposes constraints on what this data may look like.

Lemma 1 ((Taylor, Hendrickx, and Glineur 2017, Corollary 1)). The first-order data $\left\{(f_i, g_i, x_i)\right\}_{i\in\mathcal{I}}$ can be realized by a $1$ -smooth convex function (in short, is interpolable) if and only if for all $i,j\in\mathcal{I}$ $\begin{aligned} Q_{i,j}\coloneqq f_j - f_i + \left\langle g_j, x_i - x_j \right\rangle + \frac{1}{2}\left\lVert g_i - g_j \right\rVert^2 \leq 0. \end{aligned}$

Note that the quantity $Q_{i,j}$ is linear in the formal variables $\begin{aligned} f\coloneqq \begin{pmatrix} f_0\\ f_1\\ \vdots\\ f_n\\ f_\star \end{pmatrix} \qquad G\coloneqq \begin{pmatrix} \left\lVert g_0 \right\rVert^2 & \left\langle g_0, g_1 \right\rangle & \dots & \left\langle g_0, g_n \right\rangle & \left\langle g_0, x_0 - x_\star \right\rangle\\ \left\langle g_1, g_0 \right\rangle & \left\lVert g_1 \right\rVert^2 & \dots & \left\langle g_1, g_n \right\rangle & \left\langle g_1, x_0 - x_\star \right\rangle\\ \vdots & \vdots & \ddots & \vdots & \vdots\\ \left\langle g_n, g_0 \right\rangle & \left\langle g_n,g_1 \right\rangle & \dots & \left\lVert g_n \right\rVert^2 & \left\langle g_n, x_0 - x_\star \right\rangle\\ \left\langle x_0 - x_\star, g_0 \right\rangle & \left\langle x_0 - x_\star, g_1 \right\rangle & \dots & \left\langle x_0 - x_\star, g_n \right\rangle & \left\lVert x_0 - x_\star \right\rVert^2 \end{pmatrix}. \end{aligned}$

Example 1. As an example, we can write $Q_{3,1}$ as $\begin{aligned} Q_{3,1} &= f_1 - f_3 + \left\langle g_1, x_3 - x_1 \right\rangle + \frac{1}{2}\left\lVert g_1 - g_3 \right\rVert^2\\ &= f_1 - f_3 + \left\langle g_1,\left(H_{1,0}-H_{3,0}\right)g_0 - H_{3,1}g_1 - H_{3,2} g_2 \right\rangle + \frac{1}{2}\left\lVert g_1 - g_3 \right\rVert^2\\ &= \begin{pmatrix} 0\\ 1\\ 0\\ -1\\ 0\\ \vdots \end{pmatrix}^\intercal f + \frac{1}{2}\left\langle \begin{pmatrix} 0 & H_{1,0} - H_{3,0} & 0 & 0 & \dots\\ H_{1,0} - H_{3,0} & 1-2H_{3,1} & -H_{3,2} & -1 & \dots\\ 0 & -H_{3,2} & 0 & 0 & \dots\\ 0 & -1 & 0 & 1 & \dots\\ \vdots & \vdots & \vdots & \vdots & \ddots \end{pmatrix},G \right\rangle \end{aligned}$

We learn that the set of interpolable $(f,G)$ pairs can be written as $\begin{aligned} {\mathcal D}\coloneqq \left\{(f,G)\in{\mathbb{R}}^{\mathcal{I}}\times {\mathbb{S}}^{\mathcal{I}}:\, \begin{array}{l} Q_{i,j} \leq 0,\qquad\forall i,j\in\mathcal{I}\\ G\succeq 0\\ \operatorname{rank}(G) \leq d \end{array}\right\}. \end{aligned}$ Here, each $Q_{i,j}$ is a linear function in $(f,G)$ . We will make a high-dimensional assumption, $d \geq n+2$ , so that we can ignore the rank constraint. Then, ${\mathcal D}$ becomes the feasible domain of a semidefinite program (SDP)!

Convergence guaranteees and the shadow iterate

Let’s consider perhaps the simplest FSFOM corresponding to the matrix $H = \begin{pmatrix} 0 \end{pmatrix}$ and $n = 0$ . In this “algorithm”, we will simply return the starting iterate $x_0$ . From the basic definition of $1$ -smoothness, we know that $\begin{aligned} f_0 - f_\star - \frac{1}{2}\left\lVert x_0 - x_\star \right\rVert^2 \leq 0. \end{aligned}$ Let’s think about what this means in the PEP framework. In the language of PEP, $f_0 - f_\star - \frac{1}{2}\left\lVert x_0 - x_\star \right\rVert^2$ is a linear expression in $(f,G)$ that is nonpositive on the cone ${\mathcal D}$ . By strong dualityStrong duality holds under very minor conditions that I will sweep under the rug here. See (Taylor, Hendrickx, and Glineur 2017, Theorem 6)., we can certify this inequality using a nonnegative combination of the $Q_{i,j}$ and an additional PSD slack term. Concretely, there must exist $\lambda \in{\mathbb{R}}^{\mathcal{I}\times\mathcal{I}}_+$ and a PSD matrix $S\in{\mathbb{S}}^{\mathcal{I}}_+$ so that $\begin{aligned} f_0 - f_\star - \frac{1}{2}\left\lVert x_0 - x_\star \right\rVert^2 &= \sum_{i,j\in\mathcal{I}}\lambda_{i,j}Q_{i,j} - \frac{1}{2}\left\langle S,G \right\rangle \end{aligned}$ as a formal expression in the variables $f$ and $G$ .

For our choice of FSFOM, there is a unique such certificate: $\begin{aligned} f_0 - f_\star - \frac{1}{2}\left\lVert x_0 - x_\star \right\rVert^2 &= Q_{\star,0} - \frac{1}{2}\left\langle \begin{pmatrix} 1 & -1\\ -1 & 1 \end{pmatrix},G \right\rangle. \end{aligned}$ Rearranging this and dropping the term $Q_{\star,0}\leq 0$ , we get the more precise bound: $\begin{aligned} f_0 - f_\star \leq \frac{1}{2}\left\lVert x_0 - x_\star \right\rVert^2 - \frac{1}{2}\left\lVert x_0 - g_0 - x_\star \right\rVert^2. \end{aligned}$ Let us define $z_1 \coloneqq x_0 - g_0$ . Then, $\begin{aligned} f_0 - f_\star \leq \frac{1}{2}\left\lVert x_0 - x_\star \right\rVert^2 - \frac{1}{2}\left\lVert z_1 - x_\star \right\rVert^2. \end{aligned}$

What does this tell us? One of two things must be true. Either:

$x_0$ outperforms the initial bound on $f_0 -f_\star$ , or
$z_1$ is close to $x_\star$ .

We refer to $z_1$ as a shadow iterate for this FSFOM. We emphasize that $z_1$ lives in the affine space $x_0+\operatorname{span}(\left\{g_0\right\})$ and can be constructed from the first-order information $(f_0,x_0,g_0)$ .

Perhaps surprisingly, a similar story holds for any FSFOM. The following derivation is implicit in (Grimmer, Shu, and Wang 2024, Proposition 4).

Lemma 2. Suppose we have a FSFOM parameterized by $H\in{\mathbb{R}}^{[0,n]\times[0,n]}$ and suppose we can give a guarantee*Here, assume either that strong duality holds or that the value of $\theta_n^2$ is the largest that can be certified via the PEP SDP.* of the form $\begin{aligned} \theta_n^2(f_n - f_\star) \leq \frac{1}{2}\left\lVert x_0 - x_\star \right\rVert^2. \end{aligned}$ Then, there exist weights $v_0,\dots,v_n$ so that $z_{n+1}\coloneqq x_0- \sum_{i=0}^n v_ig_i$ satisfies $\begin{aligned} \theta_n^2 (f_n-f_\star)\leq \frac{1}{2}\left\lVert x_0-x_\star \right\rVert^2 - \frac{1}{2}\left\lVert z_{n+1}^2-x_\star \right\rVert^2. \end{aligned}$

Proof. Again, in the language of PEP, this is a linear expression in $(f,G)$ that is nonpositive on the cone ${\mathcal D}$ and can be certified via a nonnegative $\lambda\in{\mathbb{R}}^{\mathcal{I} \times\mathcal{I}}_+$ and a PSD $S\in{\mathbb{S}}^{\mathcal{I}}$ : $\begin{aligned} \theta_n^2(f_n - f_\star) - \frac{1}{2}\left\lVert x_0 - x_\star \right\rVert^2 &= \sum_{i,j\in\mathcal{I}}\lambda_{i,j}Q_{i,j} - \frac{1}{2}\left\langle S,G \right\rangle. \end{aligned}$ By inspection, the coefficient on $\left\lVert x_0 -x_\star \right\rVert^2$ , i.e., $G_{\star,\star}$ , is zero in each of the $Q_{i,j}$ expressions. Thus, for the above identity to hold, it must be that $S_{\star,\star} = 1$ . By the Schur complement lemma, $\begin{aligned} S\succeq \begin{pmatrix} S_{0,\star}\\ S_{1,\star}\\ \vdots\\ S_{n,\star}\\ 1 \end{pmatrix}\begin{pmatrix} S_{0,\star}\\ S_{1,\star}\\ \vdots\\ S_{n,\star}\\ 1 \end{pmatrix}^\intercal. \end{aligned}$

Thus, we may set $v_i = S_{i,\star}$ and $z_{n+1} = x_0 + \sum_{i=0}^n v_ig_i$ to complete the proof. ◻

Again, either $x_n$ outperforms the initial estimate $f_n - f_\star\leq \frac{1}{2\theta_n^2}\left\lVert x_0-x_\star \right\rVert^2$ or $z_{n+1}$ is close to $x_\star$ . We refer to $z_{n+1}$ as the shadow iterate for the FSFOM $H$ . Note that $z_{n+1}$ lives in the affine space $x_0 + \operatorname{span}\left(\left\{g_0, \dots,g_n\right\}\right)$ and can be constructed upon learning $\left\{(f_i,g_i,x_i)\right\}_{i\in[0,n]}$ .

Deriving a momentum method

Guessing the form of the method

Suppose we have computed $x_{n-1}$ , $g_{n-1}$ , and $z_{n}$ satisfying the following inductive hypothesis: $\begin{aligned} Q_{H}\coloneqq \theta_{n-1}^2(f_{n-1} - f_\star) - \left(\frac{1}{2}\left\lVert x_0 -x_\star \right\rVert^2 - \frac{1}{2}\left\lVert z_{n} - x_\star \right\rVert^2\right) \leq 0. \end{aligned}$

We must now decide how to choose our next point $x_n \in x_0 + \operatorname{span}(\left\{g_0,\dots,g_{n-1}\right\})$ . Two natural choices in this affine space come to mind: First, we could take a gradient descent step $x_{n} = x_{n-1} - g_{n-1}$ . This would be a good next step assuming that $f_{n-1} - f_\star$ is already small. Alternatively, we could take $x_{n}$ to be the shadow iterate $z_{n}$ . This would be a good next step assuming that $f_{n-1} - f_\star$ is large (so that $\left\lVert z_{n}- x_\star \right\rVert$ is small).

From here, it is a small jump to guess the form $x_n = \alpha(x_{n-1} - g_{n-1}) + (1-\alpha)z_n,$ where $\alpha\in(0,1)$ is chosen to hedge between the two cases.

Definition of $\alpha$

We will choose $\alpha$ to greedily minimize the worst-case value of $f_n - f_\star$ with respect to $\left\lVert x_0-x_\star \right\rVert^2$ . Here, the worst-case is naturally taken over the pairs $\begin{aligned} f = \begin{pmatrix} f_{n-1}\\ f_{n}\\ f_\star \end{pmatrix},\qquad G = \begin{pmatrix} \left\lVert g_{n-1} \right\rVert^2& \left\langle g_{n-1}, x_{n-1}- z_n \right\rangle& \left\langle g_{n-1},g_n \right\rangle & \left\langle g_{n-1},z_n - x_\star \right\rangle\\ \cdot & \left\lVert x_{n-1} - z_n \right\rVert^2& \left\langle x_{n-1} - z_n, g_n \right\rangle & \left\langle x_{n-1} - z_n,z_n - x_\star \right\rangle\\ \cdot & \cdot & \left\lVert g_n \right\rVert^2 & \left\langle g_n,z_n - x_\star \right\rangle\\ \cdot&\cdot & \cdot & \left\lVert z_n - x_\star \right\rVert^2 \end{pmatrix} \end{aligned}$ (where entries below the diagonal in $G$ are defined by symmetry) satisfying $\begin{aligned} {\mathcal D}_H\coloneqq \left\{(f,G):\, \begin{array}{l} Q_H\leq 0\\ Q_{i,j}\leq 0,\, \forall i,j\in\left\{n-1,n,\star\right\}\\ G\succeq 0 \end{array}\right\}. \end{aligned}$ Again, $Q_H$ and each of the $Q_{i,j}$ are linear functions in $(f,G)$ .

Our goal is to pick $\alpha$ and $\theta_n^2$ to solve $\begin{aligned} &\max_{\alpha,\theta^2_n}\left\{\theta^2_n:\, \begin{array}{l} \theta^2_n(f_{n}-f_\star) - \frac{1}{2}\left\lVert x_0-x_\star \right\rVert^2 \leq 0,\,\forall (f,G)\in {\mathcal D}_H\\ \alpha\in(0,1) \end{array}\right\}\\ &\qquad = \max_{\alpha,\theta^2_n,v,\lambda, S}\left\{\theta^2_n:\, \begin{array}{l} \theta^2_n(f_{n}-f_\star) - \frac{1}{2}\left\lVert x_0-x_\star \right\rVert^2 = \sum_{i,j\in\left\{n,n+1,\star\right\}}\lambda_{i,j}Q_{i,j} + vQ_h - S\\ \alpha\in(0,1), \lambda\geq 0,\, v\geq 0,\, S\succeq 0 \end{array}\right\}. \end{aligned}$ Here, $\lambda$ is a nonnegative matrix indexed by $\left\{n,n+1,\star\right\}^2$ and $S\in{\mathbb{S}}^{4}_+$ .

Writing out each of the $Q_{i,j}$ and equating the coefficients on $f$ and $G$ , we arrive at the following $\begin{aligned} \max_{\alpha,\theta^2_n,\lambda}\left\{\theta^2_n:\, \begin{array}{l} \alpha\in(0,1), \lambda\geq 0\\ \begin{pmatrix} -\lambda_{n-1,n} + \lambda_{n,n-1} - \lambda_{n-1,\star} + \lambda_{\star,n-1} + \theta^2_{n-1}\\ \lambda_{n-1,n} - \lambda_{n,n-1} - \lambda_{n,\star} + \lambda_{\star,n}\\ \lambda_{n-1,\star} - \lambda_{\star,n-1} + \lambda_{n,\star} - \lambda_{\star,n} - \theta^2_{n-1} \end{pmatrix} = \begin{pmatrix} 0\\ \theta_n^2\\ -\theta_n^2 \end{pmatrix}\\ {\mathcal M}\succeq 0 \end{array}\right\} \end{aligned}$ where ${\mathcal M}$ is defined as $\begin{aligned} \bgroup \left( \begin{smallmatrix} \lambda_{n-1,n} + \lambda_{n,n-1}(1-2\alpha) + \lambda_{n-1,\star} + \lambda_{\star,n-1} & \lambda_{n,n-1}(\alpha-1) - \lambda_{\star,n-1} & \lambda_{n-1,n}(\alpha-1) - \lambda_{n,n-1} + \lambda_{\star,n} \alpha& - \lambda_{\star,n-1}\\ \cdot& 0 & \lambda_{n-1,n}(1-\alpha) - \lambda_{\star,n}\alpha& 0\\ \cdot & \cdot & \lambda_{n-1,n} + \lambda_{n,n-1} + \lambda_{n,\star} + \lambda_{\star,n} & -\lambda_{\star,n}\\ \cdot & \cdot & \cdot & 1 \end{smallmatrix} \right) \egroup \end{aligned}$ and entries below the diagonal are defined by symmetry.

This problem looks far uglier than it really is. In fact, we can reduce it to a one-dimensional optimization problem and solve it explicitly without too much effort (see Appendix A). The punchline will be that the optimal solution sets $\begin{aligned} \alpha = \frac{2\theta^2_{n-1}}{1+2\theta^2_{n-1}+\sqrt{1+4\theta^2_{n-1}}} \end{aligned}$ with optimal value $\begin{aligned} \theta_{n} = \frac{1 + \sqrt{1+4\theta^2_{n-1}}}{2}. \end{aligned}$ Furthermore, by Lemma 1, we will necessarily have a shadow iterate $z_{n+1}$ so that $\begin{aligned} \theta_{n}^2(f_n-f_\star) \leq \frac{1}{2}\left\lVert x_0-x_\star \right\rVert^2 - \frac{1}{2}\left\lVert z_{n+1}-x_\star \right\rVert^2 \end{aligned}$ and we can continue the induction. By inspecting the slack term ${\mathcal M}$ that we will find analytically in Appendix A, we will learn that we can set $z_{n+1}=z_n - (\theta_n^2-\theta_{n-1}^2)g_n$ . This gives an accelerated gradient method with an $O(n^{-2})$ convergence rate. We summarize the above derivation as the following algorithm:

Fast Gradient Method

Initialize $\theta_0 = 1$ , $z_1 = x_0 - g_0$
For $t = 1,\dots,T$ $t = 1, \dots, T$
- Set $\alpha = \frac{2\theta_{t-1}^2}{1 + 2\theta_{t-1}^2 +\sqrt{1+4\theta_{t-1}^2}}$ and $\theta_t = \frac{1 + \sqrt{1 + 4\theta_{t-1}^2}}{2}$
- Set $x_{t} = \alpha(x_{t-1} - g_{t-1}) + (1-\alpha)z_{t}$
- Set $z_{t+1} = z_t - (\theta_t^2 - \theta_{t-1}^2) g_t$

The $z$ update can/should be moved to the beginning of the next iteration so that we only need to compute one gradient in each iteration. However, this form emphasizes that we maintain a valid inductive hypothesis by the end of each iteration.

Alexandre d’Aspremont, Damien Scieur, and Adrien Taylor. 2021. “Acceleration methods.” Foundations and Trends in Optimization 5(1-2): 1-245.

Yoel Drori and Marc Teboulle. 2014. “Performance of First-Order Methods for Smooth Convex Minimization: A Novel Approach.” Mathematical Programming 145 (1): 451–82.

Benjamin Grimmer, Kevin Shu, and Alex L Wang. 2024. “Composing Optimized Stepsize Schedules for Gradient Descent.” arXiv Preprint arXiv:2410.16249.

Adrien B Taylor, Julien M Hendrickx, and François Glineur. 2017. “Smooth Strongly Convex Interpolation and Exact Worst-Case Performance of First-Order Methods.” Mathematical Programming 161: 307–45.

Appendix A: Solving for $\alpha$

Our goal is to solve: $\begin{aligned} \max_{\alpha,\theta^2_n,\lambda}\left\{\theta^2_n:\, \begin{array}{l} \alpha\in(0,1), \lambda\geq 0\\ \begin{pmatrix} -\lambda_{n-1,n} + \lambda_{n,n-1} - \lambda_{n-1,\star} + \lambda_{\star,n-1} + \theta^2_{n-1}\\ \lambda_{n-1,n} - \lambda_{n,n-1} - \lambda_{n,\star} + \lambda_{\star,n}\\ \lambda_{n-1,\star} - \lambda_{\star,n-1} + \lambda_{n,\star} - \lambda_{\star,n} - \theta^2_{n-1} \end{pmatrix} = \begin{pmatrix} 0\\ \theta_n^2\\ -\theta_n^2 \end{pmatrix}\\ {\mathcal M}\succeq 0 \end{array}\right\}. \end{aligned}$

Suppose we have any feasible solution to this problem. We will construct a new feasible solution with the same objective value but with a specific structure. First, the fact that ${\mathcal M}_{2,2}=0$ implies that ${\mathcal M}_{1,2}=0$ and ${\mathcal M}_{2,3}=0$ , i.e, $\lambda_{\star,n-1}=\lambda_{n,n-1}=0$ and $\alpha = \frac{\lambda_{n-1,n}}{\lambda_{n-1,n}+\lambda_{\star,n}}$ . Thus, $\begin{aligned} {\mathcal M}=\begin{pmatrix} \lambda_{n-1,n} + \lambda_{n-1,\star} & 0 & 0 & 0\\ \cdot& 0 & 0 & 0\\ \cdot & \cdot & \lambda_{n-1,n} + \lambda_{n,\star} + \lambda_{\star,n} & -\lambda_{\star,n}\\ \cdot & \cdot & \cdot & 1 \end{pmatrix}. \end{aligned}$ We will modify our feasible solution by increasing both $\lambda_{n,\star}$ and $\lambda_{\star,n}$ simultaneously. Note that the value of the objective function does not change and the linear equality constraint remains satisfied. On the other hand, by increasing these values enough, we can force the bottom-right $2\times 2$ block of ${\mathcal M}$ to be rank-one. We deduce there exists an optimal solution with $\begin{gathered} \lambda_{\star,n-1}=\lambda_{n,n-1}=0\\ \lambda_{n-1,n}+\lambda_{n,\star}+\lambda_{\star,n}=\lambda_{\star,n}^2. \end{gathered}$ Additionally, as long as this constraint is satisfied, we can find a setting of $\alpha$ so that ${\mathcal M}\succeq 0$ .

We may thus simplify the original problem to the problem below. Here, we have enforced the constraints listed above and the linear equality constraints from the original problem. We additionally parameterize $\lambda_{n-1,n}= s$ , $\lambda_{\star,n}=t$ : $\begin{aligned} \max_{s,t}\left\{2s+2t-t^2:\, \begin{array}{l} s,t>0\\ \theta^2_{n-1}\geq s\\ t \geq \frac{1+\sqrt{1+4s}}{2} \end{array}\right\}. \end{aligned}$ Note that for any $s>0$ , we will have the constraint $t\geq \frac{1+\sqrt{1+4s}}{2}>1$ . Thus, the optimal choice of $t$ must set $t = \frac{1+\sqrt{1+4s}}{2}$ . Finally, $\begin{aligned} \max_{s}\left\{\frac{1+2s+ \sqrt{1+4s}}{2}:\, 0<s\leq \theta^2_{n-1}\right\}. \end{aligned}$ The optimal solution is $s=\theta^2_{n-1}$ . We can then construct the optimal solution to the original problem by substituting in backwards to get $\begin{aligned} \alpha = \frac{2\theta^2_{n-1}}{1+2\theta^2_{n-1}+\sqrt{1+4\theta^2_{n-1}}}. \end{aligned}$

What is momentum? – A PEP view