People who're familiar with Newton's method might be surprised at the convergence rate. 1/k^2 is slower than the textbook rate for Newton's method with line search, which is 2^{-2^k}, basically implying convergence in a constant number of steps. The rate in this paper seems to be no faster than plain gradient descent on a strongly convex function! So why go through the trouble of the much more complicated update?
Because the textbook proof of the fast convergence of Newton's method make additional assumptions on the objective function, for example that it is strongly convex, or it is self concordant. This paper only assumes Lipschitz continuous Hessians.
The idea of dampening the Hessian is old (it's sometimes called "damped newton method", or "trust region newton method", or "levenberg-marquardt", though the latter two refer to more specific ideas). This paper offers a view to how much dampening to apply.
There are two kinds of convergence proofs - the ones that get you a 1/k or 1/k^2 rate, and the ones that get you "linear" or "superlinear" convergence (which is optimization jargon for exponential/superexponential decay of the error). The latter require way too strong of assumptions to be that useful in practice, and we generally think of them as just "local" convergence theorems - i.e. the assumptions they make start to be true when you get close enough to the minimum, and the fast convergence kicks in at that point. Which is nice, but getting close enough in the first place is often 99% of the problem!
This is very interesting. It looks like the innovation is to scale the identity matrix in the L-M step formula by the square root of the norm of the gradient. Very simple and elegant, and trivial to implement. Lots of people are going to be shaking their heads wondering why they didn't think of this first.
The motivation for the idea is very nice and easy to understand. The author completely explains it before the middle of the second page. I encourage you to look at it if you know enough vector calc to make sense of the notation. Essentially, he takes a well-known fast-converging method whose steps are hard to compute and shows how these steps can be approximated cheaply while still retaining much of the convergence rate benefit.
This may have implications for many large-scale optimization problems as L-M regularized Newton steps are a sort of natural halfway house between gradient descent and the full Newton's method, both theoretically and practically.
The problem with simple innovations in old methods, is that there is very rarely anything new under the sun, odds are good its already been done. The paper does not evaluate on any of the big standard benchmark either. The paper spends a surprising/suspicious amount of time on very basic things. "The divergence issues of Newton’s method with line search seems to be a relatively unknown fact." No its well known!
It looks interesting, but some things, like the L-M lambda beeing constant? No, you adapt it to the data increasing and decreasing it in accordance with the feedback from the step error predictions. Similar, but not identically to what is proposed here. Id need to read it carefully and it doesnt look bad, but think I'll just wait for the reviewers.
Personally I wouldn't have known divergence of Newton's method with line search off the top of my head. What I think everyone in this field would know is that Newton's method with constant step size diverges. I'm actually curious now if Newton's method with backtracking line search from a constant step size diverges. Exact line search can go a little nuts, that doesn't surprise me, but divergence with backtracking from a constant step size seems less likely. (I guess it depends on what "constant" means here...)
Seems like it could be awfully nice to have a smart theoretically backed initial guess of the dynamic lambda. The author actually uses a feedback strategy to set H in conjunction with his gradient scaling (see Algorithm 2 / AdaN), and it's possible that the scaling allows H to be basically correct more often and not have to be adjusted so much? Which is potentially really helpful since Newton steps aren't cheap and you don't want to waste them on guess-and-check.
But it's not obvious it's better than other strategies and we do need benchmarks to prove it out, yes.
The author is from INRIA, which does a lot of advanced optimization research, so I'd say it's a serious paper prima facie.
A very similar technique is popular for rootfinding (since the 80s, convergence analysis in this 1998 paper), where scaling by the norm of the gradient (eq. 1.5) enables q-quadratic convergence.
> It looks like the innovation is to scale the identity matrix in the L-M step formula by the square root of the norm of the gradient.
If I have f(x), then I define
g(x) = 100*f(x)
h(x) = .01*f(x)
won't I get different rates of convergence with g(x) and h(x)?
Normally, the scale factor of a function cancels out when multiplying the inverse Hessian times the gradient, but the square root changes that here. This seems undesirable.
Maybe it can be hand-waved away with their constant H, which looks like a tuning parameter.
Good question. Clearly the square root has to be there for a reason, but I agree it makes the scheme not scale-invariant. And the original L-M is also not scale-invariant so there must be some reason for this.
There may be stability issues with perfect scale invariance. You don't want to drop your identity matrix term when the gradient gets too small; that would defeat the purpose of the L-M regularization.
Regularization in general tends to make things less symmetric and invariant. It's kind of a necessary evil.
And you're right, the H plays an important role in keeping the scheme relatively efficient. It is typically adjusted dynamically in L-M schemes (see AdaN and AdaN+ algorithms in the paper).
Previously I had thought of Levenberg Marquardt as it's own little island, separate from the collection of line search and trust region algorithms. I guess maybe I should think of algorithms using regularized Hessians as a third category now. (Although I guess the quasi-Newton algorithms that maintain positive definiteness are regularized in a sense, so it's kind of a fuzzy categorization)
I'm curious if this square root trick performs better than a well written trust region algorithm.
> We present a Newton-type method that converges fast from any initialization and for arbitrary convex objectives with Lipschitz Hessians. We achieve this by merging the ideas of cubic regularization with a certain adaptive Levenberg--Marquardt penalty.
I feel insignificant just by reading the first phrase.
It's funny how many complex ideas are "just" simpler ideas mapped to the right sequence of steps. The jargon and naming conventions can obscure the idea from outsiders, but allows for brief and accurate communication. If they'd explained the idea Barney-style, they'd have needed twice as many pages.
Thanks for the breakdown, it makes the paper more accessible.
The problem would be the resource would be too sloppy to be useful. For example, the phrase "Lipschitz: a space that isn’t too stretchy" doesn't tell you about the precise definition of Lipschitz, which is absolutely essential, and there are tons of names in math that would all be sloppified to "a space that isn’t too stretchy".
Not knowing precisely which definition you need for an application or which one you need for a proof to hold is essential for progress.
Math doesn't make up "jargon" for no reason. The names are to denote to professionals the least needed elements to make something true.
>> You don’t even have to build your own solvers. The commercial/open source ones are typically good enough.
But as a maintainer of solvespace (Open Source CAD software with geometric/algebraic constraint system) I am left wondering if this can be applied to our constraint solver, which I didn't write. It solves systems of algebraic constraints using some form of Newtons method, partial derivatives, jacobian... All stuff I'm familiar with, but not in great detail. Figuring out how best (and weather) to apply this might be a major digression ;-)
After following the course, you should be able to grok the solver code in /src/system.cpp. That looks fairly standard.
The interesting part (to me, as that is not my specialty) lies in the translation of the 3D constraints (including rotation, etc.) into a single objective function for solving Newton-style.
>> The interesting part (to me, as that is not my specialty) lies in the translation of the 3D constraints (including rotation, etc.) into a single objective function for solving Newton-style.
That's funny - I like the geometry stuff and have a good grasp of how to create useful constraint equations. I just don't know too much about the code for solving systems of those equations ;-)
There are probably many ways to formulate the same geometric constraint. For example, what parameterization of rotations do you use? I think the choice has implications for how easy it is to solve the equations.
>> For example, what parameterization of rotations do you use?
For orientation we use quaternions. For other things it's an axis-angle representation. For the equal angle constraint I'm not sure how that's implemented. There was a recent addition of length-ratio between lines and arcs. That implementation was surprising to me.
These lectures look fantastic — thank you for making and sharing them! I've been looking for a good foundational treatment of optimization to digest over the holidays.
> I feel insignificant just by reading the first phrase.
I often wonder about the intentions of those who post of these types of comments. Being charitable one might suppose the parent is offering a backward sort of compliment to the authors. Something like "Great job, beyond my capabilities, I'm glad someone is working on these things".
There are a lot of less charitable formulations; why should other readers care about parent commenters insecurity? Does parent commenter not appreciate the work that goes into studying these topics? Was parent commenter told they were "smart" to often as a child and never learned to exert themselves academically? Etc.
Although this post is mostly in jest, I really am curious what prompts these comments, comments of this type are quite reliable.
> There are a lot of less charitable formulations; why should other readers care about parent commenters insecurity? Does parent commenter not appreciate the work that goes into studying these topics? Was parent commenter told they were "smart" to often as a child and never learned to exert themselves academically? Etc.
You're missing the obvious interpretation -- jargonization of the sciences keeps people out and keeps the masses stupid. Math is particularly guilty of this -- more charitable fields use jargon that is at least semi-self-explanatory, words and phrases that, while jargon, do make sense if you deconstruct them. In Math, people are so arrogant that they discover something and immediately name it after themselves, so if you haven't had a topology class, you won't know that X person's name maps to Y concept. Good jargon can be pieced together with pure reason alone without knowing the names of the people who invented what. This isn't true across the board of course -- "gradient descent" is a particularly well-named piece of jargon, for example.
People generally don’t name it after themselves; other people name it after them later.
Not that many people wouldn’t like for the thing they introduce to eventually be named after them. But, I think it would generally be considered presumptuous to directly name it after oneself.
I remember one story where one mathematician upon hearing something referred to as a [some term named after them], asked what the term referred to. (I want to say it was Hilbert asking what a “Hilbert space” was, but I’m not sure.)
The story is told by Saunders Mac Lane, quoted in the book Mathematical Apocrypha Redux:
> J. von Neumann in 1927 introduced the axiomatic description of a Hilbert space, and used it in his work on quantum mechanics. There is a story of the time he came to Göttingen in 1929 to lecture on these ideas. The lecture started "A Hilbert space is a linear vector space over the complex numbers, complete in the convergence defined by an inner product (a product <a,b> of two vectors a, b) and separable". At the end of the lecture, David Hilbert (by custom sitting in the first row of the Mathematische Gesellschaft), who was then evidently thinking about his definition and not about the axiomatic description, is said to have asked, "Dr. von Neumann, ich möchte gern wissen, was ist dann eigentlich ein Hilbertscher Raum?" [Freely translated this is "Dr. von Neuman, I would very much like to know, what after all is a Hilbert space?"]
math at least has __nowhere__ near as much jargon as something like biology or the life sciences.
I think a larger problem in reading math might be polysemy. Symbols have multiple meanings, so context has to be used to infer what meaning is intended by authors.
If it makes you feel any better, I knew nothing about the field of optimization (convex or otherwise) 3 months ago. But after going through an optimization course in my CS master's program (MCSO at UT Austin) this semester, I can parse this part of the abstract fine, minus the Levenberg--Marquardt, which I had to look up. That is with not much of a prior mathematics background. You could probably learn enough in a semester of part-time study to grok this paper, as long as you're comfortable with differential calculus.
Longitudinal Analysis of an Amygdalocentric Structural Network using Mixed Effects Multivariate Distance Matrix Regression with Multiple Fractionated Polynomials
A working title for a manuscript I am shopping out to the various neuroscience journals right now. I feel like I'm finally the esoteric person I've always wanted to be.
For those interested, James Renegar also had some great stuff on Newton's method ("A polynomial-time algorithm, based on Newton's method, for linear programming" 1988). It seems there are a lot of great algorithmic opportunities in specializing Newton's method to a given problem domain.
Hey, I’m Konstantin, the author of the paper. It’s a big surprise to see an active discussion here, it was very interesting for me to read the comments. I agree with the people saying that Newton with line search often works as it is, the point of my work is to obtain a theory that can be used later to obtain extensions to harder settings or modifications with more aggressive iterations. Only the time will tell if the method itself is useful in practice. I feel like most questions have been already answered here by the community, but I’m happy to answer any remaining questions.
I also presented this work online and the recording is available on youtube, I hope you’d enjoy it:
https://youtu.be/0lXvUMfafBY
(Disclaimer: Not an expert, and only gave the paper a very cursory read).
Newton's method attempts to find zeroes of a function (i.e. find x where F(x) = 0). This is equivalent to finding a (possible) minimum of a function f(x) where f'(x) = F(x). Note: The wiki article describes the process of finding zeroes for F(x), whereas the article searches minima of f(x), in terms of my notation here.
At a basic level, Newton's method iteratively improves a guess x_k via the following update step:
x_(k+1) = x_k - f(x_k) / f'(x_k)
When close to the correct solution, convergence is quadratic: You "double the number of correct digits" at every step, so to speak. But if you're far away, you might be diverging (going further away from the solution). Progress can also be slow under other conditions (multiple solutions nearby).
The paper, essentially, suggest the following update step instead:
for some H (that needs to be chosen/estimated). The extra term can be thought of as some form of regularization. This supposedly avoids divergence (where f is non-decreasing and does not grow too fast), and is also consistently fast.
Note that I've simplified to the one-dimensional case here. There are further concerns in higher dimensions (e.g. number of matrix inversions) that this method handles well, according to the authors.
> This is equivalent to finding a (possible) minimum of a function f(x) where f'(x) = F(x).
I'd like to clarify that the "equivalence" might be misleading. Just because you have a vector-valued function, that does not imply it has an antiderivative you can minimize, so merely having that scalar antiderivative actually gives the problem more structure and thus intrinsically makes it "easier" in that sense.
I think a concrete example of what you're saying is that a rotation vector field (e.g. F(x,y) = [y, -x]) is not the gradient of any potential function.
And I think you're saying it's intrinsically easier to find where a vector field vanishes if it has a potential function. You can just follow the vector field and you'll get to a zero. But if you follow a rotating vector field, it will take you in circles.
But given that you're doing Newton's method, I don't think one problem is easier than the other, right?
Is your conjecture that Newton's method would always work at least as well as gradient descent when the vector field has nonzero rotation? I don't see why that should be true, especially given that's not the case when there's zero rotation. You could easily come up with examples where gradient descent would work better despite the rotation, and I expect you could also find lots of rotation cases where Newton's method would get completely thrown off (whereas GD wouldn't).
(Also, don't conflate the problem with the solution. My statements were about optimization vs. root finding, not about Newton vs. GD. Even if there is a solution that works for multiple problems, that doesn't mean the problems are equally difficult. If one problem admits solutions that the other doesn't, then it's still easier in that sense. If a problem can be reduced to another, then it's also easier in that sense. Just like how you can use a cannon to kill a fly, but that doesn't mean destroying a building is as easy as killing a fly...)
> Is your conjecture that Newton's method would always work at least as well as gradient descent when the vector field has nonzero rotation?
> You could easily come up with examples where gradient descent would work better despite the rotation, and I expect you could also find lots of rotation cases where Newton's method would get completely thrown off (whereas GD wouldn't).
I don't think I understand the meaning of the questions. It isn't gradient descent unless you have a potential function. You cannot do gradient descent on a vector field that is not a gradient. You can solve an ODE.
You can, of course, turn a root-finding problem (F(x) = 0) into a minimization problem by defining a potential like g(x) = (F(x))^2. You can do gradient descent on g. So in that sense, both of these problems reduce to each other, though there are better methods for solving F(x) = 0 than minimizing g.
I still don't understand this sense of optimization being easier than root finding.
What I do understand is some sense in which there are more vector fields than there are gradients, since only some vector fields are gradients of potential functions. My comment was intended to point that out, since I thought that is maybe what you meant.
I think I explained it poorly. I used "gradient descent" colloquially there, I just meant following the 'flow' in your vector field like you might in actual gradient descent. (Yes in this case it'd be Euler's method, and you can add whatever bells and whistles you would add for GD like to choose step sizes and whatnot.) I'm saying you might still be able to find the zero of your function by doing that. That neither implies that Newton's method would have performed better on your problem (it may perform better, or it may not), nor does it imply that your vector field must have been a gradient of anything. Right?
Whereas you're saying "given that you're doing Newton's method, I don't think one problem is easier than the other, right?" as if Newton is somehow guaranteed to give equally good if not better results than e.g. Euler when your vector field isn't a gradient... which I don't believe it is? Certainly there are cases when that happens but it's not implied, right?
Haven't read the paper yet, but Boyd and Vandenberghe's book, Convex Optimization, is available for free on the authors' website and covers all the foundational material.
Update: I've skimmed the paper and here's the gist as I understand it. The paper alleges that Newton's method can have convergence issues, even when using line search. Their approach resolves these convergence issues. It's 6am where I am which isn't great for processing this stuff, but I write a lot of custom Convex Optimization solvers, and I've never had convergence issues with Newton's method with the default line search parameters suggested by the book mentioned above, even with some bad condition numbers that throttle off-the-shelf solvers (which is why I write my own).
The paper talks about speed of convergence, which initially made me think this was a first order method, not requiring calculating the Hessian (matrix of 2nd order derivatives). For large systems this is really slow and 1st order methods speed this up by working only with the first derivatives. But that's not what this paper is: they still use the Hessian but add regularization, which presumably just improves the condition number. Reminds me of proximal algorithms but haven't explored that connection.
Update on the update: the paper addresses situations where the self-concordance assumption does not hold. TBH I didn't pay much attention in class when we went over self-concordancy and convergence, but the contribution of this paper seems to be extending convergence guarantees even to non-self-concordant systems.
Newton's method is an algorithm you can use to minimise/maximise/find the root of a differentiable function very quickly if you start with a reasonable guess. There are many variants on Newton's method, e.g. BFGS that does not require the hessian (/2nd order derivative) of the function. A pitfall of Newton's method is that it does not find the global minimum per se, it might just converge to a local minimum if this local minimum is "closer" to your initial guess. Some variants of Newton's method allow you to avoid this pitfall, but they are slow in practice.
This seems to be another variant that claims to find the global minimum quicker than other existing methods. I am not experienced enough to verify this claim. I am also not sure how this new algorithm performs in practice; in theory, ellipsoid methods are a lot more efficient than simplex methods for optimising convex functions, while in practice simplex methods are usually an order of magnitude faster. So take this result with a grain of salt.
To clarify, "global" in this context refers to global convergence (as in, it won't diverge no matter the starting point), not global optima. They assume a convex function, i.e. only a single minimum. The paper is unrelated to avoiding local minima.
There is no way of ensuring that you have found a global minimum if the function is not convex, or if you don’t make equivalent assumptions, in the general case (if the domain of the function is continuous, and/or infinite).
If you don’t make assumptions, you would need to consider every single point of the domain, and there is an infinite number of them. The job gets easier the more assumptions you make about the continuous, differentiable, and monotonous character of the function.
Take an initial guess at where f(x) = 0; call it x1. Then you generate what you hope is a better guess x2 by finding where the line through (x1, f(x1)) with slope f'(x1) hits zero. Lather, rinse, repeat.
Things can get ugly--say, if the derivative is zero at your guess--and if your initial guess is bad, or f is wonky (say it has several zeroes near one another) it may take a long time. But as people have said, if you have a good guess, convergence is quadratic, doubling the number of good bits each time, so for floating point square root, peek at the exponent and take half that (multiply by sqrt(2) if it's odd). log2(#mantissa bits) iterations and you're as good as it gets.
Apparently this new method avoids the nasty cases and is still fast. Cool stuff.
Would you like someone to explain newton's method (Newton Raphson iteration), or the linked article?
(I could have a go at explaining newton's method, but I don't understand the arxiv article. In simple terms newton raphson iteration is an iterative method of finding the "zeros" of a function. You do it by evaluating the derivative of a function, and iteratively following that gradient until you find the zero point.)
The problem with Newton's method is that it is provably convergent but only if you are sufficiently close to the solution. That basically means that the initial condition that you chose for solving the problem is very important. What this paper seems to suggest is that their method is globally convergent and they produce the algortithm to do it, although there is some tuning that needs to be done. So in theory, with their method we will always get to a solution, no matter the initial condition we chose.
Unless I'm mistaken (very possible), the existing cubic Newton method they discuss already has this convergence guarantee, but introduces a lot of extra expensive work at each step. The specific contribution of the paper is finding clever regularization tricks to keep this guarantee while sidestepping the extra hassle the cubic method needs
Their abstract says that their global convergence claim is for convex objectives. So structurally, they're only talking about functions for which the whole domain is "sufficiently close to the solution" i.e. where every tangent is a global bound.
Actually, can you expand on that? The paper makes some broad statements about how existing methods are unstable and need to have a good starting point etc. This is not my area, but in the past when I've used (quasi)newton methods, it "just worked". What are the kinds of strictly convex functions that cause this bad behavior? For "strictly convex" in my limited, low-dimensional imagination I just picture a bowl-shape. They mention failure modes, but I'd love a clear (ideally visual) example of when it goes wrong. Robust global quadratic convergence seems to be the main value proposition here, so understanding the non-global fragile non-convergence of other methods seems important.
I believe if you select as a starting point, a point where the derivative of the function is 0, then 'classic' newton raphson will not converge (since you select the "crossing point" of this tangent line for your next iteration, and that is undefined when you have 0 gradient.)
e.g. y = X^2 - 1
choose 'starting position' X = 0.
... but then your starting point is already at the optimum. You're done in zero steps!
I think for a "strictly convex" function, the (sub)derivative will _only_ be zero at the global minimum.
> but then your starting point is already at the optimum
Newton raphson is used to find where f(x) = 0, not where the derivative f'(x) = 0.
For the function given above f(x) = 0 when x = 1 and x = -1. Not when the derivative is at 0 (when x = 0).
You'd be done in 0 steps to find the minimum of the function, or the point of 0 derivative, but that's not what you're looking for with newton raphson.
Newton's method is _commonly_ used for function minimization by looking for the zero of the derivative, and this is entirely what the article is about. In sec 1.1 at the top of page 2, they say very clearly "In this work, we are interested in solving the unconstrained minimization problem ...". Even in the first para, they say "Newton’s method is a cornerstone of convex optimization", meaning optimization/minimization of a convex function.
The article says "Unfortunately, Newton’s method is unstable: it works only for strongly convex problems and may diverge exponentially when initialized not very close to the optimum." And I was asking for an example or description of such a case.
You'll also note there are several comments in this discussion that mention the hessian/2nd order derivatives, and these all show up because this is being used in the context of optimization by looking for f'(x) = 0.
Other people have given you mathematical explanations (it finds values of x such that f(x) = 0). If you want an intuitive geometrical understanding (of the one dimensional version, i.e. x and f(x) represent numbers not vectors) try playing with this demo I made: https://www.desmos.com/calculator/kogwanro0d. It shows the way the method follows tangent lines to get each iteration. You can move the orange start point, adjust the function with the b and c sliders, or change the f(x) function formula to another function entirely.
Theoretically yes, but: typically with neural networks, and esp. deep networks with millions or billions of parameters, 2nd order methods which require calculating and factoring the matrix of 2nd derivatives are prohibitively expensive. 1st order methods such as gradient descent skip this, albeit with a tradeoff in convergence. The method proposed in this paper is impractical for large systems such as computer vision or language models.
As a note, 2nd order methods do not require calculating the entire Hessian nor do they require factorizing the matrix. A trust-region method using Steihaug-Toint (truncated) CG or a Newton-Krylov method simply require Hessian-vector products, which are not particularly costly to compute. In fact, a forward-mode automatic differentiation (AD) method on an existing gradient calculation, which could also be done using AD, will calculate the Hessian-vector product at a cost of twice the gradient calculation. Though, AD is not required and this can be done by hand. Even if only a single Krylov iteration is done, there are many benefits to using this methodology to help handle the scaling of the problem. These algorithms are described fully in a book such as Numerical Optimization by Nocedal and Wright.
The current paper does not directly affect most ML algorithms because it speaks directly to convex optimization. Most ML models are highly nonlinear, so globalization of the optimization algorithm is required. Here, globalization means something like a line-search or a trust-region to ensure convergence to a local minima. Convex problems using an appropriate algorithm and under the correct assumptions do not necessarily require the use of a globalization technique and will converge directly to a local min, which is now global due to convexity. Normally, this discussion is done in the context of interior point methods for linear cone problems and helps explain why globalization is not done for these algorithms. An example of this can be found in Convex Optimization by Boyd and Vandenberghe in section 11.5.3, but it's a well researched topic. The algorithm in this paper discusses a technique where they also don't require a typical globalization technique.
Anyway, mostly I'm commenting to dispel the misconception that 2nd order method can not be applied to large scale problems. I've applied them successfully to problems with hundreds of millions of variables and they work just fine. Largely, their success is tied to how complicated the spectrum of the Hessian is, but that's a longer discussion.
It has to do with whether the eigenvalues of the Hessian are well clustered or not. Generally, to get fast convergence near an optimal solution, the Newton system needs to be solved well enough. What well enough is can be difficult to determine and practically it's not calculated. However, if the spectrum of the Hessian is well clustered, then it doesn't take that many Krylov iterations to get an accurate solution. To be clear, how Krylov methods perform is an area of study that sits independent of optimization algorithms, but those results are applicable here. In fact, it's a slightly easier situation than normal because the Hessian is symmetric, so the spectrum is real and more exotic topics like pseudospectra don't need to be considered.
1. A neural network with ReLU activations (the most common one) effectively has a piecewise linear optimisation surface, so the 2nd derivative is 0 almost everywhere. No curvature, no Hessian.
2. 2nd order methods make an assumption that the optimisation surface is (close to) convex, which is very much not the case with neural networks.
3. Even if the above were not true, to measure the Hessian you would have to iterate through the entire training set to measure it before you could make take an optimisation step.
4. Representing the entire Hessian for a network with N parameters needs O(N²) space, where N is on the order of 10⁹, so you couldn't use a method that needs the whole Hessian. There are so-called Quasi-Newton methods that need O(N) space though, like L-BFGS.
Neural networks need completely different optimisation methods, and there is no practically useful application of any of the Newton or Quasi-Newton methods for their optimisation.
> Neural networks need completely different optimisation methods, and there is no practically useful application of any of the Newton or Quasi-Newton methods for their optimisation.
I don't think this is quite fair. There are several variations of 2nd order methods, notably KFAC and Shampoo, that seem to quite effective for large-scale neural network training, e.g., see the intro of this paper for an overview: https://openreview.net/forum?id=-t9LPHRYKmi
I checked the paper. The 2nd order method actually achieves 10 times worse minimum (it's much worse at minimisation), and the reason the results are better is because the network overfits less (Figure 3). The reviewers should have caught this!
One of the authors is Donald Goldfarb, who is the G in BFGS, so maybe they are onto something. But I'm always suspicious if the tests shown in a paper are fair.
Aside from the other note about 2nd order derivatives being a blocker for most of the models we work with these days, and the fact that DL encounters non-convex loss functions, the other big difference is that we're used to training with SGD which only needs unbiased estimates of a gradient based on samples / batches. This family of approaches requires _actual_ derivatives, i.e. from looking at the whole dataset. If you have estimates of the gradient only, I think basically none of the convergence arguments in this hold. This is basically also why we don't already use LBFGS (which doesn't hold onto the full hessian).
Is it a groundbreaking better way to approximate continuous functions, or is it fundamentally impractical, like new algorithms with a good asymptotic time often are? I'm curious to know, but my mathematical illiteracy prevents me from reading the paper.
Newton's method is not for function approximation, but for optimization ("find the minimum").
Newton's method already has "good asymptotic time" (e.g. if you use it to compute a square root, you double the number of correct digits with each step!) - but only under certain conditions (e.g. "not too far from the solution"). This paper shows how to modify it so this sort of benefit is obtained under a wider range of conditions ("global").
There are other related methods with a similar goal, but some require more expensive computations. The method from the paper appears to fill a sweet spot in high-dimensional optimization by virtue of requiring "few" matrix inversions (or linear solves) while having very good convergence properties.
When you are dealing with the kind of problem that this sort of method is applicable to, it appears to me that you might benefit from looking into using this method. But it's not a revolutionary improvement from what I can tell - there are many related methods, and chances are you could have used something similar for similar benefit already.
And it's likely that someone already used more or less this method in practice. As the paper notes:
> Our goal is twofold. On the one hand, we are interested in designing methods that are useful for
applications and can be used without any change as black-box tools. On the other hand, we hope
that our theory will serve as the basis for further study of globally-convergent second-order and
quasi-Newton methods with superior rates. Although many of the ideas that we discuss in this
paper are not new and have been observed by practitioners to help solve challenging optimization
problems, our analysis, however, is the first of its kind. We hope that our theory will lead to
appearance of new methods that are motivated by the theoretical insights of our work.
This is an approach for solving convex optimization problems. An optimization problem is when you want to find the best option, or the fastest, or the most profitable. If you're asking questions with superlatives (most, best, fastest), you may be working on an optimization problem.
Many optimization problems are NP Hard and cannot be solved in a reasonable amount of time. Convex optimization problems are a special subclass which can typically be solved efficiently (meaning in polynomial time).
This paper provides a new approach to solving Convex Optimization problems for systems where other methods struggle.
I would say "interesting" more than "groundbreaking", in the sense that you will not solve problems that you were not previously solving.
The benchmarks provided show nice convergence. The interesting part is that the method will not fail, which is very good. The benchmarks seem to crush other methods but keep in mind that the plots do not show "time required" but "number of iterations required" and that with Newton's method you are doing much much more work per iteration: computing the Hessian instead of the gradient and solving a system of linear equations on every iteration.
From abstract: "Our method is the first variant of Newton’s method that has both cheap iterations and provably fast global convergence". It seems interesting.
"Cheap iterations" led me to expect a 1st order method, not requiring computing the matrix of 2nd derivatives, which is the bottleneck in large systems. But this does indeed rely on the Hessian, so I'm not sure what makes the iterations "cheap".
I am not an expert on convex, unconstrained optimization but I frequently use these methods. I'd wait until it's peer-reviewed and published in a good journal (SIAM) before I get too excited about this one.
This is interesting, but it’s a pity that the only experimental results do not show fast asymptotic convergence of the main and simpler algorithm proposed (Algorithm 1, the one stated in the abstract) but only for its line search variants (Algorithms 2 and 4 in the paper).
I was not aware that this was a problem in need of solving. aren't computers so fast that this is already solved form a practical standpoint. Newton's method (as well as extensions of it to vector analysis) is ancient math. I was not even aware that this is an area of ongoing interest.
> aren't computers so fast that this is already solved form a practical standpoint.
Not in the least. Much of the scientific computation out there is to solve these types of problems. Even if you sped up computers ten-fold you wouldn't be satisfied.
When I was in grad school, the computational teams would run calculations where it would take a few days to get a single data point (they were trying to plot a curve). And they had problems in their head where it would take over a month to get a data point - but of course they weren't pursuing those problems as it was impractical.
Things like computing the derivative - let alone the Hessian - can be quite expensive for some problems.
Non-linear optimization is very popular in many fields (non-linear = real world, optimization = find me the least cost/most profit/lowest energy). Nearly all scientific problems can be reformulated as an optimization problem. Training neural networks is a (stochastic) non-linear optimization problem.
Computers are quite fast at doing this in one variable, but typically you're in at least 10s, if not 10,000s of variables, which makes it much much harder.
In dense math, consider n=10,000. Typically a single step involves at least one system solve. Using some matrix factorization approach you're doing O(n*3) operations, assume the constant is ~1 and you have 1 TFlop of computation per step. Modern compute capability of a consumer device is about 1 TFlop/s, give or take. If you're doing (say) 300 steps to reach a solution, you're now at the 5 minute mark for a single moderately sized problem.
Admittedly things like sparse math and iterative solution approaches exist, but so do much bigger problems. And that's before we get to non-convex problems which will often use a convex solver to repeatedly solve subproblems.
I am unsure if this precise problem needed a better solution, but even if not, this is not how math research works. Finding better solutions to random / useless problems can lead to a better solution to a problem which needs solving. The relevance of the problem is not a consideration.
Because the textbook proof of the fast convergence of Newton's method make additional assumptions on the objective function, for example that it is strongly convex, or it is self concordant. This paper only assumes Lipschitz continuous Hessians.
The idea of dampening the Hessian is old (it's sometimes called "damped newton method", or "trust region newton method", or "levenberg-marquardt", though the latter two refer to more specific ideas). This paper offers a view to how much dampening to apply.