# Improving Nelder-Mead Optimization by Genetic Algorithms and Simulated Annealing concepts

29 views (last 30 days)
Sepp on 5 Jun 2016
Edited: Walter Roberson on 10 Jun 2016
Hello
I'm using the Nelder-Mead simplex algorithm for hyperparameter optimization. It works quiet well but now I would like to develop it further. I have also tried genetic algorithms and simulated annealing and I would like to incorporate techniques from these algorithms into Nelder-Mead. My goal is to overcome the problem of local minima and/or to get faster convergence.
Does somebody have some idea how to incorporate ideas from genetic algorithms or simulated annealing into Nelder-Mead? Perhaps the simplex gradient could also be used in some way. I'm grateful for every idea.

John D'Errico on 6 Jun 2016
Edited: John D'Errico on 6 Jun 2016
Long ago (when I first learned about polytope methods), I recall being told by those who were knowledgable about Nelder-Mead that chasing the gradient tended to be a bad idea. The idea is it tends to create simplexes that are long and thin. Long thin simplexes will never be robust. So these methods tend to be better running away from a bad place than they are at running towards a good place. At least they will be more robust, more stable.
As far as the various stochastic algorithms, I don't see how you would plan on combining them with a scheme like Nelder-Mead. At least that is true of genetic algorithms, which have a very different underlying algorithm.
I suppose you might try something vaguely stochastic, where the new point is chosen by some random methodology. The issue is you need to avoid creating degenerate simplexes.
If your goal is faster convergence, I don't have serious hopes. There are several problems.
1. Nelder-Mead suffers from the curse of dimensionality. By this, you need to consider how well a simplex samples a higher dimensional space. Nelder-Mead does reasonably well in low numbers of dimensions, so 2 or 3 dimensions are no problem. But see how well a simplex samples a hyper-cube. Thus we can dissect the square (a 2-d hypercube) into 2 simplexes. A simple dissection of an n-dimensional hypercube uses factorial(n) simplexes. (There are other ways to do it, but they all scale somewhat similarly, so even faster than a pure exponential.) That means an (n+1)-simplex is a poor way to sample R^n for purposes of optimization.
2. Nelder-Mead has no accelerated convergence properties near the solution. So variations on Newton's method tend to be faster than linear when near the solution.
The point of all this is if you want faster convergence, then use a better method. There are better methods to be found, although every method will have its flaws.
Could you come up with some compromise scheme? Again, in reasonably low numbers of dimensions where Nelder-Mead will not be that obscenely poor, I suppose you might try to come up with some hybrid scheme. For example, I recall fzero is a hybrid method, that tries to have fast convergence when things are going well, but is able to avoid problems by switching methods when it sees things go bad.
However, in low numbers of dimensions, a good scheme is just a multi-start method. Generate many samples that roughly cover the search space. A simple lattice or a Monte Carlo sampling is a good start. Then take the best points from those samples, and set off a rapidly convergent scheme (like Newton's method, which will be fast near a solution) from each candidate point. Apply clustering to the solutions if you wish.
The above is a rather standard methodology, although I've not done much reading about Global optimization methods for quite some years. You might be best off just using the global optimization toolbox. This is always a better approach than re-creating the wheel. At the very least, you might want to do some research on global optimization methods to learn the current state of the art.
Sepp on 7 Jun 2016
Edited: Sepp on 7 Jun 2016
Thank you John for your detailed explanation, I appreciate it. Your idea of creating a new point at random seems interesting. Of course one could check for a degenerated simplex but the problem is that when sampling a new point at random there is no more a need for a simplex because the nelder-mead steps like reflexion, expansion etc. are no more performed for finding this new point.
Your second idea with a hybrid scheme is also interesting. One could use Nelder-Mead for fast convergence when things are going well and Genetic Algorithms when things are going bad. Do you have any idea how to detect when things are going good/bad? This would be the main issue.
An interesting topic would also be to create the initial simplex based on the loss function (and not just random or based on the initial guess). The problem is that I can evaluate my loss function but I don't know how the loss function looks like (because I'm using the cross-validated loss). Do you have any idea or hints how to create the initial simplex based on such a loss function?

Walter Roberson on 6 Jun 2016
simulannealbnd allows you to set a HybridFcn to be called periodically, and you can have Nelder-Mead be that hybrid method. Unfortunately, simulannealbnd() does not update its idea of where it should search based upon the results of the hybrid function: it only updates its knowledge of the global minima, so you cannot use the simplex to accelerate the search. I opened a case about this and recommended that there be an option to allow the updating of the current x, so it is in the list "for consideration" some day. In the meantime I made a private modification to my copy of the simulannealbnd routines and it has worked out relatively well for the functions I happened to test it with.
Meanwhile, over in patternsearch, if you turn CompletePoll off and set the polling next step to 'success' then you get some of the effect of a simplex search. (Note that your objective function can only be evaluated in vectorized or parallel mode if you have CompletePoll turned on, so if your function works fairly well in vectorized mode then CompletePoll might be somewhat more efficient.) Alan Weiss has posted about patternsearch a number of times and recommends it as a primary tool, but it is easy to construct surfaces that patternsearch cannot find the minima for unless you ask it to use one of the MAD* point selections algorithms (which are typically very very inefficient.)
##### 3 CommentsShow 1 older commentHide 1 older comment
Walter Roberson on 9 Jun 2016
The simulannealbnd routines internally keep track of two items: (1) the location (and function value) of the best point they have encountered so far; and (2) the location (and function value) of the "current" point. In each case when a new point is to be generated, it is the "current" point that is the starting location. If the function value at the newly generated point is less than the best so far then the new point is always accepted and the global best is updated and the new point becomes the current point. If instead the function value at the newly generated point is not better than the global best but is better than the value at the "current" point, then the newly generated point is accepted and becomes the "current" point and its associated function value is recorded as the current function value. If instead the unction value at the newly generated point is not better than the value at the "current" point, then a random decision is made about whether to accept the new point as the new "current" point; the details about the probability calculation depend which decision function you configured.
The above having been done, the hybrid section is invoked. The code decides whether it is the proper iteration to run the user-specified hybrid function, depending on the user-specified hybrid function period; if not then the code just returns. If it is the proper iteration, then the specified hybrid function is invoked upon the best point found so far, returning a point and a function value. If the function value is less than the best value ever encountered, then the returned point is assigned as the current global best and the associated function value is updated. However, the "current" point is not changed at all, no matter what the outcome of the hybrid function.
Because the code does not use the result of the hybrid function to update the "current" point, you cannot use the hybrid function to "pull" the random investigation into a different area. As you could always call the hybrid function on the best point returned from running the annealing without a hybrid function (or specify 'end' as the time to run the hybrid function), the point in running the hybrid function periodically has to be in hoping that one of the local minima has hidden in it a minima reachable by the hybrid function that will be good enough to be the best value over all of the searches (but does not have to be the absolute best value because you can run a finer-detailed hybrid search afterwards: you just have to reach the "catch basin" of the global minima.) This implies that you should run your hybrid function "as far as practical" expecting it to pick out the global minima, or else you should run your hybrid function to a tolerance fine enough that you are satisfied that if you reach it then the global minima is in that catch basin.
For example I was investigating a function the other day where it turned out that there was a local minima at around a residue of 1E-11 so to be sure of finding the global basin I would have had to run to a residue of less than that. I was surprised, as in the rest of my investigations of the function, any area which had a residue of less than 1E-5 was a smooth catch-basin for a strong local minima so I would have expected I would only have to run the hybrid function to a tolerance of 1E-5 to find the current basin.
The code modification I made in my private copy of simulannelbnd was that when a hybrid function value was accepted as being the new global best, that I also set the "current" point to that location. With that change, it becomes meaningful to run the hybrid function periodically with a limited number of iterations to "guide" the search. "Hey, there is a relative valley over here, take a closer look." When your search is in the wrong area, this can make a huge difference in efficiency and whether you get even close to the the minima.
Sepp on 9 Jun 2016
Edited: Walter Roberson on 10 Jun 2016
Thank you very much for your answer. Now it should be clear. By the way, I have another question regarding simulannealbnd http://ch.mathworks.com/matlabcentral/answers/288824-initial-temperature-for-simulated-annealing