From the introduction of a manuscript that I recently submitted for review
The practice of Machine Learning research can be characterized as the effective semiprincipled reduction of learning problems to problems for which robust and efficient solution techniques exist – ideally ones with provable bounds on their use of time and space. In a recent paper Bennett and Parrado-Hern´andez (2006) describe the synergistic relationship between the fields of machine learning (ML) and mathematical programming (MP). They remark:
“Optimization lies at the heart of machine learning. Most machine learning problems reduce to optimization problems. Consider the machine learning analyst in action solving a problem for some set of data. The modeler formulates the problem by selecting an appropriate family of models and massages the data into a format amenable to modeling. Then the model is typically trained by solving a core optimization problem that optimizes the variables or parameters of the model with respect to the selected loss function and possibly some regularization function. In the process of model selection and validation, the core optimization problem may be solved many times. The research area of mathematical programming theory intersects with machine learning through these
core optimization problems” (Bennett and Parrado-Hern´andez, 2006).
Later Bennett and Parrado-Hern´andez imply that when the targets of ML reductions have been optimization problems, they have for the most part been the convex optimization problems within the MP pantheon.
“Convexity plays a key role in mathematical programming. Convex programs minimize convex optimization functions subject to convex constraints ensuring that every local minimum is always a global minimum. In general, convex problems are much more tractable algorithmically and theoretically. The complexity of nonconvex problems can grow enormously. General nonconvex programs are NP-hard.” (Bennett and Parrado-Hern´andez, 2006).
The close relationship between ML and MP arguably exists because MP provides ML with a set of crisp, well-defined problems along with algorithmic solvers that come with guarantees on their use of time and space. To state this using metaphors from software engineering, the well-defined convex optimization problems are interfaces that MP publishes, and the provably efficient and robust algorithmic solvers of MP implement these interfaces.
Let us differentiate, in this paper, between optimization and adaptation. We define optimization as the procurement of one or more points of optimal or close-to-optimal value, and adaptation as the generation of points of increasing value over time. Given this definition, to say that the target problems of Machine Learning reductions are optimization problems is to fudge the truth somewhat. While the Mathematical Programming community indeed seems to be almost completely concerned with the procurement of optimal or close to optimal points, ML researchers aren’t interested in optimization per se but in the means by which it is achieved in most MP algorithms, i.e. adaptation. In fact optimization is often prevented in machine learning algorithms – using a “technique” named early-stopping – to prevent overfitting. In other words, robust, efficient adaptation is the modus operandi of most convex optimization algorithms, and for the most part, it is this modus operandi that makes these algorithms interesting to Machine Learning researchers.
The interface-problems published by the MP community give ML researchers useful targets to hit; if a ML researcher works out a semi-principled reduction of a class of learning problems to one of MP’s interface-problems, there are off-the-shelf algorithms within MP which allow her to quickly test whether her reduction is effective. Because of the emphasis that the ML community place on guarantees of robustness and efficiency. when the targets of ML reductions have been optimization problems, they have for the most part, been restricted to being convex optimization problems within MP. These problems are rather simple as adaptation problems go – every local optimum is also a global optimum, or stated differently, there are no local optima. Rather heroic feats of ingenuity are therefore necessary in order to obtain effective semi-principled reductions
of hard problems to these simple optimization problems. The difficulty of obtaining such
reductions is currently a fundamental limitation on the pace of progress within ML.
The SGA (Mitchell, 1996) is an adaptation algorithm which mimics natural sexual evolution. It has been directly applied to a large number of hard real-world problems and has often succeeded in generating solutions of remarkably high-quality. To be sure, some amount of thought is required to “massage” these problems into a form which allows the SGA to operate successfully on them (e.g. choices must be made about the fitness function used and the way solutions are encoded as bitstrings), but unlike the case in machine learning this massaging is largely ad-hoc, an outcome more of trial and error than principled reasoning. The resulting problems are almost certainly hard ones (non-convex), with objective functions that are riddled with local optima. It is a testament to the adaptive power of the SGA that it nevertheless often produces solutions of remarkably high quality. Given these successes one might expect a great deal of interest in SGAs from the machine learning community. That this is not the case speaks to an unfortunate shortcoming of GA research. There is no dearth of one-off problems that SGAs have adequately solved. However GA researchers have yet to publish a single class of problems such that a) SGAs are likely to perform robust, efficient adaptation when applied to problems in this class, and b) the class is likely to be useful as the target of ML reductions. For the sake of brevity we will loosely define such a class of problems as an SGA-Easy/ML-Useful class. We believe that when such problem classes are found the ML community will begin to take a greater interest in GA research. The future relationship between the GA and ML communities might then be similar to the one that currently exists between MP and ML. As mentioned above SGAs commonly adapt high-quality solutions to problems which are almost certainly contain large numbers of local optima. It is reasonable therefore to suspect that there exists an SGA-Easy/ML-Useful class of hard non-convex problems and that the identification of this class will significantly ease the burden of obtaining novel ML reductions. We believe that the identification of such a problem class will go hand in hand with the discovery of a theory which can give a satisfying explanation of the adaptive capacity of the SGA. Such a theory does not currently exist.