When Will Evolution Outperform Local Search?

Recombinative evolution will outperform local search on fitness functions that have a property I call deep contingency. To develop and support this claim I will:

  1. Present a non-hierarchical, whitebox fitness function possessing the property in question and empirically show that a genetic algorithm handily outperforms simulated annealing.
  2. Explain why simulated annealing, and indeed, local search of any kind, is the wrong way to approach the problem presented.
  3. Argue that the presented problem is a tough one, and the non-trivial computation performed by the genetic algorithm is just what’s needed.

This is going to be a lengthy post, and you are welcome to jump to the punch. First, I want to explain why it has taken till now to procure a whitebox optimization problem on which a genetic algorithm outperforms simulated annealing. The short story is that we did not, till recently, have good theories about the workings of recombinative evolution.

Here is the long story:

Logical Positivism vs the Scientific Method in Genetic Algorithmics

The genetic algorithm owes its form to biomimicry, not derivation from first principles. So, unlike the workings of conventional optimization algorithms, which are typically apparent from the underlying mathematical derivations, the workings of the genetic algorithm require elucidation. Attempts to explain how genetic algorithms work can be divided in two: those based to a lesser or greater extent on the scientific method, and those that reject the scientific method in favor of logical positivism.

Positivism is the idea that the only true knowledge is knowledge that can be verified. Logical postivism is the idea that verification can be carried out through the application of formal logic to empirical observations. Logical Positivism originated in the first quarter of the twentieth century and was explored and popularized by the Vienna Circle, a group of philosophers, scientists, and mathematicians that met regularly from 1924 to 1936 at the University of Vienna. Buoyed by rapid advances in the axiomatization of mathematics, logical positivists believed that like mathematics, all of science could be derived within a formal axiomatic system where the list of axioms consist of 1) empirical observations, a.k.a. protocol sentences and 2) the axioms of mathematics (e.g. ZFC). The movement to recast science in this mold gained currency in the first quarter of the twentieth century but subsequently came to be regarded as untenable. Writing in 1971, Werner Heisenberg offered this reflection about positivism in general:

The positivists have a simple solution: the world must be divided into that which we can say clearly and the rest, which we had better pass over in silence. But can any one conceive of a more pointless philosophy, seeing that what we can say clearly amounts to next to nothing? If we omitted all that is unclear we would probably be left with completely uninteresting and trivial tautologies. (Heisenberg 1971)

The non-universality of empirical observation was a particularly vexing problem for logical positivism. Given the verificationist foundations of this movement, only empirical observations made at a particular point in time and space were admissible as protocol sentences. Generalization to other points in time and space was not allowed. This made it hard to prove much of value. Proponents of logical positivism tied themselves in knots trying to address this and other critiques. By the mid twentieth century, the logical positivist movement had largely crumbled and the  philosophy of science propounded by Karl Popper had gained traction.

While the universality of scientific theories makes logical verification (i.e. proof) difficult, if not impossible, that very universality, observed Popper (2007b, 2007a), makes disproof (i.e. falsification) a much easier path to pursue. All it takes is a single counterexample—that is, a single prediction not borne out—to disprove a theory as it currently stands. Of course a theory must lend itself to falsification in the first place, and Popper cautioned strongly against theories that employed what he called immunizing strategems to avoid falsification. Such theories (Freud’s theory of psychoanalysis for example)

…appeared to be able to explain practically everything that happened within the fields to which they referred. The study of any of them seemed to have the effect of an intellectual conversion or revelation, opening your eyes to a new truth hidden from those not yet initiated. Once your eyes were thus opened you saw confirming instances everywhere: the world was full of verifications of the theory. Whatever happened always confirmed it. Thus its truth appeared manifest and unbelievers were clearly people who did not want to see the manifest truth. (Popper, 2007a, p45)

Scientific theories, in contrast, are theories that take risks. By predicting weird phenomena (e.g. gravitational lensing in the case of Einstein’s theory of general relativity) they leave themselves open to refutation. “A theory which is not refutable by any conceivable event is non-scientific”, wrote Popper. “Irrefutability is not a virtue (as people often think) but a vice” (Popper, 2007a, p46).

Whether day-to-day science, with its tribalism and politicking, actually works as Popper described has been rightly disputed (Kuhn 2012, Latour 1987). However, it is hard to argue with the historical success of the scientific method, or with Popper’s simple and elegant use of contraposition to frame the normative practice of science:

\textrm{theory} \Rightarrow \textrm{prediction} \Longleftrightarrow \neg \textrm{prediction} \Rightarrow \neg \textrm{theory}

Here is Popper in his own words:

Like many other philosophers, I am at times inclined to classify philosophers as belonging to two main groups—those with whom I disagree, and those who agree with me. I might call them the verificationist or the justificationist philosophers of knowledge or of belief, and the falsificationists or falibilists or critical philosophers of conjectural knowledge …

The members of the first group—the verificationists or justificationists—hold, roughly speaking, that whatever cannot be supported by positive reasons is unworthy of being believed, or even of being taken into serious consideration.

On the other hand, the members of the second group—the falsificationists or falibilists—say, roughly speaking, that what cannot (at present) in principle be overthrown by criticism is (at present) unworthy of being seriously considered; while what can in principle be so overthrown and yet resists all our critical efforts to do so may quite possibly be false, but is at any rate not unworthy of being seriously considered and perhaps even of being believed—though only tentatively. (Popper 2007a)

Popper famously claimed to have “killed positivism”. Were he alive today, he would no doubt be intrigued by the resurrection of positivism in at least one corner of Computer Science. Consider the following passage in Vose’s book, The Simple Genetic Algorithm (Vose 1999):

My central purpose in writing this book is to provide an introduction to what is known about the theory of the Simple Genetic Algorithm. The rigor of mathematics is employed so as not to inadvertently repeat myths or recount folklore.

The decision to accept assertions only when proved is not without price. Few (certainly not I) have mastered enough mathematics to make its application trivial. Another cost is that one is forced to trade what is believed to be true for what can be verified. Consequently this account is silent on many questions of interest for lack of proof.

The “myths” and “folklore” mentioned above reference the Building Block Hypothesis, the idea that genetic algorithms work through the hierarchical composition of short chromosomal snippets that confer above average fitness. The inventor of the genetic algorithm, John Holland, hypothesized that this type of hierarchical composition was the reason for the genetic algorithm’s effectiveness on a range of combinatorial optimization problems. Unfortunately, the building block hypothesis and other hierarchical composition hypotheses (Watson 2006) have been propped up long after it has become clear that such hypotheses cannot be reconciled with a long list of theoretical and empirical anomalies.

The obstinacy of the hierarchical compositionalists when confronted with these anomalies resulted in an interesting transition within the theoretical evolutionary computation community. Hypotheses as a whole came to be regarded with wariness and there was a general hardening of resolve to, as Vose put it, “accept assertions only when proved.” For well over a decade now, foundational research on evolutionary computation has had a decidedly positivist bent. The logical positivism practiced by evolutionary computation theorists is purer version of the logical positivism espoused by the Vienna Circle in that the evolutionary computation theorists find no use for empirical observations, and rely exclusively on the axioms of mathematics. The body of work that goes by the name Runtime Analysis is but the latest in a line of positivist evolutionary computation literature stretching back to the 1990s.

Now perhaps this is just as well. Perhaps logical positivism was simply a movement in search of a niche, and perhaps the foundation of evolutionary computation is just that niche. Maybe. But frankly, there is no a priori reason why any of this should be so. I’ll grant that it is possible to verifiably explain the workings of a large number of computer algorithms—for example, the workings of the algorithms in Cormen et al’s classic, Introduction to Algorithms—however, these algorithms were constructed to have verifiable explanations. Explanatory verifiability, in other words, was a design consideration, not a coincidence. The same cannot be said for the genetic algorithm, which, as I said earlier, owes its form to biomimicry.

The fact of the matter is logical positivist approaches have not yielded answers to the kinds of questions that users of evolutionary algorithms care about. Rather, they have diverted research on the foundations of evolutionary computation toward the study of highly simplified—neutered, one might say—algorithms, for example, the (1+1)-EA, and trivial fitness functions, for example, onemax. In 1999, Vose wrote that his positivist account is “silent on many questions of interest for lack of proof.” Wikipedia’s embarrassing entry on Genetic Algorithms suggests that the situation today is no different.

One might ask why the scientific method—an approach that has acquitted itself splendidly in field after field—was so readily cast aside in favor of logical positivism? I think it has to do with evolutionary computation being a subdiscipline of Computer Science, and Computer Science, contrary to its name, not really being a science. By this I mean that unlike the foundations of say Physics, Chemistry, or Biology, the foundations of Computer Science are purely mathematical—hypotheses play no part. So, while Computer Science has seen engineering revolutions aplenty, it has witnessed nothing similar to the transition in Physics from a Newtonian universe to an Einsteinian one, or the move in Chemistry from the phlogiston theory of combustion to Lavoisier’s oxygen based theory, or any of the other paradigm shifts described in Kuhn’s Structure of Scientific Revolutions (2012). Mathematics, of course, figures prominently in the foundations of the natural sciences, however it is always in support or within the rubric of some hypothesis. Theoretical Physicists, Chemists, and Biologists trained informally, if not formally, in the ways of the scientific method know how to evaluate and work with scientific hypotheses. The same cannot be said of lay mathematicians and theoretical computer scientists.

The field of Genetic Algorithms, which was founded on a hypothesis, is quite possibly the first genuinely scientific discipline within Computer Science. Unfortunately, the field has been a poor standard bearer for the use of the scientific method within Computer Science. Whereas a healthy science moves towards increasing levels of vulnerability, a.k.a. falsifiability, hierarchical compositionalists, when confronted with various anomalies, armored up by moving toward unfalsifiability. For example, as criticism of the building block hypothesis mounted, Holland put forth the building block thesis (Holland 2000), an unfalsifiable theory that essentially contends that hierarchical composition is at play in a genetic algorithm because, well, hierarchical composition is at play almost everywhere. When push comes to shove, the refrain “What else might recombination be doing?” is a well worn fallback. In a mature field of science, such maneuvering would immediately be called out for violating the safeguards that are part of the scientific method. In the fledgeling scientific field of evolutionary computation, however, this critique was not leveled. Instead, logical positivism took root and “hypothesis” became a dirty word.

The Missing Royal Road Function

The way forward for foundational genetic algorithms research lies not in doubling down on logical positivism or hierarchical compositionalism, but in scientific rigor which—to reiterate one of the main points of the previous section—is more than mere mathematical rigor. The cornerstone of scientific rigor is the submission of a falsifiable hypothesis based on weak assumptions for review and refutation. Scientific rigor must be continually maintained through a sustained, community-wide effort to make predictions  and test them (e.g. the prediction of the Higgs Boson). Anomalies must be spotted and called out. Resolving them should be an urgent matter. Ignoring them should be unacceptable.

Like all search algorithms, genetic algorithms perform optimization by exploiting some structural property of the fitness function to which it is applied. Therefore, any scientific hypothesis about the workings of genetic algorithms must, at a minimum, be comprised of two parts:

  • Structural Assumptions: A specification of the structural properties of fitness functions that get exploited by genetic algorithms
  • Exploitation Story: A description of how these structural properties are efficiently exploited

Hierarchical compositionalism assumes the existence of multitudes of modules (low order schemata) that individually confer above average fitness. Also assumed is the existence of hierarchical synergy amongst these modules; i.e. the existence of second level modules that confer a higher fitness on average than the combination of the fitness increments conferred by their constituent first level modules; likewise, the existence of third level modules that confer a higher fitness on average than the combination of the fitness increments conferred by their constituent second level modules; and so on. These are the structural assumptions of hierarchical composition hypotheses. Here is the exploitation story: a genetic algorithm preferentially selects individuals with first level modules, then efficiently mixes the first level modules to create individuals with second level modules, then efficiently mixes those individuals to create individuals with third level modules, and so on.

In previous work I have criticized hierarchical compositionalism on the grounds that its structural assumptions are much too strong. (Remember, we are talking about fitness functions arising in practice through the ad hoc representational choices of genetic algorithm users.) Here I want to raise a different problem confronting hierarchical compositionalist: their inability to procure a royal road function in support of hierarchical compositionalism.

The idea behind royal road functions is simple: Given a hypothesis about the workings of genetic algorithms, it should be possible to procure a fitness function such that the following criteria are satisfied:

  1. The structural assumptions of the hypothesis are met by the fitness function.
  2. The behavior of a genetic algorithm when applied to the fitness function clearly matches the behavior described in the exploitation story of the hypothesis.
  3. The genetic algorithm is observed to be more successful at optimization on the fitness function than naive alternate approaches—hill climbing, for example.

A fitness function that meets these criteria is called a royal road function. Trying to find such a function accords well with the scientific method outlined by Popper. Essentially, a hypothesis about the workings of genetic algorithms predicts the existence of one or more royal road functions. Finding such functions validates the prediction and lends credibility to the hypothesis.

A formal review of the literature on royal road functions should at a minimum cover Mitchell et al’s original work on Royal Road functions (Mitchell 1998), Holland’s hyperplane defined functions (Holland 2000), and Watson’s HIFF functions (Watson 2006). Such a review is beyond the scope of this blog post. What I will say here is that hierarchical compositionalists have tried and failed to find a fitness function that satisfies the criteria listed above.


Contingent Parities Functions

I will now present a function that satisfies the royal road criteria. The hypothesis supported by this function has nothing to do with hierarchical compositionalism. Instead the function lends credibility to a different hypothesis about the workings of genetic algorithms called the generative fixation hypothesis (Burjorjee 2009).

A Contingent Parities Function of order k and height h is pseudo-boolean function that takes a bitstring of length k*h as input and returns an integer in the interval [0,h]. The pseudocode for this function appears below:

contingent parities function algo

Given a chromosome of length k*h as input, the fitness of the chromosome, i.e. the output of the contingent parities function, is determined by the parity of the loci in a series of h sets of loci, each of cardinality k. Each set of loci in this series is called a step. Now for the crucial part: For all i in {2, …, h}, observe that the composition of step i is not the same for all chromosomes, but is contingent upon the hash of a trace string comprised of the loci participating in steps 1 to i-1 and the values of the loci comprising these steps.

A chromosome is said to have misstepped at level i if  the values of the loci in the ith step have even parity. Otherwise the chromosome is said to have eustepped at level i. The maximum number of missteps a chromosome can have is h. The difference between h and the number of missteps is the fitness of the chromosome (see line 17).

Time for an example. Given some contingent parities function of order 2 and height 8, the table below shows a sorted list of final trace strings and fitness values of thirty chromosomes of length 16 generated uniformly at random.

trace_and_fitness

To generate such traces yourself clone the github repository that accompanies this article, and execute the following python code:

import royalroads as rr
import numpy.random as npr

# generate a contingent parities function of order 2 and height 8
cpf = rr.generateContingentParitiesFunction(order=2, height=8)

# generate 30 chromosomes of length 16 uniformly at random
chromosomes = npr.rand(30,16) < 0.5

# evaluate the chromosomes
cpf(chromosomes, verbose=True)

The challenge, given query access to a contingent parities function, is to find a chromosome that maximizes fitness. There are 2(k-1)*h such chromosomes. Before proceeding further, you should convince yourself a) that this is not a simple problem and b) that local search is not a great way to approach it. Observe, in particular, that the correction of lower level missteps can turn higher level eusteps into missteps; however, the reverse is not true. That is, the correction of higher level missteps cannot turn lower level eusteps into missteps.  Thus, the higher the fitness of a chromosome, the less likely it is that local search will successfully correct any low level missteps. In other words the fitness structure of the problem is such that local search will tend to “freeze in” lower level missteps as search proceeds.

This behavior can be observed in the following figure which shows the missteps of a chromosome with maximal fitness (in blue) every 500 epochs (called a period) when simulated annealing was run on a contingent parities function of order 2 and height 100. The maximal fitness achieved in each period is shown in red. The probability of accepting a solution with a fitness of one less than that of the current solution was 0.6 at the beginning of the run and 0.001 by the end of the run.

SA_missteps_and_fitnessThe plot above can be regenerated as follows:

import royalroads as rr
rr.SA(rngSeed=1, numPeriods=1000)

The horizontal lines in the plot are missteps that have been frozen in. All in all there are 15 such missteps, resulting in a fitness of 85 by the end of the run.

As the correction of higher level missteps cannot turn lower level eusteps into missteps, a methodical way to optimize a contingent parities function is to get lower steps right before getting higher steps right. This is precisely what a genetic algorithm does when applied to the function. A simple genetic algorithm with uniform crossover, fitness proportional selection, sigma truncation (with coefficient 0.5), a per bit mutation probability of 0.004, and a population size 500 was run on the same contingent parities function for 1600 generations. The figure below shows the maximal fitness and the missteps of a chromosome with maximal fitness in each generation.

GA_noclamping_missteps_and_fitness
This plot can be regenerated as follows:

import royalroads as rr
rr.GA(rngSeed=1, numGenerations=1600)

The figure shows that the simple genetic algorithm methodically got lower level steps right before it got higher level steps right. That said, diminishing returns soon set in, and there came a point beyond which there was no steady increase in the number of eusteps in the chromosome with maximal fitness. Simple genetic algorithms have been criticized for lacking a “killer instinct”. The results obtained here validate this criticism. The genetic algorithm’s ability to methodically correct missteps peters out even though from a Detection Theory perspective, it is actually easier to correct higher steps than it is to correct lower steps.

I have previously accounted for the absence of a “killer instinct” and have provided a simple fix called clamping (Burjorjee 2009, 2013). I reran the simple genetic algorithm, this time with clamping turned on (a flag frequency threshold of 0.01, an unflag frequency threshold of 0.1 and a waiting period of 60 generations was used). The results appear in the figure below.

GA_missteps_and_fitness

This plot can be regenerated as follows:

import royalroads as rr
rr.GA(rngSeed=1, numGenerations=1000, useClamping=True)

The figure shows that the genetic algorithm once again eliminated lower level missteps before eliminating higher level missteps. Instead of diminishing returns, however, the algorithm yielded accelerating returns as the run proceeded, which is in line with the decision theoretic observation that higher steps are easier to get right than lower steps. By generation 600, the genetic algorithm had found a chromosome with the maximum fitness of 100.

A single generation of the genetic algorithm incurs the same number of fitness evaluations as one period of simulated annealing, allowing us compare the performance of the two algorithms. The figure below shows the mean maximal fitness value per generation/period over forty runs of simulated annealing (in purple) and recombinative evolution (in green). The error bands show one standard deviation above and below the mean.

ga_vs_sa

This plot was generated as follows:

import royalroads as rr
rr.compareAlgorithms(numRuns=40)

The mean maximal fitness in the final generation/period of simulated annealing and the genetic algorithm are given below.

ga_vs_sa_final_gen_stats

The superior performance of recombinative evolution can be attributed to the methodical way it solves the problem.

Deep Hierarchy vs. Deep Contingency

The general results presented in the previous section are predicted by the generative fixation theory  of adaptation (Burjorjee 2009, 2013) and the hypomixability theory of recombination. They are not, to the best of my knowledge, predicted by any verificationist theory currently in existence, and they are certainly not predicted by any theory having to do with hierarchical composition. In fact, the provable absence of hierarchical modularity in contingent parity functions, makes these results a good counterexample to hierarchical composition based hypotheses.

With the attention currently being devoted to all things “deep”, the hierarchical composition that purported occurs in recombinative evolutionary systems recently acquired a new name: deep optimization. Just as deep learning composes low level feature detectors into high level feature detectors, so does recombinative evolution, the story goes, compose low level modules into high level modules.

On the one hand I find this new name problematic—dangerous even—because the it promotes hierarchical compositionalism under a catchy new name while papering over the evidence against this class of hypotheses (the efficacy of uniform crossover, for starters). On the other hand, I do like the name because I believe it can help focus attention on the difference between generative fixation and hierarchical composition. While both hypotheses assume fitness functions with “depth”, very different assumptions are made about form this depth takes. Hierarchical composition based hypotheses, of course, assume a deep hierarchy of modules. The generative fixation hypothesis on the other hand makes a much weaker assumption—that of deep contingency: higher level modules owe their existence (or non-existence as the case may be) to the fixation of lower level genes. Unlike hierarchical depth, the existence of a given higher level module is not pre-established.

Conclusion

In this post I have explained how the the promising field of evolutionary computation has been ill-served by a vampirical theory on the one hand (hierarchical composition) and logical positivism on the other. I have argued that the way forward lies in rigorous adherence to the scientific method, a defining characteristic of which is the making and testing of weird predictions. Accordingly I have introduced contingent parities functions, a class of non-trivial, whitebox fitness functions whose structural properties are such that the genetic algorithm can be expected to outperform local search algorithms. On at least one such function, I have demonstrated that a simple genetic algorithm, when augmented by a simple mechanism called clamping, handily outperforms simulated annealing. These results and arguments lend support to the generative fixation theory of adaptation and the hypomixability theory of sex.

The importance of shoring up the foundations of evolutionary computation cannot be overstated. At stake is an understanding of the computational efficiencies at play in recombinative evolution, a final understanding of the computational purpose of sex, and more broadly, an understanding of the place and utility of the scientific method in Computer Science. Then, of course, there are the engineering ramifications of all of the above.

To learn more about generative fixation and the role of recombination in evolutionary systems check out my papers on hyperclimbing and hypomixability. I encourage you to use the theories presented therein to make predictions of your own and to publicize your attempts to validate them—successful and otherwise.

References

Burjorjee, K. M. (2009). Generative fixation: a unified explanation for the adaptive capacity of simple recombinative genetic algorithms. Brandeis University.

Burjorjee, K. M. (2013). Explaining optimization in genetic algorithms with uniform crossover. In Proceedings of the twelfth workshop on Foundations of genetic algorithms XII (pp. 37-50). ACM.

Heisenberg, W. (1971). Positivism, Metaphysics and Religion. In Ruth Nanda Nanshen. Werner Heisenberg – Physics and Beyond – Encounters and Conversations. World Perspectives 42. Translator: Arnold J. Pomerans. New York: Harper and Row. p. 213.

Holland, J. H. (2000). Building blocks, cohort genetic algorithms, and hyperplane-defined functions. Evolutionary computation, 8(4), 373-391.

Kuhn, T. S. (2012). The structure of scientific revolutions. University of Chicago press.

Latour, B. (1987). Science in action: How to follow scientists and engineers through society. Harvard university press.

Mitchell, M. (1998). An introduction to genetic algorithms. MIT press.

Nagle, E. and Newman, J.R. (2001).  Godel’s Proof, New York University Press.

 Popper, K. (2007a). Conjectures and Refutations. Routledge.

 Popper, K. (2007b). The Logic Of Scientific Discovery. Routledge.

Skiena, S. (2010). The Algorithm Design Manual (2nd ed.). Springer Science+Business Media

Vose, M. (1999). The Simple Genetic Algorithm: Foundations and Theory. Cambridge, MA: MIT Press

Watson, R. A. (2006). Compositional evolution: the impact of sex, symbiosis and modularity on the gradualist framework of evolution. MIT Press.

 

When Will Evolution Outperform Local Search?

Hypomixability Theory

TL;DR: Computer scientists can say why the fast fourier transform is fast and why quicksort is quick. Yet when it comes to explaining what’s efficient about the computation that yielded ourselves, we currently draw a blank. Hypomixability theory provides the beginnings of an answer.

Run a population based recombinative evolutionary algorithm on a real world fitness function; until adaptation tapers off, you will find that the fitness of an offspring is, on average, lower than the mean fitness of its parents. You will find, in other words, that the average fitness of the population goes down after mixing (recombination) and up after selection. Selection, of course, makes sense—it increases average fitness. But why mixing? What sense does it make to recombine the genes of individuals with above average fitness, especially on epistatic landscapes where recombination is virtually guaranteed to lower average fitness? What computational purpose does mixing serve?

Hypomixability Theory explains why mixing makes computational sense and illuminates how mixing induced reductions in average fitness (the hypo in hypomixability) between parent and offspring populations can be indicative of room for adaptation. The theory is very parsimonious in its assumptions. It does not, for example, assume the presence of building blocks in the fitness landscape. Hypomixability Theory also comes with a prediction and proof of efficient computational learning in a recombinative evolutionary algorithm, making it the first, and currently only, computationally credible explanation for the “endless forms most beautiful and wonderful” that comprise the the sexually reproducing part of the biosphere. Our remarkable species included.

Paper: Efficient Hypomixability Elimination in Recombinative Evolutionary Systems

Hypomixability Theory

Explaining Sex, Adaptation, Speciation, and the Emergence of Genetic Modularity

To appear in the proceedings of the Foundations of Genetic Algorithms Conference, 2015

Hypomixability Elimination in Evolutionary Systems

Abstract: Hypomixability Elimination is an intriguing form of computation thought to underlie general-purpose, non-local, noise-tolerant adaptation in recombinative evolutionary systems. We demonstrate that hypomixability elimination in recombinative evolutionary systems can be efficient by using it to obtain optimal bounds on the time and queries required to solve a subclass (k=7, \eta=1/5) of a familiar computational learning problem: PAC-learning parities with noisy membership queries; where k is the number of relevant attributes and \eta is the oracle’s noise rate. Specifically, we show that a simple genetic algorithm with uniform crossover (free recombination) that treats the noisy membership query oracle as a fitness function can be rigged to PAC-learn the relevant variables in O(\log (n/\delta)) queries and O(n \log (n/\delta)) time, where n is the total number of attributes and \delta is the probability of error. To the best of our knowledge, this is the first time optimally efficient computation has been shown to occur in an evolutionary algorithm on a non-trivial problem.

The optimality result and indeed the implicit implementation of hypomixability elimination by a simple genetic algorithm depends crucially on recombination. This dependence yields a fresh, unified explanation for sex, adaptation, speciation, and the emergence of modularity in evolutionary systems. Compared to other explanations, Hypomixability Theory is exceedingly parsimonious. For example, it does not assume deleterious mutation, a changing fitness landscape, or the existence of building blocks.

Explaining Sex, Adaptation, Speciation, and the Emergence of Genetic Modularity

Efficiently Evo-learning Juntas With Membership Queries

Recently, I showed that a simple genetic algorithm with uniform crossover can solve a noisy learning 7-parity problem efficiently—in O(n \textrm{ polylog}(n, \frac{1}{\epsilon})) time and O(\textrm{polylog}(n, \frac{1}{\epsilon})) membership queries. Learning parities is a subclass of a more general problem—learning juntas—that is of considerable interest to computational learning theorists because it neatly abstracts the general problem of learning in the presence of large amounts of irrelevant information. This is a very common problem in practice. A good example comes, coincidentally, from the field of genetic epidemiology—learning the genetic loci participating in the penetrance of a particular disease. (More about this “coincidence” in a bit)

The implicit concurrency hypothesis posits that the efficient learning of relevant variables in the presence of large numbers of irrelevant variables is precisely what gives evolution its power; so, understandably, when I read about the learning juntas problem a few weeks ago (on this blog), I sensed the possibility of a connection. Formally, a k-junta is a function f over n boolean variables, only j\,\,(\leq k\leq n) of which matter (i.e. are relevant) in the determination of the output of f. The j relevant variables are said to be the juntas of f, and the function f is completely characterized by its juntas and by a hidden boolean function h over j inputs. The output of f is just the output of h on the values of the j relevant variables of f (the values of the irrelevant variables are simply ignored). The problem of identifying the relevant variables is called the learning juntas problem.

Observe how this problem abstracts the genetic epidemiology problem mentioned above. The junta are the genetic loci participating in the presence/absence of the disease. The hidden function specifies which configurations of alleles at these loci cause the disease, and which ones do not. The problem of learning the relevant loci is the problem of learning juntas. Also note that the learning juntas problem generalizes the learning parities problem straightforwardly. Whereas the hidden function in the case of learning parities must be the parity function, the hidden function in the case of learning juntas can be any boolean function over the juntas.

Here is an evolution based algorithm, written in Python, that learns juntas given a membership query oracle. And here is a tutorial on its usage. For any value of k, the theory of implicit concurrency suggests that the k-junta for which it is hardest for a genetic algorithm to learn one or more relevant attributes (Mossel et al observed in Learning Juntas that to crack the problem, asymptotically speaking, one only needs to find a single relevant attribute) is the k-junta with a hidden parity function over k variables. This insight and my previous work on learning parities suggests that the evolutionary algorithm presented can learn the relevant attributes of k-juntas with small k in O(n\textrm{ polylog}(n, \frac{1}{\epsilon})) time and O(\textrm{polylog}(n, \frac{1}{\epsilon})) queries.

Implications for the Neo-Darwinian Synthesis

It seems like more than a coincidence that

  1. The Learning Juntas problem abstracts the genetic epidemiology problem mentioned earlier
  2. Evolutionary computation can efficiently solve the Learning Juntas problem for small numbers of juntas.

One is hard pressed not to hypothesize that evolution takes the form it takes precisely because this form allows it to solve a problem that is similar to the genetic epidemiology problem in all significant ways, save two :

  1. The problem is to learn the genetic loci participating in fitness/hotness not disease penetrance (often the same loci, but by no means always)
  2. Unlike computational geneticists, evolution has access to an active oracle.

Given the straightforwardness of this idea it may seem odd that it has not been forthcoming from Evolutionary Biology itself; that is until one realizes that a deep-rooted practice in Population Genetics actively steers theorists away. A vestige from the days before computers, when pen, paper, and mathematics was all one had, the conflation of the unit of inheritance with the unit of selection casts a long shadow. More about this in Sections 2.4.1 and 5.1 of my dissertation.

What now?

Despite the routine use of evolutionary algorithms in science and engineering, computer scientists have been unable to give a straight answer to a simple question: “Is there anything computationally efficient about evolution?” This question can now be answered in the affirmative. Recombinative evolutionary algorithms are capable of efficiently solving a broad, non-trivial problem: Learning k-Juntas with membership queries when k is small.

While fascinating from a biological perspective, this answer is not entirely satisfying to the engineer in me because there are conventional algorithms that solve the problem under consideration as efficiently; for example, the binary search algorithm due to Blum and Langley described in Section 5 of the paper by Mossel et al. A more involved algorithm due to Feldman solves the problem almost as efficiently, and does so non-adaptively and in the presence of random persistent classification error.

Have modern ML techniques eclipsed evolution at the very thing it is good at?

I have a hunch that the answer is “no”. It would be fantastic to find generalizations of the learning juntas with membership queries problem that are more difficult for conventional algorithms, but remain efficiently solvable by evolution.

Relevant Computational Learning Theory papers
R.J. Lipton
[web] The Learning Juntas Problem
(An excellent introduction to the Learning Juntas problem)

E. Mossel, R. O’donnell, and R. Servedio
[pdf] Learning Juntas,
(Section 3.1 is especially relevant if you want to reverse engineering the posted Python algorithm)

A. Blum, L. Hellerstein, and N. Littlestone
[pdf] Learning in the Presence of Finitely or Infitely Many Irrelevant Attributes

V. Feldman
[pdf] Attribute-Efficient and Non-adaptive Learning of Parities and DNF Expressions

Relevant papers on Implicit Concurrency
K.M. Burjorjee
[pdf] The Fundamental Learning Problem that Genetic Algorithms with Uniform Crossover Solve Efficiently and Repeatedly as Evolution Proceeds

K.M. Burjorjee
[pdf] Explaining Optimization in Genetic Algorithms with Uniform Crossover

Relevant papers on genetic epidemiology
J.H. Moore, L.W. Hahn, M.D. Ritchie, T.A. Thornton, and B.C. White
[web] Routine Discovery of Complex Genetic Models using Genetic Algorithms

J.H. Moore
[web] The ubiquitous nature of epistasis in determining susceptibility to common human diseases

Last Updated: September 18, 2013

Efficiently Evo-learning Juntas With Membership Queries

Theoretical Bonafides For Implicit Concurrency

A commonly asked question since the early days of evolutionary computation is “What are Genetic Algorithms good for?” What, in other words, is the thing that genetic algorithms (GAs) do well? The prevailing answer for almost two decades was “Hierarchical Building Block Assembly”. This answer, which goes by the name of the building block hypothesis, generated tremendous excitement in its day and helped jumpstart the field of genetic algorithmics. However it fell into disrepute during the ’90s because of a lack of theoretical support and (more damningly) anomalies in the empirical record.

Since then, The Question (“What are GAs good for?”) has gone unanswered—or, to be precise, remains poorly answered—The Building Block Hypothesis continues on as a placeholder in textbooks. More recently, The Question is increasingly going unasked by evolutionary computation theorists, in large part because EC theorists are now wary of entertaining claims about genetic algorithms that cannot be formally deduced, and because Genetic algorithms used in practice (ones with finite but non-unitary recombinative populations) have proven frustratingly resistant to conventional analytic approaches.

The following paper squares off with Das Question, and for genetic algorithms with uniform crossover (UGAs), ventures an answer: Implicit Concurrent Multivariate Effect Evaluation—implicit concurrency for short—is what UGAs are good for. Implicit concurrency is compared with implicit parallelism (the “engine of optimization” according to the building block hypothesis), and the former is revealed to be more powerful.

Empirical evidence for implicit concurrency can be found here and here. And additional empirical evidence is easily obtained for one-off problem instances—just run a genetic algorithm on the appropriately chosen instance. The paper below takes a different tack. It establishes theoretical support (across an infinite set of problem instances) for the claim that implicit concurrency is a form of efficient computational learning. It achieves this end by demonstrating that implicit concurrency can be used to obtain low bounds on the time and query complexity of a UGA based algorithm that solves a constrained version of a well recognized problem from the computational learning literature: Learning parities with a noisy membership query oracle.

The difficulty of formally analyzing genetic algorithms is circumvented with an empirico-symmetry-analytic proof comprised of

  1. A formal part (no knowledge of genetic algorithms required)
  2. An accessible symmetry argument
  3. A simple statistical hypothesis testing based rejection of a null hypothesis with very high significance p<10^{-100}

The paper is here:
The Fundamental Learning Problem that GAs with UX Solve Efficiently and Repeatedly as Evolution Proceeds

The code used to conduct the experiment is here:
https://github.com/kburjorj/speedyGApy/blob/master/soda14.py

Theoretical Bonafides For Implicit Concurrency

Implicit Concurrency in Genetic Algorithms

Genetic Algorithmics remains a niche area within Artificial Intelligence largely because the field has not identified a computation of some kind that genetic algorithms perform efficiently. The skepticism with which GAs are regarded is understandable. If, after decades of research and thousands of published articles, we cannot say what genetic algorithms do efficiently, then it is hardly surprising that the field would be held at arm’s length by the wider AI community.

His Highness is Naked. Time for some Real Clothes

Interestingly, the thing genetic algorithms do efficiently—concurrent multivariate effect evaluation (implicit concurrency for short)—is not difficult to describe or demonstrate. It’s been under our noses ever since schemata and schema partitions were defined as concepts, but for various reasons, which I won’t go into here, it’s escaped identification and dissemination.

Implicit concurrency (not to be confused with implicit parallelism) is a marvelous phenomenon that should be more widely appreciated than it currently is. This blog post provides a quick introduction. For a more in depth treatment check out  my FOGA 2013 paper Explaining Optimization in Genetic Algorithms with Uniform Crossover (slides here). And if you really want to have at it, my dissertation is here.

Schema Partitions and Effects

Let’s begin with a quick primer on schemata and schema partitions. Let S=\{0,1\}^\ell be a search space consisting of binary strings of length \ell. Let \mathcal I be some set of indices between 1 and \ell, i.e.  \mathcal I\subseteq\{1,\ldots,\ell\}. Then  \mathcal I represents a partition of S into 2^{|\mathcal I|} subsets called schemata (singular schema) as in the following example: suppose \ell=4, and \mathcal I=\{1,3\}, then \mathcal I partitions S into four schemata:

0*0* = \{0000, 0001, 0100, 0101\}

0*1* = \{0010, 0011, 0110, 0111\}

1*0* = \{1000, 1001, 1100, 1101\}

1*1* = \{1010, 1011, 1110, 1111\}

where the symbol * stands for ‘wildcard’. Partitions of this type are called schema partitions. As we’ve already seen, schemata can be expressed using templates, for example,  0*1*. The same goes for schema partitions. For example \#*\#* denotes the schema partition represented by the index set \mathcal I. Here the symbol \# stands for ‘defined bit’. The order of a schema partition is simply the cardinality of the index set that defines the partition (in our running example, it is |\mathcal I| = 2). Clearly, schema partitions of lower order are coarser than schema partitions of higher order.

Let us define the effect of a schema partition to be the variance of the average fitness values of the constituent schemata under sampling from the uniform distribution over each schema. So for example, the effect of the schema partition \#*\#*=\{0*0*\,,\, 0*1*\,,\, 1*0*\,,\, 1*1*\} is

\frac{1}{4}\sum\limits_{i=0}^1\sum\limits_{j=0}^1(F(i*j*)-F(****))^2

where the operator F gives the average fitness of a schema under sampling from the uniform distribution.

You’re now well poised to understand implicit concurrency. Before we get to a description, a brief detour to provide some motivation: We’re going to do a thought experiment in which we examine how effects change with the coarseness of schema partitions. Let [\mathcal I] denote the schema partition represented by some index set \mathcal I. Consider a search space S=\{0,1\}^\ell with \ell=10^6, and let \mathcal I =\{1,\ldots,10^6\}. Then [\mathcal I] is the finest possible partition of S; one where each schema in the partition has just one point. Consider what happens to the effect of [\mathcal I] as we start removing elements from \mathcal I. It should be relatively easy to see that the effect of [\mathcal I] decreases monotonically. Why? Because we’re averaging over points that used to be in separate partitions. Don’t proceed further until you convince yourself that coarsening a partition tends to decrease its effect.

Finally, observe that the number of schema partitions of order o is {\ell \choose o}. So for \ell = 10^6,  the number of schema partitions of order 2,3,4 and 5 are on the order of 10^{11}, 10^{17}, 10^{23}, and 10^{28} respectively. The take away from our thought experiment is this: while a search space may have vast numbers of coarse schema partitions, most of them will have negligible effects (due to averaging). In other words, while coarse schema partitions are numerous, ones with non-negligible effects are rare.

So what exactly does a genetic algorithm do efficiently? Using experiments and symmetry arguments I’ve demonstrated that a genetic algorithm with uniform crossover can concurrently sift through vast numbers of coarse schema partitions and identify partitions with non-negligible  effects. In other words, a genetic algorithm with uniform crossover can implicitly perform multitudes of effect/no-effect multifactor analyses and can efficiently identify interacting loci with non-negligible effects.

Let’s Play a Game

It’s actually quite easy (not to mention, cool and fun) to visualize a genetic algorithm as it identifies such loci. Let’s play a game. Consider a stochastic function that takes bitstrings of length 200 as input and returns an output that depends on the values of the bits of at just four indices. These four indices are fixed; they can be any one of the {\ell \choose 4} combinations of four indices between between 1 and 200. Given some bitstring, if the parity of the bits at these indices is 1 (i.e. if the sum of the four bits is odd) then the stochastic function returns a value drawn from the magenta distribution (see below). Otherwise, it returns a value drawn from the black distribution. The four indices are said to be pivotal. All other indices are said to be non-pivotal.

parityDistribs

As per the discussion in the first part of this post, the set of pivotal indices is the dual of a schema partition of order 4. Of all the schema partitions of order 4 or less, only this partition has a non-zero effect. All other schema partitions of order 4 or less have no effect. (Verify for yourself that this is true) In other words, in the world of effect evaluation, parity is a Needle in a Haystack (NIAH) problem. The kind of problem that seems closed to all approaches save brute force.

Now for the rules of the game: Say I give you query access to the stochastic function just described, but I do not tell you what four indices are pivotal. You are free to query the function with any bitstring 200 bits long as many times as you want. Your job is to recover the pivotal indices I picked, i.e. to identify the only schema partition of order 4 or less with a non-negligible effect.

Take a moment to think about how you would do it? What is the time and query complexity of your method?

What Not Breaking a Sweat Looks Like

The animation below shows what happens when a genetic algorithm with uniform crossover is applied to the stochastic function just described. Each dot displays the proportion of 1’s in the population at a locus. Note that it’s trivial to just “read off” the proportion of 0s at each locus. The four pivotal loci are marked by red dots. Of course, the genetic algorithm does not “know” that these loci are special. It only has query access to the stochastic function.

As the animation shows, after 500 generations you can simply “read off” the four loci I picked by examining the proportion of 1s to 0s in the population at each locus. Congratulations! You’ve just seen implicit concurrency in action. The chromosome size in this case is 200, so there are {200 \choose 4} possible combinations of four loci. From all of these possibilities, the genetic algorithm managed to identify the correct one within five hundred generations.

Let’s put implicit concurrency through it’s paces. I’m going to tack on an additional 800 non-pivotal loci while leaving the indices of the four pivotal loci unchanged. Check out what happens:

[Note: more dots in the animation below does not mean a bigger population or more queries. More dots just means more loci under consideration. The population size and total number of queries remain the same]

So despite a quintupling in the number of bits, entailing an increase in the number of coarse schema partitions of order 4 to {1000 \choose 4}, the genetic algorithm solves the problem with no increase in the number of queries. Not bad. (Of course we’re talking about a single run of a stochastic process. And yes, it’s a representative run. See chapter 3 of my dissertation to get a sense for the shape of the underlying process)

Let’s take it up another notch, and increase the length of the bitstrings to 10,000. So now we’re looking at  {10000 \choose 4} \sim 10^{18} combinations of four loci. That’s on the order of a million trillion combinations. This time round, let’s also change the locations of the 4 pivotal loci. Will the genetic algorithm find them in 500 generations or less?

How’s that for not breaking a sweat? Don’t be deceived by the ease with which the genetic algorithm finds the answer. This is not an easy problem.

Intrigued? [If you’ve read this far, you should be] To run the experiments yourself download speedyGApy, and run it with

python speedyGA.py --fitnessFunction seap --bitstringLength <go-wild!>

noting that the increase in “wall clock” time between generations as you increase bitstringLength is due to an increase in the running time of everything (including query evaluation). The number of queries (i.e. number of fitness evaluations), however, stays the same. To learn how a genetic algorithm parlays implicit concurrency into a general-purpose global search heuristic called hyperclimbing, read my FOGA 2013 paper Explaining Optimization in Genetic Algorithms with Uniform Crossover (slides here).

[A request: Please help me get the word out about Implicit Concurrency. Tweet/retweet, blog/reblog, like, email, and mention on mailing lists. As always, if you have questions/comments, holler]

Implicit Concurrency in Genetic Algorithms

FOGA 2013 Slides

Optimization by Genetic Algorithms with uniform crossover is one of the deep mysteries of Evolutionary Computation. At first glance, the efficacy of uniform crossover, an extremely disruptive form of variation, makes no sense. Yet, genetic algorithms with uniform crossover often outperform genetic algorithms with variation operators that induce tight linkage between genetic loci.

At the Foundations of Genetic Algorithms Conference XII (January 16-20, Adelaide), I proposed an explanation for the efficacy of uniform crossover. The hypothesis presented posits that genetic algorithms with uniform crossover implement an intuitive, general purpose, global optimization heuristic called Hyperclimbing extraordinarily efficiently. The final version of the paper is available here. A generalization of the Hyperclimbing Hypothesis that explains optimization by genetic algorithms with less disruptive forms of crossover appears in my dissertation.

The presentation contains several animations, such as the one below, that serve as proof of concept for the Hyperclimbing Hypothesis. To view the animations, click on the “Watch On YouTube” links in the slides above or just download the presentation (93 Mb) and view it in Adobe Acrobat Reader. Note: the pdf reader in Chrome will not display the animations.

Download SpeedyGApy to run the experiments yourself (easy, and highly recommended).

FOGA 2013 Slides