Hypomixability Theory

TL;DR: Computer scientists can say why the fast fourier transform is fast and why quicksort is quick. Yet when it comes to explaining what’s efficient about the computation that yielded ourselves, we currently draw a blank. Hypomixability theory provides the basis for an answer.

Run a population based recombinative evolutionary algorithm on a real world fitness function; until adaptation tapers off, you will find that the fitness of an offspring is, on average, lower than the mean fitness of its parents. You will find, in other words, that the average fitness of the population goes down after mixing (recombination) and up after selection. Selection, of course, makes sense—it increases average fitness. But why mixing? What sense does it make to recombine the genes of individuals with above average fitness, especially on epistatic landscapes where recombination is virtually guaranteed to lower average fitness? What computational purpose does mixing serve?

Hypomixability Theory explains why mixing makes computational sense and illuminates how mixing induced reductions in average fitness (the hypo in hypomixability) between parent and offspring populations can be indicative of room for adaptation. The theory is very parsimonious in its assumptions. It does not, for example, assume the presence of building blocks in the fitness landscape. Hypomixability Theory also comes with a prediction and proof of efficient computational learning in a recombinative evolutionary algorithm, making it the first, and currently only, computationally credible explanation for the “endless forms most beautiful and wonderful” that comprise the the sexually reproducing part of the biosphere. Our remarkable species included.

Paper: Efficient Hypomixability Elimination in Recombinative Evolutionary Systems

Hypomixability Theory

Explaining Sex, Adaptation, Speciation, and the Emergence of Genetic Modularity

To appear in the proceedings of the Foundations of Genetic Algorithms Conference, 2015

Hypomixability Elimination in Evolutionary Systems

Abstract: Hypomixability Elimination is an intriguing form of computation thought to underlie general-purpose, non-local, noise-tolerant adaptation in recombinative evolutionary systems. We demonstrate that hypomixability elimination in recombinative evolutionary systems can be efficient by using it to obtain optimal bounds on the time and queries required to solve a subclass (k=7, \eta=1/5) of a familiar computational learning problem: PAC-learning parities with noisy membership queries; where k is the number of relevant attributes and \eta is the oracle’s noise rate. Specifically, we show that a simple genetic algorithm with uniform crossover (free recombination) that treats the noisy membership query oracle as a fitness function can be rigged to PAC-learn the relevant variables in O(\log (n/\delta)) queries and O(n \log (n/\delta)) time, where n is the total number of attributes and \delta is the probability of error. To the best of our knowledge, this is the first time optimally efficient computation has been shown to occur in an evolutionary algorithm on a non-trivial problem.

The optimality result and indeed the implicit implementation of hypomixability elimination by a simple genetic algorithm depends crucially on recombination. This dependence yields a fresh, unified explanation for sex, adaptation, speciation, and the emergence of modularity in evolutionary systems. Compared to other explanations, Hypomixability Theory is exceedingly parsimonious. For example, it does not assume deleterious mutation, a changing fitness landscape, or the existence of building blocks.

Explaining Sex, Adaptation, Speciation, and the Emergence of Genetic Modularity

Parallel Effect Search in Genetic Algorithms

The thing genetic algorithms do efficiently—Parallel Effect Search —is not difficult to describe or demonstrate. It’s been under our noses since schemata and schema partitions were defined as concepts, but for various reasons, which I won’t get into here, it’s not widely appreciated. This blog post provides a quick introduction. For a more in depth treatment check out  my FOGA 2013 paper Explaining Optimization in Genetic Algorithms with Uniform Crossover (slides here).

Schema Partitions and Effects

Let’s begin with a quick primer on schemata and schema partitions. Let S=\{0,1\}^\ell be a search space consisting of binary strings of length \ell. Let \mathcal I be some set of indices between 1 and \ell, i.e.  \mathcal I\subseteq\{1,\ldots,\ell\}. Then  \mathcal I represents a partition of S into 2^{|\mathcal I|} subsets called schemata (singular schema) as in the following example: suppose \ell=4, and \mathcal I=\{1,3\}, then \mathcal I partitions S into four schemata:

0*0* = \{0000, 0001, 0100, 0101\}

0*1* = \{0010, 0011, 0110, 0111\}

1*0* = \{1000, 1001, 1100, 1101\}

1*1* = \{1010, 1011, 1110, 1111\}

where the symbol * stands for ‘wildcard’. Partitions of this type are called schema partitions. As we’ve already seen, schemata can be expressed using templates, for example,  0*1*. The same goes for schema partitions. For example \#*\#* denotes the schema partition represented by the index set \mathcal I. Here the symbol \# stands for ‘defined bit’. The order of a schema partition is simply the cardinality of the index set that defines the partition (in our running example, it is |\mathcal I| = 2). Clearly, schema partitions of lower order are coarser than schema partitions of higher order.

Let us define the effect of a schema partition to be the variance of the average fitness values of the constituent schemata under sampling from the uniform distribution over each schema. So for example, the effect of the schema partition \#*\#*=\{0*0*\,,\, 0*1*\,,\, 1*0*\,,\, 1*1*\} is

\frac{1}{4}\sum\limits_{i=0}^1\sum\limits_{j=0}^1(F(i*j*)-F(****))^2

where the operator F gives the average fitness of a schema under sampling from the uniform distribution.

You’re now well poised to understand Parallel Effect Search. Before we get to a description, a brief detour to provide some motivation: We’re going to do a thought experiment in which we examine how effects change with the coarseness of schema partitions. Let [\mathcal I] denote the schema partition represented by some index set \mathcal I. Consider a search space S=\{0,1\}^\ell with \ell=10^6, and let \mathcal I =\{1,\ldots,10^6\}. Then [\mathcal I] is the finest possible partition of S; one where each schema in the partition has just one point. Consider what happens to the effect of [\mathcal I] as we start removing elements from \mathcal I. It should be relatively easy to see that the effect of [\mathcal I] decreases monotonically. Why? Because we’re averaging over points that used to be in separate partitions. Don’t proceed further until you convince yourself that coarsening a partition tends to decrease its effect.

Finally, observe that the number of schema partitions of order o is {\ell \choose o}. So for \ell = 10^6,  the number of schema partitions of order 2,3,4 and 5 are on the order of 10^{11}, 10^{17}, 10^{23}, and 10^{28} respectively. The take away from our thought experiment is this: while a search space may have vast numbers of coarse schema partitions, most of them will have negligible effects (due to averaging). In other words, while coarse schema partitions are numerous, ones with non-negligible effects are rare.

So what exactly does a genetic algorithm do efficiently? Using experiments and symmetry arguments I’ve demonstrated that a genetic algorithm with uniform crossover can sift through vast numbers of coarse schema partitions in parallel and identify partitions with non-negligible  effects. In other words, a genetic algorithm with uniform crossover can implicitly perform multitudes of effect/no-effect multifactor analyses and can efficiently identify interacting loci with non-negligible effects.

Let’s Play a Game

It’s actually quite easy to visualize a genetic algorithm as it identifies such loci. Let’s play a game. Consider a stochastic function that takes bitstrings of length 200 as input and returns an output that depends on the values of the bits of at just four indices. These four indices are fixed; they can be any one of the {\ell \choose 4} combinations of four indices between between 1 and 200. Given some bitstring, if the parity of the bits at these indices is 1 (i.e. if the sum of the four bits is odd) then the stochastic function returns a value drawn from the magenta distribution (see below). Otherwise, it returns a value drawn from the black distribution. The four indices are said to be pivotal. All other indices are said to be non-pivotal.

parityDistribs

As per the discussion in the first part of this post, the set of pivotal indices is the dual of a schema partition of order 4. Of all the schema partitions of order 4 or less, only this partition has a non-zero effect. All other schema partitions of order 4 or less have no effect. (Verify for yourself that this is true) In other words parity seems like a Needle in a Haystack (NIAH) problem—a problem, in other words, that requires brute force.

Now for the rules of the game: Say I give you query access to the stochastic function just described, but I do not tell you what four indices are pivotal. You are free to query the function with any bitstring 200 bits long as many times as you want. Your job is to recover the pivotal indices I picked, i.e. to identify the only schema partition of order 4 or less with a non-negligible effect.

Take a moment to think about how you would do it? What is the time and query complexity of your method?

What “Not Breaking a Sweat” Looks Like

The animation below shows what happens when a genetic algorithm with uniform crossover is applied to the stochastic function just described. Each dot displays the proportion of 1’s in the population at a locus. Note that it’s trivial to just “read off” the proportion of 0s at each locus. The four pivotal loci are marked by red dots. Of course, the genetic algorithm does not “know” that these loci are special. It only has query access to the stochastic function.

As the animation shows, after 500 generations you can simply “read off” the four loci I picked by examining the proportion of 1s to 0s in the population at each locus. You’ve just seen Parallel Effect Search in action. The chromosome size in this case is 200, so there are {200 \choose 4} possible combinations of four loci. From all of these possibilities, the genetic algorithm managed to identify the correct one within five hundred generations.

Let’s put the genetic algorithm through it’s paces. I’m going to tack on an additional 800 non-pivotal loci while leaving the indices of the four pivotal loci unchanged. Check out what happens:

[Note: more dots in the animation below does not mean a bigger population or more queries. More dots just means more loci under consideration. The population size and total number of queries remain the same]

So despite a quintupling in the number of bits, entailing an increase in the number of coarse schema partitions of order 4 to {1000 \choose 4}, the genetic algorithm solves the problem with no increase in the number of queries. Not bad. (Of course we’re talking about a single run of a stochastic process. And yes, it’s a representative run. See chapter 3 of my dissertation to get a sense for the shape of the underlying process)

Let’s take it up another notch, and increase the length of the bitstrings to 10,000. So now we’re looking at  {10000 \choose 4} \sim 10^{18} combinations of four loci. That’s on the order of a million trillion combinations. This time round, let’s also change the locations of the 4 pivotal loci. Will the genetic algorithm find them in 500 generations or less?

How’s that for not breaking a sweat? Don’t be deceived by the ease with which the genetic algorithm finds the answer. This is not an easy problem.

To run the experiments yourself download speedyGApy, and run it with

python speedyGA.py --fitnessFunction seap --bitstringLength <big number>

noting that the increase in “wall clock” time between generations as you increase bitstringLength is due to an increase in the running time of everything (including query evaluation). The number of queries (i.e. number of fitness evaluations), however, stays the same. To learn how a genetic algorithm parlays parallel effect search into a general-purpose global search heuristic called hyperclimbing, read my FOGA 2013 paper Explaining Optimization in Genetic Algorithms with Uniform Crossover (slides here).

Parallel Effect Search in Genetic Algorithms

FOGA 2013 Slides

Optimization by Genetic Algorithms with uniform crossover is one of the deep mysteries of Evolutionary Computation. At first glance, the efficacy of uniform crossover, an extremely disruptive form of variation, makes no sense. Yet, genetic algorithms with uniform crossover often outperform genetic algorithms with variation operators that induce tight linkage between genetic loci.

At the Foundations of Genetic Algorithms Conference XII (January 16-20, Adelaide), I proposed an explanation for the efficacy of uniform crossover. The hypothesis presented posits that genetic algorithms with uniform crossover implement an intuitive, general purpose, global optimization heuristic called Hyperclimbing extraordinarily efficiently. The final version of the paper is available here. A generalization of the Hyperclimbing Hypothesis that explains optimization by genetic algorithms with less disruptive forms of crossover appears in my dissertation.

The presentation contains several animations, such as the one below, that serve as proof of concept for the Hyperclimbing Hypothesis. To view the animations, click on the “Watch On YouTube” links in the slides above or just download the presentation (93 Mb) and view it in Adobe Acrobat Reader. Note: the pdf reader in Chrome will not display the animations.

Download SpeedyGApy to run the experiments yourself (easy, and highly recommended).

FOGA 2013 Slides

Generative Fixation

One of the mainstays of human engineering is the idea of hierarchical assembly—the assembly of useful low level modules (a.k.a. building blocks) into useful modules at higher levels. Artifacts ranging from nuclear submarines to enterprise software are all constructed in this fashion. The building block hypothesis holds that genetic algorithms also construct solutions using hierarchical assembly, and that the basic building blocks used by these algorithms are short chromosomal snippets that confer above average fitness. Tantalizing as this hypothesis may be, it is based on strong assumptions about the distribution of fitness over a search space. I’ve criticized these assumptions in my dissertation and have proposed a different hypothesis based on weaker assumptions. This alternate hypothesis is grounded in a different metaphor—generative fixation.

Though not as ubiquitously recognizable as hierarchical assembly, generative fixation underlies progress in many areas. Like in the video industry for instance, which only really took off once the VHS/Betamax war had run its course. The “fixation” of VHS within this industry in essence generated new opportunities for advancement. For example, it permitted the development of the video rental business, which brought films from the back rooms of studio houses into living rooms everywhere. Had the tussle between VHS and Betamax continued, many of these opportunities might not have presented themselves. The economics of supporting two formats simultaneously would have been economically crippling for the fledgling industry.

VHS v. Betamax was a case where two contestants competed for dominance along a single dimension—the format accepted by video players. Consider a scenario consisting of multiple dimensions and multiple contestants in each dimension. Suppose, moreover, that no contestant in any dimension is outright superior. That is, the superioriority/non-superiority of the contestants in each dimension is dependent on the state of the contests in the other dimensions. This scenario describes the most commonly arising situation in natural evolution. Here, the “dimensions” are genetic loci, and the “contestants” are alleles. (Statistically, the scenario just described is more likely than one where one or more locus has an allele that is superior regardless of the frequencies of the alleles at other loci.)

The danger in such cases is that progress will stall because no contestant will come to dominate its dimension. The generative fixation hypothesis holds that in a system undergoing recombinative evolutionary dynamics, progress will continue. Although no locus has an allele that is outright superior, a small number of  alleles belonging to different dimensions that play nice together (i.e. confer above average fitness as a group) will come to dominate their respective dimensions. In doing so, they will set the stage for the next multi-dimensional contest over the remaining dimensions. And so on.

I’ve demonstrated the genetic algorithm’s ability to scaleably identify and fix synergistic sets of unlinked non-competing alleles in a recent manuscript and in my dissertation.

Generative Fixation

More Debate, Please!

“… there are many issues in computing that inspire differing
opinions. We would be better off highlighting the differences
rather than pretending they do not exist”
–Moshe Y. Vardi

In an article entitled “More Debate, Please!”, in the January, 2010 issue of Communications of the ACM, Moshe Y. Vardi, editor-in-chief of Communications, writes:

‘Vigorous debate, I believe, exposes all sides of an issue—their strengths and weaknesses. It helps us reach more knowledgeable conclusions. To quote Benjamin Franklin: “When Truth and Error have fair play, the former is always an overmatch for the latter.”’[1]

Vardi goes on to say that as he solicited ideas for the 2008 relaunch of Communications, he was frequently told to keep controversial topics front and center. “Let blood spill over the pages of Communications,” a member of a focus group colorfully urged [1].

When attempting to publish my doctoral research in evolutionary computation journals, I found the sentiments expressed by Vardi to be in short supply. The reviewers seemed much more invested in not rocking the boat than in fostering a climate in which prevailing assumptions can be challenged, and alternate ideas expressed transparently. They seemed, in short, to be inured to the poverty of the field’s foundations, and, for the most part, had little tolerance for someone with a bone to pick with the status quo. “Fall in line, or have your work be rejected,” was the overarching message.

One way this unfortunate state of affairs may be addressed is through the institution of a forum like the Point/Counterpoint section introduced to Communications by Vardi in 2008—a forum where the various controversies that mark our field are periodically featured, and the different sides of each controversy given, as Benjamin Franklin put it, “fair play”. There are several contentious topics in EC. Tapped correctly, many of  these topics can be powerful vehicles for learning—not just about the workings of evolutionary algorithms, but, also, about the workings of a vibrant intellectual community. Right now, instead of vigorous, open, ongoing debates in the EC literature, uneasy truces prevail. The community, by and large, steps around the the really big points of contention. Researchers talk past each other to niche audiences. And, if my experience is anything to go by, new lines of criticism, and new modes of analysis are hastily dismissed.

In the absence of a written record of ongoing controversies, new entrants to the field will not have access to the various positions involved. Pressed for time, and confronting the reality of “publish or perish”, most will fall back on the opinions and practices of their advisors. It doesn’t take much to see that in an environment like this, opportunities for learning and advancement will frequently be missed.

A forum for open, ongoing, collegial debate would  bring awareness, and transparency to the controversies in our field. It would also (one hopes) inculcate a more welcoming attitude toward alternate approaches, conclusions, and critiques.

Two topics for debate:

EC Theory and First Hitting Time:  Is it problematic that so much contemporary theoretical  work in EC focuses on “first hitting time”, i.e., the number of fitness evaluations required to find a global optimum? Do we look at first hitting time only because there currently isn’t a well developed, and generally accepted theoretical framework for examining adaptation (the generation of fitter points over time)? If so, isn’t the study of first hitting time a lot  like the proverbial search for one’s house keys under the light of a street lamp just because it happens to be dark in one’s house?

The Building Block Hypothesis: Can the building block hypothesis be reconciled with the widely reported utility of uniform crossover? If yes, how? If no, can we—more to the point, should we—be comfortable with this knowledge given the considerable influence of the building block hypothesis on contemporary evolutionary computation research?

What other topics have been under-addressed in the evolutionary computation literature? Leave a comment with your opinion, or a link to your own blog post.

[1] Moshe Y. Vardi. More debate please!, In Communications of the ACM 53(1):5, 2010

J
More Debate, Please!

Hyperclimbing and Decimation

In recent years, probabilistic inference algorithms such as survey propagation and belief propagation have been shown to be remarkably effective at tackling large, random instances of SAT, and other combinatorial optimization problems that lie beyond the reach of previous approaches. These inference algorithms belong to a class of techniques called decimation strategies. Decimation strategies monotonically reduce the size of a problem instance by iteratively fixing partial solutions (partial variable assignments in the case of SAT).

The generative fixation hypothesis essentially states that genetic algorithms work by efficiently implementing a decimation strategy called hyperclimbing.

Hyperclimbing and Decimation

Hyperclimbing, Genetic Algorithms, and Machine Learning

I’ve identified a promising stochastic search heuristic, called hyperclimbing, for large-scale optimization over massive attribute product spaces (e.g., the set of all binary strings of some length N, where N is very large) with rugged fitness functions. Hyperclimbing works by progressively limiting sampling to a series of nested subsets with increasing expected fitness. At any given step, this heuristic sifts through vast numbers of coarse partitions of the subset it “inhabits”, and identifies ones that partition this set into subsets whose expected fitness values are significantly variegated. Because hyperclimbing is sensitive, not to the local features of a search space, but to certain more global statistics, it is not susceptible to the kinds of issues that waylay local search heuristics.

The chief barrier to the wide and enthusiastic use of hyperclimbing is that it seems to scale very poorly with the number of attributes. If one heeds the seemingly high cost of applying hyperclimbing to large search spaces, this heuristic quickly looses its shine. A key conclusion of my doctoral work is that this seemingly high cost is illusory. I have uncovered evidence that strongly suggests that genetic algorithms can implement hyperclimbing extraordinarily efficiently.

As readers of this blog probably know, genetic algorithms are search algorithms that mimic natural evolution. These algorithms have been used in a wide range of engineering and scientific fields to quickly procure useful solutions to poorly understood (i.e. black-box) optimization problems. Unfortunately, despite the routine use of genetic algorithms for over three decades, their adaptive capacity has not been adequately accounted for. Given the evidence that genetic algorithms can implement efficient hyperclimbing, I’ve proposed a new explanation for the adaptive capacity of these algorithms. This new account—the generative fixation hypothesis—promises to spark significant advances in the fields of genetic algorithmics and discrete optimization.

The discovery that hyperclimbing is efficiently implementable also promises to have a non-negligible impact on the ecology of machine learning research. Optimization and machine learning are, after all, intimately related. Overlooking a few exceptions, the practice of machine learning research, can be characterized as the effective reduction of difficult learning problems to optimization problems for which efficient algorithms exist. In other words, the machine learning problems that can effectively be tackled are in large part those that can in practice be reduced to optimization problems that can be tackled efficiently. Currently, this largely limits the class of tractable machine learning problems to the class of learning problems that can in practice be reduced to convex optimization problems [1] . The identification of general-purpose non-convex optimization heuristics with efficient implementations (e.g. hyperclimbing), thus, has the potential to significantly extend the reach of machine learning.

For a description of hyperclimbing, and evidence that genetic algorithms can implement this heuristic efficiently, please see my dissertation

[1]  Kristin P. Bennett and Emilio Parrado-Hernandez. The interplay of optimization and machine  learning research. Journal of Machine Learning Research, 7:1265–1281, 2006.

Hyperclimbing, Genetic Algorithms, and Machine Learning