Wednesday, April 11, 2012

Pattern Matching


A friend gave me a desk calendar with daily puzzles. Usually the numerical ones are easy, but last Friday the one above came up, and I couldn't immediately solve it. If you want to try to solve it yourself, you may not want to read any further until you're done with it.

I assumed the 'game' in the puzzle was to use the four numbers in the corners to mathematically derive the one in the middle, although there was the possibility that there was a non-numerical 'trick' answer that relied on the way numerals are spelled or something. I stuck with it for a while, trying simple calculations, but couldn't find a suitable pattern that worked for all three examples. Then I remembered reading about Eureqa, from Cornell Creative Machines Lab--a genetic pattern finder that should be able to chew this problem up in no time. It's also free.

I labeled the input values x1, x2, x3, and x4 for top left, top right, bottom left, and bottom right, respectively, and the center (output) number became y. It's easy to enter this in the program like a spreadsheet:


The y column was originally 4, 5, 3, which are the numbers in the puzzle. The 8, 9, 3 that appear in the image are explained below.

I skipped the "Prepare Data" tab because I didn't need to smooth data or fill in missing values, etc. Things started to get interesting with the "Prepare Data" tab. The program guessed my target expression correctly. And here I made a mistake. You get to decide what sorts of operators the solution is allowed to use. This is a meta-problem, where you have to think like the test designer. Is it likely that the solution uses the hyperbolic tangent function? Probably not. So I picked the operations one learns about in grade school: addition, subtraction, multiplication, and division. Part of the screen is shown below.


Notice that I left "Constant" and "Integer Constant" unchecked, and hence unavailable for the program to use in a solution. I reasoned that the puzzle designer would not have included an arbitrary number, as in y=(x1+x2-x3-x4+15). It seemed inelegant and made the problem space much bigger to include even integer constants. But this was a case of premature optimization, as we will see. I chose the "Minimize the absolute error" metric of success and started the search.

This was my first time using the interface, but it only took a moment to figure out what I was looking at. The "fitness" is the error function, which is optimized at zero, meaning no errors in predicting y. The formula that it found immediately was y=x4. If you look at the puzzle again, you'll notice that the lower right number is the same as the center one. This is the simplest pattern that matches, and the automatic searcher has a bias for low-complexity (i.e. small) formulas. I was sure the problem designer intended all four input numbers to be used, so this couldn't be it.

So I tried obfuscation to trick the solver into using all four inputs. I did this by transforming the problem by creating a function z, and asking the solver to find a formula f so that f(x1,x2,x3,x4) = y + z(x1,x2,x3,x4). The idea was that this would rule out the simple y = x4 solution, and force the solver to look for answers that  used all four inputs. Then I could subtract off the z part and have my solution. For example, one function I tried was z = 2x1 - x2 - x3*x4.The input screenshot above shows one of these attempts, with the y column transformed in this way.

This yielded results. After using two different z functions, I got two new solutions. Here they are:
f(x1,x2,x3,x4) = x2+x3-x3*x3
f(x1,x2,x3,x4) = -2x1+x2+2x3x4-x3*x3+x4
These both work, and you can check them by plugging in the puzzle inputs to see that the calculations equal the number in the middle. But the first one above doesn't use x4, and although the second one does, it seems too complex for a casual puzzle.

At this point I looked at the "official" answer. Here it is:
f(x1,x2,x3,x4) = (10*x1+x2)/(10*x3+x4)
As you can see, there are integer constants in the solution, which I had ruled out when I set up the problem. Although the formula looks opaque as I have written it, for a human it's natural to read two digits across as a single number, so that the 9 and 6 on the top of the first example becomes 96, which is formulized as 9*10 + 6. The bottom is 24. I had tried that trick myself, but not noticed that 96/24 = 4.

I was naturally annoyed at myself for having ruled out the possibility of finding the official solution by unchecking "Integer Constant," so tried again with constants enabled. This time I got
f(x1,x2,x3,x4) = 24-2*x1-x2-x4-x3*x4
This was after 6931 generations. It still isn't what I'm looking for. My transformation was making sure all the inputs were being used, but it was a crude hack that ended up making the solution more complex than it needed to be.

What I really needed was a way to modify either the target expression or the error metric in order so that an optimal solution has exactly one occurrence of each of x1 through x4. I couldn't figure out how to do that with the interface, and it may not be possible with this version of the software. So I tried another approach.

The puzzle is flawed in that it allows the y = x4 solution, so I created some more examples for the solver to chew on: 22/11 = 2 and 26/12 = 3. With the original puzzle values plus these two extras, the solver began to converge to solutions with smaller and smaller errors:


Unfortunately, integer-based problems like this just aren't amenable to gradient-based approaches. Unlike horseshoes and hand grenades, close isn't good enough. After seven minutes, the formulas were all variations on high powers of the inputs, overfitting the limited data to find a solution.

I tried 'seeding' the search with y=x1/x3 This is the same as y = (10*x1)/(10*x3), which is pretty close to the solution. After more than 20,000 generations that resulted in a perfect solution, but not the one I wanted:
y = x4 + -1/x4 + (x2 + x3 - x3*x3)/(x4*x4)
It did recognize the 'correct' solution when I fed that directly in. This is pictured below (third one down).


The 'fit' is the error, which is zero for the right answer. It immediately lost the integer constants, however, and went off looking for lower-complexity solutions, which do exist (see the ones listed earlier).

To be fair, Eureqa isn't really designed for solving this sort of problem. Finding patterns with real world data usually means accounting for noise and missing values, and looking for simple approximate relationships that might tell you something about underlying relationships. In this case, y = x4 is still a pretty good approximation for the examples I put in. It occurred to me that if this were real data, the solution I'd be looking for is y = x1/x3, which would normally be a reasonable approximation to the perfect solution. When I added a few more rows of example data, the solver found this immediately. So it works as intended, which is not necessarily for the purpose of solving arbitrary puzzles humans design for each other.

Update: There are a few comments about the above on /r/machinelearning, which you can find here.