Thoughts on the ARC Benchmark

Notes on the ARC benchmark, my experience at the competition, and how it can become a long-term benchmark in AI - Twitter Thread

The ARC Benchmark

Around the end of 2019, François Chollet released the Abstraction and Reasoning Corpus (ARC) benchmark, along with the paper On the Measure of Intelligence. And in the paper, he made a powerful claim: that deep learning cannot solve the ARC benchmark.

On the surface, the tasks in the benchmark are simple. You get a few example images for a task that applies some simple logic to the image. And then you’re asked to generate images for one or more test images that follow the same logic. Pretty much like a question format used in common IQ tests. Here are some examples:

Example ARC Tasks
Two tasks from the train set. My descriptions of the tasks: Left ‘Make the structure symmetrical, add the added part red’, Right: ‘Fill the smallest box with magenta, the next with orange, and the largest with sky blue.’

The paper makes the claim that Deep Learning is “conceptually similar to a locality-sensitive hashtable”, and cannot generalize beyond the kind of data it is trained on. This was pretty new to me, I was a noob in the ML space, but it really piqued my interest. The paper describes in detail the kinds of generalizations, and how to measure them; and why deep learning cannot generalize to these higher forms of generalization (such as what will be required to solve ARC). I highly recommend going through the paper, in this blog I’ll only focus on the ARC benchmark.

What concepts does the benchmark use to generate the tasks? - The goal is to encode priors that are common to humans, that’s no simple goal, and is vaguely understood. To quote the paper “What is the exact list of knowledge priors that humans are born with? This is the question that the developmental science theory of Core Knowledge seeks to answer.” These are the broad categories of priors that are used Objectness and elementary physics - A red circle on a black background, and a line at the edge, we imagine a sphere hitting a wall.

Agentness and goal-directedness - A blue dot in an orange maze, our immediate intuition is the blue dot is an agent.

Natural numbers and elementary arithmetic - Counting, adding, subtracting, we naturally do it without being taught.

Elementary geometry and topology - Rotating a square by 45 degrees makes a diamond, we’re easily able to visualize this.

Why can’t deep learning solve ARC?

Chollet’s argument is that deep learning is just pattern matching on low dimensional data that is its training dataset. But wait, aren’t images “high dimensional”, and you can’t really interpolate high dimensional data? Yann LeCun certainly says so. So who is right? Well, they’re kind of talking about different things. Yes, images are high dimensional, if you consider the individual pixels as independent variables. But for a dataset that represents a concept, e.g “natural images”, pixels are not independent. Its pretty intuitive to understand that pixel values near each other on the same object will be close in value. But an extended idea on this is that there exists a low dimensional manifold in the high dimensional euclidian space that encompasses the entire dataset. This is known as the manifold hypothesis, here’s my favorite video that explains the manifold hypothesis. Note that this doesn’t make the manifold hypothesis necessarily correct, for an extended discussion check out this video from the Machine Learning Street Talk podcast. This debate is long ongoing, here’s a twitter query to put you down a rabbit hole. 😉

The point is, to solve the tasks on the ARC Benchmark, deep learning cannot leverage pattern matching. The tasks are hand designed and independent, not generated by some underlying code.

In a later interview, Chollet shares that coming up with these tasks is hard and takes a lot of time. The benchmark is limited to 300 tasks total. 100 for “training”, 100 for “testing”, and 100 secret tasks for evaluation. There’s no need follow the split as such, though the training tasks are somewhat easier.

Challenge Design

The competition was designed as many others on Kaggle. You get a dataset to train on, and you’re expected to submit code that works on the evaluation dataset, without ever seeing the evaluation data. Obviously, there’s a time limit to run your code. Of course, ARC is not a standard machine learning benchmark. There are only 200 tasks, the only thing common between them is the human like priors described in the above section.

How does one approach such a competition? - ARC is pretty different from most other competitions. While IQ test benchmarks are not uncommon, most formulate it as a classification problem. ARC requires you to generate the exact answer, which in turn means you need not only need a fuzzy understanding of concepts (something one might hope Deep learning can do, though manifold hypothesis suggests it can’t); and also have a module that produces exact results. For the latter, an AI that generates programs is a good approach. So you would need to combine fuzzy heuristics which does the “abstraction and reasoning”, with a program generation process. This makes the challenge exceedingly difficult in my opinion, but that’s what makes it fun!

Here’s Chollet’s recommendation on how to work on the competition.

One major flaw

The challenge had one major flaw in my opinion. There was no private leaderboard. All the 100 evaluation tasks were on the public leaderboard. Organizers claimed that they’re confident no one can beat even 20 of the 100 tasks (Spoiler: the top solution beat 21). A lot of great engineering ideas were implemented, and the shared solutions are a goldmine of ideas.

Nevertheless, not having a private leaderboard meant overfitting to the public leaderboard. It becomes a guessing game of what the tasks could be, and incorporate code to run that idea. Even if you don’t explicitly probe the tasks, say you add some code and it doesn’t move the scores, you’re likely to remove it (remember there’s the time limit to run your code). This defeats the purpose of generalization, in the sense that no new major solutions were found, just a lot of engineering to incorporate priors and do fast program search.

Ideas shared by participants

As the competition progressed, participants shared a wide variety of ideas they were looking into. Decision trees, genetic algorithms, vision models, sequence to sequence models, and cellular automata to name a few. None of them could be used out of the box per se, the priors that the tasks used had to be hand-coded somehow.

One thing was common, participants often created their own domain-specific language (DSL) for the tasks, based on the priors. For example, one had to find “objects” in the images, possibly downscale/upscale them, rotate them, etc. See the below figure for an example task. All of these were specific to the ARC benchmark, and everyone set out to make high quality implementations of these priors.

Humans would likely think of this as a two ‘step’ task: 1. Cut out the object 2. Repeat it horizontally, one time.

Even to my newbie’s brain, there seemed to be some critical problems with all these approaches. First, the DSLs were catered for the ARC dataset, not really generalized, they reminded me of feature engineering for a specific task/dataset. In that sense, a 3 month competition is too short for such a benchmark, where you need to design your DSL, try different algorithms to use it, and make it efficient to meet the compute limits, while each step depends on the others.

Second, it wasn’t clear that using a DSL was even the right approach. Yes, Chollet had suggested it so everyone was trying to make one (myself included), but if we’re trying to generate programs that solve for general tasks, would a DSL really be the right approach? When humans approach the problems, the priors are used, but often we come up with some one-off operation that solves the tasks combined with common functions in the priors (see example below). I suppose one could incorporate these one-off operations into the DSL. But keep adding too many and you’d probably have been better off using a general language rather than DSL.

My description of the steps: 1. Select blue objects 2. Add orange squares to the NESW squares of the blue objects 3. Select red objects 4. Add yellow squares to the diagonal squares of the red objects. To me steps 2 and 4 feel really ‘one-off’, not quite something I’d include in a DSL. Sure we can break it down further to fit some other general DSL, but nothing super simple.

Top solutions

TL;DR - Everyone had a bunch of hand-crafted DSLs, and optimized code for fast search.

The amount of effort the top participants put is truly awe-inspiring! The only thing that slightly disappointed me was the number of DSLs that have so much in common, but were designed by everyone independently, and with custom fast software. In a sense, the competition strongly incentivized this due to the compute limit.

Some solutions used hand crafted heuristics to filter their search process. Some hand coded extra tasks to improve their DSL evaluation.

The top DSLs had these common themes:

  • Cut parts from input by color/shape/location/etc

  • Recolor parts

  • Check symmetries

  • Moving and replicating objects

And many others that I’m leaving out for brevity.

Interestingly, nearly everyone converged to the same “human like operations”. It begs the question, is that the correct approach to solving ARC, and Abstraction and Reasoning as a problem in general?

First place solution - Worth the read if you’re interested.

Collection of all top solutions

Interestingly, and perhaps obvious in hindsight, GPT-3, which came out later that year, completely failed the ARC tasks.

Ideas I tried

DSL and Tree search - Like most others, I also tried to create a DSL that catered to the ARC tasks. I wouldn’t say it was anything out of the ordinary, I had organized it in a way that functions operate on scalars, images, list of scalars, and list of images. For example, one function would take an image and split it into multiple images based on color, another would find the area of each object in the image, and so on. I performed simple Depth First Search on this DSL, discarding nodes based on some heuristics like absence of different colors in the outputs. You can check the full code here. The code solved 20 training tasks, 8 test tasks, and only 1 evaluation task.😂

Here’s and example DSL function:

@register("Image", "ImageList")
def split_by_color(img, crop=True):
    """ Input an image, splits image list of images by different colors"""
    if img.max() == 0:
        return None
    cols = np.unique(img[img > 0])
    outs = []
    for c in cols:
        colimg = np.zeros_like(img)
        colimg[img == c] = img[img == c]
    if crop:
        return _composite_imagelist_to_imagelist(outs, _crop_nonzero)
        return outs

Yes writing in C++ would make the code run much faster, however, the development time would be much higher and at this point, I was just trying to get an understanding of what kind of DSL is needed. As you can see the top solution, he truly put an insane amount of engineering effort into the competition, hats off.

Training a classifier on “concepts”: At this point I’m thinking, okay so I can’t write such a huge DSL right now, and Chollet claims deep learning can’t do anything here. But its still the most powerful method I know of, so how can I try and leverage it, if at all. Can I prune my tree search with it somehow? What could I train a DL model on ARC for?

If DL can do anything well, I thought, its probably going to be some vaguely specified concepts where coming up with rules is not straightforward. So I came up with a list of concepts which I annotated the train tasks for. These are quite similar to the priors mentioned in the paper but somewhat more specific. The idea was to train a classifier on these tasks, which can then help to prune the DFS.

  • Background Color

  • Object Splitting

  • Rotational Symmetry

  • Translational Symmetry

  • Counting

  • Drawing Lines

  • Movement

  • Multi Symbol

  • Color Structure

  • Size Change

  • Repeating

  • Swap Colors

  • Being Inside of

At first, I didn’t expect the model to perform too well, but with some augmentations (color swapping and rotations), surprisingly the model actually got a good accuracy on most of the concepts. Of course, this was just on multiple folds being split and the dataset is tiny, so I can’t say with confidence how good it actually was. The predictions on the test set seemed reasonable.

Sadly, this was the middle of the pandemic, right near the beginning, and I wasn’t able to test this idea further and had to stop working on the competition midway due to factors beyond my control.

Distributed ARC

The main problem with the ARC to stay as a long term benchmark seems to be that the tasks are limited. It’s easy to overfit to the evaluation tasks and call the benchmark solved. Even if you don’t see the evaluation tasks, a 100 is not that many. But as Chollet had pointed out, its difficult for one person to come up with these tasks by hand. So how can we make ARC a long term benchmark? We use the power of crowd sourcing! I had this vague idea towards the end of 2020 but never tried to concretely formulate it. Then in 2021, Chollet discussed his idea of crowdsourcing at a high level in his interview at Machine Learning Street Talk. Not much has happened about it though, or none that I know of.

I’ve tried to outline my thoughts on what the benchmark can grow into using crowdsourcing. I call it Distributed ARC.

Here’s my wishlist for a crowdsourced ARC benchmark

  1. The collection of tasks should be large (> 10,000). To avoid overfitting.

  2. Anyone should be able to submit new tasks, versioning can be done each year to improve datasets.

  3. Tasks should be accompanied by code to solve the task (but not necessarily made public).

  4. Optionally, priors used should be annotated, for some future direction of incorporating deep learning into possible solutions.

Managing such an evolving dataset of course needs a strong support structure, and possibly some initial incentive for people to create new tasks. But I do have confidence in human creativity and curiosity to create a much more diverse benchmark than what is currently available.

Resources and references