Google FunSearch Shows a Path for LLMs to Improve Math Problem Solving Skills

Automated code iteration leads to novel ideas

Dec 17, 2023

Google Deepmind has released a new research paper showing how large language models (LLM) can create new knowledge. Up to this point, LLMs have created false information in the form of confabulations (i.e., hallucinations) or synthesized known information. They have not been able to create new information or solutions that extend the corpus of human knowledge.

Deepmind researchers recently overcame that limitation using an LLM to create new approaches to solving mathematical problems. A blog post by Deepmind elaborated on the research and technical approach, which involved iterative code writing and an “evaluator” that assesses the LLM responses and chooses the best output before trying again to improve the solution.

FunSearch works by pairing a pre-trained LLM, whose goal is to provide creative solutions in the form of computer code, with an automated “evaluator”, which guards against hallucinations and incorrect ideas. By iterating back-and-forth between these two components, initial solutions “evolve” into new knowledge. The system searches for “functions” written in computer code; hence the name FunSearch.

The Capset Problem

Deepmind addressed two mathematical problems in its FunSearch evaluation. First, it discovered a new answer to a problem beyond today’s computational capabilities.

We first address the cap set problem, an open challenge, which has vexed mathematicians in multiple research areas for decades…The problem consists of finding the largest set of points (called a cap set) in a high-dimensional grid, where no three points lie on a line…Brute-force computing approaches to this problem don’t work – the number of possibilities to consider quickly becomes greater than the number of atoms in the universe.
FunSearch generated solutions - in the form of programs - that in some settings discovered the largest cap sets ever found.

The Bin Packing Problem

The second problem is related to optimization. Algorithms can solve many problems reasonably well, but FunSearch provides better answers to one complex mathematical challenge.

The “bin packing” problem looks at how to pack items of different sizes into the smallest number of bins. It sits at the core of many real-world problems, from loading containers with items to allocating compute jobs in data centers to minimize costs.
The online bin-packing problem is typically addressed using algorithmic rules-of-thumb (heuristics) based on human experience. But finding a set of rules for each specific situation - with differing sizes, timing, or capacity – can be challenging. Despite being very different from the cap set problem, setting up FunSearch for this problem was easy. FunSearch delivered an automatically tailored program (adapting to the specifics of the data) that outperformed established heuristics – using fewer bins to pack the same number of items.

The Significance

Technology Review contrasted FunSearch with earlier Deepmind developments, which also made significant advances in math but faced meaningful limitations.

AlphaTensor found a way to speed up a calculation at the heart of many different kinds of code, beating a 50-year record. Then AlphaDev found ways to make key algorithms used trillions of times a day run faster.
Yet those tools did not use large language models. Built on top of DeepMind’s game-playing AI AlphaZero, both solved math problems by treating them as if they were puzzles in Go or chess. The trouble is that they are stuck in their lanes, says Bernardino Romera-Paredes, a researcher at the company who worked on both AlphaTensor and FunSearch: “AlphaTensor is great at matrix multiplication, but basically nothing else.”
FunSearch takes a different tack. It combines a large language model called Codey, a version of Google’s PaLM 2 that is fine-tuned on computer code, with other systems that reject incorrect or nonsensical answers and plug good ones back in.
…
A key advantage that FunSearch has over AlphaTensor is that it can, in theory, be used to find solutions to a wide range of problems. That’s because it produces code—a recipe for generating the solution, rather than the solution itself. Different code will solve different problems.

Renato Vincente, associate professor of applied mathematics at the University of Sao Paulo and a Data and AI partner at WillowTree, told Synthedia, “This is expected. Mathematics is a language and not an empirical science. It is a language that is consistent. If you train a model with mathematical knowledge or programming, you expect it to solve problems."

In addition, the programming-centric strategy will be a strength for addressing some problems but will not likely lead to new knowledge outside the computational domain. The immediate applications appear to be related to problems that today rely on heuristics or require impractical computing resources to solve.

The question of FunSearch’s significance will ultimately hinge on the value it delivers. If it answers hard math problems, that will be an important contribution, even if it is an expected result. If it develops meaningful new knowledge in science and math, then it may be even more significant.

FunSearch shows early promise in the former, but its impact in the latter is less certain. With that said, LLMs optimized to code as copilots may turn out to be less impactful than LLMs optimized to generate new software code that is better than what already exists.

GPT-4 Beats MedPaLM 2 for Medical Questions - Prompt Engineering vs Fine-Tuning Battle Royale

Bret Kinsella

December 3, 2023

GPT-4 Beats MedPaLM 2 for Medical Questions - Prompt Engineering vs Fine-Tuning Battle Royale

To Prompt or to Tune: That is the Question Microsoft researchers published data showing the generalist foundation model GPT-4, combined with advanced prompt engineering techniques, outperformed the fine-tuned specialty AI model MedPaLM 2 in a series of tests on medical knowledge.

Read full story

Google's Gemini LLM Arrives Next Week and It May Just Outperform GPT-4 (sort of)