The International Mathematical Olympiad (IMO) is probably the most prestigious competition for preuniversity students. Every year students from all over the world compete for its coveted bronze, silver and gold medals (112 countries took part in 2023). In a new twist, AI programs could soon be competing with them, too.
A team led by Trieu H. Trinh of Google DeepMind and New York University unveiled a new AI program called AlphaGeometry in the journal Nature on January 17. The researchers reported that the program was able to successfully solve 25 out of 30 geometry problems from past IMOs. This gave it a similar success rate to the best human participants who have won a gold medal in the competition. The AI also found a more general solution to a problem from the IMO in 2004 that had previously escaped the attention of experts.
“Mathematical Olympiads are the most reputed theorem-proving competitions in the world,” wrote Trinh and his colleagues in their recent publication. Over two days, each competing student has to solve a total of six problems from different mathematical domains. Some of the problems are so complicated that even experts cannot solve them. They usually have short, elegant solutions but require a lot of creativity. This makes them particularly interesting from the point of view of AI research, which is aimed at developing systems with creative abilities. Until now, even large language models (LLMs) such as OpenAI’s GPT-4 have failed at such tasks.
One of the reasons why AI programs have so far not met with success is their lack of access to data. LLMs such as GPT-4 are trained with tens of gigabytes of text files, which correspond to around 20 million letter-size pages filled with words. Translating a proof into a programming language that computers can understand—the language Lean is one example—requires a great deal of work. This is particularly difficult in the field of geometry, where proofs are usually even harder to formalize so that a solution can be computed.
Although there are formal programming languages specifically developed for geometry, they make little use of methods from other areas of mathematics. So if a proof requires an intermediate step in which, for example, there are complex numbers that may contain square roots of negative values, programming languages specialized for geometry cannot be used.
This is why Trinh and his colleagues created a dataset that doesn’t require the translation of human-generated proofs into a formal language. To do this, they first had an algorithm generate a set of geometric “premises,” or starting points: for example, a triangle with its heights drawn in and additional points marked along its sides. The researchers then used a deductive algorithm to infer further properties of the triangle, such as which angles matched and which lines were perpendicular to each other. Such programs have been available for several decades: they use predefined geometric and algebraic rules to make statements about geometric objects.
By combining the premises with the derived properties, the researchers were able to create a suitable training dataset for their AI that consisted of theorems and corresponding proofs. For example, a problem could involve proving a certain characteristic of a triangle—say, that two angles were equal. The corresponding solution would then consist of the steps that led the deductive algorithm to it. In this way, Trinh and his colleagues generated a dataset with more than 100 million problems and corresponding proofs.
Such methods are not sufficient to cope with theorems at the level of an IMO, however. The problems encountered there usually require more than the ability to make simple deductions. “To solve Olympiad-level problems..., the key missing piece is generating new proof terms,” Trinh and his team wrote in their paper. For example, if you want to prove something about a triangle, it is sometimes necessary to introduce new points and lines that were not mentioned in the problem. It is precisely this introduction of new auxiliary objects (the points and lines) needed to approach a proof that an LLM similar to GPT-4 is well suited to take on.
Large language models generate text by gauging the words to output based on a set of probabilities of whether one word should follow the next. Because of the enormous amounts of data with which they are trained, AI programs such as ChatGPT are sometimes able to respond to questions in a meaningful way and even conduct humanlike dialogue when someone inputs a question. Trinh and his team were able to use their specially created database to train AlphaGeometry on theorems and proofs in a similar way. But the LLM does not learn the deductive steps to finding a solution, which is work that is still done by specialized deductive algorithms. The AI model instead concentrates on finding points, lines and other useful auxiliary objects.
If AlphaGeometry is given a problem, a deductive algorithm first derives various properties from what is described. If the statement to be proven is not included, the LLM auxiliary objects. Through intensive training, the AI can decide to add a fourth point X to a triangle ABC, for example, so that ABCX represents a parallelogram, something that the program has learned from previous training. This gives the deductive algorithm new information that it can use to derive further properties of the geometric object. This can be repeated until the AI and the deductive program reach the desired conclusion. “The method sounds plausible and in some ways similar to the training of participants in the International Mathematical Olympiad,” says Fields Medalist Peter Scholze, who has won the gold medal at the IMO three times.
To test AlphaGeometry, the scientists selected 30 geometric problems that have appeared in the IMO since 2000. While the standard program previously used to solve geometric problems, called Wu’s algorithm, only managed to solve 10 correctly, and GPT-4 failed on all of them, AlphaGeometry managed to solve 25. According to the researchers, the AI outperformed most IMO participants, who solved an average of 15.2 out of 30 problems. Gold medal winners, on the other hand, managed to solve an average of 25.9 problems correctly. Humans who compete in the IMO do not just solve geometric problems, however: its problems also cover other areas of mathematics such as algebra, number theory and combinatorics.
When the researchers looked through the AI-generated proofs, they noticed that the program had not used all the information provided in a problem from 2004. This meant that AlphaGeometry had set out on its own and found a solution to a related but more general theorem than the 2004 one. It was also apparent that complicated tasks (in which IMO participants performed poorly) generally required longer proofs from the AI. Thus, the machine seems to struggle with the same challenges that humans do.
AlphaGeometry cannot yet take part in the International Mathematical Olympiad because the geometry tasks usually only make up a third of the problems. Trinh and his colleagues have emphasized that their approach could also be applied to other mathematical subdisciplines such as combinatorics, however. Who knows, maybe in a few years’ time a nonhuman participant will take part in the IMO for the first time—and maybe even win gold.