Artificial intelligence models show massive gaps on traditional human intelligence tests

Artificial intelligence programs designed to process and generate text show remarkably high verbal reasoning abilities, but they struggle with visual and numerical puzzles. New research evaluating a variety of commercial and open-source models on traditional intelligence quotient tests revealed wide gaps in performance depending on the format of the questions. The findings were published in Computers in Human Behavior: Artificial Humans.

Large language models are computer algorithms trained on immense amounts of text data scraped from the internet. They calculate the statistical probability of which word should logically follow the previous word. Because they are designed essentially as highly advanced text-prediction engines, scientists debate whether these programs actually understand what they are saying or if they are simply mimicking human language patterns.

Standard benchmarks like the Massive Multitask Language Understanding exam test how well an artificial intelligence system can remember specialized academic facts. While scoring high on a legal or medical exam is impressive, it only proves that the program can recall information it has already seen in its training data. These tests do not directly measure the machine’s ability to engage in generalized, abstract reasoning.

To bridge this gap, scientists look toward cognitive tests designed for humans. Intelligence quotient tests evaluate what psychologists call fluid intelligence. Fluid intelligence is the capacity to think logically and solve problems in novel situations, independent of acquired knowledge. Sections featuring spatial rotation prompts or word analogies present unfamiliar scenarios, requiring the test-taker to deduce the underlying rules of the puzzle without relying on memorized trivia.

Lead researcher Sherif Abdelkarim, a computer scientist at the University of California Irvine, organized a study to see how artificial intelligence programs handle these fluid intelligence tests. He authored the study alongside David Lu, Dora-Luz Flores, Susanne Jaeggi, and Pierre Baldi. The team wanted to measure whether advanced models possess general reasoning skills independent of specific academic knowledge.

The researchers selected 18 different large language models to provide a comprehensive look at the modern software landscape. They tested proprietary systems developed by large tech companies as well as open-source models created by the broader research community. By comparing models of varying sizes, the team hoped to track how cognitive limits change as the software grows more robust.

The assessment relied on a self-scoring intelligence quotient suite first published in 1996. The test encompasses 14 distinct categories covering three modes of thinking. The verbal sections ask the test-taker to identify synonyms or complete complex analogies. The numerical sections require the participant to solve arithmetic equations or identify numbers missing from a sequence based on unstated mathematical rules. The visual sections ask the participant to analyze geometric shapes, imagine those shapes rotating in space, and predict the next image in a matrix pattern.

Administering an exam designed for humans to a computer program presents distinct logistical challenges. Because language models generate responses based on probabilities, they can give a completely different answer to the identical prompt if it is asked twice. The researchers adjusted the internal parameters of the models, changing a setting known as temperature to zero. This setting minimizes the randomness of the program, forcing it to provide its most likely answer every time.

Google News Preferences Add PsyPost to your preferred sources

When analyzing the results, researchers noted that model size dictated performance. In software development, model size refers to the number of mathematical parameters the system uses to connect different concepts and process information. More parameters usually mean a more capable system.

The smallest language models, containing roughly seven billion parameters, achieved scores equivalent to a human intelligence quotient range of 89 to 110. The largest and most advanced programs achieved scores simulating a range of 111 to 131. In human testing protocols, a score of 100 sits exactly at the population average.

Despite the high intelligence estimates for the large models, the researchers noticed intense variations across different subject areas. The algorithms exhibited an overwhelming bias toward verbal tasks. For example, OpenAI’s GPT-4 answered 79 percent of the verbal questions correctly but only managed an accuracy rate of 53 percent on the numerical questions. This divide makes intuitive sense, as the models are predominantly trained with language data rather than numerical logic systems.

The division expanded further when comparing text comprehension to visual comprehension. The top-tier models achieved an estimated intelligence quotient of roughly 125 on text-based questions but hovered around an estimated score of 103 for visual questions. Several visual reasoning sections stumped the programs entirely. In sections requiring the program to count specific shapes hidden inside a larger, overlapping geometric pattern, every single model registered a zero percent success rate.

These programs also demonstrated a persistent inability to answer abstract numerical puzzles. Even the most advanced commercial models performed terribly on missing-number tasks. These specific tasks ask the test-taker to find the hidden mathematical relationship between a sequence of numbers and then fill in a blank space. No model achieved higher than 20 percent accuracy in this section. The researchers note that these programs lack external memory capabilities, meaning they struggle to hold information in a temporary mental space while conducting multi-step arithmetic over several sequential operations.

The researchers additionally evaluated the specialized personality settings offered by Microsoft’s Bing Chat interface. This interface allows users to dictate whether the chat agent acts in a creative, precise, or balanced manner. These three modes use the exact same underlying software architecture, but they are guided by hidden instructions that alter their behavior.

The creative mode achieved the highest marks, generating an estimated intelligence quotient up to 132. It performed exceptionally well on analogies and tasks requiring innovative, flexible thinking. The precise mode scored slightly lower overall but excelled at strict logical reasoning sequences. The balanced mode performed the worst of the three. The results suggest that attempting to combine instructions for precision and creativity actually hinders the program’s ability to reason effectively, leading to subpar responses.

To see if performance could be improved beyond these base scores, the team designed a multi-agent system. In this setup, one artificial intelligence generates an initial answer, a second criticizes that answer, and a third uses that criticism to suggest a revision. The first program then tries to answer the original question again using the new advice. This mimics the human peer-review process.

The composition of this synthetic team completely altered the final test scores. When the researchers assigned a small model to answer the questions and a massive, highly capable model to act as the critic, the small model improved its score on its second attempt. The large critic accurately guided the smaller algorithm toward the right logic.

Conversely, when a large model originally answered the questions and a small model acted as the critic, the large model’s performance decreased on the second attempt. The flawed criticism generated by the small program caused the massive model to doubt its own initially correct answers. Taking the largest models and letting them act as their own critics provided almost no extra benefit, suggesting the top-tier systems might have hit a temporary ceiling in their reasoning capabilities.

The study does feature certain limitations regarding how intelligence is defined and measured. The tests used in this assessment were originally designed to gauge the cognitive abilities of human beings. These tests might not accurately capture the unique internal workings of an artificial intelligence system, which can ingest millions of text documents in seconds but lacks any physical interaction with the real world. Many psychologists debate the validity of intelligence tests for measuring human capability, making it an imperfect tool for measuring synthetic minds.

Future research will likely involve administering current clinical diagnostic assessments used by psychologists in professional medical environments. The researchers also hope to run larger trials focusing solely on images, as visual reasoning remains a massive obstacle for the current generation of generative artificial intelligence software.

The study, “Evaluating the Intelligence of large language models: A comparative study using verbal and visual IQ tests,” was authored by Sherif Abdelkarim, David Lu, Dora-Luz Flores, Susanne Jaeggi, and Pierre Baldi.

Artificial intelligence models show massive gaps on traditional human intelligence tests

Trending

Science of Money

Recent

Welcome Back!

Retrieve your password

Add New Playlist