PsyPost
  • Mental Health
  • Social Psychology
  • Cognitive Science
  • Neuroscience
  • About
No Result
View All Result
Join
My Account
PsyPost
No Result
View All Result
Home Exclusive Artificial Intelligence

AI models struggle with expert-level global history knowledge

by Eric W. Dolan
January 22, 2025
Reading Time: 4 mins read
Share on TwitterShare on Facebook

Researchers recently evaluated the ability of advanced artificial intelligence (AI) models to answer questions about global history using a benchmark derived from the Seshat Global History Databank. The study, presented at the Neural Information Processing Systems conference in Vancouver, revealed that the best-performing model, GPT-4 Turbo, achieved a score of 46% on a multiple-choice test, a marked improvement over random guessing but far from expert comprehension. The findings highlight significant limitations in current AI tools’ ability to process and understand historical knowledge, particularly outside well-documented regions like North America and Western Europe.

The motivation for the study stemmed from a desire to explore the potential of artificial intelligence (AI) tools in aiding historical and archaeological research. History and archaeology often involve analyzing vast amounts of complex and unevenly distributed data, making these fields particularly challenging for researchers.

Advances in AI, particularly in large language models (LLMs), have demonstrated their utility in fields like law and data labeling, raising the question of whether these tools could similarly assist historians by processing and synthesizing historical knowledge. Researchers hoped that AI could augment human efforts, providing insights that might otherwise be missed or speeding up labor-intensive tasks like data organization.

Peter Turchin, a project leader at the Complexity Science Hub, and his collaborators developed the Seshat Global History Databank, a comprehensive repository of historical knowledge. They recognized the need for a systematic evaluation of AI’s understanding of history. The researchers hoped the study would not only reveal the strengths and weaknesses of current AI but also guide future efforts to refine these tools for academic use.

The Seshat Global History Databank includes 36,000 data points about 600 historical societies, covering all major world regions and spanning 10,000 years of history. Data points are drawn from over 2,700 scholarly sources and coded by expert historians and graduate research assistants. The dataset is unique in its systematic approach to recording both well-supported evidence and inferred conclusions.

To evaluate AI performance, the researchers converted the dataset into multiple-choice questions that asked whether a historical variable (e.g., the presence of writing or a specific governance structure) was “present,” “absent,” “inferred present,” or “inferred absent” during a given society’s time frame. Seven AI models were tested, including GPT-3.5, GPT-4 Turbo, Llama, and Gemini. Models were provided with examples to help them understand the task and were instructed to act as expert historians in their responses.

The researchers assessed the models using a balanced accuracy metric, which accounts for the uneven distribution of answers across the dataset. Random guessing would result in a score of 25%, while perfect accuracy would yield 100%. The models were also tested on their ability to distinguish between “evidenced” and “inferred” facts, a critical skill for historical analysis.

“We wanted to set a benchmark for assessing the ability of these LLMs to handle expert-level history knowledge,” explained first author Jakob Hauser, a resident scientist at the Complexity Science Hub. “The Seshat Databank allows us to go beyond ‘general knowledge’ questions. A key component of our benchmark is that we not only test whether these LLMs can identify correct facts, but also explicitly ask whether a fact can be proven or inferred from indirect evidence.”

Google News Preferences Add PsyPost to your preferred sources

GPT-4 Turbo outperformed the other models, achieving a balanced accuracy of 43.8% on the four-choice test. While this score exceeded random guessing, it still fell well short of expert-level performance. In a simplified two-choice format (“present” versus “absent”), GPT-4 Turbo performed better, with an accuracy of 63.2%. These results suggest that while the models can identify straightforward facts, they struggle with more nuanced historical questions.

“One surprising finding, which emerged from this study, was just how bad these models were. This result shows that artificial ‘intelligence’ is quite domain-specific. LLMs do well in some contexts, but very poorly, compared to humans, in others,” Turchin remarked.

The study also revealed patterns in the models’ performance across regions, time periods, and types of historical data. Models generally performed better on earlier historical periods (e.g., before 3000 BCE) and struggled with more recent data, likely due to the increasing complexity of societies and historical records over time. Regionally, performance was highest for societies in the Americas and lowest for Sub-Saharan Africa and Oceania, highlighting potential biases in the models’ training data.

“LLMs, such as ChatGPT, have been enormously successful in some fields—for example, they have largely succeeded by replacing paralegals. But when it comes to making judgments about the characteristics of past societies, especially those located outside North America and Western Europe, their ability to do so is much more limited,” explained Turchin, who leads the Complexity Science Hub’s research group on social complexity and collapse.

Interestingly, the models exhibited relative consistency across different types of historical data, such as military organization, religious practices, and legal systems. However, performance varied significantly between models. GPT-4 Turbo consistently outperformed others in most categories, while smaller models like Llama-3.1-8B struggled to achieve comparable results.

The researchers acknowledged several limitations in their study. The Seshat Databank, while comprehensive, reflects the biases of its sources, which are predominantly in English and focused on well-documented societies. This linguistic and regional bias likely influenced the models’ performance. Additionally, the study only tested a limited number of AI models, leaving room for future evaluations of newer or more specialized tools.

The study also highlighted challenges in interpreting historical data. Unlike fields with clear-cut answers, history often involves ambiguity and debate, making it difficult to design objective benchmarks for AI evaluation. Furthermore, the models’ underperformance in regions like Sub-Saharan Africa underscores the need for more diverse training data that accurately represents global history.

Looking ahead, the researchers plan to expand the Seshat dataset to include more data from underrepresented regions and to incorporate additional types of historical questions. They also aim to test newer AI models to assess whether advancements in AI technology can address the limitations identified in this study.

“The main takeaway from this study is that LLMs, while impressive, still lack the depth of understanding required for advanced history. They’re great for basic facts, but when it comes to more nuanced, PhD-level historical inquiry, they’re not yet up to the task,” said Maria del Rio-Chanona, the study’s corresponding author and an assistant professor at University College London.

The paper, “Large Language Models’ Expert-level Global History Knowledge Benchmark (HiST-LLM),” wa authored by Jakob Hauser, Daniel Kondor, Jenny Reddish, Majid Benam, Enrico Cioni, Federica Villa, James S. Bennett, Daniel Hoyer, Pieter François, Peter Turchin, and R. Maria del Rio-Chanona.

RELATED

Artificial intelligence flatters users into bad behavior
Artificial Intelligence

AI chatbots fail medical misinformation test, returning inaccurate and fabricated advice

June 1, 2026
Brain scans identify the neural network that traps anxious people in cycles of self-blame
ADHD Research News

Irregular brain maturation in childhood predicts emotional habits in early adolescence

May 31, 2026
Live music causes brain waves to synchronize more strongly with rhythm than recorded music
Artificial Intelligence

New research reveals how humans judge the moral minds of artificial intelligence

May 30, 2026
Study links phubbing sensitivity to attachment patterns in romantic couples
Artificial Intelligence

Training AI chatbots to be warm and empathetic makes them less factually accurate

May 29, 2026
New Habsburg research reveals reproductive consequences of royal inbreeding
Artificial Intelligence

Machine learning uncovers how childhood trauma amplifies genetic risks for depression

May 27, 2026
People cannot tell AI-generated from human-written poetry and they like AI poetry more
Artificial Intelligence

A new study mapped 350,000 relationship stories and found a communication style AI struggles to copy

May 24, 2026
New study links manipulative personality traits to lower relationship intimacy expectations
Artificial Intelligence

Brain scans shed light on why women develop romantic feelings for AI companions

May 22, 2026
Live music causes brain waves to synchronize more strongly with rhythm than recorded music
ADHD Research News

A new AI tool spots hidden signs of adult ADHD months before a formal diagnosis

May 21, 2026

Follow PsyPost

The latest research, however you prefer to read it.

Daily newsletter

One email a day. The newest research, nothing else.

Google News

Get PsyPost stories in your Google News feed.

Add PsyPost to Google News
RSS feed

Use your favorite reader. We also syndicate to Apple News.

Copy RSS URL
Social media
Support independent science journalism

Ad-free reading, full archives, and weekly deep dives for members.

Become a member

Trending

  • More than half of adults with ADHD in clinical settings have a co-occurring personality disorder
  • New study links parental indulgence to psychopathic and narcissistic traits in adulthood
  • How learning to read alters the brain’s approach to spoken language
  • The psychology of paradoxical thinking: Extreme arguments in favor of a controversial topic can reduce overall support
  • Men’s sexual desire peaks around age 40, large new study finds

Science of Money

  • Class isn’t dead: Your job title still predicts your wealth in Europe, a five-country study finds
  • Packing products tightly on shelves makes shoppers grab more flavors
  • When your job feels scriptable: How routine work and AI anxiety drain employee energy
  • Childhood obesity and the American Dream: New research links early weight to lower lifetime mobility
  • The brain chemical behind your money moves: How dopamine shapes financial choices

PsyPost is a psychology and neuroscience news website dedicated to reporting the latest research on human behavior, cognition, and society. (READ MORE...)

  • Mental Health
  • Neuroimaging
  • Personality Psychology
  • Social Psychology
  • Artificial Intelligence
  • Cognitive Science
  • Psychopharmacology
  • Contact us
  • Disclaimer
  • Privacy policy
  • Terms and conditions
  • Do not sell my personal information

(c) PsyPost Media Inc

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In

Add New Playlist

Subscribe
  • My Account
  • Cognitive Science Research
  • Mental Health Research
  • Social Psychology Research
  • Drug Research
  • Relationship Research
  • About PsyPost
  • Contact
  • Privacy Policy

(c) PsyPost Media Inc