Subscribe
The latest psychology and neuroscience discoveries.
My Account
  • Mental Health
  • Social Psychology
  • Cognitive Science
  • Psychopharmacology
  • Neuroscience
  • About
No Result
View All Result
PsyPost
PsyPost
No Result
View All Result
Home Exclusive Artificial Intelligence

AI models struggle with expert-level global history knowledge

by Eric W. Dolan
January 22, 2025
in Artificial Intelligence
Share on TwitterShare on Facebook

Researchers recently evaluated the ability of advanced artificial intelligence (AI) models to answer questions about global history using a benchmark derived from the Seshat Global History Databank. The study, presented at the Neural Information Processing Systems conference in Vancouver, revealed that the best-performing model, GPT-4 Turbo, achieved a score of 46% on a multiple-choice test, a marked improvement over random guessing but far from expert comprehension. The findings highlight significant limitations in current AI tools’ ability to process and understand historical knowledge, particularly outside well-documented regions like North America and Western Europe.

The motivation for the study stemmed from a desire to explore the potential of artificial intelligence (AI) tools in aiding historical and archaeological research. History and archaeology often involve analyzing vast amounts of complex and unevenly distributed data, making these fields particularly challenging for researchers.

Advances in AI, particularly in large language models (LLMs), have demonstrated their utility in fields like law and data labeling, raising the question of whether these tools could similarly assist historians by processing and synthesizing historical knowledge. Researchers hoped that AI could augment human efforts, providing insights that might otherwise be missed or speeding up labor-intensive tasks like data organization.

Peter Turchin, a project leader at the Complexity Science Hub, and his collaborators developed the Seshat Global History Databank, a comprehensive repository of historical knowledge. They recognized the need for a systematic evaluation of AI’s understanding of history. The researchers hoped the study would not only reveal the strengths and weaknesses of current AI but also guide future efforts to refine these tools for academic use.

The Seshat Global History Databank includes 36,000 data points about 600 historical societies, covering all major world regions and spanning 10,000 years of history. Data points are drawn from over 2,700 scholarly sources and coded by expert historians and graduate research assistants. The dataset is unique in its systematic approach to recording both well-supported evidence and inferred conclusions.

To evaluate AI performance, the researchers converted the dataset into multiple-choice questions that asked whether a historical variable (e.g., the presence of writing or a specific governance structure) was “present,” “absent,” “inferred present,” or “inferred absent” during a given society’s time frame. Seven AI models were tested, including GPT-3.5, GPT-4 Turbo, Llama, and Gemini. Models were provided with examples to help them understand the task and were instructed to act as expert historians in their responses.

The researchers assessed the models using a balanced accuracy metric, which accounts for the uneven distribution of answers across the dataset. Random guessing would result in a score of 25%, while perfect accuracy would yield 100%. The models were also tested on their ability to distinguish between “evidenced” and “inferred” facts, a critical skill for historical analysis.

“We wanted to set a benchmark for assessing the ability of these LLMs to handle expert-level history knowledge,” explained first author Jakob Hauser, a resident scientist at the Complexity Science Hub. “The Seshat Databank allows us to go beyond ‘general knowledge’ questions. A key component of our benchmark is that we not only test whether these LLMs can identify correct facts, but also explicitly ask whether a fact can be proven or inferred from indirect evidence.”

GPT-4 Turbo outperformed the other models, achieving a balanced accuracy of 43.8% on the four-choice test. While this score exceeded random guessing, it still fell well short of expert-level performance. In a simplified two-choice format (“present” versus “absent”), GPT-4 Turbo performed better, with an accuracy of 63.2%. These results suggest that while the models can identify straightforward facts, they struggle with more nuanced historical questions.

“One surprising finding, which emerged from this study, was just how bad these models were. This result shows that artificial ‘intelligence’ is quite domain-specific. LLMs do well in some contexts, but very poorly, compared to humans, in others,” Turchin remarked.

The study also revealed patterns in the models’ performance across regions, time periods, and types of historical data. Models generally performed better on earlier historical periods (e.g., before 3000 BCE) and struggled with more recent data, likely due to the increasing complexity of societies and historical records over time. Regionally, performance was highest for societies in the Americas and lowest for Sub-Saharan Africa and Oceania, highlighting potential biases in the models’ training data.

“LLMs, such as ChatGPT, have been enormously successful in some fields—for example, they have largely succeeded by replacing paralegals. But when it comes to making judgments about the characteristics of past societies, especially those located outside North America and Western Europe, their ability to do so is much more limited,” explained Turchin, who leads the Complexity Science Hub’s research group on social complexity and collapse.

Interestingly, the models exhibited relative consistency across different types of historical data, such as military organization, religious practices, and legal systems. However, performance varied significantly between models. GPT-4 Turbo consistently outperformed others in most categories, while smaller models like Llama-3.1-8B struggled to achieve comparable results.

The researchers acknowledged several limitations in their study. The Seshat Databank, while comprehensive, reflects the biases of its sources, which are predominantly in English and focused on well-documented societies. This linguistic and regional bias likely influenced the models’ performance. Additionally, the study only tested a limited number of AI models, leaving room for future evaluations of newer or more specialized tools.

The study also highlighted challenges in interpreting historical data. Unlike fields with clear-cut answers, history often involves ambiguity and debate, making it difficult to design objective benchmarks for AI evaluation. Furthermore, the models’ underperformance in regions like Sub-Saharan Africa underscores the need for more diverse training data that accurately represents global history.

Looking ahead, the researchers plan to expand the Seshat dataset to include more data from underrepresented regions and to incorporate additional types of historical questions. They also aim to test newer AI models to assess whether advancements in AI technology can address the limitations identified in this study.

“The main takeaway from this study is that LLMs, while impressive, still lack the depth of understanding required for advanced history. They’re great for basic facts, but when it comes to more nuanced, PhD-level historical inquiry, they’re not yet up to the task,” said Maria del Rio-Chanona, the study’s corresponding author and an assistant professor at University College London.

The paper, “Large Language Models’ Expert-level Global History Knowledge Benchmark (HiST-LLM),” wa authored by Jakob Hauser, Daniel Kondor, Jenny Reddish, Majid Benam, Enrico Cioni, Federica Villa, James S. Bennett, Daniel Hoyer, Pieter François, Peter Turchin, and R. Maria del Rio-Chanona.

RELATED

How AI’s distorted body ideals could contribute to body dysmorphia
Artificial Intelligence

How AI’s distorted body ideals could contribute to body dysmorphia

January 28, 2026
New psychology research finds romantic cues reduce self-control and increase risky behavior
Artificial Intelligence

Machine learning identifies brain patterns that predict antidepressant success

January 25, 2026
Genetic factors likely confound the link between c-sections and offspring mental health
Addiction

AI identifies behavioral traits that predict alcohol preference during adolescence

January 24, 2026
Scientists shocked to find AI’s social desirability bias “exceeds typical human standards”
Artificial Intelligence

A simple language switch can make AI models behave significantly differently

January 23, 2026
LLM red teamers: People are hacking AI chatbots just for fun and now researchers have catalogued 35 “jailbreak” techniques
Artificial Intelligence

Are you suffering from “cognitive atrophy” due to AI overuse?

January 22, 2026
Scientists reveal atypical depression is a distinct biological subtype linked to antidepressant resistance
Artificial Intelligence

Researchers are using Dungeons & Dragons to find the breaking points of major AI models

January 22, 2026
Groundbreaking AI model uncovers hidden patterns of political bias in online news
Artificial Intelligence

AI chatbots tend to overdiagnose mental health conditions when used without structured guidance

January 22, 2026
AI chatbots often misrepresent scientific studies — and newer models may be worse
Artificial Intelligence

Sycophantic chatbots inflate people’s perceptions that they are “better than average”

January 19, 2026

STAY CONNECTED

LATEST

Researchers identify the psychological mechanisms behind the therapeutic effects of exercise

Alzheimer’s patients show reduced neural integration during brain stimulation

Women’s libido drops significantly during a specific phase of the menstrual cycle

Narcissism shows surprisingly consistent patterns across 53 countries, study finds

How AI’s distorted body ideals could contribute to body dysmorphia

Study links burnout and perfectionism to imposter phenomenon in psychiatrists

Menopause is linked to reduced gray matter and increased anxiety

Having a close friend with a gambling addiction increases personal risk, study finds

RSS Psychology of Selling

  • Surprising link found between greed and poor work results among salespeople
  • Intrinsic motivation drives sales performance better than financial rewards
  • New research links faking emotions to higher turnover in B2B sales
  • How defending your opinion changes your confidence
  • The science behind why accessibility drives revenue in the fashion sector
       
  • Contact us
  • Privacy policy
  • Terms and Conditions
[Do not sell my information]

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In

Add New Playlist

Subscribe
  • My Account
  • Cognitive Science Research
  • Mental Health Research
  • Social Psychology Research
  • Drug Research
  • Relationship Research
  • About PsyPost
  • Contact
  • Privacy Policy