Subscribe
The latest psychology and neuroscience discoveries.
My Account
  • Mental Health
  • Social Psychology
  • Cognitive Science
  • Neuroscience
  • About
No Result
View All Result
PsyPost
PsyPost
No Result
View All Result
Home Exclusive Artificial Intelligence

AI models struggle with expert-level global history knowledge

by Eric W. Dolan
January 22, 2025
in Artificial Intelligence
Share on TwitterShare on Facebook

Researchers recently evaluated the ability of advanced artificial intelligence (AI) models to answer questions about global history using a benchmark derived from the Seshat Global History Databank. The study, presented at the Neural Information Processing Systems conference in Vancouver, revealed that the best-performing model, GPT-4 Turbo, achieved a score of 46% on a multiple-choice test, a marked improvement over random guessing but far from expert comprehension. The findings highlight significant limitations in current AI tools’ ability to process and understand historical knowledge, particularly outside well-documented regions like North America and Western Europe.

The motivation for the study stemmed from a desire to explore the potential of artificial intelligence (AI) tools in aiding historical and archaeological research. History and archaeology often involve analyzing vast amounts of complex and unevenly distributed data, making these fields particularly challenging for researchers.

Advances in AI, particularly in large language models (LLMs), have demonstrated their utility in fields like law and data labeling, raising the question of whether these tools could similarly assist historians by processing and synthesizing historical knowledge. Researchers hoped that AI could augment human efforts, providing insights that might otherwise be missed or speeding up labor-intensive tasks like data organization.

Peter Turchin, a project leader at the Complexity Science Hub, and his collaborators developed the Seshat Global History Databank, a comprehensive repository of historical knowledge. They recognized the need for a systematic evaluation of AI’s understanding of history. The researchers hoped the study would not only reveal the strengths and weaknesses of current AI but also guide future efforts to refine these tools for academic use.

The Seshat Global History Databank includes 36,000 data points about 600 historical societies, covering all major world regions and spanning 10,000 years of history. Data points are drawn from over 2,700 scholarly sources and coded by expert historians and graduate research assistants. The dataset is unique in its systematic approach to recording both well-supported evidence and inferred conclusions.

To evaluate AI performance, the researchers converted the dataset into multiple-choice questions that asked whether a historical variable (e.g., the presence of writing or a specific governance structure) was “present,” “absent,” “inferred present,” or “inferred absent” during a given society’s time frame. Seven AI models were tested, including GPT-3.5, GPT-4 Turbo, Llama, and Gemini. Models were provided with examples to help them understand the task and were instructed to act as expert historians in their responses.

The researchers assessed the models using a balanced accuracy metric, which accounts for the uneven distribution of answers across the dataset. Random guessing would result in a score of 25%, while perfect accuracy would yield 100%. The models were also tested on their ability to distinguish between “evidenced” and “inferred” facts, a critical skill for historical analysis.

“We wanted to set a benchmark for assessing the ability of these LLMs to handle expert-level history knowledge,” explained first author Jakob Hauser, a resident scientist at the Complexity Science Hub. “The Seshat Databank allows us to go beyond ‘general knowledge’ questions. A key component of our benchmark is that we not only test whether these LLMs can identify correct facts, but also explicitly ask whether a fact can be proven or inferred from indirect evidence.”

Google News Preferences Add PsyPost to your preferred sources

GPT-4 Turbo outperformed the other models, achieving a balanced accuracy of 43.8% on the four-choice test. While this score exceeded random guessing, it still fell well short of expert-level performance. In a simplified two-choice format (“present” versus “absent”), GPT-4 Turbo performed better, with an accuracy of 63.2%. These results suggest that while the models can identify straightforward facts, they struggle with more nuanced historical questions.

“One surprising finding, which emerged from this study, was just how bad these models were. This result shows that artificial ‘intelligence’ is quite domain-specific. LLMs do well in some contexts, but very poorly, compared to humans, in others,” Turchin remarked.

The study also revealed patterns in the models’ performance across regions, time periods, and types of historical data. Models generally performed better on earlier historical periods (e.g., before 3000 BCE) and struggled with more recent data, likely due to the increasing complexity of societies and historical records over time. Regionally, performance was highest for societies in the Americas and lowest for Sub-Saharan Africa and Oceania, highlighting potential biases in the models’ training data.

“LLMs, such as ChatGPT, have been enormously successful in some fields—for example, they have largely succeeded by replacing paralegals. But when it comes to making judgments about the characteristics of past societies, especially those located outside North America and Western Europe, their ability to do so is much more limited,” explained Turchin, who leads the Complexity Science Hub’s research group on social complexity and collapse.

Interestingly, the models exhibited relative consistency across different types of historical data, such as military organization, religious practices, and legal systems. However, performance varied significantly between models. GPT-4 Turbo consistently outperformed others in most categories, while smaller models like Llama-3.1-8B struggled to achieve comparable results.

The researchers acknowledged several limitations in their study. The Seshat Databank, while comprehensive, reflects the biases of its sources, which are predominantly in English and focused on well-documented societies. This linguistic and regional bias likely influenced the models’ performance. Additionally, the study only tested a limited number of AI models, leaving room for future evaluations of newer or more specialized tools.

The study also highlighted challenges in interpreting historical data. Unlike fields with clear-cut answers, history often involves ambiguity and debate, making it difficult to design objective benchmarks for AI evaluation. Furthermore, the models’ underperformance in regions like Sub-Saharan Africa underscores the need for more diverse training data that accurately represents global history.

Looking ahead, the researchers plan to expand the Seshat dataset to include more data from underrepresented regions and to incorporate additional types of historical questions. They also aim to test newer AI models to assess whether advancements in AI technology can address the limitations identified in this study.

“The main takeaway from this study is that LLMs, while impressive, still lack the depth of understanding required for advanced history. They’re great for basic facts, but when it comes to more nuanced, PhD-level historical inquiry, they’re not yet up to the task,” said Maria del Rio-Chanona, the study’s corresponding author and an assistant professor at University College London.

The paper, “Large Language Models’ Expert-level Global History Knowledge Benchmark (HiST-LLM),” wa authored by Jakob Hauser, Daniel Kondor, Jenny Reddish, Majid Benam, Enrico Cioni, Federica Villa, James S. Bennett, Daniel Hoyer, Pieter François, Peter Turchin, and R. Maria del Rio-Chanona.

Previous Post

Wealth inequality reaches new heights: What the data says about America’s future

Next Post

Adolescents with authoritarian leanings exhibit weaker cognitive ability and emotional intelligence

RELATED

Scientists identify a fat-derived hormone that drives the mood benefits of exercise
Artificial Intelligence

People consistently devalue creative writing generated by artificial intelligence

April 5, 2026
People cannot tell AI-generated from human-written poetry and they like AI poetry more
Artificial Intelligence

Job seekers mask their emotions and act more analytical when evaluated by artificial intelligence

April 3, 2026
AI autocomplete suggestions covertly change how users think about important topics
Artificial Intelligence

AI autocomplete suggestions covertly change how users think about important topics

April 2, 2026
Study links phubbing sensitivity to attachment patterns in romantic couples
Artificial Intelligence

How generative artificial intelligence is upending theories of political persuasion

April 1, 2026
People with attachment anxiety are more vulnerable to problematic AI use
Artificial Intelligence

Relying on AI chatbots for historical facts can influence your political beliefs, new study shows

March 30, 2026
ChatGPT acts as a “cognitive crutch” that weakens memory, new research suggests
Artificial Intelligence

ChatGPT acts as a “cognitive crutch” that weakens memory, new research suggests

March 30, 2026
Russian propaganda campaign used AI to scale output without sacrificing credibility, study finds
Artificial Intelligence

Knowing an AI is involved ruins human trust in social games

March 28, 2026
Scientists just uncovered a major limitation in how AI models understand truth and belief
Artificial Intelligence

Most Americans don’t fear an AI apocalypse, according to new research

March 26, 2026

STAY CONNECTED

RSS Psychology of Selling

  • When brands embrace diversity, some customers pull away — and new research explains why
  • Smaller influencers drive engagement while bigger ones drive purchases, meta-analysis finds
  • Political conservatives are more drawn to baby-faced product designs, and purity values explain why
  • Free gifts with no strings attached can boost customer spending by over 30%, study finds
  • New research reveals the “Goldilocks” age for social media influencers

LATEST

Longitudinal study links associative learning gains to later improvements in fluid intelligence

Conservative 2024 campaigns reframed demographic shifts as an election integrity issue

People with social anxiety scan moving faces differently than others

Social context influences dating preferences just as much as biological sex

Feeling like you slept poorly might take a heavier toll on new parents than actual sleep loss

The unexpected link between loneliness, status, and shopping habits

Scientists uncover the neurological mechanisms behind cannabis-induced “munchies”

New psychology research explains why some women devalue their own orgasms

PsyPost is a psychology and neuroscience news website dedicated to reporting the latest research on human behavior, cognition, and society. (READ MORE...)

  • Mental Health
  • Neuroimaging
  • Personality Psychology
  • Social Psychology
  • Artificial Intelligence
  • Cognitive Science
  • Psychopharmacology
  • Contact us
  • Disclaimer
  • Privacy policy
  • Terms and conditions
  • Do not sell my personal information

(c) PsyPost Media Inc

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In

Add New Playlist

Subscribe
  • My Account
  • Cognitive Science Research
  • Mental Health Research
  • Social Psychology Research
  • Drug Research
  • Relationship Research
  • About PsyPost
  • Contact
  • Privacy Policy

(c) PsyPost Media Inc