Subscribe
The latest psychology and neuroscience discoveries.
My Account
  • Mental Health
  • Social Psychology
  • Cognitive Science
  • Psychopharmacology
  • Neuroscience
  • About
No Result
View All Result
PsyPost
PsyPost
No Result
View All Result
Home Exclusive Artificial Intelligence

AI models struggle with expert-level global history knowledge

by Eric W. Dolan
January 22, 2025
in Artificial Intelligence
(Photo credit: DALL·E)

(Photo credit: DALL·E)

Share on TwitterShare on Facebook
Stay on top of the latest psychology findings: Subscribe now!

Researchers recently evaluated the ability of advanced artificial intelligence (AI) models to answer questions about global history using a benchmark derived from the Seshat Global History Databank. The study, presented at the Neural Information Processing Systems conference in Vancouver, revealed that the best-performing model, GPT-4 Turbo, achieved a score of 46% on a multiple-choice test, a marked improvement over random guessing but far from expert comprehension. The findings highlight significant limitations in current AI tools’ ability to process and understand historical knowledge, particularly outside well-documented regions like North America and Western Europe.

The motivation for the study stemmed from a desire to explore the potential of artificial intelligence (AI) tools in aiding historical and archaeological research. History and archaeology often involve analyzing vast amounts of complex and unevenly distributed data, making these fields particularly challenging for researchers.

Advances in AI, particularly in large language models (LLMs), have demonstrated their utility in fields like law and data labeling, raising the question of whether these tools could similarly assist historians by processing and synthesizing historical knowledge. Researchers hoped that AI could augment human efforts, providing insights that might otherwise be missed or speeding up labor-intensive tasks like data organization.

Peter Turchin, a project leader at the Complexity Science Hub, and his collaborators developed the Seshat Global History Databank, a comprehensive repository of historical knowledge. They recognized the need for a systematic evaluation of AI’s understanding of history. The researchers hoped the study would not only reveal the strengths and weaknesses of current AI but also guide future efforts to refine these tools for academic use.

The Seshat Global History Databank includes 36,000 data points about 600 historical societies, covering all major world regions and spanning 10,000 years of history. Data points are drawn from over 2,700 scholarly sources and coded by expert historians and graduate research assistants. The dataset is unique in its systematic approach to recording both well-supported evidence and inferred conclusions.

To evaluate AI performance, the researchers converted the dataset into multiple-choice questions that asked whether a historical variable (e.g., the presence of writing or a specific governance structure) was “present,” “absent,” “inferred present,” or “inferred absent” during a given society’s time frame. Seven AI models were tested, including GPT-3.5, GPT-4 Turbo, Llama, and Gemini. Models were provided with examples to help them understand the task and were instructed to act as expert historians in their responses.

The researchers assessed the models using a balanced accuracy metric, which accounts for the uneven distribution of answers across the dataset. Random guessing would result in a score of 25%, while perfect accuracy would yield 100%. The models were also tested on their ability to distinguish between “evidenced” and “inferred” facts, a critical skill for historical analysis.

“We wanted to set a benchmark for assessing the ability of these LLMs to handle expert-level history knowledge,” explained first author Jakob Hauser, a resident scientist at the Complexity Science Hub. “The Seshat Databank allows us to go beyond ‘general knowledge’ questions. A key component of our benchmark is that we not only test whether these LLMs can identify correct facts, but also explicitly ask whether a fact can be proven or inferred from indirect evidence.”

GPT-4 Turbo outperformed the other models, achieving a balanced accuracy of 43.8% on the four-choice test. While this score exceeded random guessing, it still fell well short of expert-level performance. In a simplified two-choice format (“present” versus “absent”), GPT-4 Turbo performed better, with an accuracy of 63.2%. These results suggest that while the models can identify straightforward facts, they struggle with more nuanced historical questions.

“One surprising finding, which emerged from this study, was just how bad these models were. This result shows that artificial ‘intelligence’ is quite domain-specific. LLMs do well in some contexts, but very poorly, compared to humans, in others,” Turchin remarked.

The study also revealed patterns in the models’ performance across regions, time periods, and types of historical data. Models generally performed better on earlier historical periods (e.g., before 3000 BCE) and struggled with more recent data, likely due to the increasing complexity of societies and historical records over time. Regionally, performance was highest for societies in the Americas and lowest for Sub-Saharan Africa and Oceania, highlighting potential biases in the models’ training data.

“LLMs, such as ChatGPT, have been enormously successful in some fields—for example, they have largely succeeded by replacing paralegals. But when it comes to making judgments about the characteristics of past societies, especially those located outside North America and Western Europe, their ability to do so is much more limited,” explained Turchin, who leads the Complexity Science Hub’s research group on social complexity and collapse.

Interestingly, the models exhibited relative consistency across different types of historical data, such as military organization, religious practices, and legal systems. However, performance varied significantly between models. GPT-4 Turbo consistently outperformed others in most categories, while smaller models like Llama-3.1-8B struggled to achieve comparable results.

The researchers acknowledged several limitations in their study. The Seshat Databank, while comprehensive, reflects the biases of its sources, which are predominantly in English and focused on well-documented societies. This linguistic and regional bias likely influenced the models’ performance. Additionally, the study only tested a limited number of AI models, leaving room for future evaluations of newer or more specialized tools.

The study also highlighted challenges in interpreting historical data. Unlike fields with clear-cut answers, history often involves ambiguity and debate, making it difficult to design objective benchmarks for AI evaluation. Furthermore, the models’ underperformance in regions like Sub-Saharan Africa underscores the need for more diverse training data that accurately represents global history.

Looking ahead, the researchers plan to expand the Seshat dataset to include more data from underrepresented regions and to incorporate additional types of historical questions. They also aim to test newer AI models to assess whether advancements in AI technology can address the limitations identified in this study.

“The main takeaway from this study is that LLMs, while impressive, still lack the depth of understanding required for advanced history. They’re great for basic facts, but when it comes to more nuanced, PhD-level historical inquiry, they’re not yet up to the task,” said Maria del Rio-Chanona, the study’s corresponding author and an assistant professor at University College London.

The paper, “Large Language Models’ Expert-level Global History Knowledge Benchmark (HiST-LLM),” wa authored by Jakob Hauser, Daniel Kondor, Jenny Reddish, Majid Benam, Enrico Cioni, Federica Villa, James S. Bennett, Daniel Hoyer, Pieter François, Peter Turchin, and R. Maria del Rio-Chanona.

TweetSendScanShareSendPinShareShareShareShareShare

RELATED

Generative AI chatbots like ChatGPT can act as an “emotional sanctuary” for mental health
Artificial Intelligence

Do AI tools undermine our sense of creativity? New study says yes

June 19, 2025

A new study published in The Journal of Creative Behavior offers insight into how people think about their own creativity when working with artificial intelligence.

Read moreDetails
Dark personality traits and specific humor styles are linked to online trolling, study finds
Artificial Intelligence

Memes can serve as strong indicators of coming mass violence

June 15, 2025

A new study finds that surges in visual propaganda—like memes and doctored images—often precede political violence. By combining AI with expert analysis, researchers tracked manipulated content leading up to Russia’s invasion of Ukraine, revealing early warning signs of instability.

Read moreDetails
Teen depression tied to balance of adaptive and maladaptive emotional strategies, study finds
Artificial Intelligence

Sleep problems top list of predictors for teen mental illness, AI-powered study finds

June 15, 2025

A new study using data from over 11,000 adolescents found that sleep disturbances were the most powerful predictor of future mental health problems—more so than trauma or family history. AI models based on questionnaires outperformed those using brain scans.

Read moreDetails
New research links certain types of narcissism to anti-immigrant attitudes
Artificial Intelligence

Fears about AI push workers to embrace creativity over coding, new research suggests

June 13, 2025

A new study shows that when workers feel threatened by artificial intelligence, they tend to highlight creativity—rather than technical or social skills—in job applications and education choices. The research suggests people see creativity as a uniquely human skill machines can’t replace.

Read moreDetails
Smash or pass? AI could soon predict your date’s interest via physiological cues
Artificial Intelligence

A neuroscientist explains why it’s impossible for AI to “understand” language

June 12, 2025

Can artificial intelligence truly “understand” language the way humans do? A neuroscientist challenges this popular belief, arguing that machines may generate convincing text—but they lack the emotional, contextual, and biological grounding that gives real meaning to human communication.

Read moreDetails
Scientists reveal ChatGPT’s left-wing bias — and how to “jailbreak” it
Artificial Intelligence

ChatGPT mimics human cognitive dissonance in psychological experiments, study finds

June 3, 2025

OpenAI’s GPT-4o demonstrated behavior resembling cognitive dissonance in a psychological experiment. After writing essays about Vladimir Putin, the AI changed its evaluations—especially when it thought it had freely chosen which argument to make, echoing patterns seen in people.

Read moreDetails
Generative AI simplifies science communication, boosts public trust in scientists
Artificial Intelligence

East Asians more open to chatbot companionship than Westerners

May 30, 2025

A new study highlights cultural differences in attitudes toward AI companionship. East Asian participants were more open to emotionally connecting with chatbots, a pattern linked to greater anthropomorphism and differing exposure to social robots across regions.

Read moreDetails
AI can predict intimate partner femicide from variables extracted from legal documents
Artificial Intelligence

Being honest about using AI can backfire on your credibility

May 29, 2025

New research reveals a surprising downside to AI transparency: people who admit to using AI at work are seen as less trustworthy. Across 13 experiments, disclosing AI use consistently reduced credibility—even among tech-savvy evaluators and in professional contexts.

Read moreDetails

SUBSCRIBE

Go Ad-Free! Click here to subscribe to PsyPost and support independent science journalism!

STAY CONNECTED

LATEST

Scientists reveal a surprising link between depression and microbes in your mouth

New study sheds light on the psychological roots of collective violence

Experienced FPS gamers show faster, more efficient eye movements during aiming tasks, study finds

Study links moderate awe in psychedelic ayahuasca journeys to better well-being

Dementia: Tactile decline may signal early cognitive impairment

Adults’ beliefs about children and race shift when a child’s race is specified, study finds

Anxiety and anger may explain how parenting styles shape life satisfaction

Psychopathic individuals recognize unfairness but are less likely to punish it

         
       
  • Contact us
  • Privacy policy
  • Terms and Conditions
[Do not sell my information]

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In

Add New Playlist

Subscribe
  • My Account
  • Cognitive Science Research
  • Mental Health Research
  • Social Psychology Research
  • Drug Research
  • Relationship Research
  • About PsyPost
  • Contact
  • Privacy Policy