Subscribe
The latest psychology and neuroscience discoveries.
My Account
  • Mental Health
  • Social Psychology
  • Cognitive Science
  • Neuroscience
  • About
No Result
View All Result
PsyPost
PsyPost
No Result
View All Result
Home Exclusive Artificial Intelligence

Artificial intelligence struggles to consistently evaluate scientific facts

by Karina Petrova
March 17, 2026
in Artificial Intelligence
Reading Time: 5 mins read
[Adobe Stock]

[Adobe Stock]

Share on TwitterShare on Facebook

Generative artificial intelligence programs can write fluently, but they still struggle to accurately and consistently evaluate basic scientific statements. A recent study shows that when an artificial intelligence is asked the exact same question multiple times, it often gives completely different answers. These results, published in the Rutgers Business Review, highlight the limits of current automated reasoning and the ongoing need for human oversight.

Generative artificial intelligence is a type of technology trained on massive databases of text to produce human-like writing. Millions of people now use these applications daily for tasks ranging from marketing to software development. The software writes with an authoritative tone that often sounds correct even when it is entirely wrong. Some high-profile consulting firms have even faced public embarrassment after relying on automated reports that included fabricated data.

Despite these known flaws, many businesses have partnered with technology vendors to incorporate these tools into their daily operations. Professionals frequently rely on automated software to analyze data, answer customer queries, and summarize research. The researchers wanted to know if the logical abilities of these programs actually matched their impressive vocabularies. They designed a test to see if the technology could reliably evaluate rigorous business concepts.

Mesut Cicek, an associate professor in the Department of Marketing and International Business at Washington State University, led the investigation. His co-authors included Sevincgul Ulu of Southern Illinois University, Can Uslay of Rutgers University, and Kate Karniouchina of Northeastern University. The team designed an experiment to test the software’s ability to interpret academic literature.

The researchers collected 719 scientific hypotheses from nine open-access business journals published since 2021. A hypothesis is a formal, testable prediction about how two or more things interact in the real world. For example, a statement might predict that a specific type of advertising increases consumer spending.

The team presented these statements to ChatGPT, a highly popular automated text generator. The program was asked to determine whether each statement was ultimately proven true or false by the actual research data. To test the stability of the program, the researchers submitted the exact same prompt ten separate times for each statement.

The entire experiment was run twice to track technological progress over time. The first test occurred in mid-2024 using an older version of the software. The researchers repeated the entire process in mid-2025 with an updated version of the application.

The results revealed a modest improvement in overall correctness, but the raw numbers were highly misleading. The software chose the correct answer 76.5 percent of the time in 2024 and 80 percent of the time in 2025. Because the questions only had two possible answers, a completely blind guess would be right half the time.

Google News Preferences Add PsyPost to your preferred sources

Once the researchers mathematically adjusted the scores to account for random guessing, the true performance dropped substantially. The effective accuracy rate hovered around a mere 60 percent. The software essentially earned a barely passing grade when it came to anticipating actual scientific findings.

The program performed exceptionally poorly when evaluating ideas that the original researchers had found to be false. The software correctly identified these unsupported statements only 16.4 percent of the time in 2025. The program displayed a strong bias toward agreeing with whatever statement it was fed, acting as a compliant assistant rather than an objective analyst. This tendency to blindly confirm existing ideas creates an echo chamber that can mislead decision-makers.

Consistency proved to be an even bigger problem for the automated system. When asked the same question ten times in a row, the software frequently contradicted itself. Sometimes the program would flip back and forth between true and false on consecutive attempts.

“We’re not just talking about accuracy, we’re talking about inconsistency, because if you ask the same question again and again, you come up with different answers,” Cicek said. In 2025, the program provided identical answers across all ten attempts for only 73 percent of the statements. For more than a quarter of the questions, the software gave at least one wrong answer during the ten trials.

The lack of a stable response pattern makes the software highly unreliable for individual searches. Users who ask a question once might get a completely different answer if they simply refresh the page. “There were several cases where there were five true, five false,” Cicek said.

The researchers also categorized the test questions by their logical difficulty. The software did best with direct cause-and-effect relationships, where one event leads straight to another. It struggled the most with conditional statements, which are ideas that depend on changing variables to be true.

These outcomes suggest that the program relies on recognizing common word patterns rather than actually understanding the concepts. It can mimic the structure of a logical argument without grasping the underlying meaning or context. The system possesses a high degree of linguistic fluency, but it lacks genuine theoretical flexibility. When faced with complex scenarios, the technology fails to adapt its reasoning.

The software remains bound by pattern recognition rather than true comprehension. “They just memorize, and they can give you some insight, but they don’t understand what they’re talking about,” Cicek said. The apparent improvements over the past year seem to stem from better text processing rather than deeper cognitive abilities.

For managers and analysts, these limitations carry substantial risks. The findings reveal that automated systems are currently too shallow to handle high-stakes decision-making on their own. As the text generated by these programs becomes smoother, users might easily miss hidden conceptual flaws.

The researchers advise professionals to use artificial intelligence for speed rather than substitution. A marketing team might use a text generator to brainstorm ideas or summarize long reports quickly. However, human experts must step in to verify whether the logic aligns with actual market evidence.

Professionals should also verify automated insights through repetition. Asking the same question multiple times can help expose underlying bias or instability in the software. Any conclusions generated by artificial intelligence should be treated as diagnostic clues rather than absolute facts.

The authors advocate for building organizational literacy regarding automated tools. Employees need to understand exactly where these programs excel and where they fail. Organizations should train their staff to audit the reasoning behind automated answers, rather than just trusting the numerical output.

The ultimate goal is to create a hybrid system that pairs human intelligence with automated speed. In this arrangement, software handles structural analysis while humans preserve interpretive judgment. This balanced approach ensures that technology supports human understanding rather than replacing it.

The authors noted a few minor limitations to their experiment. The study assumed that every published, peer-reviewed finding was entirely true or false, which leaves out some nuance in real-world science. Sometimes a scientific finding has mixed results that do not easily fit into a strict binary category.

The team also limited their consistency test to ten repetitions per question using a single software platform. Future investigations should involve a higher number of repetitions to confirm these patterns. Researchers should also test a wider variety of artificial intelligence programs to see if the flaws are universal.

Despite these limitations, the research suggests that users must remain vigilant. Human judgment remains a necessary check on these increasingly common digital systems. “Always be skeptical,” Cicek said. “I’m not against AI. I’m using it. But you need to be very careful.”

The study, “Unstable Intelligence: GenAI Struggles with Accuracy and Consistency,” was authored by Mesut Cicek, Sevincgul Ulu, Can Uslay, and Kate Karniouchina.

Previous Post

New brain scanning method safely tracks how Alzheimer’s drugs work in living patients

Next Post

A massive review reveals cannabis falls short in treating psychiatric disorders

RELATED

Live music causes brain waves to synchronize more strongly with rhythm than recorded music
Artificial Intelligence

Unrestricted generative AI harms high school math learning by acting as a crutch

April 21, 2026
Live music causes brain waves to synchronize more strongly with rhythm than recorded music
Artificial Intelligence

People remain “blissfully ignorant” of AI use in everyday messages, new research shows

April 20, 2026
Live music causes brain waves to synchronize more strongly with rhythm than recorded music
Artificial Intelligence

Disclosing autism to AI chatbots prompts overly cautious, stereotypical advice

April 18, 2026
Live music causes brain waves to synchronize more strongly with rhythm than recorded music
Artificial Intelligence

Scientists tested the creativity of AI models, and the results were surprisingly homogeneous

April 18, 2026
People ascribe intentions and emotions to both human- and AI-made art, but still report stronger emotions for artworks made by humans
Artificial Intelligence

New research links personality traits to confidence in recognizing artificial intelligence deception

April 13, 2026
Scientists just found a novel way to uncover AI biases — and the results are unexpected
Artificial Intelligence

Artificial intelligence makes consumers more impatient

April 11, 2026
Scientists identify a fat-derived hormone that drives the mood benefits of exercise
Artificial Intelligence

People consistently devalue creative writing generated by artificial intelligence

April 5, 2026
People cannot tell AI-generated from human-written poetry and they like AI poetry more
Artificial Intelligence

Job seekers mask their emotions and act more analytical when evaluated by artificial intelligence

April 3, 2026

STAY CONNECTED

RSS Psychology of Selling

  • A new framework maps how influencers, brands, and platforms all compete for long-term value
  • Why personalized ads sometimes backfire: A research review explains when tailoring messages works and when it doesn’t
  • The common advice to avoid high customer expectations may not be backed by evidence
  • Personality-matched persuasion works better, but mismatched messages can backfire
  • When happy customers and happy employees don’t add up: How investor signals have shifted in the social media age

LATEST

Unrestricted generative AI harms high school math learning by acting as a crutch

Lifting weights builds a sharper mind and reduces anxiety in older women

How a perceived lack of traditional values makes minorities seem younger

Does listening to true crime make you a more creative criminal?

Autism spectrum disorder is associated with specific congenital malformations

Study links internalized pornographic standards to body image issues among incel men

Listening to bad music makes you crave sugar, study finds

People remain “blissfully ignorant” of AI use in everyday messages, new research shows

PsyPost is a psychology and neuroscience news website dedicated to reporting the latest research on human behavior, cognition, and society. (READ MORE...)

  • Mental Health
  • Neuroimaging
  • Personality Psychology
  • Social Psychology
  • Artificial Intelligence
  • Cognitive Science
  • Psychopharmacology
  • Contact us
  • Disclaimer
  • Privacy policy
  • Terms and conditions
  • Do not sell my personal information

(c) PsyPost Media Inc

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In

Add New Playlist

Subscribe
  • My Account
  • Cognitive Science Research
  • Mental Health Research
  • Social Psychology Research
  • Drug Research
  • Relationship Research
  • About PsyPost
  • Contact
  • Privacy Policy

(c) PsyPost Media Inc