PsyPost
  • Mental Health
  • Social Psychology
  • Cognitive Science
  • Neuroscience
  • About
No Result
View All Result
Join
My Account
PsyPost
No Result
View All Result
Home Exclusive Artificial Intelligence

Artificial intelligence struggles to consistently evaluate scientific facts

by Karina Petrova
March 17, 2026
Reading Time: 5 mins read
[Adobe Stock]

[Adobe Stock]

Share on TwitterShare on Facebook

Generative artificial intelligence programs can write fluently, but they still struggle to accurately and consistently evaluate basic scientific statements. A recent study shows that when an artificial intelligence is asked the exact same question multiple times, it often gives completely different answers. These results, published in the Rutgers Business Review, highlight the limits of current automated reasoning and the ongoing need for human oversight.

Generative artificial intelligence is a type of technology trained on massive databases of text to produce human-like writing. Millions of people now use these applications daily for tasks ranging from marketing to software development. The software writes with an authoritative tone that often sounds correct even when it is entirely wrong. Some high-profile consulting firms have even faced public embarrassment after relying on automated reports that included fabricated data.

Despite these known flaws, many businesses have partnered with technology vendors to incorporate these tools into their daily operations. Professionals frequently rely on automated software to analyze data, answer customer queries, and summarize research. The researchers wanted to know if the logical abilities of these programs actually matched their impressive vocabularies. They designed a test to see if the technology could reliably evaluate rigorous business concepts.

Mesut Cicek, an associate professor in the Department of Marketing and International Business at Washington State University, led the investigation. His co-authors included Sevincgul Ulu of Southern Illinois University, Can Uslay of Rutgers University, and Kate Karniouchina of Northeastern University. The team designed an experiment to test the software’s ability to interpret academic literature.

The researchers collected 719 scientific hypotheses from nine open-access business journals published since 2021. A hypothesis is a formal, testable prediction about how two or more things interact in the real world. For example, a statement might predict that a specific type of advertising increases consumer spending.

The team presented these statements to ChatGPT, a highly popular automated text generator. The program was asked to determine whether each statement was ultimately proven true or false by the actual research data. To test the stability of the program, the researchers submitted the exact same prompt ten separate times for each statement.

The entire experiment was run twice to track technological progress over time. The first test occurred in mid-2024 using an older version of the software. The researchers repeated the entire process in mid-2025 with an updated version of the application.

The results revealed a modest improvement in overall correctness, but the raw numbers were highly misleading. The software chose the correct answer 76.5 percent of the time in 2024 and 80 percent of the time in 2025. Because the questions only had two possible answers, a completely blind guess would be right half the time.

Google News Preferences Add PsyPost to your preferred sources

Once the researchers mathematically adjusted the scores to account for random guessing, the true performance dropped substantially. The effective accuracy rate hovered around a mere 60 percent. The software essentially earned a barely passing grade when it came to anticipating actual scientific findings.

The program performed exceptionally poorly when evaluating ideas that the original researchers had found to be false. The software correctly identified these unsupported statements only 16.4 percent of the time in 2025. The program displayed a strong bias toward agreeing with whatever statement it was fed, acting as a compliant assistant rather than an objective analyst. This tendency to blindly confirm existing ideas creates an echo chamber that can mislead decision-makers.

Consistency proved to be an even bigger problem for the automated system. When asked the same question ten times in a row, the software frequently contradicted itself. Sometimes the program would flip back and forth between true and false on consecutive attempts.

“We’re not just talking about accuracy, we’re talking about inconsistency, because if you ask the same question again and again, you come up with different answers,” Cicek said. In 2025, the program provided identical answers across all ten attempts for only 73 percent of the statements. For more than a quarter of the questions, the software gave at least one wrong answer during the ten trials.

The lack of a stable response pattern makes the software highly unreliable for individual searches. Users who ask a question once might get a completely different answer if they simply refresh the page. “There were several cases where there were five true, five false,” Cicek said.

The researchers also categorized the test questions by their logical difficulty. The software did best with direct cause-and-effect relationships, where one event leads straight to another. It struggled the most with conditional statements, which are ideas that depend on changing variables to be true.

These outcomes suggest that the program relies on recognizing common word patterns rather than actually understanding the concepts. It can mimic the structure of a logical argument without grasping the underlying meaning or context. The system possesses a high degree of linguistic fluency, but it lacks genuine theoretical flexibility. When faced with complex scenarios, the technology fails to adapt its reasoning.

The software remains bound by pattern recognition rather than true comprehension. “They just memorize, and they can give you some insight, but they don’t understand what they’re talking about,” Cicek said. The apparent improvements over the past year seem to stem from better text processing rather than deeper cognitive abilities.

For managers and analysts, these limitations carry substantial risks. The findings reveal that automated systems are currently too shallow to handle high-stakes decision-making on their own. As the text generated by these programs becomes smoother, users might easily miss hidden conceptual flaws.

The researchers advise professionals to use artificial intelligence for speed rather than substitution. A marketing team might use a text generator to brainstorm ideas or summarize long reports quickly. However, human experts must step in to verify whether the logic aligns with actual market evidence.

Professionals should also verify automated insights through repetition. Asking the same question multiple times can help expose underlying bias or instability in the software. Any conclusions generated by artificial intelligence should be treated as diagnostic clues rather than absolute facts.

The authors advocate for building organizational literacy regarding automated tools. Employees need to understand exactly where these programs excel and where they fail. Organizations should train their staff to audit the reasoning behind automated answers, rather than just trusting the numerical output.

The ultimate goal is to create a hybrid system that pairs human intelligence with automated speed. In this arrangement, software handles structural analysis while humans preserve interpretive judgment. This balanced approach ensures that technology supports human understanding rather than replacing it.

The authors noted a few minor limitations to their experiment. The study assumed that every published, peer-reviewed finding was entirely true or false, which leaves out some nuance in real-world science. Sometimes a scientific finding has mixed results that do not easily fit into a strict binary category.

The team also limited their consistency test to ten repetitions per question using a single software platform. Future investigations should involve a higher number of repetitions to confirm these patterns. Researchers should also test a wider variety of artificial intelligence programs to see if the flaws are universal.

Despite these limitations, the research suggests that users must remain vigilant. Human judgment remains a necessary check on these increasingly common digital systems. “Always be skeptical,” Cicek said. “I’m not against AI. I’m using it. But you need to be very careful.”

The study, “Unstable Intelligence: GenAI Struggles with Accuracy and Consistency,” was authored by Mesut Cicek, Sevincgul Ulu, Can Uslay, and Kate Karniouchina.

RELATED

Scientists show how common chord progressions unlock social bonding in the brain
Artificial Intelligence

Perpetrators of AI sexual abuse often view their actions as a joke, new research shows

May 7, 2026
AI outshines humans in humor: Study finds ChatGPT is as funny as The Onion
Artificial Intelligence

Conversational AI shows promise in easing symptoms of anxiety and depression

May 6, 2026
The surprising link between conspiracy mentality and deepfake detection ability
Artificial Intelligence

Deepfake videos degrade political reputations even when viewers realize they are fake

May 5, 2026
Stanford scientist discovers that AI has developed an uncanny human-like ability
Artificial Intelligence

Turning to chatbots when lonely may exacerbate feelings of loneliness, study finds

May 4, 2026
Study explores how virtual “girlfriend experiences” tap evolved relationship motivations in the digital age
Artificial Intelligence

Study explores how virtual “girlfriend experiences” tap evolved relationship motivations in the digital age

May 3, 2026
People cannot tell AI-generated from human-written poetry and they like AI poetry more
Artificial Intelligence

Fascinating new research suggests artificial neurodivergence could help solve the AI alignment problem

May 1, 2026
Gold digging is strongly linked to psychopathy and dark personality traits, study finds
Artificial Intelligence

High trust in AI leaves individuals vulnerable to “cognitive surrender,” study finds

April 30, 2026
Artificial intelligence flatters users into bad behavior
Artificial Intelligence

Artificial intelligence flatters users into bad behavior

April 26, 2026

Follow PsyPost

The latest research, however you prefer to read it.

Daily newsletter

One email a day. The newest research, nothing else.

Google News

Get PsyPost stories in your Google News feed.

Add PsyPost to Google News
RSS feed

Use your favorite reader. We also syndicate to Apple News.

Copy RSS URL
Social media
Support independent science journalism

Ad-free reading, full archives, and weekly deep dives for members.

Become a member

Trending

  • The human brain appears to rely heavily on the thighs to accurately judge female body size
  • What your personality traits reveal about your sexual fantasies
  • Both men and women view a partner’s financial investment in a rival as a major relationship threat
  • Brain scans of 800 incarcerated men link psychopathy to an expanded cortical surface area
  • The gender friendship gap is driven primarily by white men, not a universal difference across groups

Science of Money

  • When ICE ramps up, U.S.-born workers don’t fill the gap, study finds
  • Why a blue background can make a brown sofa look bigger
  • Why brand names like “Yum Yum” and “BonBon” taste sweeter to our brains
  • How the science of persuasion connects to B2B sales success
  • Can AI shopping assistants make consumers less willing to choose eco-friendly options?

PsyPost is a psychology and neuroscience news website dedicated to reporting the latest research on human behavior, cognition, and society. (READ MORE...)

  • Mental Health
  • Neuroimaging
  • Personality Psychology
  • Social Psychology
  • Artificial Intelligence
  • Cognitive Science
  • Psychopharmacology
  • Contact us
  • Disclaimer
  • Privacy policy
  • Terms and conditions
  • Do not sell my personal information

(c) PsyPost Media Inc

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In

Add New Playlist

Subscribe
  • My Account
  • Cognitive Science Research
  • Mental Health Research
  • Social Psychology Research
  • Drug Research
  • Relationship Research
  • About PsyPost
  • Contact
  • Privacy Policy

(c) PsyPost Media Inc