Why AI Confidence Doesn't Track AI Accuracy in 2026 Research

2026-05-11

Editorial illustration of two adjacent dials labelled confidence and accuracy, with the confidence dial pinned to the top and the accuracy dial wavering in the middle, suggesting the two needles have come apart.

A peer-reviewed 2026 study finds that an AI assistant reports near-ceiling confidence even when its accuracy slips, while human confidence still tracks whether the answer is right. For builders shipping AI features, that gap changes what is worth surfacing to users.

TLDR

A peer-reviewed paper out of Frontiers in Artificial Intelligence in late March compared 87 people and 87 runs of GPT-4 on the same questions. The model was more accurate. It was also confidently wrong in a way humans were not, with confidence pinned near the ceiling regardless of whether the answer was right. The signal a builder reaches for when surfacing AI confidence is not the signal users need.

I shipped a small feature this week that asked the model to label its outputs with a confidence score. Looked clean on the design review. Then I read a paper from late March that made me pause. When a person answers a question, their confidence tracks whether they got it right. When the same question goes to a chat model, the confidence number barely moves. The thing builders keep wanting to surface to users is not the thing we think it is.

What the research shows

The paper is from Shun Yoshizawa and colleagues, published March 27 in Frontiers in Artificial Intelligence. They ran the same general-knowledge questions past 87 people and 87 separate runs of GPT-4. After every answer, both the people and the model reported how sure they were that the answer was right.

The model outscored the humans across every condition, 78 to 88 percent against 52 to 73 percent. The interesting part was confidence. People scaled their confidence to the difficulty of the question, sliding from about 72 percent on the easier items down to 56 percent on the open-ended ones. The model said it was 95 to 97 percent sure on the multiple-choice questions and still 93 percent sure on the open-ended ones, where its accuracy actually dropped.

"Human confidence closely tracked variations in accuracy, while GPT-4 was not. In humans, Type-2 AUROC was 0.76 ± 0.14 for the All condition while GPT-4 was 0.63 ± 0.20."

Frontiers in Artificial Intelligence, March 2026

The AUROC number is shorthand for how well a confidence rating sorts right answers from wrong ones. The slope of how confidence moves when accuracy moves was almost ten times steeper in the people than in the model. The model’s confidence pinned near the ceiling regardless of whether the answer was right.

Key insight

Confidence and accuracy are two different signals in current chat models. Designing a product as if they were one signal is the calibration mistake.

A second paper, from Daniela Fernandes, Robin Welsch, Thomas Kosch and colleagues in Computers in Human Behavior late last year, ran the same logic on the human side. With 246 people doing logical-reasoning tasks, AI assistance lifted task scores by about three points and lifted self-rated performance by four. The error-correction signal where low performers normally overestimate themselves the most quietly disappeared. The twist: participants with more AI literacy were more overconfident, not better calibrated. So the model drifts confident, and the human drifts confident with it. That is the same direction as the self-efficacy erosion research we covered last week, and a close cousin of why the adoption percentage on the CTO slide is not the productivity number on the engineering side.

What it doesn’t tell us yet

The Yoshizawa paper tested one model on one task family, in a domain where verification is hard for the model to ground itself in. Other models, especially the more recent reasoning ones, may calibrate differently, and the confidence number a model reports is partly a function of the prompt rather than its internal probability. The Fernandes work is an online lab task, not a year of field deployment. Both findings are bounded. What is striking is that two careful papers, one on the model side and one on the human side, are landing in the same place.

One thing to notice in your work today

If the product a builder is shipping surfaces a confidence score from a model, look at the last five places that number actually changed a user’s behavior. Was it the number doing the work, the fluent prose around it, or both together? The research suggests neither one is doing the calibration work users need. The right move is probably not a prettier confidence dial. It is a separate signal that distinguishes the model is right and knows it from the model is right and got lucky.

The same noticing applies to a builder’s own work this afternoon. The next fluent AI output that gets accepted, especially close to home turf where the shape of your workday is shifting toward judging more than doing, pause for one beat before pasting. Did the verification actually happen, or did fluent read like correct? The skill the work is quietly asking right now is not faster acceptance. It is reading fluent and not feeling that meant true.

Sources

Metacognition of ChatGPT in confidence judgements - Frontiers in Artificial Intelligence, 2026-03-27
AI makes you smarter but none the wiser: The disconnect between performance and metacognition - Computers in Human Behavior, 2025-10-27
Metacognitive sensitivity: The key to calibrating trust and optimal decision making with AI - PNAS Nexus, 2025-04-24

Back to all insights