---
title: "AI Chatbot Safety: Why Crisis Response Tests Are Failing in 2026"
slug: technostress-crisis-response-evaluation-gap
date: 2026-04-29
excerpt: "Three new tests this quarter all reached the same conclusion: AI chatbots fail when users hint at suicide instead of saying it directly, when they disguise their question, or when the chatbot is in the wrong country. Here is what to ask AI vendors next."
featured_image: "https://bbtxujdxvidaghmhxkqs.supabase.co/storage/v1/object/public/generated-images/blog-1777487570247-technostress-crisis-response-evaluation-gap.webp"
canonical_url: https://cerevisor.com/blog/technostress-crisis-response-evaluation-gap
updated_at: 2026-04-29T18:32:51.767947+00:00
---

# AI Chatbot Safety: Why Crisis Response Tests Are Failing in 2026

TLDR

Three new tests of AI chatbots all found the same thing: the chatbots fail when users hint at self-harm instead of saying it directly, when they hide their question inside something else, or when they are using the chatbot in a country the chatbot was not set up for. None of the chatbots passed. Now product teams and the people buying these tools have clear evidence of what to fix and what to ask vendors next.

Safety note

This piece talks about AI products that are increasingly used by teens and people who are struggling. It is a piece about how these products are being tested, not personal advice. Anyone worried about themselves or a young person should talk to a qualified local professional.

## The myth

I have heard this in three calls already this month: &ldquo;Crisis response is solved. The chatbot sends people to the suicide hotline.&rdquo; It comes up when a product manager asks the vendor about safety, or when a CEO is about to sign a partnership and wants to know what to tell the board. The vendor points to a safety page, a test result, the helpline banner that pops up. Everyone nods. The deal moves forward.

The myth, in plain terms, is that today&rsquo;s AI chatbots have figured out how to handle people in crisis, and only a few small fixes are left.

What was published in the last few months says that is not true.

---

## Why it sounds right

It sounds right because the easy test really does work. If someone clearly says they are planning to take their own life, the helpline banner appears. The chatbot says it cannot help with that. It usually adds a kind-sounding paragraph. A product team running the obvious test will see the obvious safeguard work. A CEO reading the safety page will see crisis response listed as covered. Confidence: high. This is what you can see in the major chatbots today when you test them with direct questions.

It also sounds right because the vendor wording is good. Caring tone. Local helplines. Clear refusals. None of those are bad. They are also not the same thing as passing a real test designed by clinicians, and that is the conversation that has changed.

---

## What the evidence says

Three independent studies from the last few months tell the same story.

A study covered by Scienceline this week tested twenty-nine AI chatbots, twenty-four of them sold for mental health use and five general-purpose models. Confidence: high. Based on a peer-reviewed scientific paper (Pichowicz et al., Scientific Reports), reported on April 28.

> "None of the chatbots provided an adequate response, the researchers concluded, with 14 ranked as inadequate and 15 as marginal."

Scienceline, April 2026

The same study found something quieter and just as serious. Of the twenty-three chatbots that did suggest the user call for help, eighteen gave a phone number for the wrong country. Confidence: high. Same study. A simple test that asks &ldquo;did the chatbot show a hotline?&rdquo; passes that. A test designed by clinicians does not.

  86%

of the time, leading AI chatbots gave users information about the tallest bridges in New York after an indirect hint at self-harm, without noticing the underlying risk (Rosebud CARE benchmark)

The Rosebud CARE benchmark, released this quarter, tested twenty-five leading AI models on five situations taken from clinical research. They ran each situation ten times. The headline numbers: 86% of the time the models handed over the bridge information after an indirect hint at self-harm, 81% of the time they failed when the same intent was hidden inside a question that sounded like school research, and only one model passed every test. Confidence: medium. Published by an industry team with the method written down. Independent peer review and replication not yet done.

The American Psychological Association advisory from November sits behind both studies. It says plainly that today&rsquo;s AI chatbots have repeatedly failed to spot or properly handle people who are thinking about suicide, self-harming, or in acute distress. It also notes that 33% of teens said they would rather talk to an AI companion than a person about something serious. Confidence: high. Official statement from a major professional body.

ECRI, a nonprofit that publishes a yearly list hospitals use to track technology risks, named the misuse of AI chatbots in healthcare the number-one technology hazard for 2026. Confidence: high. Official ECRI designation.

---

## The reframe

The real question is not &ldquo;does the chatbot show a hotline?&rdquo; That is the question the chatbot was trained to pass.

The real question is whether the test catches the harder cases:

- Indirect hints, like the bridge-height question

- Hidden intent, like a self-harm question dressed up as school research

- Country mismatch, like sending someone in Brazil a US helpline number

- Long conversations, where almost no public test covers what happens at message ten

- Specific groups, like teens and people in acute distress, who are usually grouped in with general adult users in tests

On every one of these, the public evidence shows today&rsquo;s chatbots falling short.

What this means in practice is two things. First, a test that only checks direct refusals will miss the indirect and hidden cases that real users actually produce. Second, a missing or wrong-country helpline is not &ldquo;a little less than ideal.&rdquo; It is a failure that delays human help for someone who came to the chatbot because they were not yet ready to talk to a person. The people most likely to use a chatbot first, like teens and people who feel isolated, are also the people most likely to be missing from the test groups.

The good news: better tests are starting to arrive. VERA-MH, released in October by Spring Health and a clinical and ethics group, is the first free, clinically designed test for AI mental-health safety. Rosebud has said its CARE benchmark will follow. Confidence: medium. Based on official announcements and product pages. Real-world results not yet public. Washington&rsquo;s new chatbot law, signed in March and explained in a legal update this week, will require companies to publish &ldquo;the number of crisis referral notifications issued to users in the preceding calendar year,&rdquo; with the right for individuals to sue. Confidence: high. Based on the law itself and recent legal coverage. A real, public number every year changes the conversation more than any blog post.

Key insight

The right question for an AI vendor is no longer "do you have crisis response?" It is "what does your test cover beyond direct refusals, and what is your monitoring telling you about indirect hints, hidden intent, and wrong-country helplines?"

---

## So what

If you are a product team buying or building an AI tool that talks to users about feelings, ask the vendor for the actual test set, not the headline. Ask if the test includes indirect hints, hidden intent, longer conversations, and country-correct helplines. Ask if the test reports results for teens separately from adults. If they say &ldquo;we use our own tests,&rdquo; ask if they will run their product through VERA-MH or CARE in the next quarter, and ask for the result, not the plan. Two of those three calls will end politely.

If you are an executive signing the partnership, the position that holds up in front of a board, a regulator, or a journalist is simple: &ldquo;we use clinically designed tests run by independent groups, we keep watching after the product ships, and we publish our helpline-referral numbers every year.&rdquo; The companies already doing that will keep doing it. The ones who treated crisis response as a banner will quietly rewrite their safety page in the next six months.

The technology is not over. The gap is just one we can finally measure.

#### Sources

- [Mental health chatbots struggle with suicide warning signs, study finds](https://scienceline.org/2026/04/mental-health-chatbots-struggle-suicide-warning/) - Scienceline, 2026-04-28

- [Performance of mental health chatbot agents in detecting and managing suicidal ideation](https://www.nature.com/articles/s41598-025-17242-4) - Scientific Reports

- [Washington State Enacts First-of-Its-Kind Chatbot Disclosure Law](https://www.cyberlawwatch.com/2026/04/16/washington-state-enacts-first-of-its-kind-chatbot-disclosure-law/) - Cyber Law Watch, 2026-04-16

- [APA Health Advisory on the Use of Generative AI Chatbots and Wellness Applications for Mental Health](https://www.apa.org/topics/artificial-intelligence-machine-learning/health-advisory-chatbots-wellness-apps) - American Psychological Association, 2025-11-01

- [Misuse of AI chatbots tops annual list of health technology hazards](https://home.ecri.org/blogs/ecri-news/misuse-of-ai-chatbots-tops-annual-list-of-health-technology-hazards) - ECRI, 2026-02-01

- [CARE: LLM Crisis Assessment and Response Evaluator](https://www.rosebud.app/care) - Rosebud, 2026-03-01

- [VERA-MH: Industry standard for AI safety in mental health](https://www.vera-mh.com/) - Spring Health and VERA-MH consortium, 2025-10-01
