AI hallucinationsGPT-5LLMRAGfact-checkingAI Act

From Fictional Lawsuits to Zombie Attacks — A Guide to AI Hallucinations

October 28, 202418 min readmimo.ooo

# From Fictional Lawsuits to Zombie Attacks — A Guide to AI Hallucinations

In 2023, a New York lawyer made headlines after filing a court brief filled with… invented cases. Source: a popular chatbot that spouted nonsense with total confidence. That was everyday reality in the GPT-3 and GPT-4 era — models impressed with eloquence, but could "make up" facts, quotes, and even entire documents. In critical fields like law or medicine, such hallucinations were a serious risk.

Today, the data says something surprising. In LongFact-Concepts tests, GPT-5 with "thinking mode" makes a mistake in less than 1% of cases, while earlier models were wrong even a dozen times more often. In difficult medical scenarios on HealthBench, the advantage is just as clear — GPT-5 logs only 1.6% hallucinations, compared with 12.9% for OpenAI o3 and 15.8% for GPT-4o. This isn't a cosmetic improvement — it's a qualitative leap that changes the conversation about what AI can be in practice.

And yet — even at such a low error rate — hallucinations haven't disappeared completely. To understand why, we have to go back to the source of the problem, see what it looks like in practice, and how we measure it.

What AI hallucinations are

A hallucination in humans is a sensory illusion — we see, hear, or feel something that isn't there. In AI, the term is a metaphor. An AI model has no eyes or ears, but it can "produce" content that looks real even when it's entirely fabricated. Scientifically, this is described as generating information that has no grounding in reality or in the model's source data.

It can be a small thing — a wrong date or a non-existent quote — or something at scale: an entire fake biography, a fictional legal precedent, or an untrue medical report. And most importantly: AI delivers it with the same confidence it uses for factual information.

Experts distinguish four main types of AI hallucinations. Factual hallucinations involve inventing untrue information (e.g., attributing a NASA discovery to a telescope that never made it). Logical hallucinations come from faulty reasoning — for example, drawing contradictory conclusions from correct premises. Perceptual hallucinations occur in computer vision systems when AI "sees" an object where none exists. Multimodal hallucinations affect models combining formats (e.g., text and image) and involve adding elements that don't exist in either — like an AI describing an empty beach while mentioning deck chairs and umbrellas that weren't there.

The common denominator? Confidence. A hallucinating AI won't flag that it might be wrong — on the contrary, it can build a narrative so coherent that the user sees no reason to question it.

Where they come from

The source of hallucinations lies in the very construction of language models. LLMs like GPT-5 don't store facts in databases and don't "understand" the world. Their job is to predict the most likely next word in a sentence based on billions of examples seen during training. When a model can't find an unambiguous answer in its "experience," it won't say "I don't know" — it will try to guess in a way that sounds coherent and credible.

The problem grows when training data contains errors, inaccuracies, or biased content. The model can then reproduce these distortions — and even build new "facts" on top of them. Another factor is pressure toward creativity: the more settings increase output diversity (for example via a higher temperature), the greater the risk the model generates information detached from reality.

Computer vision models work similarly, except they analyze image patches instead of words. If a fragment resembles a pattern the system associates with a specific object, it will "see" it — even if in reality it's just a random shadow or artifact. In autonomous vehicles, this can lead to situations where a car brakes for a shadow on the road, interpreting it as an obstacle.

The paradox is that the same traits that make AI so impressive — the ability to produce fluent, rich narratives or generate detail-filled images — are also the traits that increase hallucination risk. Creativity and error go hand in hand, and the boundary between a creative answer and a falsehood can be extraordinarily thin.

The history of the problem

Although AI hallucinations are now a near-daily media topic, their roots stretch back decades. In 1966, Joseph Weizenbaum created ELIZA — a simple program simulating a psychotherapist. It didn't understand words in a human sense; it applied phrase-matching rules to sound like it was having a meaningful conversation. Even so, some users attributed almost human intelligence to it, reacting emotionally to responses generated by rigid scripts.

Over the following decades, expert systems emerged that could answer in specific domains — from medical diagnosis to weather forecasting. When they made mistakes, those were logical errors or gaps in knowledge bases. There was no element of "creating" facts out of nothing, because these systems operated within hard rules.

The breakthrough came with large language models trained on billions of internet examples. In 2020, GPT-3 amazed with fluency — but it quickly became clear that it also produced information that was entirely fabricated. GPT-4 in 2023 was more accurate, yet still sometimes fabricated sources, mixed up facts, or produced false numeric data.

The issue gained public attention through several high-profile incidents. The most famous is the New York lawyer case: in May 2023, a lawyer filed a court submission based on precedents invented by ChatGPT. That same year, Google Bard made a factual error during a demo, attributing to the James Webb Space Telescope a discovery made years earlier. In medicine, researchers noted cases where AI added symptoms to a patient history that the patient never had.

Each of these moments fueled the debate about whether generative AI can be trusted at all. Only in 2025 did charts show a real shift — GPT-5 reduced hallucination rates to low single digits, reaching results that just a year earlier seemed unattainable.

Hallucinations in practice — industry case studies

Medicine

In one U.S. hospital, an AI system was tested to support doctors in analyzing drug interactions. During a consultation for a patient with cardiac issues, the model generated a warning about an allegedly dangerous reaction between two common medications. Sources? They didn't exist. The AI combined fragments of research about entirely different substances and created a new, fictional interaction. The physician, trusting the tool, changed therapy to a less effective one, delaying the patient's improvement. Only days later, when the error was discovered, did the original treatment have to be reinstated.

Transport

Tesla drivers have reported phantom braking for years — sudden braking on an empty road. Analyses indicated that in some cases the culprit was the vision recognition system interpreting a shadow on asphalt as an obstacle. This is a classic perceptual hallucination: the model "saw" an object that wasn't there and reacted as if it existed. At highway speeds, such behavior can cause a pileup if vehicles behind cannot brake in time.

Law

In May 2023, lawyer Steven Schwartz filed a court brief containing legal precedents that had never existed. The chatbot he used generated them from scratch — with case names, dates, and ruling descriptions. As a result, Schwartz was fined and publicly embarrassed. The case echoed widely across the legal community, and some law firms introduced internal bans on using generative AI without additional verification.

Finance

An experimental investing bot published a report claiming a publicly traded company had gone bankrupt, citing a "leaked document." The information was false, but it managed to drop the stock price by more than a dozen percent within hours. Investors lost millions before the news was corrected. This shows how an AI hallucination can influence financial markets at a speed traditional misinformation can't match.

Public administration

In one American city, a chatbot on the municipal website announced that a parade was canceled due to… a "zombie attack." In this case it ended in memes and laughter — but it's easy to imagine how a similar hallucination could cause panic in a real crisis. What's more, for several hours the information circulated on social media as "confirmed" by the city's official channel, showing how AI-generated falsehood can quickly gain the appearance of authority.

How to measure hallucinations — data and benchmarks

Assessing whether AI is "making things up" requires more than casually reading its answer. That's why researchers developed test sets — benchmarks — that measure hallucination frequency and scale in a repeatable way.

One is LongFact — a set of tasks where the model answers questions requiring many connected facts. Each fragment of the answer is checked against reliable sources. If AI provides information that cannot be confirmed, it counts as a hallucination. This is where GPT-5 with "thinking mode" scored below 1% errors in LongFact-Concepts — while earlier models recorded values a dozen times higher.

In medicine, a key benchmark is HealthBench — a set of clinical questions and scenarios evaluated by physicians. It measures both factual correctness and answer safety. In this test, GPT-5 recorded 1.6% hallucinations — the best result among OpenAI models. For comparison, model o3 scored 12.9%, and GPT-4o scored 15.8%.

Another approach is FActScore, which splits an AI answer into individual claims and automatically verifies each one on the web. This makes it possible to assess what percentage of a response is grounded in reality.

These numbers and methods let us compare models objectively. They also show that reducing hallucinations is possible — but requires both better architecture and training grounded in fact verification.

How to fight it

Reducing AI hallucinations is a combination of better technology, procedures, and conscious usage. The most common method today is RAG (Retrieval-Augmented Generation) — before producing an answer, the model searches reliable sources (knowledge bases, the internet) and weaves retrieved information into the response. In OpenAI internal tests, this technique reduced hallucinations by about two thirds. In practice, that means a medical chatbot integrated with an up-to-date clinical research database will far less often provide outdated or false information.

A second approach is post-hoc verification — a response produced by one model is sent to another model whose job is to detect potential errors and flag fragments needing correction. Commercial versions can even automatically rewrite problematic sentences before the user sees them.

There are also hybrid models combining generative AI with classic databases and symbolic logic. For example, in Google Gemini, part of the information comes from their massive Knowledge Graph, reducing the risk of inventing non-existent facts.

In critical applications, the "human in the loop" principle is increasingly standard. In medicine, every AI recommendation must be approved by a physician; in finance, by an analyst; and in transport, a driver or operator must be able to take over control.

Finally, there's user education. Knowing AI can be wrong — and being able to verify outputs — is a simple but extremely effective tool against hallucinations. In everyday work, the "trust but verify" rule helps: copying an answer into a search engine is often enough to separate fact from fiction.

The future and regulation

The rapid drop in hallucination rates in GPT-5 shows technology can reach a point where the problem stops dominating the generative-AI conversation. In an optimistic scenario, in a few years models will be able to explicitly signal uncertainty and refuse to answer when the risk of falsehood is high. Autopilots will stop reacting to road shadows, medical chatbots won't add non-existent symptoms, and legal tools will limit themselves to citing verified precedents.

In a realistic scenario, hallucinations won't disappear entirely, but will become rare and clearly labeled. In critical industries, the "human in the loop" model will become widely adopted, and systems will undergo certification similar to what medical devices or aviation software undergo today.

A pessimistic scenario assumes that several serious failures — for example, a wrong therapeutic recommendation leading to tragedy, or misinformation destabilizing markets — will trigger a sharp public and regulatory backlash. In response, restrictions could emerge limiting AI use in some domains, slowing innovation.

Regulation will play a key role. In the European Union, work is already underway on the AI Act, which provides separate requirements for high-risk systems — including documenting data sources, reporting safety testing, and ensuring audit mechanisms. In the United States, there is debate over federal quality and accountability standards. More and more, people also talk about the need for global agreements — so that fact-verification standards remain consistent regardless of country.

Conscious use of AI

The story of the New York lawyer who fully trusted a hallucinating chatbot is now almost an anecdote. But its moral still holds: even the best model — like GPT-5, which makes an error in under one percent of answers in tests — can still be wrong.

That's why conscious use of AI tools is key. In practice, this means treating them like brilliant but sometimes distracted assistants: you can assign them information gathering, data analysis, or drafting — but final decisions, especially in high-risk areas, should belong to a human.

It's also about habits. Checking sources, verifying numbers, asking follow-up questions — these simple steps are the best protection even against the rarest hallucinations. Technological progress is impressive, and the gap between GPT-5 and earlier models is huge, but awareness of AI's limitations remains the foundation of safe use.

---

Sources

  • OpenAI – GPT-5 Developer Release: LongFact-Concepts and HealthBench test results, comparative data vs o3 and GPT-4o. https://openai.com/research/gpt-5
  • OpenAI – System Card GPT-4o and o3: methodology documentation for factual accuracy tests. https://openai.com/research/o3, https://openai.com/research/gpt-4o-system-card
  • LongFact Benchmark: dataset description and hallucination measurement methodology. https://github.com/google-research/longfact
  • HealthBench Benchmark: methodology and results for language models in medicine. https://github.com/openai/health-bench
  • FActScore: automated fact verification. https://github.com/allenai/factscore
  • Reuters: the Steven Schwartz lawyer case. https://www.reuters.com/legal/chatgpt-cited-fake-cases-2023-05-27/
  • The Verge: Google Bard's JWST demo error. https://www.theverge.com/2023/2/8/google-bard-demo-error
  • NHTSA Investigation PE22-002: Tesla phantom braking. https://www.nhtsa.gov/investigations/PE22-002
  • AI Act: EU regulation draft. https://artificialintelligenceact.eu/
  • US Senate: U.S. regulatory debate. https://www.commerce.senate.gov/2023/5/oversight-of-ai
  • ---

    This is an authorized translation of the original article.

    Document prepared by mimo.ooo.

    Key takeaways

    • GPT-5 with thinking mode hallucinates <1% (LongFact) vs double-digit % in earlier models
    • 4 types of hallucinations: factual, logical, perceptual, multimodal
    • Lawyer Schwartz (2023) — ChatGPT-invented precedents made it into court
    • RAG (Retrieval-Augmented Generation) reduces hallucinations by ~66%
    • The AI Act introduces certification and auditing for high-risk systems

    TL;DR

    AI hallucinations are the generation of information with no grounding in reality. GPT-5 with thinking mode reaches <1% errors in LongFact and 1.6% in HealthBench — a qualitative leap versus GPT-4o (15.8%). We distinguish four types: factual (made-up facts), logical (faulty inferences), perceptual (Tesla phantom braking), and multimodal. High-profile cases include: NYC lawyer Schwartz citing non-existent precedents, Google Bard misattributing a JWST discovery, and a bot announcing a 'zombie attack.' Defenses include: RAG (~66% fewer hallucinations), post-hoc verification, and human-in-the-loop. Regulation: the EU AI Act introduces certification for high-risk systems. Rule: trust — but verify.

    Frequently asked questions

    Want to implement AI in your company?

    Let's talk about how we can help your brand achieve more with AI.

    Contact us