How to Fact-Check a Whole Paragraph with an AI Agent (Python + OpenRouter)

To fact-check a paragraph with an AI agent, you split the job into two steps: extract every distinct claim, then verify each claim separately. A single agent that tries to check a whole paragraph at once gets confused and runs out of room. Two specialized steps, an Extractor and a Researcher, produce a clean per-claim verdict map instead.

This is the second phase of Trinetra, a fact-checking agent I'm building in public. Phase 1 verified one claim at a time. This post adds paragraph mode, and along the way it surfaces a lesson I did not expect: the choice of model mattered more than any line of code I wrote.

The full project is on GitHub: github.com/TheRavi/trinetra. If you want the single-claim foundation first, start with the Phase 1 post on building the agent from scratch.

Quick answer

To fact-check a paragraph with an AI agent:

Extract the claims. One model call reads the paragraph and returns a structured list of distinct factual claims, each classified by type.
Decompose compound claims. "Founded in 2015 by Sam Altman and Elon Musk" is really three separate claims with different evidence. Split them.
Verify each claim separately. Run your single-claim fact-checking loop once per claim.
Assemble a Claims Atlas. Merge the per-claim verdicts into one structured output you can read or save.
Pin a strong model. The reasoning quality of the verdict depends almost entirely on the model. A random free model gave me a confidently wrong verdict; a stronger one fixed it with no code change.

Why a paragraph is harder than a claim

Phase 1 of Trinetra verified a single claim. You hand it "Python was created by Guido van Rossum," it searches, and it gives you a verdict. That works because the agent has exactly one thing to research.

The trouble starts the moment you paste real writing. Real writing comes in paragraphs, and a paragraph is a tangle of claims, some checkable, some not. Here is the test paragraph I kept coming back to:

OpenAI was founded in 2015 by Sam Altman and Elon Musk. GPT-4 has 1.7 trillion parameters and was trained on 25,000 H100 GPUs. AI will replace 50% of jobs by 2030.

Count the claims. There are at least six in there: a founding year, a list of founders, a parameter count, a training-hardware claim, and a prediction. They have wildly different evidence profiles. The founding year is settled fact. The parameter count is a rumor OpenAI never confirmed. The jobs claim is a prediction you cannot verify against anything that exists today.

When I fed this paragraph to the Phase 1 agent, it broke. It tried to research everything at once, burned through its search budget, and crashed without producing a verdict. That failure is what Phase 2 exists to fix.

The two-agent design

The fix is to stop asking one agent to do two jobs. Paragraph mode uses two specialized roles.

The Extractor reads the paragraph and returns a structured list of claims. It does not search the web. It does not form verdicts. Its only job is to pull apart the paragraph into atomic, checkable units and label each one. One model call, structured output, done.

The Researcher is the Phase 1 agent, completely unchanged. It takes one isolated claim, searches, reasons, and produces a verdict. Phase 2 just calls it once per claim.

        paragraph
            │
            ▼
   ┌────────────────┐
   │  Extractor     │   one model call, no tools,
   │                │   returns classified claims
   └────────┬───────┘
            │
            ▼
    for each claim:
            │
            ▼
   ┌────────────────┐
   │  Researcher    │   the Phase 1 loop, unchanged:
   │  (agent loop)  │   search, read, conclude
   └────────┬───────┘
            │
            ▼
      Claims Atlas

The discipline here matters. The Researcher does not change at all between phases. That is the whole point of building in phases: each new phase wraps the previous one instead of rewriting it. If Phase 2 had required me to tear up the Phase 1 agent, that would have been a sign the Phase 1 design was wrong.

Step 1: Classify claims, do not just list them

The Extractor does more than split sentences. It classifies each claim into one of four types, because not every claim is checkable the same way.

verifiable_fact: has a checkable answer. Dates, names, numbers, events.
contested_opinion: framed as fact but really opinion. "Most experts agree X."
prediction: about the future. No current evidence can verify it.
pure_opinion: subjective preference. "Vanilla is the best flavor."

Here is the Extractor, using OpenRouter through the OpenAI SDK with structured output:

from pydantic import BaseModel, Field
from typing import Literal
 
ClaimType = Literal["verifiable_fact", "contested_opinion", "prediction", "pure_opinion"]
 
class ExtractedClaim(BaseModel):
    id: str = Field(description="Stable id like 'c1', 'c2'")
    text: str = Field(description="The claim as a clean self-contained sentence")
    type: ClaimType
 
class ExtractionResult(BaseModel):
    claims: list[ExtractedClaim]
 
def extract_claims(paragraph: str) -> list[ExtractedClaim]:
    response = client.chat.completions.create(
        model=EXTRACTOR_MODEL,
        messages=[
            {"role": "system", "content": EXTRACTOR_PROMPT},
            {"role": "user", "content": paragraph},
        ],
        temperature=0.1,
        response_format={"type": "json_object"},
    )
    raw = response.choices[0].message.content or "{}"
    return ExtractionResult.model_validate_json(raw).claims

python

The classification earns its keep later. A prediction like "AI will replace 50% of jobs by 2030" cannot be verified, so when the Researcher checks it, the honest verdict is "unverifiable." The type label and the verdict end up agreeing, which is a small built-in sanity check.

Step 2: Decompose compound claims

This is the most important rule in the Extractor's prompt. A single sentence can hide several claims:

OpenAI was founded in 2015 by Sam Altman and Elon Musk.

That is not one claim. It is at least three:

OpenAI was founded in 2015.
OpenAI was founded by Sam Altman.
OpenAI was founded by Elon Musk.

They have different answers. The year is settled. The two founders are both genuinely co-founders, but the original sentence also implies they were the only founders, which is false (Greg Brockman, Ilya Sutskever, and others were co-founders too). If you check the compound sentence as one unit, you get a muddy verdict. Split it, and each piece gets a clean answer.

The prompt teaches this with an explicit example:

Decomposition rule:
Multi-part claims MUST be broken into separate claims.
   Input:  "OpenAI was founded in 2015 by Sam Altman and Elon Musk."
   Output: TWO claims:
           c1: "OpenAI was founded in 2015."
           c2: "OpenAI was founded by Sam Altman and Elon Musk."

In practice the Extractor sometimes keeps "Sam Altman and Elon Musk" together as one claim and sometimes splits them. Both are reasonable. What matters is that the founding year gets separated from the founding people, because those need different searches.

Step 3: Verify each claim and assemble the atlas

With the claims extracted, the orchestrator loops through them and calls the unchanged Phase 1 Researcher on each one:

def fact_check_paragraph(paragraph: str) -> dict:
    extracted = extract_claims(paragraph)
 
    atlas_claims = []
    for claim in extracted:
        try:
            verdict = fact_check_with_retry(claim.text)
        except Exception as e:
            verdict = {
                "claim": claim.text,
                "verdict": "unverifiable",
                "confidence": "low",
                "reasoning": f"Could not verify. Error: {e}",
                "evidence": [],
            }
        atlas_claims.append({
            "id": claim.id,
            "text": claim.text,
            "type": claim.type,
            "verdict": verdict,
        })
 
    return {"input_text": paragraph, "claims": atlas_claims, ...}

python

Note the try/except around each claim. If one claim fails, it degrades to a "could not verify" placeholder instead of crashing the whole paragraph. That single defensive choice is the difference between losing all your work and losing one claim.

The result is the Claims Atlas: every claim, its type, its verdict, its confidence, and its cited evidence, rendered as colored panels in the terminal.

The lesson I did not expect: the model is the product

Here is where Phase 2 taught me something that no tutorial had.

I first ran the paragraph through OpenRouter's random free-model router. It picks a free model at random for each request. Four claims came back, and one of them was confidently, specifically wrong.

The claim: "OpenAI was founded by Sam Altman and Elon Musk." The verdict from the random model: contradicted, high confidence. Its reasoning claimed Sam Altman "joined the company later" and "became CEO in 2019," implying he was not a founder.

That is factually false. Sam Altman was a co-founder of OpenAI in 2015, named in the original founding announcement. He became CEO later, but he was a founder from the start. The model conflated "became CEO later" with "joined later" and reasoned its way into a wrong answer, stated with high confidence and backed by real sources. That last part is what makes this failure mode dangerous: it looks authoritative.

Then I changed one thing. Not the code. Not the prompt. I pinned a specific, stronger model instead of the random router:

TRINETRA_RESEARCHER_MODEL=openai/gpt-oss-120b:free
TRINETRA_EXTRACTOR_MODEL=openai/gpt-oss-20b:free

Same paragraph, same code, same searches. The new verdict on that claim: supported, high confidence, with reasoning that correctly noted they were "part of a larger founding team" and the claim is "accurate, even if not exhaustive."

A confidently wrong verdict became a correct, nuanced one. The only variable I changed was the model.

This is the lesson: in an agent, the architecture is the easy part. The reasoning quality of your verdicts lives almost entirely in the model. A weak model will find the right sources and then reason incorrectly about them. A strong model reads the same sources and gets it right. If your agent is producing bad output, the model is the first thing to check, not the last.

Why pin a model instead of using the random router?

OpenRouter's free router is convenient because it auto-selects a free model that supports the features your request needs. For casual use it is fine. For a fact-checker it is the wrong choice, for one reason: reproducibility.

If the model changes on every request, you cannot trust a comparison between two runs, you cannot reproduce a result you liked, and you cannot tell whether a change you made helped or whether you just got a luckier model. A fact-checker needs to be consistent. Pinning a specific model makes every run comparable.

I landed on openai/gpt-oss-120b:free for the Researcher and openai/gpt-oss-20b:free for the Extractor. The Researcher does the hard work (multi-step tool calling plus reasoning over evidence), so it gets the stronger model. The Extractor only needs clean structured output, so the smaller, faster model is plenty.

A brief note on the stack: Phase 1 used Gemini directly. Phase 2 moved to OpenRouter, which is OpenAI-API-compatible and gives access to dozens of models from one key. That switch is what made the model-comparison experiment a one-line change instead of a rewrite.

Real gotchas from building paragraph mode

1. Weak models return empty or malformed verdicts

The most common failure with weaker models was not a wrong answer, it was no answer. After a few rounds of tool calls, the model would return an empty message, and the JSON parser would throw a cryptic error. The fix was a defensive parser that detects empty responses, strips markdown fences, and extracts a JSON object even when the model wraps it in prose:

def _parse_verdict(text: str) -> dict:
    text = (text or "").strip()
    if not text:
        raise ValueError("Model returned an empty response instead of a verdict.")
    # strip ```json fences, then try to find the {...} even inside prose
    start, end = text.find("{"), text.rfind("}")
    if start != -1 and end > start:
        return json.loads(text[start:end + 1])
    raise ValueError(f"Model did not return valid JSON: {text[:200]!r}")

python

2. Free models hang, so you need a timeout

A large free model under load can queue your request and never respond. Without a timeout, your program hangs forever. The fix is one parameter on the client:

client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key=os.environ["OPENROUTER_API_KEY"],
    timeout=90.0,
    max_retries=2,
)

python

Now a hung request fails after 90 seconds, the retry wrapper catches it, and you get a clear error instead of a frozen terminal.

3. One bad claim should not kill the paragraph

Early on, a single failed claim would crash the whole run and throw away every verdict already gathered. Wrapping each claim's fact-check in a try/except, and degrading failures to a low-confidence placeholder, means a five-claim paragraph still produces four good verdicts even if one claim fails.

4. Pick a model with real tool calling

Not every model on OpenRouter handles tool calling well, even when the listing says it does. I wasted time on an experimental model that produced empty verdicts on any claim that needed more than one search. The models that reliably handle agent tool-calling on the free tier right now are the OpenAI gpt-oss family, GLM-4.5-Air, and Gemma 4. Test tool use on your exact model before depending on it.

Frequently asked questions

How do you extract claims from text with an LLM?

Send the text to a model with a system prompt that defines what a factual claim is and asks for structured JSON output. Use a schema (Pydantic plus the model's JSON mode) so the output is machine-readable. The key instruction is to decompose compound sentences into separate atomic claims, because each part may need different evidence.

Why split fact-checking into an Extractor and a Researcher?

Because they are different jobs. Extraction is a single structured-output call with no tools. Research is a multi-step agent loop with web search. Keeping them separate means each can use the model best suited to it, and the Researcher stays identical to the single-claim version, so you are not rewriting working code.

What is the best free model on OpenRouter for an AI agent?

For agent work that needs tool calling, the OpenAI gpt-oss models (gpt-oss-120b and gpt-oss-20b) are strong free options, along with GLM-4.5-Air and Gemma 4. The larger gpt-oss-120b gives noticeably better reasoning for the verdict step. Avoid the random free router for anything where reproducibility matters.

Does the choice of model really change the answer that much?

Yes. In this project, the same claim with the same code and the same searches got a confidently wrong verdict from a random weak model and a correct verdict from a stronger one. The architecture determines what the agent does; the model determines how well it reasons about what it finds.

How do you handle long-running or failing LLM calls in an agent?

Three layers: a request timeout so hung calls fail fast, a retry wrapper with exponential backoff for transient errors (rate limits, 5xx, timeouts, empty responses), and per-item error handling so one failure degrades gracefully instead of crashing the whole batch.

Can this fact-check a full article, not just a paragraph?

The same design scales to longer text, but each claim is a separate set of model and search calls, so a long article means many calls and real cost or rate-limit pressure. For longer inputs you would add parallelism and caching, which are planned for later phases.

What is next

Paragraph mode works, but watching the same claim flip between right and wrong across model choices made one thing obvious: I need a way to measure quality, not just eyeball it. So the next phase is likely an evaluation harness, a set of claims with known-correct verdicts that I can run the agent against to catch regressions automatically. After that: reading full source pages instead of search snippets, actively hunting for counter-evidence, and a multi-agent design with a Critic that double-checks the Researcher's reasoning.

Each phase wraps the last one instead of rewriting it. That is the whole approach, and it is why paragraph mode took an afternoon instead of a rebuild.

Working code

The complete project is on GitHub:

github.com/TheRavi/trinetra

Clone it, add your OpenRouter and Tavily keys to .env, and run:

python trinetra.py --paragraph "OpenAI was founded in 2015 by Sam Altman and Elon Musk. GPT-4 has 1.7 trillion parameters and was trained on 25,000 H100 GPUs."

bash

Watch it extract the claims, check each one, and print a Claims Atlas with cited sources. Then click the sources. The whole point is that you can.

Quick answer

Why a paragraph is harder than a claim

The two-agent design

Step 1: Classify claims, do not just list them

Step 2: Decompose compound claims

Step 3: Verify each claim and assemble the atlas

The lesson I did not expect: the model is the product

Why pin a model instead of using the random router?

Real gotchas from building paragraph mode

1. Weak models return empty or malformed verdicts

2. Free models hang, so you need a timeout

3. One bad claim should not kill the paragraph

4. Pick a model with real tool calling

Frequently asked questions

How do you extract claims from text with an LLM?

Why split fact-checking into an Extractor and a Researcher?

What is the best free model on OpenRouter for an AI agent?

Does the choice of model really change the answer that much?

How do you handle long-running or failing LLM calls in an agent?

Can this fact-check a full article, not just a paragraph?

What is next

Working code

Further reading