← Back to blog
PythonAI AgentsGeminiTavilyLLMTool UseAI Engineering

Build an AI Agent in Python from Scratch with Gemini and Tavily (2026)

RK

Ravi Kumar

May 25, 2026 · 17 min read

Build an AI Agent in Python from Scratch with Gemini and Tavily (2026)

An AI agent is a small program that lets an LLM decide what to do next. It loops between a model call, a tool call, and a result, until the model says it's done. The whole pattern fits in about forty lines of Python, and you really don't need LangChain or LangGraph to write one.

This post walks through building Trinetra, a little fact-checking agent I've been building in public. You hand it a single factual claim, it searches the web through Tavily, and it hands back a structured JSON verdict with citations. It's the first post in a series, so I'm keeping the scope deliberately narrow: one claim in, one verdict out. Later posts handle full paragraphs, real page reading, counter-evidence, and a multi-agent split.

Working code is on GitHub: github.com/TheRavi/trinetra.

Quick answer

To build a fact-checking AI agent from scratch in Python:

  1. Pick a model with function calling. Gemini 2.5 Flash is generous on the free tier and handles tool use cleanly.
  2. Pick a search API the agent can call. Tavily was built for LLM agents, and you don't need a credit card for the free tier.
  3. Declare one tool: a tavily_search function, plus the declaration the model reads to know how to call it.
  4. Write the loop. Model call, check for tool calls, run them, feed the results back, repeat until the model decides it's done.
  5. Make it return JSON. Bake the output schema into the system prompt so the verdict is machine-readable.

That's the whole agent. Everything beyond this is prompt engineering and handling failure gracefully.

What is an AI agent?

An AI agent is a program that lets a language model take actions and see the results of those actions, in a loop, until the model decides it's done.

A normal LLM call is one-shot. You send a prompt, you get a reply. An agent wraps that call in a loop and adds two things: tools the model can ask to use, and a way to feed each tool's output back into the next model call. So the model isn't just generating text anymore. It's choosing, at each step, between "give a final answer" and "call a tool and try again."

That choice is the whole field. Multi-agent systems, planning agents, ReAct agents, computer-use agents: they're all the same shape, just with more tools, more agents, or more structure stitched around the loop.

Why build it from scratch instead of using LangChain?

LangChain and LangGraph are genuinely useful in production. For learning, though, they hide the one thing you're actually trying to understand.

If you wrap your first agent in AgentExecutor before you've ever written the raw loop, you'll end up with a working app and zero feel for what's happening inside it. And agents break a lot during development. When yours does, you'll be debugging the framework's abstractions instead of your own logic, which is a much worse place to be.

Build it raw first. It's about a weekend of work and you'll permanently understand the pattern. After that, picking up LangGraph takes an afternoon, because you'll recognize what it's abstracting.

Why I built Trinetra

Fact-checking is a crowded space. VerifactAI, Originality.ai, and a dozen others already exist as commercial products. Open-source research projects like Loki and OpenFactCheck have been around for a while. So why build another one?

Two reasons, and both of them shaped the design.

Most fact-checkers give you a verdict. Almost none show their work. When a tool says "verified," you just have to trust it. But honestly, the verdict matters less than the research that produced it. Trinetra's output is a structured evidence map: every claim, every source, every quote, with you as the final judge.

And I wanted to actually understand how agents work. Most tutorials hand you a framework and call it a day. I wanted to build the loop myself, so when I do reach for a framework later, I'll know what it's abstracting.

The name comes from Trinetra (Sanskrit: त्रिनेत्र), Shiva's third eye, the one that sees past surface appearance. For a tool whose whole job is to look past the surface of a claim and check what the evidence actually says, the name fit. It's also genuinely available in the AI agent namespace, which is rarer than you'd think.

Prerequisites

That's it. Two API keys and Python.

Step 1: Install the dependencies

mkdir trinetra && cd trinetra
python -m venv venv
source venv/bin/activate          # Windows: venv\Scripts\activate
pip install google-genai tavily-python python-dotenv
bash

Three packages, each doing one thing. google-genai talks to Gemini, tavily-python wraps the search API, and python-dotenv reads your API keys from a .env file.

Create .env:

GEMINI_API_KEY=your-key-here
TAVILY_API_KEY=your-key-here

Step 2: Declare the search tool

The agent needs one tool: web search. The Tavily SDK handles the HTTP, so the function stays short.

from tavily import TavilyClient
 
tavily = TavilyClient(api_key=os.environ["TAVILY_API_KEY"])
 
def tavily_search(query: str, max_results: int = 5) -> list[dict]:
    response = tavily.search(query=query, max_results=max_results, search_depth="basic")
    return [
        {
            "title": r.get("title", ""),
            "url": r.get("url", ""),
            "snippet": r.get("content", ""),
            "score": r.get("score", 0.0),
        }
        for r in response.get("results", [])
    ]
python

The agent doesn't call this function directly. It calls it through Gemini's function-calling interface, which means the model needs a declaration telling it what the tool does and what arguments it takes.

from google.genai import types
 
search_tool = types.Tool(
    function_declarations=[
        types.FunctionDeclaration(
            name="tavily_search",
            description=(
                "Search the web for information about a claim. "
                "Returns results with title, URL, snippet, and a relevance score. "
                "Use this to gather evidence before forming a verdict."
            ),
            parameters=types.Schema(
                type="OBJECT",
                properties={
                    "query": types.Schema(
                        type="STRING",
                        description="Short, specific search query (3-6 words).",
                    ),
                },
                required=["query"],
            ),
        )
    ]
)
python

This declaration is what Gemini reads when it's deciding whether to call the tool. The clearer the description, the better the model's tool-use decisions get. If your tool is being called with bad queries, the fix is almost always in this description, not in the model.

Step 3: Write the system prompt

Honestly, the system prompt is doing most of the engineering work in this whole project. It sets the agent's identity, its scope, the output schema, and the hard rules.

SYSTEM_PROMPT = """You are Trinetra, the third eye for fact-checking.
Your job is to verify a single factual claim using web search.
 
SCOPE - IMPORTANT:
You verify ONE factual claim at a time. If the input contains multiple
distinct claims, treat only the FIRST one and ignore the rest.
 
Process:
1. Use the tavily_search tool to gather evidence. Search multiple times if
   needed. Vary queries and look for opposing views.
2. Pay attention to relevance scores. Low-score results may be off-topic.
3. If a search returns zero results, rephrase and try a different query.
4. Once you have enough evidence, STOP calling tools and produce your verdict.
 
Your final answer MUST be a single JSON object (no prose, no markdown fences):
 
{
  "claim": "<the claim, verbatim>",
  "verdict": "supported" | "contradicted" | "contested" | "unverifiable",
  "confidence": "high" | "medium" | "low",
  "reasoning": "<2-3 sentence explanation of how the evidence weighs>",
  "evidence": [
    {
      "source_url": "<url returned by the search tool>",
      "source_title": "<title returned by the search tool>",
      "supports": true | false,
      "quote_or_summary": "<what this source actually says about the claim>"
    }
  ]
}
 
Verdict rules:
- "supported":    multiple credible sources confirm and none contradict.
- "contradicted": credible sources clearly refute it.
- "contested":    credible sources disagree; include both sides in evidence.
- "unverifiable": predictions, opinions, or claims with no public evidence.
 
Hard rules:
- Never invent sources or URLs. Only cite results returned by tavily_search.
- Be honest about uncertainty in the confidence field.
"""
python

Two things in here are doing serious work.

The four-way verdict (supported, contradicted, contested, unverifiable) is what makes Trinetra useful. Most fact-checkers only have true and false, which forces them to be confidently wrong on rumors, predictions, and contested claims. Just having an unverifiable bucket catches a huge chunk of real-world claims that pure true/false systems fumble.

The "never invent sources" line is in there because, without it, LLMs will absolutely hallucinate plausible-looking URLs when they can't find real ones. It's the single most important guardrail in the project.

Step 4: Write the agent loop

This is the part that matters. Read it slowly.

def fact_check(claim: str, max_iterations: int = 6) -> dict:
    contents = [
        types.Content(role="user", parts=[types.Part(text=f"Claim to check: {claim}")])
    ]
 
    for step in range(max_iterations):
        response = gemini.models.generate_content(
            model="gemini-2.5-flash",
            contents=contents,
            config=types.GenerateContentConfig(
                system_instruction=SYSTEM_PROMPT,
                tools=[search_tool],
                temperature=0.2,
            ),
        )
 
        candidate = response.candidates[0]
        parts = candidate.content.parts or []
        contents.append(candidate.content)
 
        tool_calls = [p.function_call for p in parts if p.function_call]
 
        if tool_calls:
            tool_responses = []
            for call in tool_calls:
                query = call.args.get("query", "")
                results = tavily_search(query)
                payload = (
                    {"results": results}
                    if results
                    else {"results": [], "hint": "No results. Try different keywords."}
                )
                tool_responses.append(
                    types.Part.from_function_response(name=call.name, response=payload)
                )
            contents.append(types.Content(role="user", parts=tool_responses))
            continue
 
        text = "".join(p.text for p in parts if p.text)
        return _parse_verdict(text)
 
    return _force_final_verdict(contents)
python

The whole conversation state lives in a single list called contents. Every step is the same shape:

  1. Call the model with the current conversation and the tool definition.
  2. Look at what came back. Did the model ask to use a tool, or did it produce a final answer?
  3. If it called a tool: run the tool, append the result to contents, loop.
  4. If no tool call: parse the JSON verdict and return it.

That's the agent. Every more advanced pattern you've heard of (planner-executor, ReAct, multi-agent) is some variation of this loop, with more tools, more agents, or some extra structure between the steps. Frameworks like LangGraph generalize this. The shape doesn't change.

Step 5: Force a wrap-up when the agent runs out of time

My first version of this loop crashed when it hit max_iterations. Six searches' worth of evidence, gone, just to raise an error. So I changed it: when the agent runs out of steps, it gets one final no-tools call that forces it to produce a verdict with whatever it's got.

def _force_final_verdict(contents: list) -> dict:
    contents.append(types.Content(
        role="user",
        parts=[types.Part(text=(
            "You have reached the search limit. Produce your final JSON verdict "
            "now using only the evidence you have gathered. Set confidence to "
            "'low' if the evidence is thin. Do not call any more tools."
        ))],
    ))
    final = gemini.models.generate_content(
        model="gemini-2.5-flash",
        contents=contents,
        config=types.GenerateContentConfig(
            system_instruction=SYSTEM_PROMPT,
            temperature=0.2,
            # no tools, force a text answer
        ),
    )
    text = "".join(p.text for p in final.candidates[0].content.parts if p.text)
    return _parse_verdict(text)
python

The code's a little uglier now. The agent's behavior is dramatically more honest. Worth the trade, every time.

Run it

python trinetra.py "GPT-4 has 1.7 trillion parameters."
bash

Sample output:

👁  Trinetra is examining: 'GPT-4 has 1.7 trillion parameters.'

--- Step 1 ---
  🔍 search: GPT-4 parameter count official
     → 5 results
--- Step 2 ---
  🔍 search: OpenAI GPT-4 technical report architecture
     → 5 results
--- Step 3 ---
  ✅ verdict produced
============================================================
{
  "claim": "GPT-4 has 1.7 trillion parameters.",
  "verdict": "unverifiable",
  "confidence": "medium",
  "reasoning": "The 1.7 trillion figure is widely cited in tech media but every source traces back to unofficial leaks, most notably from George Hotz. OpenAI's own GPT-4 technical report explicitly withholds details about model size and architecture. Without a primary source, the specific number cannot be verified.",
  "evidence": [
    {
      "source_url": "https://...",
      "source_title": "Tech outlet citing leaked specs",
      "supports": true,
      "quote_or_summary": "Reports GPT-4 has approximately 1.76 trillion parameters across a mixture-of-experts architecture."
    },
    {
      "source_url": "https://...",
      "source_title": "OpenAI GPT-4 Technical Report",
      "supports": false,
      "quote_or_summary": "OpenAI explicitly withholds details about model size, architecture, and training data."
    }
  ]
}

This is exactly the behavior I was aiming for. The claim is everywhere on the internet. A naive checker would confidently call it true. A different naive checker would confidently call it false because there's no primary source. Both would be wrong. The honest verdict is unverifiable: the claim comes from leaks, no official source confirms it, and the agent shows you the evidence so you can make the call yourself.

This is why the four-way verdict matters so much. Most fact-checkers can't produce this kind of output because their schema only has true and false.

Real gotchas from building this

The naive version mostly worked on the first try. The version that's actually trustworthy needed four specific fixes.

1. The system prompt is most of the engineering

The system prompt above is around fifty lines. During development, almost every behavior change I made was a prompt edit, not a code change. If your agent isn't doing what you want, the answer is almost always in the prompt, not the framework.

A specific example: my early prompt didn't include the "if the input has multiple claims, only verify the first one" rule. When I pasted a paragraph, the agent tried to verify all the claims at once and ran out of iterations. Three sentences in the prompt fixed it permanently.

2. Telling the agent why a tool failed makes it adapt

Tavily occasionally returns zero results, usually for vague queries. My first version of the loop just returned {"results": []} and the agent would basically shrug and conclude with what it had.

The fix is three lines:

payload = (
    {"results": results}
    if results
    else {"results": [], "hint": "No results. Try different keywords."}
)
python

That hint field is the agent's only feedback about why the tool returned nothing. Without it, the model has no signal that the query was the problem. With it, the agent reliably rephrases and tries again. Agents are only as smart as the feedback you give them about their own actions.

3. Confidence scores are not calibrated

This is a Gemini-and-every-other-LLM thing, not a Trinetra-specific thing. When the model emits "confidence": "high", it's picking a plausible-sounding label, not computing an actual probability. If you build product logic on top of these scores ("auto-approve if confidence == 'high'"), you're building on sand.

In Trinetra, the confidence field is there for humans to read. The verdict is what carries real information. Don't confuse the two.

4. The model invents URLs without the hard rule

Drop the "never invent sources" line from the system prompt and Gemini will, occasionally, cite a URL that doesn't exist. The URL will look real. It'll follow plausible naming conventions. And it'll be completely fabricated.

The fix is in the prompt:

Hard rules:
- Never invent sources or URLs. Only cite results returned by tavily_search.

This is the single most important guardrail in the project. It's also why the agent's output is auditable: every URL it cites came from a real search result, which means you can click through and check.

Frequently asked questions

What is an AI agent in simple terms?

An AI agent is a program that lets a language model take actions and see the results of those actions in a loop. The model decides whether to call a tool (like a web search) or just produce a final answer. If it calls a tool, the program runs the tool and feeds the result back to the model. The loop keeps going until the model is done.

Can I build an AI agent without LangChain?

Yes, and honestly it's the best way to learn how agents actually work. The core agent loop is about forty lines of Python. LangChain and LangGraph are useful in production, but they hide the loop you're trying to understand. Build it raw first, and pick up a framework later if you feel real pain.

What's the difference between Gemini function calling and tool use in other models?

Function calling and tool use refer to the same pattern across providers. Gemini, OpenAI, Anthropic, and others all support it: you declare a tool with a schema, the model decides when to call it, and your code executes the call and returns the result. The APIs look different in shape (different SDK classes, different field names), but the mental model is identical.

Why Tavily instead of Brave Search or Serper?

Tavily was designed specifically for LLM agents: results come with relevance scores, deduplication, and optional pre-cleaned content. The free tier is 1,000 credits per month with no credit card. Brave's free tier asks for a card. Serper's free tier is a one-time 2,500 queries. For a learning project, Tavily has the lowest friction.

How much does it cost to run this agent?

On the free tiers, nothing. Gemini 2.5 Flash has a generous free tier on Google AI Studio. Tavily gives you 1,000 credits per month for free. A typical Trinetra run uses 2 to 4 search calls, so the free tier comfortably covers 250+ runs per month at no cost.

What's the difference between an agent and a chatbot?

A chatbot generates text in response to a prompt. An agent can take actions: it can call tools, observe their results, and decide what to do next. The agent loop is what turns a model into something that can do real work, not just talk about it.

What to build next

Phase 1 verifies one claim. That's a deliberately narrow scope, and it's also the wrong scope for how people actually use fact-checking tools. Real writing comes in paragraphs.

Here's the roadmap from here:

  • Phase 2: Claim extraction. Paste a paragraph, get an atlas of every distinct factual claim, with Phase 1's verdict applied to each. This is where Trinetra starts to feel actually useful.
  • Phase 3: Full-page evidence reading. Right now the agent trusts search snippets. Phase 3 will fetch and read the actual source pages.
  • Phase 4: Active counter-evidence search. Teach the agent to hunt for disagreement, not just confirmation.
  • Phase 5: Evals. A labeled test set of claims, so I can measure regressions as the system grows.
  • Phase 6: Multi-agent. Split into an Extractor, parallel Researchers (one per claim), and a Critic that reviews findings.

Each phase introduces one new agent concept on top of the previous one. By the end, the architecture will look much more like a real agent system, built from scratch, every piece readable.

Working code

The complete project is on GitHub:

github.com/TheRavi/trinetra

Clone it, drop your two API keys into .env, and run any claim you can think of:

python trinetra.py "Python was created by Guido van Rossum."
python trinetra.py "You only use 10% of your brain."
python trinetra.py "AI will replace half of all jobs by 2030."
bash

Click the cited URLs. The whole point is auditability. The agent shouldn't be a black box.

Further reading

Related posts on this blog:


Building something with Trinetra, or hit a failure mode this post missed? Open an issue on the GitHub repo or reach out on LinkedIn. I'm building Trinetra in public, one phase at a time, and real-world reports always help me make the next phase better.