RAG Basics in Spring AI: Retrieval, the Assembly Trap, and When Retrieval Lies
Retrieval-augmented generation (RAG) lets a model answer from your documents instead of only from its training data: you embed your documents into a vector store, retrieve the chunks most similar to a question, and feed those chunks to the model as grounding. Spring AI ships the embedding, vector-store, and retrieval pieces to do this in Java. This recipe builds a working RAG pipeline over a set of incident reports, and it's honest about the two places it fights back: assembling the right dependencies, and trusting what retrieval returns.
This is the fifth and final recipe in the Spring AI Recipes series. Recipe 03 gave the model tools to query a database. This one gives it the ability to read unstructured documents and answer from them.
Quick Answer
A minimal RAG setup in Spring AI has three moving parts:
- Ingestion: read documents, split them into chunks, embed the chunks, store them in a vector store. Done once, at startup in this recipe.
- Retrieval: at query time, embed the question and find the most similar chunks.
- Generation: feed the retrieved chunks to the model as context and let it answer.
The one-call version uses QuestionAnswerAdvisor, which handles retrieval and prompt-stuffing for you:
chatClient.prompt()
.user(question)
.advisors(QuestionAnswerAdvisor.builder(vectorStore).build())
.call()
.content();The thing nobody warns you about: RAG needs four Spring AI dependencies, not one, and they fail in three different ways if you miss them. More on that below, because it's the real lesson of building this.
Working code: github.com/TheRavi/spring-ai-recipes.
What this recipe builds
A support-engineer assistant with a memory. It ingests three past incident write-ups on startup (an analytics export timeout, a billing export failure, and an SSO redirect loop), embeds them into an in-memory vector store, and answers questions by retrieving the relevant incident and grounding its answer in it.
Ask it a clear question and it works exactly as you'd hope:
curl -X POST http://localhost:8080/ask \
-H 'Content-Type: application/json' \
-d '{"question": "Why did the analytics CSV export return an empty file, and how was it fixed?"}'It retrieves INCIDENT-2041 and answers from it: the export query had no pagination, exceeded the 30-second gateway timeout for date ranges over 90 days, returned a 200 with an empty body, and was fixed in v2.4.0 with keyset pagination, a raised timeout, and a proper 504. Correct, grounded, and it cites the incident number.
That's the happy path. The rest of this post is about everything that isn't.
The assembly trap: RAG needs four dependencies
Every previous recipe in this series used a single dependency, spring-ai-starter-model-google-genai, for chat. RAG needs four, and Spring AI splits them into fine-grained artifacts that the chat starter does not pull in transitively:
<!-- chat: gives you a ChatModel (you already had this) -->
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-starter-model-google-genai</artifactId>
</dependency>
<!-- embeddings: gives you an EmbeddingModel to turn text into vectors -->
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-starter-model-google-genai-embedding</artifactId>
</dependency>
<!-- vector store: gives you VectorStore and SimpleVectorStore -->
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-vector-store</artifactId>
</dependency>
<!-- advisors: gives you QuestionAnswerAdvisor for one-call RAG -->
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-advisors-vector-store</artifactId>
</dependency>I'll be honest about how I learned this: by hitting each missing dependency in sequence. The build failed three separate times before it ran, each with a different and initially confusing error.
- Missing
spring-ai-vector-store: compile error,import org.springframework.ai.vectorstore cannot be resolved. - Missing
spring-ai-advisors-vector-store: a different compile error,import org.springframework.ai.chat.client.advisor.vectorstore cannot be resolved. - Missing the embedding starter: the code compiles, then fails at startup because there's no
EmbeddingModelbean to build the vector store.
This is not a complaint about Spring AI. The fine-grained module split is the right design: you pull in only what you use, and a chat-only app doesn't carry vector-store code it never touches. But it means the first RAG build almost always fails two or three times on missing dependencies before it boots, and no tutorial that pastes a finished pom prepares you for that. If your first RAG attempt won't compile or won't start, you are almost certainly missing one of these four.
The embedding auth gotcha: chat works, embeddings don't
This one cost me the most time, because it fails in a way that looks like it shouldn't. With all four dependencies in place, the app still died at startup:
java.lang.IllegalArgumentException: Google GenAI project-id must be set!
The confusing part: the chat model authenticates fine with a plain Gemini API key. The embedding model, using the same key from the same provider, does not. It falls through to the Vertex AI authentication path, which requires a Google Cloud project ID, and refuses to start without one.
The fix is to give the embedding side its own API key property, separate from chat:
spring:
ai:
google:
genai:
api-key: ${GEMINI_API_KEY} # chat
chat:
options:
model: gemini-2.5-flash
embedding:
api-key: ${GEMINI_API_KEY} # embeddings need their own key property
text:
options:
model: gemini-embedding-001Same key, but it has to be set in both places. The chat auto-config and the embedding auto-config read different properties, and the embedding one defaults to the Vertex path if it doesn't find its own key. Worth knowing before you lose half an hour to it.
How the pipeline fits together
Once the dependencies and auth are sorted, the actual RAG code is small.
Ingestion runs once at startup. It reads the incident files, splits them with TokenTextSplitter, and adds them to the vector store, which embeds them automatically:
var splitter = new TokenTextSplitter();
for (Resource file : files) {
var reader = new TextReader(file);
reader.getCustomMetadata().put("source", file.getFilename());
allChunks.addAll(splitter.apply(reader.get()));
}
vectorStore.add(allChunks);The vector store is the in-memory SimpleVectorStore, built from the auto-configured embedding model:
@Bean
public VectorStore vectorStore(EmbeddingModel embeddingModel) {
return SimpleVectorStore.builder(embeddingModel).build();
}Spring AI's own docs are explicit that SimpleVectorStore is for testing and development, not production. It holds everything in memory and doesn't persist across restarts. For production you'd swap in pgvector, Redis, or another store, and because your code talks to the VectorStore interface, that swap is a dependency and config change, not a code change.
Retrieval and generation are the one QuestionAnswerAdvisor call shown in the Quick Answer. It embeds the question, runs the similarity search, stuffs the results into the prompt, and calls the model.
When retrieval lies: the finding
Here's the part that matters, and the reason I built the corpus the way I did. Two of the three incidents are about an export failing: one in analytics (a query timeout) and one in billing (connection-pool exhaustion). They share vocabulary. That's deliberate, because it's where similarity search gets interesting.
I added a debug endpoint that runs retrieval only, no model, so you can see exactly what comes back and with what score. Then I asked it a deliberately vague question: "why did the export fail."
curl "http://localhost:8080/retrieve?question=why%20did%20the%20export%20fail&topK=3"The result:
billing-export score 0.6176
analytics-export score 0.6088
sso-redirect score 0.5345
Look at the top two. The billing incident ranked first, ahead of the analytics one, and the gap between them is less than 0.01. For a vague query, similarity search treated the two export incidents as very nearly equal, and put billing on top.
That's the finding. If I had built a naive RAG system that retrieves only the single best match (topK=1) and answered from it, this question would have been answered with the billing incident: a connection-pool problem, which is the wrong root cause for someone actually asking about the analytics export. The model would have sounded completely confident while citing the wrong incident. Retrieval didn't error. It returned a plausible, wrong-for-the-intent document, and ranked it first.
This is the core thing to understand about RAG: retrieval is similarity, not understanding. When two documents are similar to a vague query, the vector store has no way to know which one the user actually meant. The fix is not in the model. It's in retrieval design: better chunking, metadata filters, a similarity threshold, or retrieving more than one result and letting the model sort it out.
What the model did with the ambiguity
That last option turned out to be what saved the default behavior. Because the advisor retrieves several chunks by default, not just the top one, both export incidents landed in the model's context. So when I asked the full RAG question "Why did the export fail?", the model didn't pick wrong. It noticed there were two:
There are two documented reasons why an export might fail. (1) Analytics CSV export: the query for large date ranges had no pagination and exceeded the 30-second gateway timeout (INCIDENT-2041). (2) Billing invoice export: the export held a database connection for the full PDF render and exhausted the connection pool under load (INCIDENT-2088).
The model compensated for retrieval's ambiguity by surfacing both and labeling each with its incident number. Retrieval couldn't tell the two apart, so the model handed both back and let the reader decide. That's a better outcome than confidently answering with one, and it happened because the default retrieval pulled multiple chunks. Drop to topK=1 and you lose this safety net.
The honest no-answer case
One more test, because the worst RAG failure is confident fabrication. I asked something the corpus has no answer for:
curl -X POST http://localhost:8080/ask \
-H 'Content-Type: application/json' \
-d '{"question": "Why is the search feature slow?"}'The answer:
The provided incident documents do not contain information about why the search feature is slow.
It declined, instead of stitching together a plausible-sounding answer from the weakly-matched chunks it did retrieve. That's partly the system prompt (it's instructed to say so plainly when the documents don't cover the question) and partly the model being well-behaved. But do not assume this happens for free. Without that instruction, and with a weaker model, the same retrieval can produce a confident answer built on irrelevant context. If you take one habit from this recipe, make it this: tell the model explicitly to refuse when the context doesn't answer the question, and test that it actually does.
What can go wrong with RAG
1. Missing one of the four dependencies
Covered above. Two fail at compile time with different import errors, one fails at startup with a missing EmbeddingModel. If your build won't compile or boot, check that all four are present.
2. The embedding auth path
Also covered above. Set spring.ai.google.genai.embedding.api-key explicitly, or the embedding side falls through to Vertex and demands a project-id.
3. Retrieval returns the wrong-but-similar document
The finding of this recipe. When documents share vocabulary, similarity search can rank the wrong one first, and the scores can be nearly tied. Use the retrieval-only path to inspect what comes back before trusting an answer. Consider metadata filters and thresholds.
4. No similarity threshold by default
Without a threshold, the store always returns your top-K even when the best match is barely relevant. For a query with no good answer, you still get chunks back, and a less disciplined setup will answer from them. Set a threshold in SearchRequest to make retrieval return nothing when nothing is close enough.
5. Chunk size quietly controls quality
TokenTextSplitter defaults are fine for short documents like these incident reports. For longer documents, chunks that are too large dilute relevance and chunks that are too small lose context. There's no universally correct size; it depends on your corpus, and it's worth tuning deliberately rather than accepting the default blindly.
Frequently asked questions
What dependencies do I need for RAG in Spring AI?
For this Gemini setup, four: the chat starter (spring-ai-starter-model-google-genai), the embedding starter (spring-ai-starter-model-google-genai-embedding), the vector store API (spring-ai-vector-store), and the advisors (spring-ai-advisors-vector-store). The chat starter does not pull the others in transitively.
Why does my Spring AI app say "Google GenAI project-id must be set"?
The embedding auto-config needs its own spring.ai.google.genai.embedding.api-key property. Without it, it falls through to the Vertex AI auth path, which requires a Google Cloud project ID. Set the embedding api-key to the same key your chat model uses.
What is QuestionAnswerAdvisor?
A Spring AI advisor that wraps the whole RAG retrieval step. It embeds the user's question, runs a similarity search against the vector store, injects the retrieved chunks into the prompt, and calls the model, all inside one .call(). It lives in the spring-ai-advisors-vector-store artifact.
Is SimpleVectorStore okay for production?
No. Spring AI's docs are explicit that it's for testing and development. It's in-memory and not persistent. For production, use pgvector, Redis, or another supported store. Your code talks to the VectorStore interface, so swapping is a config and dependency change, not a code change.
How do I stop RAG from making things up when the answer isn't in my documents?
Two things. Instruct the model in the system prompt to say plainly when the context doesn't contain the answer, and test that it does. Optionally, set a similarity threshold so retrieval returns nothing when no chunk is close enough, rather than always returning a top-K that the model might answer from.
The series, in five recipes
This is the last recipe, so here's the whole arc in one place. Five capabilities, each with a real finding from actually running the code:
- 01, hello-gemini: the first call, and the 429 you hit when a model name is deprecated.
- 02, structured output: typed Java records from the model, and the confidence scores that turned out to be decorative.
- 03, tool calling: letting the model query a database, and the schema-mismatch where it asked for a component that didn't exist.
- 04, provider portability: the same code on Gemini and OpenRouter, agreeing on everything that mattered.
- 05, RAG basics: this one, where retrieval ranked the wrong export incident first by a hair.
The throughline: Spring AI is a genuinely capable way to build AI features in Java, and the interesting part was never the happy-path API. It was the gotchas, the leaks, and the places the abstraction meets reality. That's where the engineering actually lives.
Working code
The complete project is on GitHub:
github.com/TheRavi/spring-ai-recipes (the 05-rag-basics/ folder)
Clone it, copy .env.example to .env, add your Gemini key, and run. The /retrieve endpoint is the one worth playing with: try your own ambiguous queries and watch the scores.
Further reading
- Spring AI RAG reference
- Spring AI vector store reference
- Google GenAI text embeddings reference
- Get a free Gemini API key
Related posts on this blog:
- Spring AI Tool Calling: Let Gemini Query Your Database in Java. Recipe 03, the structured-data counterpart to this recipe's unstructured retrieval.
- Spring AI Provider Portability: Running One App on Gemini and OpenRouter. Recipe 04, on keeping your provider swappable.
Built RAG in Spring AI and hit a gotcha this post didn't cover? Open an issue on the repo or reach out on LinkedIn. The retrieval-ranking problem especially has endless variations, and I'd like to collect the ones that show up on real corpora.