Improving RAG Retrieval With Cohere Rerank On SAP ...

If you’ve worked with Retrieval Augmented Generation (RAG) systems before, you’ve probably seen this happen.

You ask a question.

Your vector database retrieves a bunch of “relevant” chunks.

A few of them are useful. Some are loosely related. And occasionally, one of the top results makes you wonder if the embedding model was having a bad day.

This is one of the biggest challenges with RAG applications. Even if the LLM itself is powerful, the final response is still heavily dependent on the quality of the retrieved context. Poor retrieval leads to poor answers.

Most modern RAG pipelines rely on vector similarity search for retrieval, and this is fine because embeddings are fast, scalable, and generally good at capturing semantic meaning. But vector similarity alone does not always guarantee that the most relevant result ends up at the top.

This is where reranking comes in, because it’s not your LLM that is always the problem.

Instead of directly sending retrieved chunks to the LLM, a reranker takes the initially retrieved results and reorders them based on how relevant they are to the query. In a lot of cases, this can significantly improve the quality of the final response without changing your base LLM itself.

SAP AI Launchpad now supports the Cohere Rerank API, making it easier to integrate reranking into enterprise AI workflows running on SAP AI Core. This allows developers to improve retrieval quality in their RAG pipelines while continuing to use their existing retrieval and generation setup.

In this blog, we’ll take a look at:

what reranking is
why it matters in RAG applications
the trade-offs involved
how to deploy the Cohere Rerank model on SAP AI Launchpad
and how to use the deployed endpoint in a real application flow

Let’s get started.

In a typical RAG pipeline, a user query is converted into an embedding and matched against embeddings stored in a vector database. The database then returns the top matching chunks based on vector similarity.

This works well because embeddings are fast and scalable, making them ideal for searching across large datasets like SAP Help documentation.

But vector similarity alone does not always guarantee that the most useful result appears at the top.

This is where reranking helps.

Instead of directly sending retrieved chunks to the LLM, a reranker takes the initial search results and scores them again based on relevance to the query. The chunks are then reordered before being passed to the LLM.

The flow usually looks something like this:

The reranker does not replace vector search. Vector search is still responsible for quickly narrowing down the search space, while the reranker improves the quality of the final retrieved context.

To understand why reranking improves retrieval quality, it helps to look at how it differs from traditional vector search.

In a vector search pipeline, both the user query and the stored chunks are converted into embeddings independently. Similarity between them is then calculated using metrics like cosine similarity.

This works well for finding semantically related content quickly, which is why vector databases are widely used in RAG systems.

However, semantic similarity does not always mean the retrieved chunk directly answers the user’s question.

For example, in a RAG application built on SAP Help documentation, a query like:

“How do I configure OAuth for SAP AI Core?”

might retrieve chunks related to authentication, destinations, API access, and OAuth setup. All of these are related to the query, but not all of them are equally useful.

Reranking approaches this differently.

Instead of comparing embeddings independently, the reranker evaluates the query and each retrieved chunk together before assigning a relevance score. This allows it to better understand query intent and improve the ordering of the retrieved results.

The following diagram gives a simplified comparison between traditional vector similarity search and reranking:

In large documentation systems like SAP Help, improving retrieval quality can significantly improve:

answer accuracy
contextual relevance
response consistency
token efficiency by reducing unnecessary context

Another important advantage is that reranking improves retrieval quality without requiring changes to the base LLM itself. In many cases, improving retrieval can have a larger impact than switching to a larger or more expensive model.

Of course, reranking also comes with trade-offs like additional latency and inference cost, which we’ll look at next.

If reranking improves retrieval quality so much, the obvious question is:
why not use it for every document directly?

The answer is scale.

Rerankers are significantly more expensive and computationally heavier than vector similarity search. Running a reranker across thousands or millions of chunks for every query would introduce too much latency and cost for most applications.

This is why reranking is typically used as a second-stage retrieval step.

The vector database first narrows down the search space by retrieving the top candidate chunks quickly using embeddings. The reranker then evaluates only those retrieved results in more detail and improves their ordering before they are sent to the LLM.

Even with this smaller candidate set, reranking still introduces:

an additional model call
increased latency
additional inference cost

Because of this, reranking is usually most useful in scenarios where retrieval quality matters more than absolute response speed, such as large enterprise documentation systems or knowledge-heavy RAG applications.

It is also important to remember that reranking is not a replacement for good retrieval practices. Proper chunking, embeddings, and retrieval strategies still matter. Reranking works best as a precision improvement layer on top of an already solid retrieval pipeline.

To use the Cohere Rerank API, the first step is to create a deployment on SAP AI Launchpad.

Navigate to the ML Operations section in SAP AI Launchpad and create a new configuration for the Cohere rerank model.

Steps to create configuration:

Configuration details

Note: any valid configuration can be given

Input Parameters

Note: Input artifacts are not required for this. We can directly proceed to review and create the configuration.

Once the configuration is created, create a deployment using that configuration.

Wait for the deployment to start running. The URL for the deployment can be found in the deployment details tab by selecting the deployment of your choice.

At this point, the reranker is ready to be integrated into a RAG pipeline.

The next step is to send a request to the deployment endpoint and use it to rerank retrieved chunks before passing them to the LLM.

Once the deployment is active, the deployment URL can be used to send reranking requests from your application.

In a typical RAG pipeline, reranking is usually not performed on the entire document set directly. Doing that would be far too expensive and computationally heavy for most applications.

Instead, the vector database is first used to retrieve a larger candidate set using embeddings and cosine similarity.

For example, if a traditional keyword-based search might return 5 directly matching chunks, a vector search might retrieve the top 20–25 semantically related chunks instead. This improves recall and reduces the chances of missing useful context.

The reranker is then used as a second-stage retrieval step to improve precision.

Instead of sending all 20–25 retrieved chunks directly to the LLM, the application sends them to the reranker, which evaluates their relevance to the user query and returns only the most useful results.

A sample reranking request using Python could look something like this:

import requests

deployment_url = "/rerank"
auth_token = "" ## this is the Auth Token for BTP obtained using your ai core client credentials
resource_group = ""

query = "How do I configure OAuth for SAP AI Core?"

## Sample documents obtained through cosine similarity
documents = [
    "Authentication concepts in SAP BTP and identity management.",
    "Steps to configure OAuth for SAP AI Core using service keys and client credentials.",
    "API access and token lifecycle management in SAP AI Core.",
    "Destination configuration setup in SAP BTP.",
    "Identity provider configuration and trust setup."
]

payload = {
    "model": "cohere-reranker-35",
    "top_n": 3,
    "query": query,
    "documents": documents
}

headers = {
    "AI-Resource-Group": resource_group,
    "Content-Type": "application/json",
    "Authorization": f"Bearer {auth_token}"
}

response = requests.post(
    deployment_url,
    json=payload,
    headers=headers
)

response.raise_for_status()

reranked_results = response.json()

print(reranked_results)

The reranked results can then be used to construct the final prompt sent to the LLM.

This two-stage retrieval approach works particularly well for large documentation systems where multiple chunks may appear semantically similar while still serving very different purposes.

Reranking is one of those additions to a RAG pipeline that seems small at first, but can have a surprisingly large impact on retrieval quality.

In a lot of cases, the problem is not that the LLM is bad. The problem is that the right context never made it into the prompt in the first place.

By combining vector search with reranking, it becomes possible to balance both recall and precision in retrieval pipelines. The vector database retrieves broadly relevant chunks quickly, while the reranker helps identify which of those chunks are actually useful.

And when working with large documentation systems like SAP Help, that additional filtering layer can make a very noticeable difference.

With Cohere Rerank now available in SAP AI Launchpad, integrating reranking into enterprise RAG workflows becomes much more straightforward.

Sometimes, better answers do not come from a bigger model.
They come from better retrieval.

Happy coding! 🙂

Source link