August 31, 2025

10 min read

🎬 Episode 2 : 🔍 Building on RAG: Exploring BM25 and Semantic Search

Faeze abdoli

Ai engineer

Retrieval-Augmented Generation (RAG) depends on effective search. BM25 offers fast, keyword-based precision, while Semantic Search uses embeddings to capture meaning and context. Hybrid Search blends both approaches,combining exact matches with semantic understanding.to deliver more accurate and reliable retrieval for AI systems.

🎬 Episode 2 — 🔍 Building on RAG: Exploring BM25 and Semantic Search

Welcome back to our RAG series! In Episode1 [Link] we introduced the concept of Retrieval-Augmented Generation (RAG) and discussed its significance in enhancing AI responses by retrieving relevant information from external sources.In Episode2, we delve deeper into the mechanics of RAG by exploring two pivotal components of information retrieval: BM25 and Semantic Search. Understanding these methods is crucial for refining the retrieval process within RAG systems.

Building upon that foundation, this post shifts focus to two pivotal components of information retrieval: BM25 and Semantic Search. Understanding these methods is crucial for refining the retrieval process within RAG systems.

BM25: A traditional, keyword-based ranking function that evaluates the relevance of documents based on the frequency of query terms. It's known for its efficiency and effectiveness in scenarios where exact term matching is paramount.
Semantic Search: A more advanced approach that leverages embeddings and vector databases to capture the meaning behind words, enabling the retrieval of documents that are contextually relevant, even if they don't contain the exact query terms.

In the following sections, we'll explore each of these methods in detail, examining their mechanisms, strengths, and how they can be integrated into RAG systems to enhance performance and reliability.

🔍 BM25 (Best Match 25): the workhorse of lexical retrieval

BM25 is a ranking function used by search engines (e.g., Lucene/Elasticsearch) to score how well each document matches a query using words, not meanings. It balances three ideas: (1) rare terms are more informative, (2) repeating a term helps but with diminishing returns, and (3) longer documents shouldn’t win just because they’re long.

How BM25 scores a match (intuition, not scary math)

For each query term, BM25 computes a score and sums them:

IDF (inverse document frequency): rarer terms get bigger weight; very common terms contribute little.
Term frequency saturation: the more a term appears in a document, the higher the score—but with a saturation controlled by k1 (typically between ~1–2).
Length normalization: longer-than-average docs are penalized via parameter b (0–1; often around 0.75).

Tuning tips:

Increase k1 → give more credit to repeated terms.
Increase b → normalize more aggressively against long documents.

When BM25 shines

Exact phrases, IDs, codes, product names, error messages, legal citations.
You want fast, dependable retrieval without embeddings.
As the sparse half of hybrid retrieval (paired with a vector retriever in RAG).

⚡ Minimal BM25 in Python with `bm25s`

bm25s is a modern, fast, low-dependency BM25 library that benchmarks near Elasticsearch speed on a single node—while staying pure Python. It supports popular BM25 variants (Lucene/ATIRE/BM25+/BM25L) and integrates with Hugging Face for saving/loading indices.

# pip install bm25s
import bm25s

# A tiny corpus
corpus = [
    "A cat is a small domesticated carnivorous mammal.",
    "Dogs are loyal companions that love to play.",
    "Birds can fly long distances and sing beautifully.",
    "Fish live in water and do not purr.",
]

# 1) Build the index (tokenize + index)
tokens = bm25s.tokenize(corpus)           # basic tokenizer; you can customize
retriever = bm25s.BM25(corpus=corpus, method="lucene")  # choose a variant
retriever.index(tokens)

# 2) Query
query = "Do fish purr like a cat?"
q_tokens = bm25s.tokenize(query)

docs, scores = retriever.retrieve(q_tokens, k=3)  # top-3 results

for rank, (doc, score) in enumerate(zip(docs[0], scores[0]), start=1):
    print(f"{rank}. score={score:.2f} | {doc}")

What you’ll observe:

The “cat” sentence ranks high because of term matches (“cat”, “purr”).
The “fish” sentence also ranks (it negates purring), helped by exact matches (“fish”).
No embeddings are used—just words and BM25 scoring. ([Hugging Face][2])

Notes: You can switch variants with method="bm25+" / "bm25l" / "atire" and (in BM25-style systems) tune k1 and b to control saturation and length normalization.

🧩 Using BM25 inside LlamaIndex

LlamaIndex exposes a BM25Retriever (powered by BM25S under the hood) with convenient persistence, metadata filtering, and hybrid-combo examples.

# pip install llama-index llama-index-retrievers-bm25 Stemmer
from llama_index.core import Document
from llama_index.core.node_parser import SentenceSplitter
from llama_index.retrievers.bm25 import BM25Retriever
import Stemmer

# 1) Tiny document set
docs = [
    Document(text="BM25 ranks documents using word matches and IDF."),
    Document(text="Vector search uses embeddings to capture meaning."),
    Document(text="Elasticsearch uses BM25 for lexical scoring by default."),
]

# 2) Chunk into nodes
splitter = SentenceSplitter(chunk_size=256)
nodes = splitter.get_nodes_from_documents(docs)

# 3) Create a BM25 retriever (with stemming & stopwords for English)
bm25 = BM25Retriever.from_defaults(
    nodes=nodes,
    similarity_top_k=3,
    stemmer=Stemmer.Stemmer("english"),
    language="english",
)

# 4) Retrieve
results = bm25.retrieve("How does Elasticsearch score text?")
for i, node in enumerate(results, 1):
    print(f"{i}. score={node.score:.3f} | {node.text}")

This is production-friendly: you can persist the index to disk, use a docstore, apply metadata filters, or combine it with a vector store for hybrid retrieval.

Takeaways

BM25 is fast, proven, and robust for keyword-heavy queries.
Two knobs matter most: k1 (term saturation) and b (length normalization).
Libraries like bm25s make BM25 easy and fast in Python; frameworks like LlamaIndex let you plug it into a full RAG stack (and fuse with dense retrieval later).

🧠 Semantic Search: When Meaning Matters More Than Words

Semantic search is all about finding what you mean, not just what you say. It uses embeddings—compressed representations of language—to surface results based on context, intent, and similarity in meaning. This is what powers smarter AI systems and RAG pipelines.

What Is Semantic Search?

Semantic search interprets the meaning behind queries. Unlike keyword search that matches based on literal words, semantic search relies on vector similarity, so it can connect "chocolate milk" with "milk chocolate" differently (because the meaning changes with word order) and understand that "football" might mean “soccer” or “American football” depending on context.

How Does It Work?

Embedding Generation Both your documents and search queries are transformed into embeddings—high-dimensional numeric vectors capturing meaning.
Vector Storage in Databases These embeddings are stored in specialized vector databases, optimized to find nearest vectors efficiently—even in large datasets.
Similarity Search (k-NN) When a query comes in, it's embedded and compared against stored vectors using similarity metrics (like cosine similarity). The system returns the closest matches—those conceptually most similar to your query.
Efficient Indexing Techniques To speed up searches over massive embedding sets, vector databases use approximate nearest neighbor strategies like HNSW (Hierarchical Navigable Small World graphs).

Why It Matters

Understands Context & Synonyms It can connect semantically related concepts—even if they don’t share words. Search for "electric cars" → “EV,” “Tesla,” or “hybrid vehicle.”
Delivers Intent-Rich Results Semantic search interprets deeper meaning, delivering results based on intent—not just matching keywords.
Handles Natural Language Queries Better You can type "stylish couch for small living room," and it understands the query’s meaning—not just the keywords.
Crucial for RAG Embeddings and vector search enable RAG systems to fetch documents with relevant meaning, not just literal overlap. This reduces hallucinations and enhances grounding.

Quick Python Example: Semantic Search with Embeddings + FAISS

# pip install sentence-transformers faiss-cpu

from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

# 1. Sample corpus and embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')
docs = [
    "Electric cars are the future of transportation.",
    "Healthy smoothie recipes for breakfast.",
    "Python programming tips and tricks.",
    "Tesla unveils new self-driving car model.",
]

# 2. Compute embeddings
doc_embeddings = model.encode(docs, convert_to_numpy=True)

# 3. Build FAISS index (vector DB)
dimension = doc_embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(doc_embeddings)

# 4. Query and retrieve
query = "latest developments in autonomous electric vehicles"
query_vector = model.encode([query], convert_to_numpy=True)
distances, indices = index.search(query_vector, k=2)

for dist, idx in zip(distances[0], indices[0]):
    print(f"Score: {dist:.4f} | {docs[idx]}")

What to expect:

The query about autonomous electric vehicles should bring up the Tesla/self-driving car doc—even though the wording doesn’t exactly match.

Summary: Semantic Search in a Nutshell

Component	Function
Embeddings	Represent meaning of text numerically
Vector Database	Efficient storage & retrieval of embeddings
Similarity Search	Finds conceptually closest matches (k-NN)
Indexing (e.g., HNSW)	Speeds up search in high-dimensional spaces

Semantic search bridges the gap between human intent and machine understanding—making RAG systems more accurate and intuitive.

🧩 Hybrid Search: Merging Precision with Context

Hybrid search is a technique that combines keyword (lexical) search with semantic (vector) search to leverage the strengths of both approaches. This fusion enhances search relevance by considering both exact term matching and semantic context, providing more accurate and comprehensive search results .

🔍 What Is Hybrid Search?

Hybrid search operates by running both keyword and vector searches in parallel and then merging their results. The combination can be achieved through various fusion methods, such as:

Ranked Fusion: Merges results based on their ranks from both search types.
Relative Score Fusion: Combines results by normalizing and merging their scores .

This approach allows systems to benefit from the exactness of keyword matches and the contextual understanding of semantic searches, making it versatile and robust for a wide range of search use cases .

⚙️ How Does Hybrid Search Work?

Keyword Search: Utilizes traditional keyword-based methods like BM25 to find documents that match the query terms exactly.
Semantic Search: Employs vector embeddings to understand the meaning behind the query and retrieve documents with similar semantic content.
Fusion: Combines the results from both searches using a fusion algorithm to produce a final ranking of documents.

This process ensures that the search system considers both the literal terms in the query and the underlying meaning, leading to more relevant and accurate results.

🧪 Example: Implementing Hybrid Search in OpenSearch

OpenSearch 2.11 introduced hybrid search capabilities, allowing users to combine keyword and semantic searches effectively. Here's an example of how to implement a hybrid search query:

POST /my-index/_search
{
  "query": {
    "hybrid": {
      "queries": [
        {
          "match": {
            "title": "AI advancements"
          }
        },
        {
          "vector": {
            "field": "content_vector",
            "query_vector": [0.12, 0.34, 0.56, ...],
            "k": 5
          }
        }
      ],
      "fusion": {
        "method": "rankedFusion",
        "parameters": {
          "keyword_weight": 0.5,
          "vector_weight": 0.5
        }
      }
    }
  }
}

In this example:

The match query searches for documents where the title contains "AI advancements".
The vector query searches for documents with content vectors similar to the provided query vector.
The fusion method combines the results from both queries using the rankedFusion algorithm, with equal weights assigned to both keyword and vector searches. This setup allows OpenSearch to return documents that are both semantically relevant and contain the exact keywords, enhancing the overall search experience.

💡 Why Use Hybrid Search?

Enhanced Relevance: Combines the precision of keyword search with the contextual understanding of semantic search.
Flexibility: Allows fine-tuning of the contribution of each search type through fusion parameters.
Improved User Experience: Delivers more accurate and comprehensive search results, meeting diverse user needs.

🔍 The Evolution of Search: From Precision to Understanding

In the realm of Retrieval-Augmented Generation (RAG), the effectiveness of AI systems hinges on their ability to retrieve and process information accurately. The journey of search methodologies reflects this progression:

Keyword Search: The bedrock of traditional search engines, keyword search offers precision by matching exact terms. It's akin to a meticulous librarian who retrieves information based solely on specific queries.
Semantic Search: Advancing beyond mere keyword matching, semantic search understands the context and meaning behind queries. This approach is like conversing with an expert who comprehends the nuances of your questions.
Hybrid Search: Integrating the strengths of both keyword and semantic search, hybrid search combines precision with contextual understanding. It's as if you have a team of specialists—each excelling in different areas—collaborating to provide the most accurate and relevant information.

💡 Stay tuned! In the next post, we’ll dive deeper into RAG pipelines and discuss how to evaluate chunk retrieval—showing how BM25, semantic search, and hybrid methods work together to make AI systems smarter and more reliable.

🎬 Episode 2 : 🔍 Building on RAG: Exploring BM25 and Semantic Search

🎬 Episode 2 — 🔍 Building on RAG: Exploring BM25 and Semantic Search

🔍 BM25 (Best Match 25): the workhorse of lexical retrieval

How BM25 scores a match (intuition, not scary math)

When BM25 shines

⚡ Minimal BM25 in Python with `bm25s`

🧩 Using BM25 inside LlamaIndex

Takeaways

🧠 Semantic Search: When Meaning Matters More Than Words

What Is Semantic Search?

How Does It Work?

Why It Matters

Quick Python Example: Semantic Search with Embeddings + FAISS

Summary: Semantic Search in a Nutshell

🧩 Hybrid Search: Merging Precision with Context

🔍 What Is Hybrid Search?

⚙️ How Does Hybrid Search Work?

🧪 Example: Implementing Hybrid Search in OpenSearch

💡 Why Use Hybrid Search?

🔍 The Evolution of Search: From Precision to Understanding

Related Articles

🎬 Role Prompting: How to...

🌡️What Is Temperature in AI...

🎬 Episode 2 — 🔍 Building on RAG: Exploring BM25 and Semantic Search

🔍 BM25 (Best Match 25): the workhorse of lexical retrieval

How BM25 scores a match (intuition, not scary math)

When BM25 shines

⚡ Minimal BM25 in Python with bm25s

🧩 Using BM25 inside LlamaIndex

Takeaways

🧠 Semantic Search: When Meaning Matters More Than Words

What Is Semantic Search?

How Does It Work?

Why It Matters

Quick Python Example: Semantic Search with Embeddings + FAISS

Summary: Semantic Search in a Nutshell

🧩 Hybrid Search: Merging Precision with Context

🔍 What Is Hybrid Search?

⚙️ How Does Hybrid Search Work?

🧪 Example: Implementing Hybrid Search in OpenSearch

💡 Why Use Hybrid Search?

🔍 The Evolution of Search: From Precision to Understanding

Related Articles

🎬 Role Prompting: How to...

🌡️What Is Temperature in AI...

⚡ Minimal BM25 in Python with `bm25s`