使用 BM42 进行混合检索

在 Colab 中打开下载

_{最后更新：2024 年 9 月 24 日}

在本 notebook 中，我们将了解如何创建混合检索流水线，结合 BM42（一种新的稀疏嵌入检索方法）和密集嵌入检索。

我们将使用 Qdrant Document Store 和 Fastembed Embedders。

⚠️ 最近的评估对 BM42 的有效性提出了质疑。未来的发展可能会解决这些担忧。请在审查内容时牢记这一点。

为什么选择 BM42？

Qdrant 推出了 BM42，这是一种旨在替代混合 RAG 流水线（密集+稀疏检索）中 BM25 的算法。

他们发现，虽然 BM25 已经相关很长时间，但在常见的 RAG 场景中存在一些局限性。

让我们先来看看 BM25 和 SPLADE，以了解 BM42 的动机和灵感来源。

BM25 \begin{equation} \text{score}(D,Q) = \sum_{i=1}^{N} \text{IDF}(q_i) \times \frac{f(q_i, D) \cdot (k_1 + 1)}{f(q_i, D) + k_1 \cdot \left(1 - b + b \cdot \frac{|D|}{\text{avgdl}}\right)}
\end{equation}

BM25 是 TF-IDF 的演进，包含两个部分：

逆文档频率 = 术语在集合中的重要性
一个包含词频的组件 = 术语在文档中的重要性

Qdrant 的开发者观察到，词频组件依赖于文档统计信息，而这只对较长的文本有意义。这与常见的 RAG 流水线不同，后者中的文档通常很短。

SPLADE

另一个有趣的方法是 SPLADE，它使用基于 BERT 的模型来创建文本的词袋表示。虽然它通常比 BM25 表现更好，但也有一些缺点：

词汇表外单词的标记化问题
适应新领域需要微调
计算量大

要将 SPLADE 与 Haystack 一起使用，请参阅此 notebook。

BM42

\begin{equation} \text{score}(D,Q) = \sum_{i=1}^{N} \text{IDF}(q_i) \times \text{Attention}(\text{CLS}, q_i) \end{equation}

从 SPLADE 中获得灵感，Qdrant 团队开发了 BM42 来改进 BM25。

IDF 的效果很好，所以他们保留了它。

但是如何量化术语在文档中的重要性呢？

Transformer 模型的注意力矩阵派上了用场：我们可以使用 [CLS] 标记的注意力行！

为了解决标记化问题，BM42 会合并子词并对其注意力权重求和。

在他们的实现中，Qdrant 团队使用了 all-MiniLM-L6-v2 模型，但这种技术可以与任何 Transformer 配合使用，无需微调。

⚠️ 最近的评估对 BM42 的有效性提出了质疑。未来的发展可能会解决这些担忧。请在审查内容时牢记这一点。

安装依赖项

!pip install -U fastembed-haystack qdrant-haystack wikipedia transformers

混合检索

索引

创建 Qdrant Document Store

from haystack_integrations.document_stores.qdrant import QdrantDocumentStore

document_store = QdrantDocumentStore(
    ":memory:",
    recreate_index=True,
    embedding_dim=384,
    return_embedding=True,
    use_sparse_embeddings=True,  # set this parameter to True, otherwise the collection schema won't allow to store sparse vectors
    sparse_idf=True  # required for BM42, allows streaming updates of the sparse embeddings while keeping the IDF calculation up-to-date
)

下载维基百科页面并创建原始文档

我们下载几篇关于动物的维基百科页面，并将它们创建为 Haystack 文档。

nice_animals= ["Capybara", "Dolphin", "Orca", "Walrus"]

import wikipedia
from haystack.dataclasses import Document

raw_docs=[]
for title in nice_animals:
    page = wikipedia.page(title=title, auto_suggest=False)
    doc = Document(content=page.content, meta={"title": page.title, "url":page.url})
    raw_docs.append(doc)

索引管道

我们的索引流水线包括稀疏文档嵌入器（基于 BM42）和密集文档嵌入器。

from haystack.components.preprocessors import DocumentCleaner, DocumentSplitter
from haystack.components.writers import DocumentWriter
from haystack.document_stores.types import DuplicatePolicy
from haystack import Pipeline
from haystack_integrations.components.embedders.fastembed import FastembedSparseDocumentEmbedder, FastembedDocumentEmbedder

hybrid_indexing = Pipeline()
hybrid_indexing.add_component("cleaner", DocumentCleaner())
hybrid_indexing.add_component("splitter", DocumentSplitter(split_by='sentence', split_length=4))
hybrid_indexing.add_component("sparse_doc_embedder", FastembedSparseDocumentEmbedder(model="Qdrant/bm42-all-minilm-l6-v2-attentions", meta_fields_to_embed=["title"]))
hybrid_indexing.add_component("dense_doc_embedder", FastembedDocumentEmbedder(model="BAAI/bge-small-en-v1.5", meta_fields_to_embed=["title"]))
hybrid_indexing.add_component("writer", DocumentWriter(document_store=document_store, policy=DuplicatePolicy.OVERWRITE))

hybrid_indexing.connect("cleaner", "splitter")
hybrid_indexing.connect("splitter", "sparse_doc_embedder")
hybrid_indexing.connect("sparse_doc_embedder", "dense_doc_embedder")
hybrid_indexing.connect("dense_doc_embedder", "writer")

<haystack.core.pipeline.pipeline.Pipeline object at 0x7fb6bc33a2f0>
🚅 Components
  - cleaner: DocumentCleaner
  - splitter: DocumentSplitter
  - sparse_doc_embedder: FastembedSparseDocumentEmbedder
  - dense_doc_embedder: FastembedDocumentEmbedder
  - writer: DocumentWriter
🛤️ Connections
  - cleaner.documents -> splitter.documents (List[Document])
  - splitter.documents -> sparse_doc_embedder.documents (List[Document])
  - sparse_doc_embedder.documents -> dense_doc_embedder.documents (List[Document])
  - dense_doc_embedder.documents -> writer.documents (List[Document])

让我们来索引我们的文档！

⚠️ 如果您在 Google Colab 上运行此 notebook，请注意，Google Colab 只提供 2 个 CPU 核心，因此使用 Fastembed 生成嵌入的速度可能不如在标准机器上快。

hybrid_indexing.run({"documents":raw_docs})

Calculating sparse embeddings: 100%|██████████| 340/340 [00:27<00:00, 12.52it/s]
Calculating embeddings: 100%|██████████| 340/340 [01:23<00:00,  4.07it/s]
400it [00:00, 1179.66it/s]                         





{'writer': {'documents_written': 340}}

document_store.count_documents()

检索

检索流水线

如前所述，BM42 的设计宗旨是在混合检索（以及混合 RAG）流水线中发挥最佳性能。

FastembedSparseTextEmbedder：将查询转换为稀疏嵌入
FastembedTextEmbedder：将查询转换为密集嵌入
QdrantHybridRetriever：根据嵌入相似性查找相关文档

Qdrant Hybrid Retriever 比较密集和稀疏的查询和文档嵌入，并使用倒数排名融合 (Reciprocal Rank Fusion) 来检索最相关的文档。

如果您想更深入地定制融合行为，请参阅混合检索流水线（教程）。

from haystack_integrations.components.retrievers.qdrant import QdrantHybridRetriever
from haystack_integrations.components.embedders.fastembed import FastembedTextEmbedder, FastembedSparseTextEmbedder


hybrid_query = Pipeline()
hybrid_query.add_component("sparse_text_embedder", FastembedSparseTextEmbedder(model="Qdrant/bm42-all-minilm-l6-v2-attentions"))
hybrid_query.add_component("dense_text_embedder", FastembedTextEmbedder(model="BAAI/bge-small-en-v1.5", prefix="Represent this sentence for searching relevant passages: "))
hybrid_query.add_component("retriever", QdrantHybridRetriever(document_store=document_store, top_k=5))

hybrid_query.connect("sparse_text_embedder.sparse_embedding", "retriever.query_sparse_embedding")
hybrid_query.connect("dense_text_embedder.embedding", "retriever.query_embedding")

<haystack.core.pipeline.pipeline.Pipeline object at 0x7fb6bc33ae30>
🚅 Components
  - sparse_text_embedder: FastembedSparseTextEmbedder
  - dense_text_embedder: FastembedTextEmbedder
  - retriever: QdrantHybridRetriever
🛤️ Connections
  - sparse_text_embedder.sparse_embedding -> retriever.query_sparse_embedding (SparseEmbedding)
  - dense_text_embedder.embedding -> retriever.query_embedding (List[float])

尝试检索流水线

question = "Who eats fish?"

results = hybrid_query.run(
    {"dense_text_embedder": {"text": question},
     "sparse_text_embedder": {"text": question}}
)

Calculating sparse embeddings: 100%|██████████| 1/1 [00:00<00:00, 82.10it/s]
Calculating embeddings: 100%|██████████| 1/1 [00:00<00:00,  7.75it/s]

import rich

for d in results['retriever']['documents']:
  rich.print(f"\nid: {d.id}\n{d.meta['title']}\n{d.content}\nscore: {d.score}\n---")

id: 370071638e221257cf77702716695626d9b1b4dfe4212b4a10e255434bfeb08b
Orca
 Some populations in the Norwegian and Greenland sea specialize in herring and follow that fish's autumnal 
migration to the Norwegian coast. Salmon account for 96% of northeast Pacific residents' diet, including 65% of 
large, fatty Chinook. Chum salmon are also eaten, but smaller sockeye and pink salmon are not a significant food 
item. Depletion of specific prey species in an area is, therefore, cause for concern for local populations, despite
the high diversity of prey.
score: 0.5
---

id: 1ed8f49561630f10202b55c8c7619a32cd9f6a11675cbb56c64a578826e488ef
Orca
; Ellis, Graeme M. (2006). "Selective foraging by fish-eating killer whales Orcinus orca in British Columbia". 
Marine Ecology Progress Series.
score: 0.5
---

id: a9bb77dac4747c4fba48a7464038c9da206d7e3663d837f2c95f6d882de8111e
Orca
 On average, an orca eats 227 kilograms (500 lb) each day. While salmon are usually hunted by an individual whale 
or a small group, herring are often caught using carousel feeding: the orcas force the herring into a tight ball by
releasing bursts of bubbles or flashing their white undersides. They then slap the ball with their tail flukes, 
stunning or killing up to 15 fish at a time, then eating them one by one. Carousel feeding has been documented only
in the Norwegian orca population, as well as some oceanic dolphin species.
score: 0.41666666666666663
---

id: 33fdef8b4f33f4c5ce00cbbc9e3cb3605b778131854436d4bb7e54f5adaf79ae
Dolphin
 === Consumption === ==== Cuisine ==== In some parts of the world, such as Taiji, Japan and the Faroe Islands, 
dolphins are traditionally considered as food, and are killed in harpoon or drive hunts.
Dolphin meat is consumed in a small number of countries worldwide, which include Japan and Peru (where it is 
referred to as chancho marino, or "sea pork"). While Japan may be the best-known and most controversial example, 
only a very small minority of the population has ever sampled it.
Dolphin meat is dense and such a dark shade of red as to appear black.
score: 0.3333333333333333
---

id: 6b643c8aa3d47fc198063f8bbc98828bd1d2368d22c95b6b97c36beb60b7fbd0
Orca
" Although large variation in the ecological distinctiveness of different orca groups complicate simple 
differentiation into types, research off the west coast of North America has identified fish-eating "residents", 
mammal-eating "transients" and "offshores". Other populations have not been as well studied, although specialized 
fish and mammal eating orcas have been distinguished elsewhere. Mammal-eating orcas in different regions were long 
thought likely to be closely related, but genetic testing has refuted this hypothesis. A 2024 study supported the 
elevation of Eastern North American resident and transient orcas as distinct species, O.
score: 0.3333333333333333
---

question = "capybara social behavior"

results = hybrid_query.run(
    {"dense_text_embedder": {"text": question},
     "sparse_text_embedder": {"text": question}}
)

Calculating sparse embeddings: 100%|██████████| 1/1 [00:00<00:00, 71.98it/s]
Calculating embeddings: 100%|██████████| 1/1 [00:00<00:00,  8.90it/s]

import rich

for d in results['retriever']['documents']:
  rich.print(f"\nid: {d.id}\n{d.meta['title']}\n{d.content}\nscore: {d.score}\n---")

id: d35c090ebfdad52eb882915b0ee2a9578c751a243ecef3a2c941ef0713a7c9aa
Capybara
 The capybara inhabits savannas and dense forests, and lives near bodies of water. It is a highly social species 
and can be found in groups as large as 100 individuals, but usually live in groups of 10–20 individuals. The 
capybara is hunted for its meat and hide and also for grease from its thick fatty skin. == Etymology ==
Its common name is derived from Tupi ka'apiûara, a complex agglutination of kaá (leaf) + píi (slender) + ú (eat) + 
ara (a suffix for agent nouns), meaning "one who eats slender leaves", or "grass-eater".
score: 0.7
---

id: e1b0dcc9a1d01481052af5964616f438073f201ffaa0605282a7ddaf90fcafaf
Capybara
 Males establish social bonds, dominance, or general group consensus. They can make dog-like barks when threatened 
or when females are herding young.
Capybaras have two types of scent glands: a morrillo, located on the snout, and anal glands. Both sexes have these 
glands, but males have much larger morrillos and use their anal glands more frequently.
score: 0.6666666666666666
---

id: fd11addea30e8ae2f1d60274beae4d42646b075eb0579bef1c2899cde1e1bb2b
Capybara
1.31.0.1.
score: 0.5
---

id: 1600c15a21aa722965ef2cc4fab4e622474fc8d1ff9e0c555c955e78b038ee2d
Capybara
 In addition, a female alerts males she is in estrus by whistling through her nose. During mating, the female has 
the advantage and mating choice. Capybaras mate only in water, and if a female does not want to mate with a certain
male, she either submerges or leaves the water. Dominant males are highly protective of the females, but they 
usually cannot prevent some of the subordinates from copulating.
score: 0.25
---

id: 994f31c23e46c16744558b3a499cff0c446da33661a74bb2ddaede9e26e64e11
Capybara
40 ft) in length, stand 50 to 62 cm (20 to 24 in) tall at the withers, and typically weigh 35 to 66 kg (77 to 146 
lb), with an average in the Venezuelan llanos of 48.9 kg (108 lb). Females are slightly heavier than males. The top
recorded weights are 91 kg (201 lb) for a wild female from Brazil and 73.
score: 0.25
---

📚 资源

(Notebook 由 Stefano Fiorucci 编写)