通过嵌入有意义的元数据来改进检索

在 Colab 中打开下载

_{最后更新：2024 年 9 月 19 日}

在这个 Notebook 中，我将对嵌入有意义的元数据以改进文档检索进行一些实验。

%%capture
! pip install wikipedia haystack-ai sentence_transformers rich

import rich

从维基百科加载数据

我们将使用 Python 库 wikipedia 下载与某些乐队相关的维基百科页面。

这些页面被转换为 Haystack 文档。

some_bands="""The Beatles
Rolling stones
Dire Straits
The Cure
The Smiths""".split("\n")

import wikipedia
from haystack.dataclasses import Document

raw_docs=[]

for title in some_bands:
    page = wikipedia.page(title=title, auto_suggest=False)
    doc = Document(content=page.content, meta={"title": page.title, "url":page.url})
    raw_docs.append(doc)

🔧 设置实验

用于创建 Pipeline 的工具函数

索引 Pipeline 将文档进行转换并（连同向量一起）存储在 Document Store 中。检索 Pipeline 以查询为输入，执行向量搜索。

我构建了一些工具函数来创建不同的索引和检索 Pipeline。

实际上，我感兴趣的是比较标准方法（仅嵌入文本）与嵌入元数据策略（嵌入文本 + 有意义的元数据）的差异。

from haystack import Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.preprocessors import DocumentCleaner, DocumentSplitter
from haystack.components.embedders import SentenceTransformersTextEmbedder, SentenceTransformersDocumentEmbedder
from haystack.components.writers import DocumentWriter
from haystack.document_stores.types import DuplicatePolicy
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from haystack.utils import ComponentDevice

def create_indexing_pipeline(document_store, metadata_fields_to_embed):

  indexing = Pipeline()
  indexing.add_component("cleaner", DocumentCleaner())
  indexing.add_component("splitter", DocumentSplitter(split_by='sentence', split_length=2))

  # in the following componente, we can specify the parameter `metadata_fields_to_embed`, with the metadata to embed
  indexing.add_component("doc_embedder", SentenceTransformersDocumentEmbedder(model="thenlper/gte-large",
                                                                              device=ComponentDevice.from_str("cuda:0"),
                                                                              meta_fields_to_embed=metadata_fields_to_embed)
  )
  indexing.add_component("writer", DocumentWriter(document_store=document_store, policy=DuplicatePolicy.OVERWRITE))

  indexing.connect("cleaner", "splitter")
  indexing.connect("splitter", "doc_embedder")
  indexing.connect("doc_embedder", "writer")

  return indexing

def create_retrieval_pipeline(document_store):

  retrieval = Pipeline()
  retrieval.add_component("text_embedder", SentenceTransformersTextEmbedder(model="thenlper/gte-large",
                                                                            device=ComponentDevice.from_str("cuda:0")))
  retrieval.add_component("retriever", InMemoryEmbeddingRetriever(document_store=document_store, scale_score=False, top_k=3))

  retrieval.connect("text_embedder", "retriever")

  return retrieval

创建 Pipeline

让我们定义 2 个 Document Store，以比较不同的方法。

document_store = InMemoryDocumentStore(embedding_similarity_function="cosine")
document_store_w_embedded_metadata = InMemoryDocumentStore(embedding_similarity_function="cosine")

现在，我创建 2 个索引 Pipeline 并运行它们。

indexing_pipe_std = create_indexing_pipeline(document_store=document_store, metadata_fields_to_embed=[])

# here we specify the fields to embed
# we select the field `title`, containing the name of the band
indexing_pipe_w_embedded_metadata = create_indexing_pipeline(document_store=document_store_w_embedded_metadata, metadata_fields_to_embed=["title"])

indexing_pipe_std.run({"cleaner":{"documents":raw_docs}})
indexing_pipe_w_embedded_metadata.run({"cleaner":{"documents":raw_docs}})

print(len(document_store.filter_documents()))
print(len(document_store_w_embedded_metadata.filter_documents()))

创建 2 个检索 Pipeline。

retrieval_pipe_std = create_retrieval_pipeline(document_store=document_store)

retrieval_pipe_w_embedded_metadata = create_retrieval_pipeline(document_store=document_store_w_embedded_metadata)

🧪 运行实验！

# standard approach (no metadata embedding)

res=retrieval_pipe_std.run({"text_embedder":{"text":"have the beatles ever been to bangor?"}})
for doc in res['retriever']['documents']:
  rich.print(doc)
  rich.print(doc.content+"\n")

❌ 检索到的文档似乎不相关

# embedding meaningful metadata

res=retrieval_pipe_w_embedded_metadata.run({"text_embedder":{"text":"have the beatles ever been to bangor?"}})
for doc in res['retriever']['documents']:
  rich.print(doc)
  rich.print(doc.content+"\n")

✅ 第一个文档是相关的

# standard approach (no metadata embedding)

res=retrieval_pipe_std.run({"text_embedder":{"text":"What announcements did the band The Cure make in 2022?"}})
for doc in res['retriever']['documents']:
  rich.print(doc)
  rich.print(doc.content)

❌ 检索到的文档似乎不相关

# embedding meaningful metadata

res=retrieval_pipe_w_embedded_metadata.run({"text_embedder":{"text":"What announcements did the band The Cure make in 2022?"}})
for doc in res['retriever']['documents']:
  rich.print(doc)
  rich.print(doc.content)

✅ 一些文档是相关的

⚠️ 注意事项

这项技术并非万能
当嵌入的元数据有意义且具有区分性时，它效果很好
我认为嵌入的元数据应该从嵌入模型的角度来看是有意义的。例如，我不期望嵌入数字能有好的效果。