通过自动合并和分层文档检索改进检索
最后更新:2025 年 3 月 20 日
本笔记本演示了如何使用 Haystack 组件:AutoMergingRetriever 和 HierarchicalDocumentSplitter。
设置
!pip install haystack-ai
让我们获取一个数据集来进行索引和探索
-
我们将使用一个包含 2225 篇新文章的数据集,这些文章来自 D. Greene 和 P. Cunningham 在 Proc. ICML 2006 发表的论文“Practical Solutions to the Problem of Diagonal Dominance in Kernel Document Clustering”。
-
原始数据集可在此处找到:http://mlg.ucd.ie/datasets/bbc.html,但我们将使用此处提供的 CSV 处理版本:https://raw.githubusercontent.com/amankharwal/Website-data/master/bbc-news-data.csv
!wget https://raw.githubusercontent.com/amankharwal/Website-data/master/bbc-news-data.csv
--2024-09-06 09:41:04-- https://raw.githubusercontent.com/amankharwal/Website-data/master/bbc-news-data.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5080260 (4.8M) [text/plain]
Saving to: ‘bbc-news-data.csv’
bbc-news-data.csv 100%[===================>] 4.84M --.-KB/s in 0.09s
2024-09-06 09:41:05 (56.4 MB/s) - ‘bbc-news-data.csv’ saved [5080260/5080260]
让我们将原始数据转换为 Haystack 文档
import csv
from typing import List
from haystack import Document
def read_documents() -> List[Document]:
with open("bbc-news-data.csv", "r") as file:
reader = csv.reader(file, delimiter="\t")
next(reader, None) # skip the headers
documents = []
for row in reader:
category = row[0].strip()
title = row[2].strip()
text = row[3].strip()
documents.append(Document(content=text, meta={"category": category, "title": title}))
return documents
docs = read_documents()
docs[0:5]
[Document(id=8b0eec9b4039d3c21eed119c9cbf1022a172f6b96661a391c76ee9a00b388334, content: 'Quarterly profits at US media giant TimeWarner jumped 76% to $1.13bn (£600m) for the three months to...', meta: {'category': 'business', 'title': 'Ad sales boost Time Warner profit'}),
Document(id=0b20edb280b3c492d81751d97aa67f008759b242f2596d56c6816bacb5ea0c08, content: 'The dollar has hit its highest level against the euro in almost three months after the Federal Reser...', meta: {'category': 'business', 'title': 'Dollar gains on Greenspan speech'}),
Document(id=9465b0a3c9e81843db56beb8cb3183b14810e8fc7b3195bd37718296f3a13e31, content: 'The owners of embattled Russian oil giant Yukos are to ask the buyer of its former production unit t...', meta: {'category': 'business', 'title': 'Yukos unit buyer faces loan claim'}),
Document(id=151d64ed92b61b1b9e58c52a90e7ab4be964c0e47aaf1a233dfb93110986d9cd, content: 'British Airways has blamed high fuel prices for a 40% drop in profits. Reporting its results for th...', meta: {'category': 'business', 'title': "High fuel prices hit BA's profits"}),
Document(id=4355d611f770b814f9e7d33959ad9d16b69048650ed0eaf24f1bce3e8ab5bf4c, content: 'Shares in UK drinks and food firm Allied Domecq have risen on speculation that it could be the targe...', meta: {'category': 'business', 'title': 'Pernod takeover talk lifts Domecq'})]
我们可以看到我们已成功创建了文档。
文档拆分和索引
现在我们将每个文档拆分成更小的文档,创建一个分层的文档结构,将每个子文档与相应的父文档连接起来。
我们还创建了两个文档存储,一个用于叶子文档,另一个用于父文档。
from typing import Tuple
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.document_stores.types import DuplicatePolicy
from haystack.components.preprocessors import HierarchicalDocumentSplitter
def indexing(documents: List[Document]) -> Tuple[InMemoryDocumentStore, InMemoryDocumentStore]:
splitter = HierarchicalDocumentSplitter(block_sizes={10, 3}, split_overlap=0, split_by="word")
docs = splitter.run(documents)
# Store the leaf documents in one document store
leaf_documents = [doc for doc in docs["documents"] if doc.meta["__level"] == 1]
leaf_doc_store = InMemoryDocumentStore()
leaf_doc_store.write_documents(leaf_documents, policy=DuplicatePolicy.OVERWRITE)
# Store the parent documents in another document store
parent_documents = [doc for doc in docs["documents"] if doc.meta["__level"] == 0]
parent_doc_store = InMemoryDocumentStore()
parent_doc_store.write_documents(parent_documents, policy=DuplicatePolicy.OVERWRITE)
return leaf_doc_store, parent_doc_store
leaf_doc_store, parent_doc_store = indexing(docs)
使用自动合并检索文档
我们现在已准备好使用 AutoMergingRetriever 查询文档存储。让我们构建一个管道,该管道使用 BM25Retriever 来处理用户查询,并将其连接到 AutoMergingRetriever,后者根据检索到的文档和分层结构,决定是返回叶子文档还是父文档。
from haystack import Pipeline
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
from haystack.components.retrievers import AutoMergingRetriever
def querying_pipeline(leaf_doc_store: InMemoryDocumentStore, parent_doc_store: InMemoryDocumentStore, threshold: float = 0.6):
pipeline = Pipeline()
bm25_retriever = InMemoryBM25Retriever(document_store=leaf_doc_store)
auto_merge_retriever = AutoMergingRetriever(parent_doc_store, threshold=threshold)
pipeline.add_component(instance=bm25_retriever, name="BM25Retriever")
pipeline.add_component(instance=auto_merge_retriever, name="AutoMergingRetriever")
pipeline.connect("BM25Retriever.documents", "AutoMergingRetriever.documents")
return pipeline
让我们通过将 AutoMergingRetriever 的阈值设置为 0.6 来创建此管道。
pipeline = querying_pipeline(leaf_doc_store, parent_doc_store, threshold=0.6)
现在,让我们查询文档存储以获取与网络安全相关的文章。我们还将使用管道参数 include_outputs_from 来获取 BM25Retriever 组件的输出。
result = pipeline.run(data={'query': 'phishing attacks spoof websites spam e-mails spyware'}, include_outputs_from={'BM25Retriever'})
len(result['AutoMergingRetriever']['documents'])
10
len(result['BM25Retriever']['documents'])
10
retrieved_doc_titles_bm25 = sorted([d.meta['title'] for d in result['BM25Retriever']['documents']])
retrieved_doc_titles_bm25
['Bad e-mail habits sustains spam',
'Cyber criminals step up the pace',
'Cyber criminals step up the pace',
'More women turn to net security',
'Rich pickings for hi-tech thieves',
'Screensaver tackles spam websites',
'Security scares spark browser fix',
'Solutions to net security fears',
'Solutions to net security fears',
'Spam e-mails tempt net shoppers']
retrieved_doc_titles_automerging = sorted([d.meta['title'] for d in result['AutoMergingRetriever']['documents']])
retrieved_doc_titles_automerging
['Bad e-mail habits sustains spam',
'Cyber criminals step up the pace',
'Cyber criminals step up the pace',
'More women turn to net security',
'Rich pickings for hi-tech thieves',
'Screensaver tackles spam websites',
'Security scares spark browser fix',
'Solutions to net security fears',
'Solutions to net security fears',
'Spam e-mails tempt net shoppers']
