通过自动合并和分层文档检索改进检索

在 Colab 中打开下载

_{最后更新：2025 年 3 月 20 日}

本笔记本演示了如何使用 Haystack 组件：AutoMergingRetriever 和 HierarchicalDocumentSplitter。

📚 在此处阅读完整文章

设置

!pip install haystack-ai

让我们获取一个数据集来进行索引和探索

我们将使用一个包含 2225 篇新文章的数据集，这些文章来自 D. Greene 和 P. Cunningham 在 Proc. ICML 2006 发表的论文“Practical Solutions to the Problem of Diagonal Dominance in Kernel Document Clustering”。
原始数据集可在此处找到：http://mlg.ucd.ie/datasets/bbc.html，但我们将使用此处提供的 CSV 处理版本：https://raw.githubusercontent.com/amankharwal/Website-data/master/bbc-news-data.csv

!wget https://raw.githubusercontent.com/amankharwal/Website-data/master/bbc-news-data.csv

--2024-09-06 09:41:04--  https://raw.githubusercontent.com/amankharwal/Website-data/master/bbc-news-data.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5080260 (4.8M) [text/plain]
Saving to: ‘bbc-news-data.csv’

bbc-news-data.csv   100%[===================>]   4.84M  --.-KB/s    in 0.09s   

2024-09-06 09:41:05 (56.4 MB/s) - ‘bbc-news-data.csv’ saved [5080260/5080260]

让我们将原始数据转换为 Haystack 文档

import csv
from typing import List
from haystack import Document

def read_documents() -> List[Document]:
    with open("bbc-news-data.csv", "r") as file:
        reader = csv.reader(file, delimiter="\t")
        next(reader, None)  # skip the headers
        documents = []
        for row in reader:
            category = row[0].strip()
            title = row[2].strip()
            text = row[3].strip()
            documents.append(Document(content=text, meta={"category": category, "title": title}))

    return documents

docs = read_documents()

docs[0:5]

[Document(id=8b0eec9b4039d3c21eed119c9cbf1022a172f6b96661a391c76ee9a00b388334, content: 'Quarterly profits at US media giant TimeWarner jumped 76% to $1.13bn (£600m) for the three months to...', meta: {'category': 'business', 'title': 'Ad sales boost Time Warner profit'}),
 Document(id=0b20edb280b3c492d81751d97aa67f008759b242f2596d56c6816bacb5ea0c08, content: 'The dollar has hit its highest level against the euro in almost three months after the Federal Reser...', meta: {'category': 'business', 'title': 'Dollar gains on Greenspan speech'}),
 Document(id=9465b0a3c9e81843db56beb8cb3183b14810e8fc7b3195bd37718296f3a13e31, content: 'The owners of embattled Russian oil giant Yukos are to ask the buyer of its former production unit t...', meta: {'category': 'business', 'title': 'Yukos unit buyer faces loan claim'}),
 Document(id=151d64ed92b61b1b9e58c52a90e7ab4be964c0e47aaf1a233dfb93110986d9cd, content: 'British Airways has blamed high fuel prices for a 40% drop in profits.  Reporting its results for th...', meta: {'category': 'business', 'title': "High fuel prices hit BA's profits"}),
 Document(id=4355d611f770b814f9e7d33959ad9d16b69048650ed0eaf24f1bce3e8ab5bf4c, content: 'Shares in UK drinks and food firm Allied Domecq have risen on speculation that it could be the targe...', meta: {'category': 'business', 'title': 'Pernod takeover talk lifts Domecq'})]

我们可以看到我们已成功创建了文档。

文档拆分和索引

现在我们将每个文档拆分成更小的文档，创建一个分层的文档结构，将每个子文档与相应的父文档连接起来。

我们还创建了两个文档存储，一个用于叶子文档，另一个用于父文档。

from typing import Tuple

from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.document_stores.types import DuplicatePolicy

from haystack.components.preprocessors import HierarchicalDocumentSplitter

def indexing(documents: List[Document]) -> Tuple[InMemoryDocumentStore, InMemoryDocumentStore]:
    splitter = HierarchicalDocumentSplitter(block_sizes={10, 3}, split_overlap=0, split_by="word")
    docs = splitter.run(documents)

    # Store the leaf documents in one document store
    leaf_documents = [doc for doc in docs["documents"] if doc.meta["__level"] == 1]
    leaf_doc_store = InMemoryDocumentStore()
    leaf_doc_store.write_documents(leaf_documents, policy=DuplicatePolicy.OVERWRITE)

    # Store the parent documents in another document store
    parent_documents = [doc for doc in docs["documents"] if doc.meta["__level"] == 0]
    parent_doc_store = InMemoryDocumentStore()
    parent_doc_store.write_documents(parent_documents, policy=DuplicatePolicy.OVERWRITE)

    return leaf_doc_store, parent_doc_store

leaf_doc_store, parent_doc_store = indexing(docs)

使用自动合并检索文档

我们现在已准备好使用 AutoMergingRetriever 查询文档存储。让我们构建一个管道，该管道使用 BM25Retriever 来处理用户查询，并将其连接到 AutoMergingRetriever，后者根据检索到的文档和分层结构，决定是返回叶子文档还是父文档。

from haystack import Pipeline
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
from haystack.components.retrievers import AutoMergingRetriever

def querying_pipeline(leaf_doc_store: InMemoryDocumentStore, parent_doc_store: InMemoryDocumentStore, threshold: float = 0.6):
    pipeline = Pipeline()
    bm25_retriever = InMemoryBM25Retriever(document_store=leaf_doc_store)
    auto_merge_retriever = AutoMergingRetriever(parent_doc_store, threshold=threshold)
    pipeline.add_component(instance=bm25_retriever, name="BM25Retriever")
    pipeline.add_component(instance=auto_merge_retriever, name="AutoMergingRetriever")
    pipeline.connect("BM25Retriever.documents", "AutoMergingRetriever.documents")
    return pipeline

让我们通过将 AutoMergingRetriever 的阈值设置为 0.6 来创建此管道。

pipeline = querying_pipeline(leaf_doc_store, parent_doc_store, threshold=0.6)

现在，让我们查询文档存储以获取与网络安全相关的文章。我们还将使用管道参数 include_outputs_from 来获取 BM25Retriever 组件的输出。

result = pipeline.run(data={'query': 'phishing attacks spoof websites spam e-mails spyware'},  include_outputs_from={'BM25Retriever'})

len(result['AutoMergingRetriever']['documents'])

len(result['BM25Retriever']['documents'])

retrieved_doc_titles_bm25 = sorted([d.meta['title'] for d in result['BM25Retriever']['documents']])

retrieved_doc_titles_bm25

['Bad e-mail habits sustains spam',
 'Cyber criminals step up the pace',
 'Cyber criminals step up the pace',
 'More women turn to net security',
 'Rich pickings for hi-tech thieves',
 'Screensaver tackles spam websites',
 'Security scares spark browser fix',
 'Solutions to net security fears',
 'Solutions to net security fears',
 'Spam e-mails tempt net shoppers']

retrieved_doc_titles_automerging = sorted([d.meta['title'] for d in result['AutoMergingRetriever']['documents']])

retrieved_doc_titles_automerging

['Bad e-mail habits sustains spam',
 'Cyber criminals step up the pace',
 'Cyber criminals step up the pace',
 'More women turn to net security',
 'Rich pickings for hi-tech thieves',
 'Screensaver tackles spam websites',
 'Security scares spark browser fix',
 'Solutions to net security fears',
 'Solutions to net security fears',
 'Spam e-mails tempt net shoppers']