教程：检索句子周围的上下文窗口

_{最后更新：2025年6月10日}

级别：入门
完成时间：10 分钟
使用的组件：SentenceWindowRetriever、DocumentSplitter、InMemoryDocumentStore、InMemoryBM25Retriever
目标：完成本教程后，您将了解句子窗口检索及其在文档检索中的应用。

概述

句子窗口检索技术是一种简单有效的方法，可以在用户查询匹配到某个文档时检索更多的上下文。它的基本思想是，最相关的句子很可能彼此靠近。该技术涉及选择用户查询匹配到的句子周围的句子窗口，然后返回整个窗口而不是匹配的句子。当用户查询是一个问题或需要更多上下文才能理解的短语时，这种技术尤其有用。

可以在 Pipeline 中使用 SentenceWindowRetriever 来实现句子窗口检索技术。

该组件将 document_store 和 window_size 作为输入。document_store 包含我们要查询的文档，window_size 用于确定匹配句子周围要返回的句子数量。因此，返回的句子数量将是 2 * window_size + 1。虽然我们称之为“句子”，因为它与该技术本身相关，但 SentenceWindowRetriever 实际上可以与 DocumentSplitter 类中的任何分割器一起工作，例如：word、sentence、page。

SentenceWindowRetriever(document_store=doc_store, window_size=2)

准备 Colab 环境

启用 GPU 运行时

安装 Haystack

首先，使用 pip 安装最新版本的 Haystack

%%bash

pip install --upgrade pip
pip install haystack-ai nltk

开始使用句子窗口检索

让我们来看一个简单的示例，演示如何单独使用 SentenceWindowRetriever，稍后我们将学习如何在管道中使用它。我们首先创建一个文档，并使用 DocumentSplitter 类将其分割成句子。

from haystack import Document
from haystack.components.preprocessors import DocumentSplitter

splitter = DocumentSplitter(split_length=1, split_overlap=0, split_by="period")

text = (
    "Paul fell asleep to dream of an Arrakeen cavern, silent people all around  him moving in the dim light "
    "of glowglobes. It was solemn there and like a cathedral as he listened to a faint sound—the "
    "drip-drip-drip of water. Even while he remained in the dream, Paul knew he would remember it upon "
    "awakening. He always remembered the dreams that were predictions. The dream faded. Paul awoke to feel "
    "himself in the warmth of his bed—thinking thinking. This world of Castle Caladan, without play or "
    "companions his own age,  perhaps did not deserve sadness in farewell. Dr Yueh, his teacher, had "
    "hinted  that the faufreluches class system was not rigidly guarded on Arrakis. The planet sheltered "
    "people who lived at the desert edge without caid or bashar to command them: will-o’-the-sand people "
    "called Fremen, marked down on no  census of the Imperial Regate."
)

doc = Document(content=text)
docs = splitter.run([doc])

这将生成 9 个句子，表示为 Haystack Document 对象。然后，我们可以将这些文档写入 DocumentStore，并使用 SentenceWindowRetriever 来检索匹配句子周围的句子窗口。

from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.document_stores.types import DuplicatePolicy

doc_store = InMemoryDocumentStore()
doc_store.write_documents(docs["documents"], policy=DuplicatePolicy.OVERWRITE)

现在，我们使用 SentenceWindowRetriever 来检索某个句子周围的句子窗口。请注意，SentenceWindowRetriever 在运行时接收一个存在于文档存储中的 Document 作为输入，并且它将依赖文档的元数据来检索匹配句子周围的句子窗口。因此，需要注意的一个重要方面是，SentenceWindowRetriever 需要与另一个处理初始用户查询的 Retriever 结合使用，例如 InMemoryBM25Retriever，并返回匹配的文档。

我们将包含句子 The dream faded. 的文档传递给 SentenceWindowRetriever，并检索它周围 2 个句子的窗口。请注意，我们需要将其包装在一个列表中，因为 run 方法期望一个文档列表。

from haystack.components.retrievers import SentenceWindowRetriever

retriever = SentenceWindowRetriever(document_store=doc_store, window_size=2)
result = retriever.run(retrieved_documents=[docs["documents"][4]])

结果是一个包含两个键的字典

context_windows：一个字符串列表，包含匹配句子周围的上下文窗口。
context_documents：一个 Document 对象列表，包含检索到的文档及其周围的上下文文档。文档按 split_idx_start 元字段排序。

result["context_windows"]

result["context_documents"]

使用句子窗口检索创建一个关键词检索管道

让我们看看这个组件是如何工作的。我们将使用 BBC 新闻数据集来展示 SentenceWindowRetriever 如何处理包含多篇新闻文章的数据集。

读取数据集

原始数据集可在 http://mlg.ucd.ie/datasets/bbc.html 找到，但它已经被预处理并存储在一个 CSV 文件中，可在此处获取：https://raw.githubusercontent.com/amankharwal/Website-data/master/bbc-news-data.csv

from typing import List
import csv
from haystack import Document


def read_documents(file: str) -> List[Document]:
    with open(file, "r") as file:
        reader = csv.reader(file, delimiter="\t")
        next(reader, None)  # skip the headers
        documents = []
        for row in reader:
            category = row[0].strip()
            title = row[2].strip()
            text = row[3].strip()
            documents.append(Document(content=text, meta={"category": category, "title": title}))

    return documents

from pathlib import Path
import requests

doc = requests.get("https://raw.githubusercontent.com/amankharwal/Website-data/master/bbc-news-data.csv")

datafolder = Path("data")
datafolder.mkdir(exist_ok=True)
with open(datafolder / "bbc-news-data.csv", "wb") as f:
    for chunk in doc.iter_content(512):
        f.write(chunk)

docs = read_documents("data/bbc-news-data.csv")
len(docs)

索引文档

我们现在将应用 DocumentSplitter 将文档分割成句子，并将它们写入 InMemoryDocumentStore。

from haystack import Document, Pipeline
from haystack.components.preprocessors import DocumentSplitter
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.document_stores.types import DuplicatePolicy

doc_store = InMemoryDocumentStore()

indexing_pipeline = Pipeline()
indexing_pipeline.add_component("splitter", DocumentSplitter(split_length=1, split_overlap=0, split_by="sentence"))
indexing_pipeline.add_component("writer", DocumentWriter(document_store=doc_store, policy=DuplicatePolicy.OVERWRITE))

indexing_pipeline.connect("splitter", "writer")

indexing_pipeline.run({"documents": docs})

构建句子窗口检索管道

现在让我们构建一个管道，使用 InMemoryBM25Retriever（带关键词检索）和 SentenceWindowRetriever 来检索文档。在这里，我们将检索器设置为 window_size 为 2

from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
from haystack.components.retrievers import SentenceWindowRetriever

sentence_window_pipeline = Pipeline()

sentence_window_pipeline.add_component("bm25_retriever", InMemoryBM25Retriever(document_store=doc_store))
sentence_window_pipeline.add_component("sentence_window__retriever", SentenceWindowRetriever(doc_store, window_size=2))

sentence_window_pipeline.connect("bm25_retriever.documents", "sentence_window__retriever.retrieved_documents")

整合

让我们看看当我们检索与“网络钓鱼攻击”相关的文档时会发生什么，只返回得分最高的文档。我们还将包括来自 InMemoryBM25Retriever 的输出，以便我们可以比较有和没有 SentenceWindowRetriever 的结果。

result = sentence_window_pipeline.run(
    data={"bm25_retriever": {"query": "phishing attacks", "top_k": 1}}, include_outputs_from={"bm25_retriever"}
)

现在让我们检查 InMemoryBM25Retriever 和 SentenceWindowRetriever 的结果。由于我们按句子分割文档，InMemoryBM25Retriever 只返回与匹配查询相关的句子。

result["bm25_retriever"]["documents"]

另一方面，SentenceWindowRetriever 返回匹配句子周围的句子窗口，为我们提供了更多的上下文来理解句子。

result["sentence_window__retriever"]["context_windows"]

我们还可以访问上下文窗口作为一个 Document 列表

result["sentence_window__retriever"]["context_documents"]

总结

我们看到了 SentenceWindowRetriever 的工作原理以及如何使用它来检索匹配文档周围的句子窗口，从而为我们提供更多上下文来理解文档。需要注意的一个重要方面是，SentenceWindowRetriever 不直接处理查询，而是依赖于另一个处理初始用户查询的 Retriever 的输出。这使得 SentenceWindowRetriever 可以与管道中的任何其他检索器结合使用，例如 InMemoryBM25Retriever。

构建抽取式问答流水线

创建自定义 SuperComponents