教程：使用元数据过滤文档

_{最后更新：2025 年 8 月 25 日}

级别：入门
完成时间：5 分钟
使用的组件：InMemoryDocumentStore, InMemoryBM25Retriever
先决条件：无
目标：根据提供的元数据过滤文档存储中的文档

概述

📚 相关文档：元数据过滤

尽管新的检索技术很棒，但有时您就是知道您想在文档存储中的特定文档组上执行搜索。这可以是任何内容，从与特定用户相关的文档，或在特定日期之后发布的文档等。在这些情况下，元数据过滤非常有用。在本教程中，我们将创建一些包含 Haystack 相关信息的简单文档，其中元数据包含有关信息所关联的 Haystack 版本的信息。然后，我们将进行元数据过滤，以确保我们仅基于 Haystack 2.0 的信息来回答问题。

准备 Colab 环境

安装 Haystack

使用 pip 安装 Haystack

%%bash

pip install haystack-ai

准备文档

首先，让我们准备一些文档。下面，我们手动创建 3 个带有 meta 的简单文档。然后，我们将这些文档写入 InMemoryDocumentStore，但您也可以使用任何可用的文档存储，例如 OpenSearch、Chroma、Pinecone 等。（请注意，并非所有文档存储都提供内存存储选项，可能需要额外设置）。

⭐️ 有关如何将文档写入不同文档存储的更多信息，您可以遵循有关索引不同文件类型的教程。

from datetime import datetime

from haystack import Document
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever

documents = [
    Document(
        content="Use pip to install a basic version of Haystack's latest release: pip install farm-haystack. All the core Haystack components live in the haystack repo. But there's also the haystack-extras repo which contains components that are not as widely used, and you need to install them separately.",
        meta={"version": 1.15, "date": datetime(2023, 3, 30)},
    ),
    Document(
        content="Use pip to install a basic version of Haystack's latest release: pip install farm-haystack[inference]. All the core Haystack components live in the haystack repo. But there's also the haystack-extras repo which contains components that are not as widely used, and you need to install them separately.",
        meta={"version": 1.22, "date": datetime(2023, 11, 7)},
    ),
    Document(
        content="Use pip to install only the Haystack 2.0 code: pip install haystack-ai. The haystack-ai package is built on the main branch which is an unstable beta version, but it's useful if you want to try the new features as soon as they are merged.",
        meta={"version": 2.0, "date": datetime(2023, 12, 4)},
    ),
]
document_store = InMemoryDocumentStore(bm25_algorithm="BM25Plus")
document_store.write_documents(documents=documents)

构建文档搜索管道

例如，下面我们构建一个简单的文档搜索管道，其中只有一个检索器。但是，您也可以修改此管道以执行更多操作，例如为问题生成答案或更多操作。

from haystack import Pipeline

pipeline = Pipeline()
pipeline.add_component(instance=InMemoryBM25Retriever(document_store=document_store), name="retriever")

进行元数据过滤

最后，通过将文档过滤到 "version" > 1.21 来提出问题。

要查看您可以在元数据中使用哪些比较运算符，包括 NOT、AND 等逻辑比较，请查阅元数据过滤文档

query = "Haystack installation"
pipeline.run(data={"retriever": {"query": query, "filters": {"field": "meta.version", "operator": ">", "value": 1.21}}})

作为最后一步，让我们看看如何为过滤器添加逻辑运算符。这次，我们要求检索到的文档被过滤到 version > 1.21，并且我们还要求它们的 date 晚于 2023 年 11 月 7 日。

query = "Haystack installation"
pipeline.run(
    data={
        "retriever": {
            "query": query,
            "filters": {
                "operator": "AND",
                "conditions": [
                    {"field": "meta.version", "operator": ">", "value": 1.21},
                    {"field": "meta.date", "operator": ">", "value": datetime(2023, 11, 7)},
                ],
            },
        }
    }
)

下一步

🎉 恭喜！您已使用元数据过滤了检索到的文档！

如果您喜欢这个教程，您可能还会喜欢

要及时了解最新的 Haystack 开发动态，您可以订阅我们的新闻通讯。感谢您的阅读！

使用检索增强（Retrieval-Augmentation）创建您的第一个问答（QA）流水线

预处理不同文件类型