高级 RAG：自动结构化元数据丰富

在 Colab 中打开下载

_{最后更新：2025 年 4 月 11 日}

作者：Tuana Celik ( LI, Twitter)

这是 **高级用例** 系列的第一部分

1️⃣ 从查询中提取元数据以改进检索食谱 & 完整文章

2️⃣ 查询扩展食谱和完整文章

3️⃣ 查询分解食谱和完整文章

4️⃣ 自动化元数据丰富

在此示例中，您将了解如何利用结构化输出（某些 LLM 的一个选项）以及自定义 Haystack 组件来自动化文档元数据的丰富。

您将看到如何将自己的元数据字段定义为 Pydantic 模型，以及每个字段应具有的数据类型。最后，您将获得一个自定义的 MetadataEnricher 来提取所需的字段并将其添加到文档元信息中。

在此示例中，我们将使用与资助公告相关的信息来丰富元数据。

一旦我们用自己的字段填充了文档的元数据，我们就可以在 RAG 管道的检索步骤中使用元数据过滤。我们甚至可以将其与从查询中提取元数据以改进检索相结合，从而精确地确定我们为 LLM 提供给上下文的文档。

📺 代码演示

安装要求

!pip install haystack-ai
!pip install trafilatura

from haystack import Document, component, Pipeline
from haystack.components.builders import PromptBuilder
from haystack.components.converters import HTMLToDocument
from haystack.components.fetchers import LinkContentFetcher
from haystack.components.generators import OpenAIGenerator
from haystack.components.preprocessors import DocumentSplitter
from haystack.dataclasses import ChatMessage, StreamingChunk

from openai import Stream
from openai.types.chat import ChatCompletion, ChatCompletionChunk
from typing import List, Any, Dict, Optional, Callable, Union
from pydantic import BaseModel

🧪 实验性功能：为 OpenAIGenerator 添加结构化输出支持

🚀 这是在高级 RAG：查询分解与推理示例中使用的 OpenAIGenerator 的相同扩展

让我们扩展 OpenAIGeneraotor，使其能够利用 OpenAI 的结构化输出选项。下面，如果用户在 generation_kwargs 中提供了 respose_format，我们将扩展该类以调用 self.client.beta.chat.completions.parse。这将使我们能够为生成器提供 Pydantic 模型，并要求生成器以符合此 Pydantic 模式的结构化输出来响应。

class OpenAIGenerator(OpenAIGenerator):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)

    @component.output_types(replies=List[str], meta=List[Dict[str, Any]], structured_reply=BaseModel)
    def run(self, prompt: str, streaming_callback: Optional[Callable[[StreamingChunk], None]] = None, generation_kwargs: Optional[Dict[str, Any]] = None,):
      generation_kwargs = {**self.generation_kwargs, **(generation_kwargs or {})}
      if "response_format" in generation_kwargs.keys():
        message = ChatMessage.from_user(prompt)
        if self.system_prompt:
            messages = [ChatMessage.from_system(self.system_prompt), message]
        else:
            messages = [message]

        streaming_callback = streaming_callback or self.streaming_callback
        openai_formatted_messages = [message.to_openai_dict_format() for message in messages]
        completion: Union[Stream[ChatCompletionChunk], ChatCompletion] = self.client.beta.chat.completions.parse(
            model=self.model,
            messages=openai_formatted_messages,
            **generation_kwargs)
        completions = [self._build_structured_message(completion, choice) for choice in completion.choices]
        for response in completions:
            self._check_finish_reason(response)

        return {
            "replies": [message.text for message in completions],
            "meta": [message.meta for message in completions],
            "structured_reply": completions[0].text
        }
      else:
          return super().run(prompt, streaming_callback, generation_kwargs)

    def _build_structured_message(self, completion: Any, choice: Any) -> ChatMessage:
        chat_message = ChatMessage.from_assistant(choice.message.parsed or "")
        chat_message.meta.update(
            {
                "model": completion.model,
                "index": choice.index,
                "finish_reason": choice.finish_reason,
                "usage": dict(completion.usage),
            }
        )
        return chat_message

import os
from getpass import getpass

if "OPENAI_API_KEY" not in os.environ:
  os.environ["OPENAI_API_KEY"] = getpass("OpenAI API Key:")

OpenAI API Key:··········

自定义 `MetadataEnricher`

我们创建一个自定义 Haystack 组件，该组件可以接受 metadata_model 和 prompt。如果没有提供 prompt，它将使用 DEFAULT_PROMPT。

此组件返回用请求的元数据字段丰富后的 documents。

DEFAULT_PROMPT = """
Given the contents of the documents, extract the requested metadata.
The requested metadata is {{ metadata_model }}
Document:
{{document}}
Metadata:
"""
@component
class MetadataEnricher:

    def __init__(self, metadata_model: BaseModel, prompt:str = DEFAULT_PROMPT):
        self.metadata_model = metadata_model
        self.metadata_prompt = prompt

        builder = PromptBuilder(self.metadata_prompt)
        llm = OpenAIGenerator(generation_kwargs={"response_format": metadata_model})
        self.pipeline = Pipeline()
        self.pipeline.add_component(name="builder", instance=builder)
        self.pipeline.add_component(name="llm", instance=llm)
        self.pipeline.connect("builder", "llm")

    @component.output_types(documents=List[Document])
    def run(self, documents: List[Document]):
        documents_with_meta = []
        for document in documents:
          result = self.pipeline.run({'builder': {'document': document.content, 'metadata_model': self.metadata_model}})
          metadata = result['llm']['structured_reply']
          document.meta.update(metadata.dict())
          documents_with_meta.append(document)
        return {"documents": documents_with_meta}

将元数据字段定义为 Pydantic 模型

为了实现自动化元数据丰富，我们希望能够提供一个结构来描述我们想要提取的字段以及它们的类型。

下面，我定义了一个 Metadata 模型，其中包含 4 个字段。

💡 注意： 在某些情况下，将每个字段设为可选或提供默认值可能更有意义。

class Metadata(BaseModel):
    company: str
    year: int
    funding_value: int
    funding_currency: str

接下来，我们初始化一个 MetadataEnricher 并提供 Metadata 作为我们希望遵守的 metadata_model。

enricher = MetadataEnricher(metadata_model=Metadata)

构建自动化元数据丰富管道

现在我们有了 enricher，就可以在管道中使用它了。下面是一个管道示例，该管道获取一些 URL 的内容（在本例中，是包含资助公告信息的 URL）。然后，管道将请求的元数据字段添加到每个 Document 的 meta 字段中 👇

pipeline = Pipeline()
pipeline.add_component("fetcher", LinkContentFetcher())
pipeline.add_component("converter", HTMLToDocument())
pipeline.add_component("enricher", enricher)


pipeline.connect("fetcher", "converter")
pipeline.connect("converter.documents", "enricher.documents")

pipeline.run({"fetcher": {"urls": ['https://techcrunch.com/2023/08/09/deepset-secures-30m-to-expand-its-llm-focused-mlops-offerings/',
                                   'https://www.prnewswire.com/news-releases/arize-ai-raises-38-million-series-b-to-scale-machine-learning-observability-platform-301620603.html']}})

{'enricher': {'documents': [Document(id=5844517120556b13f92430ea8af9837714ede1b351580c43c2ddce9b646cb6cb, content: 'Deepset, a platform for building enterprise apps powered by large language models akin to ChatGPT, t...', meta: {'content_type': 'text/html', 'url': 'https://techcrunch.com/2023/08/09/deepset-secures-30m-to-expand-its-llm-focused-mlops-offerings/', 'company': 'Deepset', 'year': 2023, 'funding_value': 30000000, 'funding_currency': 'USD'}),
   Document(id=8cdcb63a4e006b1cac902ebc2e012cd95156d188777e0d0c8bd407a92f4491c7, content: 'Arize AI Raises $38 Million Series B To Scale Machine Learning Observability Platform
   As companies t...', meta: {'content_type': 'text/html', 'url': 'https://www.prnewswire.com/news-releases/arize-ai-raises-38-million-series-b-to-scale-machine-learning-observability-platform-301620603.html', 'company': 'Arize AI', 'year': 2022, 'funding_value': 38000000, 'funding_currency': 'USD'})]}}

pipeline.show()

额外：元数据继承

这只是一个额外的步骤，用于展示当您使用 DocumentSplitter 等组件时，属于文档的元数据如何被文档块继承。

pipeline.add_component("splitter", DocumentSplitter())

pipeline.connect("enricher", "splitter")

<haystack.core.pipeline.pipeline.Pipeline object at 0x77ff80aee8c0>
🚅 Components
  - fetcher: LinkContentFetcher
  - converter: HTMLToDocument
  - enricher: MetadataEnricher
  - splitter: DocumentSplitter
🛤️ Connections
  - fetcher.streams -> converter.sources (List[ByteStream])
  - converter.documents -> enricher.documents (List[Document])
  - enricher.documents -> splitter.documents (List[Document])

pipeline.run({"fetcher": {"urls": ['https://techcrunch.com/2023/08/09/deepset-secures-30m-to-expand-its-llm-focused-mlops-offerings/',
                                   'https://www.prnewswire.com/news-releases/arize-ai-raises-38-million-series-b-to-scale-machine-learning-observability-platform-301620603.html']}})

{'splitter': {'documents': [Document(id=9611aa2bdb658163d8f6964220052065936fcd036dd24743d1b34ce79d25bc5a, content: 'Deepset, a platform for building enterprise apps powered by large language models akin to ChatGPT, t...', meta: {'content_type': 'text/html', 'url': 'https://techcrunch.com/2023/08/09/deepset-secures-30m-to-expand-its-llm-focused-mlops-offerings/', 'company': 'Deepset', 'year': 2023, 'funding_value': 30000000, 'funding_currency': 'USD', 'source_id': '5844517120556b13f92430ea8af9837714ede1b351580c43c2ddce9b646cb6cb', 'page_number': 1, 'split_id': 0, 'split_idx_start': 0}),
   Document(id=6bffbcf9f1cd1a3940628d1450c9ba9a8c9a092136896d295b35af3175caffbf, content: 'unfortunate state of affairs is likely contributing to challenges around AI development within the e...', meta: {'content_type': 'text/html', 'url': 'https://techcrunch.com/2023/08/09/deepset-secures-30m-to-expand-its-llm-focused-mlops-offerings/', 'company': 'Deepset', 'year': 2023, 'funding_value': 30000000, 'funding_currency': 'USD', 'source_id': '5844517120556b13f92430ea8af9837714ede1b351580c43c2ddce9b646cb6cb', 'page_number': 1, 'split_id': 1, 'split_idx_start': 1256}),
   Document(id=da372f9bc2292f487f0aad372053a531744d9549a46f8ac197c216abaa4d99d0, content: 'to end users, and perform analyses of the LLMs’ accuracy while continuously monitoring their perform...', meta: {'content_type': 'text/html', 'url': 'https://techcrunch.com/2023/08/09/deepset-secures-30m-to-expand-its-llm-focused-mlops-offerings/', 'company': 'Deepset', 'year': 2023, 'funding_value': 30000000, 'funding_currency': 'USD', 'source_id': '5844517120556b13f92430ea8af9837714ede1b351580c43c2ddce9b646cb6cb', 'page_number': 1, 'split_id': 2, 'split_idx_start': 2609}),
   Document(id=f316cd275e8bc763de41d128dbdbd81e1baad2693b0102f6951c4f46aa8f6048, content: 'predicts that the sector for MLOps will reach $23.1 billion by 2031, up from around $1 billion in 20...', meta: {'content_type': 'text/html', 'url': 'https://techcrunch.com/2023/08/09/deepset-secures-30m-to-expand-its-llm-focused-mlops-offerings/', 'company': 'Deepset', 'year': 2023, 'funding_value': 30000000, 'funding_currency': 'USD', 'source_id': '5844517120556b13f92430ea8af9837714ede1b351580c43c2ddce9b646cb6cb', 'page_number': 1, 'split_id': 3, 'split_idx_start': 3997}),
   Document(id=c7ff4e0d7af8aaa16f3195cb1f9096bb1cf8e7d985190fa6746c278b1d8457e8, content: 'Arize AI Raises $38 Million Series B To Scale Machine Learning Observability Platform
   As companies t...', meta: {'content_type': 'text/html', 'url': 'https://www.prnewswire.com/news-releases/arize-ai-raises-38-million-series-b-to-scale-machine-learning-observability-platform-301620603.html', 'company': 'Arize AI', 'year': 2022, 'funding_value': 38000000, 'funding_currency': 'USD', 'source_id': '8cdcb63a4e006b1cac902ebc2e012cd95156d188777e0d0c8bd407a92f4491c7', 'page_number': 1, 'split_id': 0, 'split_idx_start': 0}),
   Document(id=646f4d43ee20fcbe35c97bd04e8fd6edd4ad5c9af63fe4a63267f5df1807254f, content: 'by humans.
   Launched in 2020, Arize's ML observability platform is already counted on by a growing li...', meta: {'content_type': 'text/html', 'url': 'https://www.prnewswire.com/news-releases/arize-ai-raises-38-million-series-b-to-scale-machine-learning-observability-platform-301620603.html', 'company': 'Arize AI', 'year': 2022, 'funding_value': 38000000, 'funding_currency': 'USD', 'source_id': '8cdcb63a4e006b1cac902ebc2e012cd95156d188777e0d0c8bd407a92f4491c7', 'page_number': 1, 'split_id': 1, 'split_idx_start': 1360}),
   Document(id=09382ae0ab9adbd7860199b0d86e8ca044787eda860dfb181e3b243c6584a427, content: 'what happened, and improve overall model performance," says Morgan Gerlak, Partner at TCV. "Like oth...', meta: {'content_type': 'text/html', 'url': 'https://www.prnewswire.com/news-releases/arize-ai-raises-38-million-series-b-to-scale-machine-learning-observability-platform-301620603.html', 'company': 'Arize AI', 'year': 2022, 'funding_value': 38000000, 'funding_currency': 'USD', 'source_id': '8cdcb63a4e006b1cac902ebc2e012cd95156d188777e0d0c8bd407a92f4491c7', 'page_number': 1, 'split_id': 2, 'split_idx_start': 2697}),
   Document(id=94497826791e38f15016a2360d7e3eea5e242770f51dd11be69fb21210e89c9a, content: 'you are going to be left behind," notes Brett Wilson, Co-Founder and General Partner at Swift Ventur...', meta: {'content_type': 'text/html', 'url': 'https://www.prnewswire.com/news-releases/arize-ai-raises-38-million-series-b-to-scale-machine-learning-observability-platform-301620603.html', 'company': 'Arize AI', 'year': 2022, 'funding_value': 38000000, 'funding_currency': 'USD', 'source_id': '8cdcb63a4e006b1cac902ebc2e012cd95156d188777e0d0c8bd407a92f4491c7', 'page_number': 1, 'split_id': 3, 'split_idx_start': 3967})]}}