教程：使用基于循环的自动纠错生成结构化输出

_{最后更新：2025年6月10日}

级别：中级
完成时间：15 分钟
先决条件：您必须拥有有效的 OpenAI 帐户的 API 密钥，因为本教程使用的是 OpenAI 的 gpt-4o-mini 模型。
使用的组件：PromptBuilder、OpenAIChatGenerator、OutputValidator（自定义组件）
目标：完成本教程后，您将构建一个系统，该系统可以提取非结构化数据，将其放入 JSON 模式中，并自动纠正大型语言模型（LLM）生成的 JSON 输出中的错误，以确保其遵循指定的结构。

概述

本教程演示了如何使用 Haystack 的高级循环管道与 LLM 结合，以实现更动态和灵活的数据处理。您将学习如何使用 LLM 从非结构化数据中提取结构化数据，并将生成的输出与预定义的模式进行验证。

本教程使用 gpt-4o-mini 将非结构化文本转换为符合 Pydantic 模式的 JSON 输出。它使用自定义的 OutputValidator 组件来验证 JSON，并在必要时进行循环纠错。

准备 Colab 环境

启用日志的调试模式

import logging

logging.basicConfig()
logging.getLogger("canals.pipeline.pipeline").setLevel(logging.DEBUG)

安装依赖项

使用 pip 安装 Haystack 和 colorama

%%bash

pip install haystack-ai
pip install colorama

定义一个模式来解析 JSON 对象

为要从文本段中提取的数据定义一个简单的 JSON 模式。作为第一步，定义两个 Pydantic 模型，City 和 CitiesData，具有适当的字段和类型。

from typing import List
from pydantic import BaseModel


class City(BaseModel):
    name: str
    country: str
    population: int


class CitiesData(BaseModel):
    cities: List[City]

您可以根据希望从文本中提取的格式来更改这些模型。

然后，使用 schema_json() 从 Pydantic 模型生成 JSON 模式。稍后您将在提示中使用此模式来指示 LLM。

要了解有关 JSON 模式的更多信息，请访问 Pydantic Schema。

json_schema = CitiesData.schema_json(indent=2)

创建自定义组件：OutputValidator

OutputValidator 是一个自定义组件，它验证 LLM 生成的 JSON 对象是否符合提供的 Pydantic 模型。如果不符合，OutputValidator 将返回一个错误消息以及不正确的 JSON 对象，以便在下一个循环中进行修复。

有关自定义组件的更多详细信息，请参阅创建自定义组件。

import json
import random
import pydantic
from pydantic import ValidationError
from typing import Optional, List
from colorama import Fore
from haystack import component
from haystack.dataclasses import ChatMessage


# Define the component input parameters
@component
class OutputValidator:
    def __init__(self, pydantic_model: pydantic.BaseModel):
        self.pydantic_model = pydantic_model
        self.iteration_counter = 0

    # Define the component output
    @component.output_types(valid_replies=List[str], invalid_replies=Optional[List[str]], error_message=Optional[str])
    def run(self, replies: List[ChatMessage]):

        self.iteration_counter += 1

        ## Try to parse the LLM's reply ##
        # If the LLM's reply is a valid object, return `"valid_replies"`
        try:
            output_dict = json.loads(replies[0].text)
            self.pydantic_model.parse_obj(output_dict)
            print(
                Fore.GREEN
                + f"OutputValidator at Iteration {self.iteration_counter}: Valid JSON from LLM - No need for looping: {replies[0]}"
            )
            return {"valid_replies": replies}

        # If the LLM's reply is corrupted or not valid, return "invalid_replies" and the "error_message" for LLM to try again
        except (ValueError, ValidationError) as e:
            print(
                Fore.RED
                + f"OutputValidator at Iteration {self.iteration_counter}: Invalid JSON from LLM - Let's try again.\n"
                f"Output from LLM:\n {replies[0]} \n"
                f"Error from OutputValidator: {e}"
            )
            return {"invalid_replies": replies, "error_message": str(e)}

然后，使用您之前创建的 CitiesData 创建一个 OutputValidator 实例。

output_validator = OutputValidator(pydantic_model=CitiesData)

创建提示

为 LLM 编写关于将段落转换为 JSON 格式的说明。确保说明解释了如何识别和纠正 JSON 不匹配所需模式的错误。创建提示后，初始化 PromptBuilder 来使用它。

有关 Jinja2 模板和 ChatPromptBuilder 的信息，请参阅 ChatPromptBuilder。

from haystack.components.builders import ChatPromptBuilder


prompt_template = [
    ChatMessage.from_user(
        """
Create a JSON object from the information present in this passage: {{passage}}.
Only use information that is present in the passage. Follow this JSON schema, but only return the actual instances without any additional schema definition:
{{schema}}
Make sure your response is a dict and not a list.
{% if invalid_replies and error_message %}
  You already created the following output in a previous attempt: {{invalid_replies}}
  However, this doesn't comply with the format requirements from above and triggered this Python exception: {{error_message}}
  Correct the output and try again. Just return the corrected output without any extra explanations.
{% endif %}
"""
    )
]
prompt_builder = ChatPromptBuilder(template=prompt_template)

初始化 ChatGenerator

OpenAIChatGenerator 默认使用 OpenAI 的 gpt-4o-mini 模型生成文本。设置 OPENAI_API_KEY 变量并将模型名称提供给 ChatGenerator。

import os
from getpass import getpass

from haystack.components.generators.chat import OpenAIChatGenerator

if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = getpass("Enter OpenAI API key:")
chat_generator = OpenAIChatGenerator()

构建管道

将所有组件添加到您的管道并连接它们。在 output_validator 和 prompt_builder 之间添加连接，以处理生成的 JSON 不符合 JSON 模式的情况。将 max_runs_per_component 设置为避免无限循环。

from haystack import Pipeline

pipeline = Pipeline(max_runs_per_component=5)

# Add components to your pipeline
pipeline.add_component(instance=prompt_builder, name="prompt_builder")
pipeline.add_component(instance=chat_generator, name="llm")
pipeline.add_component(instance=output_validator, name="output_validator")

# Now, connect the components to each other
pipeline.connect("prompt_builder.prompt", "llm.messages")
pipeline.connect("llm.replies", "output_validator")
# If a component has more than one output or input, explicitly specify the connections:
pipeline.connect("output_validator.invalid_replies", "prompt_builder.invalid_replies")
pipeline.connect("output_validator.error_message", "prompt_builder.error_message")

可视化管道

使用 draw() 方法绘制管道以确认连接正确。您可以在此 Colab 的 Files 部分找到图表。

# pipeline.draw("auto-correct-pipeline.png")

测试管道

使用您想要转换为 JSON 格式的示例段落和为 CitiesData 创建的 json_schema 来运行管道。对于给定的示例段落，生成的 JSON 对象应如下所示：

{
  "cities": [
    {
      "name": "Berlin",
      "country": "Germany",
      "population": 3850809
    },
    {
      "name": "Paris",
      "country": "France",
      "population": 2161000
    },
    {
      "name": "Lisbon",
      "country": "Portugal",
      "population": 504718
    }
  ]
}

LLM 的输出应符合 json_schema。如果 LLM 未生成正确的 JSON 对象，它将循环回并重试。

passage = "Berlin is the capital of Germany. It has a population of 3,850,809. Paris, France's capital, has 2.161 million residents. Lisbon is the capital and the largest city of Portugal with the population of 504,718."
result = pipeline.run({"prompt_builder": {"passage": passage, "schema": json_schema}})

如果您遇到 PipelineMaxLoops: Maximum loops count (5) exceeded for component 'prompt_builder'. 错误，请考虑增加最大循环次数或简单地重新运行管道。

打印正确的 JSON

如果没有出现任何错误，您现在可以打印已更正的 JSON。

valid_reply = result["output_validator"]["valid_replies"][0].text
valid_json = json.loads(valid_reply)
print(valid_json)

下一步

🎉 恭喜！您已经构建了一个从非结构化文本段生成结构化 JSON，并通过使用 Haystack 管道的循环功能进行自动纠错的系统。

要随时了解最新的 Haystack 开发动态，您可以订阅我们的时事通讯并加入 Haystack Discord 社区。

感谢阅读！

创建混合检索流水线

按语言对文档和查询进行分类