Haystack 2.10.0

⭐️ Highlights

改进了 `Pipeline.run()` 逻辑

新的 Pipeline.run() 逻辑修复了常见的管道问题，包括异常、组件执行不正确、中间输出缺失以及惰性可变组件的过早执行。虽然大多数管道应该不受影响，但我们建议仔细审查您的管道执行，如果您正在使用循环管道或带有惰性可变组件的管道，以确保其行为没有改变。您可以使用此工具来比较新旧逻辑管道执行的跟踪信息。

`AsyncPipeline` 用于异步执行

随着新的 Pipeline.run 逻辑，AsyncPipeline 实现了异步执行，允许管道组件在可能的情况下并发运行。这带来了显著的速度提升，特别是对于处理数据并行分支（如混合检索设置）的管道。

AsyncPipeline vs Pipeline

源文件

混合检索

hybrid_rag_retrieval = AsyncPipeline()
hybrid_rag_retrieval.add_component("text_embedder", SentenceTransformersTextEmbedder())
hybrid_rag_retrieval.add_component("embedding_retriever", InMemoryEmbeddingRetriever(document_store=document_store))
hybrid_rag_retrieval.add_component("bm25_retriever", InMemoryBM25Retriever(document_store=document_store))

hybrid_rag_retrieval.connect("text_embedder", "embedding_retriever")
hybrid_rag_retrieval.connect("bm25_retriever", "document_joiner")
hybrid_rag_retrieval.connect("embedding_retriever", "document_joiner")

async def run_inner():
    return await hybrid_rag_retrieval.run({
      "text_embedder": {"text": query}, 
      "bm25_retriever": {"query": query}
      })

results = asyncio.run(run_inner())

并行翻译管道

from haystack.components.builders import ChatPromptBuilder
from haystack.components.generators.chat import OpenAIChatGenerator
from haystack import AsyncPipeline
from haystack.utils import Secret

# Create prompt builders with templates at initialization
spanish_prompt_builder = ChatPromptBuilder(template="Translate this message to Spanish: {{user_message}}")
turkish_prompt_builder = ChatPromptBuilder(template="Translate this message to Turkish: {{user_message}}")
thai_prompt_builder = ChatPromptBuilder(template="Translate this message to Thai: {{user_message}}")

# Create LLM instances
spanish_llm = OpenAIChatGenerator()
turkish_llm = OpenAIChatGenerator()
thai_llm = OpenAIChatGenerator()

# Create and configure pipeline
pipe = AsyncPipeline()

# Add components
pipe.add_component("spanish_prompt_builder", spanish_prompt_builder)
pipe.add_component("turkish_prompt_builder", turkish_prompt_builder)
pipe.add_component("thai_prompt_builder", thai_prompt_builder)

pipe.add_component("spanish_llm", spanish_llm)
pipe.add_component("turkish_llm", turkish_llm)
pipe.add_component("thai_llm", thai_llm)

# Connect components
pipe.connect("spanish_prompt_builder.prompt", "spanish_llm.messages")
pipe.connect("turkish_prompt_builder.prompt", "turkish_llm.messages")
pipe.connect("thai_prompt_builder.prompt", "thai_llm.messages")

user_message = """
In computer programming, the async/await pattern is a syntactic feature of many programming languages that 
allows an asynchronous, non-blocking function to be structured in a way similar to an ordinary synchronous function. 
It is semantically related to the concept of a coroutine and is often implemented using similar techniques, 
and is primarily intended to provide opportunities for the program to execute other code while waiting 
for a long-running, asynchronous task to complete, usually represented by promises or similar data structures.
"""

# Run the pipeline with simplified input
res = pipe.run(data={"user_message": user_message})

# Print results
print("Spanish translation:", res["spanish_llm"]["generated_messages"][0].text)
print("Turkish translation:", res["turkish_llm"]["generated_messages"][0].text)
print("Thai translation:", res["thai_llm"]["generated_messages"][0].text)

处处支持工具调用

现在，所有聊天生成器都普遍支持工具调用，这使得开发者比以往任何时候都更容易地跨不同平台移植工具。只需切换使用的聊天生成器，工具即可无缝运行，无需任何额外配置。此更新适用于 AzureOpenAIChatGenerator、HuggingFaceLocalChatGenerator 以及所有核心集成，包括 AnthropicChatGenerator、CohereChatGenerator、AmazonBedrockChatGenerator 和 VertexAIGeminiChatGenerator。通过此增强功能，工具使用成为整个生态系统的原生能力，从而能够构建更高级、更具交互性的代理应用程序。

在本地可视化您的管道

管道可视化现在更加灵活，允许用户在本地渲染管道图，而无需互联网连接或将数据发送到外部服务。通过使用 Docker 运行本地 Mermaid 服务器，您可以使用 draw() 或 show() 命令生成管道的视觉表示。在可视化管道中了解更多信息。

用于更智能文档处理的新组件

本次发布引入了增强文档处理能力的新组件。 CSVDocumentSplitter 和 CSVDocumentCleaner 使处理 CSV 文件更加高效。 LLMMetadaExtractor 利用 LLM 分析文档并用相关元数据丰富文档，从而提高搜索性和检索准确性。

⬆️ 升级说明

DOCXToDocument 转换器现在返回一个 Document 对象，其 DOCX 元数据存储在 meta 字段中，作为键为 docx 的字典。以前，元数据由 DOCXMetadata 数据类表示。此更改不会影响从 Document Store 读取或写入 Document Store。
已删除已弃用的 NLTKDocumentSplitter，其功能现在由 DocumentSplitter 支持。
已从 ChatRole 枚举中删除已弃用的 FUNCTION 角色。请改用 TOOL。已删除已弃用的类方法 ChatMessage.from_function。请改用 ChatMessage.from_tool。

🚀 新功能

添加了一个新组件 ListJoiner，它将来自不同组件的值列表合并成一个列表。

引入了 OpenAPIConnector 组件，允许直接调用 OpenAPI 规范中指定的 REST 端点。此组件设计用于直接调用 REST 端点，无需 LLM 生成的负载，用户需要显式传递运行参数。示例

from haystack.utils import Secret 
from haystack.components.connectors.openapi import OpenAPIConnector  

connector = OpenAPIConnector(openapi_spec="https://bit.ly/serperdev_openapi", credentials=Secret.from_env_var("SERPERDEV_API_KEY")) 
response = connector.run(operation_id="search", parameters={"q": "Who was Nikola Tesla?"} )

添加了一个新组件 LLMMetadaExtractor，可以在索引管道中使用该组件，根据用户提供的提示从文档中提取元数据，并返回带有 LLM 输出的元数据字段的文档。
引入了用于清理 CSV 文档的 CSVDocumentCleaner 组件。
- 删除空行和空列，同时保留指定的忽略行和列。
- 处理过程中要忽略的行数和列数的自定义设置。
引入 CSVDocumentSplitter：CSVDocumentSplitter 通过递归地按空行和大于指定阈值的列进行拆分，将 CSV 文档拆分成结构化的子表。这在转换 Excel 文件时特别有用，因为 Excel 文件通常在一个工作表中包含多个表。

⚡️ 增强说明

增强了 SentenceTransformersDocumentEmbedder 和 SentenceTransformersTextEmbedder 以接受一个附加参数，该参数直接传递到底层的 SentenceTransformer.encode 方法，以实现更灵活的嵌入定制。
添加了 completion_start_time 元数据，用于跟踪 Hugging Face API 和 OpenAI (Azure) 流式响应中的首次令牌响应时间 (TTFT)。
MetadataRouter 中的日期过滤增强功能
- 通过引入 _parse_date，改进了过滤器实用程序中的日期解析，该实用程序首先尝试 datetime.fromisoformat(value) 以实现向后兼容性，然后回退到 dateutil.parser.parse() 以支持更广泛的 ISO 8601。
- 解决了将朴素日期时间和时区感知日期时间进行比较时导致 TypeError 的常见问题。添加了 _ensure_both_dates_naive_or_aware，以确保两个日期时间都是朴素的或感知时区的。如果其中一个缺少时区，则将其分配给另一个的时区以保持一致性。
当 Pipeline.from_dict 收到无效类型（例如，空字符串）时，现在会引发信息性的 PipelineError。
将 jsonschema 库添加为核心依赖项。它在 Tool 和 JsonSchemaValidator 中使用。
HF 聊天生成器的流式回调运行参数支持。
对于 CSVDocumentCleaner，添加了 remove_empty_rows 和 remove_empty_columns 以选择性地删除行和列。还添加了 keep_id 以允许选择性地保留原始文档 ID。
增强了 OpenAPIServiceConnector 以支持新的 ChatMessage 格式并与其兼容。
根据 issue #8741 的要求，在 DocumentSplitter 中初始化 Document 后更新了 Document 的元数据。

⚠️ 弃用说明

ExtractedTableAnswer 数据类和 Document 数据类中的 dataframe 字段已被弃用，并将从 Haystack 2.11.0 中删除。有关动机和详细信息，请查看 GitHub 讨论。

🐛 Bug 修复

修复了导致 pyright 类型检查器在所有组件对象上失败的错误。
Haystack 管道的 Mermaid 图现在经过压缩，以减小编码 base64 的大小，并避免在图过大时出现 HTTP 400 错误。
DOCXToDocument 组件现在会跳过以前导致错误的 DOCX 文件中的注释块。
可调用对象的反序列化现在适用于所有完全限定的导入路径。
修复了 Document Classifier 组件的错误消息，这些消息建议使用不存在的组件进行文本分类。
修复了 JSONConverter，使其能够正确跳过对非 utf-8 编码的 JSON 文件的转换。
- 非循环管道与多个惰性可变组件未运行所有组件
- 循环管道未将中间输出传递给循环外部的组件
- 具有两个或多个可选或贪婪可变边缘的循环管道显示出意外的执行行为
- 具有共享边的两个循环的循环管道引发错误
更新了 PDFMinerToDocument 转换函数，以在 container_text 之间添加双换行符，以便后续可以通过 DocumentSplitter 进行分块。
在 Hugging Face API embedders 中，现在使用 InferenceClient.feature_extraction 方法而不是 InferenceClient.post 来计算嵌入。这确保了更健壮和面向未来的实现。
改进了 OpenAIChatGenerator 流式响应工具调用处理：该逻辑现在扫描所有块以正确识别包含工具调用的第一个块，从而确保正确的负载构造并防止在工具调用数据不限于初始块时出现错误。