Haystack 2.1.0

亮点

📊 新的评估器组件

Haystack 为基于模型的评估和统计评估引入了新组件：AnswerExactMatchEvaluator、ContextRelevanceEvaluator、DocumentMAPEvaluator、DocumentMRREvaluator、DocumentRecallEvaluator、FaithfulnessEvaluator、LLMEvaluator、SASEvaluator

以下是使用 DocumentMAPEvaluator 评估检索到的文档并计算平均精度得分的示例

from haystack import Document
from haystack.components.evaluators import DocumentMAPEvaluator

evaluator = DocumentMAPEvaluator()
result = evaluator.run(
    ground_truth_documents=[
        [Document(content="France")],
        [Document(content="9th century"), Document(content="9th")],
    ],
    retrieved_documents=[
        [Document(content="France")],
        [Document(content="9th century"), Document(content="10th century"), Document(content="9th")],
    ],
)

result["individual_scores"]
>> [1.0, 0.8333333333333333]
result["score"]
>> 0 .9166666666666666

要了解更多关于使用 Haystack 中可用的基于模型的和统计指标来评估 RAG 管道的信息，请查看教程：评估 RAG 管道。

🕸️ 支持稀疏嵌入

Haystack 为 SPLADE 等稀疏嵌入检索技术提供了强大的支持。以下是如何使用稀疏嵌入创建简单的检索管道

from haystack import Pipeline
from haystack_integrations.components.retrievers.qdrant import QdrantSparseEmbeddingRetriever
from haystack_integrations.components.embedders.fastembed import FastembedSparseTextEmbedder

sparse_text_embedder = FastembedSparseTextEmbedder(model="prithvida/Splade_PP_en_v1")
sparse_retriever = QdrantSparseEmbeddingRetriever(document_store=document_store)

query_pipeline = Pipeline()
query_pipeline.add_component("sparse_text_embedder", sparse_text_embedder)
query_pipeline.add_component("sparse_retriever", sparse_retriever)

query_pipeline.connect("sparse_text_embedder.sparse_embedding", "sparse_retriever.query_sparse_embedding")

在我们的文档稀疏嵌入检索器中了解有关此主题的更多信息。开始使用我们的新食谱进行构建：🧑‍🍳 使用 Qdrant 和 FastEmbed 进行稀疏嵌入检索。

🧐 检查组件输出

从 2.1.0 版本开始，您现在可以在运行管道后检查每个组件的输出。为 pipeline.run 提供带有 include_outputs_from 键的组件名称

pipe.run(data, include_outputs_from={"prompt_builder", "llm", "retriever"})

管道输出应如下所示

{'llm': {'replies': ['The Rhodes Statue was described as being built with iron tie bars to which brass plates were fixed to form the skin. It stood on a 15-meter-high white marble pedestal near the Rhodes harbor entrance. The statue itself was about 70 cubits, or 32 meters, tall.'],
  'meta': [{'model': 'gpt-3.5-turbo-0125',
    ...
    'usage': {'completion_tokens': 57,
     'prompt_tokens': 446,
     'total_tokens': 503}}]},
 'retriever': {'documents': [Document(id=a3ee3a9a55b47ff651ae11dc56d84d2b6f8d931b795bd866c14eacfa56000965, content: 'Within it, too, are to be seen large masses of rock, by the weight of which the artist steadied it w...', meta: {'url': 'https://en.wikipedia.org/wiki/Colossus_of_Rhodes', '_split_id': 9}, score: 0.648961685430463),...]},
 'prompt_builder': {'prompt': "\nGiven the following information, answer the question.\n\nContext:\n\n    Within it, too, are to be seen large masses of rock, by the weight of which the artist steadied it while...
 ... levels during construction.\n\n\n\nQuestion: What does Rhodes Statue look like?\nAnswer:"}}

🚀 新功能

添加了几个新的评估组件，例如
- AnswerExactMatchEvaluator
- ContextRelevanceEvaluator
- DocumentMAPEvaluator
- DocumentMRREvaluator
- DocumentRecallEvaluator
- FaithfulnessEvaluator
- LLMEvaluator
- SASEvaluator
引入了一个新的 SparseEmbedding 类，它可以存储文档的稀疏向量表示。在后续引入稀疏嵌入器和稀疏嵌入检索器时，它对于支持稀疏嵌入检索将至关重要。
添加了 SentenceTransformersDiversityRanker。多样性排序器对文档进行排序以最大化它们的整体多样性。排序器利用 sentence-transformer 模型来计算每个文档和查询的语义嵌入。
引入了新的 HuggingFace API 组件，即
- HuggingFaceAPIChatGenerator，它将在未来取代 HuggingFaceTGIChatGenerator。
- HuggingFaceAPIDocumentEmbedder，它将在未来取代 HuggingFaceTEIDocumentEmbedder。
- HuggingFaceAPIGenerator，它将在未来取代 HuggingFaceTGIGenerator。
- HuggingFaceAPITextEmbedder，它将在未来取代 HuggingFaceTEITextEmbedder。
- 这些组件支持不同的 Hugging Face API
  - 免费无服务器推理 API
  - 付费推理端点
  - 自托管文本生成推理

⚡️ 增强说明

HuggingFaceTGIGenerator 和 HuggingFaceTGIChatGenerator 组件与 huggingface_hub>=0.22.0 的兼容性。
向 HuggingFaceTEITextEmbedder 和 HuggingFaceTEITextEmbedder 添加了 truncate 和 normalize 参数，以允许对嵌入进行截断和归一化。
向 SentenceTransformersDocumentEmbedder 和 SentenceTransformersTextEmbedder 添加了 trust_remote_code 参数，以允许使用自定义模型和脚本。
向 HuggingFaceLocalGenerator 添加了 streaming_callback 参数，允许用户处理流式响应。
添加了一个 ZeroShotTextRouter，它使用 HuggingFace 的 NLI 模型根据一组提供的标签对文本进行分类，并根据它们被分类的标签进行路由。
向 Azure OpenAI Embedders（AzureOpenAITextEmbedder 和 AzureOpenAIDocumentEmbedder）添加了 dimensions 参数，以完全支持 text-embedding-3-small、text-embedding-3-large 等新嵌入模型以及即将推出的模型。
现在 DocumentSplitter 会将 page_number 字段添加到所有输出文档的元数据中，以跟踪其所属的原始文档的页面。
允许用户自定义从 PDF 文件中提取文本。这对于布局不寻常的 PDF（例如多文本列）特别有用。例如，用户可以配置对象以保留阅读顺序。
增强了 PromptBuilder 以指定和强制执行提示模板中的必需变量。
将 HuggingFace 生成器的 max_new_tokens 默认值设置为 512。
增强了 AzureOCRDocumentConverter 以包含对表格和文本的高级处理。已引入诸如提取表格的前后上下文、合并多个列标题以及启用文本的单列页面布局等功能。此更新进一步提高了复杂布局中文档转换的灵活性和准确性。
增强了 DynamicChatPromptBuilder 的功能，允许所有用户和系统消息都使用提供的变量进行模板化。此更新确保了更通用和动态的模板化过程，使聊天提示生成更有效且根据用户需求进行了定制。
通过尝试按优先级顺序使用多个提取器直到成功来改进 HTML 内容提取。HTMLToDocument 中有一个额外的 try_others 参数（默认为 True），它决定在失败后是否使用后续提取器。此增强功能减少了提取失败，确保了更可靠的内容检索。
增强了 FileTypeRouter，支持 MIME 类型的正则表达式。这一强大的功能允许更精细地控制和灵活地根据 MIME 类型路由文件，轻松处理广泛的类别或特定的 MIME 类型模式。此功能特别有利于需要复杂文件分类和路由逻辑的应用程序。
在 Jupyter notebook 中，Pipeline 的图像将不再自动显示。取而代之的是，将显示 Pipeline 的文本表示。要显示 Pipeline 图像，请使用 Pipeline 对象的 show 方法。
支持在管道反序列化期间进行回调。目前支持预初始化挂钩，可用于在调用组件的 __init__ 方法之前检查和修改初始化参数。
pipeline.run() 接受一组组件名称，这些组件的中间输出将在最终管道输出字典中返回。
重构 PyPDFToDocument 以简化对自定义 PDF 转换器的支持。PDF 转换器是实现 PyPDFConverter 协议的类，并具有 3 个方法：convert、to_dict 和 from_dict。

⚠️ 弃用说明

弃用 HuggingFaceTGIChatGenerator，将在 Haystack 2.3.0 中删除。请改用 HuggingFaceAPIChatGenerator。
弃用 HuggingFaceTEIDocumentEmbedder，将在 Haystack 2.3.0 中删除。请改用 HuggingFaceAPIDocumentEmbedder。
弃用 HuggingFaceTGIGenerator，将在 Haystack 2.3.0 中删除。请改用 HuggingFaceAPIGenerator。
弃用 HuggingFaceTEITextEmbedder，将在 Haystack 2.3.0 中删除。请改用 HuggingFaceAPITextEmbedder。
在 PyPDFToDocument 组件中使用 converter_name 参数已被弃用。它将在 2.3.0 版本中删除。请改用 converter 参数。

🐛 Bug 修复

在 AzureOCRDocumentConverter 中对 AnalyzeResult 类型进行前向声明。AnalyzeResult 已在惰性导入块中导入。当未安装 azure-ai-formrecognizer>=3.2.0b2 时，前向声明可避免出现问题。
修复了 MetaFieldRanker 中的一个错误：当 run 方法中的 weight 参数设置为 0 时，该组件错误地使用了 __init__ 方法中设置的默认参数。
修复了 Pipeline.run() 的逻辑，使所有输入都带有默认值的组件按正确的顺序运行。
修复了运行 Pipeline 时可能导致其陷入无限循环的错误
修复了 HuggingFaceTEITextEmbedder 在与使用 Docker 部署的 Text-Embedding-Inference 端点一起使用时返回错误形状的嵌入的问题。
将 @component 装饰器添加到 HuggingFaceTGIChatGenerator。缺少此装饰器导致无法在管道中使用 HuggingFaceTGIChatGenerator。
更新了 SearchApiWebSearch 组件，采用了新的搜索格式，并允许用户通过 search_params 中的 engine 参数指定搜索引擎。默认搜索引擎是 Google，使用户能够更轻松地定制网络搜索。