运行示例 🧑‍🍳 前往食谱

评估

使用 Prometheus 2 进行 RAG 评估

Prometheus 2 是一款最先进的开源模型，专门为评估而训练。了解它的工作原理，以及如何使用它来评估 RAG 管道。

2024 年 6 月 17 日

在构建基于语言模型（如 RAG）的实际应用程序时，评估起着重要作用。最近，使用强大的专有语言模型（如 GPT-4）评估生成的答案变得流行起来，并且与人类判断高度相关，但它也有自身的局限性和挑战。

Prometheus 2 是一个新发布的开源模型系列，专门用于评估其他语言模型的输出。在这篇文章（以及相关的 Notebook）中，我们将了解如何使用 Prometheus，并实验它来评估 Haystack 中 RAG 管道的生成响应。

语言模型作为评估者

随着语言模型（LM）在各种任务上展现出强大的通用能力，使用其他生成式 LM 来评估这些模型生成的答案已成为一种常见且有效的方法。与基于统计的评估相比，这种技术很方便，因为它通常不需要真实标签。

专有模型（如 GPT-4 或 Claude 3 Opus）经常被选作评估工具，并已显示出与人类判断的良好相关性。然而，依赖闭源模型存在几个局限性

数据隐私：您的数据会离开您的机器并传输给模型提供商
透明度：这些模型的训练数据未知
可控性：由于这些模型通过 API 访问，因此它们的行为可能会随着时间的推移而改变
价格：尽管价格不断下跌，但这些大型模型仍然昂贵。此外，评估过程通常涉及多个测试和优化周期，这会显著增加总体费用。

另一方面，使用开源模型进行评估是一个活跃的研究领域，但其实际应用通常受到限制。它们通常与人类判断的相关性不高，并且缺乏灵活性（有关更多详细信息，请参阅 Prometheus 2 论文）。

🔥 Prometheus 2：一个强大的开源评估模型

Prometheus 2 是一个新系列的开源模型，旨在弥合专有模型和开源 LM 在评估方面的差距。

作者统一了两种不同的评估范式：直接评估（根据特定标准评估单个模型生成的答案质量）和成对排名（在两个答案之间选择最佳答案，这两个答案通常由不同的模型生成）。

具体来说，对于每个变体，他们从 MistralAI 的基础模型开始，在开源数据集上训练了 2 个不同的模型（每个模型对应一项任务），然后合并它们的权重，从而创建一个强大的评估语言模型。

结果令人印象深刻

两个变体：7B 和 8x7B，分别基于 Mistral-7B 和 Mixtral8x7B 微调
与人类评估和专有模型高度相关
模型非常灵活：能够执行直接评估和成对排名，并允许自定义评估标准

提示 Prometheus 2

提示模板

Prometheus 2 模型是为执行评估而训练的生成式语言模型。为了获得这些模型的最佳结果，我们需要遵循精确但可自定义的提示结构。您可以在论文和 GitHub 上找到提示模板。

由于我们想用 Prometheus 2 来评估单个 RAG 系统，我们主要对直接评估提示模板感兴趣，该模板允许根据特定标准评估答案的质量。以下模板包含一个参考答案；提供的链接还包含一个不带参考答案的版本。

我们来看看它。

You are a fair judge assistant tasked with providing clear, objective feedback 
based on specific criteria, ensuring each assessment reflects the absolute 
standards set for performance.

###Task Description:
An instruction (might include an Input inside it), a response to evaluate, a 
reference answer that gets a score of 5, and a score rubric representing a 
evaluation criteria are given.
1. Write a detailed feedback that assess the quality of the response strictly 
based on the given score rubric, not evaluating in general.
2. After writing a feedback, write a score that is an integer between 1 and 5. 
You should refer to the score rubric.
3. The output format should look as follows: \\"Feedback: (write a feedback for 
criteria) [RESULT] (an integer number between 1 and 5)\\"
4. Please do not generate any other opening, closing, and explanations.

###The instruction to evaluate:
{instruction}

###Response to evaluate:
{response}

###Reference Answer (Score 5):
{reference_answer}

###Score Rubrics:
{score_rubric}

###Feedback:

在此提示模板中，唯一需要自定义的部分是用大括号括起来的部分。

我们应该提供

用于评估的指令，其中可能包括输入（例如，用户问题，如果在评估 RAG 管道）
要评估的 LLM 响应
参考答案：一个完美的答案，根据评分标准得 5 分
一个评分标准，分数从 1 到 5，准确描述响应符合每个分数的时间。

当提供这样的提示时，模型将生成两个输出：详细的 feedback 和 1 到 5 的分数。

一个例子

假设我们要评估生成答案的正确性。在这种情况下，我们将有一个真实答案，但这不是强制性的。

问题：“谁赢得了 2022 年世界杯？” 生成的答案：“阿根廷赢得了 2022 年 FIFA 世界杯。法国赢得了 2018 年 FIFA 世界杯。” 参考答案：“阿根廷赢得了 2022 年 FIFA 世界杯。”

在这种情况下，我们的提示可能如下所示

[...]

###The instruction to evaluate:
Your task is to evaluate the generated answer against the reference answer for 
the question: Who won the 2022 World Cup?

###Response to evaluate:
Argentina won the 2022 FIFA World Cup. France won the 2018 FIFA World Cup.

###Reference Answer (Score 5):
Argentina won the 2022 FIFA World Cup.

###Score Rubrics:
Score 1: The answer is not relevant to the question and does not align with the reference answer.
Score 2: The answer is relevant to the question but deviates significantly from the reference answer.
Score 3: The answer is relevant to the question and generally aligns with the reference answer but has errors or omissions.
Score 4: The answer is relevant to the question and closely matches the reference answer but is less concise or clear.
Score 5: The answer is highly relevant, fully accurate, and matches the reference answer in both content and clarity.

###Feedback:

使用此提示调用模型应该会返回类似的结果

生成的响应和参考答案都将阿根廷确定为 2022 年 FIFA 世界杯的获胜者。这表明生成的答案准确且与用户关于谁赢得了 2022 年世界杯的查询相关。然而，包含法国赢得 2018 年世界杯的不相关事实并没有提高响应的相关性或准确性，使其比参考答案稍微不够简洁。虽然关于法国的信息没有错，但它与关于 2022 年世界杯的原始查询无关。因此，尽管总体上是正确的，但生成响应中的附加细节偏离了参考答案的简洁性，并且可能被认为效果稍差。因此，根据强调相关性、简洁性和准确性的评分标准，生成的响应与参考答案的匹配度不如它可能达到的程度。所以分数是 4。[RESULT] 4

在 Haystack 中实验 Prometheus 2

Haystack 是一个用于构建和评估基于 LLM 的应用程序的编排框架。它提供了自己的评估器集，并与不同的评估库集成。通过创建自定义组件，可以轻松扩展 Haystack 的功能，因此我们可以尝试集成 Prometheus 2。

这是计划

构建并运行一个索引管道
构建并运行一个要评估的 RAG 管道
实现 PrometheusLLMEvaluator 组件
创建不同的 PrometheusLLMEvaluator
使用不同的 PrometheusLLMEvaluator 构建并运行一个评估管道

在这篇博文中，我们总结了这些步骤，但您可以在随附的 Notebook 中找到完整的实现。

索引管道

在运行 RAG 管道之前，我们需要索引一些数据。

我们将使用一个标记的 PubMed 数据集，其中包含问题、上下文和答案。这允许我们将上下文用作文档，并为我们定义的一些评估指标提供必要的标记数据。

为简单起见，我们将使用 InMemoryDocumentStore。我们的索引管道将包括一个 DocumentEmbedder（嵌入模型：sentence-transformers/all-MiniLM-L6-v2）和一个 DocumentWriter。

有关构建和运行索引管道的完整代码，请参阅随附的 Notebook。

RAG 管道

现在我们的数据已准备就绪，我们可以创建一个简单的 RAG 管道。

我们的 RAG 管道将包括

InMemoryEmbeddingRetriever，用于检索与查询相关的文档（基于与之前相同的嵌入模型）
PromptBuilder，用于动态创建提示
HuggingFaceLocalGenerator 与 google/gemma-1.1-2b-it，用于生成查询的答案。这是一个小型模型，稍后我们将根据自定义标准评估生成响应的质量。
AnswerBuilder

让我们使用一组问题运行我们的 RAG 管道，并保存评估所需的数据：问题、真实答案和生成的答案。

实现 PrometheusLLMEvaluator 组件

为了进行评估，我们基于 Prometheus 2 创建了一个自定义 Haystack Evaluator 组件。

此组件允许您开发各种评估器。

您可以在随附的 Notebook 中找到实现。让我们对该组件进行高层次概述

init 参数
- template：一个符合 Prometheus 2 提示结构的花括号模板，其中包含我们希望在运行时传递的输入数据的占位符（例如，question、generated_answer、ground_truth_answer）
- inputs：一个元组列表，格式为（input_name、input_type）。这些是评估器期望并用于评估的输入。它们应与模板中定义的匹配。
- generator：（hacky）允许传递不同类型的 Haystack 生成器来使用 Prometheus 2 模型。例如：HuggingFaceLocalGenerator、LlamaCPPGenerator 等。
run 方法：对于每个要评估的示例，输入都会经过验证，集成到提示中并传递给模型。模型输出被解析以提取分数和反馈。此方法返回一个字典，其中包含一个聚合的 score、individual_scores 和 feedbacks。

创建不同的评估器

让我们看看如何使用 PrometheusLLMEvaluator。

我们首先创建一个 Correctness Evaluator，类似于上面的示例。

首先，我们初始化一个生成器来加载 Prometheus 2 模型；特别是，我们正在使用小型变体（7B）。

from haystack.components.generators import HuggingFaceLocalGenerator

generator = HuggingFaceLocalGenerator(
    model="prometheus-eval/prometheus-7b-v2.0",
    task="text2text-generation",
		...
)
generator.warm_up()

在此示例中，我们使用的是 HuggingFaceLocalGenerator，它可以在 Colab 提供的免费 GPU 上运行，但根据您的环境，还有其他几种选择：LlamaCPPGenerator（适用于资源受限的环境，即使没有 GPU）；TGI（通过 HuggingFaceAPIGenerator）和 vLLM（适用于有 GPU 资源的生产环境）。

接下来，让我们为 Correctness 评估器准备提示模板。请注意，我们正在插入 query、generated_answer 和 reference_answer 的占位符。这些字段将根据 RAG 结果和真实答案动态填充。

correctness_prompt_template = """
...
###The instruction to evaluate:
Your task is to evaluate the generated answer against the reference answer for the question: {{query}}

###Response to evaluate:
generated answer: {{generated_answer}}

###Reference Answer (Score 5): {{reference_answer}}

###Score Rubrics:
Score 1: The answer is not relevant to the question and does not align with the reference answer.
Score 2: The answer is relevant to the question but deviates significantly from the reference answer.
Score 3: The answer is relevant to the question and generally aligns with the reference answer but has errors or omissions.
Score 4: The answer is relevant to the question and closely matches the reference answer but is less concise or clear.
Score 5: The answer is highly relevant, fully accurate, and matches the reference answer in both content and clarity.

###Feedback:"""

最后，让我们初始化我们的评估器，指定它在运行时应该期望哪些输入（它们应该与上述提示模板的占位符匹配）。

correctness_evaluator = PrometheusLLMEvaluator(
    template=correctness_prompt_template,
    generator=generator,
    inputs=[
        ("query", List[str]),
        ("generated_answer", List[str]),
        ("reference_answer", List[str]),
    ],
)

类似地，我们可以创建其他评估器

响应相关性：评估生成答案与用户问题的相关性。
逻辑稳健性：评估响应的逻辑组织和进展。

这些评估器不需要真实标签。有关提示模板和所需输入的详细信息，请参阅随附的 Notebook。

评估管道

现在我们可以将评估器放入管道并运行它，看看我们的小模型表现如何。

from haystack import Pipeline

eval_pipeline = Pipeline()
eval_pipeline.add_component("correctness_evaluator", correctness_evaluator)
eval_pipeline.add_component("response_relevance_evaluator", response_relevance_evaluator)
eval_pipeline.add_component("logical_robustness_evaluator", logical_robustness_evaluator)

eval_results = eval_pipeline.run(
    {
        "correctness_evaluator": {
            "query": questions,
            "generated_answer": rag_answers,
            "reference_answer": ground_truth_answers,
        },
        "response_relevance_evaluator": {
            "query": questions,
            "generated_answer": rag_answers,
        },
        "logical_robustness_evaluator": {
            "query": questions,
            "generated_answer": rag_answers,
        },
    }
)

运行评估管道后，我们还可以创建一个完整的评估报告。Haystack 提供了一个 EvaluationRunResult，我们可以用它来显示 score_report。

from haystack.evaluation.eval_run_result import EvaluationRunResult

inputs = {
    "question": questions,
    "answer": ground_truth_answers,
    "predicted_answer": rag_answers,
}

evaluation_result = EvaluationRunResult(run_name="pubmed_rag_pipeline", inputs=inputs, results=eval_results)
evaluation_result.score_report()

在我们的实验中（涉及 10 个示例的小样本），我们得到以下结果

评估	分数
correctness_evaluator	3.9
response_relevance_evaluator	4.3
logical_robustness_evaluator	3.5

Gemma-1.1-2b-it 似乎生成了相关的答案，但响应与真实答案不同，并且逻辑组织不理想。

为了更详细地检查这些结果，我们可以将 evaluation_result 转换为 Pandas DataFrame，并查看每个评估器对每个示例的单独反馈。

总结

在这篇文章中，您了解了 Prometheus 2：一个用于评估的新系列最先进的开源模型。

在介绍了模型及其特定用法之后，我们在 Haystack 中将它们投入使用，并创建了不同的评估器来从多个方面评估 RAG 管道生成的响应质量。

我们的实验结果很有趣且有前景。然而，在将这些模型用于实际应用程序之前，您应该针对您的特定用例进行评估。此外，在这个快速变化的世界中，也许通用开源模型可以有效地用于评估的那一天并不遥远。