使用 Prometheus 2 进行 RAG 评估

在 Colab 中打开下载

_{最后更新：2024 年 9 月 19 日}

评估语言模型和基于 LLM 的应用程序的响应通常涉及使用不需要地面真相标签的模型指标。像 GPT-4 和 Claude 3 Opus 这样的大型专有模型经常被用作评估器，并显示出与人类评估的良好相关性。

然而，依赖于封闭模型会带来一些挑战：

公平性：这些模型的训练数据未知。
可控性：这些模型的行为可能无法预测地改变。
数据隐私：将数据发送给外部提供商可能会引起隐私担忧。
可负担性：使用这些强大的模型可能会很昂贵。

使用开放模型进行评估是一个活跃的研究领域，但其实际应用通常受到限制。它们通常与人类判断的相关性不高，并且缺乏灵活性。

🔥 Prometheus 2 是一个旨在解决这些差距的新型开源模型系列

两种变体，分别从 Mistral-7B 和 Mixtral8x7B 进行微调
在开源数据上训练
与人类评估和专有模型显示出高度相关性
高度灵活：能够执行直接评估和成对排名，并允许定义自定义评估标准。

在本实验笔记本中，我们将使用 Prometheus 2 来评估 RAG 管道的响应。

首先，我们将构建 RAG 管道并收集一些结果。然后，我们将为 Haystack 编写一个自定义的 Prometheus Evaluator 组件。最后，我们将初始化三个不同的评估器并在评估管道中运行它们。

创建要评估的 RAG 管道

我们想使用 Prometheus 2 来评估 RAG 生成的答案，所以我们首先需要构建我们的 RAG 管道。

这部分与“评估 RAG 管道”教程非常相似。请查看该教程以获取更多详细信息。

如果您愿意，可以只阅读本节。我们将提供生成的数据以供后续评估步骤使用。

!pip install haystack-ai datasets sentence-transformers accelerate huggingface_hub bitsandbytes

我们将使用一个标记的 PubMed 数据集，其中包含问题、上下文和答案。这允许我们将上下文用作文档，并为我们将定义的一些评估指标提供了必要的标记数据。

在此示例中，我们将使用前 100 行。

首先，让我们获取数据集并提取 all_documents、all_questions 和 all_ground_truth_answers。

from datasets import load_dataset
from haystack import Document

dataset = load_dataset("vblagoje/PubMedQA_instruction", split="train")
dataset = dataset.select(range(100))
all_documents = [Document(content=doc["context"]) for doc in dataset]
all_questions = [doc["instruction"] for doc in dataset]
all_ground_truth_answers = [doc["response"] for doc in dataset]

索引管道

接下来，让我们构建一个简单的索引管道并将文档写入 Document Store。

from haystack import Pipeline
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.document_stores.types import DuplicatePolicy

document_store = InMemoryDocumentStore()

document_embedder = SentenceTransformersDocumentEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")
document_writer = DocumentWriter(document_store=document_store, policy=DuplicatePolicy.SKIP)

indexing = Pipeline()
indexing.add_component(instance=document_embedder, name="document_embedder")
indexing.add_component(instance=document_writer, name="document_writer")

indexing.connect("document_embedder.documents", "document_writer.documents")

indexing.run({"document_embedder": {"documents": all_documents}})

/usr/local/lib/python3.10/dist-packages/sentence_transformers/SentenceTransformer.py:173: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v3 of SentenceTransformers.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(



Batches:   0%|          | 0/4 [00:00<?, ?it/s]





{'document_writer': {'documents_written': 100}}

RAG 管道

现在我们的数据已准备就绪，我们可以创建一个简单的 RAG 管道。

在此示例中，我们将使用：

InMemoryEmbeddingRetriever 来检索与查询相关的文档。
HuggingFaceLocalGenerator 与 google/gemma-1.1-2b-it 一起为查询生成答案。这是一个小型模型，稍后我们将根据自定义标准评估生成响应的质量。

import os
from getpass import getpass
from haystack import Pipeline
from haystack.components.builders import AnswerBuilder, PromptBuilder
from haystack.components.embedders import SentenceTransformersTextEmbedder
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from haystack.components.generators import HuggingFaceLocalGenerator
from haystack.utils import ComponentDevice


# to access Gemma
# 1. you need a Hugging Face account
# 2. you have to accept Google conditions: https://hugging-face.cn/google/gemma-1.1-2b-it
# 3. copy your HF token (https://hugging-face.cn/settings/tokens) and paste it below
os.environ["HF_API_TOKEN"] = getpass("Your Hugging Face token")

generator = HuggingFaceLocalGenerator(
    "google/gemma-1.1-2b-it",
    huggingface_pipeline_kwargs={"device_map": "auto"},
    device=ComponentDevice.from_str("cuda:0"),
)

template = """
<bos><start_of_turn>user
You have to answer the following question based on the given context information only.

Context:
{% for document in documents %}
    {{ document.content }}
{% endfor %}

Question: {{question}}
Answer:<end_of_turn>
<start_of_turn>model"""

rag_pipeline = Pipeline()
rag_pipeline.add_component(
    "query_embedder",
    SentenceTransformersTextEmbedder(model="sentence-transformers/all-MiniLM-L6-v2"),
)
rag_pipeline.add_component("retriever", InMemoryEmbeddingRetriever(document_store, top_k=3))
rag_pipeline.add_component("prompt_builder", PromptBuilder(template=template))
rag_pipeline.add_component("generator", generator)
rag_pipeline.add_component("answer_builder", AnswerBuilder())

rag_pipeline.connect("query_embedder", "retriever.query_embedding")
rag_pipeline.connect("retriever", "prompt_builder.documents")
rag_pipeline.connect("prompt_builder", "generator")
rag_pipeline.connect("generator.replies", "answer_builder.replies")
rag_pipeline.connect("retriever", "answer_builder.documents")

Your Hugging Face token··········





<haystack.core.pipeline.pipeline.Pipeline object at 0x7b1b5c30bdc0>
🚅 Components
  - query_embedder: SentenceTransformersTextEmbedder
  - retriever: InMemoryEmbeddingRetriever
  - prompt_builder: PromptBuilder
  - generator: HuggingFaceLocalGenerator
  - answer_builder: AnswerBuilder
🛤️ Connections
  - query_embedder.embedding -> retriever.query_embedding (List[float])
  - retriever.documents -> prompt_builder.documents (List[Document])
  - retriever.documents -> answer_builder.documents (List[Document])
  - prompt_builder.prompt -> generator.prompt (str)
  - generator.replies -> answer_builder.replies (List[str])

您可以通过提问来尝试 RAG 管道

question = "Do high levels of procalcitonin in the early phase after pediatric liver transplantation indicate poor postoperative outcome?"

response = rag_pipeline.run(
    {
        "query_embedder": {"text": question},
        "prompt_builder": {"question": question},
        "answer_builder": {"query": question},
    }
)
print(response["answer_builder"]["answers"][0].data)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



Batches:   0%|          | 0/1 [00:00<?, ?it/s]


 **Yes.**

The study found that high levels of procalcitonin (PCT) on postoperative day (POD) 2 were associated with worse outcomes, including higher International Normalized Ratio values, primary graft non-function, longer stay in the pediatric intensive care unit, and mechanical ventilation.

response["answer_builder"]

{'answers': [GeneratedAnswer(data=' **Yes.**\n\nThe study found that high levels of procalcitonin (PCT) on postoperative day (POD) 2 were associated with worse outcomes, including higher International Normalized Ratio values, primary graft non-function, longer stay in the pediatric intensive care unit, and mechanical ventilation.', query='Do high levels of procalcitonin in the early phase after pediatric liver transplantation indicate poor postoperative outcome?', documents=[Document(id=9928bb3fd5bfd294a30717df6f590301c0f7c82f65fec5ff9ae7a00ac4956571, content: 'To date, no data is available about procalcitonin (PCT) levels and its relevance to morbidity and gr...', score: 0.7619273394960041), Document(id=2f1be411b8673646b72551e57af84872e39a788c3602c9b22af2ae901eda0da4, content: 'Intrahepatic cholestasis of pregnancy (ICP) is defined by pruritus, elevated total fasting serum bil...', score: 0.4159278001751194), Document(id=b112787486a85ff8086de3f2562d80497bc4cc76bc9d8cf9d3d5b3ee3b663975, content: 'Most hepatocellular carcinomas (HCCs) are associated with cirrhosis. Portal hypertension (PHT) and e...', score: 0.34273266043157447)], meta={})]}

运行 RAG 管道并保存结果

让我们用一组问题运行 RAG 管道，并确保保存评估所需的数据：问题、地面真实答案和生成的答案。

在此示例中，我们将使用 10 个随机问题。
在评估部分，我们不会评估检索到的上下文，因此我们也不会保存它。但是，您可以选择在评估中考虑上下文：正如我们稍后将看到的，Prometheus 的评估是高度可定制的。

import random

questions, ground_truth_answers = zip(*random.sample(list(zip(all_questions, all_ground_truth_answers)), 10))

rag_answers = []

for question in list(questions):
    results = rag_pipeline.run(
        {
            "query_embedder": {"text": question},
            "prompt_builder": {"question": question},
            "answer_builder": {"query": question},
        }
    )

    rag_answers.append(results["answer_builder"]["answers"][0].data)

results = {
    "questions": questions,
    "ground_truth_answers": ground_truth_answers,
    "rag_answers": rag_answers,
}

import json

with open("gemma_2b_rag_results.json", "w") as fo:
    json.dump(results, fo)

使用 Prometheus 2 进行评估

在准备工作完成后，我们可以使用 Prometheus 2 来评估在几个期望的维度上生成的响应。

该模型期望一个类似下面的提示，并返回包含反馈和分数的文本。

###Task Description:
An instruction (might include an Input inside it), a response to evaluate, a reference answer that gets a score of 5, and a score rubric representing a evaluation criteria are given.
1. Write a detailed feedback that assess the quality of the response strictly based on the given score rubric, not evaluating in general.
2. After writing a feedback, write a score that is an integer between 1 and 5. You should refer to the score rubric.
3. The output format should look as follows: \"Feedback: (write a feedback for criteria) [RESULT] (an integer number between 1 and 5)\"
4. Please do not generate any other opening, closing, and explanations.

###The instruction to evaluate:
{orig_instruction}

###Response to evaluate:
{orig_response}

###Reference Answer (Score 5):
{orig_reference_answer}

###Score Rubrics:
[{orig_criteria}]
Score 1: {orig_score1_description}
Score 2: {orig_score2_description}
Score 3: {orig_score3_description}
Score 4: {orig_score4_description}
Score 5: {orig_score5_description}

###Feedback:

创建 Prometheus Evaluator 组件

为了进行评估，我们创建了一个自定义的 Haystack Evaluator 组件。在 Haystack 中，创建自定义组件非常容易，我们可以用几行代码实现 Prometheus Evaluator。

设计选择

我们的实现有些 hacky，并且是针对实验设计的，但其中一些选择值得解释。

该组件受到我们LLMEvaluator的启发并对其进行了扩展，但针对 Prometheus 进行了特定调整。
初始化参数
- template：Prometheus 高度可定制，因此我们可以轻松创建具有不同提示模板的各种评估器。
- inputs：评估器期望并评估的输入。它们应与模板中定义的匹配。
- generator：（hacky）允许传递不同类型的 Haystack 生成器以使用 Prometheus 模型。示例：HuggingFaceLocalGenerator、LlamaCPPGenerator 等。
run 方法：对于要评估的每个示例，inputs 会集成到提示中并传递给模型；然后解析模型输出以提取分数和反馈。此方法返回一个包含聚合 score、individual_scores 和 feedbacks 的字典。

from typing import Any, Dict, List, Tuple, Type
from haystack import component
from haystack.components.evaluators import LLMEvaluator
from haystack.components.builders import PromptBuilder
from tqdm import tqdm
from numpy import mean as np_mean


ABS_SYSTEM_PROMPT = (
    "You are a fair judge assistant tasked with providing clear, objective feedback based on "
    "specific criteria, ensuring each assessment reflects the absolute standards set for performance."
)


@component
class PrometheusLLMEvaluator(LLMEvaluator):
    def __init__(
        self,
        generator,
        template: str,
        inputs: List[Tuple[str, Type[List]]],
        progress_bar: bool = True,
    ):
        outputs = ["feedback", "score"]
        self.validate_init_parameters(inputs, outputs, [])
        self.inputs = inputs
        self.outputs = outputs

        self._builder = PromptBuilder(template=template)
        self._generator = generator
        self.progress_bar = progress_bar

        component.set_input_types(self, **dict(inputs))

    def _parse_output(self, output):
        feedback, _, score_str = output.rpartition("[RESULT]")
        feedback = feedback.rpartition("###Feedback: [/INST]")[-1].strip()
        score_str = score_str.strip()

        score = None
        if score_str.isdigit() and score_str in ["1", "2", "3", "4", "5"]:
            score = int(score_str)
        return feedback, score

    @component.output_types(score=float, individual_scores=List[float], feedbacks=List[str])
    def run(self, **inputs) -> Dict[str, Any]:
        self.validate_input_parameters(dict(self.inputs), inputs)

        # inputs is a dictionary with keys being input names and values being a list of input values
        # We need to iterate through the lists in parallel for all keys of the dictionary
        input_names, values = inputs.keys(), list(zip(*inputs.values()))
        list_of_input_names_to_values = [dict(zip(input_names, v)) for v in values]

        individual_scores, feedbacks = [], []
        for input_names_to_values in tqdm(list_of_input_names_to_values, disable=not self.progress_bar):
            
            partial_prompt = self._builder.run(**input_names_to_values)["prompt"]
            prompt = f"[INST] {ABS_SYSTEM_PROMPT}\n{partial_prompt} [/INST]"
            
            output = self._generator.run(prompt=prompt)["replies"][0]

            feedback, individual_score = self._parse_output(output)
            if individual_score is not None:
                individual_scores.append(individual_score)
            feedbacks.append(feedback)
        score = np_mean(individual_scores)

        return {
            "score": score,
            "individual_scores": individual_scores,
            "feedbacks": feedbacks,
        }

加载 Prometheus 2 模型

我们将使用prometheus-7b-v2.0：Prometheus 2 最小的变体，可以在标准的 Colab 笔记本上运行 8 位量化。

特别是，我们将通过基于 Transformers 库的HuggingFaceLocalGenerator 来使用该模型。

generation_kwargs 简单地复制了prometheus-eval 库中使用的参数。对于实际应用，值得进行实验并查看是否有更好的参数组合可以提供良好的评估性能和可重复性。

如前所述，还有其他几种选项可以使用 Haystack 运行此开放模型：

资源受限环境：[LlamaCPPGenerator]（由于 GGUF 量化格式，可以在纯 CPU 环境中运行；示例如下注释）
在生产环境中，使用可用的 GPU 资源：TGI（通过HuggingFaceAPIGenerator）、vLLM。

# if you have previously run the RAG pipeline, you will probably need to restart
# the kernel in order to free up GPU memory

from haystack.components.generators import HuggingFaceLocalGenerator

generator = HuggingFaceLocalGenerator(
    model="prometheus-eval/prometheus-7b-v2.0",
    task="text2text-generation",
    huggingface_pipeline_kwargs={
        "device_map": "auto",
        "model_kwargs": {"load_in_8bit": True},
    },
    generation_kwargs={
        "max_new_tokens": 512,
        "temperature": 1.0,
        "do_sample": True,
        "repetition_penalty": 1.03,
        "top_p": 0.9,
    },
)

generator.warm_up()

/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:89: UserWarning: 
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://hugging-face.cn/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
  warnings.warn(



config.json:   0%|          | 0.00/653 [00:00<?, ?B/s]


The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.



model.safetensors.index.json:   0%|          | 0.00/22.8k [00:00<?, ?B/s]



Downloading shards:   0%|          | 0/8 [00:00<?, ?it/s]



model-00001-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]



model-00002-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]



model-00003-of-00008.safetensors:   0%|          | 0.00/1.97G [00:00<?, ?B/s]



model-00004-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]



model-00005-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]



model-00006-of-00008.safetensors:   0%|          | 0.00/1.92G [00:00<?, ?B/s]



model-00007-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]



model-00008-of-00008.safetensors:   0%|          | 0.00/789M [00:00<?, ?B/s]



Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]



tokenizer_config.json:   0%|          | 0.00/1.52k [00:00<?, ?B/s]



tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]



tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]



special_tokens_map.json:   0%|          | 0.00/551 [00:00<?, ?B/s]


The model 'MistralForCausalLM' is not supported for text2text-generation. Supported models are ['BartForConditionalGeneration', 'BigBirdPegasusForConditionalGeneration', 'BlenderbotForConditionalGeneration', 'BlenderbotSmallForConditionalGeneration', 'EncoderDecoderModel', 'FSMTForConditionalGeneration', 'GPTSanJapaneseForConditionalGeneration', 'LEDForConditionalGeneration', 'LongT5ForConditionalGeneration', 'M2M100ForConditionalGeneration', 'MarianMTModel', 'MBartForConditionalGeneration', 'MT5ForConditionalGeneration', 'MvpForConditionalGeneration', 'NllbMoeForConditionalGeneration', 'PegasusForConditionalGeneration', 'PegasusXForConditionalGeneration', 'PLBartForConditionalGeneration', 'ProphetNetForConditionalGeneration', 'SeamlessM4TForTextToText', 'SeamlessM4Tv2ForTextToText', 'SwitchTransformersForConditionalGeneration', 'T5ForConditionalGeneration', 'UMT5ForConditionalGeneration', 'XLMProphetNetForConditionalGeneration'].

# UNCOMMENT THE FOLLOWING LINES TO USE llama.cpp
# You can also choose a model with a different quantization: you will lose some quality in exchange with using less resources and being faster

# ! pip install haystack-ai llama-cpp-haystack

# from haystack_integrations.components.generators.llama_cpp import LlamaCppGenerator
# from huggingface_hub import hf_hub_download

# prometheus_path = hf_hub_download(
#             repo_id="AlekseiPravdin/prometheus-7b-v2_0-gguf", filename="prometheus-7b-v2_0.q8_0.gguf", repo_type="model"
# )

# generator = LlamaCppGenerator(
#     model=prometheus_path,
#     n_ctx=8192,
#     n_batch=512,
# 	  generation_kwargs={"max_tokens": 512, "temperature": 1.0, "do_sample":True, "repeat_penalty": 1.03, "top_p": 0.9},
# )
# generator.warm_up()

初始化不同的 Prometheus 评估器

我们将定义 3 个提示模板和相应的 Prometheus 评估器：

正确性：评估生成的答案，同时考虑其与问题的相关性以及与地面真实答案的相似性。
响应相关性：评估生成答案与用户问题的相关性。
逻辑稳健性：评估响应的逻辑组织和进展。

如所示，通过自定义提示模型，可以创建各种多样的评估器。

通常，第一部分（任务描述）应保持不变。唯一需要更改的方面，如以下示例所示，是是否使用参考答案。

⚠️ 尽管这些评估器名称可能与 Haystack 或其他库中使用的评估指标相似，但重要的是要理解它们是专门为 Prometheus 创建的，并且产生 1 到 5 之间的分数。它们与概念上相似但定义不同的指标不具有可比性。

correctness_prompt_template = """
###Task Description
An instruction (might include an Input inside it), a response to evaluate, a reference answer that gets a score of 5, and a score rubric representing a evaluation criteria are given.
1. Write a detailed feedback that assesses the quality of the response strictly based on the given score rubric, not evaluating in general.
2. After writing a feedback, write a score that is an integer between 1 and 5. You should refer to the score rubric.
3. The output format should look as follows: "Feedback: (write a feedback for criteria) [RESULT] (1 or 2 or 3 or 4 or 5)"
4. Please do not generate any other opening, closing, and explanations.

###The instruction to evaluate:
Your task is to evaluate the generated answer against the reference answer for the question: {{query}}

###Response to evaluate:
generated answer: {{generated_answer}}

###Reference Answer (Score 5): {{reference_answer}}

###Score Rubrics:
Score 1: The answer is not relevant to the question and does not align with the reference answer.
Score 2: The answer is relevant to the question but deviates significantly from the reference answer.
Score 3: The answer is relevant to the question and generally aligns with the reference answer but has errors or omissions.
Score 4: The answer is relevant to the question and closely matches the reference answer but is less concise or clear.
Score 5: The answer is highly relevant, fully accurate, and matches the reference answer in both content and clarity.

###Feedback:""".strip()

correctness_evaluator = PrometheusLLMEvaluator(
    template=correctness_prompt_template,
    generator=generator,
    inputs=[
        ("query", List[str]),
        ("generated_answer", List[str]),
        ("reference_answer", List[str]),
    ],
)



response_relevance_prompt_template = """
###Task Description
An instruction (might include an Input inside it), a response to evaluate, and a score rubric representing a evaluation criteria are given.
1. Write a detailed feedback that assess the quality of the response strictly based on the given score rubric, not evaluating in general.
2. After writing a feedback, write a score that is an integer between 1 and 5. You should refer to the score rubric.
3. The output format should look as follows: "Feedback: (write a feedback for criteria) [RESULT] (an integer number between 1 and 5)"
4. Please do not generate any other opening, closing, and explanations.

###The instruction to evaluate:
Your task is to evaluate whether the generated answer is relevant to the question: {{query}}

###Response to evaluate:
generated answer: {{generated_answer}}

Score 1: The generated answer is off-topic or irrelevant to the question asked.
Score 2: The generated answer includes some relevant information but often contains unrelated details.
Score 3: The generated answer is generally relevant to the question but occasionally includes extraneous or off-topic details.
Score 4: The generated answer is mostly relevant to the question, with minimal unrelated information.
Score 5: The generated answer is highly relevant to the question, addressing it directly and thoroughly without including unnecessary information.

###Feedback:""".strip()

response_relevance_evaluator = PrometheusLLMEvaluator(
    template=response_relevance_prompt_template,
    generator=generator,
    inputs=[("query", List[str]), ("generated_answer", List[str])],
)



logical_robustness_prompt_template = """
###Task Description
An instruction (might include an Input inside it), a response to evaluate, and a score rubric representing a evaluation criteria are given.
1. Write a detailed feedback that assess the quality of the response strictly based on the given score rubric, not evaluating in general.
2. After writing a feedback, write a score that is an integer between 1 and 5. You should refer to the score rubric.
3. The output format should look as follows: "Feedback: (write a feedback for criteria) [RESULT] (an integer number between 1 and 5)"
4. Please do not generate any other opening, closing, and explanations.

###The instruction to evaluate:
Your task is to evaluate how logically the generated answer for the question is organized, ensuring a clear progression of ideas and arguments that are easy to follow. question:{{query}}

###Response to evaluate:
generated answer: {{generated_answer}}

###Score Rubrics:
Score 1: Disorganized, lacks clear structure, and is difficult to follow.
Score 2: Some structure, but inconsistent and hard to follow due to abrupt transitions.
Score 3: Generally organized with minor flow issues and occasional unclear connections.
Score 4: Well-organized with clear and smooth transitions, easy to follow.
Score 5: Excellently organized with flawless logical flow and seamless transitions.

###Feedback:""".strip()

logical_robustness_evaluator = PrometheusLLMEvaluator(
    template=logical_robustness_prompt_template,
    generator=generator,
    inputs=[("query", List[str]), ("generated_answer", List[str])],
)

让我们尝试 logical_robustness_evaluator

query = [
    "Are group 2 innate lymphoid cells ( ILC2s ) increased in chronic rhinosinusitis with nasal polyps or eosinophilia?",
    "Does poor sleep predict symptoms of depression and disability retirement due to depression?",
]
generated_answer = [
    "As ILC2s are elevated in patients with CRSwNP, they may drive nasal polyp formation in CRS. ILC2s are also linked with high tissue and blood eosinophilia and have a potential role in the activation and survival of eosinophils during the Th2 immune response. The association of innate lymphoid cells in CRS provides insights into its pathogenesis.",
    "Lack of baseline diagnostic interviews; sleep quality based on self-report.",
]


res = logical_robustness_evaluator.run(query=query, generated_answer=generated_answer)

res

{'score': 3.0,
 'individual_scores': [5, 1],
 'feedbacks': ["The generated response is well-organized and presents a clear progression of ideas. It starts by establishing a link between ILC2s and CRSwNP, then describes the role of ILC2s in nasal polyps formation and eosinophilia. The response then draws a conclusion about the pathogenesis of CRS, which is a coherent and logical flow of information. Each sentence builds on the previous, ensuring that the reader is able to follow the argument without confusion. The response maintains a consistent structure and makes smooth transitions between the different points, making it easy to follow. The logical flow and seamless transitions indicate a high level of organization, which aligns well with the score rubric's criteria for a score of 5. Therefore, the response is of high quality in terms of logical organization.",
  'The response provided does not follow the logical structure expected as per the score rubric. There is a lack of clear organization and progression of ideas. The statement is abrupt and does not flow into a logical argument or question, making it difficult to follow the reasoning behind it. It fails to establish a connection between poor sleep, symptoms of depression, and disability retirement due to depression, which is the main focus of the question. The lack of a clear progression of ideas and arguments, and the absence of smooth transitions, makes it challenging to follow the response. Thus, the response fails to meet the criteria for a well-organized and logically flowing answer. Therefore, based on the score rubric, the response is disorganized and lacks a clear structure, making it difficult to follow. So the overall score is 1.']}

好的，不错！

评估管道

现在我们可以将评估器添加到 Evaluation 管道中，并使用我们的 RAG 结果运行该管道。

from haystack import Pipeline

eval_pipeline = Pipeline()
eval_pipeline.add_component("correctness_evaluator", correctness_evaluator)
eval_pipeline.add_component("response_relevance_evaluator", response_relevance_evaluator)
eval_pipeline.add_component("logical_robustness_evaluator", logical_robustness_evaluator)

让我们下载 RAG 结果。如果您已经运行了 RAG 管道，则可以跳过下一个单元格。

# skip this cell if you have run the RAG pipeline before

!wget "https://raw.githubusercontent.com/deepset-ai/haystack-cookbook/main/data/prometheus2_evaluation/gemma_2b_rag_results.json"

import json

with open("gemma_2b_rag_results.json", "r") as fin:
    rag_results = json.load(fin)

questions = rag_results["questions"]
ground_truth_answers = rag_results["ground_truth_answers"]
rag_answers = rag_results["rag_answers"]

eval_results = eval_pipeline.run(
    {
        "correctness_evaluator": {
            "query": questions,
            "generated_answer": rag_answers,
            "reference_answer": ground_truth_answers,
        },
        "response_relevance_evaluator": {
            "query": questions,
            "generated_answer": rag_answers,
        },
        "logical_robustness_evaluator": {
            "query": questions,
            "generated_answer": rag_answers,
        },
    }
)

评估结果

一旦我们运行了评估管道，我们还可以创建一个完整的评估报告。Haystack 提供了一个 EvaluationRunResult，我们可以使用它来显示一个 score_report。

from haystack.evaluation.eval_run_result import EvaluationRunResult

inputs = {
    "question": questions,
    "answer": ground_truth_answers,
    "predicted_answer": rag_answers,
}

evaluation_result = EvaluationRunResult(run_name="pubmed_rag_pipeline", inputs=inputs, results=eval_results)
evaluation_result.score_report()

                              score
correctness_evaluator           3.9
response_relevance_evaluator    4.3
logical_robustness_evaluator    3.5

总的来说，在我们的小样本中，Gemma-1.1-2b-it 似乎能生成相关的答案，但响应与地面真实答案不同，并且逻辑组织不理想。

让我们在数据框中检查具体的指标。

import pandas as pd

# do not truncate text
pd.set_option("display.max_colwidth", None)

results_df = evaluation_result.to_pandas()
results_df

                                                                                                                                             question  \
0                                                                    Is cDK1 and CDK2 activity a strong predictor of renal cell carcinoma recurrence?   
1           Does metabolic control analysis of the Trypanosoma cruzi peroxide detoxification pathway identify tryparedoxin as a suitable drug target?   
2                            Does promoter variant rs2301228 on the neural cell adhesion molecule 1 gene confer risk of schizophrenia in Han Chinese?   
3                             Does pancreatic polypeptide regulate glucagon release through PPYR1 receptors expressed in mouse and human alpha-cells?   
4                                   Does tetraploid complementation prove pluripotency of induced pluripotent stem cells derived from adipose tissue?   
5  Is osteoprotegerin associated with subclinical left ventricular systolic dysfunction in diabetic hypertensive patients : a speckle tracking study?   
6                                          Is cD30 expression a novel prognostic indicator in extranodal natural killer/T-cell lymphoma , nasal type?   
7                                                        Does mild cognitive dysfunction affect diabetes mellitus control in minority elderly adults?   
8                                                                  Do youth walking and biking rates vary by environments around 5 Louisiana schools?   
9                                        Are human enteroviruses the cause of neurological impairments in children at the Korle-Bu Teaching Hospital?   

                                                                                                                                                                                                                                                                                                                                                                            answer  \
0                                                                                                                                                                                                                                                                                               CDK1SA of tumors and the CDK2SA are both associated with recurrence and prognosis.   
1                                                                                              These quantitative kinetic and metabolic analyses pointed out to TXN as a convenient drug target due to its low catalytic efficiency, high control on the flux of peroxide detoxification and role as provider of reducing equivalents to the two main peroxidases in the parasite.   
2                                                                                                                                                                 Our results provide direct evidence for NCAM1 as a susceptibility gene for schizophrenia, which offers support to a neurodevelopmental model and neuronal connectivity hypothesis in the onset of schizophrenia.   
3                                                                                                                                                                                                       Glucose stimulates PP secretion and PP inhibits glucagon release in mouse pancreatic islets. PP receptors are present in alpha-cells of mouse and human pancreatic islets.   
4                                                                                                                                                                                          We also directed differentiation of iPS cells into chondrocytes, thus adipose-derived iPS cells can be used as models to study chondrogenic differentiation and cartilage regeneration.   
5                                                                                                                                                                                                                                                                           Plasma OPG values could predict subclinical LV systolic dysfunction in diabetic hypertensive patients.   
6                                 Our results showed that expression of CD30 was not related to response to treatment but was an independent prognostic factor for both OS and PFS in ENKTL, nasal type, which suggests a role for CD30 in the pathogenesis of this disease and may support the incorporation of anti-CD30-targeted therapy into the treatment paradigm for ENKTL.   
7  Cognitive dysfunction in the mild range did not seem to affect diabetes mellitus control parameters in this multiethnic cohort of older adults with diabetes mellitus, although it cannot be excluded that cognitive impairment was overcome through assistance from formal or informal caregivers. It is possible that more-severe cognitive dysfunction could affect control.   
8                                                                                                                                                                                                                                                       There is a role and a need for the SRTS program. The environment and other factors influence biking and walking to school.   
9                                                                                                                                           Non-polio enteroviruses and chikugunya virus were found not to be the etiologic agent responsible for the convulsion with neurologic morbidity observed in the Ghanaian children. Investigation for other viral agents is recommended.   

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         predicted_answer  \
0                                                                                                                                                                                                                                                                                                                                             **Yes, cDK1 and CDK2 activity is a strong predictor of renal cell carcinoma recurrence.**\n\nThe study found that tumors with low CDK2SA-CDK1SA ratio showed significantly better 5-year recurrence-free survival than those with high CDK2SA-CDK1SA ratio.   
1                                                                                                                                                                                                                                                                                                                                                                   **Yes.**\n\nMetabolic control analysis of the Trypanosoma cruzi peroxide detoxification pathway identified tryparedoxin as a suitable drug target, with 90-100% of the pathway flux controlled by TXN-TXNPx and TXN-GPxA redox pairs.   
2                                                                                                                                                                                                                                                                                                                                                                                                                                                                      **Yes.**\n\nThe study found that the promoter variant rs2301228 on the NCAM1 gene is associated with schizophrenia in Han Chinese.   
3                                                                                                                                                                                                                                                                                                                                                             **Yes.**\n\nThe context information indicates that PPYR1 receptors are expressed in the alpha-cell population of mouse and human islets, and that PP release increases upon glucose stimulation, leading to inhibition of glucagon release.   
4                                                                                                                                                                                                                                                                                                                                 **Yes.**\n\nThe study shows that tetraploid complementation successfully generated viable and fertile adult mice derived entirely from reprogrammed ASC, indicating that cell types other than fibroblasts can also be restored to the embryonic level of pluripotency.   
5   **Yes**, osteoprotegerin (OPG) is associated with subclinical left ventricular systolic dysfunction in diabetic hypertensive patients.\n\n**Results:**\n\n* Diabetic hypertensive patients had higher diastolic peak early/early diastolic tissue velocity and lower systolic tissue velocity, GLS, GLS rate systolic, and GLS rate early diastolic than nondiabetic hypertensive patients.\n* OPG was found to be an independent predictor of impaired GLS.\n* Receiver operating characteristic curve analysis revealed that OPG values of > 6.45 (pmol/L) identified the patients with GLS ≤ 18.5.   
6                                                                                                                                                                                                                                                                                                                        **Yes**, CD30 expression is a novel prognostic indicator in extranodal natural killer/T-cell lymphoma, nasal type.\n\nThe study found that CD30 positivity was associated with shorter 5-year OS and PFS in patients with extranodal natural killer/T-cell lymphoma, nasal type.   
7                                                                                                                                                                                                                                                                                                                                                                         The provided text does not contain any information regarding whether mild cognitive dysfunction affects diabetes mellitus control in minority elderly adults, so I am unable to answer this question from the provided context.   
8                                                                                                                                                                                                                                                                                                             The provided text indicates that there is a significant difference in the walking and biking rates between the 5 Louisiana schools. More students walked than biked to school, and the sites with the highest PEDS score had the highest percentage of students who walked/biked to school.   
9                                                                                                                                                                                                                                                                                                                                 The provided text suggests that enteroviruses were detected in cerebrospinal fluid (CSF) samples from children at the Korle-Bu Teaching Hospital, but further studies are needed to establish a causal relationship between enteroviruses and neurological impairments.   

   correctness_evaluator  response_relevance_evaluator  \
0                      4                             3   
1                      5                             5   
2                      3                             5   
3                      5                             4   
4                      4                             5   
5                      5                             5   
6                      5                             5   
7                      1                             1   
8                      3                             5   
9                      4                             5   

   logical_robustness_evaluator  
0                             2  
1                             4  
2                             1  
3                             5  
4                             5  
5                             5  
6                             4  
7                             1  
8                             4  
9                             4

由于 Prometheus 为每次评估都提供了反馈，因此查看它们很有意义。

eval_results["logical_robustness_evaluator"]["feedbacks"]

['The generated answer, while accurate, does not exhibit a strong logical organization. It simply states the conclusion without a detailed explanation of the underlying data or the process that led to this conclusion. Furthermore, there are no transition phrases or linking sentences that would guide the reader from one point to the next, making it hard to follow the progression of ideas.\n\nDespite the absence of transition phrases or linking sentences, the answer maintains a certain degree of coherence, but this coherence could be greatly improved by providing more context or by elaborating on the reasons behind the observed relationship between cDK1 and CDK2 activity and renal cell carcinoma recurrence. For example, it could explain why a lower CDK2SA-CDK1SA ratio is associated with better survival outcomes.\n\nTherefore, although the response contains the necessary information, it lacks the clear progression of ideas and arguments that would make it easy to follow. In contrast, a response with excellent organization would include detailed explanations, smoothly transitioning from one point to the next, and a clear progression of ideas. The absence of these elements in the response means that it falls short of the expected standard of logical organization and flow. \n\nSo the overall score is 2.',
 "This response provides a concise answer to the question, effectively stating that metabolic control analysis identified tryparedoxin as a suitable drug target. It succinctly describes the pathway's regulation by the redox pairs TXN-TXNPx and TXN-GPxA, which demonstrates the clear flow of information and aligns well with the expected logical structure of the response.\n\nHowever, while this response is accurate and follows a logical progression, it lacks the detail found in more elaborate answers. For instance, it does not explicitly mention the percentage of pathway flux controlled by these redox pairs, which could have added more depth to the answer. Moreover, the explanation could be further refined to improve the clarity of the connections between the different components.\n\nDespite these minor drawbacks, the response maintains a well-organized structure and smooth transitions, making it easy to follow. The information is presented in a logical sequence, which helps to enhance the overall coherence of the answer.\n\nIn light of the criteria outlined in the score rubric, the response fulfills the expectations for a score of 4. It presents the information in a logical, coherent, and well-structured manner, although there is room for improvement in terms of detail and connection clarity.\n\nSo the overall score is 4.",
 'This response is disorganized and lacks clear structure. It does not provide any details or reasoning behind its claim. The transition from presenting the study to confirming the link between the promoter variant and schizophrenia is abrupt and lacks any logical flow. The reader is left without any explanation or understanding of how the study reached its conclusion, making it difficult to follow. This failure to elaborate or substantiate the claim results in a response that does not meet the required standards for logical organization. Thus, it can be concluded that this response falls short in fulfilling the criteria outlined in the score rubric.',
 "This response succinctly affirms the question, with a clear structure that logically follows from the contextual information provided. The connection between the increase in PP release and the subsequent inhibition of glucagon release is presented in a logical sequence that's easy to understand. There are no abrupt transitions or unclear connections in the response, ensuring a smooth flow from one point to another. This response effectively demonstrates a coherent and seamless logical progression of ideas. As per the scoring rubric, it shows that the answer is not only well-organized but also has clear and smooth transitions. Therefore, it adheres to the criteria of being easy to follow and exhibiting a flawless logical flow, hence it is awarded a score of 5.",
 'The response provided has shown an excellent logical flow, which aligns with the requirements of the score rubric. The answer directly addresses the question, presenting a clear and well-structured argument. It starts with an affirmation of the initial question, then elaborates on the process of tetraploid complementation, explaining the implications in terms of the pluripotency of the induced pluripotent stem cells. The transition from the premise to the conclusion is seamless, making it easy for the reader to follow the logic. There are no abrupt transitions or disorganized elements in the response, which further contributes to its overall clarity and coherence. So, the response fully meets the criteria of a score 5, as it is excellently organized with flawless logical flow and seamless transitions.',
 'The generated answer demonstrates an excellent logical flow and seamless transitions between the information provided, which aligns with the highest score of the rubric. It effectively establishes the connection between osteoprotegerin (OPG) and subclinical left ventricular systolic dysfunction in diabetic hypertensive patients. The response succinctly presents the results of the speckle tracking study and clearly defines how OPG is an independent predictor of impaired GLS. The conclusion drawn from the receiver operating characteristic curve analysis reinforces the connection between OPG levels and the identification of patients with GLS ≤ 18.5. The organization of the response is logical and clear, making it easy for readers to follow the line of reasoning from the introduction to the conclusion. Therefore, according to the score rubric, the response is well-structured and offers an in-depth and coherent understanding of the topic. So the overall score is 5.',
 'When evaluating the organization of the response, the primary concern is the clarity and smoothness of the progression of ideas. In this case, the answer is well-structured with a clear statement followed by supporting evidence from the study. The transition from stating the conclusion ("Yes, CD30 expression is a novel prognostic indicator") to presenting the study\'s findings is smooth and logical.\n\nHowever, there is room for improvement in terms of providing more context to the initial statement. By mentioning what the study found in relation to the 5-year OS and PFS, the answer could have provided a more thorough explanation that directly relates to the question. The connection between the initial statement and the supporting evidence is clear but could benefit from a more explicit explanation.\n\nDespite these minor areas for improvement, the response does a good job at presenting the argument in a coherent manner. Therefore, according to the score rubric, which emphasizes the clear progression of ideas and arguments, this response meets the requirements for a score of 4. The overall structure is sound, but a slightly more detailed presentation of the evidence would have elevated it to a perfect score.',
 "Upon reviewing the generated response, it is evident that there is a lack of content that directly addresses the posed question. The response fails to provide any argument or information related to the relationship between mild cognitive dysfunction and diabetes mellitus control in minority elderly adults. The text's structure is disorganized, as it merely states the inability to answer, without any attempt to explore the question or provide a logical flow of information. This makes it very difficult for the reader to follow or understand the content. Consequently, according to the score rubric, this response would be evaluated as having a score of 1, as it is disorganized, lacks clear structure, and is difficult to follow.",
 'This response presents a straightforward statement that walks through the central point in a linear fashion. The progression of ideas is logical and easy to follow, as it moves from indicating a difference in rates to specifying the relationship between these rates and the PEDS score. However, the response does not provide the depth of analysis that could have made the argument more robust. For example, it does not delve into why this significant difference might exist or consider any potential variables that could affect these rates. The logical flow and clarity of the response meet the requirements of a score of 4, but it falls short of achieving a score of 5 due to the absence of more detailed explanations or comparisons. Therefore, while the response is generally organized, it could benefit from further elaboration and a more comprehensive analysis of the data. So the overall score is 4.',
 "The generated response presents a clear and structured argument, aligning with the scoring rubric's criteria for a score of 4. The response successfully establishes the presence of enteroviruses in CSF samples, acknowledging the need for more research to definitively link these viruses to neurological impairments. The argument flows logically, from acknowledging the initial findings to suggesting the necessity of additional studies. This structure, along with the smooth transitions between ideas, facilitates easy comprehension, which is a critical aspect as per the score rubric. However, the response could be further enhanced by providing a bit more context or detail about the research process or the specific types of neurological impairments associated with the virus, which might elevate it to a score of 5. Nevertheless, the response does not present abrupt transitions, nor does it contain unclear connections, which are key factors negatively impacting the scoring."]