使用 RAGAS 进行 RAG 流水线评估

在 Colab 中打开下载

_{最后更新：2025 年 7 月 8 日}

Ragas 是一个基于模型的开源评估框架，用于评估您的检索增强生成 (RAG) 管道和 LLM 应用程序。它支持正确性、语调、幻觉（忠实性）、流畅性等指标。

有关评估器、支持的指标和用法的更多信息，请查看

此笔记本展示了如何使用 Ragas-Haystack 集成来针对各种指标评估 RAG 管道。

笔记本由 Anushree Bannadabhavi、Siddharth Sahu、Julian Risch 编写

前提条件

Ragas 使用 OpenAI 密钥来计算某些指标，因此我们需要一个 OpenAI API 密钥。

import os
from getpass import getpass

os.environ["OPENAI_API_KEY"] = getpass("Enter OpenAI API key:")

安装依赖项

!pip install ragas-haystack

导入所需库

from haystack import Document, Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.embedders import OpenAITextEmbedder, OpenAIDocumentEmbedder
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from haystack.components.builders import ChatPromptBuilder
from haystack.dataclasses import ChatMessage
from haystack.components.generators import OpenAIGenerator
from haystack.components.generators.chat import OpenAIChatGenerator
from haystack.components.builders import AnswerBuilder
from haystack_integrations.components.evaluators.ragas import RagasEvaluator

from ragas.llms import HaystackLLMWrapper
from ragas.metrics import AnswerRelevancy, ContextPrecision, Faithfulness

创建示例数据集

在本节中，我们将创建一个包含人工智能公司及其语言模型信息的示例数据集。此数据集在管道执行期间用于检索相关数据的上下文。

dataset = [
    "OpenAI is one of the most recognized names in the large language model space, known for its GPT series of models. These models excel at generating human-like text and performing tasks like creative writing, answering questions, and summarizing content. GPT-4, their latest release, has set benchmarks in understanding context and delivering detailed responses.",
    "Anthropic is well-known for its Claude series of language models, designed with a strong focus on safety and ethical AI behavior. Claude is particularly praised for its ability to follow complex instructions and generate text that aligns closely with user intent.",
    "DeepMind, a division of Google, is recognized for its cutting-edge Gemini models, which are integrated into various Google products like Bard and Workspace tools. These models are renowned for their conversational abilities and their capacity to handle complex, multi-turn dialogues.",
    "Meta AI is best known for its LLaMA (Large Language Model Meta AI) series, which has been made open-source for researchers and developers. LLaMA models are praised for their ability to support innovation and experimentation due to their accessibility and strong performance.",
    "Meta AI with it's LLaMA models aims to democratize AI development by making high-quality models available for free, fostering collaboration across industries. Their open-source approach has been a game-changer for researchers without access to expensive resources.",
    "Microsoft’s Azure AI platform is famous for integrating OpenAI’s GPT models, enabling businesses to use these advanced models in a scalable and secure cloud environment. Azure AI powers applications like Copilot in Office 365, helping users draft emails, generate summaries, and more.",
    "Amazon’s Bedrock platform is recognized for providing access to various language models, including its own models and third-party ones like Anthropic’s Claude and AI21’s Jurassic. Bedrock is especially valued for its flexibility, allowing users to choose models based on their specific needs.",
    "Cohere is well-known for its language models tailored for business use, excelling in tasks like search, summarization, and customer support. Their models are recognized for being efficient, cost-effective, and easy to integrate into workflows.",
    "AI21 Labs is famous for its Jurassic series of language models, which are highly versatile and capable of handling tasks like content creation and code generation. The Jurassic models stand out for their natural language understanding and ability to generate detailed and coherent responses.",
    "In the rapidly advancing field of artificial intelligence, several companies have made significant contributions with their large language models. Notable players include OpenAI, known for its GPT Series (including GPT-4); Anthropic, which offers the Claude Series; Google DeepMind with its Gemini Models; Meta AI, recognized for its LLaMA Series; Microsoft Azure AI, which integrates OpenAI’s GPT Models; Amazon AWS (Bedrock), providing access to various models including Claude (Anthropic) and Jurassic (AI21 Labs); Cohere, which offers its own models tailored for business use; and AI21 Labs, known for its Jurassic Series. These companies are shaping the landscape of AI by providing powerful models with diverse capabilities.",
]

初始化 RAG 管道组件

本节设置构建检索增强生成 (RAG) 管道所需的核心组件。这些组件包括用于管理和存储文档的文档存储 (Document Store)、用于生成嵌入以实现基于相似度的检索的嵌入器 (Embedder) 以及用于获取相关文档的检索器 (Retriever)。此外，还将设计一个提示模板 (Prompt Template) 来构建管道的输入，而聊天生成器 (Chat Generator) 则负责响应生成。

# Sets up an in-memory store to hold documents
document_store = InMemoryDocumentStore()
docs = [Document(content=doc) for doc in dataset]

# Embeds the documents using OpenAI's embedding models to enable similarity search.
document_embedder = OpenAIDocumentEmbedder(model="text-embedding-3-small")
text_embedder = OpenAITextEmbedder(model="text-embedding-3-small")

docs_with_embeddings = document_embedder.run(docs)
document_store.write_documents(docs_with_embeddings["documents"])

# Configures a retriever to fetch relevant documents based on embeddings
retriever = InMemoryEmbeddingRetriever(document_store, top_k=2)

# Defines a template for prompting the LLM with a user query and the retrieved documents
template = [
    ChatMessage.from_user(
        """
Given the following information, answer the question.

Context:
{% for document in documents %}
    {{ document.content }}
{% endfor %}

Question: {{question}}
Answer:
"""
    )
]

# Sets up an LLM-based generator to create responses
prompt_builder = ChatPromptBuilder(template=template, required_variables="*")
chat_generator = OpenAIChatGenerator(model="gpt-4o-mini")

Calculating embeddings: 1it [00:00,  1.67it/s]

配置 RagasEvaluator 组件

传入您要用于评估的所有 Ragas 指标，确保提供计算每个选定指标所需的所有必要信息。

例如

AnswerRelevancy：需要 query 和 response。它不考虑事实准确性，而是对响应不完整或包含冗余详细信息的情况给予较低分数。
ContextPrecision：需要 query、retrieved documents 和 reference。它评估检索到的文档在多大程度上仅包含回答查询所需的相关信息。
Faithfulness：需要 query、retrieved documents 和 response。如果响应中的所有陈述都可以从检索到的文档中推断出来，则认为响应是忠实的。

请确保包含每个指标的所有相关数据，以确保评估准确。

llm = OpenAIGenerator(model="gpt-4o-mini")
evaluator_llm = HaystackLLMWrapper(llm)

ragas_evaluator = RagasEvaluator(
    ragas_metrics=[AnswerRelevancy(), ContextPrecision(), Faithfulness()],
    evaluator_llm=evaluator_llm,
)

构建和连接 RAG 管道

在此处添加并连接初始化的组件，以形成 RAG Haystack 管道。

# Creating the Pipeline
rag_pipeline = Pipeline()

# Adding the components
rag_pipeline.add_component("text_embedder", text_embedder)
rag_pipeline.add_component("retriever", retriever)
rag_pipeline.add_component("prompt_builder", prompt_builder)
rag_pipeline.add_component("llm", chat_generator)
rag_pipeline.add_component("answer_builder", AnswerBuilder())
rag_pipeline.add_component("ragas_evaluator", ragas_evaluator)

# Connecting the components
rag_pipeline.connect("text_embedder.embedding", "retriever.query_embedding")
rag_pipeline.connect("retriever", "prompt_builder")
rag_pipeline.connect("prompt_builder.prompt", "llm.messages")
rag_pipeline.connect("llm.replies", "answer_builder.replies")
rag_pipeline.connect("retriever", "answer_builder.documents")
rag_pipeline.connect("retriever", "ragas_evaluator.documents")
rag_pipeline.connect("llm.replies", "ragas_evaluator.response")

<haystack.core.pipeline.pipeline.Pipeline object at 0x16a0d1790>
🚅 Components
  - text_embedder: OpenAITextEmbedder
  - retriever: InMemoryEmbeddingRetriever
  - prompt_builder: ChatPromptBuilder
  - llm: OpenAIChatGenerator
  - answer_builder: AnswerBuilder
  - ragas_evaluator: RagasEvaluator
🛤️ Connections
  - text_embedder.embedding -> retriever.query_embedding (List[float])
  - retriever.documents -> prompt_builder.documents (List[Document])
  - retriever.documents -> answer_builder.documents (List[Document])
  - retriever.documents -> ragas_evaluator.documents (List[Document])
  - prompt_builder.prompt -> llm.messages (List[ChatMessage])
  - llm.replies -> answer_builder.replies (List[ChatMessage])
  - llm.replies -> ragas_evaluator.response (List[ChatMessage])

question = "What makes Meta AI’s LLaMA models stand out?"

reference = "Meta AI’s LLaMA models stand out for being open-source, supporting innovation and experimentation due to their accessibility and strong performance."


result = rag_pipeline.run(
    {
        "text_embedder": {"text": question},
        "prompt_builder": {"question": question},
        "answer_builder": {"query": question},
        "ragas_evaluator": {"query": question, "reference": reference},
        # Each metric expects a specific set of parameters as input. Refer to the
        # Ragas class' documentation for more details.
    }
)

print(result['answer_builder']['answers'][0].data, '\n')
print(result['ragas_evaluator']['result'])

Evaluating: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:18<00:00,  6.33s/it]


Meta AI’s LLaMA models stand out due to their open-source nature, allowing researchers and developers to access high-quality models for free. This accessibility fosters innovation and experimentation, making it easier for individuals and organizations to collaborate and advance AI development without the constraints of expensive resources. Their strong performance further enhances their appeal in the AI community. 

{'answer_relevancy': 0.9758, 'context_precision': 1.0000, 'faithfulness': 1.0000}

独立评估 RAG 管道

本节探讨了一种不使用 RagasEvaluator 组件评估 RAG 管道的替代方法。它强调手动提取输出并将其组织起来以进行评估。

为此，您可以使用任何现有的 Haystack 管道。为演示起见，我们将创建一个简单的 RAG 管道，类似于前面描述的管道，但其中不包含 RagasEvaluator 组件。

设置基本 RAG 管道

我们构建了一个简单的 RAG 管道，类似于上述方法，但其中不包含 RagasEvaluator 组件。

# Initialize components for RAG pipeline
document_store = InMemoryDocumentStore()
docs = [Document(content=doc) for doc in dataset]

document_embedder = OpenAIDocumentEmbedder(model="text-embedding-3-small")
text_embedder = OpenAITextEmbedder(model="text-embedding-3-small")

docs_with_embeddings = document_embedder.run(docs)
document_store.write_documents(docs_with_embeddings["documents"])

retriever = InMemoryEmbeddingRetriever(document_store, top_k=2)

template = [
    ChatMessage.from_user(
        """
Given the following information, answer the question.

Context:
{% for document in documents %}
    {{ document.content }}
{% endfor %}

Question: {{question}}
Answer:
"""
    )
]

prompt_builder = ChatPromptBuilder(template=template, required_variables="*")
chat_generator = OpenAIChatGenerator(model="gpt-4o-mini")

# Creating the Pipeline
rag_pipeline = Pipeline()

# Adding the components
rag_pipeline.add_component("text_embedder", text_embedder)
rag_pipeline.add_component("retriever", retriever)
rag_pipeline.add_component("prompt_builder", prompt_builder)
rag_pipeline.add_component("llm", chat_generator)
rag_pipeline.add_component("answer_builder", AnswerBuilder())

# Connecting the components
rag_pipeline.connect("text_embedder.embedding", "retriever.query_embedding")
rag_pipeline.connect("retriever", "prompt_builder")
rag_pipeline.connect("prompt_builder.prompt", "llm.messages")
rag_pipeline.connect("llm.replies", "answer_builder.replies")
rag_pipeline.connect("retriever", "answer_builder.documents")
rag_pipeline.connect("llm.replies", "answer_builder.replies")
rag_pipeline.connect("retriever", "answer_builder.documents")

Calculating embeddings: 1it [00:00,  3.14it/s]





<haystack.core.pipeline.pipeline.Pipeline object at 0x16a77bbd0>
🚅 Components
  - text_embedder: OpenAITextEmbedder
  - retriever: InMemoryEmbeddingRetriever
  - prompt_builder: ChatPromptBuilder
  - llm: OpenAIChatGenerator
  - answer_builder: AnswerBuilder
🛤️ Connections
  - text_embedder.embedding -> retriever.query_embedding (List[float])
  - retriever.documents -> prompt_builder.documents (List[Document])
  - retriever.documents -> answer_builder.documents (List[Document])
  - prompt_builder.prompt -> llm.messages (List[ChatMessage])
  - llm.replies -> answer_builder.replies (List[ChatMessage])

提取用于评估的输出

构建管道后，我们使用它来生成必要的输出，例如检索到的文档和响应。然后将这些输出整合成一个用于评估的数据集。

questions = [
    "Who are the major players in the large language model space?",
    "What is Microsoft’s Azure AI platform known for?",
    "What kind of models does Cohere provide?",
]

references = [
    "The major players include OpenAI (GPT Series), Anthropic (Claude Series), Google DeepMind (Gemini Models), Meta AI (LLaMA Series), Microsoft Azure AI (integrating GPT Models), Amazon AWS (Bedrock with Claude and Jurassic), Cohere (business-focused models), and AI21 Labs (Jurassic Series).",
    "Microsoft’s Azure AI platform is known for integrating OpenAI’s GPT models, enabling businesses to use these models in a scalable and secure cloud environment.",
    "Cohere provides language models tailored for business use, excelling in tasks like search, summarization, and customer support.",
]


evals_list = []

for que_idx in range(len(questions)):

    single_turn = {}
    single_turn['user_input'] = questions[que_idx]
    single_turn['reference'] = references[que_idx]

    # Running the pipeline
    response = rag_pipeline.run(
        {
            "text_embedder": {"text": questions[que_idx]},
            "prompt_builder": {"question": questions[que_idx]},
            "answer_builder": {"query": questions[que_idx]},
        }
    )

    # the response of the pipeline
    single_turn['response'] = response["answer_builder"]["answers"][0].data

    haystack_documents = response["answer_builder"]["answers"][0].documents
    # extracting context from haystack documents
    # retrieved durring answer generation process
    single_turn['retrieved_contexts'] = [doc.content for doc in haystack_documents]

    evals_list.append(single_turn)

在构建 evals_list 时，重要的是使 single_turn 字典中的键与 Ragas SingleTurnSample 中定义的属性保持一致。这可确保与 Ragas 评估框架兼容。如提供的代码片段所示，使用检索到的文档和管道输出准确地填充这些字段。

使用 Ragas EvaluationDataset 评估管道

提取的数据集被转换为 Ragas EvaluationDataset，以便 Ragas 可以处理它。然后，我们使用 HaystackLLMWrappe r初始化一个 LLM 评估器。最后，我们使用我们的评估数据集、三个指标和 LLM 评估器调用 Ragas 的 evaluate() 函数。

from ragas import evaluate
from ragas.dataset_schema import EvaluationDataset

evaluation_dataset = EvaluationDataset.from_list(evals_list)

llm = OpenAIGenerator(model="gpt-4o-mini")
evaluator_llm = HaystackLLMWrapper(llm)

result = evaluate(
    dataset=evaluation_dataset,
    metrics=[AnswerRelevancy(), ContextPrecision(), Faithfulness()],
    llm=evaluator_llm,
)

print(result)
result.to_pandas()

Evaluating: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:23<00:00,  2.57s/it]


{'answer_relevancy': 0.9679, 'context_precision': 1.0000, 'faithfulness': 1.0000}

	user_input	retrieved_contexts	response	reference	answer_relevancy	context_precision	faithfulness
0	大型语言模型的主要参与者是谁？...	[在快速发展的人工智能领域...	大型语言模型的主要参与者...	主要参与者包括 OpenAI（GPT 系列）、...	0.999999	1.0	1.0
1	微软的 Azure AI 平台以什么而闻名？	[微软的 Azure AI 平台以其...	微软的 Azure AI 平台以集成...	微软的 Azure AI 平台以集成...	1.000000	1.0	1.0
2	Cohere 提供哪种类型的模型？	[Cohere 以其语言模型而闻名...	Cohere 提供针对商业用途量身定制的语言模型...	Cohere 提供针对商业用途量身定制的语言模型...	0.903765	1.0	1.0

Haystack 实用资源