集成：Ragas

使用 Ragas 评估框架计算基于模型的指标

作者

Siddharth Sahu

GitHub Repo PyPI Package

概述
安装
使用
- 3.1 使用集成的 RagasEvaluator 组件进行评估
  - 3.1.1 导入必需的库和设置环境
  - 3.1.2 获取数据集
  - 3.1.3 初始化 RAG 管道组件
  - 3.1.4 配置 RagasEvaluator 组件
  - 3.1.5 构建和连接 RAG 管道
  - 3.1.6 运行管道
- 3.2 RAG 管道的独立评估
  - 3.2.1 设置基本的 RAG 管道
  - 3.2.2 提取输出进行评估
  - 3.2.3 使用 Ragas EvaluationDataset 评估管道

概述

Ragas 是一个开源的基于模型的评估框架，用于通过量化 LLM 应用在正确性、语气、幻觉、流畅性等方面表现来评估它们。更多信息可在文档页面上找到。

本教程演示了如何将 Ragas 与使用 Haystack 构建的检索增强生成 (RAG) 管道集成，并对其进行评估（无论是否使用 RagasEvaluator 组件）。

安装

安装 ragas-haystack 集成包

pip install ragas-haystack

使用

使用集成的 RagasEvaluator 组件进行评估

本节重点介绍在 Haystack RAG 管道中使用 RagasEvaluator 组件执行评估。

导入必需的库和设置环境

我们首先导入必需的库并配置环境。

import os
from getpass import getpass

if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = getpass("Enter OpenAI API key:")

from haystack import Document, Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.embedders import OpenAITextEmbedder, OpenAIDocumentEmbedder
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from haystack.components.builders import ChatPromptBuilder
from haystack.dataclasses import ChatMessage
from haystack.components.generators import OpenAIGenerator
from haystack.components.generators.chat import OpenAIChatGenerator
from haystack.components.builders import AnswerBuilder
from haystack_integrations.components.evaluators.ragas import RagasEvaluator

from ragas.llms import HaystackLLMWrapper
from ragas.metrics import AnswerRelevancy, ContextPrecision, Faithfulness

获取数据集

在本节中，我们将创建一个包含有关 AI 公司及其语言模型信息的示例数据集。此数据集在管道执行期间用作检索相关数据的上下文。

dataset = [
    "OpenAI is one of the most recognized names in the large language model space, known for its GPT series of models. These models excel at generating human-like text and performing tasks like creative writing, answering questions, and summarizing content. GPT-4, their latest release, has set benchmarks in understanding context and delivering detailed responses.",
    "Anthropic is well-known for its Claude series of language models, designed with a strong focus on safety and ethical AI behavior. Claude is particularly praised for its ability to follow complex instructions and generate text that aligns closely with user intent.",
    "DeepMind, a division of Google, is recognized for its cutting-edge Gemini models, which are integrated into various Google products like Bard and Workspace tools. These models are renowned for their conversational abilities and their capacity to handle complex, multi-turn dialogues.",
    "Meta AI is best known for its LLaMA (Large Language Model Meta AI) series, which has been made open-source for researchers and developers. LLaMA models are praised for their ability to support innovation and experimentation due to their accessibility and strong performance.",
    "Meta AI with it's LLaMA models aims to democratize AI development by making high-quality models available for free, fostering collaboration across industries. Their open-source approach has been a game-changer for researchers without access to expensive resources.",
    "Microsoft’s Azure AI platform is famous for integrating OpenAI’s GPT models, enabling businesses to use these advanced models in a scalable and secure cloud environment. Azure AI powers applications like Copilot in Office 365, helping users draft emails, generate summaries, and more.",
    "Amazon’s Bedrock platform is recognized for providing access to various language models, including its own models and third-party ones like Anthropic’s Claude and AI21’s Jurassic. Bedrock is especially valued for its flexibility, allowing users to choose models based on their specific needs.",
    "Cohere is well-known for its language models tailored for business use, excelling in tasks like search, summarization, and customer support. Their models are recognized for being efficient, cost-effective, and easy to integrate into workflows.",
    "AI21 Labs is famous for its Jurassic series of language models, which are highly versatile and capable of handling tasks like content creation and code generation. The Jurassic models stand out for their natural language understanding and ability to generate detailed and coherent responses.",
    "In the rapidly advancing field of artificial intelligence, several companies have made significant contributions with their large language models. Notable players include OpenAI, known for its GPT Series (including GPT-4); Anthropic, which offers the Claude Series; Google DeepMind with its Gemini Models; Meta AI, recognized for its LLaMA Series; Microsoft Azure AI, which integrates OpenAI’s GPT Models; Amazon AWS (Bedrock), providing access to various models including Claude (Anthropic) and Jurassic (AI21 Labs); Cohere, which offers its own models tailored for business use; and AI21 Labs, known for its Jurassic Series. These companies are shaping the landscape of AI by providing powerful models with diverse capabilities.",
]

初始化 RAG 管道组件

本节设置了构建检索增强生成 (RAG) 管道所需的关键组件。这些组件包括用于管理和存储文档的 Document Store，用于生成嵌入以实现基于相似性检索的 Embedder，以及用于获取相关文档的 Retriever。此外，还设计了一个 Prompt Template 来构建管道的输入，而 Chat Generator 则负责响应生成。这些组件共同构成了 RAG 管道的骨干，确保文档检索和响应生成之间的顺畅集成。

# Sets up an in-memory store to hold documents
document_store = InMemoryDocumentStore()
docs = [Document(content=doc) for doc in dataset]

# Embeds the documents using OpenAI's embedding models to enable similarity search.
document_embedder = OpenAIDocumentEmbedder(model="text-embedding-3-small")
text_embedder = OpenAITextEmbedder(model="text-embedding-3-small")

docs_with_embeddings = document_embedder.run(docs)
document_store.write_documents(docs_with_embeddings["documents"])

# Configures a retriever to fetch relevant documents based on embeddings
retriever = InMemoryEmbeddingRetriever(document_store, top_k=2)

# Defines a template for prompting the LLM with a user query and the retrieved documents
template = [
    ChatMessage.from_user(
        """
Given the following information, answer the question.

Context:
{% for document in documents %}
    {{ document.content }}
{% endfor %}

Question: {{question}}
Answer:
"""
    )
]

# Sets up an LLM-based generator to create responses
prompt_builder = ChatPromptBuilder(template=template)
chat_generator = OpenAIChatGenerator(model="gpt-4o-mini")

配置 RagasEvaluator 组件

传递您想要用于评估的所有 Ragas 指标，并确保提供计算每个选定指标所需的所有信息。

例如

AnswerRelevancy：需要 query 和 response。
ContextPrecision：需要 query、检索到的文档和 reference。
Faithfulness：需要 query、检索到的文档和 response。

请确保包含每个指标的所有相关数据，以确保评估的准确性。

llm = OpenAIGenerator(model="gpt-4o-mini")
evaluator_llm = HaystackLLMWrapper(llm)

ragas_evaluator = RagasEvaluator(
    ragas_metrics=[AnswerRelevancy(), ContextPrecision(), Faithfulness()],
    evaluator_llm=evaluator_llm,
)

构建和连接 RAG 管道

在此我们将初始化的组件添加并连接起来，形成一个 RAG Haystack 管道。

# Creating the Pipeline
rag_pipeline = Pipeline()

# Adding the components
rag_pipeline.add_component("text_embedder", text_embedder)
rag_pipeline.add_component("retriever", retriever)
rag_pipeline.add_component("prompt_builder", prompt_builder)
rag_pipeline.add_component("llm", chat_generator)
rag_pipeline.add_component("answer_builder", AnswerBuilder())
rag_pipeline.add_component("ragas_evaluator", ragas_evaluator)

# Connecting the components
rag_pipeline.connect("text_embedder.embedding", "retriever.query_embedding")
rag_pipeline.connect("retriever", "prompt_builder")
rag_pipeline.connect("prompt_builder.prompt", "llm.messages")
rag_pipeline.connect("llm.replies", "answer_builder.replies")
rag_pipeline.connect("retriever", "answer_builder.documents")
rag_pipeline.connect("retriever", "ragas_evaluator.documents")
rag_pipeline.connect("llm.replies", "ragas_evaluator.response")

运行管道

在本节中，我们将使用示例查询执行管道，并使用配置的 RagasEvaluator 评估其性能。

question = "What makes Meta AI’s LLaMA models stand out?"

reference = "Meta AI’s LLaMA models stand out for being open-source, supporting innovation and experimentation due to their accessibility and strong performance."


result = rag_pipeline.run(
    {
        "text_embedder": {"text": question},
        "prompt_builder": {"question": question},
        "answer_builder": {"query": question},
        "ragas_evaluator": {"query": question, "reference": reference},
        # Each metric expects a specific set of parameters as input. Refer to the
        # Ragas class' documentation for more details.
    }
)

print(result['answer_builder']['answers'][0].data, '\n')
print(result['ragas_evaluator']['result'])

输出

Evaluating: 100%|██████████| 3/3 [00:14<00:00,  4.72s/it]

Meta AI's LLaMA models stand out due to their open-source nature, which allows researchers and developers easy access to high-quality language models without the need for expensive resources. This accessibility fosters innovation and experimentation, enabling collaboration across various industries. Moreover, the strong performance of the LLaMA models further enhances their appeal, making them valuable tools for advancing AI development. 

{'answer_relevancy': 0.9782, 'context_precision': 1.0000, 'faithfulness': 1.0000}

RAG 管道的独立评估

本节探讨了一种不使用 RagasEvaluator 组件来评估 RAG 管道的替代方法。它强调手动提取输出并组织它们以供评估。

您可以使用任何现有的 Haystack 管道来实现此目的。为了演示，我们将创建一个与前面描述的管道类似的简单 RAG 管道，但不包含 RagasEvaluator 组件。

设置基本的 RAG 管道

我们构建一个简单的 RAG 管道，其方法与上述类似，但不包含 RagasEvaluator 组件。

# Initialize components for RAG pipeline
document_store = InMemoryDocumentStore()
docs = [Document(content=doc) for doc in dataset]

document_embedder = OpenAIDocumentEmbedder(model="text-embedding-3-small")
text_embedder = OpenAITextEmbedder(model="text-embedding-3-small")

docs_with_embeddings = document_embedder.run(docs)
document_store.write_documents(docs_with_embeddings["documents"])

retriever = InMemoryEmbeddingRetriever(document_store, top_k=2)

template = [
    ChatMessage.from_user(
        """
Given the following information, answer the question.

Context:
{% for document in documents %}
    {{ document.content }}
{% endfor %}

Question: {{question}}
Answer:
"""
    )
]

prompt_builder = ChatPromptBuilder(template=template)
chat_generator = OpenAIChatGenerator(model="gpt-4o-mini")

# Creating the Pipeline
rag_pipeline = Pipeline()

# Adding the components
rag_pipeline.add_component("text_embedder", text_embedder)
rag_pipeline.add_component("retriever", retriever)
rag_pipeline.add_component("prompt_builder", prompt_builder)
rag_pipeline.add_component("llm", chat_generator)
rag_pipeline.add_component("answer_builder", AnswerBuilder())

# Connecting the components
rag_pipeline.connect("text_embedder.embedding", "retriever.query_embedding")
rag_pipeline.connect("retriever", "prompt_builder")
rag_pipeline.connect("prompt_builder.prompt", "llm.messages")
rag_pipeline.connect("llm.replies", "answer_builder.replies")
rag_pipeline.connect("retriever", "answer_builder.documents")
rag_pipeline.connect("llm.replies", "answer_builder.replies")
rag_pipeline.connect("retriever", "answer_builder.documents")

提取输出进行评估

构建管道后，我们使用它来生成必要的输出，例如检索到的文档和响应。然后将这些输出结构化为用于评估的数据集。

questions = [
    "Who are the major players in the large language model space?",
    "What is Microsoft’s Azure AI platform known for?",
    "What kind of models does Cohere provide?",
]

references = [
    "The major players include OpenAI (GPT Series), Anthropic (Claude Series), Google DeepMind (Gemini Models), Meta AI (LLaMA Series), Microsoft Azure AI (integrating GPT Models), Amazon AWS (Bedrock with Claude and Jurassic), Cohere (business-focused models), and AI21 Labs (Jurassic Series).",
    "Microsoft’s Azure AI platform is known for integrating OpenAI’s GPT models, enabling businesses to use these models in a scalable and secure cloud environment.",
    "Cohere provides language models tailored for business use, excelling in tasks like search, summarization, and customer support.",
]


evals_list = []

for que_idx in range(len(questions)):

    single_turn = {}
    single_turn['user_input'] = questions[que_idx]
    single_turn['reference'] = references[que_idx]

    # Running the pipeline
    response = rag_pipeline.run(
        {
            "text_embedder": {"text": questions[que_idx]},
            "prompt_builder": {"question": questions[que_idx]},
            "answer_builder": {"query": questions[que_idx]},
        }
    )

    # the response of the pipeline
    single_turn['response'] = response["answer_builder"]["answers"][0].data

    haystack_documents = response["answer_builder"]["answers"][0].documents
    # extracting context from haystack documents 
    # retrieved durring answer generation process
    single_turn['retrieved_contexts'] = [doc.content for doc in haystack_documents]

    evals_list.append(single_turn)

在构造 evals_list 时，重要的是将 single_turn 字典中的键与 Ragas SingleTurnSample 中定义的属性对齐。这确保了与 Ragas 评估框架的兼容性。如提供的代码片段所示，使用检索到的文档和管道输出来准确填充这些字段。

使用 Ragas EvaluationDataset 评估管道

提取的数据集被转换为 Ragas EvaluationDataset。

from ragas import evaluate
from ragas.dataset_schema import EvaluationDataset

evaluation_dataset = EvaluationDataset.from_list(evals_list)

llm = OpenAIGenerator(model="gpt-4o-mini")
evaluator_llm = HaystackLLMWrapper(llm)

result = evaluate(
    dataset=evaluation_dataset,
    metrics=[AnswerRelevancy(), ContextPrecision(), Faithfulness()],
    llm=evaluator_llm,
)

print(result)
result.to_pandas()

输出

Evaluating: 100%|██████████| 9/9 [00:21<00:00,  2.35s/it]

{'answer_relevancy': 0.9715, 'context_precision': 1.0000, 'faithfulness': 1.0000}

user_input	retrieved_contexts	response	reference	answer_relevancy	context_precision	faithfulness
大型语言模型的主要参与者是谁……	[在人工智能飞速发展的领域……	大型语言模型的主要参与者是……	主要参与者包括 OpenAI (GPT 系列)、……	1.000000	1.0	1.0
微软的 Azure AI 平台以什么闻名？	[微软的 Azure AI 平台以其……闻名	微软的 Azure AI 平台以整合……而闻名	微软的 Azure AI 平台以整合……而闻名	1.000000	1.0	1.0
Cohere 提供哪类模型？	[Cohere 以其语言模型而闻名……	Cohere 提供针对 B 类……的定制语言模型。	Cohere 提供针对 B 类……的定制语言模型。	0.914599	1.0	1.0