使用 Arize Phoenix 追踪和评估 RAG

在 Colab 中打开下载

_{最后更新：2025 年 6 月 13 日}

Phoenix 是一个用于追踪和评估 LLM 应用的工具。在本教程中，我们将追踪和评估一个 Haystack RAG 管道。我们将使用三种不同类型的评估进行评估：

相关性：检索到的文档是否与问题相关。
问答正确性：问题的答案是否正确。
幻觉：答案是否包含幻觉。

ℹ️ 此笔记本需要 OpenAI API 密钥。

!pip install -q openinference-instrumentation-haystack haystack-ai arize-phoenix

设置 API 密钥

from getpass import getpass
import os

if not (openai_api_key := os.getenv("OPENAI_API_KEY")):
    openai_api_key = getpass("🔑 Enter your OpenAI API key: ")

os.environ["OPENAI_API_KEY"] = openai_api_key

🔑 Enter your OpenAI API key: ··········

启动 Phoenix 并启用 Haystack 追踪

如果您没有 Phoenix API 密钥，可以在 phoenix.arize.com 免费获取。如果您更愿意自己运行该应用程序，Arize Phoenix 还提供自托管选项。

if os.getenv("PHOENIX_API_KEY") is None:
    os.environ["PHOENIX_API_KEY"] = getpass("Enter your Phoenix API Key")

os.environ["OTEL_EXPORTER_OTLP_HEADERS"] = f"api_key={os.environ['PHOENIX_API_KEY']}"
os.environ["PHOENIX_CLIENT_HEADERS"] = f"api_key={os.environ['PHOENIX_API_KEY']}"
os.environ["PHOENIX_COLLECTOR_ENDPOINT"] = "https://app.phoenix.arize.com"

Enter your Phoenix API Key··········

下面的命令将 Phoenix 连接到您的 Haystack 应用程序，并对 Haystack 库进行仪器化。从此时间点开始，对 Haystack 管道的任何调用都将被追踪并记录到 Phoenix UI。

from phoenix.otel import register

project_name = "Haystack RAG"
tracer_provider = register(project_name=project_name, auto_instrument=True)

设置您的 Haystack 应用

有关使用 Haystack 创建 RAG 管道的逐步指南，请遵循使用检索增强创建您的第一个问答管道教程。

from haystack.components.builders import ChatPromptBuilder
from haystack.dataclasses import ChatMessage, Document
from haystack.components.generators.chat import OpenAIChatGenerator
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
from haystack import Pipeline

# Write documents to InMemoryDocumentStore
document_store = InMemoryDocumentStore()
document_store.write_documents(
    [
        Document(content="My name is Jean and I live in Paris."),
        Document(content="My name is Mark and I live in Berlin."),
        Document(content="My name is Giorgio and I live in Rome."),
    ]
)

# Basic RAG Pipeline
template = [
    ChatMessage.from_system(
"""
Answer the questions based on the given context.

Context:
{% for document in documents %}
    {{ document.content }}
{% endfor %}
Question: {{ question }}
Answer:
"""
    )
]
rag_pipe = Pipeline()
rag_pipe.add_component("retriever", InMemoryBM25Retriever(document_store=document_store))
rag_pipe.add_component("prompt_builder", ChatPromptBuilder(template=template, required_variables="*"))
rag_pipe.add_component("llm", OpenAIChatGenerator(model="gpt-4o-mini"))

rag_pipe.connect("retriever", "prompt_builder.documents")
rag_pipe.connect("prompt_builder.prompt", "llm.messages")

<haystack.core.pipeline.pipeline.Pipeline object at 0x7f5e1e4be390>
🚅 Components
  - retriever: InMemoryBM25Retriever
  - prompt_builder: ChatPromptBuilder
  - llm: OpenAIChatGenerator
🛤️ Connections
  - retriever.documents -> prompt_builder.documents (List[Document])
  - prompt_builder.prompt -> llm.messages (List[ChatMessage])

使用查询运行管道。它将自动在 Phoenix 上创建追踪。

# Ask a question
question = "Who lives in Paris?"
results = rag_pipe.run(
    {
        "retriever": {"query": question},
        "prompt_builder": {"question": question},
    }
)

print(results["llm"]["replies"][0].text)

Jean lives in Paris.

评估检索到的文档

现在我们已经追踪了管道，让我们开始评估检索到的文档。

Phoenix 中的所有评估都使用相同的通用过程：

查询并从 Phoenix 下载追踪数据
向追踪数据添加评估标签。这可以通过使用 Phoenix 库、使用 Haystack 评估器或使用您自己的评估器来完成。
将评估标签记录到 Phoenix
查看评估

我们将使用 get_retrieved_documents 函数来获取检索文档的追踪数据。

import nest_asyncio
nest_asyncio.apply()

import phoenix as px
client = px.Client()

from phoenix.session.evaluation import get_retrieved_documents

retrieved_documents_df = get_retrieved_documents(px.Client(), project_name=project_name)
retrieved_documents_df.head()

/usr/local/lib/python3.11/dist-packages/phoenix/utilities/client.py:60: UserWarning: The Phoenix server (10.9.1) and client (10.11.0) versions are mismatched and may have compatibility issues.
  warnings.warn(





                                                    context.trace_id  \
context.span_id  document_position                                     
40880a3ade3753c3 0                  53d4a3ef151e2dc3009fa6aff152dc86   
                 1                  53d4a3ef151e2dc3009fa6aff152dc86   
                 2                  53d4a3ef151e2dc3009fa6aff152dc86   

                                                                                input  \
context.span_id  document_position                                                      
40880a3ade3753c3 0                  {"query": "Who lives in Paris?", "filters": nu...   
                 1                  {"query": "Who lives in Paris?", "filters": nu...   
                 2                  {"query": "Who lives in Paris?", "filters": nu...   

                                                                 reference  \
context.span_id  document_position                                           
40880a3ade3753c3 0                    My name is Jean and I live in Paris.   
                 1                   My name is Mark and I live in Berlin.   
                 2                  My name is Giorgio and I live in Rome.   

                                    document_score  
context.span_id  document_position                  
40880a3ade3753c3 0                        1.293454  
                 1                        0.768010  
                 2                        0.768010

接下来，我们将使用 Phoenix 的 RelevanceEvaluator 来评估检索文档的相关性。此评估器使用 LLM 来确定检索到的文档是否包含问题的答案。

from phoenix.evals import OpenAIModel, RelevanceEvaluator, run_evals

relevance_evaluator = RelevanceEvaluator(OpenAIModel(model="gpt-4o-mini"))

retrieved_documents_relevance_df = run_evals(
    evaluators=[relevance_evaluator],
    dataframe=retrieved_documents_df,
    provide_explanation=True,
    concurrency=20,
)[0]

run_evals |          | 0/3 (0.0%) | ⏳ 00:00<? | ?it/s

retrieved_documents_relevance_df.head()

                                        label  score  \
context.span_id  document_position                     
40880a3ade3753c3 0                   relevant      1   
                 1                  unrelated      0   
                 2                  unrelated      0   

                                                                          explanation  
context.span_id  document_position                                                     
40880a3ade3753c3 0                  The question asks who lives in Paris. The refe...  
                 1                  The question asks about who lives in Paris, wh...  
                 2                  The question asks about who lives in Paris, wh...

最后，我们将评估标签记录到 Phoenix。

from phoenix.trace import DocumentEvaluations, SpanEvaluations

px.Client().log_evaluations(
    DocumentEvaluations(dataframe=retrieved_documents_relevance_df, eval_name="relevance"),
)

如果您现在单击 Phoenix 中的文档检索跨度，您应该会看到评估标签。

评估响应

使用 HallucinationEvaluator 和 QAEvaluator，我们可以检测生成响应的正确性和幻觉得分。

from phoenix.session.evaluation import get_qa_with_reference

qa_with_reference_df = get_qa_with_reference(px.Client(), project_name=project_name)
qa_with_reference_df

/usr/local/lib/python3.11/dist-packages/phoenix/utilities/client.py:60: UserWarning: The Phoenix server (10.9.1) and client (10.11.0) versions are mismatched and may have compatibility issues.
  warnings.warn(





                                                              input  \
context.span_id                                                       
a3e33d1e526e97bd  {"data": {"retriever": {"query": "Who lives in...   

                                                             output  \
context.span_id                                                       
a3e33d1e526e97bd  {"llm": {"replies": ["ChatMessage(_role=<ChatR...   

                                                          reference  
context.span_id                                                      
a3e33d1e526e97bd  My name is Jean and I live in Paris.\n\nMy nam...

from phoenix.evals import (
    HallucinationEvaluator,
    OpenAIModel,
    QAEvaluator,
    run_evals,
)

qa_evaluator = QAEvaluator(OpenAIModel(model="gpt-4-turbo-preview"))
hallucination_evaluator = HallucinationEvaluator(OpenAIModel(model="gpt-4-turbo-preview"))

qa_correctness_eval_df, hallucination_eval_df = run_evals(
    evaluators=[qa_evaluator, hallucination_evaluator],
    dataframe=qa_with_reference_df,
    provide_explanation=True,
    concurrency=20,
)

run_evals |          | 0/2 (0.0%) | ⏳ 00:00<? | ?it/s

px.Client().log_evaluations(
    SpanEvaluations(dataframe=qa_correctness_eval_df, eval_name="Q&A Correctness"),
    SpanEvaluations(dataframe=hallucination_eval_df, eval_name="Hallucination"),
)

您现在应该在 Phoenix 中看到问答正确性和幻觉评估。