使用 Arize Phoenix 追踪和评估 RAG
最后更新:2025 年 6 月 13 日
Phoenix 是一个用于追踪和评估 LLM 应用的工具。在本教程中,我们将追踪和评估一个 Haystack RAG 管道。我们将使用三种不同类型的评估进行评估:
- 相关性:检索到的文档是否与问题相关。
- 问答正确性:问题的答案是否正确。
- 幻觉:答案是否包含幻觉。
ℹ️ 此笔记本需要 OpenAI API 密钥。
!pip install -q openinference-instrumentation-haystack haystack-ai arize-phoenix
设置 API 密钥
from getpass import getpass
import os
if not (openai_api_key := os.getenv("OPENAI_API_KEY")):
openai_api_key = getpass("🔑 Enter your OpenAI API key: ")
os.environ["OPENAI_API_KEY"] = openai_api_key
🔑 Enter your OpenAI API key: ··········
启动 Phoenix 并启用 Haystack 追踪
如果您没有 Phoenix API 密钥,可以在 phoenix.arize.com 免费获取。如果您更愿意自己运行该应用程序,Arize Phoenix 还提供 自托管选项。
if os.getenv("PHOENIX_API_KEY") is None:
os.environ["PHOENIX_API_KEY"] = getpass("Enter your Phoenix API Key")
os.environ["OTEL_EXPORTER_OTLP_HEADERS"] = f"api_key={os.environ['PHOENIX_API_KEY']}"
os.environ["PHOENIX_CLIENT_HEADERS"] = f"api_key={os.environ['PHOENIX_API_KEY']}"
os.environ["PHOENIX_COLLECTOR_ENDPOINT"] = "https://app.phoenix.arize.com"
Enter your Phoenix API Key··········
下面的命令将 Phoenix 连接到您的 Haystack 应用程序,并对 Haystack 库进行仪器化。从此时间点开始,对 Haystack 管道的任何调用都将被追踪并记录到 Phoenix UI。
from phoenix.otel import register
project_name = "Haystack RAG"
tracer_provider = register(project_name=project_name, auto_instrument=True)
设置您的 Haystack 应用
有关使用 Haystack 创建 RAG 管道的逐步指南,请遵循 使用检索增强创建您的第一个问答管道 教程。
from haystack.components.builders import ChatPromptBuilder
from haystack.dataclasses import ChatMessage, Document
from haystack.components.generators.chat import OpenAIChatGenerator
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
from haystack import Pipeline
# Write documents to InMemoryDocumentStore
document_store = InMemoryDocumentStore()
document_store.write_documents(
[
Document(content="My name is Jean and I live in Paris."),
Document(content="My name is Mark and I live in Berlin."),
Document(content="My name is Giorgio and I live in Rome."),
]
)
# Basic RAG Pipeline
template = [
ChatMessage.from_system(
"""
Answer the questions based on the given context.
Context:
{% for document in documents %}
{{ document.content }}
{% endfor %}
Question: {{ question }}
Answer:
"""
)
]
rag_pipe = Pipeline()
rag_pipe.add_component("retriever", InMemoryBM25Retriever(document_store=document_store))
rag_pipe.add_component("prompt_builder", ChatPromptBuilder(template=template, required_variables="*"))
rag_pipe.add_component("llm", OpenAIChatGenerator(model="gpt-4o-mini"))
rag_pipe.connect("retriever", "prompt_builder.documents")
rag_pipe.connect("prompt_builder.prompt", "llm.messages")
<haystack.core.pipeline.pipeline.Pipeline object at 0x7f5e1e4be390>
🚅 Components
- retriever: InMemoryBM25Retriever
- prompt_builder: ChatPromptBuilder
- llm: OpenAIChatGenerator
🛤️ Connections
- retriever.documents -> prompt_builder.documents (List[Document])
- prompt_builder.prompt -> llm.messages (List[ChatMessage])
使用查询运行管道。它将自动在 Phoenix 上创建追踪。
# Ask a question
question = "Who lives in Paris?"
results = rag_pipe.run(
{
"retriever": {"query": question},
"prompt_builder": {"question": question},
}
)
print(results["llm"]["replies"][0].text)
Jean lives in Paris.
评估检索到的文档
现在我们已经追踪了管道,让我们开始评估检索到的文档。
Phoenix 中的所有评估都使用相同的通用过程:
- 查询并从 Phoenix 下载追踪数据
- 向追踪数据添加评估标签。这可以通过使用 Phoenix 库、使用 Haystack 评估器或使用您自己的评估器来完成。
- 将评估标签记录到 Phoenix
- 查看评估
我们将使用 get_retrieved_documents 函数来获取检索文档的追踪数据。
import nest_asyncio
nest_asyncio.apply()
import phoenix as px
client = px.Client()
from phoenix.session.evaluation import get_retrieved_documents
retrieved_documents_df = get_retrieved_documents(px.Client(), project_name=project_name)
retrieved_documents_df.head()
/usr/local/lib/python3.11/dist-packages/phoenix/utilities/client.py:60: UserWarning: The Phoenix server (10.9.1) and client (10.11.0) versions are mismatched and may have compatibility issues.
warnings.warn(
context.trace_id \
context.span_id document_position
40880a3ade3753c3 0 53d4a3ef151e2dc3009fa6aff152dc86
1 53d4a3ef151e2dc3009fa6aff152dc86
2 53d4a3ef151e2dc3009fa6aff152dc86
input \
context.span_id document_position
40880a3ade3753c3 0 {"query": "Who lives in Paris?", "filters": nu...
1 {"query": "Who lives in Paris?", "filters": nu...
2 {"query": "Who lives in Paris?", "filters": nu...
reference \
context.span_id document_position
40880a3ade3753c3 0 My name is Jean and I live in Paris.
1 My name is Mark and I live in Berlin.
2 My name is Giorgio and I live in Rome.
document_score
context.span_id document_position
40880a3ade3753c3 0 1.293454
1 0.768010
2 0.768010
接下来,我们将使用 Phoenix 的 RelevanceEvaluator 来评估检索文档的相关性。此评估器使用 LLM 来确定检索到的文档是否包含问题的答案。
from phoenix.evals import OpenAIModel, RelevanceEvaluator, run_evals
relevance_evaluator = RelevanceEvaluator(OpenAIModel(model="gpt-4o-mini"))
retrieved_documents_relevance_df = run_evals(
evaluators=[relevance_evaluator],
dataframe=retrieved_documents_df,
provide_explanation=True,
concurrency=20,
)[0]
run_evals | | 0/3 (0.0%) | ⏳ 00:00<? | ?it/s
retrieved_documents_relevance_df.head()
label score \
context.span_id document_position
40880a3ade3753c3 0 relevant 1
1 unrelated 0
2 unrelated 0
explanation
context.span_id document_position
40880a3ade3753c3 0 The question asks who lives in Paris. The refe...
1 The question asks about who lives in Paris, wh...
2 The question asks about who lives in Paris, wh...
最后,我们将评估标签记录到 Phoenix。
from phoenix.trace import DocumentEvaluations, SpanEvaluations
px.Client().log_evaluations(
DocumentEvaluations(dataframe=retrieved_documents_relevance_df, eval_name="relevance"),
)
如果您现在单击 Phoenix 中的文档检索跨度,您应该会看到评估标签。
评估响应
使用 HallucinationEvaluator 和 QAEvaluator,我们可以检测生成响应的正确性和幻觉得分。
from phoenix.session.evaluation import get_qa_with_reference
qa_with_reference_df = get_qa_with_reference(px.Client(), project_name=project_name)
qa_with_reference_df
/usr/local/lib/python3.11/dist-packages/phoenix/utilities/client.py:60: UserWarning: The Phoenix server (10.9.1) and client (10.11.0) versions are mismatched and may have compatibility issues.
warnings.warn(
input \
context.span_id
a3e33d1e526e97bd {"data": {"retriever": {"query": "Who lives in...
output \
context.span_id
a3e33d1e526e97bd {"llm": {"replies": ["ChatMessage(_role=<ChatR...
reference
context.span_id
a3e33d1e526e97bd My name is Jean and I live in Paris.\n\nMy nam...
from phoenix.evals import (
HallucinationEvaluator,
OpenAIModel,
QAEvaluator,
run_evals,
)
qa_evaluator = QAEvaluator(OpenAIModel(model="gpt-4-turbo-preview"))
hallucination_evaluator = HallucinationEvaluator(OpenAIModel(model="gpt-4-turbo-preview"))
qa_correctness_eval_df, hallucination_eval_df = run_evals(
evaluators=[qa_evaluator, hallucination_evaluator],
dataframe=qa_with_reference_df,
provide_explanation=True,
concurrency=20,
)
run_evals | | 0/2 (0.0%) | ⏳ 00:00<? | ?it/s
px.Client().log_evaluations(
SpanEvaluations(dataframe=qa_correctness_eval_df, eval_name="Q&A Correctness"),
SpanEvaluations(dataframe=hallucination_eval_df, eval_name="Hallucination"),
)
您现在应该在 Phoenix 中看到问答正确性和幻觉评估。
