使用 Amazon Sagemaker、Chroma 和 Haystack 进行问答

在 Colab 中打开下载

_{最后更新：2025 年 4 月 17 日}

Amazon SageMaker 是一项全面、完全托管的机器学习服务，可让数据科学家和开发人员高效地构建、训练和部署机器学习模型。您可以从各种基础模型中进行选择，找到最适合您用例的模型。

在本笔记本中，我们将通过新添加的 Amazon SageMaker 集成与 Haystack 和 Chroma，来创建一个生成式问答应用程序，以高效地存储我们的文档。该演示将使用维基百科上关于 NASA 火星任务 🚀 的一些页面，来说明 QA 应用程序的逐步开发。

设置开发环境

安装依赖项

%%bash

pip install chroma-haystack amazon-sagemaker-haystack wikipedia typing_extensions

在 SageMaker 上部署模型

要使用 Amazon SageMaker 的模型，您首先需要部署它们。在此示例中，我们将使用 Falcon 7B Instruct BF16，因此请确保在继续之前已在您的帐户中部署此类模型。

如需帮助，您可以查看

Amazon SageMaker Jumpstart 文档。
此笔记本介绍如何通过笔记本以编程方式部署 Falcon 模型
这篇博文介绍了如何为 Haystack 1.x 部署 SageMaker 模型

API 密钥

要使用 Amazon SageMaker，您需要设置一些环境变量：AWS_ACCESS_KEY_ID、AWS_SECRET_ACCESS_KEY，并且通常需要通过设置 AWS_REGION 来指定区域。登录到您的帐户后，请在 IAM 用户“安全凭证”部分查找这些密钥。有关详细指南，请参阅关于管理 IAM 用户访问密钥的文档。

import os
from getpass import getpass

os.environ["AWS_ACCESS_KEY_ID"] = getpass("aws_access_key_id: ")
os.environ["AWS_SECRET_ACCESS_KEY"] = getpass("aws_secret_access_key: ")
os.environ["AWS_REGION"] = input("aws_region_name: ")

从维基百科加载数据

我们将使用 python 库 wikipedia 下载与 NASA 火星探测器相关的维基百科页面。

这些页面被转换为 Haystack 文档。

import wikipedia
from haystack.dataclasses import Document

wiki_pages = [
    "Ingenuity_(helicopter)",
    "Perseverance_(rover)",
    "Curiosity_(rover)",
    "Opportunity_(rover)",
    "Spirit_(rover)",
    "Sojourner_(rover)"
]

raw_docs=[]
for title in wiki_pages:
    page = wikipedia.page(title=title, auto_suggest=False)
    doc = Document(content=page.content, meta={"title": page.title, "url":page.url})
    raw_docs.append(doc)

构建索引管道

我们的索引管道将通过清理和分割维基百科页面将其分块，然后存储在 ChromaDocumentStore 中。

让我们运行下面的管道，并将文件索引到我们的文档存储中

from pathlib import Path

from haystack import Pipeline
from haystack.components.preprocessors import DocumentCleaner, DocumentSplitter
from haystack.components.writers import DocumentWriter
from haystack.document_stores.types import DuplicatePolicy
from haystack_integrations.document_stores.chroma import ChromaDocumentStore

## Initialize ChromaDocumentStore
document_store = ChromaDocumentStore()

## Create pipeline components
cleaner = DocumentCleaner()
splitter = DocumentSplitter(split_by="sentence", split_length=10, split_overlap=2)
writer = DocumentWriter(document_store=document_store, policy=DuplicatePolicy.SKIP)

## Add components to the pipeline
indexing_pipeline = Pipeline()
indexing_pipeline.add_component("cleaner", cleaner)
indexing_pipeline.add_component("splitter", splitter)
indexing_pipeline.add_component("writer", writer)

## Connect the components to each other
indexing_pipeline.connect("cleaner", "splitter")
indexing_pipeline.connect("splitter", "writer")

使用要索引的文件运行管道（请注意，此步骤可能需要一些时间）

indexing_pipeline.run({"cleaner":{"documents":raw_docs}})

构建查询管道

让我们创建另一个管道来查询我们的应用程序。在此管道中，我们将使用 ChromaQueryTextRetriever 从 ChromaDocumentStore 中检索相关信息，并使用 Falcon 7B Instruct BF16 模型通过 SagemakerGenerator 生成答案。

接下来，我们将使用 PromptBuilder，通过检索增强生成 (RAG) 方法来为我们的任务创建一个提示。此提示将通过考虑提供的上下文来帮助生成答案。最后，我们将连接这三个组件以完成管道。

from haystack import Pipeline
from haystack.components.builders import PromptBuilder
from haystack_integrations.components.generators.amazon_sagemaker import SagemakerGenerator
from haystack_integrations.components.retrievers.chroma import ChromaQueryTextRetriever

# Create pipeline components
retriever = ChromaQueryTextRetriever(document_store=document_store, top_k=3)

# Initialize the AmazonSagemakerGenerator with an Amazon Sagemaker model
# You may need to change the model name if it differs from your endpoint name.
model = 'jumpstart-dft-hf-llm-falcon-7b-instruct-bf16'
generator = SagemakerGenerator(model=model, generation_kwargs={"max_new_tokens":256})
template = """
{% for document in documents %}
    {{ document.content }}
{% endfor %}

Answer based on the information above: {{question}}
"""
prompt_builder = PromptBuilder(template=template)

## Add components to the pipeline
rag_pipeline = Pipeline()
rag_pipeline.add_component("retriever", retriever)
rag_pipeline.add_component("prompt_builder", prompt_builder)
rag_pipeline.add_component("llm", generator)

## Connect the components to each other
rag_pipeline.connect("retriever", "prompt_builder.documents")
rag_pipeline.connect("prompt_builder", "llm")

提出您的问题，使用 Amazon SageMaker 模型了解 Amazon SageMaker 服务！

question = "When did Opportunity land?"
response = rag_pipeline.run({"retriever": {"query": question}, "prompt_builder": {"question": question}})

print(response["llm"]["replies"][0])

Opportunity landed on Mars on January 24, 2004.

question = "Is Ingenuity mission over?"
response = rag_pipeline.run({"retriever": {"query": question}, "prompt_builder": {"question": question}})

print(response["llm"]["replies"][0])

Yes, the Ingenuity mission is over. The helicopter made a total of 72 flights over a period of about 3 years until rotor damage sustained in January 2024 forced an end to the mission.

question = "What was the name of the first NASA rover to land on Mars?"
response = rag_pipeline.run({"retriever": {"query": question}, "prompt_builder": {"question": question}})

print(response["llm"]["replies"][0])

The first NASA rover to land on Mars was called Sojourner.