使用 Amazon Bedrock 和 Haystack 进行基于 PDF 的问答

在 Colab 中打开下载

_{最后更新：2025 年 7 月 8 日}

笔记本作者：Bilge Yucel

Amazon Bedrock 是一项全托管服务，通过单一 API 提供来自领先的 AI 初创公司和亚马逊的高性能基础模型。您可以选择各种基础模型，找到最适合您用例的模型。

在此笔记本中，我们将使用新添加的 Haystack 和 OpenSearch 与 Haystack 的 Amazon Bedrock 集成，为 PDF 文件构建一个定制化的生成式问答应用程序。该演示将说明为 Bedrock 文档量身定制的 QA 应用程序的逐步开发过程，从而在此过程中展示 Bedrock 的强大功能 🚀

设置开发环境

安装依赖项

%%bash

pip install -q opensearch-haystack amazon-bedrock-haystack pypdf

下载文件

对于此应用程序，我们将使用 Amazon Bedrock 的用户指南。Amazon Bedrock 提供了其指南的 PDF 版本。让我们下载它！

!wget "https://docs.aws.amazon.com/pdfs/bedrock/latest/userguide/bedrock-ug.pdf"

注意：您可以编写代码将 PDF 下载到 /content/bedrock-documentation.pdf 目录，作为另一种选择👇🏼

# import os

# import boto3
# from botocore import UNSIGNED
# from botocore.config import Config

# s3 = boto3.client('s3', config=Config(signature_version=UNSIGNED))
# s3.download_file('core-engineering', 'public/blog-posts/bedrock-documentation.pdf', '/content/bedrock-documentation.pdf')

在 Colab 中初始化 OpenSearch 实例

OpenSearch 是一个完全开源的搜索和分析引擎，与 Amazon OpenSearch Service 兼容，如果您想稍后部署、运行和扩展 OpenSearch 集群，这将非常有用。

让我们安装 OpenSearch 并在 Colab 中启动一个实例。有关其他安装选项，请查看 OpenSearch 文档。

!wget https://artifacts.opensearch.org/releases/bundle/opensearch/2.11.1/opensearch-2.11.1-linux-x64.tar.gz
!tar -xvf opensearch-2.11.1-linux-x64.tar.gz
!chown -R daemon:daemon opensearch-2.11.1
# disabling security. Be mindful when you want to disable security in production systems
!sudo echo 'plugins.security.disabled: true' >> opensearch-2.11.1/config/opensearch.yml

%%bash --bg
cd opensearch-2.11.1 && sudo -u daemon -- ./bin/opensearch

OpenSearch 完全启动服务器需要 30 秒

import time

time.sleep(30)

API 密钥

要使用 Amazon Bedrock，您需要 aws_access_key_id、aws_secret_access_key，并指定 aws_region_name。登录您的帐户后，请在 IAM 用户“安全凭证”部分找到这些密钥。有关详细指南，请参阅有关管理 IAM 用户访问密钥的文档。

from getpass import getpass

os.environ["AWS_ACCESS_KEY_ID"] = getpass("aws_access_key_id: ")
os.environ["AWS_SECRET_ACCESS_KEY"] = getpass("aws_secret_access_key: ")
os.environ["AWS_DEFAULT_REGION"] = input("aws_region_name: ")

aws_access_key_id: ··········
aws_secret_access_key: ··········
aws_region_name: us-east-1

构建索引管道

我们的索引管道将使用 PyPDFToDocument 将 PDF 文件转换为 Haystack Document，并通过清理和分割成块来预处理，然后再将其存储在 OpenSearchDocumentStore 中。

让我们运行下面的管道并将文件索引到我们的文档存储中

from pathlib import Path

from haystack import Pipeline
from haystack.components.converters import PyPDFToDocument
from haystack.components.preprocessors import DocumentCleaner, DocumentSplitter
from haystack.components.writers import DocumentWriter
from haystack.document_stores.types import DuplicatePolicy
from haystack_integrations.document_stores.opensearch import OpenSearchDocumentStore

## Initialize the OpenSearchDocumentStore
document_store = OpenSearchDocumentStore()

## Create pipeline components
converter = PyPDFToDocument()
cleaner = DocumentCleaner()
splitter = DocumentSplitter(split_by="sentence", split_length=10, split_overlap=2)
writer = DocumentWriter(document_store=document_store, policy=DuplicatePolicy.SKIP)

## Add components to the pipeline
indexing_pipeline = Pipeline()
indexing_pipeline.add_component("converter", converter)
indexing_pipeline.add_component("cleaner", cleaner)
indexing_pipeline.add_component("splitter", splitter)
indexing_pipeline.add_component("writer", writer)

## Connect the components to each other
indexing_pipeline.connect("converter", "cleaner")
indexing_pipeline.connect("cleaner", "splitter")
indexing_pipeline.connect("splitter", "writer")

使用 pdf 运行管道。如果您的笔记本在 CPU 上运行，这可能需要大约 4 分钟。

indexing_pipeline.run({"converter": {"sources": [Path("/content/bedrock-ug.pdf")]}})

{'writer': {'documents_written': 1060}}

构建查询管道

让我们创建另一个管道来查询我们的应用程序。在此管道中，我们将使用 OpenSearchBM25Retriever 从 OpenSearchDocumentStore 中检索相关信息，并使用 Amazon Nova 模型 amazon.nova-pro-v1:0 通过 AmazonChatBedrockGenerator 生成答案。您可以使用右侧的下拉菜单选择和测试不同的模型。

接下来，我们将使用 ChatPromptBuilder 和检索增强生成 (RAG) 方法为我们的任务创建一个提示。此提示将通过考虑提供的上下文来帮助生成答案。最后，我们将连接这三个组件以完成管道。

from haystack.components.builders import ChatPromptBuilder
from haystack.dataclasses import ChatMessage
from haystack import Pipeline
from haystack_integrations.components.generators.amazon_bedrock import AmazonBedrockChatGenerator
from haystack_integrations.components.retrievers.opensearch import OpenSearchBM25Retriever

## Create pipeline components
retriever = OpenSearchBM25Retriever(document_store=document_store, top_k=15)

## Initialize the AmazonBedrockGenerator with an Amazon Bedrock model
bedrock_model = 'amazon.nova-lite-v1:0'
generator = AmazonBedrockChatGenerator(model=bedrock_model)
template = """
{% for document in documents %}
    {{ document.content }}
{% endfor %}

Please answer the question based on the given information from Amazon Bedrock documentation.

{{query}}
"""
prompt_builder = ChatPromptBuilder(template=[ChatMessage.from_user(template)], required_variables="*")

## Add components to the pipeline
rag_pipeline = Pipeline()
rag_pipeline.add_component("retriever", retriever)
rag_pipeline.add_component("prompt_builder", prompt_builder)
rag_pipeline.add_component("llm", generator)

## Connect the components to each other
rag_pipeline.connect("retriever", "prompt_builder.documents")
rag_pipeline.connect("prompt_builder", "llm")

<haystack.core.pipeline.pipeline.Pipeline object at 0x7d1aa6550150>
🚅 Components
  - retriever: OpenSearchBM25Retriever
  - prompt_builder: ChatPromptBuilder
  - llm: AmazonBedrockChatGenerator
🛤️ Connections
  - retriever.documents -> prompt_builder.documents (List[Document])
  - prompt_builder.prompt -> llm.messages (List[ChatMessage])

提出您的问题，并通过 Amazon Bedrock 模型了解 Amazon Bedrock 服务！

question = "What is Amazon Bedrock?"
response = rag_pipeline.run({"query": question})

print(response["llm"]["replies"][0].text)

Amazon Bedrock is a fully managed service that makes high-performing foundation models (FMs) from leading AI startups and Amazon available for use through a unified API. Key capabilities include:

- Easily experiment with and evaluate top foundation models for various use cases. Models are available from providers like AI21 Labs, Anthropic, Cohere, Meta, and Stability AI.

- Privately customize models with your own data using techniques like fine-tuning and retrieval augmented generation (RAG). 

- Build agents that execute tasks using enterprise systems and data sources.

- Serverless experience so you can get started quickly without managing infrastructure.

- Integrate customized models into applications using AWS tools.

So in summary, Amazon Bedrock provides easy access to top AI models that you can customize and integrate into apps to build intelligent solutions. It's a fully managed service focused on generative AI.

其他查询

您也可以尝试以下查询

如何设置 Amazon Bedrock？
如何微调基础模型？
如何为 Amazon Titan 模型构建提示？
如何为 Claude 模型构建提示？