RAG：使用 Apify-Haystack 集成提取和使用网站内容进行问答

在 Colab 中打开下载

_{最后更新：2024 年 10 月 3 日}

作者：Jiri Spilka ( Apify)

在本教程中，我们将使用 apify-haystack 集成来调用 Website Content Crawler，并抓取 Haystack 网站的文本内容。然后，我们将使用 OpenAIDocumentEmbedder 来计算文本嵌入，并使用 InMemoryDocumentStore 将文档存储在临时的内存数据库中。最后一步将是检索增强生成管道，以根据抓取的数据回答用户的问题。

安装依赖项

!pip install apify-haystack haystack-ai

设置 API 密钥

您需要拥有一个 Apify 账户并获取 APIFY_API_TOKEN。

您还需要一个 OpenAI 账户和 OPENAI_API_KEY

import os
from getpass import getpass

os.environ["APIFY_API_TOKEN"] = getpass("Enter YOUR APIFY_API_TOKEN")
os.environ["OPENAI_API_KEY"] = getpass("Enter YOUR OPENAI_API_KEY")

Enter YOUR APIFY_API_TOKEN··········
Enter YOUR OPENAI_API_KEY··········

使用 Website Content Crawler 从 haystack 文档中抓取数据

现在，我们使用 Haystack 组件 ApifyDatasetFromActorCall 调用 Website Content Crawler。首先，我们需要定义 Website Content Crawler 的参数，然后定义需要保存到向量数据库的数据。

actor_id 和输入参数的详细描述（变量 run_input）可以在 Website Content Crawler 输入页面上找到。

在此示例中，我们将定义 startUrls 并将抓取页面的数量限制为五个。

actor_id = "apify/website-content-crawler"
run_input = {
    "maxCrawlPages": 5,  # limit the number of pages to crawl
    "startUrls": [{"url": "https://haystack.com.cn/"}],
}

接下来，我们需要定义一个数据集映射函数。我们需要知道 Website Content Crawler 的输出。通常，它是一个 JSON 对象，看起来像这样（为简洁起见已截断）

[
  {
    "url": "https://haystack.com.cn/",
    "text": "Haystack | Haystack - Multimodal - AI - Architect a next generation AI app around all modalities, not just text ..."
  },
  {
    "url": "https://haystack.com.cn/tutorials/24_building_chat_app",
    "text": "Building a Conversational Chat App ... "
  },
]

我们将使用 dataset_mapping_function 如下将此 JSON 转换为 Haystack Document

from haystack import Document

def dataset_mapping_function(dataset_item: dict) -> Document:
    return Document(content=dataset_item.get("text"), meta={"url": dataset_item.get("url")})

以及 ApifyDatasetFromActorCall 的定义

from apify_haystack import ApifyDatasetFromActorCall

apify_dataset_loader = ApifyDatasetFromActorCall(
    actor_id=actor_id,
    run_input=run_input,
    dataset_mapping_function=dataset_mapping_function,
)

在实际运行 Website Content Crawler 之前，我们需要定义嵌入函数和文档存储

from haystack.components.embedders import OpenAIDocumentEmbedder
from haystack.document_stores.in_memory import InMemoryDocumentStore

document_store = InMemoryDocumentStore()
docs_embedder = OpenAIDocumentEmbedder()

之后，我们就可以调用 Website Content Crawler 并打印抓取的数据

# Crawler website and store documents in the document_store
# Crawling will take some time (1-2 minutes), you can monitor progress in the https://console.apify.com/actors/runs

docs = apify_dataset_loader.run()
print(docs)

{'documents': [Document(id=6c4d570874ff59ed4e06017694bee8a72d766d2ed55c6453fbc9ea91fd2e6bde, content: 'Haystack | Haystack Luma · Delightful Events Start HereAWS Summit Berlin 2023: Building Generative A...', meta: {'url': 'https://haystack.com.cn/'}), Document(id=d420692bf66efaa56ebea200a4a63597667bdc254841b99654239edf67737bcb, content: 'Tutorials & Walkthroughs | Haystack
Tutorials & Walkthroughs2.0
Whether you’re a beginner or an expe...', meta: {'url': 'https://haystack.com.cn/tutorials'}), Document(id=5a529a308d271ba76f66a060c0b706b73103406ac8a853c19f20e1594823efe8, content: 'Get Started | Haystack
Haystack is an open-source Python framework that helps developers build LLM-p...', meta: {'url': 'https://haystack.com.cn/overview/quick-start'}), Document(id=1d126a03ae50586729846d492e9e8aca802d7f281a72a8869ded08ebc5585a36, content: 'What is Haystack? | Haystack
Haystack is an open source framework for building production-ready LLM ...', meta: {'url': 'https://haystack.com.cn/overview/intro'}), Document(id=4324a62242590d4ecf9b080319607fa1251aa0822bbe2ce6b21047e783999703, content: 'Integrations | Haystack
The Haystack ecosystem integrates with many other technologies, such as vect...', meta: {'url': 'https://haystack.com.cn/integrations'})]}

计算嵌入并将它们存储在数据库中

embeddings = docs_embedder.run(docs.get("documents"))
document_store.write_documents(embeddings["documents"])

Calculating embeddings: 100%|██████████| 1/1 [00:00<00:00,  3.29it/s]





5

检索和 LLM 生成管道

一旦我们将抓取的数据放入数据库，就可以设置经典的检索增强管道。有关详细信息，请参阅 RAG Haystack 教程。

from haystack import Pipeline
from haystack.components.builders import PromptBuilder
from haystack.components.embedders import OpenAITextEmbedder
from haystack.components.generators import OpenAIGenerator
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever

text_embedder = OpenAITextEmbedder()
retriever = InMemoryEmbeddingRetriever(document_store)
generator = OpenAIGenerator(model="gpt-4o-mini")

template = """
Given the following information, answer the question.

Context:
{% for document in documents %}
    {{ document.content }}
{% endfor %}

Question: {{question}}
Answer:
"""

prompt_builder = PromptBuilder(template=template)

# Add components to your pipeline
print("Initializing pipeline...")
pipe = Pipeline()
pipe.add_component("embedder", text_embedder)
pipe.add_component("retriever", retriever)
pipe.add_component("prompt_builder", prompt_builder)
pipe.add_component("llm", generator)

# Now, connect the components to each other
pipe.connect("embedder.embedding", "retriever.query_embedding")
pipe.connect("retriever", "prompt_builder.documents")
pipe.connect("prompt_builder", "llm")

Initializing pipeline...





<haystack.core.pipeline.pipeline.Pipeline object at 0x7c02095efdc0>
🚅 Components
  - embedder: OpenAITextEmbedder
  - retriever: InMemoryEmbeddingRetriever
  - prompt_builder: PromptBuilder
  - llm: OpenAIGenerator
🛤️ Connections
  - embedder.embedding -> retriever.query_embedding (List[float])
  - retriever.documents -> prompt_builder.documents (List[Document])
  - prompt_builder.prompt -> llm.prompt (str)

现在，您可以提出关于 Haystack 的问题并获得正确的答案

question = "What is haystack?"

response = pipe.run({"embedder": {"text": question}, "prompt_builder": {"question": question}})

print(f"question: {question}")
print(f"answer: {response['llm']['replies'][0]}")

question: What is haystack?
answer: Haystack is an open-source Python framework designed to help developers build LLM-powered custom applications. It is used for creating production-ready LLM applications, retrieval-augmented generative pipelines, and state-of-the-art search systems that work effectively over large document collections. Haystack offers comprehensive tooling for developing AI systems that use LLMs from platforms like Hugging Face, OpenAI, Cohere, Mistral, and more. It provides a modular and intuitive framework that allows users to quickly integrate the latest AI models, offering flexibility and ease of use. The framework includes components and pipelines that enable developers to build end-to-end AI projects without the need to understand the underlying models deeply. Haystack caters to LLM enthusiasts and beginners alike, providing a vibrant open-source community for collaboration and learning.