集成:Apify
使用 Apify-Haystack 集成从网络提取数据并自动化网络任务。
目录
概述
Apify 是一个网络抓取和数据提取平台。它有助于自动化网络任务,并从电子商务网站、社交媒体(Facebook、Instagram、TikTok)、搜索引擎、在线地图等提取内容。Apify 提供两千多个现成的云解决方案,称为 Actors。
安装
安装 Apify-haystack 集成
pip install apify-haystack
使用
安装完成后,您将可以访问 Apify 商店中两千多个现成的应用程序,称为 Actors。
- 从 Apify 加载数据集并将其转换为 Haystack Document
- 从 Facebook/Instagram 提取数据并将其保存在 InMemoryDocumentStore 中
- 抓取网站,刮取文本内容,并将其存储在 InMemoryDocumentStore 中
- 检索增强生成 (RAG):从网站提取文本并进行问答
该集成实现了以下组件(您可以在这些 示例 中找到它们的使用方法)
ApifyDatasetLoader:加载由 Apify Actor 创建的数据集ApifyDatasetFromActorCall:调用 Apify Actor,加载数据集,并将其转换为 Haystack DocumentsApifyDatasetFromTaskCall:调用 Apify 任务,加载数据集,并将其转换为 Haystack Documents
您需要拥有一个 Apify 帐户和一个 Apify API token 才能运行此示例。您可以从 Apify 开始注册免费帐户,并获取您的 Apify API token。
在下面的示例中,指定 apify_api_token 并运行脚本。
ApifyDatasetFromActorCall 本身
使用 Apify 的 Website Content Crawler 抓取网站,刮取文本内容,并将其转换为 Haystack Documents。您可以在 Apify Store 中浏览其他 Actors
在下面的示例中,文本内容是从 https://haystack.com.cn/ 提取的。您可以使用 maxCrawlPages 参数控制抓取的页面数量。有关参数的详细概述,请参阅 Website Content Crawler。
脚本应产生以下输出(截断为单个 Document)
Document(id=a617d376*****, content: 'Introduction to Haystack 2.x)
Haystack is an open-source framework fo...', meta: {'url': 'https://docs.haystack.com.cn/docs/intro'}
from dotenv import load_dotenv
import os
from haystack import Document
from apify_haystack import ApifyDatasetFromActorCall
# Use APIFY_API_TOKEN from .env file or set it
load_dotenv()
os.environ["APIFY_API_TOKEN"] = "YOUR APIFY_API_TOKEN"
actor_id = "apify/website-content-crawler"
run_input = {
"maxCrawlPages": 3, # limit the number of pages to crawl
"startUrls": [{"url": "https://haystack.com.cn/"}],
}
def dataset_mapping_function(dataset_item: dict) -> Document:
"""Convert an Apify dataset item to a Haystack Document
Website Content Crawler returns a dataset with the following output fields:
{
"url": "https://haystack.com.cn",
"text": "Haystack is an open-source framework for building production-ready LLM applications",
}
"""
return Document(content=dataset_item.get("text"), meta={"url": dataset_item.get("url")})
actor = ApifyDatasetFromActorCall(
actor_id=actor_id,
run_input=run_input,
dataset_mapping_function=dataset_mapping_function
)
print(f"Calling the Apify Actor {actor_id} ... crawling will take some time ...")
print("You can monitor the progress at: https://console.apify.com/actors/runs")
dataset = actor.run().get("documents")
print(f"Loaded {len(dataset)} documents from the Apify Actor {actor_id}:")
for d in dataset:
print(d)
ApifyDatasetFromActorCall 在 RAG 管道中
请遵循 🧑🍳 Cookbook:使用 Apify-Haystack 集成提取网站内容并用于问答 以获取完整的可运行示例。
检索增强生成 (RAG):提取网站的文本内容并将其用于问答。使用提取的文本内容回答关于 https://haystack.com.cn 网站的问题。
预期输出
question: "What is haystack?"
answer: Haystack is an open-source framework for building production-ready LLM applications
除了 APIFY_API_TOKEN 之外,您还需要指定 OPENAI_API_KEY 才能运行此示例。
import os
from dotenv import load_dotenv
from haystack import Document, Pipeline
from haystack.components.builders import PromptBuilder
from haystack.components.embedders import OpenAIDocumentEmbedder, OpenAITextEmbedder
from haystack.components.generators import OpenAIGenerator
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.utils.auth import Secret
from apify_haystack import ApifyDatasetFromActorCall
# Set APIFY_API_TOKEN and OPENAI_API_KEY here or use it from .env file
load_dotenv()
os.environ["APIFY_API_TOKEN"] = getpass("Enter YOUR APIFY_API_TOKEN")
os.environ["OPENAI_API_KEY"] = getpass("Enter YOUR OPENAI_API_KEY")
actor_id = "apify/website-content-crawler"
run_input = {
"maxCrawlPages": 1, # limit the number of pages to crawl
"startUrls": [{"url": "https://haystack.com.cn/"}],
}
def dataset_mapping_function(dataset_item: dict) -> Document:
"""Convert an Apify dataset item to a Haystack Document
Website Content Crawler returns a dataset with the following output fields:
{
"url": "https://haystack.com.cn",
"text": "Haystack is an open-source framework for building production-ready LLM applications",
}
"""
return Document(content=dataset_item.get("text"), meta={"url": dataset_item.get("url")})
apify_dataset_loader = ApifyDatasetFromActorCall(
actor_id=actor_id,
run_input=run_input,
dataset_mapping_function=dataset_mapping_function
)
# Components
print("Initializing components...")
document_store = InMemoryDocumentStore()
docs_embedder = OpenAIDocumentEmbedder()
text_embedder = OpenAITextEmbedder()
retriever = InMemoryEmbeddingRetriever(document_store)
generator = OpenAIGenerator(model="gpt-3.5-turbo")
# Load documents from Apify
print("Crawling and indexing documents...")
print("You can visit https://console.apify.com/actors/runs to monitor the progress")
docs = apify_dataset_loader.run()
embeddings = docs_embedder.run(docs.get("documents"))
document_store.write_documents(embeddings["documents"])
template = """
Given the following information, answer the question.
Context:
{% for document in documents %}
{{ document.content }}
{% endfor %}
Question: {{question}}
Answer:
"""
prompt_builder = PromptBuilder(template=template)
# Add components to your pipeline
print("Initializing pipeline...")
pipe = Pipeline()
pipe.add_component("embedder", text_embedder)
pipe.add_component("retriever", retriever)
pipe.add_component("prompt_builder", prompt_builder)
pipe.add_component("llm", generator)
# Now, connect the components to each other
pipe.connect("embedder.embedding", "retriever.query_embedding")
pipe.connect("retriever", "prompt_builder.documents")
pipe.connect("prompt_builder", "llm")
question = "What is haystack?"
print("Running pipeline ... ")
response = pipe.run({"embedder": {"text": question}, "prompt_builder": {"question": question}})
print(f"question: {question}")
print(f"answer: {response['llm']['replies'][0]}")
# Other questions
examples = [
"Who created Haystack?",
"Are there any upcoming events or community talks?",
]
for example in examples:
response = pipe.run({"embedder": {"text": example}, "prompt_builder": {"question": example}})
print(f"question: {question}")
print(f"answer: {response['llm']['replies'][0]}")
许可证
apify-haystack 在 Apache-2.0 许可条款下分发。
