AstraDB 🤝 Haystack 集成
最后更新:2024 年 10 月 3 日
在本笔记本中,您将学习如何使用 AstraDB 作为 Haystack pipeline 的数据源。
前提条件
您需要一个 OpenAPI 密钥 才能继续操作。(Haystack 与模型无关,如果您愿意,也可以使用其他模型!)
您需要以下变量才能使用 Haystack 扩展。以下教程将向您展示如何创建 AstraDB 数据库,以及如何保存这些信息。
- API 端点
- 令牌
- Astra 键空间
- Astra 集合名称
请遵循此 教程 的第一步来创建免费的 AstraDB 数据库,并保存您的数据库 ID、应用程序令牌、键空间和数据库区域。
遵循这些步骤创建集合。保存您的集合名称。
选择一个维度数量,使其与您打算使用的 嵌入模型 相匹配。在本示例中,我们将使用一个 384 维的模型,即 sentence-transformers/all-MiniLM-L6-v2。
接下来,安装我们的依赖项。
!pip install astra-haystack sentence-transformers
在这里,您将输入您的凭据等信息。在生产代码中,您应该为应用程序令牌等敏感凭据使用环境变量,以避免将其提交到源代码管理。
from getpass import getpass
import os
os.environ["OPENAI_API_KEY"] = getpass("Enter your openAI key:")
os.environ["ASTRA_DB_API_ENDPOINT"] = getpass("Enter your Astra API Endpoint:")
os.environ["ASTRA_DB_APPLICATION_TOKEN"] = getpass("Enter your Astra application token (e.g.AstraCS:xxx ):")
ASTRA_DB_COLLECTION_NAME = getpass("enter your Astra collection name:")
接下来,我们将创建一个 Haystack pipeline 来创建嵌入并将它们添加到 AstraDocumentStore 中。
import logging
from haystack import Document, Pipeline
from haystack.components.embedders import SentenceTransformersDocumentEmbedder, SentenceTransformersTextEmbedder
from haystack.components.writers import DocumentWriter
from haystack.document_stores.types import DuplicatePolicy
from haystack_integrations.document_stores.astra import AstraDocumentStore
logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)
embedding_model_name = "sentence-transformers/all-MiniLM-L6-v2"
# embedding_dim is the number of dimensions the embedding model supports.
document_store = AstraDocumentStore(
astra_collection=ASTRA_DB_COLLECTION_NAME,
duplicates_policy=DuplicatePolicy.SKIP,
embedding_dim=384,
)
# Add Documents
documents = [
Document(content="There are over 7,000 languages spoken around the world today."),
Document(
content="Elephants have been observed to behave in a way that indicates"
" a high level of self-awareness, such as recognizing themselves in mirrors."
),
Document(
content="In certain parts of the world, like the Maldives, Puerto Rico, "
"and San Diego, you can witness the phenomenon of bioluminescent waves."
),
]
index_pipeline = Pipeline()
index_pipeline.add_component(
instance=SentenceTransformersDocumentEmbedder(model=embedding_model_name),
name="embedder",
)
index_pipeline.add_component(instance=DocumentWriter(document_store=document_store, policy=DuplicatePolicy.SKIP), name="writer")
index_pipeline.connect("embedder.documents", "writer.documents")
index_pipeline.run({"embedder": {"documents": documents}})
print(document_store.count_documents())
Batches: 0%| | 0/1 [00:00<?, ?it/s]
WARNING:astra_haystack.document_store:No documents written. Argument policy set to SKIP
3
接下来,我们将创建一个 RAG pipeline,以便我们可以查询我们的文档。
from haystack.components.builders.answer_builder import AnswerBuilder
from haystack.components.builders.prompt_builder import PromptBuilder
from haystack.components.generators import OpenAIGenerator
from haystack_integrations.components.retrievers.astra import AstraEmbeddingRetriever
prompt_template = """
Given these documents, answer the question.
Documents:
{% for doc in documents %}
{{ doc.content }}
{% endfor %}
Question: {{question}}
Answer:
"""
rag_pipeline = Pipeline()
rag_pipeline.add_component(
instance=SentenceTransformersTextEmbedder(model=embedding_model_name),
name="embedder",
)
rag_pipeline.add_component(instance=AstraEmbeddingRetriever(document_store=document_store), name="retriever")
rag_pipeline.add_component(instance=PromptBuilder(template=prompt_template), name="prompt_builder")
rag_pipeline.add_component(instance=OpenAIGenerator(), name="llm")
rag_pipeline.add_component(instance=AnswerBuilder(), name="answer_builder")
rag_pipeline.connect("embedder", "retriever")
rag_pipeline.connect("retriever", "prompt_builder.documents")
rag_pipeline.connect("prompt_builder", "llm")
rag_pipeline.connect("llm.replies", "answer_builder.replies")
rag_pipeline.connect("llm.meta", "answer_builder.meta")
rag_pipeline.connect("retriever", "answer_builder.documents")
# Draw the pipeline
rag_pipeline.draw("./rag_pipeline.png")
# Run the pipeline
question = "How many languages are there in the world today?"
result = rag_pipeline.run(
{
"embedder": {"text": question},
"retriever": {"top_k": 2},
"prompt_builder": {"question": question},
"answer_builder": {"query": question},
}
)
print(result)
Batches: 0%| | 0/1 [00:00<?, ?it/s]
{'answer_builder': {'answers': [GeneratedAnswer(data='There are over 7,000 languages spoken around the world today.', query='How many languages are there in the world today?', documents=[Document(id=cfe93bc1c274908801e6670440bf2bbba54fad792770d57421f85ffa2a4fcc94, content: 'There are over 7,000 languages spoken around the world today.', score: 0.9267925, embedding: vector of size 384), Document(id=6f20658aeac3c102495b198401c1c0c2bd71d77b915820304d4fbc324b2f3cdb, content: 'Elephants have been observed to behave in a way that indicates a high level of self-awareness, such ...', score: 0.5357444, embedding: vector of size 384)], meta={'model': 'gpt-4o-mini-2024-07-18', 'index': 0, 'finish_reason': 'stop', 'usage': {'completion_tokens': 14, 'prompt_tokens': 83, 'total_tokens': 97}})]}}
输出应类似于以下内容
{'answer_builder': {'answers': [GeneratedAnswer(data='There are over 7,000 languages spoken around the world today.', query='How many languages are there in the world today?', documents=[Document(id=cfe93bc1c274908801e6670440bf2bbba54fad792770d57421f85ffa2a4fcc94, content: 'There are over 7,000 languages spoken around the world today.', score: 0.9267925, embedding: vector of size 384), Document(id=6f20658aeac3c102495b198401c1c0c2bd71d77b915820304d4fbc324b2f3cdb, content: 'Elephants have been observed to behave in a way that indicates a high level of self-awareness, such ...', score: 0.5357444, embedding: vector of size 384)], meta={'model': 'gpt-4o-mini-2024-07-18', 'index': 0, 'finish_reason': 'stop', 'usage': {'completion_tokens': 14, 'prompt_tokens': 83, 'total_tokens': 97}})]}}
现在您已经了解了如何将 AstraDB 用作 Haystack pipeline 的数据源。感谢您的阅读!要了解更多关于 Haystack 的信息,请 加入我们的 Discord 或 订阅我们的月度新闻通讯。
