宣布 Astra DB Haystack 集成

了解如何在 RAG 管道中使用新的 Astra DB 与 Haystack 的集成。

2024 年 1 月 19 日

Haystack 扩展家族发展如此之快，令人难以跟上！我们最新的成员是来自 Datastax 的 Astra DB 扩展。它是一个开源包，可帮助您将 Astra DB 用作 Haystack 管道的向量数据库。

让我们了解 Astra DB 的优势以及如何将其与 Haystack 一起使用。

Astra DB 的优势

DataStax Astra DB 是一个基于 Apache Cassandra 构建的无服务器向量数据库。是什么让 Astra DB 与众不同？

与 Cassandra 的开源生态系统和工具的互操作性。
Astra DB支持各种不同的嵌入模型。一个 Astra 数据库实例可以有多个具有不同向量大小的collections。这使得测试不同的嵌入模型并找到最适合您用例的模型变得容易。
它是无服务器的。这对数据库意味着什么？您不必管理单个实例，也不必处理繁琐的升级或扩展。所有这些都在后台为您处理。
企业级可扩展性。Astra DB 可以部署在主要的云提供商（AWS、GCP 或 Azure）上，并根据您的需求跨多个区域进行部署。
在撰写本文时，有一个免费套餐可用，因此您无需信用卡即可尝试。

创建您的 Astra DB 数据库

为了确保这些说明保持最新，我们将引导您访问 Astra DB 文档，了解如何创建数据库。

创建免费的 Astra DB 数据库。记下您的凭据 - 您将需要您的 Astra API 端点和 Astra 应用程序令牌才能使用 Haystack 扩展。
选择与您计划使用的嵌入模型匹配的维度数量。在此示例中，我们将使用一个 384 维的模型，sentence-transformers/all-MiniLM-L6-v2。
创建一个集合，其维度数量与您的嵌入模型相同。保存您的集合名称，因为您也会需要它。

开始使用 Astra DB Haystack 集成

首先，安装集成

pip install astra-haystack sentence-transformers

还记得我之前提到您将需要您的凭据吗？希望您保存了它们。如果没有，没关系，您可以回到Astra Portal获取它们。

注意：如果您在生产环境中运行此代码，您应该将这些保存为环境变量，以确保安全。

from getpass import getpass
import os

os.environ["OPENAI_API_KEY"] = getpass("Enter your openAI key:")
os.environ["ASTRA_DB_API_ENDPOINT"] = getpass("Enter your Astra API Endpoint:")
os.environ["ASTRA_DB_APPLICATION_TOKEN"] = getpass("Enter your Astra application token (e.g.AstraCS:xxx ):")

在索引管道中使用 Astra DocumentStore

接下来，我们将创建一个 Haystack 管道，从文档创建一些嵌入，并将它们添加到AstraDocumentStore中。

import logging

from haystack import Document, Pipeline

from haystack.components.embedders import SentenceTransformersDocumentEmbedder, SentenceTransformersTextEmbedder
from haystack.components.writers import DocumentWriter
from haystack.document_stores.types import DuplicatePolicy
from haystack_integrations.document_stores.astra import AstraDocumentStore

logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)

embedding_model_name = "sentence-transformers/all-MiniLM-L6-v2"

# Make sure ASTRA_DB_API_ENDPOINT and ASTRA_DB_APPLICATION_TOKEN environment variables are set before proceeding

# embedding_dim is the number of dimensions the embedding model supports.
document_store = AstraDocumentStore(
    duplicates_policy=DuplicatePolicy.SKIP,
    embedding_dimension=384,
)


# Add Documents
documents = [
    Document(content="There are over 7,000 languages spoken around the world today."),
    Document(
        content="Elephants have been observed to behave in a way that indicates"
        " a high level of self-awareness, such as recognizing themselves in mirrors."
    ),
    Document(
        content="In certain parts of the world, like the Maldives, Puerto Rico, "
        "and San Diego, you can witness the phenomenon of bioluminescent waves."
    ),
]
index_pipeline = Pipeline()
index_pipeline.add_component(
    instance=SentenceTransformersDocumentEmbedder(model=embedding_model_name),
    name="embedder",
)
index_pipeline.add_component(instance=DocumentWriter(document_store=document_store, policy=DuplicatePolicy.SKIP), name="writer")
index_pipeline.connect("embedder.documents", "writer.documents")

index_pipeline.run({"embedder": {"documents": documents}})

print(document_store.count_documents())

如果一切顺利，应该有 3 个文档。🎉

在 Haystack RAG 管道中使用 `AstraEmbeddingRetriever`

在 Haystack 中，每个 DocumentStore 都与从中提取文档的 Retriever 紧密耦合。Astra DB 也不例外。在这里，我们将创建一个 RAG 管道，其中 AstraEmbeddingRetriever 将提取与您的查询相关的文档。

from haystack.components.builders.answer_builder import AnswerBuilder
from haystack.components.builders.prompt_builder import PromptBuilder
from haystack.components.generators import OpenAIGenerator
from haystack_integrations.components.retrievers.astra import AstraEmbeddingRetriever

prompt_template = """
                Given these documents, answer the question.
                Documents:
                {% for doc in documents %}
                    {{ doc.content }}
                {% endfor %}
                Question: {{question}}
                Answer:
                """

rag_pipeline = Pipeline()
rag_pipeline.add_component(
    instance=SentenceTransformersTextEmbedder(model=embedding_model_name),
    name="embedder",
)
rag_pipeline.add_component(instance=AstraEmbeddingRetriever(document_store=document_store), name="retriever")
rag_pipeline.add_component(instance=PromptBuilder(template=prompt_template), name="prompt_builder")
rag_pipeline.add_component(instance=OpenAIGenerator(), name="llm")
rag_pipeline.add_component(instance=AnswerBuilder(), name="answer_builder")
rag_pipeline.connect("embedder", "retriever")
rag_pipeline.connect("retriever", "prompt_builder.documents")
rag_pipeline.connect("prompt_builder", "llm")
rag_pipeline.connect("llm.replies", "answer_builder.replies")
rag_pipeline.connect("llm.meta", "answer_builder.meta")
rag_pipeline.connect("retriever", "answer_builder.documents")

# Run the pipeline
question = "How many languages are there in the world today?"
result = rag_pipeline.run(
    {
        "embedder": {"text": question},
        "retriever": {"top_k": 2},
        "prompt_builder": {"question": question},
        "answer_builder": {"query": question},
    }
)

print(result)

输出应如下所示

{'answer_builder': {'answers': [GeneratedAnswer(data='There are over 7,000 languages spoken around the world today.', query='How many languages are there in the world today?', documents=[Document(id=cfe93bc1c274908801e6670440bf2bbba54fad792770d57421f85ffa2a4fcc94, content: 'There are over 7,000 languages spoken around the world today.', score: 0.9267925, embedding: vector of size 384), Document(id=6f20658aeac3c102495b198401c1c0c2bd71d77b915820304d4fbc324b2f3cdb, content: 'Elephants have been observed to behave in a way that indicates a high level of self-awareness, such ...', score: 0.5357444, embedding: vector of size 384)], meta={'model': 'gpt-3.5-turbo-0613', 'index': 0, 'finish_reason': 'stop', 'usage': {'completion_tokens': 14, 'prompt_tokens': 83, 'total_tokens': 97}})]}}

总结

如果您已走到这一步，那么您现在就知道如何将 Astra DB 用作 Haystack 管道的数据源。要了解更多关于 Haystack 的信息，请加入我们的 Discord或注册我们的月度通讯。