Chroma 索引和 RAG 示例
最后更新:2025 年 7 月 8 日
安装依赖项
# Install the Chroma integration, Haystack will come as a dependency
!pip install -U chroma-haystack "huggingface_hub>=0.22.0"
索引管道:预处理、分割和索引文档
在本节中,我们将通过构建 Haystack 索引管道将文档索引到 Chroma DB 集合中。在这里,我们将 VIM 用户手册中的文档索引到 Haystack 的 ChromaDocumentStore 中。
在 ChromaDocumentStore 的示例文件夹中,我们有这些页面的 .txt 文件,因此我们使用 TextFileToDocument 和 DocumentWriter 组件来构建此索引管道。
# Fetch data files from the Github repo
!curl -sL https://github.com/deepset-ai/haystack-core-integrations/tarball/main -o main.tar
!mkdir main
!tar xf main.tar -C main --strip-components 1
!mv main/integrations/chroma/example/data .
mkdir: main: File exists
mv: rename main/integrations/chroma/example/data to ./data: Directory not empty
import os
from pathlib import Path
from haystack import Pipeline
from haystack.components.converters import TextFileToDocument
from haystack.components.writers import DocumentWriter
from haystack_integrations.document_stores.chroma import ChromaDocumentStore
file_paths = ["data" / Path(name) for name in os.listdir("data")]
# Chroma is used in-memory so we use the same instances in the two pipelines below
document_store = ChromaDocumentStore()
indexing = Pipeline()
indexing.add_component("converter", TextFileToDocument())
indexing.add_component("writer", DocumentWriter(document_store))
indexing.connect("converter", "writer")
indexing.run({"converter": {"sources": file_paths}})
{'writer': {'documents_written': 36}}
查询管道:构建检索增强生成 (RAG) 管道
一旦文档进入 ChromaDocumentStore,我们就可以使用配套的 Chroma 检索器来构建查询管道。下面的查询管道是一个简单的检索增强生成 (RAG) 管道,它使用 Chroma 的 查询 API。
您可以通过使用其中一个 Haystack Embedders 配合 ChromaEmbeddingRetriever 来更改此处的索引管道和查询管道以进行嵌入搜索。
在此示例中,我们使用了
- 带有
gpt-4o-mini的OpenAIChatGenerator。(您需要 OpenAI API 密钥才能使用此模型)。您可以将其替换为任何其他Generators。 - 包含提示模板的
ChatPromptBuilder。您可以将其调整为您选择的提示。 ChromaQueryTextRetriver期望一个查询列表,并从您的 Chroma 集合中检索top_k个最相关的文档。
import os
from getpass import getpass
os.environ["OPENAI_API_KEY"] = getpass("Enter OpenAI API key:")
Enter OpenAI API key: ········
from haystack_integrations.components.retrievers.chroma import ChromaQueryTextRetriever
from haystack.components.generators.chat import OpenAIChatGenerator
from haystack.components.builders import ChatPromptBuilder
from haystack.dataclasses.chat_message import ChatMessage
prompt = """
Answer the query based on the provided context.
If the context does not contain the answer, say 'Answer not found'.
Context:
{% for doc in documents %}
{{ doc.content }}
{% endfor %}
query: {{query}}
Answer:
"""
template = [ChatMessage.from_user(prompt)]
prompt_builder = ChatPromptBuilder(template=template)
llm = OpenAIChatGenerator()
retriever = ChromaQueryTextRetriever(document_store)
querying = Pipeline()
querying.add_component("retriever", retriever)
querying.add_component("prompt_builder", prompt_builder)
querying.add_component("llm", llm)
querying.connect("retriever.documents", "prompt_builder.documents")
querying.connect("prompt_builder", "llm")
ChatPromptBuilder has 2 prompt variables, but `required_variables` is not set. By default, all prompt variables are treated as optional, which may lead to unintended behavior in multi-branch pipelines. To avoid unexpected execution, ensure that variables intended to be required are explicitly set in `required_variables`.
<haystack.core.pipeline.pipeline.Pipeline object at 0x308f29880>
🚅 Components
- retriever: ChromaQueryTextRetriever
- prompt_builder: ChatPromptBuilder
- llm: OpenAIChatGenerator
🛤️ Connections
- retriever.documents -> prompt_builder.documents (List[Document])
- prompt_builder.prompt -> llm.messages (List[ChatMessage])
query = "Should I write documentation for my plugin?"
results = querying.run({"retriever": {"query": query, "top_k": 3}, "prompt_builder": {"query": query}})
print(results["llm"]["replies"][0].text)
Yes, it is a good idea to write documentation for your plugin. This helps users understand how to use it, especially when its behavior can be changed by the user.
