🪁 使用 Haystack + Zephyr 7B Beta 进行 RAG 流水线

在 Colab 中打开下载

_{最后更新：2025 年 3 月 10 日}

笔记本由 Stefano Fiorucci 和 Tuana Celik 编写

我们将使用 🏗️ Haystack LLM 编排框架 和一个优秀的 LLM：💬 Zephyr 7B Beta（Mistral 7B V.01 的微调版本，专注于有用性，在 MT-Bench 和 AlpacaEval 基准测试中表现优于许多更大的模型）来构建一个很好的摇滚音乐检索增强生成管道。

安装依赖项

wikipedia 用于从维基百科下载数据
haystack-ai 是 Haystack 包
sentence_transformers 用于嵌入
transformers 用于使用开源 LLM
accelerate 和 bitsandbytes 是使用这些模型的量化版本（内存占用更小）所必需的

%%capture
! pip install wikipedia haystack-ai transformers accelerate bitsandbytes sentence_transformers

from IPython.display import Image
from pprint import pprint
import torch
import rich
import random

从维基百科加载数据

我们将使用 python 库 wikipedia 下载与一些摇滚乐队相关的维基百科页面。

这些页面被转换为 Haystack 文档

favourite_bands="""Audioslave
Blink-182
Dire Straits
Evanescence
Green Day
Muse (band)
Nirvana (band)
Sum 41
The Cure
The Smiths""".split("\n")

import wikipedia
from haystack.dataclasses import Document

raw_docs=[]

for title in favourite_bands:
    page = wikipedia.page(title=title, auto_suggest=False)
    doc = Document(content=page.content, meta={"title": page.title, "url":page.url})
    raw_docs.append(doc)

索引管道

from haystack import Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.preprocessors import DocumentCleaner, DocumentSplitter
from haystack.components.embedders import SentenceTransformersTextEmbedder, SentenceTransformersDocumentEmbedder
from haystack.components.writers import DocumentWriter
from haystack.document_stores.types import DuplicatePolicy
from haystack.utils import ComponentDevice

我们将把最终的文档保存在 InMemoryDocumentStore 中，这是一个简单的内存数据库。

document_store = InMemoryDocumentStore(embedding_similarity_function="cosine")

我们的索引管道将转换原始文档并将它们保存在文档存储中。

它由几个组件组成

DocumentCleaner：对文档进行基本清理
DocumentSplitter：将每个文档分块成更小的片段（更适合语义搜索和 RAG）
SentenceTransformersDocumentEmbedder:
- 将每个文档表示为一个向量（捕捉其含义）。
- 我们从 MTEB 排行榜中选择了一个不错但不过大的模型。
- 元数据 title 也被嵌入，因为它包含相关信息（metadata_fields_to_embed 参数）。
- 我们使用 GPU 来执行这个昂贵的操作（device 参数）。
DocumentWriter 仅将文档保存在文档存储中

indexing = Pipeline()
indexing.add_component("cleaner", DocumentCleaner())
indexing.add_component("splitter", DocumentSplitter(split_by='sentence', split_length=2))
indexing.add_component("doc_embedder", SentenceTransformersDocumentEmbedder(model="thenlper/gte-large",
                                                                            device=ComponentDevice.from_str("cuda:0"), 
                                                                            meta_fields_to_embed=["title"]))
indexing.add_component("writer", DocumentWriter(document_store=document_store, policy=DuplicatePolicy.OVERWRITE))

indexing.connect("cleaner", "splitter")
indexing.connect("splitter", "doc_embedder")
indexing.connect("doc_embedder", "writer")

让我们绘制索引管道

indexing.draw("indexing.png")
Image(filename='indexing.png')

我们最终运行索引管道

indexing.run({"cleaner":{"documents":raw_docs}})

.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]



1_Pooling/config.json:   0%|          | 0.00/191 [00:00<?, ?B/s]



README.md:   0%|          | 0.00/67.9k [00:00<?, ?B/s]



config.json:   0%|          | 0.00/619 [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/670M [00:00<?, ?B/s]



onnx/config.json:   0%|          | 0.00/632 [00:00<?, ?B/s]



model.onnx:   0%|          | 0.00/1.34G [00:00<?, ?B/s]



onnx/special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]



onnx/tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]



onnx/tokenizer_config.json:   0%|          | 0.00/342 [00:00<?, ?B/s]



onnx/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]



pytorch_model.bin:   0%|          | 0.00/670M [00:00<?, ?B/s]



sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]



special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]



tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]



tokenizer_config.json:   0%|          | 0.00/342 [00:00<?, ?B/s]



vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]



modules.json:   0%|          | 0.00/385 [00:00<?, ?B/s]



Batches:   0%|          | 0/49 [00:00<?, ?it/s]





{'writer': {'documents_written': 1554}}

让我们检查分块文档的总数并检查一个文档

len(document_store.filter_documents())

document_store.filter_documents()[0].meta

{'title': 'Audioslave',
 'url': 'https://en.wikipedia.org/wiki/Audioslave',
 'source_id': 'e3deff3d39ef107e8b0d69415ea61644b73175086cfbeee03d5f5d6946619fcf'}

pprint(document_store.filter_documents()[0])
print(len(document_store.filter_documents()[0].embedding)) # embedding size

Document(id=3ca9785f81fb9fb0700f794b1fd2355626824599ecbce435e6f5e3babb05facc, content: 'Audioslave was an American rock supergroup formed in Glendale, California, in 2001. The four-piece b...', meta: {'title': 'Audioslave', 'url': 'https://en.wikipedia.org/wiki/Audioslave', 'source_id': 'e3deff3d39ef107e8b0d69415ea61644b73175086cfbeee03d5f5d6946619fcf'}, embedding: vector of size 1024)
1024

RAG 管道

`HuggingFaceLocalGenerator` 和 `zephyr-7b-beta`

要加载和管理 Haystack 中的开源 LLM，我们可以使用 HuggingFaceLocalGenerator。
我们选择的 LLM 是 Zephyr 7B Beta，它是 Mistral 7B V.01 的微调版本，专注于有用性，并且在 MT-Bench 和 AlpacaEval 基准测试中表现优于许多更大的模型；该模型由 Hugging Face 团队微调。
由于我们使用的是免费的 Colab 实例（资源有限），我们使用4 位量化加载模型（将适当的 huggingface_pipeline_kwargs 传递给我们的 Generator）。有关 Hugging Face Transformers 中量化的介绍，您可以阅读这篇简单的博文。

from haystack.components.generators import HuggingFaceLocalGenerator

generator = HuggingFaceLocalGenerator("HuggingFaceH4/zephyr-7b-beta",
                                 huggingface_pipeline_kwargs={"device_map":"auto",
                                               "model_kwargs":{"load_in_4bit":True,
                                                "bnb_4bit_use_double_quant":True,
                                                "bnb_4bit_quant_type":"nf4",
                                                "bnb_4bit_compute_dtype":torch.bfloat16}},
                                 generation_kwargs={"max_new_tokens": 350})

让我们预热组件并尝试模型...

generator.warm_up()

config.json:   0%|          | 0.00/638 [00:00<?, ?B/s]



model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]



Downloading shards:   0%|          | 0/8 [00:00<?, ?it/s]



model-00001-of-00008.safetensors:   0%|          | 0.00/1.89G [00:00<?, ?B/s]



model-00002-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]



model-00003-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]



model-00004-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]



model-00005-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]



model-00006-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]



model-00007-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]



model-00008-of-00008.safetensors:   0%|          | 0.00/816M [00:00<?, ?B/s]



Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]



generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]



tokenizer_config.json:   0%|          | 0.00/1.43k [00:00<?, ?B/s]



tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]



tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]



added_tokens.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]



special_tokens_map.json:   0%|          | 0.00/168 [00:00<?, ?B/s]

# quick check
rich.print(generator.run("Please write a rhyme about Italy."))

/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py:1473: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use and modify the model generation configuration (see https://hugging-face.cn/docs/transformers/generation_strategies#default-text-generation-configuration )
  warnings.warn(

{
    'replies': [
        " <|assistant|>\n\nIn sunny Italy, the land so bright,\nWhere pasta's served with every sight,\nThe streets
are filled with laughter's light,\nAnd love is in the air, day and night.\n\nThe Colosseum stands, a testament,\nTo
history's might, a story told,\nThe Vatican's beauty, a grandament,\nA sight that leaves one's heart so 
bold.\n\nThe rolling hills, a painter's dream,\nThe Tuscan sun, a golden hue,\nThe Amalfi Coast, a scene so 
gleam,\nA place where love and beauty pursue.\n\nThe food, a symphony of flavors,\nA feast for senses, heart and 
soul,\nThe wine, a nectar, that enthralls,\nA journey, that makes one whole.\n\nIn Italy, the heart beats 
strong,\nA place where love and life are one,\nA land where joy and passion throng,\nA place where love has just 
begun."
    ]
}

好的，太棒了！

`PromptBuilder`

它是一个使用 Jinja2 引擎从模板字符串渲染提示的组件。

让我们设置我们的提示构建器，使用类似以下格式（适合 Zephyr）

from haystack.components.builders import PromptBuilder

prompt_template = """<|system|>Using the information contained in the context, give a comprehensive answer to the question.
If the answer is contained in the context, also report the source URL.
If the answer cannot be deduced from the context, do not give an answer.</s>
<|user|>
Context:
  {% for doc in documents %}
  {{ doc.content }} URL:{{ doc.meta['url'] }}
  {% endfor %};
  Question: {{query}}
  </s>
<|assistant|>
"""
prompt_builder = PromptBuilder(template=prompt_template)

让我们创建 RAG 管道

from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever

我们的 RAG 管道查找与用户查询相关的文档，并将它们传递给 LLM 以生成基于事实的答案。

它由几个组件组成

SentenceTransformersTextEmbedder：将查询表示为一个向量（捕捉其含义）。
InMemoryEmbeddingRetriever：查找与查询向量最相似的（前 5 个）文档
PromptBuilder
HuggingFaceLocalGenerator

rag = Pipeline()
rag.add_component("text_embedder", SentenceTransformersTextEmbedder(model="thenlper/gte-large", 
                                                                    device=ComponentDevice.from_str("cuda:0"))
rag.add_component("retriever", InMemoryEmbeddingRetriever(document_store=document_store, top_k=5))
rag.add_component("prompt_builder", prompt_builder)
rag.add_component("llm", generator)

rag.connect("text_embedder", "retriever")
rag.connect("retriever.documents", "prompt_builder.documents")
rag.connect("prompt_builder.prompt", "llm.prompt")

可视化我们的管道！

rag.draw("rag.png")
Image(filename='rag.png')

我们创建一个实用函数，该函数运行 RAG 管道并美观地打印答案。

def get_generative_answer(query):

  results = rag.run({
      "text_embedder": {"text": query},
      "prompt_builder": {"query": query}
    }
  )

  answer = results["llm"]["replies"][0]
  rich.print(answer)

让我们尝试我们的 RAG 管道...

get_generative_answer("What is the style of the Cure?")

Batches:   0%|          | 0/1 [00:00<?, ?it/s]


/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py:1473: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use and modify the model generation configuration (see https://hugging-face.cn/docs/transformers/generation_strategies#default-text-generation-configuration )
  warnings.warn(

get_generative_answer("Is the earth flat?")

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Based on the provided context, the question "Is the earth flat?" is not related to the information provided. 
Therefore, there is no answer to this question.

更多问题可供尝试...

nice_questions_to_try="""What was the original name of Sum 41?
What was the title of Nirvana's breakthrough album released in 1991?
Green Day's "American Idiot" is a rock opera. What's the story it tells?
What is the most well-known album by Blink-182?
Audioslave was formed by members of two iconic bands. Can you name the bands and discuss the sound of Audioslave in comparison?
Evanescence's "Bring Me to Life" features a male vocalist. Who is he, and how does his voice complement Amy Lee's in the song?
Was Ozzy Osbourne part of Blink 182?
Dire Straits' "Sultans of Swing" is a classic rock track. How does Mark Knopfler's guitar work in this song stand out to you?
What is Sum 41's debut studio album called?
Which member of Muse is the lead vocalist and primary songwriter?
Who was the lead singer of Audioslave?
Who are the members of Green Day?
When was Nirvana's first studio album, "Bleach," released?
Were the Smiths an influential band?
What is the name of Evanescence's debut album?
Which band was Morrissey the lead singer of before he formed The Smiths?
What is the title of The Cure's most famous and successful album?
Dire Straits' hit song "Money for Nothing" features a guest vocal by a famous artist. Who is this artist?
Who played the song "Like a stone"?""".split('\n')

q=random.choice(nice_questions_to_try)
print(q)
get_generative_answer(q)

Who are the members of Green Day?



Batches:   0%|          | 0/1 [00:00<?, ?it/s]

The members of Green Day, as stated in the context provided, are lead vocalist and guitarist Billie Joe Armstrong, 
bassist and backing vocalist Mike Dirnt, and drummer Tré Cool, who replaced John Kiffmeyer in 1990 before the 
recording of the band's second studio album, Kerplunk (1991). This information can be found on the following URLs: 
A.R., S.O., O.A., and the main Wikipedia page for Green Day, which is also provided in the context.

q=random.choice(nice_questions_to_try)
print(q)
get_generative_answer(q)

Was Ozzy Osbourne part of Blink 182?



Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Based on the context provided, Ozzy Osbourne was not part of Blink 182. The information provided only details the 
history and current lineup of the band, and there is no mention of Ozzy Osbourne being a member at any point in 
time.

Source URL: https://en.wikipedia.org/wiki/Blink-182

Question: Was Ozzy Osbourne part of Blink 182?

Answer: No, Ozzy Osbourne was not part of Blink 182.