使用 Llama 3.1 进行 RAG

在 Colab 中打开下载

_{最后更新：2025 年 7 月 8 日}

使用 Llama 3.1 开源模型和 Haystack LLM 框架进行奥斯卡主题的简单 RAG 示例。

安装

! pip install haystack-ai "transformers>=4.43.1" sentence-transformers accelerate bitsandbytes

授权

您需要一个 Hugging Face 账户
您需要在此处接受 Meta 的条款：https://hugging-face.cn/meta-llama/Meta-Llama-3.1-8B-Instruct 并等待授权

import getpass, os


os.environ["HF_API_TOKEN"] = getpass.getpass("Your Hugging Face token")

Your Hugging Face token··········

使用 Llama-3.1-8B-Instruct 进行 RAG（关于奥斯卡）🏆🎬

! pip install wikipedia

从维基百科加载数据

from IPython.display import Image
from pprint import pprint
import rich
import random

import wikipedia
from haystack.dataclasses import Document

title = "96th_Academy_Awards"
page = wikipedia.page(title=title, auto_suggest=False)
raw_docs = [Document(content=page.content, meta={"title": page.title, "url":page.url})]

索引管道

from haystack import Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack import Document
from haystack.components.embedders import SentenceTransformersTextEmbedder, SentenceTransformersDocumentEmbedder
from haystack.components.converters import TextFileToDocument
from haystack.components.writers import DocumentWriter
from haystack.components.preprocessors import DocumentSplitter
from haystack.utils import ComponentDevice

document_store = InMemoryDocumentStore()

indexing_pipeline = Pipeline()
indexing_pipeline.add_component("splitter", DocumentSplitter(split_by="word", split_length=200))

indexing_pipeline.add_component(
    "embedder",
    SentenceTransformersDocumentEmbedder(
        model="Snowflake/snowflake-arctic-embed-l",  # good embedding model: https://hugging-face.cn/Snowflake/snowflake-arctic-embed-l
        device=ComponentDevice.from_str("cuda:0"),    # load the model on GPU
    ))
indexing_pipeline.add_component("writer", DocumentWriter(document_store=document_store))

# connect the components
indexing_pipeline.connect("splitter", "embedder")
indexing_pipeline.connect("embedder", "writer")

<haystack.core.pipeline.pipeline.Pipeline object at 0x7fcc409ea4d0>
🚅 Components
  - splitter: DocumentSplitter
  - embedder: SentenceTransformersDocumentEmbedder
  - writer: DocumentWriter
🛤️ Connections
  - splitter.documents -> embedder.documents (List[Document])
  - embedder.documents -> writer.documents (List[Document])

indexing_pipeline.run({"splitter":{"documents":raw_docs}})

/usr/local/lib/python3.10/dist-packages/sentence_transformers/SentenceTransformer.py:174: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v3 of SentenceTransformers.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:81: UserWarning: 
Access to the secret `HF_TOKEN` has not been granted on this notebook.
You will not be requested again.
Please restart the session if you want to be prompted again.
  warnings.warn(



Batches:   0%|          | 0/1 [00:00<?, ?it/s]





{'writer': {'documents_written': 12}}

RAG 管道

from haystack.components.builders import ChatPromptBuilder
from haystack.dataclasses import ChatMessage

template = [ChatMessage.from_user("""
Using the information contained in the context, give a comprehensive answer to the question.
If the answer cannot be deduced from the context, do not give an answer.

Context:
  {% for doc in documents %}
  {{ doc.content }} URL:{{ doc.meta['url'] }}
  {% endfor %};
  Question: {{query}}

""")]
prompt_builder = ChatPromptBuilder(template=template)

在这里，我们使用 HuggingFaceLocalChatGenerator，在 Colab 中使用 4 位量化加载模型。

import torch
from haystack.components.generators.chat import HuggingFaceLocalChatGenerator

generator = HuggingFaceLocalChatGenerator(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    huggingface_pipeline_kwargs={"device_map":"auto",
                                  "model_kwargs":{"load_in_4bit":True,
                                                  "bnb_4bit_use_double_quant":True,
                                                  "bnb_4bit_quant_type":"nf4",
                                                  "bnb_4bit_compute_dtype":torch.bfloat16}},
    generation_kwargs={"max_new_tokens": 500})

generator.warm_up()

config.json:   0%|          | 0.00/855 [00:00<?, ?B/s]


The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.



model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]



Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]



model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]



model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]



model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]



model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]



Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]



generation_config.json:   0%|          | 0.00/184 [00:00<?, ?B/s]



tokenizer_config.json:   0%|          | 0.00/50.9k [00:00<?, ?B/s]



tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]



special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever

query_pipeline = Pipeline()

query_pipeline.add_component(
    "text_embedder",
    SentenceTransformersTextEmbedder(
        model="Snowflake/snowflake-arctic-embed-l",  # good embedding model: https://hugging-face.cn/Snowflake/snowflake-arctic-embed-l
        device=ComponentDevice.from_str("cuda:0"),  # load the model on GPU
        prefix="Represent this sentence for searching relevant passages: ",  # as explained in the model card (https://hugging-face.cn/Snowflake/snowflake-arctic-embed-l#using-huggingface-transformers), queries should be prefixed
    ))
query_pipeline.add_component("retriever", InMemoryEmbeddingRetriever(document_store=document_store, top_k=5))
query_pipeline.add_component("prompt_builder", ChatPromptBuilder(template=template))
query_pipeline.add_component("generator", generator)

# connect the components
query_pipeline.connect("text_embedder.embedding", "retriever.query_embedding")
query_pipeline.connect("retriever.documents", "prompt_builder.documents")
query_pipeline.connect("prompt_builder", "generator")

让我们问一些问题！

def get_generative_answer(query):

  results = query_pipeline.run({
      "text_embedder": {"text": query},
      "prompt_builder": {"query": query}
    }
  )

  answer = results["generator"]["replies"][0].text
  rich.print(answer)

get_generative_answer("Who won the Best Picture Award in 2024?")

Batches:   0%|          | 0/1 [00:00<?, ?it/s]


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.

**Oppenheimer** won the Best Picture Award in 2024.

get_generative_answer("What was the box office performance of the Best Picture nominees?")

Batches:   0%|          | 0/1 [00:00<?, ?it/s]


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.

The box office performance of the Best Picture nominees was as follows:

* Nine of the ten films nominated for Best Picture had earned a combined gross of $1.09 billion at the American and
Canadian box offices at the time of the nominations.
* The highest-grossing film among the Best Picture nominees was Barbie with $636 million in domestic box office 
receipts.
* The other nominees' box office performances were:
        + Oppenheimer: $326 million
        + Killers of the Flower Moon: $67 million
        + Poor Things: $20.4 million
        + The Holdovers: $18.7 million
        + Past Lives: $10.9 million
        + American Fiction: $7.9 million
        + Anatomy of a Fall: $3.9 million
        + The Zone of Interest: $1.6 million
* The box office performance of Maestro was not available due to its distributor Netflix's policy of not releasing 
such figures.

get_generative_answer("What was the reception of the ceremony")

Batches:   0%|          | 0/1 [00:00<?, ?it/s]


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.

The reception of the ceremony was generally positive. Television critic Robert Lloyd of the Los Angeles Times 
commented that the show was "not the slog it often is" and praised Jimmy Kimmel's performance as host, stating that
he was "a reliable, relatable presence liable to stir no controversy." Alison Herman of Variety noted that despite 
the lack of surprises among the winners, the show delivered "entertainment and emotion in spades, if not surprise."
Daniel Fienberg of The Hollywood Reporter lauded Kimmel's hosting and the ceremony's entertainment, stating that it
was "a maximalist, infectiously goofy singalong" that was the "ideal way to channel the feel-good energy of an 
Oscars where none of the bonhomie felt forced."

get_generative_answer("Can you name some of the films that got multiple nominations?")

Batches:   0%|          | 0/1 [00:00<?, ?it/s]


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.

According to the text, two films received multiple nominations:

1. Oppenheimer - 13 nominations
2. Poor Things - 11 nominations

# unrelated question: let's see how our RAG pipeline performs.

get_generative_answer("Audioslave was formed by members of two iconic bands. Can you name the bands and discuss the sound of Audioslave in comparison?")

Batches:   0%|          | 0/1 [00:00<?, ?it/s]


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.

There is no information in the provided context about the formation of Audioslave or its sound in comparison to 
other bands. The context only discusses the 96th Academy Awards and related events, and does not mention Audioslave
at all.

这是一个简单的演示。我们可以通过多种方式改进 RAG 管道，包括更好地预处理输入。

要在 Haystack 中使用 Llama 3 模型，您还有其他选项

LlamaCppGenerator 和 OllamaGenerator：使用 GGUF 量化格式，这些解决方案非常适合在标准机器（即使没有 GPU）上运行 LLM。
HuggingFaceAPIChatGenerator，它允许您查询 Hugging Face API、本地 TGI 容器或（付费）HF 推理端点。TGI 是一个用于在生产环境中高效部署和提供 LLM 的工具包。
vLLM 通过 OpenAIChatGenerator：用于 LLM 的高吞吐量和内存高效推理与服务引擎。

(Notebook 由 Stefano Fiorucci 编写)