集成：Llama.cpp

在 Haystack 中使用 Llama.cpp 模型。

作者

Ashwin Mathur

GitHub 仓库 PyPI 包

简介

Llama.cpp 是一个用 C/C++ 编写的库，用于高效地推断大型语言模型。它使用高效的量化 GGUF 格式，显著降低内存需求并加速推断。这意味着可以在标准机器（即使没有 GPU）上高效地运行 LLM。

安装

安装 llama-cpp-haystack 包

pip install llama-cpp-haystack

使用不同的计算后端

默认安装行为是在 Linux 和 Windows 上为 CPU 构建 llama.cpp，并在 MacOS 上使用 Metal。要使用其他计算后端

请按照 llama.cpp 安装页面上的说明，为您偏好的计算后端安装 llama-cpp-python。
使用上面的命令安装 llama-cpp-haystack。

例如，要将 llama-cpp-haystack 与 cuBLAS 后端一起使用，您需要运行以下命令

export LLAMA_CUBLAS=1
CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python
pip install llama-cpp-haystack

下载模型

Llama.cpp 需要 LLM 的量化二进制文件（GGUF 格式）。

流行 LLM 的 GGUF 版本可以从 HuggingFace 下载。

例如，要下载 OpenChat3.5 的 GGUF 版本，我们在 HuggingFace 上找到所需的 GGUF 版本，然后将文件下载到本地

import os
import urllib.request

def download_file(file_link, filename):
    # Checks if the file already exists before downloading
    if not os.path.isfile(filename):
        urllib.request.urlretrieve(file_link, filename)
        print("Model file downloaded successfully.")
    else:
        print("Model file already exists.")

# Download GGUF model from HuggingFace
ggml_model_path = (
    "https://hugging-face.cn/TheBloke/openchat-3.5-1210-GGUF/resolve/main/openchat-3.5-1210.Q3_K_S.gguf"
)
filename = "openchat-3.5-1210.Q3_K_S.gguf"
download_file(ggml_model_path, filename)

您也可以直接使用 Curl 从命令行下载文件

curl -L -O "https://hugging-face.cn/TheBloke/openchat-3.5-1210-GGUF/resolve/main/openchat-3.5-1210.Q3_K_S.gguf"

使用

您可以使用 LlamaCppGenerator 组件来运行 Llama.cpp 模型。

使用 GGUF 文件的路径初始化 LlamaCppGenerator，并指定所需的模型和文本生成参数

from haystack_integrations.components.generators.llama_cpp import LlamaCppGenerator

generator = LlamaCppGenerator(
    model="/content/openchat-3.5-1210.Q3_K_S.gguf",
    n_ctx=512,
    n_batch=128,
    model_kwargs={"n_gpu_layers": -1},
		generation_kwargs={"max_tokens": 128, "temperature": 0.1},
)
generator.warm_up()
prompt = f"Who is the best American actor?"
result = generator.run(prompt)

传递额外的模型参数

model_path、n_ctx、n_batch 参数已公开以便于使用，并可在初始化时直接作为关键字参数传递给 Generator。

model_kwargs 参数可用于在初始化模型时传递其他参数。如果出现重复，这些参数将覆盖 model_path、n_ctx 和 n_batch 初始化参数。

有关可用模型参数的更多信息，请参阅 Llama.cpp 的 LLM 文档。

例如，在初始化期间将模型卸载到 GPU

from haystack_integrations.components.generators.llama_cpp import LlamaCppGenerator

generator = LlamaCppGenerator(
    model="/content/openchat-3.5-1210.Q3_K_S.gguf",
    n_ctx=512,
    n_batch=128,
    model_kwargs={"n_gpu_layers": -1}
)
generator.warm_up()
prompt = f"Who is the best American actor?"
result = generator.run(prompt, generation_kwargs={"max_tokens": 128})
generated_text = result["replies"][0]
print(generated_text)

传递文本生成参数

generation_kwargs 参数可用于在推断期间将 max_tokens、temperature、top_k、top_p 等其他生成参数传递给模型。

有关可用生成参数的更多信息，请参阅 Llama.cpp 的 Completion API 文档。

例如，设置 max_tokens 和 temperature

from haystack_integrations.components.generators.llama_cpp import LlamaCppGenerator

generator = LlamaCppGenerator(
    model="/content/openchat-3.5-1210.Q3_K_S.gguf",
    n_ctx=512,
    n_batch=128,
    generation_kwargs={"max_tokens": 128, "temperature": 0.1},
)
generator.warm_up()
prompt = f"Who is the best American actor?"
result = generator.run(prompt)

generation_kwargs 也可以直接传递给生成器的 run 方法

from haystack_integrations.components.generators.llama_cpp import LlamaCppGenerator

generator = LlamaCppGenerator(
    model="/content/openchat-3.5-1210.Q3_K_S.gguf",
    n_ctx=512,
    n_batch=128,
)
generator.warm_up()
prompt = f"Who is the best American actor?"
result = generator.run(
    prompt,
    generation_kwargs={"max_tokens": 128, "temperature": 0.1},
)

示例：RAG 管道

我们在 HuggingFace 的 Simple Wikipedia 数据集上，在一个检索增强生成（RAG）管道中使用 LlamaCppGenerator，并使用 OpenChat-3.5 LLM 生成答案。

加载数据集

# Install HuggingFace Datasets using "pip install datasets"
from datasets import load_dataset
from haystack import Document, Pipeline
from haystack.components.builders.answer_builder import AnswerBuilder
from haystack.components.builders.prompt_builder import PromptBuilder
from haystack.components.embedders import SentenceTransformersDocumentEmbedder, SentenceTransformersTextEmbedder
from haystack.components.retrievers import InMemoryEmbeddingRetriever
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore

# Import LlamaCppGenerator
from haystack_integrations.components.generators.llama_cpp import LlamaCppGenerator

# Load first 100 rows of the Simple Wikipedia Dataset from HuggingFace
dataset = load_dataset("pszemraj/simple_wikipedia", split="validation[:100]")

docs = [
    Document(
        content=doc["text"],
        meta={
            "title": doc["title"],
            "url": doc["url"],
        },
    )
    for doc in dataset
]

使用 SentenceTransformersDocumentEmbedder 和 DocumentWriter 将文档索引到 InMemoryDocumentStore 中

doc_store = InMemoryDocumentStore(embedding_similarity_function="cosine")
doc_embedder = SentenceTransformersDocumentEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")

# Indexing Pipeline
indexing_pipeline = Pipeline()
indexing_pipeline.add_component(instance=doc_embedder, name="DocEmbedder")
indexing_pipeline.add_component(instance=DocumentWriter(document_store=doc_store), name="DocWriter")
indexing_pipeline.connect("DocEmbedder", "DocWriter")

indexing_pipeline.run({"DocEmbedder": {"documents": docs}})

创建检索增强生成（RAG）管道，并向其中添加 LlamaCppGenerator

# Prompt Template for the https://hugging-face.cn/openchat/openchat-3.5-1210 LLM
prompt_template = """GPT4 Correct User: Answer the question using the provided context.
Question: {{question}}
Context:
{% for doc in documents %}
    {{ doc.content }}
{% endfor %}
<|end_of_turn|>
GPT4 Correct Assistant:
"""

rag_pipeline = Pipeline()

text_embedder = SentenceTransformersTextEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")

# Load the LLM using LlamaCppGenerator
model_path = "openchat-3.5-1210.Q3_K_S.gguf"
generator = LlamaCppGenerator(model=model_path, n_ctx=4096, n_batch=128)

rag_pipeline.add_component(
    instance=text_embedder,
    name="text_embedder",
)
rag_pipeline.add_component(instance=InMemoryEmbeddingRetriever(document_store=doc_store, top_k=3), name="retriever")
rag_pipeline.add_component(instance=PromptBuilder(template=prompt_template), name="prompt_builder")
rag_pipeline.add_component(instance=generator, name="llm")
rag_pipeline.add_component(instance=AnswerBuilder(), name="answer_builder")

rag_pipeline.connect("text_embedder", "retriever")
rag_pipeline.connect("retriever", "prompt_builder.documents")
rag_pipeline.connect("prompt_builder", "llm")
rag_pipeline.connect("llm.replies", "answer_builder.replies")
rag_pipeline.connect("retriever", "answer_builder.documents")

运行管道

question = "Which year did the Joker movie release?"
result = rag_pipeline.run(
    {
        "text_embedder": {"text": question},
        "prompt_builder": {"question": question},
        "llm": {"generation_kwargs": {"max_tokens": 128, "temperature": 0.1}},
        "answer_builder": {"query": question},
    }
)

generated_answer = result["answer_builder"]["answers"][0]
print(generated_answer.data)
# The Joker movie was released on October 4, 2019.