多模态文本生成介绍

在 Colab 中打开下载

_{最后更新：2025 年 8 月 14 日}

在本 notebook 中，我们介绍了 Haystack 中实现多模态文本生成的功能。

我们介绍了 ImageContent 数据类，它代表了用户 ChatMessage 的图像内容。
我们开发了一些图像转换器组件。
OpenAIChatGenerator 已得到扩展，以支持多模态消息。
ChatPromptBuilder 已重构为支持字符串模板，从而更轻松地支持多模态用例。

在本 notebook 中，我们将介绍所有这些功能，展示一个使用文本检索 + 多模态生成的应用，以及一个多模态 Agent。

设置开发环境

!pip install haystack-ai gdown nest_asyncio pillow pypdfium2 python-weather

import os
from getpass import getpass
from pprint import pp as print


if "OPENAI_API_KEY" not in os.environ:
  os.environ["OPENAI_API_KEY"] = getpass("Enter OpenAI API key:")

Enter OpenAI API key:··········

`ImageContent` 简介

ImageContent 是一个新数据类，用于存储用户 ChatMessage 的图像内容。

它具有以下属性：

base64_image：表示图像的 base64 字符串。
mime_type：图像的可选 MIME 类型（例如，“image/png”、“image/jpeg”）。
detail：图像的可选详细程度（仅 OpenAI 支持）。“auto”、“high”或“low”之一。
meta：图像的可选元数据。

创建 `ImageContent` 对象

让我们先从网上下载一张图片，并手动创建一个 ImageContent 对象。稍后我们会看到更方便的方法。

! wget "https://upload.wikimedia.org/wikipedia/commons/thumb/e/e1/Cattle_tyrant_%28Machetornis_rixosa%29_on_Capybara.jpg/960px-Cattle_tyrant_%28Machetornis_rixosa%29_on_Capybara.jpg?download" -O capybara.jpg

--2025-05-14 09:29:45--  https://upload.wikimedia.org/wikipedia/commons/thumb/e/e1/Cattle_tyrant_%28Machetornis_rixosa%29_on_Capybara.jpg/960px-Cattle_tyrant_%28Machetornis_rixosa%29_on_Capybara.jpg?download
Resolving upload.wikimedia.org (upload.wikimedia.org)... 198.35.26.112, 2620:0:863:ed1a::2:b
Connecting to upload.wikimedia.org (upload.wikimedia.org)|198.35.26.112|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 202119 (197K) [image/jpeg]
Saving to: ‘capybara.jpg’

capybara.jpg        100%[===================>] 197.38K  --.-KB/s    in 0.09s   

2025-05-14 09:29:45 (2.23 MB/s) - ‘capybara.jpg’ saved [202119/202119]

from haystack.dataclasses import ImageContent, ChatMessage
from haystack.components.generators.chat import OpenAIChatGenerator
import base64

with open("capybara.jpg", "rb") as fd:
  base64_image = base64.b64encode(fd.read()).decode("utf-8")

image_content = ImageContent(
    base64_image=base64_image,
    mime_type="image/jpeg",
    detail="low")

image_content

ImageContent(base64_image='/9j/4QBoRXhpZgAATU0AKgAAAAgABQEaAAUAAAABAAAASgEbAAUAAAABAAAAUgEoAAMAAAABAAIAAAE7AAIAAAAFAAAAWgITAAMA...', mime_type='image/jpeg', detail='low', meta={})

image_content.show()

很好！

为了基于这张图片进行文本生成，我们需要将其包含在一个带有提示的用户消息中。我们这样做。

user_message = ChatMessage.from_user(content_parts=["Describe the image in short.", image_content])
llm = OpenAIChatGenerator(model="gpt-4o-mini")

print(llm.run([user_message])["replies"][0].text)

('The image depicts a capybara, a large rodent, with a small bird standing on '
 'its head. The capybara has a brownish fur coat, while the bird has a yellow '
 'belly and a grayish-brown back. They are surrounded by grassy vegetation, '
 'creating a natural setting.')

从 URL 或文件路径创建 `ImageContent` 对象

ImageContent 具有两个实用的类方法：

from_url：下载图像文件并将其封装在 ImageContent 中。
from_file_path：从磁盘加载图像并将其封装在 ImageContent 中。

使用 from_url，我们可以简化前面的示例。mime_type 会自动推断。

capybara_image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/e/e1/Cattle_tyrant_%28Machetornis_rixosa%29_on_Capybara.jpg/960px-Cattle_tyrant_%28Machetornis_rixosa%29_on_Capybara.jpg?download"

image_content = ImageContent.from_url(capybara_image_url, detail="low")
image_content

ImageContent(base64_image='/9j/4QBoRXhpZgAATU0AKgAAAAgABQEaAAUAAAABAAAASgEbAAUAAAABAAAAUgEoAAMAAAABAAIAAAE7AAIAAAAFAAAAWgITAAMA...', mime_type='image/jpeg', detail='low', meta={'content_type': 'image/jpeg', 'url': 'https://upload.wikimedia.org/wikipedia/commons/thumb/e/e1/Cattle_tyrant_%28Machetornis_rixosa%29_on_Capybara.jpg/960px-Cattle_tyrant_%28Machetornis_rixosa%29_on_Capybara.jpg?download'})

由于我们下载了图像文件，我们也可以看到 from_file_path 的实际应用。

在这种情况下，我们还将使用 size 参数，该参数会调整图像大小以适应指定的尺寸，同时保持宽高比。这可以减少文件大小、内存使用量和处理时间，这在处理具有分辨率限制的模型或将图像传输到远程服务时非常有用。

image_content = ImageContent.from_file_path("capybara.jpg", detail="low", size=(300, 300))
image_content

ImageContent(base64_image='/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAgGBgcGBQgHBwcJCQgKDBQNDAsLDBkSEw8UHRofHh0aHBwgJC4nICIsIxwcKDcpLDAx...', mime_type='image/jpeg', detail='low', meta={'file_path': 'capybara.jpg'})

image_content.show()

`ImageContent` 的图像转换器

为了在多模态管道中进行图像转换，我们还引入了两个图像转换器：

ImageFileToImageContent，它将图像文件转换为 ImageContent 对象（类似于 from_file_path）。
PDFToImageContent，它将 PDF 文件转换为 ImageContent 对象。

from haystack.components.converters.image import ImageFileToImageContent

converter = ImageFileToImageContent(detail="low", size=(300, 300))
result = converter.run(sources=["capybara.jpg"])

result["image_contents"][0]

ImageContent(base64_image='/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAgGBgcGBQgHBwcJCQgKDBQNDAsLDBkSEw8UHRofHh0aHBwgJC4nICIsIxwcKDcpLDAx...', mime_type='image/jpeg', detail='low', meta={'file_path': 'capybara.jpg'})

让我们看一个更有趣的例子。我们希望我们的 LLM 能够解读 Google 发表的这篇有影响力的论文中的一张图：Scaling Instruction-Finetuned Language Models。

! wget "https://arxiv.org/pdf/2210.11416.pdf" -O flan_paper.pdf

--2025-05-14 09:31:03--  https://arxiv.org/pdf/2210.11416.pdf
Resolving arxiv.org (arxiv.org)... 151.101.67.42, 151.101.3.42, 151.101.195.42, ...
Connecting to arxiv.org (arxiv.org)|151.101.67.42|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://arxiv.org/pdf/2210.11416 [following]
--2025-05-14 09:31:04--  http://arxiv.org/pdf/2210.11416
Connecting to arxiv.org (arxiv.org)|151.101.67.42|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1557309 (1.5M) [application/pdf]
Saving to: ‘flan_paper.pdf’

flan_paper.pdf      100%[===================>]   1.48M  --.-KB/s    in 0.09s   

2025-05-14 09:31:04 (16.5 MB/s) - ‘flan_paper.pdf’ saved [1557309/1557309]

from haystack.components.converters.image import PDFToImageContent

pdf_converter = PDFToImageContent()
paper_page_image = pdf_converter.run(sources=["flan_paper.pdf"], page_range="9")["image_contents"][0]
paper_page_image

ImageContent(base64_image='/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAgGBgcGBQgHBwcJCQgKDBQNDAsLDBkSEw8UHRofHh0aHBwgJC4nICIsIxwcKDcpLDAx...', mime_type='image/jpeg', detail=None, meta={'file_path': 'flan_paper.pdf', 'page_number': 9})

paper_page_image.show()

user_message = ChatMessage.from_user(content_parts=["What is the main takeaway of Figure 6? Be brief and accurate.", paper_page_image])

print(llm.run([user_message])["replies"][0].text)

('The main takeaway of Figure 6 is that Flan-PaLM demonstrates improved '
 'performance in zero-shot reasoning tasks when utilizing chain-of-thought '
 '(CoT) reasoning, as indicated by higher accuracy across different model '
 'sizes compared to PaLM without finetuning. This highlights the importance of '
 'instruction finetuning combined with CoT for enhancing reasoning '
 'capabilities in models.')

使用字符串模板扩展的 `ChatPromptBuilder`

当我们探索多模态用例时，很明显现有的 ChatPromptBuilder 有一些限制。特别是，我们需要一种方法在构建 ChatMessage 时传递结构化对象（如 ImageContent），并处理可变数量的此类对象。

为了解决这个问题，我们正在引入对 ChatPromptBuilder中字符串模板的支持。语法非常简单，如下所示。

template = """
{% message role="system" %}
You are a {{adjective}} assistant.
{% endmessage %}

{% message role="user" %}
Compare these images:
{% for img in image_contents %}
  {{ img | templatize_part }}
{% endfor %}
{% endmessage %}
"""

请注意 | templatize_part Jinja2 过滤器：这用于指示内容部分是结构化对象，而不是纯文本，需要特殊处理。

from haystack.components.builders import ChatPromptBuilder

builder = ChatPromptBuilder(template, required_variables="*")

image_contents = [ImageContent.from_url("https://1000logos.net/wp-content/uploads/2017/02/Apple-Logosu.png", detail="low"),
                  ImageContent.from_url("https://upload.wikimedia.org/wikipedia/commons/2/26/Pink_Lady_Apple_%284107712628%29.jpg", detail="low")]

messages = builder.run(image_contents=image_contents, adjective="joking")["prompt"]
print(messages)

[ChatMessage(_role=<ChatRole.SYSTEM: 'system'>,
             _content=[TextContent(text='You are a joking assistant.')],
             _name=None,
             _meta={}),
 ChatMessage(_role=<ChatRole.USER: 'user'>,
             _content=[TextContent(text='Compare these images:'),
                       ImageContent(base64_image='iVBORw0KGgoAAAANSUhEUgAADwAAAAhwAgMAAADt0CPhAAAADFBMVEVHcEwAAADe3t58fHxUHjQgAAAAAXRSTlMAQObYZgAAIABJ...', mime_type='image/png', detail='low', meta={'content_type': 'image/png', 'url': 'https://1000logos.net/wp-content/uploads/2017/02/Apple-Logosu.png'}),
                       ImageContent(base64_image='/9j/4AAQSkZJRgABAQEA8ADwAAD/7SnQUGhvdG9zaG9wIDMuMAA4QklNA+0AAAAAABAA8AAAAAEAAQDwAAAAAQABOEJJTQQMAAAA...', mime_type='image/jpeg', detail='low', meta={'content_type': 'image/jpeg', 'url': 'https://upload.wikimedia.org/wikipedia/commons/2/26/Pink_Lady_Apple_%284107712628%29.jpg'})],
             _name=None,
             _meta={})]

print(llm.run(messages)["replies"][0].text)

("Sure! Let's dive into these fruity comparisons! \n"
 '\n'
 "1. **Apple Logo**: This is a stylized logo of an apple. It's simple, iconic, "
 "and represents a well-known tech company. It's all about design and branding "
 '– who knew a fruit could be so influential in the tech world?\n'
 '\n'
 "2. **Real Apple**: This is an actual apple, the kind you can bite into! It's "
 'delicious, nutritious, and makes a great snack or pie ingredient. Plus, it '
 "doesn't need charging!\n"
 '\n'
 'In short, one is a tech icon, and the other is a snackable delight. Both are '
 'essential in their own realms! 🍏🍎')

文本检索和多模态生成

让我们看一个更高级的示例。

在这种情况下，我们收集了关于语言模型的论文中的图像。

我们的目标是构建一个系统，该系统能够：

根据用户的文本问题，从该集合中检索最相关的图像。
使用此图像以及原始问题，让 LLM 生成答案。

我们首先下载图像。

import gdown

url = "https://drive.google.com/drive/folders/1KLMow1NPq6GIuoNfOmUbjUmAcwFNmsCc"

gdown.download_folder(url, quiet=True, output=".")

['./arxiv/direct_preference_optimization.png',
 './arxiv/large_language_diffusion_models.png',
 './arxiv/lora_vs_full_fine_tuning.png',
 './arxiv/magpie.png',
 './arxiv/online_ai_feedback.png',
 './arxiv/reverse_thinking_llms.png',
 './arxiv/scaling_laws_for_precision.png',
 './arxiv/spectrum.png',
 './arxiv/textgrad.png',
 './arxiv/tulu_3.png',
 './map.png']

我们创建一个 InMemoryDocumentStore，并将每张图像的一个 Document 写入其中：内容是图像的文本描述；图像路径存储在 meta 中。

此处 Document 的内容最少。您可以考虑更复杂的方法来创建代表性内容：执行 OCR 或使用视觉语言模型。我们将在未来探索这个方向。

import glob
from haystack.dataclasses import Document
from haystack.document_stores.in_memory import InMemoryDocumentStore

docs = []

for image_path in glob.glob("arxiv/*.png"):
    text = "image from '" + image_path.split("/")[-1].replace(".png", "").replace("_", " ") + "' paper"
    docs.append(Document(content=text, meta={"image_path": image_path}))

document_store = InMemoryDocumentStore()
document_store.write_documents(docs)

我们执行文本检索（使用 BM25）以获取最相关的 Document。然后使用图像文件路径创建一个 ImageContent 对象。最后，将 ImageContent 与用户问题一起传递给 LLM。

from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
from haystack.components.generators.chat import OpenAIChatGenerator
from haystack.dataclasses import ImageContent, ChatMessage

retriever = InMemoryBM25Retriever(document_store=document_store)
llm = OpenAIChatGenerator(model="gpt-4o-mini")

def retrieve_and_generate(question):
  doc = retriever.run(query=question, top_k=1)["documents"][0]
  image_content = ImageContent.from_file_path(doc.meta["image_path"], detail="auto")
  image_content.show()

  message = ChatMessage.from_user(content_parts=[question, image_content])
  response = llm.run(messages=[message])["replies"][0].text
  print(response)

retrieve_and_generate("Describe the image of the Direct Preference Optimization paper")

('The image compares two methods in machine learning: Reinforcement Learning '
 'from Human Feedback (RLHF) and Direct Preference Optimization (DPO).\n'
 '\n'
 '### Left Side: RLHF\n'
 '- **Process**: \n'
 '  - Input example: "write me a poem about the history of jazz."\n'
 '  - Preference data shown as two different responses (y₁ and y₂).\n'
 '- **Components**:\n'
 '  - It includes a "reward model" that labels the quality of outputs and '
 'involves a reinforcement learning process.\n'
 '  - The goal is to derive an LM policy through sampling that improves over '
 'time.\n'
 '- **Key Terms**: "preference data," "maximum likelihood," "reinforcement '
 'learning."\n'
 '\n'
 '### Right Side: DPO\n'
 '- **Process**:\n'
 '  - Similar input as on the left.\n'
 '  - Preference data involves determining which response (y₁ or y₂) is '
 'preferred without a reward model.\n'
 '- **Components**:\n'
 '  - Focuses directly on optimizing preferences to produce a final language '
 'model.\n'
 '- **Key Terms**: "preference data," "maximum likelihood," "final LM."\n'
 '\n'
 '### General Visual Elements:\n'
 '- The image utilizes a clear color scheme to differentiate between systems, '
 'with RLHF having a pink background and DPO a light blue background.\n'
 '- Diagrams include nodes to represent network components and flows '
 'indicating processes.')

以下是一些其他用于测试的示例问题：

examples = [
    "Describe the image of the LoRA vs Full Fine-tuning paper",
    "Describe the image of the Online AI Feedback paper",
    "Describe the image of the Spectrum paper",
    "Describe the image of the Textgrad paper",
    "Describe the image of the Tulu 3 paper",
]

多模态 Agent

让我们将多模态消息与 Agent 组件结合使用。

我们首先创建一个天气 Tool，基于 python-weather library。该库是异步的，而 Tool 抽象期望一个同步调用方法，因此我们进行了一些调整。

要了解有关创建 Agent 的更多信息，请参阅教程：构建一个 Tool-Calling Agent。

import asyncio
from typing import Annotated

from haystack.tools import tool
from haystack.components.agents import Agent
from haystack.components.generators.chat import OpenAIChatGenerator

from haystack.dataclasses import ChatMessage, ImageContent
import python_weather

# only needed in Jupyter notebooks where there is an event loop running
import nest_asyncio
nest_asyncio.apply()


@tool
def get_weather(location: Annotated[str, "The location to get the weather for"]) -> dict:
    """A function to get the weather for a given location"""
    async def _fetch_weather():
        async with python_weather.Client(unit=python_weather.METRIC) as client:
            weather = await client.get(location)
            return {
                "description": weather.description,
                "temperature": weather.temperature,
                "humidity": weather.humidity,
                "precipitation": weather.precipitation,
                "wind_speed": weather.wind_speed,
                "wind_direction": weather.wind_direction
            }

    return asyncio.run(_fetch_weather())

让我们通过使用必需的参数调用我们的 Tool 来测试它。

get_weather.invoke(location="New York")

{'description': 'Heavy rain, fog',
 'temperature': 14,
 'humidity': 93,
 'precipitation': 0.0,
 'wind_speed': 24,
 'wind_direction': WindDirection.EAST_SOUTHEAST}

现在我们可以定义一个 Agent，为其提供天气 Tool，看看它是否能根据地理地图找到天气。

generator = OpenAIChatGenerator(model="gpt-4o-mini")
agent = Agent(chat_generator=generator, tools=[get_weather])

map_image = ImageContent.from_file_path("map.png")
map_image.show()

content_parts = ["What is the weather in the area of the map?", map_image]
messages = agent.run([ChatMessage.from_user(content_parts=content_parts)])["messages"]

print(messages[-1].text)

('The weather in Valencia, Spain is currently overcast with a temperature of '
 '21°C. The humidity is at 64%, and there is no precipitation. Winds are '
 'coming from the east-northeast at a speed of 25 km/h.')

接下来做什么？

我们还支持多种 LLM 提供商的图像功能，包括 Amazon Bedrock、Google、Mistral、Ollama 等。

要了解如何构建更高级的多模态管道（使用不同的文件格式和多模态嵌入模型），请查看创建 Vision+Text RAG 管道教程。

(Notebook 由 Stefano Fiorucci 编写)

多模态文本生成介绍

设置开发环境

ImageContent 简介

创建 ImageContent 对象

从 URL 或文件路径创建 ImageContent 对象

ImageContent 的图像转换器

使用字符串模板扩展的 ChatPromptBuilder