Talk to YouTube Videos with Haystack Pipelines — 使用 Haystack Pipeline 与 YouTube 视频对话

使用 Haystack Pipeline 与 YouTube 视频对话

使用 Whisper 将 YouTube 视频作为上下文用于检索增强生成

2023 年 9 月 8 日

您可以使用此 Colab 笔记本作为本文所描述应用程序的工作示例。

在本文中，我将展示一个示例，说明如何利用像 OpenAI 的 Whisper 这样的转录模型来构建一个检索增强生成 (RAG) Pipeline，从而能够有效地搜索视频内容。

我将展示的示例应用程序能够根据从视频中提取的转录内容来回答问题。我将使用 Erika Cardenas 的视频作为示例。在视频中，她谈论了为 RAG Pipeline 分块和预处理文档。完成后，我们将能够查询 Haystack Pipeline，它将根据视频内容进行响应。

转录和存储视频

要开始，我们首先需要设置一个索引 Pipeline。Haystack 中的这些 Pipeline 设计用于接收各种形式的文件（.pdf、.txt、.md，在我们的情况下是 YouTube 链接），并将它们存储在数据库中。索引 Pipeline 也用于设计和定义我们希望文件如何被准备。这通常涉及文件转换步骤、一些预处理，也许还有一些嵌入创建等。

我们设计 Pipeline 的组件和结构的方式对于我们将在下一节中创建的另一种 Pipeline 很重要：RAG Pipeline，通常也称为查询 Pipeline 或 LLM Pipeline。虽然索引 Pipeline 定义了我们如何准备和存储数据，但 LLM Pipeline 使用这些存储的数据。索引 Pipeline 对 RAG Pipeline 影响的一个简单示例是，根据我们使用的模型，我们可能需要将文件分块得更长或更短。

可重用性

Haystack Pipeline 的理念是，一旦创建，它们就可以在需要时被调用。这确保了数据每次都以相同的方式处理。就索引 Pipeline 而言，这意味着我们有一种方法可以始终保持 RAG Pipeline 的数据库是最新的。在此示例应用程序的实际意义上，当有新视频我们想查询时，我们会重用相同的索引 Pipeline 并将新视频通过它运行。

创建索引 Pipeline

在此示例中，我们使用 Weaviate 作为我们的向量数据库进行存储。但是，Haystack 提供了许多文档存储供您选择。

首先，我们创建我们的 WeaviateDocumentStore

import weaviate  
from weaviate.embedded import EmbeddedOptions  
from haystack.document_stores import WeaviateDocumentStore  
  
client = weaviate.Client(  
  embedded_options=weaviate.embedded.EmbeddedOptions()  
)  
  
document_store = WeaviateDocumentStore(port=6666)

接下来，我们构建索引 Pipeline。在这里，我们的目标是创建一个将创建 YouTube 视频转录内容的 Pipeline。因此，我们将 WhisperTranscriber 作为我们的第一个组件。此组件使用 OpenAI 的 Whisper，它是一种自动语音识别 (ASR) 系统，可用于将音频转录为文本。该组件需要音频文件，并以 Haystack Document 的形式返回转录内容，可用于任何 Haystack Pipeline。

我们的 Pipeline 还包括预处理和嵌入创建。这是因为当需要创建 RAG Pipeline 时，我们希望对索引的文件进行语义搜索。

from haystack.nodes import EmbeddingRetriever, PreProcessor  
from haystack.nodes.audio import WhisperTranscriber  
from haystack.pipelines import Pipeline  
  
preprocessor = PreProcessor()  
embedder = EmbeddingRetriever(document_store=document_store,   
                              embedding_model="sentence-transformers/multi-qa-mpnet-base-dot-v1")  
whisper = WhisperTranscriber(api_key='OPENAI_API_KEY')  
  
indexing_pipeline = Pipeline()  
indexing_pipeline.add_node(component=whisper, name="Whisper", inputs=["File"])  
indexing_pipeline.add_node(component=preprocessor, name="Preprocessor", inputs=["Whisper"])  
indexing_pipeline.add_node(component=embedder, name="Embedder", inputs=["Preprocessor"])  
indexing_pipeline.add_node(component=document_store, name="DocumentStore", inputs=["Embedder"])

接下来，我们创建一个提取 YouTube 视频音频的辅助函数，然后我们可以运行 Pipeline，为此，我们安装 pytube 包 👇

from pytube import YouTube  
  
def youtube2audio (url: str):  
    yt = YouTube(url)  
    video = yt.streams.filter(abr='160kbps').last()  
    return video.download()

现在，我们可以使用 YouTube 视频的 URL 来运行我们的索引 Pipeline

file_path = youtube2audio("https://www.youtube.com/watch?v=h5id4erwD4s")  
indexing_pipeline.run(file_paths=[file_path])

检索增强生成 (RAG) Pipeline

这部分无疑是最有趣的部分。我们现在定义我们的 RAG Pipeline。这将是定义我们如何查询视频的 Pipeline。尽管 RAG Pipeline 通常用于问答，但它们也可以为许多其他用例设计。在此案例中，Pipeline 的作用在很大程度上取决于您为 LLM 提供的提示。您可以在 PromptHub 中找到各种用于不同用例的提示。

提示

在本例中，我们采用了一种常用的问答提示风格，当然，您可以更改此提示以实现您想要的目标。例如，将其更改为要求摘要的提示可能会很有趣。您也可以使其更通用。在这里，我们还告知模型转录内容属于 Weaviate 视频。

You will be provided some transcripts from Weaviate YouTube videos.   
Please answer the query based on what is said in the videos.  
Video Transcripts: {join(documents)}  
Query: {query}  
Answer:

在 Haystack 中，可以使用 PromptTemplate 和 PromptNode 组件将这些提示包含在 Pipeline 中。

而 PromptTemplate 是我们定义提示和提示期望的变量（在我们的情况下是 documents 和 query）的地方，PromptNode 实际上是我们与 LLM 交互的接口。在此示例中，我们使用 GPT-4 作为我们选择的模型，但您可以更改为使用 Hugging Face、SageMaker、Azure 等的其他模型。

from haystack.nodes import PromptNode, PromptTemplate, AnswerParser  
  
video_qa_prompt = PromptTemplate(prompt="You will be provided some transcripts from Weaviate YouTube videos. Please answer the query based on what is said in the videos.\n"  
                                        "Video Transcripts: {join(documents)}\n"  
                                        "Query: {query}\n"  
                                        "Answer:", output_parser = AnswerParser())  
  
prompt_node = PromptNode(model_name_or_path="gpt-4", 
                         api_key='OPENAI_KEY', 
                         default_prompt_template=video_qa_prompt)

Pipeline

最后，我们定义我们的 RAG Pipeline。这里需要注意的重要一点是 documents 输入如何提供给我们正在使用的提示。

Haystack 检索器始终返回 documents。请注意，下面第一个接收查询的组件与我们在上面索引 Pipeline 中使用的 EmbeddingRetriever 相同。这允许我们使用与索引转录所使用的模型相同的模型来嵌入查询。然后，查询和索引转录内容的嵌入被用来检索转录内容中最相关的部分。由于检索器将它们作为文档返回，因此我们可以使用检索器返回的任何内容来填充提示的 documents 参数。

video_rag_pipeline = Pipeline()  
video_rag_pipeline.add_node(component=embedder, name="Retriever", inputs=["Query"])  
video_rag_pipeline.add_node(component=prompt_node, name="PromptNode", inputs=["Retriever"])

我们可以使用查询来运行 Pipeline。响应将基于 Erika 在我们使用的示例视频中所说的内容 🤗

result = video_rag_pipeline.run("Why do we do chunking?")

我得到的对这个问题的答案如下

Chunking is done to ensure that the language model is receiving the most   
relevant information and not going over the context window. It involves   
splitting up the text once it hits a certain token limit, depending on   
the model or the chunk size defined. This is especially useful in documents   
where subsequent sentences or sections may not make sense without the   
information from previous ones. Chunking can also help in providing extremely   
relevant information when making queries that are specific to titles or   
sections.

进一步改进

在本例中，我们使用了一个能够将音频转录为文本的转录模型，但它无法区分说话者。我希望尝试的一个后续步骤是使用一个允许说话者区分的模型。这将使我能够提出问题，并在模型的响应中了解谁在视频中提供了该答案。

我想指出的另一点是，这个 Pipeline 是为了演示目的而设计的，它使用了轻量级但非常有效的sentence-transformers模型进行检索，以及默认设置的预处理。肯定还有更多工作可以做，以找出哪种嵌入模型最适合检索。并且从 Erika 的视频中获得启发，可以评估和改进转录文档的分块和预处理。

要了解更多关于可帮助您构建自定义 LLM 应用程序的可用 Pipeline 和组件的信息，请查看 Haystack 文档。