使用 AssemblyAI 进行说话人 diarization
最后更新:2024 年 9 月 24 日
📚 此食谱附带一篇完整的演练文章:“使用说话人日志升级您的 RAG 应用程序”
LLM 在文本数据方面表现出色,无需手动阅读或搜索即可回答复杂问题。当处理音频或视频时,提供转录是关键。转录捕获音频或视频的口语内容,但在多说话人录音中,它可能忽略非语言信息,并且无法传达说话人数量或个人发言。因此,为了最大化 LLM 在此类录音中的潜力,**说话人日志**必不可少!
在此示例中,我们将为音频文件构建一个带有说话人标签的 RAG 应用程序。此应用程序将使用 Haystack 和 AssemblyAI 的说话人日志模型。
📚 有用的资源
安装依赖项
%%bash
pip install haystack
pip install assemblyai-haystack
pip install "sentence-transformers>=3.0.0"
pip install "huggingface_hub>=0.23.0"
pip install --upgrade gdown
下载音频文件
我们已从 YouTube 视频中提取音频,并将其保存在一个 Google Drive 文件夹中:https://drive.google.com/drive/folders/10zsFuHmj3oytYMyGrLdytpW-6JzT9T_W?usp=drive_link
您可以通过在左侧栏的“文件”选项卡下运行以下代码,将音频文件下载到此 Colab notebook 中。
!gdown https://drive.google.com/drive/folders/10zsFuHmj3oytYMyGrLdytpW-6JzT9T_W -O "/content" --folder
Retrieving folder contents
Processing file 12654ySXSYc2rZnPgNxXZwWt2hH-kTNDZ Netflix_Q4_2023_Earnings_Interview.mp3
Processing file 1Zb15D_nrBzWlM3K8FuPOmyiCiYvsuJLD Panel_Discussion.mp3
Processing file 1FFKGEZAUSmJayZgGaAe1uFP9HUtOK5m- Working_From_Home_Debate.mp3
Retrieving folder contents completed
Building directory structure
Building directory structure completed
Downloading...
From: https://drive.google.com/uc?id=12654ySXSYc2rZnPgNxXZwWt2hH-kTNDZ
To: /content/Netflix_Q4_2023_Earnings_Interview.mp3
100% 39.1M/39.1M [00:00<00:00, 67.6MB/s]
Downloading...
From: https://drive.google.com/uc?id=1Zb15D_nrBzWlM3K8FuPOmyiCiYvsuJLD
To: /content/Panel_Discussion.mp3
100% 21.8M/21.8M [00:00<00:00, 60.4MB/s]
Downloading...
From: https://drive.google.com/uc?id=1FFKGEZAUSmJayZgGaAe1uFP9HUtOK5m-
To: /content/Working_From_Home_Debate.mp3
100% 4.45M/4.45M [00:00<00:00, 34.8MB/s]
Download completed
添加您的 API 密钥
输入来自 AssemblyAI 和 Hugging Face 的 API 密钥
import os
from getpass import getpass
ASSEMBLYAI_API_KEY = getpass("Enter your ASSEMBLYAI_API_KEY: ")
os.environ["HF_API_TOKEN"] = getpass("HF_API_TOKEN: ")
Enter your ASSEMBLYAI_API_KEY: ··········
HF_API_TOKEN: ··········
将说话人标签索引到您的 DocumentStore 中
构建一个流水线来生成说话人标签,并将它们及其嵌入索引到 DocumentStore 中。在此流水线中,您需要
- InMemoryDocumentStore:在没有外部依赖或额外设置的情况下存储您的文档
- AssemblyAITranscriber:为给定的音频文件创建 `speaker_labels` 并将它们转换为 Haystack Document
- DocumentSplitter:将您的文档分割成更小的块
- SentenceTransformersDocumentEmbedder:使用 sentence-transformers 模型为每个文档创建嵌入
- DocumentWriter:将这些文档写入您的文档存储
注意:说话人信息将保存在 Document 对象的 meta 中
from haystack.components.writers import DocumentWriter
from haystack.components.preprocessors import DocumentSplitter
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack import Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from assemblyai_haystack.transcriber import AssemblyAITranscriber
from haystack.document_stores.types import DuplicatePolicy
from haystack.utils import ComponentDevice
speaker_document_store = InMemoryDocumentStore()
transcriber = AssemblyAITranscriber(api_key=ASSEMBLYAI_API_KEY)
speaker_splitter = DocumentSplitter(
split_by = "sentence",
split_length = 10,
split_overlap = 1
)
speaker_embedder = SentenceTransformersDocumentEmbedder(device=ComponentDevice.from_str("cuda:0"))
speaker_writer = DocumentWriter(speaker_document_store, policy=DuplicatePolicy.SKIP)
indexing_pipeline = Pipeline()
indexing_pipeline.add_component(instance=transcriber, name="transcriber")
indexing_pipeline.add_component(instance=speaker_splitter, name="speaker_splitter")
indexing_pipeline.add_component(instance=speaker_embedder, name="speaker_embedder")
indexing_pipeline.add_component(instance=speaker_writer, name="speaker_writer")
indexing_pipeline.connect("transcriber.speaker_labels", "speaker_splitter")
indexing_pipeline.connect("speaker_splitter", "speaker_embedder")
indexing_pipeline.connect("speaker_embedder", "speaker_writer")
提供一个 audio_file_path 并运行您的流水线
audio_file_path = "/content/Panel_Discussion.mp3" #@param ["/content/Netflix_Q4_2023_Earnings_Interview.mp3", "/content/Working_From_Home_Debate.mp3", "/content/Panel_Discussion.mp3"]
indexing_pipeline.run(
{
"transcriber": {
"file_path": audio_file_path,
"summarization": None,
"speaker_labels": True
},
}
)
/usr/local/lib/python3.10/dist-packages/sentence_transformers/SentenceTransformer.py:92: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v3 of SentenceTransformers.
warnings.warn(
modules.json: 0%| | 0.00/349 [00:00<?, ?B/s]
config_sentence_transformers.json: 0%| | 0.00/116 [00:00<?, ?B/s]
README.md: 0%| | 0.00/10.6k [00:00<?, ?B/s]
sentence_bert_config.json: 0%| | 0.00/53.0 [00:00<?, ?B/s]
config.json: 0%| | 0.00/571 [00:00<?, ?B/s]
pytorch_model.bin: 0%| | 0.00/438M [00:00<?, ?B/s]
tokenizer_config.json: 0%| | 0.00/363 [00:00<?, ?B/s]
vocab.txt: 0%| | 0.00/232k [00:00<?, ?B/s]
tokenizer.json: 0%| | 0.00/466k [00:00<?, ?B/s]
special_tokens_map.json: 0%| | 0.00/239 [00:00<?, ?B/s]
1_Pooling/config.json: 0%| | 0.00/190 [00:00<?, ?B/s]
Batches: 0%| | 0/2 [00:00<?, ?it/s]
{'transcriber': {'transcription': [Document(id=427e56c68f0440dd8f51643ba52e2a2b60c739f4fc42ddab7207fb428da4492d, content: 'I want to start with you, Amy, because I know you, obviously at Shell have had AI as part of the wor...', meta: {'transcript_id': 'c053a806-6826-40ac-a6bc-95cab9b4cb8a', 'audio_url': 'https://cdn.assemblyai.com/upload/188cdd14-ff33-4468-81cb-e2c337674fc5'})]},
'speaker_writer': {'documents_written': 64}}
带说话人标签的 RAG 流水线
构建一个 RAG 流水线,为有关录音的问题生成答案。确保将说话人信息(通过文档的元数据提供)包含在 LLM 的提示中,以便区分谁说了什么。对于此流水线,您需要
- SentenceTransformersTextEmbedder:使用 sentence-transformers 模型为用户查询创建嵌入
- InMemoryEmbeddingRetriever:检索 `top_k` 与用户查询相关的文档
- PromptBuilder:提供一个 RAG 提示模板,其中包含要填充检索到的文档和用户查询的说明
- HuggingFaceAPIGenerator:推断通过 Hugging Face 免费 Serverless Inference API 或 Hugging Face TGI 服务的模型
示例中的 LLM(
mistralai/Mixtral-8x7B-Instruct-v0.1)是一个受限模型。请确保您能够访问该模型。
from haystack import Pipeline
from haystack.components.builders.prompt_builder import PromptBuilder
from haystack.components.generators import HuggingFaceAPIGenerator
from haystack.components.embedders import SentenceTransformersTextEmbedder
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from haystack.utils import ComponentDevice
prompt = """
You will be provided with a transcription of a recording with each sentence or group of sentences attributed to a Speaker by the word "Speaker" followed by a letter representing the person uttering that sentence. Answer the given question based on the given context.
If you think that given transcription is not enough to answer the question, say so.
Transcription:
{% for doc in documents %}
{% if doc.meta["speaker"] %} Speaker {{doc.meta["speaker"]}}: {% endif %}{{doc.content}}
{% endfor %}
Question: {{ question }}
<|end_of_turn|>
Answer:
"""
retriever = InMemoryEmbeddingRetriever(speaker_document_store)
text_embedder = SentenceTransformersTextEmbedder(device=ComponentDevice.from_str("cuda:0"))
answer_generator = HuggingFaceAPIGenerator(
api_type="serverless_inference_api",
api_params={"model": "mistralai/Mixtral-8x7B-Instruct-v0.1"},
generation_kwargs={"max_new_tokens":500})
prompt_builder = PromptBuilder(template=prompt)
speaker_rag_pipe = Pipeline()
speaker_rag_pipe.add_component("text_embedder", text_embedder)
speaker_rag_pipe.add_component("retriever", retriever)
speaker_rag_pipe.add_component("prompt_builder", prompt_builder)
speaker_rag_pipe.add_component("llm", answer_generator)
speaker_rag_pipe.connect("text_embedder.embedding", "retriever.query_embedding")
speaker_rag_pipe.connect("retriever.documents", "prompt_builder.documents")
speaker_rag_pipe.connect("prompt_builder.prompt", "llm.prompt")
带说话人标签的 RAG 测试
question = "What are each speakers' opinions on building in-house or using third parties?" # @param ["What are the two opposing opinions and how many people are on each side?", "What are each speakers' opinions on building in-house or using third parties?", "How many people are speaking in this recording?" ,"How many speakers and moderators are in this call?"]
result = speaker_rag_pipe.run({
"prompt_builder":{"question": question},
"text_embedder":{"text": question},
"retriever":{"top_k": 10}
})
result["llm"]["replies"][0]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Batches: 0%| | 0/1 [00:00<?, ?it/s]
" Speaker A is interested in understanding how companies decide between building in-house solutions or using third parties. Speaker B believes that the decision depends on whether the task is part of the company's core IP or not. They also mention that the build versus buy decision is too simplistic, as there are other options like partnering or using third-party platforms. Speaker C takes a mixed approach, using open source and partnering, and emphasizes the importance of embedding AI into the business. Speaker B thinks that AI is not magic and requires hard work, process, and change management, just like any other business process."
