使用 OpenAI ChatGenerator 进行简单关键词提取

在 Colab 中打开下载

_{最后更新：2025 年 5 月 9 日}

本 notebook 演示了如何使用 Haystack 的 ChatPromptBuilder 以及通过 OpenAIChatGenerator 的 LLM 来从文本中提取关键词和关键短语。我们将

定义一个 prompt，指示模型识别单词和多词关键词。
捕获每个关键词的字符偏移量。
分配一个相关性分数（0-1）。
将结果解析并显示为 JSON。

安装软件包并设置 OpenAI API 密钥

!pip install haystack-ai

import os
from getpass import getpass

if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = getpass("Enter OpenAI API key:")

导入必需的库

import json


from haystack.dataclasses import ChatMessage
from haystack.components.builders import ChatPromptBuilder
from haystack.components.generators.chat import OpenAIChatGenerator

准备文本

收集您要分析的文本。

text_to_analyze = "Artificial intelligence models like large language models are increasingly integrated into various sectors including healthcare, finance, education, and customer service. They can process natural language, generate text, translate languages, and extract meaningful insights from unstructured data. When performing key word extraction, these systems identify the most significant terms, phrases, or concepts that represent the core meaning of a document. Effective extraction must balance between technical terminology, domain-specific jargon, named entities, action verbs, and contextual relevance. The process typically involves tokenization, stopword removal, part-of-speech tagging, frequency analysis, and semantic relationship mapping to prioritize terms that most accurately capture the document's essential information and main topics."

构建 Prompt

我们构建了一个单消息模板，指示模型提取关键词、它们的位置和分数，并将输出作为 JSON 对象返回。

messages = [
    ChatMessage.from_user(
        '''
You are a keyword extractor. Extract the most relevant keywords and phrases from the following text. For each keyword:
1. Find single and multi-word keywords that capture important concepts
2. Include the starting position (index) where each keyword appears in the text
3. Assign a relevance score between 0 and 1 for each keyword
4. Focus on nouns, noun phrases, and important terms

Text to analyze: {{text}}

Return the results as a JSON array in this exact format:
{
  "keywords": [
    {
      "keyword": "example term",
      "positions": [5],
      "score": 0.95
    },
    {
      "keyword": "another keyword",
      "positions": [20],
      "score": 0.85
    }
  ]
}

Important:
- Each keyword must have its EXACT character position in the text (counting from 0)
- Scores should reflect the relevance (0–1)
- Include both single words and meaningful phrases
- List results from highest to lowest score
'''
    )
]

builder = ChatPromptBuilder(template=messages, required_variables='*')
prompt = builder.run(text=text_to_analyze)

初始化 Generator 并提取关键词

我们使用 OpenAIChatGenerator（例如 gpt-4o-mini）发送我们的 prompt 并请求 JSON 格式的响应。

# Initialize the chat-based generator
extractor = OpenAIChatGenerator(model="gpt-4o-mini")

# Run the generator with our formatted prompt
results = extractor.run(
    messages=prompt["prompt"],
    generation_kwargs={"response_format": {"type": "json_object"}}
)

# Extract the raw text reply
output_str = results["replies"][0].text

解析并显示结果

最后，将返回的 JSON 字符串转换为 Python 对象，并迭代提取的关键词。

try:
    data = json.loads(output_str)
    for kw in data["keywords"]:
        print(f'Keyword: {kw["keyword"]}')
        print(f' Positions: {kw["positions"]}')
        print(f' Score: {kw["score"]}\n')
except json.JSONDecodeError:
    print("Failed to parse the output as JSON. Raw output:", output_str)

Keyword: artificial intelligence
 Positions: [0]
 Score: 1.0

Keyword: large language models
 Positions: [18]
 Score: 0.95

Keyword: healthcare
 Positions: [63]
 Score: 0.9

Keyword: finance
 Positions: [72]
 Score: 0.9

Keyword: education
 Positions: [81]
 Score: 0.9

Keyword: customer service
 Positions: [91]
 Score: 0.9

Keyword: natural language
 Positions: [108]
 Score: 0.85

Keyword: unstructured data
 Positions: [162]
 Score: 0.85

Keyword: key word extraction
 Positions: [193]
 Score: 0.8

Keyword: significant terms
 Positions: [215]
 Score: 0.8

Keyword: technical terminology
 Positions: [290]
 Score: 0.75

Keyword: domain-specific jargon
 Positions: [311]
 Score: 0.75

Keyword: named entities
 Positions: [334]
 Score: 0.7

Keyword: action verbs
 Positions: [352]
 Score: 0.7

Keyword: contextual relevance
 Positions: [367]
 Score: 0.7

Keyword: tokenization
 Positions: [406]
 Score: 0.65

Keyword: stopword removal
 Positions: [420]
 Score: 0.65

Keyword: part-of-speech tagging
 Positions: [437]
 Score: 0.65

Keyword: frequency analysis
 Positions: [457]
 Score: 0.65

Keyword: semantic relationship mapping
 Positions: [476]
 Score: 0.65

Keyword: essential information
 Positions: [508]
 Score: 0.6

Keyword: main topics
 Positions: [529]
 Score: 0.6