🐦‍⬛ 使用 Raven 进行信息提取

在 Colab 中打开下载

_{最后更新：2024 年 9 月 19 日}

笔记本作者：Stefano Fiorucci

在此实验中，我们将使用大型语言模型从文本数据中进行信息抽取。

🎯 目标：创建一个应用程序，该应用程序可以根据用户提供的 URL 和特定结构，从源中抽取信息。

OpenAI 模型的“函数调用”功能解锁了这项任务：用户可以通过定义一个包含所有类型化和特定参数的模拟函数来描述一个结构。LLM 将以这种特定形式准备数据并将其发送回用户。

一个使用 OpenAI 函数调用进行信息抽取的绝佳示例是 Kyle McDonald 的这个 gist。

现在正在改变的是，像 NexusRaven 这样的开放模型正在兴起，它们具有函数调用功能……

这是旧实验的一个改进版本，使用了 Gorilla Open Functions

技术栈

NexusRaven：一个开源且具有商业可行性的函数调用模型，在函数调用能力方面超越了最先进的水平。
Haystack：开源 LLM 编排框架，可简化您的 LLM 应用程序的开发。

安装依赖项

%%capture
! pip install haystack-ai "huggingface_hub>=0.22.0" trafilatura

加载并试用模型

我们使用 HuggingFaceAPIGenerator，它允许使用托管在 Hugging Face 端点的模型。特别是，我们使用了 Nexusflow 提供的付费端点来测试 LLM。

替代推理选项

使用 HuggingFaceLocalGenerator 在 Colab 中加载模型。这有点不切实际，因为模型相当大（13B 参数），即使使用量化，推理剩余的 GPU 资源也很少。
通过 TGI 或 vLLM 进行本地推理：如果您有可用的 GPU，这是一个不错的选择。
通过 Ollama/llama.cpp 进行本地推理：这适用于资源较少且没有 GPU 的机器。请记住，在这种情况下，将使用量化的 GGUF 版本模型，其质量低于原始模型。

from haystack.components.generators import HuggingFaceAPIGenerator

generator = HuggingFaceAPIGenerator(
    api_type="inference_endpoints",
    api_params={"url": "http://38.142.9.20:10240"},
    stop_words=["<bot_end>"],
    generation_kwargs={"temperature":0.001,
                    "do_sample" : False,
                    "max_new_tokens" : 1000})

/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:80: UserWarning: 
Access to the secret `HF_TOKEN` has not been granted on this notebook.
You will not be requested again.
Please restart the session if you want to be prompted again.
  warnings.warn(



tokenizer_config.json:   0%|          | 0.00/985 [00:00<?, ?B/s]



tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]



tokenizer.json:   0%|          | 0.00/1.85M [00:00<?, ?B/s]



added_tokens.json:   0%|          | 0.00/195 [00:00<?, ?B/s]



special_tokens_map.json:   0%|          | 0.00/623 [00:00<?, ?B/s]

要了解如何提示模型，请查看 Prompting notebook。稍后我们将看到如何更好地为我们的目的组织提示。

prompt='''
Function:
def get_weather_data(coordinates):
    """
    Fetches weather data from the Open-Meteo API for the given latitude and longitude.

    Args:
    coordinates (tuple): The latitude of the location.

    Returns:
    float: The current temperature in the coordinates you've asked for
    """

Function:
def get_coordinates_from_city(city_name):
    """
    Fetches the latitude and longitude of a given city name using the Maps.co Geocoding API.

    Args:
    city_name (str): The name of the city.

    Returns:
    tuple: The latitude and longitude of the city.
    """

User Query: What's the weather like in Seattle right now?<human_end>

'''

print(generator.run(prompt=prompt))

{'replies': ["Call: get_weather_data(coordinates=get_coordinates_from_city(city_name='Seattle'))"], 'meta': [{'model': 'http://38.142.9.20:10240', 'index': 0, 'finish_reason': 'stop_sequence', 'usage': {'completion_tokens': 29, 'prompt_tokens': 188, 'total_tokens': 217}}]}

一切顺利！✅

Prompt 模板和 Prompt Builder

要应用的 Prompt 模板是模型特定的。在我们的例子中，我们对原始 Prompt 进行了一些自定义，该 Prompt 可在 Prompting notebook 中找到。
在 Haystack 中，Prompt 模板使用 Prompt Builder 组件进行渲染。

from haystack.components.builders import PromptBuilder

prompt_template = '''
Function:
{{function}}
User Query: Save data from the provided text. START TEXT:{{docs[0].content|replace("\n"," ")|truncate(10000)}} END TEXT
<human_end>'''

prompt_builder = PromptBuilder(template=prompt_template)

# let's see if the Prompt Builder works properly

from haystack import Document
print(prompt_builder.run(docs=[Document(content="my fake document")], function="my fake function definition"))

{'prompt': '\nFunction:\nmy fake function definition\nUser Query: Save data from the provided text. START TEXT:my fake document END TEXT\n<human_end>'}

很好 ✅

其他组件

我们将要创建的 Pipeline 需要以下组件。但是，它们很简单，无需自定义和尝试，因此我们可以在 Pipeline 创建过程中直接实例化它们。

LinkContentFetcher：获取您提供的 URL 的内容并返回内容流列表。
HTMLToDocument：将 HTML 文件转换为 Documents。
DocumentCleaner：使文本文档更具可读性。

定义一个自定义组件来解析和可视化结果

模型生成的输出是一个函数调用字符串。

我们将创建一个简单的 Haystack 组件来适当地解析此字符串并创建精美的 HTML 可视化。

有关创建自定义组件的更多信息，请参阅文档。

from haystack import component
from typing import List, Optional
import ast
import re

def val_to_color(val):
  """
  Helper function to return a color based on the type/value of a variable
  """
  if isinstance(val, list):
    return "#FFFEE0"
  if val is True:
    return "#90EE90"
  if val is False:
    return "#FFCCCB"
  return ""

@component
class FunctionCallParser:
  """
  A component that parses the function call string and creates a HTML visualization
  """
  @component.output_types(html_visualization=str)
  def run(self, replies:List[str]):

    print(replies)

    func_call_str = replies[0].replace("Call:", "").strip()

    # sometimes the model output contains wrong expressions like "'date=[...]" or "date'=..."
    # that can't be correctly parsed, so we remove these substrings
    func_call_str=func_call_str.replace("'=","=")
    func_call_str=re.sub("'([a-z]+)=", "\g<1>=", func_call_str)

    func_call=ast.parse(func_call_str).body[0].value
    kwargs = {arg.arg: ast.literal_eval(arg.value) for arg in func_call.keywords}

    # Convert data to HTML format
    html_content = '<div style="border: 1px solid #ccc; padding: 10px; border-radius: 5px; background-color: #f9f9f9;">'
    for key, value in kwargs.items():
        html_content += f'<p><span style="font-family: Cursive; font-size: 30px;">{key}:</span>'
        html_content += f'&emsp;<span style="background-color:{val_to_color(value)}; font-family: Cursive; font-size: 20px;">{value}</span></p>'
    html_content += '</div>'

    return {"html_visualization": html_content}

创建信息抽取管道

为了以适当且可重现的方式组合组件，我们诉诸 Haystack Pipelines。语法应该很容易理解。您可以在文档中找到更多信息。

此 Pipeline 将根据提供的结构从给定 URL 中提取信息。

from haystack import Pipeline
from haystack.components.fetchers import LinkContentFetcher
from haystack.components.converters import HTMLToDocument
from haystack.components.preprocessors import DocumentCleaner

pipe = Pipeline()

pipe.add_component("fetcher", LinkContentFetcher())
pipe.add_component("converter", HTMLToDocument(extractor_type="DefaultExtractor"))
pipe.add_component("cleaner", DocumentCleaner())
pipe.add_component("prompt_builder", prompt_builder)
pipe.add_component("generator", generator)
pipe.add_component("parser", FunctionCallParser())

pipe.connect("fetcher", "converter")
pipe.connect("converter", "cleaner")
pipe.connect("cleaner.documents", "prompt_builder.docs")
pipe.connect("prompt_builder", "generator")
pipe.connect("generator", "parser")

现在我们创建一个 extract 函数来包装 Pipeline 并以 HTML 格式显示结果。该函数将接受

一个 function 字典，包含我们想要抽取的信息的结构定义
一个 url，用作数据源

from IPython.display import display, HTML

def extract(function:str, url:str) -> dict:
  if not function:
    raise ValueError("function definition is needed")
  if not url:
    raise ValueError("URL is needed")

  data_for_pipeline = {"fetcher":{"urls":[url]},
                       "prompt_builder":{"function":function}}

  html_visualization = pipe.run(data=data_for_pipeline)['parser']['html_visualization']
  display(HTML(html_visualization))

🕹️ 试用我们的应用！

让我们先定义要抽取的结构。

我们将解析一些关于动物的新闻文章…… 🦆🐻🦌

function = '''def save_data(about_animals: bool, about_ai: bool, habitat:List[string], predators:List[string], diet:List[string]):
    """
    Save data extracted from source text

    Args:
    about_animals (bool): Is the article about animals?
    about_ai (bool): Is the article about artificial intelligence?
    habitat (List[string]): List of places where the animal lives
    predators (List[string]): What are the animals that threaten them?
    diet (List[string]): What does the animal eat?
    """'''

让我们从一篇关于水豚的文章开始

extract(function=function, url="https://www.rainforest-alliance.org/species/capybara/")

INFO:haystack.core.pipeline.pipeline:Warming up component generator...


["\nCall: save_data(about_animals=True, about_ai=False, habitat=['Panama', 'Colombia', 'Venezuela', 'Guyana', 'Peru', 'Brazil', 'Paraguay', 'Northeast Argentina', 'Uruguay'], predators=['jaguars', 'caimans', 'anacondas', 'ocelots', 'harpy eagles'], diet=['vegetation', 'grass', 'grains', 'melons', 'reeds', 'squashes'])"]

about_animals: True

about_ai: False

habitat: ['Panama', 'Colombia', 'Venezuela', 'Guyana', 'Peru', 'Brazil', 'Paraguay', 'Northeast Argentina', 'Uruguay']

predators: ['jaguars', 'caimans', 'anacondas', 'ocelots', 'harpy eagles']

diet: ['vegetation', 'grass', 'grains', 'melons', 'reeds', 'squashes']

现在让我们试试一篇关于安第斯巨鹰的文章

extract(function=function, url="https://www.rainforest-alliance.org/species/cock-rock/")

INFO:haystack.core.pipeline.pipeline:Warming up component generator...


["\nCall: save_data(about_animals=True, about_ai=False, habitat=['Andes'], predators=['birds of prey', 'puma', 'jaguars', 'boa constrictors'], diet=['fruit', 'insects', 'small vertebrates'])"]

about_animals: True

about_ai: False

habitat: ['Andes']

predators: ['birds of prey', 'puma', 'jaguars', 'boa constrictors']

diet: ['fruit', 'insects', 'small vertebrates']

现在，尤卡坦鹿！

extract(function=function, url="https://www.rainforest-alliance.org/species/yucatan-deer/")

INFO:haystack.core.pipeline.pipeline:Warming up component generator...


["\nCall: save_data(about_animals=True, about_ai=False, habitat=['forests'], predators=['cougar', 'jaguar'], diet=['grass', 'leaves', 'sprouts', 'lichens', 'mosses', 'tree bark', 'fruit'])"]

about_animals: True

about_ai: False

habitat: ['forests']

predators: ['cougar', 'jaguar']

diet: ['grass', 'leaves', 'sprouts', 'lichens', 'mosses', 'tree bark', 'fruit']

一个完全不同的例子，关于人工智能……

function='''def save_data(people:List[string], companies:List[string], summary:string, topics:List[string], about_animals: bool, about_ai: bool):
    """
    Save data extracted from source text

    Args:
    people (List[string]): List of the mentioned people
    companies (List[string]): List of the mentioned companies.
    summary (string): Summarize briefly what happened in one sentence of max 15 words.
    topics (List[string]): what are the five most important topics?
    about_animals (bool): Is the article about animals?
    about_ai (bool): Is the article about artificial intelligence?
    """'''

extract(function=function, url="https://www.theverge.com/2023/11/22/23967223/sam-altman-returns-ceo-open-ai")

INFO:haystack.core.pipeline.pipeline:Warming up component generator...


["\nCall: save_data(people=['Sam Altman', 'Greg Brockman', 'Bret Taylor', 'Larry Summers', 'Adam D’Angelo', 'Ilya Sutskever', 'Emmett Shear'], companies=['OpenAI', 'Microsoft', 'Thrive Capital'], summary='Sam Altman will return as CEO of OpenAI, overcoming an attempted boardroom coup that sent the company into chaos over the past several days.', topics=['OpenAI', 'Artificial intelligence', 'Machine learning', 'Computer vision', 'Natural language processing'], about_animals=False, about_ai=True)"]

people: ['Sam Altman', 'Greg Brockman', 'Bret Taylor', 'Larry Summers', 'Adam D’Angelo', 'Ilya Sutskever', 'Emmett Shear']

companies: ['OpenAI', 'Microsoft', 'Thrive Capital']

summary: Sam Altman 将重新担任 OpenAI 的首席执行官，克服了过去几天让公司陷入混乱的试图在董事会发动政变的行为。

topics: ['OpenAI', 'Artificial intelligence', 'Machine learning', 'Computer vision', 'Natural language processing']

about_animals: False

about_ai: True

extract(function=function, url="https://www.theguardian.com/business/2023/dec/30/sam-bankman-fried-will-not-face-second-trial-after-multibillion-dollar-crypto-conviction")

INFO:haystack.core.pipeline.pipeline:Warming up component generator...


["\nCall: save_data(people=['Sam Bankman-Fried'], companies=['FTX'], summary='Sam Bankman-Fried will not face second trial after multibillion-dollar crypto fraud conviction', topics=['crypto fraud', 'FTX', 'cryptocurrency exchange'], about_animals=False, about_ai=False)"]

people: ['Sam Bankman-Fried']

companies: ['FTX']

summary: Sam Bankman-Fried 在被判犯有多项数十亿美元的加密货币欺诈罪后，将不会面临第二次审判

topics: ['crypto fraud', 'FTX', 'cryptocurrency exchange']

about_animals: False

about_ai: False

extract(function=function, url="https://lite.cnn.com/2023/11/05/tech/nvidia-amd-ceos-taiwan-intl-hnk/index.html")

INFO:haystack.core.pipeline.pipeline:Warming up component generator...


["\nCall: save_data(people=['Michelle Toh', 'Wayne Chang', 'Jensen Huang', 'Lisa Su'], companies=['Nvidia', 'AMD'], summary='The Taiwanese American cousins going head-to-head in the global AI race', topics=['chip industry', 'global AI chip industry', 'Taiwanese descent', 'semiconductors', 'generative AI'], about_animals=False, about_ai=True)"]

people: ['Michelle Toh', 'Wayne Chang', 'Jensen Huang', 'Lisa Su']

companies: ['Nvidia', 'AMD']

summary: 两位华裔表亲在全球人工智能竞赛中展开激烈竞争

topics: ['chip industry', 'global AI chip industry', 'Taiwanese descent', 'semiconductors', 'generative AI']

about_animals: False

about_ai: True

✨ 结论和注意事项

Nexus Raven 在此用例中的表现似乎远好于 Gorilla Open Functions (v0)。
我还期望它在通过添加语法使它们生成 JSON 的通用模型方面表现得更好。
⚠️ 当网页内容充斥着广告和干扰等无关信息时，模型在抽取相关信息时会遇到困难，有时会导致返回空响应。
⚠️ 作为一个统计模型，LLM 对提示高度敏感。例如，修改指定参数的顺序和描述可能会产生不同的抽取结果。

📚 参考资料

与实验相关