🦍 使用 Gorilla 进行信息提取

在 Colab 中打开下载

_{最后更新：2024 年 9 月 19 日}

笔记本作者：Stefano Fiorucci

在本实验中，我们将使用大型语言模型来从文本数据中进行信息提取。

🎯 目标：创建一个应用程序，该应用程序可以根据用户提供的文本（或 URL）和特定结构，从源中提取信息。

OpenAI 模型的“函数调用”功能解锁了这项任务：用户可以通过定义一个具有所有类型化和特定参数的虚函数来描述一个结构。LLM 将以这种特定形式准备数据并将其发送回用户。

使用 OpenAI 函数调用进行信息提取的一个很好的例子是 Kyle McDonald 的这个 gist。

现在正在发生的变化是，像 Gorilla 这样的开放模型正在出现，它们也具有函数调用能力……

技术栈

Gorilla OpenFunctions：一个开源模型，可以根据自然语言指令和 API/函数定义来制定可执行的 API/函数调用。
Haystack：一个开源 LLM 编排框架，可简化 LLM 应用程序的开发。

安装依赖项

需要 accelerate 和 bitsandbytes 才能以量化版本加载模型，该版本可以在 Colab 上流畅运行。

%%capture
! pip install transformers accelerate bitsandbytes haystack-ai trafilatura

加载并尝试模型

我们使用 HuggingFaceLocalGenerator，它允许本地加载托管在 Hugging Face 上的模型。我们还指定了一些量化选项，以便在 Colab 提供的有限资源下运行模型。有关 Haystack 上 HuggingFaceLocalGenerator 的文章。

一些说明

尽管该模型可以通过一个与 OpenAI 兼容的 API 以免费部署版本提供，但我选择不使用此选项，因为我发现服务器相当不稳定。
要在 Colab 上加载模型，我自己对其进行了分片并发布到了 Hugging Face。要理解为什么需要分片版本，您可以阅读 Maarten Grootendorst 的这篇精彩文章。

import torch
from haystack.components.generators import HuggingFaceLocalGenerator

generator = HuggingFaceLocalGenerator("anakin87/gorilla-openfunctions-v0-sharded",
                                          huggingface_pipeline_kwargs={"device_map":"auto",
                                                                      "model_kwargs":{"load_in_8bit":True,
                                                                                    "torch_dtype":torch.float16}},
                                          generation_kwargs={"max_new_tokens":128,
                                                             "batch_size":16})
generator.warm_up()

Loading checkpoint shards:   0%|          | 0/7 [00:00<?, ?it/s]

要理解如何提示模型，请查看 GitHub README。之后我们将看到如何为我们的目的更好地组织提示。

prompt="""USER: <<question>> Call me an Uber ride type \"Plus\" in Berkeley at zipcode 94704 in 10 minutes
<<function>> [    {
        "name": "Uber Carpool",
        "api_name": "uber.ride",
        "description": "Find suitable ride for customers given the location, type of ride, and the amount of time the customer is willing to wait as parameters",
        "parameters":  [{"name": "loc", "description": "location of the starting place of the uber ride"}, {"name":"type", "enum": ["plus", "comfort", "black"], "description": "types of uber ride user is ordering"}, {"name": "time", "description": "the amount of time in minutes the customer is willing to wait"}]
    }]

ASSISTANT: """

print(generator.run(prompt=prompt))

{'replies': [' uber.ride(loc="berkeley", type="plus", time=10)']}

一切顺利！✅

Prompt 模板和 Prompt Builder

要应用的 Prompt 模板是模型特定的。在我们的例子中，我们稍微自定义了原始的 prompt，该 prompt 可在 GitHub 上找到。
在 Haystack 中，prompt 模板使用 Prompt Builder 组件进行渲染。

from haystack.components.builders import PromptBuilder

prompt_template = """USER: <<question>> Extract data from the following text. START TEXT. {{docs[0].content|truncate(10000)}} END TEXT. <<function>> {{function}}

ASSISTANT: """

prompt_builder = PromptBuilder(template=prompt_template)

# let's see if the Prompt Builder works properly

from haystack import Document
print(prompt_builder.run(docs=[Document(content="my fake document")], function="my fake function definition"))

{'prompt': 'USER: <<question>> Extract data from the following text. START TEXT. my fake document END TEXT. <<function>> my fake function definition\n\nASSISTANT: '}

很好✅

其他组件

我们要创建的管道需要以下组件。但是，它们很简单，无需自定义和尝试，因此我们可以在管道创建过程中直接实例化它们。

LinkContentFetcher：获取您提供的 URL 的内容并返回一个内容流列表。
HTMLToDocument：将 HTML 文件转换为 Documents。
DocumentJoiner：连接 Documents 列表。
DocumentCleaner：使文本文档更具可读性。

定义一个自定义组件来解析和可视化结果

模型生成的输出是一个函数调用字符串。

我们将创建一个简单的 Haystack 组件来适当地解析此字符串并创建漂亮的 HTML 可视化。

有关创建自定义组件的更多信息，请参阅文档。

from haystack import component
from typing import List, Optional
import ast
import re

def val_to_color(val):
  """
  Helper function to return a color based on the type/value of a variable
  """
  if isinstance(val, list):
    return "#FFFEE0"
  if val is True:
    return "#90EE90"
  if val is False:
    return "#FFCCCB"
  return ""

@component
class FunctionCallParser:
  """
  A component that parses the function call string and creates a HTML visualization
  """
  @component.output_types(html_visualization=str)
  def run(self, replies:List[str]):

    func_call_str = replies[0].strip().replace('\n','')

    # sometimes the model output starts with "extract_data(type=text)..." or similar expressions
    # that can't be correctly parsed, so we remove this substring
    func_call_str=re.sub("type=[a-zA-Z]+\)?,?","",func_call_str)

    # sometimes the output is like this: "extract_data(..., properties={"date": "2022-01-01", ...})"
    if "properties={" in func_call_str:
      clean_json_str = func_call_str.split("properties=")[-1].strip(')')
      kwargs = ast.literal_eval(clean_json_str)
    else:
    # sometimes, it is a proper function call: "extract_data(date="2022-01-01", ...)"
      func_call=ast.parse(func_call_str).body[0].value
      kwargs = {arg.arg: ast.literal_eval(arg.value) for arg in func_call.keywords}

    # Convert data to HTML format
    html_content = '<div style="border: 1px solid #ccc; padding: 10px; border-radius: 5px; background-color: #f9f9f9;">'
    for key, value in kwargs.items():
        html_content += f'<p><span style="font-family: Cursive; font-size: 30px;">{key}:</span>'
        html_content += f'&emsp;<span style="background-color:{val_to_color(value)}; font-family: Cursive; font-size: 20px;">{value}</span></p>'
    html_content += '</div>'

    return {"html_visualization": html_content}

创建信息提取管道

为了以适当且可重现的方式组合组件，我们诉诸 Haystack Pipelines。语法应该很容易理解。您可以在文档中找到更多信息。

from haystack import Pipeline
from haystack.components.fetchers import LinkContentFetcher
from haystack.components.converters import HTMLToDocument
from haystack.components.preprocessors import DocumentCleaner
from haystack.components.routers import DocumentJoiner
# in the future, the previous import will become:
# from haystack.components.joiners import DocumentJoiner

pipe = Pipeline()

pipe.add_component("fetcher", LinkContentFetcher())
pipe.add_component("converter", HTMLToDocument(extractor_type="KeepEverythingExtractor"))
pipe.add_component("joiner", DocumentJoiner())
pipe.add_component("cleaner", DocumentCleaner())
pipe.add_component("prompt_builder", prompt_builder)
pipe.add_component("generator", generator)
pipe.add_component("parser", FunctionCallParser())

pipe.connect("fetcher", "converter")
pipe.connect("converter", "joiner")
pipe.connect("joiner", "cleaner")
pipe.connect("cleaner.documents", "prompt_builder.docs")
pipe.connect("prompt_builder", "generator")
pipe.connect("generator", "parser")

让我们绘制我们的 Pipeline！

from IPython.display import Image

pipe.draw('pipe.png')
Image('pipe.png')

现在我们创建一个 extract 函数来包装 Pipeline。这将接受

一个 function 字典，其中包含我们要提取的信息的结构定义
一个 url 或 text，用作数据源

from IPython.display import display, HTML

def extract(function:dict, url: Optional[str]=None, text: Optional[str]=None) -> dict:
  if not function:
    raise ValueError("function definition is needed")
  if not url and not text:
    raise ValueError("URL or text are needed")
  if url and text:
    raise ValueError("you should specify either a URL or a text")

  urls = []
  documents = []
  if url:
    urls.append(url)
  if text:
    documents.append(Document(content=text))

  generation_kwargs={"min_new_tokens":50, # this encourages the model to extract at least some information
                     "max_new_tokens":1000,
                     "batch_size":16}

  data_for_pipeline = {"fetcher":{"urls":urls},
                       "joiner":{"documents":documents},
                       "prompt_builder":{"function":function},
                       "generator": {"generation_kwargs": generation_kwargs}}

  html_visualization = pipe.run(data=data_for_pipeline)['parser']['html_visualization']
  display(HTML(html_visualization))

🕹️ 尝试我们的应用程序！

让我们先定义要提取的结构。

我们将解析一些关于动物的文章…… 🦆🐻🦌

function = {
    "name": "extract_data",
    "description": "Extract data from text",
    "parameters": {
        # "type": "object",  #  I found that the Gorilla model works better without this item
        "properties": {
            "about_animals": {
                "description": "Is the article about animals?",
                "type": "boolean",
            },
            "about_ai": {
                "description": "Is the article about artificial intelligence?",
                "type": "boolean",
            },
            "weight": {
                "description": "what is the weight of the animal in lbs?",
                "type": "integer",
            },
            "habitat": {
                "description": "List of places where the animal lives",
                "type": "array",
                "items": {"type": "string"},
            },
            "diet": {
                "description": "What does the animal eat?",
                "type": "array",
                "items": {"type": "string"},
            },
            "predators": {
                "description": "What are the animals that threaten them?",
                "type": "array",
                "items": {"type": "string"},
            },
        },
        "required": ["about_animals", "about_ai", "habitat", "diet", "predators"],
    },
}

让我们从一篇关于水豚的文章开始

extract(function=function, url="https://www.rainforest-alliance.org/species/capybara/")

This is a friendly reminder - the current text generation call will exceed the model's predefined maximum length (2048). Depending on the model, you may observe exceptions, performance degradation, or nothing at all.

about_animals: True

about_ai: False

weight: 100

habitat: ['Panama', 'Colombia', 'Venezuela', 'Guyana', 'Peru', 'Brazil', 'Paraguay', 'Northeast Argentina', 'Uruguay']

diet: ['grass', 'aquatic plants', 'grains', 'melons', 'reeds', 'squashes']

predators: ['jaguars', 'caimans', 'anacondas', 'ocelots', 'harpy eagles']

现在让我们试试关于安第斯岩鸡（来自 https://www.rainforest-alliance.org/species/cock-rock/）的文本

article="""Search for:\nAbout\nKnowledge Hub\nFor Individuals\nFor Business\nThe Latest\nDonate\nWork With Us\nSearch for:\nWhat our seal means\nWhat We Are Doing\nOur Impacts\nOur Approach\nIssues\nForests & Biodiversity\nLivelihoods\nClimate\nHuman rights\nRegions\nAsia\nCentral America & Mexico\nEast Africa\nSouth America\nWest & Central Africa\nWhat You Can Do\nSupport Our Work\nEveryday Actions\nFind the Frog\nSchool Curricula\nKids Games & Activities\nShop to Support\nFor Business\nCertification\nMarketing Sustainability\nTailored Supply Chain Services\nFor Partners\nAbout\nHelp Center\nThe Latest\nEvents\nWhat We Are Doing\nOur Impacts\nSee the positive change our work is making around the\n\t\t\t\t\t\tworld.\nLEARN\n\t\t\t\t\t\t\tMORE\nOur Approach\nIssues\nForests & Biodiversity\nLivelihoods\nClimate\nHuman\n\t\t\t\t\t\tRights\nRegions\nAsia\nEast\n\t\t\t\t\t\tAfrica\nWest\n\t\t\t\t\t\t& Central Africa\nSouth\n\t\t\t\t\t\tAmerica\nMexico\n\t\t\t\t\t\t& Central America\nWhat You Can Do\nSupport Our Work\nThere are many ways you can protect rainforests, fight\n\t\t\t\t\t\tclimate change, and help people and wildlife\n\t\t\t\t\t\tthrive.\nEXPLORE YOUR\n\t\t\t\t\t\t\tGIVING OPTIONS\nEveryday\n\t\t\t\t\t\tActions\nFind the Frog\nSchool\n\t\t\t\t\t\tCurricula\nKids’ Games\n\t\t\t\t\t\t& Activities\nShop to Support\nFor Business\nTransform your\n\t\t\t\t\t\tbusiness practices\nWhat Our Seal Means\nThe Rainforest Alliance certification seal means that the\n\t\t\t\t\t\tproduct (or a specified ingredient) was produced by\n\t\t\t\t\t\tfarmers, foresters, and/or companies working together to\n\t\t\t\t\t\tcreate a world where people and nature thrive in\n\t\t\t\t\t\tharmony.\nLEARN\n\t\t\t\t\t\t\tMORE\nDonate\nSpecies Profile\nCock-of-the-Rock\nRupicola peruviana\nLast updated on November 29, 2006\nCock-of-the-rock photo by\xa0Panegyrics of Granovetter\nPhoto credit: Panegyrics of Granovetter\nShare this...\nFacebook\nTwitter\nLinkedin\nEmail\nAnatomy\nA beautiful orange crest adorns the head of the cock-of-the-rock and brilliant orange, black and white feathers cover its back and wings. As with most birds, the female coloring is subtler. Their strong claws and legs allow them to grip onto steep cliffs and rocks.\nDid you know?\nForests are home to 80 percent of Earth\'s terrestrial biodiversity! We\'re preserving habitats for endangered species, conserving wildlife corridors, and saving breeding grounds. Please join our alliance to keep forests standing:\n"*" indicates required fields\nEmail*\nGDPR Consent\nYes, I agree to receive occasional emails from the Rainforest Alliance.\nHabitat\nFound in the Andes from Venezuela to Bolivia, the cock-of-the-rock lives only in mountainous regions and builds its nests on the rocky surfaces of cliffs, large boulders and caves.\nDiet\nThe cock-of-the–rock’s diet consists mainly of fruit. Often, these colorful birds do not digest the seeds of their fruity meals. Instead the seeds pass through their digestive tracks and are eventually scattered along the ground, making these birds extremely important seed dispersers. In addition to fruit, cocks-of-the-rock eat insects and small vertebrates.\nThreats\nMany predators are attracted to the cock-of-the-rock’s beautiful plumage. These include birds of prey such as eagles and hawks, puma and jaguars and even boa constrictors. The loss of habitat, predominantly from forestland being converted to farmland, is a major threat to the survival of this brilliant bird.\nSources\nJukofsky, Diane. Encyclopedia of Rainforests. Connecticut: Oryx Press, 2002.\nEcology Info\n“Cock of the Rock,” Houston Zoo website, 2007.\nPhoto by Veronica Muñoz\nTags:\nEnvironmental Curriculum for Schools\nHelp Conserve Forests And Restore Balance To Our Planet\nMake your gift go further (and greener) with a monthly pledge\nDonate\nConservation Status Least Concern\nType\nBirds\nFound In\nSouth America\nFact Often, these colorful birds do not digest the seeds of their fruity meals. Instead the seeds pass through their digestive tracks and are eventually scattered along the ground, making these birds extremely important seed dispersers.\nYou Might Also Like...\nSpecies Profile\nCerulean Warbler\nSpecies Profile\nAfrican Grey Parrot\nSpecies Profile\nScarlet Macaw\nSpecies Profile\nGreat Curassow\nFor Business\nTransform your business practices\nFor Supporters\nHelp us rebalance the planet\nFor Researchers\nSee how we measure our impacts\nFor Educators\nUse our conservation curricula in your classroom\nThe Rainforest Alliance is a 501(c)(3) Nonprofit registered in the US under EIN: 13-3377893.\nIn 2022, 75% of our income supported sustainability programs. Learn More »\nFollow Us\nFacebook\nInstagram\nLinkedIn\nTikTok\nYouTube\nSubscribe\nSign up for business updates or general updates\nFAQ\nPress\nFinancials\nCertification Documents\nCareers\nContact Us\n© Copyright 1987 - 2023, Rainforest Alliance | Privacy Policy | Cookie Policy\nRainforest Match Active\nDouble your impact against deforestation.\nAct now!\nGive today\nX\nHabitats Matter.\nYour gift helps protect vital forest habitats for wildlife.\nGive today\nX\nPeople & Forests\nSupport nature’s guardians. Act now.\nGive Support\nX\n"""

extract(function=function, text=article)

about_animals: True

about_ai: False

habitat: ['South America']

diet: ['fruit']

predators: ['puma', 'jaguars', 'boa constrictors']

现在，尤卡坦鹿！

extract(function=function, url="https://www.rainforest-alliance.org/species/yucatan-deer/")

about_animals: True

about_ai: False

habitat: ['Central America & Mexico', 'South America']

diet: ['grass', 'leaves', 'sprouts', 'lichens', 'tree bark', 'fruit']

predators: ['cougar', 'jaguar', 'ticks', 'horseflies', 'mosquitoes']

一个完全不同的例子，关于人工智能……

function = {
    "name": "extract_data",
    "description": "Extract data from text",
    "parameters": {
        "properties": {
            "date": {"type": "string", "format": "date"},
            "about_animals": {
                "description": "Is the article about animals?",
                "type": "boolean",
            },
            "about_ai": {
                "description": "Is the article about artificial intelligence?",
                "type": "boolean",
            },
            "people": {
                "type": "array",
                "description": "Please list the mentioned people",
                "items": {"type": "string"},
            },
            "summary": {
                "type": "string",
                "description": "Summarize what happened in one sentence.",
            },
            "required": ["date", "about_animals", "about_ai", "people", "summary"],
        }
    },
}

extract(function=function, url="https://www.theverge.com/2023/11/22/23967223/sam-altman-returns-ceo-open-ai")

date: 2023-11-22

about_animals: False

about_ai: True

people: ['Sam Altman', 'Greg Brockman', 'Ilya Sutskever']

summary: Sam Altman 重新担任 OpenAI CEO

⚠️ 注意事项和 🔮 未来方向

我发现模型有点不稳定。更改函数定义会略微改变提取结果，甚至可能导致无法解析的函数调用。
尝试其他类似的模型会很好，例如 NexusRaven
还应调查优秀的开源通用模型（未针对函数调用进行微调）(Anyscale 关于使用 Mistral-7B-Instruct-v0.1 进行函数调用的公告)

📚 参考资料

与实验相关

Haystack LLM 框架
使用 OpenAI 函数调用进行信息提取：Kyle McDonald 的 gist
Gorilla OpenFunctions：发布帖和GitHub 仓库

关于该主题的其他有趣资源