集成：INSTRUCTOR 嵌入器

一个使用 INSTRUCTOR 嵌入模型计算嵌入的组件。

作者

Ashwin Mathur

Varun Mathur

GitHub 仓库 PyPI 包

这个 Haystack 的自定义组件可用于使用 INSTRUCTOR 嵌入模型创建嵌入。

INSTRUCTOR 是一个经过指令微调的文本嵌入模型，它可以通过简单地提供任务指令，无需任何微调，为任何任务（例如，分类、检索、聚类、文本评估等）和领域（例如，科学、金融等）生成定制的文本嵌入。INSTRUCTOR 在 70 个不同的嵌入任务上取得了 SOTA（MTEB 排行榜）。有关更多详细信息，请查看论文和项目页面。模型检查点可以在 HuggingFace 上找到。

通过将指令与要编码的文本一起传递，INSTRUCTOR 模型可用于创建特定领域且对任务感知的嵌入。

用于创建指令的统一模板

要为特定句子创建定制的嵌入，您应遵循统一模板来编写指令

    Represent the 'domain' 'text_type' for 'task_objective':

domain 是可选的，它指定文本的领域，例如，科学、金融、医学等。
text_type 是必需的，它指定编码单元，例如，句子、文档、段落等。
task_objective 是可选的，它指定嵌入的目标，例如，检索文档、分类句子等。

示例

文档文本 - “自封建主义结束以来，资本主义在西方世界占据主导地位，但大多数人认为“混合经济”一词更准确地描述了大多数当代经济，因为它们既包含私有企业也包含国有企业。在资本主义中，价格决定供需规模。例如，某些商品和服务的需求越高，价格就越高；某些商品的需求越低，价格就越低。”
文档嵌入指令 - “为检索表示维基百科文档：”

查询 - “在混合经济中，决定特定企业是私有还是国有的关键因素是什么？”
查询嵌入指令 - “为检索支持文档表示维基百科问题：”
文档文本 - “美联储周三提高了其基准利率。周五，基金涨幅不到 0.5%。"
文档嵌入指令 - “表示财务报表：”

查询 - “利率上调有何影响？”
查询嵌入指令 - “表示财务问题：”

此组件包含

InstructorTextEmbedder，一个将字符串列表嵌入到向量列表中的组件。
InstructorDocumentEmbedder，一个嵌入 Haystack Documents 列表的组件。每个 Document 的嵌入都存储在 Document 的 embedding 字段中。

您可以将这些嵌入器用作独立组件，也可以在索引管道中使用。

安装

要使用此组件，请安装 instructor-embedders-haystack 包。

pip install instructor-embedders-haystack

使用

要初始化 InstructorTextEmbedder 或 InstructorDocumentEmbedder，您需要通过 model 参数传递本地路径或 Hugging Face 模型中心的模型名称，例如 'hkunlp/instructor-base'。
在使用 instruction 参数时，需要传递在计算特定领域嵌入时使用的指令字符串。

使用文本嵌入器

from haystack.utils.device import ComponentDevice
from haystack_integrations.components.embedders.instructor_embedders import InstructorTextEmbedder

# Example text from the Amazon Reviews Polarity Dataset (https://hugging-face.cn/datasets/amazon_polarity)
text = "It clearly says online this will work on a Mac OS system. The disk comes and it does not, only Windows. Do Not order this if you have a Mac!!"
instruction = (
    "Represent the Amazon comment for classifying the sentence as positive or negative"
)

text_embedder = InstructorTextEmbedder(
    model="hkunlp/instructor-base", instruction=instruction,
    device=ComponentDevice.from_str("cpu"),
)
text_embedder.warm_up()
result = text_embedder.run(text)
print(f"Embedding: {result['embedding']}")
print(f"Embedding Dimension: {len(result['embedding'])}")

使用文档嵌入器

from haystack.utils.device import ComponentDevice
from haystack.dataclasses import Document
from haystack_integrations.components.embedders.instructor_embedders import InstructorDocumentEmbedder


doc_embedding_instruction = "Represent the Medical Document for retrieval:"

doc_embedder = InstructorDocumentEmbedder(
    model="hkunlp/instructor-base",
    instruction=doc_embedding_instruction,
    batch_size=32,
    device=ComponentDevice.from_str("cpu"),
)

doc_embedder.warm_up()

# Text taken from PubMed QA Dataset (https://hugging-face.cn/datasets/pubmed_qa)
document_list = [
    Document(
        content="Oxidative stress generated within inflammatory joints can produce autoimmune phenomena and joint destruction. Radical species with oxidative activity, including reactive nitrogen species, represent mediators of inflammation and cartilage damage.",
        meta={
            "pubid": "25,445,628",
            "long_answer": "yes",
        },
    ),
    Document(
        content="Plasma levels of pancreatic polypeptide (PP) rise upon food intake. Although other pancreatic islet hormones, such as insulin and glucagon, have been extensively investigated, PP secretion and actions are still poorly understood.",
        meta={
            "pubid": "25,445,712",
            "long_answer": "yes",
        },
    ),
    Document(
        content="Disturbed sleep is associated with mood disorders. Both depression and insomnia may increase the risk of disability retirement. The longitudinal links among insomnia, depression and work incapacity are poorly known.",
        meta={
            "pubid": "25,451,441",
            "long_answer": "yes",
        },
    ),
]

result = doc_embedder.run(document_list)
print(f"Document Text: {result['documents'][0].content}")
print(f"Document Embedding: {result['documents'][0].embedding}")
print(f"Embedding Dimension: {len(result['documents'][0].embedding)}")

在语义搜索管道中使用嵌入器

# Import necessary modules and classes
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.dataclasses import Document
from haystack import Pipeline
from haystack.components.writers import DocumentWriter
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from haystack.utils.device import ComponentDevice
from datasets import load_dataset

# Import custom INSTRUCTOR Embedders
from haystack_integrations.components.embedders.instructor_embedders import InstructorDocumentEmbedder
from haystack_integrations.components.embedders.instructor_embedders import InstructorTextEmbedder

# Initialize a InMemoryDocumentStore, which will be used to store and retrieve documents
# It uses cosine similarity for document embeddings comparison
doc_store = InMemoryDocumentStore(embedding_similarity_function="cosine")

# Define an instruction for document embedding
doc_embedding_instruction = "Represent the News Article for retrieval:"
# Create an InstructorDocumentEmbedder instance with specified parameters
doc_embedder = InstructorDocumentEmbedder(
    model="hkunlp/instructor-base",
    instruction=doc_embedding_instruction,
    batch_size=32,
    device=ComponentDevice.from_str("cpu"),
)
# Warm up the embedder (loading the pre-trained model)
doc_embedder.warm_up()

# Create an indexing pipeline
indexing_pipeline = Pipeline()
# Add the document embedder component to the pipeline
indexing_pipeline.add_component(instance=doc_embedder, name="DocEmbedder")
# Add a DocumentWriter component to the pipeline that writes documents to the Document Store
indexing_pipeline.add_component(
    instance=DocumentWriter(document_store=doc_store), name="DocWriter"
)
# Connect the output of DocEmbedder to the input of DocWriter
indexing_pipeline.connect("DocEmbedder", "DocWriter")

# Load the 'XSum' dataset from HuggingFace (https://hugging-face.cn/datasets/xsum)
dataset = load_dataset("xsum", split="train")

# Create Document objects from the dataset and add them to the document store using the indexing pipeline
docs = [
    Document(
        content=doc["document"],
        meta={
            "summary": doc["summary"],
            "doc_id": doc["id"],
        },
    )
    for doc in dataset
]
indexing_pipeline.run({"DocEmbedder": {"documents": docs}})

# Print the first document and its embedding from the document store
print(doc_store.filter_documents()[0])
print(doc_store.filter_documents()[0].embedding)

# Define an instruction for query embedding
query_embedding_instruction = (
    "Represent the news question for retrieving supporting articles:"
)
# Create an InstructorTextEmbedder instance for query embedding
text_embedder = InstructorTextEmbedder(
    model="hkunlp/instructor-base",
    instruction=query_embedding_instruction,
    device=ComponentDevice.from_str("cpu"),
)
# Load the text embedding model
text_embedder.warm_up()

# Create a query pipeline
query_pipeline = Pipeline()
# Add the text embedder component to the pipeline
query_pipeline.add_component("TextEmbedder", text_embedder)
# Add a InMemoryEmbeddingRetriever component to the pipeline that retrieves documents from the doc_store
query_pipeline.add_component(
    "Retriever", InMemoryEmbeddingRetriever(document_store=doc_store)
)
# Connect the output of TextEmbedder to the input of Retriever
query_pipeline.connect("TextEmbedder", "Retriever")

# Run the query pipeline with a sample query text
results = query_pipeline.run(
    {
        "TextEmbedder": {
            "text": "What were the concerns expressed by Jeanette Tate regarding the response to the flooding in Newton Stewart?"
        }
    }
)

# Print information about retrieved documents
for doc in results["Retriever"]["documents"]:
    print(f"Text:\n{doc.content[:150]}...\n")
    print(f"Metadata: {doc.meta}")
    print(f"Score: {doc.score}")
    print("-" * 10 + "\n")

许可证

instructor-embedders-haystack 在 Apache-2.0 许可证条款下分发。