构建LangChain应用程序的示例代码：51、如何使用 Chroma 实现多模态检索增强生成 (RAG)

Chroma 多模态 RAG

许多文档包含多种内容类型，包括文本和图像。

然而，大多数 RAG 应用中，图像中捕获的信息往往被忽略。

随着多模态 LLM 的出现，如 GPT-4V，值得考虑如何在 RAG 中利用图像：

选项 1：

使用多模态嵌入（例如 CLIP）来嵌入图像和文本
使用相似性搜索检索图像和文本
将原始图像和文本片段传递给多模态 LLM 进行答案合成

选项 2：

使用多模态 LLM（例如 GPT-4V、LLaVA 或 FUYU-8b）从图像生成文本摘要
嵌入并检索文本
将文本片段传递给 LLM 进行答案合成

选项 3：

使用多模态 LLM（例如 GPT-4V、LLaVA、或 FUYU-8b）直接从图像和文本中生成答案
使用引用原始图像嵌入和检索图像摘要
将原始图像和文本块传递到多模态LLM以进行答案合成

本文重点介绍选项1 。

我们将使用 Unstructed 来解析文档 (PDF) 中的图像、文本和表格。
我们将使用 Open Clip多模态嵌入。
我们将使用支持多模式的 Chroma。

Packages

对于 unstructured ，您的系统中还需要 poppler （安装说明）和 tesseract （安装说明）。

! pip install -U langchain openai chromadb langchain-experimental # (newest versions required for multi-modal)

#lock to 0.10.19 due to a persistent bug in more recent versions
! pip install "unstructured[all-docs]==0.10.19" pillow pydantic lxml pillow matplotlib tiktoken open_clip_torch torch

数据加载

分割 PDF 文本和图像

让我们看一个包含有趣图像的 pdf 示例。

1/ J Paul Getty 博物馆的艺术品：

这是一个包含 PDF 和已提取图像的 zip 文件。
https://www.getty.edu/publications/resources/virtuallibrary/0892360224.pdf

2/ 国会图书馆的著名照片：

https://www.loc.gov/lcm/pdf/LCM_2020_1112.pdf
我们将在下面使用它作为示例

我们可以使用下面的 Unstructed 中的 partition_pdf 来提取文本和图像。

要提供它来提取图像：

extract_images_in_pdf=True

如果使用此 zip 文件，那么您只需使用以下命令即可简单地处理文本：

extract_images_in_pdf=False

# Folder with pdf and extracted images
path = "/Users/rlm/Desktop/photos/"
# Extract images, tables, and chunk text
from unstructured.partition.pdf import partition_pdf

raw_pdf_elements = partition_pdf(
    filename=path + "photos.pdf",
    extract_images_in_pdf=True,
    infer_table_structure=True,
    chunking_strategy="by_title",
    max_characters=4000,
    new_after_n_chars=3800,
    combine_text_under_n_chars=2000,
    image_output_dir_path=path,
)
# Categorize text elements by type
tables = []
texts = []
for element in raw_pdf_elements:
    if "unstructured.documents.elements.Table" in str(type(element)):
        tables.append(str(element))
    elif "unstructured.documents.elements.CompositeElement" in str(type(element)):
        texts.append(str(element))

Multi-modal embeddings with our document(我们的文档的多模态嵌入)

我们将使用 OpenClip 多模态嵌入。

我们使用更大的模型以获得更好的性能（在 langchain_experimental.open_clip.py 中设置）。

model_name = "ViT-g-14"
checkpoint = "laion2b_s34b_b88k"

import os
import uuid

import chromadb
import numpy as np
from langchain_community.vectorstores import Chroma
from langchain_experimental.open_clip import OpenCLIPEmbeddings
from PIL import Image as _PILImage

# Create chroma
vectorstore = Chroma(
    collection_name="mm_rag_clip_photos", embedding_function=OpenCLIPEmbeddings()
)

# Get image URIs with .jpg extension only
image_uris = sorted(
    [
        os.path.join(path, image_name)
        for image_name in os.listdir(path)
        if image_name.endswith(".jpg")
    ]
)

# Add images
vectorstore.add_images(uris=image_uris)

# Add documents
vectorstore.add_texts(texts=texts)

# Make retriever
retriever = vectorstore.as_retriever()

RAG

vectorstore.add_images 将存储/检索图像作为 base64 编码字符串。

这些可以传递给 GPT-4V。

import base64
import io
from io import BytesIO

import numpy as np
from PIL import Image


def resize_base64_image(base64_string, size=(128, 128)):
    """
    Resize an image encoded as a Base64 string.

    Args:
    base64_string (str): Base64 string of the original image.
    size (tuple): Desired size of the image as (width, height).

    Returns:
    str: Base64 string of the resized image.
    """
    # Decode the Base64 string
    img_data = base64.b64decode(base64_string)
    img = Image.open(io.BytesIO(img_data))

    # Resize the image
    resized_img = img.resize(size, Image.LANCZOS)

    # Save the resized image to a bytes buffer
    buffered = io.BytesIO()
    resized_img.save(buffered, format=img.format)

    # Encode the resized image to Base64
    return base64.b64encode(buffered.getvalue()).decode("utf-8")


def is_base64(s):
    """Check if a string is Base64 encoded"""
    try:
        return base64.b64encode(base64.b64decode(s)) == s.encode()
    except Exception:
        return False


def split_image_text_types(docs):
    """Split numpy array images and texts"""
    images = []
    text = []
    for doc in docs:
        doc = doc.page_content  # Extract Document contents
        if is_base64(doc):
            # Resize image to avoid OAI server error
            images.append(
                resize_base64_image(doc, size=(250, 250))
            )  # base64 encoded str
        else:
            text.append(doc)
    return {"images": images, "texts": text}

目前，我们使用 RunnableLambda 格式化输入，同时向 ChatPromptTemplates 添加图像支持。

我们的可运行程序遵循经典的 RAG 流程 -

我们首先计算上下文（在本例中是“文本”和“图像”）和问题（这里只是一个 RunnablePassthrough）
然后我们将其传递到提示模板中，这是一个自定义函数，用于格式化 gpt-4-vision-preview 模型的消息。
最后我们将输出解析为字符串。

from operator import itemgetter

from langchain_core.messages import HumanMessage, SystemMessage
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnableLambda, RunnablePassthrough
from langchain_openai import ChatOpenAI


def prompt_func(data_dict):
    # Joining the context texts into a single string
    formatted_texts = "\n".join(data_dict["context"]["texts"])
    messages = []

    # Adding image(s) to the messages if present
    if data_dict["context"]["images"]:
        image_message = {
            "type": "image_url",
            "image_url": {
                "url": f"data:image/jpeg;base64,{data_dict['context']['images'][0]}"
            },
        }
        messages.append(image_message)

    # Adding the text message for analysis
    text_message = {
        "type": "text",
        "text": (
            "As an expert art critic and historian, your task is to analyze and interpret images, "
            "considering their historical and cultural significance. Alongside the images, you will be "
            "provided with related text to offer context. Both will be retrieved from a vectorstore based "
            "on user-input keywords. Please use your extensive knowledge and analytical skills to provide a "
            "comprehensive summary that includes:\n"
            "- A detailed description of the visual elements in the image.\n"
            "- The historical and cultural context of the image.\n"
            "- An interpretation of the image's symbolism and meaning.\n"
            "- Connections between the image and the related text.\n\n"
            f"User-provided keywords: {data_dict['question']}\n\n"
            "Text and / or tables:\n"
            f"{formatted_texts}"
        ),
    }
    messages.append(text_message)

    return [HumanMessage(content=messages)]


model = ChatOpenAI(temperature=0, model="gpt-4-vision-preview", max_tokens=1024)

# RAG pipeline
chain = (
    {
        "context": retriever | RunnableLambda(split_image_text_types),
        "question": RunnablePassthrough(),
    }
    | RunnableLambda(prompt_func)
    | model
    | StrOutputParser()
)

测试检索并运行 RAG

from IPython.display import HTML, display


def plt_img_base64(img_base64):
    # Create an HTML img tag with the base64 string as the source
    image_html = f'<img src="data:image/jpeg;base64,{img_base64}" />'

    # Display the image by rendering the HTML
    display(HTML(image_html))


docs = retriever.invoke("Woman with children", k=10)
for doc in docs:
    if is_base64(doc.page_content):
        plt_img_base64(doc.page_content)
    else:
        print(doc.page_content)

在这里插入图片描述

chain.invoke("Woman with children")

'Visual Elements:\nThe image is a black and white photograph depicting a woman with two children. The woman is positioned centrally and appears to be in her thirties. She has a look of concern or contemplation on her face, with her hand resting on her chin. Her gaze is directed away from the camera, suggesting introspection or worry. The children are turned away from the camera, with their heads leaning against the woman, seeking comfort or protection. The clothing of the subjects is simple and worn, indicating a lack of wealth. The background is out of focus, drawing attention to the expressions and posture of the subjects.\n\nHistorical and Cultural Context:\nThe photograph was taken by Dorothea Lange in March 1936 and is titled "Destitute pea pickers in California. Mother of seven children. Age thirty-two. Nipomo, California." It was taken during the Great Depression in the United States, a period of severe economic hardship. The woman in the photo, Florence Owens Thompson, was a Cherokee from Oklahoma. The image is part of the Farm Security Administration-Office of War Information Collection, which aimed to document and bring attention to the plight of impoverished farmers and workers during this era.\n\nInterpretation and Symbolism:\nThe photograph, often referred to as "Migrant Mother," has become an iconic symbol of the Great Depression. The woman\'s expression and posture convey a sense of worry and determination, reflecting the resilience and strength required to endure such difficult times. The children\'s reliance on their mother for comfort underscores the family\'s vulnerability and the burdens placed upon the woman. Despite the hardship conveyed, the image also suggests a sense of dignity and maternal protectiveness.\n\nThe text provided indicates that Florence Owens Thompson was a strong and leading figure within her community, which contrasts with the vulnerability shown in the photograph. This dichotomy highlights the complexity of Thompson\'s character and the circumstances of the time, where even the strongest individuals faced moments of hardship that could overshadow their usual demeanor.\n\nConnections Between Image and Text:\nThe text complements the image by providing personal insight into the subject\'s feelings about the photograph. It reveals that Thompson resented the photo because it did not reflect her strength and leadership qualities. This adds depth to our understanding of the image, as it suggests that the moment captured by Lange is not fully representative of Thompson\'s character. The photograph, while powerful, is a snapshot that may not encompass the entirety of the subject\'s identity and life experiences.\n\nThe final line of the text, "They\'re willing to have me entertain them during the day, but as soon as it starts getting dark, they all go off, and leave me!" could be interpreted as a metaphor for the transient sympathy of society towards the impoverished during the Great Depression. People may have shown interest or concern during the crisis, but ultimately, those suffering, like Thompson and her family, were left to face their struggles alone when the attention faded. This line underscores the isolation and abandonment felt by many during this period, which is poignantly captured in the photograph\'s portrayal of the mother and her children.'