如何缩小Embedding尺寸并提高RAG检索速度

使用“嵌套式表征学习”（Matryoshka Representation Learning,MRL）实现灵活文本Enbedding。

引言

文本Enbedding是单词或整句的高维向量表示。

这些数字数组捕捉了底层文本的丰富信息，可用于许多下游任务，如语义理解、分类、聚类、信息检索（RAG）、重新排序等。

通常，Enbedding向量的维度d是固定的。Enbedding维度通常是2的幂，范围从64到4096不等。

使用嵌套式（Matryoshka）Enbedding，你可以根据应用需求更改Enbedding的维度。这可以减少存储空间、节省成本并提高检索速度。

什么是文本Enbedding？

我们首先定义一个词汇表，将所有可能的输入字符映射到整数值。该词汇表不仅包括字母表中的字符，还包括特殊字符、短词和子词：

{
"a": 1,
"b": 2,
"c": 3,
...
"z": 26,
"the": 27,
" ": 28
}

在进行分词处理后，我们可以将分词后的列表输入到我们的编码器模型中。该编码器通过大量训练数据的学习，能够将每个分词转换为高维数值向量Enbedding。

例如，OpenAI的text-embedding-3-large模型的Enbedding输出维度d为3072。

为了得到一个单一的句子Enbedding，我们需要压缩来自多个分词Enbedding的信息。一种实现方法是对所有分词Enbedding取平均值。

Matryoshka Enbedding

MatryoshkaEnbedding是由华盛顿大学、Google Research和哈佛大学的研究人员在2022年的论文《Matryoshka Representation Learning》中提出的【1】。

MatryoshkaEnbedding经过训练，可以在一个Enbedding向量中编码不同粒度的信息。

例如，与其简单地训练一个大小为d=1024的完整Enbedding向量，通过MRL方法，我们使用一个维度列表matryoshka_dims=[1024,512,256,128,64]，以同时优化我们想要的损失函数【2】。

这样就生成了一个包含嵌套信息的Enbedding向量，其中最粗略的信息存储在前几维度中，越来越详细的信息存储在后续维度中。

*与套娃类似，MatryoshkaEnbedding包含嵌套在其中的较小Enbedding，图片由作者提供。*

这实际上意味着我们可以在任何我们想要的地方截断我们的Enbedding向量，而不会牺牲太多性能。

为什么这很重要呢？

假设我们想要在一个向量数据库中存储n个文本Enbedding向量。每个Enbedding都有维度d。通常，每个数字是一个32位浮点数。因此，我们需要n*d*4字节的存储空间。

如果我们想要计算诸如点积或余弦相似度（余弦相似度只是归一化的点积）等相似度度量，那么维度d越高，我们就需要进行的数学计算就越多。

使用MRL，如果我们关心较小的内存占用、快速处理速度，从而实现成本节省，我们可能只使用前64个维度。如果我们想要最佳的下游性能，我们会使用所有维度。或者我们可能会选择介于两者之间的某种情况。

因此，MRL为LLM用户提供了在Enbedding大小（成本）与下游性能微小降低之间进行权衡的能力。

将MRL与Nomic AI结合使用

使用matryoshka_dims=[768,512,256,128,64]训练了Nomic的Matryoshka文本Enbedding模型nomic-embed-text-v1.5。该模型可在Hugging Face [3]上公开获取。

该编码器模型的另一个优点是支持不同的前缀。该模型支持前缀[search_query、search_document、classification、clustering]，以便为每个特定的下游任务获取更好的Enbedding。

下面是nomic-embed-text-v1.5在大规模文本Enbedding基准测试(MTEB)中的表现：

*Nomic的MRL文本Enbedding模型在MTEB排行榜上与其他几个模型的比较，图片由作者提供*

让我们使用PyTorch和句子转换器库在Python中实现该模型：

!pip install torch sentence_transformers einops

import torch
from sentence_transformers import SentenceTransformer

device = "cuda" if torch.cuda.is_available() else "cpu"

model = SentenceTransformer(
    "nomic-ai/nomic-embed-text-v1.5",
    device=device,
    trust_remote_code=True,
    prompts={
        "search_query": "search_query: ",
        "search_document": "search_document: ",
        "classification": "classification: ",
        "clustering": "clustering: ",
    },
)


def embed_sentences(
    model: SentenceTransformer,
    sentences: list[str],
    prompt_name: str,
    matryoshka_dim: int,
    device: str,
):
    assert matryoshka_dim <= 768, "maximum dimension for nomic-embed-text-v1.5 is 768"
    embeddings = model.encode(
        sentences, prompt_name=prompt_name, device=device, convert_to_tensor=True
    )
    embeddings = torch.nn.functional.layer_norm(
        embeddings, normalized_shape=(embeddings.shape[1],)
    )
    embeddings = embeddings[:, :matryoshka_dim]
    embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1)
    return embeddings.cpu()

使用matryoshka_dim参数，我们可以截断768维的Enbedding向量，然后对新的Enbedding向量进行归一化处理。

现在，我们可以设置所需的维度，并为检索增强生成（RAG）用例编码一些维基百科文本和我们的问题：

matryoshka_dim = 64

wikipedia_texts = [
    "The dog (Canis familiaris or Canis lupus familiaris) is a domesticated descendant of the wolf.",
    "Albert Einstein was born in Ulm in the Kingdom of Württemberg in the German Empire, on 14 March 1879.",
    "Einstein excelled at physics and mathematics from an early age, and soon acquired the mathematical expertise normally only found in a child several years his senior.",
    "Werner Karl Heisenberg was a German theoretical physicist, one of the main pioneers of the theory of quantum mechanics, and a principal scientist in the Nazi nuclear weapons program during World War II.",
    "Steven Paul Jobs (February 24, 1955 - October 5, 2011) was an American businessman, inventor, and investor best known for co-founding the technology giant Apple Inc.",
    "The cat (Felis catus), commonly referred to as the domestic cat or house cat, is the only domesticated species in the family Felidae.",
]

question = ["Where was Albert Einstein born?"]

question_embedding = embed_sentences(
    model,
    sentences=question,
    prompt_name="search_query",
    matryoshka_dim=matryoshka_dim,
    device=device,
)


document_embeddings = embed_sentences(
    model,
    sentences=wikipedia_texts,
    prompt_name="search_document",
    matryoshka_dim=matryoshka_dim,
    device=device,
)

print(f"document_embeddings.shape: {document_embeddings.shape}")
print(f"question_embedding.shape:  {question_embedding.shape}")
>> document_embeddings.shape: torch.Size([6, 64])
>> question_embedding.shape:  torch.Size([1, 64])

我们可以用散点图来直观地显示Matryoshka文本Enbedding的前两个维度。不过，这种Enbedding模型并没有明确针对Matryoshka的2维进行优化。

*维基百科文本和问题的MatryoshkaEnbedding散点图，图片由作者提供。*

接下来，我们可以将文档Enbedding存储到矢量数据库中。我使用的是Faiss。Faiss是Meta Research的一个开源库，用于对密集向量进行高效的相似性搜索和聚类[4]。

!pip install faiss-cpu

import faiss

index = faiss.IndexFlatIP(matryoshka_dim)
index.add(document_embeddings)

这样创建了一个向量数据库，使用“内积的精确搜索”（Exact search for inner product）和IndexFlatIP，它是点积相似度度量。由于我们使用了归一化的Enbedding，点积和余弦相似度是相同的。

现在索引是一个由六个文本Enbedding组成的向量数据库。

print(index.ntotal)
>> 6

让我们搜索与我们的问题最相似的Enbedding式，并检索前k个结果：

distances, indices = index.search(question_embedding, k=6)
print(indices)
print(distances)
>> [[1 2 3 4 0 5]]
>> [[0.9633528  0.729192   0.63353264 0.62068397 0.512541   0.43155164]]

在我们的数据库中，相似度最高的文本索引为1，相似度得分为0.96（最大值为1.0）。

# results with d=64
print(question)
print(wikipedia_texts[1])
>> ['Where was Albert Einstein born?']
>> 'Albert Einstein was born in Ulm in the Kingdom of Württemberg in the German Empire, on 14 March 1879.'

我还用matryoshka_dim=768重新运行了代码，得到了类似的结果。不过，维度越高，所需的内存和计算量就越大。

# results with d=768
print(indices)
print(distances)
>> [[1 2 4 3 0 5]]
>> [[0.92466116 0.645744   0.54405797 0.54004824 0.39331824 0.37972206]]

MRL和量化

如果我们想进一步压缩Enbedding数据，可以将MRL与二进制向量量化结合使用。二进制量化会将Enbedding向量中所有大于0的数字转换为1，而将其他所有数字转换为0[5]。

*从完整大小的Enbedding到小型二进制Enbedding，图片由作者提供。*

使用二进制量化，一个d维的Enbedding向量只需要d/8字节的内存，与float32[4]的d*4字节相比，体积缩小了32倍。然而，这种减少是以性能为代价的。

结论

在训练过程中使用Matryoshka loss的Enbedding模型可同时针对多个Enbedding维度进行优化。

利用Matryoshka表示学习，LLM用户可以用较小的性能损失来换取文本Enbedding的大小。

较小的Enbedding所需的内存和计算量较少，从长远来看可以节省大量成本。它们的计算速度也更快，因此检索速度更高，例如对于RAG应用程序而言。

感谢阅读！你还可以订阅我们的YouTube频道，观看大量大数据行业相关公开课：https://www.youtube.com/channel/UCa8NLpvi70mHVsW4J_x9OeQ；在LinkedIn上关注我们，扩展你的人际网络！https://www.linkedin.com/company/dataapplab/

参考文献：

[1]A.Kusupati et al.(2022),Matryoshka Representation Learning,arXiv:2205.13147
[2]MatryoshkaLoss:https://www.sbert.net/docs/package_reference/losses.html#matryoshkaloss (accessed:04–05–2024)
[3]nomic-embed-text-v1.5onHuggingFace:https://huggingface.co/nomic-ai/nomic-embed-text-v1.5(accessed:04–05–2024)
[4]FaissDocumentation:https://github.com/facebookresearch/faiss/wiki/Getting-started(accessed:04–05–2024)
[5]A.Shakir,T.Aarsen,S.Lee(2024),Binary and Scalar Embedding Quantization for Significantly Faster & Cheaper Retrieval,Hugging Face Blog

原文作者：Dr. Leon Eversberg
翻译作者：Qing
美工编辑：过儿
校对审稿：Jason
原文链接：https://towardsdatascience.com/how-to-reduce-embedding-size-and-increase-rag-retrieval-speed-7f903d3cecf7

June 11, 2024 | Blog | Tags: AI, 机器学习

如何缩小Embedding尺寸并提高RAG检索速度

如何缩小Embedding尺寸并提高RAG检索速度

撸起袖子干吧：9个值得探索的数据和机器学习项目教程

我是如何进入人工智能产品管理领域的

Latest post

如何开办一个人的人工智能创业公司？

多智能体协作协议（MCP）：LLM 系统中合作智能的未来

LLAMA 4 来袭：Meta 全新大模型的技术突破与商业潜力

Courses

Events

Lecture 1: Interpretation of Employment Trends in the US 2025

Understand Meta LLaMA Throughly

Lecture 2: Job Seaking Strategy and Career Positioning

Consulting

ABOUT US

Contact Info: