（评论）

（评论）
(comments)

原始链接: https://news.ycombinator.com/item?id=41544969

用户创建了一个轻量级自然语言处理 (NLP) 库，该库使用从大型语言模型 (LLM) 平均得到的标记嵌入，而不需要 PyTorch 等复杂的深度学习基础设施。该模型的设计包括提取标记嵌入、连接它们、减少它们的维度，以及使用句子转换器框架和数据集对其进行训练。该模型的一个显着特点是它能够进行排名、过滤、聚类、去重和计算相似度。这些功能既有标准的余弦相似度，也有更快、内存效率更高的二进制版本。该库还旨在通过 pip 实现速度、紧凑性和易于安装。虽然无法与 Transformer 模型相比，但该库在平均文本嵌入基准 (MTEB) 上表现良好，特别是在性能和大小方面与其他词嵌入模型相比。目前，该库支持 Linux 和 macOS 系统，但 Windows 支持仍在开发中。

After working with LLMs for long enough, I found myself wanting a lightweight utility for doing various small tasks to prepare inputs, locate information and create evaluators. This library is two things: a very simple model and utilities that inference it (eg. fuzzy deduplication). The target platform is CPU, and it’s intended to be light, fast and pip installable — a library that lowers the barrier to working with strings semantically. You don’t need to install pytorch to use it, or any deep learning runtimes.

How can this be accomplished? The model is simply token embeddings that are average pooled. To create this model, I extracted token embedding (nn.Embedding) vectors from LLMs, concatenated them along the embedding dimension, added a learnable weight parameter, and projected them to a smaller dimension. Using the sentence transformers framework and datasets, I trained the pooled embedding with multiple negatives ranking loss and matryoshka representation learning so they can be truncated. After training, the weights and projections are no longer needed, because there is no contextual calculations. I inference the entire token vocabulary and save the new token embeddings to be loaded to numpy.

While the results are not impressive compared to transformer models, they perform well on MTEB benchmarks compared to word embedding models (which they are most similar to), while being much smaller in size (smallest model, 32k vocab, 64-dim is only 4MB).

On the utility side, I’ve been adding some tools that I think it’ll be useful for. In addition to general embedding, there’s algorithms for ranking, filtering, clustering, deduplicating and similarity. Some of them have a cython implementation, and I’m continuing to work on benchmarking them and improving them as I have time. In addition to “standard” models that use cosine similarity for some algorithms, there are binarized models that use hamming distance. This is a slightly faster, similarity algorithm, with significantly less memory per embedding (float32 -> 1 bit).

Hope you enjoy it, and find it useful. PS I haven’t figured out Windows builds yet, but Linux and Mac are supported.

（评论） (comments)

（评论）
(comments)