仅头文件C向量数据库库
A header-only C vector database library

原始链接: https://github.com/abdimoallim/vdb

## vdb:一个轻量级向量数据库 `vdb` 是一个单头文件 C 库,专为高效存储和搜索高维向量嵌入而设计。它仅包含头文件,没有依赖项(除了可选的 pthreads 用于多线程),并提供了一个简单的 API,用于创建、填充、搜索、保存和加载向量数据库。 主要特性包括对余弦、欧几里得和点积距离度量的支持,以及通过 `#define VDB_MULTITHREADED` 启用的可选线程安全操作。用户还可以使用自定义的 `malloc`/`free`/`realloc` 定义自定义内存分配。 该库提供用于添加、删除和检索向量、执行 k 近邻搜索以及将数据以自定义二进制格式持久化到磁盘的函数。 此外还提供 Python 绑定。一个基本示例演示了数据库创建、向量添加、搜索和清理。它采用 Apache 2.0 许可。

一个新的、单头文件C库,用于创建向量数据库,已经在Hacker News上分享。该库在GitHub上可用,旨在保持简单。 然而,评论者指出关键限制:它主要是一个内存数据库,需要手动、非防崩溃的保存和加载,并且缺乏索引——这意味着搜索性能会随着数据大小线性下降。 讨论还涉及了在现代开发环境中,单文件C实现吸引力,这些环境严重依赖复杂的依赖和配置(如Kubernetes),在这些环境中,简单性和易于集成备受重视。一位评论员质疑将其称为“头文件优先”库的必要性,认为单个源文件可以达到类似的结果。
相关文章

原文

A lightweight, header-only C library for storing and searching high-dimensional vector embeddings with optional multithreading support.

  • Header-only implementation (single file: vdb.h)
  • Multiple distance metrics (cosine, euclidean, dot product)
  • Optional thread-safe operations via #define VDB_MULTITHREADED
  • Save/load database to/from disk
  • Custom memory allocators support
  • No dependencies (except pthreads for multithreading)
  • Python bindings (refer to vdb.py)
/*test.c*/
#include "vdb.h"

int main(void) {
  vdb_database *db = vdb_create(128, VDB_METRIC_COSINE);

  float embedding[128] = { /* ... */ };
  vdb_add_vector(db, embedding, "vec1", NULL);

  float query[128] = { /* ... */ };
  vdb_result_set *results = vdb_search(db, query, 5);

  vdb_free_result_set(results);
  vdb_destroy(db);
  return 0;
}

Include vdb.h and compile with either approach, pthreads is not necessarily available which is why this is behind a flag.

Single-threaded:

gcc -O2 test.c -o test -lm

Multi-threaded:

gcc -O2 -DVDB_MULTITHREADED test.c -o test -lpthread -lm

vdb_database *vdb_create(size_t dimensions, vdb_metric metric) Creates a new vector database.

void vdb_destroy(vdb_database *db) Frees all resources associated with the database.

size_t vdb_count(const vdb_database *db) Returns the number of vectors in the database.

size_t vdb_dimensions(const vdb_database *db) Returns the dimensionality of vectors.

vdb_error vdb_add_vector(vdb_database *db, const float *data, const char *id, void *metadata) Adds a vector to the database with optional ID and metadata.

vdb_error vdb_remove_vector(vdb_database *db, size_t index) Removes a vector at the specified index.

**vdb_error vdb_get_vector(const vdb_database \*db, size_t index, float **out_data, char **out_id, void **out_metadata)** Retrieves a vector and its metadata.

vdb_result_set *vdb_search(const vdb_database *db, const float *query, size_t k) Performs k-nearest neighbor search. Returns NULL on error.

void vdb_free_result_set(vdb_result_set *result_set) Frees search results.

vdb_error vdb_save(const vdb_database *db, const char *filename) Saves the database to disk.

vdb_database *vdb_load(const char *filename) Loads a database from disk.

  • VDB_METRIC_COSINE - Cosine distance (1 - cosine similarity)
  • VDB_METRIC_EUCLIDEAN - Euclidean (L2) distance
  • VDB_METRIC_DOT_PRODUCT - Negative dot product
VDB_OK = 0
VDB_ERROR_NULL_POINTER = -1
VDB_ERROR_INVALID_DIMENSIONS = -2
VDB_ERROR_OUT_OF_MEMORY = -3
VDB_ERROR_NOT_FOUND = -4
VDB_ERROR_INVALID_INDEX = -5
VDB_ERROR_THREAD_FAILURE = -6

Define before including vdb.h:

#define VDB_MALLOC my_malloc
#define VDB_FREE my_free
#define VDB_REALLOC my_realloc
#include "vdb.h"

When compiled with VDB_MULTITHREADED, all operations are thread-safe using read-write locks:

  • Multiple threads can search simultaneously
  • Add/remove operations are exclusive
  • No external locking required

vdb uses a binary format with magic number 0x56444230:

  • Header: magic (4 bytes), dimensions, count, metric
  • Vectors: float array + ID length + ID string (for each vector)
  • Metadata is not persisted

Apache v2.0 License

联系我们 contact @ memedata.com