亚毫秒 RAG 在苹果芯片上。无需服务器。无需 API。一个文件。

亚毫秒 RAG 在苹果芯片上。无需服务器。无需 API。一个文件。
Sub-Millisecond RAG on Apple Silicon. No Server. No API. One File

原始链接: https://github.com/christopherkarani/Wax

## Wax：单文件AI记忆 Wax 提供了一个完整的检索增强生成 (RAG) 解决方案，用单个 `.mv2s` 文件取代了复杂的堆栈——无需数据库、Docker 或网络调用。它旨在快速且私密地为 AI 应用程序添加记忆。 **主要特性：** * **简洁性：** 所有数据（文档、嵌入、索引）都存储在一个可移植、自包含的文件中。 * **性能：** 实现亚毫秒级向量搜索（在 Metal GPU 上，10K 文档为 0.84 毫秒）。 * **持久性：** 具有强大的预写日志，防崩溃且抗断电。 * **确定性：** 保证相同查询的一致结果。 * **隐私：** 完全在设备上运行，确保零数据传输。 Wax 支持各种记忆类型——文本、照片（带有 OCR 和 CLIP）和视频（带有转录），并利用混合搜索（BM25、向量、时间）和分层压缩以获得最佳结果。它还具有确定性令牌预算，以防止上下文窗口溢出。 Wax 使用 Swift 构建，非常适合离线优先应用程序、注重隐私的产品以及需要可重现检索的研究。它可在 iOS/macOS 上使用 Apple Silicon 进行 GPU 加速。 ([https://github.com/christopherkarani/Wax](https://github.com/christopherkarani/Wax))

## Wax：苹果芯片上的亚毫秒RAG Christopher Karani 开发了 **Wax**，这是一种针对苹果芯片的新型检索增强生成 (RAG) 解决方案，设计为完全离线运行——无需服务器、API 或云依赖。它被打包成单个文件，为本地知识访问提供类似于 SQLite 的简单性。 Wax 通过利用 Metal 加速搜索和优化的内核，实现了在 10,000+ 向量上 **亚毫秒的向量搜索**。它将所有内容——嵌入、索引、元数据——存储在防崩溃的单个文件 (.mv2s) 格式中，从而实现轻松的可移植性和确定性结果。主要功能包括 **多模态支持**（文本、照片、视频，带有 OCR 和关键帧索引）、**混合搜索**（结合 BM25、向量、时间线和结构化记忆）以及 **严格的 Swift 并发** 以确保线程安全。性能基准测试表明，Wax 比基于 CPU 和 SQLite FTS5 的解决方案快得多。开发者正在寻求来自构建 RAG 到 Swift 应用的开发者的反馈，并计划在未来的更新中探索语言绑定和时效衰减功能。该项目可在 [GitHub](https://github.com/christopherkarani/Wax) 上获取。

原文

The SQLite for AI memory.
One file. Full RAG. Zero infrastructure.

Quick Start • Performance • How It Works • Install

import Wax

// Create a memory file
let brain = try await MemoryOrchestrator(
    at: URL(fileURLWithPath: "brain.mv2s")
)

// Remember something
try await brain.remember(
    "User prefers dark mode and gets headaches from bright screens",
    metadata: ["source": "onboarding"]
)

// Recall with RAG
let context = try await brain.recall(query: "user preferences")
// → "User prefers dark mode and gets headaches from bright screens"
//   + relevant context, ranked and token-budgeted

That's it. No Docker. No vector DB. No network calls.

You wanted to add memory to your AI app.

3 hours later you're still configuring Docker Compose for a vector database that crashes if you look at it wrong, sends your data to who-knows-where, and needs a DevOps team to keep running.

Wax replaces your entire RAG stack with a file format.

Traditional RAG Stack:                     Wax:
┌─────────────┐                           ┌─────────────┐
│  Your App   │                           │  Your App   │
├─────────────┤                           ├─────────────┤
│  ChromaDB   │                           │             │
│  PostgreSQL │        vs.                │   brain.    │
│  Redis      │                           │    mv2s     │
│  Elasticsearch│                         │             │
│  Docker     │                           │             │
└─────────────┘                           └─────────────┘
     ~5 services                              1 file


⚡ Fast	0.84ms vector search @ 10K docs (Metal GPU)
🛡️ Durable	Kill -9 safe, power-loss safe, tested
🎯 Deterministic	Same query = same context, every time
📦 Portable	One `.mv2s` file — move it, backup it, ship it
🔒 Private	100% on-device. Zero network calls.

Apple Silicon (M1 Pro)

Vector Search Latency (10K × 384-dim)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Wax Metal (warm)     ████░░░░░░░░░░░░░░░░  0.84ms
Wax Metal (cold)     █████████████████░░░  9.2ms
Wax CPU              ███████████░░░░░░░░░  105ms
SQLite FTS5          ██████████████████░░  150ms
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Cold Open → First Query: 17ms
Hybrid Search @ 10K docs: 105ms

Core Benchmark Baselines (as of February 17, 2026)

These are reproducible XCTest benchmark baselines captured from the current Wax benchmark harness.

Ingest throughput (`testIngestHybridBatchedPerformance`)

Workload	Time	Throughput
smoke (200 docs)	`0.103s`	`~1941.7 docs/s`
standard (1000 docs)	`0.309s`	`~3236.2 docs/s`
stress (5000 docs)	`2.864s`	`~1745.8 docs/s`
10k	`7.756s`	`~1289.3 docs/s`

Workload	Time	Throughput
warm CPU smoke	`0.0015s`	`~666.7 ops/s`
warm CPU standard	`0.0033s`	`~303.0 ops/s`
warm CPU stress	`0.0072s`	`~138.9 ops/s`
10k CPU hybrid iteration	`0.103s`	`~9.7 ops/s`

Recall latency (`testMemoryOrchestratorRecallPerformance`)

Workload	Time
smoke	`0.103s`
standard	`0.101s`

Stress recall is currently harness-blocked (signal 11) and treated as a known benchmark issue.

Mode	Time
fast mode	`0.102s`
dense cached	`0.102s`

For benchmark commands, profiling traces, and methodology, see:

/Users/chriskarani/CodingProjects/Wax/Tasks/hot-path-specialization-investigation.md

No, that's not a typo. GPU vector search really is sub-millisecond.

WAL Compaction and Storage Health (2026-02)

Wax now includes a WAL/storage health track focused on commit latency tails, long-run file growth, and recovery behavior:

No-op index compaction guards to avoid unnecessary index rewrites.
Single-pass WAL replay with guarded replay snapshot fast path.
Proactive WAL-pressure commits for targeted workloads (guarded rollout).
Scheduled rewriteLiveSet maintenance with dead-payload thresholds, validation, and rollback.

Repeated unchanged index compaction growth improved from +61,768,464 bytes over 8 runs (~7.72MB/run) to bounded drift (test-gated).
Commit latency improved in most matrix workloads in recent runs (examples: medium_hybrid p95 -13.9%, large_text_10k p95 -8.0%, sustained_write_text p95 -5.7%).
Reopen/recovery p95 is generally flat-to-improved across the matrix.
sustained_write_hybrid remains workload-sensitive, so proactive/scheduled maintenance stays guarded by default.

Proactive pressure commits are tuned for targeted workloads and validated with percentile guardrails.
Replay snapshot open-path optimization is additive and guarded.
Scheduled live-set rewrite is configurable and runs deferred from the flush() hot path.
Rewrite candidates are automatically validated and rolled back on verification failure.

Configure scheduled live-set rewrite

import Wax

var config = OrchestratorConfig.default
config.liveSetRewriteSchedule = LiveSetRewriteSchedule(
    enabled: true,
    checkEveryFlushes: 32,
    minDeadPayloadBytes: 64 * 1024 * 1024,
    minDeadPayloadFraction: 0.25,
    minimumCompactionGainBytes: 0,
    minimumIdleMs: 15_000,
    minIntervalMs: 5 * 60_000,
    verifyDeep: false
)

Reproduce benchmark matrix

WAX_BENCHMARK_WAL_COMPACTION=1 \
WAX_BENCHMARK_WAL_OUTPUT=/tmp/wal-matrix.json \
swift test --filter WALCompactionBenchmarks.testWALCompactionWorkloadMatrix

WAX_BENCHMARK_WAL_GUARDRAILS=1 \
swift test --filter WALCompactionBenchmarks.testProactivePressureCommitGuardrails

WAX_BENCHMARK_WAL_REOPEN_GUARDRAILS=1 \
swift test --filter WALCompactionBenchmarks.testReplayStateSnapshotGuardrails

See /Users/chriskarani/CodingProjects/Wax/Tasks/wal-compaction-investigation.md and /Users/chriskarani/CodingProjects/Wax/Tasks/wal-compaction-baseline.json for methodology and full baseline artifacts.

.package(url: "https://github.com/christopherkarani/Wax.git", from: "0.1.6")

2. Choose Your Memory Type

📝 Text Memory — Documents, notes, conversations

import Wax

let orchestrator = try await MemoryOrchestrator(at: storeURL)

// Ingest
try await orchestrator.remember(documentText, metadata: ["source": "report.pdf"])

// Recall
let context = try await orchestrator.recall(query: "key findings")
for item in context.items {
    print("[\(item.kind)] \(item.text)")
}

📸 Photo Memory — Photo library with OCR + CLIP embeddings

import Wax

let photoRAG = try await PhotoRAGOrchestrator(
    storeURL: storeURL,
    config: .default,
    embedder: MyCLIPEmbedder()  // Your CoreML model
)

// Index local photos (offline-only)
try await photoRAG.syncLibrary(scope: .fullLibrary)

// Search
let ctx = try await photoRAG.recall(.init(text: "Costco receipt"))

🎬 Video Memory — Video segments with transcripts

import Wax

let videoRAG = try await VideoRAGOrchestrator(
    storeURL: storeURL,
    config: .default,
    embedder: MyEmbedder(),
    transcriptProvider: MyTranscriber()
)

// Ingest
try await videoRAG.ingest(files: [videoFile])

// Search by content or transcript
let ctx = try await videoRAG.recall(.init(text: "project timeline discussion"))

Wax packs everything into a single .mv2s file:

✅ Your raw documents
✅ Embeddings (any dimension, any provider)
✅ BM25 full-text search index (FTS5)
✅ HNSW vector index (USearch)
✅ Write-ahead log for crash recovery
✅ Metadata & entity graph

The file format is:

Append-only — Fast writes, no fragmentation
Checksum-verified — Every byte validated
Dual-header — Atomic updates, never corrupt
Self-contained — No external dependencies

┌─────────────────────────────────────────┐
│  Header Page A (4KB)                    │
│  Header Page B (4KB) ← atomic switch    │
├─────────────────────────────────────────┤
│  WAL Ring Buffer                        │
│  (crash recovery log)                   │
├─────────────────────────────────────────┤
│  Document Payloads (compressed)         │
│  Embeddings                             │
├─────────────────────────────────────────┤
│  TOC (Table of Contents)                │
│  Footer + Checksum                      │
└─────────────────────────────────────────┘

Feature	Wax	Chroma	Core Data + FAISS	Pinecone
Single file	✅	❌	❌	❌
Works offline	✅	⚠️	✅	❌
Crash-safe	✅	❌	⚠️	N/A
GPU vector search	✅	❌	❌	❌
No server required	✅	✅	✅	❌
Swift-native	✅	❌	✅	❌
Deterministic RAG	✅	❌	❌	❌

Features That Actually Matter

🧠 Query-Adaptive Hybrid Search

Wax doesn't just do vector search. It runs multiple lanes in parallel (BM25, vector, temporal, structured evidence) and fuses results based on query type.

"When was my last dentist appointment?" → boosts temporal + structured
"Explain quantum computing" → boosts vector + BM25

🎭 Tiered Memory Compression (Surrogates)

Not all context is equal. Wax generates hierarchical summaries:

full — Complete document (for deep dives)
gist — Key paragraphs (for balanced recall)
micro — One-liner (for quick context)

At query time, it picks the right tier based on query signals and remaining token budget.

🎯 Deterministic Token Budgeting

Strict cl100k_base token counting. No "oops, context window exceeded." No non-deterministic truncation. Reproducible RAG you can test and benchmark.

🤖 AI assistants that remember users across launches
📱 Offline-first apps with serious search requirements
🔒 Privacy-critical products where data never leaves the device
🧪 Research tooling that needs reproducible retrieval
🎮 Agent workflows that require durable state

Swift 6.2
iOS 26 / macOS 26
Apple Silicon (for Metal GPU features)

git clone https://github.com/christopherkarani/Wax.git
cd Wax
swift test

MiniLM CoreML tests are opt-in:

WAX_TEST_MINILM=1 swift test