我把一个真正的搜索引擎放进了Lambda，你只需在搜索时付费。

我把一个真正的搜索引擎放进了Lambda，你只需在搜索时付费。
I put a real search engine into a Lambda, so you only pay when you search

原始链接: https://nixiesearch.substack.com/p/i-put-a-real-search-engine-into-a

## 无服务器搜索：深入研究 Nixiesearch 在 AWS Lambda 上的应用本文探讨了实现真正无服务器搜索的挑战，揭穿了当前“无服务器搜索”解决方案不过是隐藏的、始终在线的基础设施的观点。作者尝试在 AWS Lambda 函数中运行 Nixiesearch，一个基于 Lucene 的搜索引擎，来解决这个问题。主要障碍包括类似 Elasticsearch 引擎的大型容器尺寸（约 700MB）、缓慢的启动时间（40+ 秒）以及集群状态管理的复杂性。解决方案涉及使用 GraalVM native-image 创建更小、更快的二进制文件，并探索通过 S3 和 EFS 进行远程索引存储。实验表明，虽然冷启动时间可以缩短到半秒以下，但网络存储（尤其是 EFS）由于缓慢的随机读取而引入了显著的延迟。将索引烘焙到 Docker 镜像中并不能改善情况，因为 Lambda 仍然将其缓存在 S3 中。作者提出了一种未来的方向：利用 S3 的并发性在 HNSW 图遍历期间并行化 embedding 加载，可能实现约 100 毫秒的基线延迟。最终，该项目展示了无服务器搜索的复杂性，并强调了需要进行架构调整以克服固有的限制。

一位开发者构建了NixieSearch，一个以Lambda函数实现的搜索引擎，旨在降低成本——仅在搜索时付费。该项目利用Lucene和GraalVM原生二进制文件来提高性能。 Hacker News上的讨论强调了类似的工作，一位用户之前曾在Lambda上使用Lucene，但由于AWS账户偏好而放弃。另一位用户提到了Quickwit，一个低成本选择，但其Lambda支持即将结束，因此需要像NixieSearch这样的替代方案。讨论的关键点集中在Lambda的内存限制上。虽然文档显示最多可用10GB，但一些用户报告需要向AWS支持请求增加，才能访问更高的内存分配，尽管最近有更广泛可用性的公告。对话还涉及了搜索实现成本的增加以及向“无计算机”架构的演变，Cloudflare被认为是抽象数据位置的领导者。

Modern serverless search is just an accounting trick. There’s a hidden pool of nodes behind the API, and the final bill is split evenly among all clients. There’s always a standby warm node waiting for your request - you just don’t see it.

And you can’t get rid of it because scaling search engines is HARD (or at least search vendors want you to think so). You can’t just put one into a Lambda function. But what if you actually can?

As someone who has hated Elasticsearch since version 0.89 (but still uses it), there are three major blockers to running it in a truly serverless mode:

Container size: It’s around 700MB in version 9.x. The bigger the container, the slower the node startup, since it has to be pulled from somewhere.
Container startup time: For ES 9.x, the startup time alone is about 40 seconds. And the time-to-performance is much worse, since a cold JVM is painfully slow until it sees some traffic.
Index and state: Search engines like Elastic and Qdrant behave like databases, with each node hosting a fraction of the total cluster state. When a new node joins, the cluster needs to rebalance. What happens if you scale-to-zero? Better not ask.

We’re going to take my pet-project-gone-big search engine, Nixiesearch, and squeeze it into an AWS Lambda:

It’s also JVM-based, since it uses Apache Lucene for all search-related internals (Like OpenSearch, Elasticsearch and SOLR). We’re going to build a native x86_64 binary with GraalVM native-image: this should reduce the Docker image size (no JVM!) and eliminate JVM warmup entirely.
Can we store an index outside the search engine? Can we achieve reasonable latency with AWS S3? What if we host the index on AWS EFS instead?

You may wonder why do we ever need to perform weird AWS Lambda tricks when we can just keep the status quo with warm stand-by nodes? No warm-up, no cold-start, no obscure limits - it’s a good traditional approach.

Because doing weird stupid things is the way I learn. So the challenge is to have a proof-of-concept which:

Has minimal startup time, so scale up-down won’t be an issue. With sub-second startup you can even scale to zero when there’s no traffic!
Search latency is going to be reasonably fast. Yes modern search engines compete on 3 vs 5 milliseconds, but in practice you also have an chonky embedding model, which adds extra 300-400ms latency on top.

Java AOT compilation and remote index storage, sounds easy.

GraalVM native-image is an Ahead-Of-Time compiler: it takes a JVM application JAR file and builds an almost zero-dependency native binary that depends only on glibc. It sounds fancy in theory, but in practice not all applications can be statically compiled that easily.

Accessing field reflectively in Java. If you do this, you probably doing something wrong.

If you (or your transitive dependencies) use reflection to do something dynamic like enumerating class fields or loading classes dynamically, then GraalVM requires you to create a reflection-config.json file listing all the nasty things you do.

Example of reflect-config.json

Building such a file manually for all transitive dependencies is practically impossible. But in modern GraalVM versions you can attach a tracing agent to your application, which records all the nasty things happening throughout the entire codebase.

java -agentlib:native-image-agent -jar nixiesearch.jar standalone

The tracing agent needs to observe all execution paths in your application, and instead of sending all possible search requests manually, I just attached it to the test suite — and got a monumental reachability-metadata.json file that hopefully covers all reflection usage in transitive dependencies.

It’s time to build our first native binary!

A real native-image command-line.

It takes around two minutes on my 16-core AMD Zen 5 CPU to compile the binary, which is impressively slow. With an ubuntu:24.04 minimal base image we get 338MB — a nice reduction from 760MB.

But can we get rid of the 90MB base Ubuntu image and go with Alpine? Yes, but it requires building Nixiesearch with Musl instead of Glibc. Luckily as the build is dockerized, you can replace the ghcr.io/graalvm/native-image-community:25 with ghcr.io/graalvm/native-image-community:25-muslib and get a nice musl-based build env.

native-image --libc=musl -jar nixiesearch.jar <options>

The musl-based binary ends up at the same 244MB, but now it can run natively on Alpine without the gcompat glibc layer. But can we go even further and build the app completely statically, without linking libc at all? Once you start trimming dependencies, it’s hard to stop.

native-image --static --libc=musl -jar nixiesearch.jar <options>

Now we’re at 248MB, but we no longer need a base system at all - which gives us the most perfect Docker image ever:

FROM scratch
COPY --from=builder /build/nixiesearch /nixiesearch
ENTRYPOINT /nixiesearch

We could go even further and enable the -Os option to optimize for size, but I’m afraid it might impact request processing performance.

How I went from 760MB docker image to just 205MB.

GraalVM also notes that almost 30% of the codebase is taken up by the AWS Java SDK with its massive dependency footprint, so the next step is switching to the raw S3 REST API instead of the nice SDK helpers.

I originally thought that AWS Lambda runtime for Docker is just a simple “docker pull” and “docker run” on each request, but it’s slightly more complex:

Lambda API request lifecycle.

On initial code deployment, the container gets fetched from ECR, unpacked and cached on all AWS AZ where your lambda is scheduled to run.
When first request arrives, the container goes in to the Init stage: it gets executed on a minimal Firecracker VM. The Lambda runtime waits until started container polls runtime API for the actual request to process. This stage is billed, so we need to be as fast as possible here.
Request stage: container polls runtime API for request and produces the response. This is where the actual work happens, and this is the part you’re also billed for.
And here the MAGICAL Freeze stage: After the Lambda API receives a response and sees that the app starts polling for the next request, the VM gets frozen. It’s still a VM, but with zero CPU and its RAM offloaded to disk. You pay zero for a container in the freeze stage.
When there’s a new request arrives, container VM gets into Thaw stage: unfrozen, processes the next request and so on to get frozen again.
When there are no requests arriving for a longer period of time (in practice 5-15 minutes), lambda container gets destroyed.

Cold request (full init):

  Duration: 84.15 ms
  Billed Duration: 535 ms
  Memory Size: 3008 MB
  Max Memory Used: 133 MB
  Init Duration: 449.85 ms	

Warm request (after freeze-thaw cycle):

  Duration: 2.62 ms
  Billed Duration: 3 ms
  Memory Size: 3008 MB
  Max Memory Used: 194 MB

That’s nice: we were able to spin the cold container in only 449ms, and warm no-op requests are just 3ms!

But note that AWS Lambda compute is very limited:

RAM: 128MB default with up to 3008MB max. You can submit a support ticket to get 10GB RAM, but I was too lazy to argue with AWS support.
vCPU: 1 vCPU, and if you go beyond 1536MB of RAM, you get 2nd vCPU. Not much.
Disk: up to 10GB of instance storage.

And last but not the least, S3 read throughput depends on RAM size:

S3 throughput data is well aligned with known S3 throughput limits for AWS EC2 instances. Assuming that we have only 2 vCPU max, 100MB/s is the best you can expect - which is not nice considering that to run search we need to access the index.

Nixiesearch always was built with S3 block storage in mind. But as OpenSearch (and perhaps Elasticsearch Serverless) it uses S3 only for simple segment replication:

Segment replication with AWS S3.

As lambdas are ephemeral, we need to deliver index somehow to the search engine:

Napkin storage costs math for 1M requests:

I don’t like the idea of an init taking half a minute - then we’re not much different from good old Elastic. But what if we host an index on an NFS (e.g. AWS EFS) storage?

I took the FineWiki “simple” part with 300k documents, embedded them with OpenAI text-embedding-3-small model and deployed it on an AWS Lambda with EFS storage attached.

Nixiesearch can do ONNX embedding inference for sentence-transformers models locally, but considering an average embedding model size (1-2GB) and the amount of RAM/vCPU we have it might be not a good idea.

To have a fully-functional tech demo, I vibe-coded (oh no) a simple web front-end on GH pages: https://nixiesearch.github.io/lambda-demo-ui/

Blue gradient background is like an em-dash of vibe-coded front-ends.

There’s some nice server+client side latency breakdown if you want to see where the actual time is spent. And yes, 1.5s first request latency is kinda slower than I’ve initially expected.

In simple words, random reads from NFS-style storage are just slow:

breakdown of a sample request

As my test lambda runs in AWS us-east-1 and I’m physically in EU, latency can be improved by replicating lambda to more regions. Embedding latency is the AI toll we have to pay anyway. But why search and fetch stages are so slow?

Because both HNSW search and document fetch are a bunch of iterative random reads, and with per-read AWS EFS latency being around 1ms, that’s what we get.

One of reviewers of this post suggested to bake the whole index directly to the docker image as an alternative: yes, you cannot easily update the index in real-time anymore - you need to rebuild the Docker image from scratch every time you change it. But it may work in cases when you can tolerate some lag in indexing. But the results we got were even more surprising:

When you thought AWS EFS was slow, you should try baking index into docker image.

I thought 1.5s request latency with AWS EFS was slow, but doing random reads across an index backed directly to the docker image were even slower. Why? Because lambdas are not running docker images as-is: they unpack them and cache in an AZ-local S3 block cache:

In other words, baking index into a docker image is just another way of storing your index on AZ-local S3-Express bucket (and mounting it with s3-fuse or something).

Realistically 1.5s (and even 7s) per cold search might sound horrible, but things get fast pretty quickly as we eventually load cold data into a filesystem cache:

Image above is for Docker-bundled index, but for EFS/NFS attached it’s quite similar.

We get into a reasonable 120ms latency for search and almost instant field fetch around request #10. But it’s still far from the idealistic idea of true serverless search when you don’t need an idling warm node to be up to serve your request.

Folks like turbopuffer, topk and LanceDB advocate the idea that to run on top of S3 you need another non-HNSW data structure like IVF, which is more friendly to high access latency.

Instead of navigating over HNSW graph, iteratively doing ton of random reads, you can just cluster documents together and only perform batch reads of clusters lying nearby to your query:

Much easier search implementation without any iterative random reads patterns: just read a complete set of cluster documents in a single S3 GetObject request.
Clusters can be updated in-place by just appending new documents.
The elephant in the room: IFV have much much worse recall, especially for filtered search. So your search can be either fast or precise, you should choose in advance.

Yes I can just hack IFV support to Nixiesearch (as Lucene supports flat indexes already), but there’s a better way. S3 has almost unlimited concurrency: can we untangle reads from being iterative to being batched and concurrent?

Traversing HNSW graph for K-NN search is iterative:

You land on an entrypoint node, which has M connections to other neighbor nodes.
For each connection, you load its embedding (and do S3 GetObject request) and compute a cosine distance.
After all M neighbor distances are evaluated, you jump on the best next node.

But you don’t need to be iterative while loading neighbor embeddings: Lucene’s HnswGraphSearcher is already quite close to be bend into the direction for making embedding load concurrent and parallel:

So my personal plan for Christmas holidays is to add a custom Scorer implementation which schedules N parallel S3 GetObject requests to get N embeddings on each node visit:

HNSW graph usually has only ~3 layers, so you need to evaluate 1 entrypoint + 3 layers = 4 nodes, doing 4 batches of ~32-64 S3 requests.
Each batch of S3 GetObject requests is ~15ms, so baseline latency is expected to be ~60ms for a complete search stage.
To fetch N documents, you also need to prefetch N chunks of stored fields, which is also a perfectly concurrent operation.

A theoretical ~100ms baseline latency of HNSW running on top of S3 - sounds nice, huh?

Usually at the end of the article a well-educated author should put a summary of what you might learned while reading this, so here we are:

AWS Lambdas are not your friendly docker containers: storage system is completely different, and runtime semantics with constant freeze-thaw stages is what really surprised me.
Running HNSW search on top of network-attached storage is painfully slow right now - sequential random reads, you know. But there’s light at the end of the tunnel, and you don’t need to sacrifice recall for a cheap and fast search.
If you’re brave (and stupid) enough (like me) to spend a weekend on putting a search engine in a lambda, you can do it.

If you haven’t yet tried Nixiesearch, you should: https://github.com/nixiesearch/nixiesearch - a recent 0.8.0 version was used for all the experiments in this post.

我把一个真正的搜索引擎放进了Lambda，你只需在搜索时付费。 I put a real search engine into a Lambda, so you only pay when you search

我把一个真正的搜索引擎放进了Lambda，你只需在搜索时付费。
I put a real search engine into a Lambda, so you only pay when you search