Modern serverless search is just an accounting trick. There’s a hidden pool of nodes behind the API, and the final bill is split evenly among all clients. There’s always a standby warm node waiting for your request - you just don’t see it.
And you can’t get rid of it because scaling search engines is HARD (or at least search vendors want you to think so). You can’t just put one into a Lambda function. But what if you actually can?
As someone who has hated Elasticsearch since version 0.89 (but still uses it), there are three major blockers to running it in a truly serverless mode:
Container size: It’s around 700MB in version 9.x. The bigger the container, the slower the node startup, since it has to be pulled from somewhere.
Container startup time: For ES 9.x, the startup time alone is about 40 seconds. And the time-to-performance is much worse, since a cold JVM is painfully slow until it sees some traffic.
Index and state: Search engines like Elastic and Qdrant behave like databases, with each node hosting a fraction of the total cluster state. When a new node joins, the cluster needs to rebalance. What happens if you scale-to-zero? Better not ask.
We’re going to take my pet-project-gone-big search engine, Nixiesearch, and squeeze it into an AWS Lambda:
It’s also JVM-based, since it uses Apache Lucene for all search-related internals (Like OpenSearch, Elasticsearch and SOLR). We’re going to build a native x86_64 binary with GraalVM native-image: this should reduce the Docker image size (no JVM!) and eliminate JVM warmup entirely.
Can we store an index outside the search engine? Can we achieve reasonable latency with AWS S3? What if we host the index on AWS EFS instead?
You may wonder why do we ever need to perform weird AWS Lambda tricks when we can just keep the status quo with warm stand-by nodes? No warm-up, no cold-start, no obscure limits - it’s a good traditional approach.
Because doing weird stupid things is the way I learn. So the challenge is to have a proof-of-concept which:
Has minimal startup time, so scale up-down won’t be an issue. With sub-second startup you can even scale to zero when there’s no traffic!
Search latency is going to be reasonably fast. Yes modern search engines compete on 3 vs 5 milliseconds, but in practice you also have an chonky embedding model, which adds extra 300-400ms latency on top.
Java AOT compilation and remote index storage, sounds easy.
GraalVM native-image is an Ahead-Of-Time compiler: it takes a JVM application JAR file and builds an almost zero-dependency native binary that depends only on glibc. It sounds fancy in theory, but in practice not all applications can be statically compiled that easily.
If you (or your transitive dependencies) use reflection to do something dynamic like enumerating class fields or loading classes dynamically, then GraalVM requires you to create a reflection-config.json file listing all the nasty things you do.
Building such a file manually for all transitive dependencies is practically impossible. But in modern GraalVM versions you can attach a tracing agent to your application, which records all the nasty things happening throughout the entire codebase.
java -agentlib:native-image-agent -jar nixiesearch.jar standaloneThe tracing agent needs to observe all execution paths in your application, and instead of sending all possible search requests manually, I just attached it to the test suite — and got a monumental reachability-metadata.json file that hopefully covers all reflection usage in transitive dependencies.
It’s time to build our first native binary!
It takes around two minutes on my 16-core AMD Zen 5 CPU to compile the binary, which is impressively slow. With an ubuntu:24.04 minimal base image we get 338MB — a nice reduction from 760MB.
But can we get rid of the 90MB base Ubuntu image and go with Alpine? Yes, but it requires building Nixiesearch with Musl instead of Glibc. Luckily as the build is dockerized, you can replace the ghcr.io/graalvm/native-image-community:25 with ghcr.io/graalvm/native-image-community:25-muslib and get a nice musl-based build env.
native-image --libc=musl -jar nixiesearch.jar <options>The musl-based binary ends up at the same 244MB, but now it can run natively on Alpine without the gcompat glibc layer. But can we go even further and build the app completely statically, without linking libc at all? Once you start trimming dependencies, it’s hard to stop.
native-image --static --libc=musl -jar nixiesearch.jar <options>Now we’re at 248MB, but we no longer need a base system at all - which gives us the most perfect Docker image ever:
FROM scratch
COPY --from=builder /build/nixiesearch /nixiesearch
ENTRYPOINT /nixiesearchWe could go even further and enable the -Os option to optimize for size, but I’m afraid it might impact request processing performance.
GraalVM also notes that almost 30% of the codebase is taken up by the AWS Java SDK with its massive dependency footprint, so the next step is switching to the raw S3 REST API instead of the nice SDK helpers.
I originally thought that AWS Lambda runtime for Docker is just a simple “docker pull” and “docker run” on each request, but it’s slightly more complex:
On initial code deployment, the container gets fetched from ECR, unpacked and cached on all AWS AZ where your lambda is scheduled to run.
When first request arrives, the container goes in to the Init stage: it gets executed on a minimal Firecracker VM. The Lambda runtime waits until started container polls runtime API for the actual request to process. This stage is billed, so we need to be as fast as possible here.
Request stage: container polls runtime API for request and produces the response. This is where the actual work happens, and this is the part you’re also billed for.
And here the MAGICAL Freeze stage: After the Lambda API receives a response and sees that the app starts polling for the next request, the VM gets frozen. It’s still a VM, but with zero CPU and its RAM offloaded to disk. You pay zero for a container in the freeze stage.
When there’s a new request arrives, container VM gets into Thaw stage: unfrozen, processes the next request and so on to get frozen again.
When there are no requests arriving for a longer period of time (in practice 5-15 minutes), lambda container gets destroyed.
Cold request (full init):
Duration: 84.15 ms
Billed Duration: 535 ms
Memory Size: 3008 MB
Max Memory Used: 133 MB
Init Duration: 449.85 ms
Warm request (after freeze-thaw cycle):
Duration: 2.62 ms
Billed Duration: 3 ms
Memory Size: 3008 MB
Max Memory Used: 194 MB That’s nice: we were able to spin the cold container in only 449ms, and warm no-op requests are just 3ms!
But note that AWS Lambda compute is very limited:
RAM: 128MB default with up to 3008MB max. You can submit a support ticket to get 10GB RAM, but I was too lazy to argue with AWS support.
vCPU: 1 vCPU, and if you go beyond 1536MB of RAM, you get 2nd vCPU. Not much.
Disk: up to 10GB of instance storage.
And last but not the least, S3 read throughput depends on RAM size:
S3 throughput data is well aligned with known S3 throughput limits for AWS EC2 instances. Assuming that we have only 2 vCPU max, 100MB/s is the best you can expect - which is not nice considering that to run search we need to access the index.
Nixiesearch always was built with S3 block storage in mind. But as OpenSearch (and perhaps Elasticsearch Serverless) it uses S3 only for simple segment replication:
As lambdas are ephemeral, we need to deliver index somehow to the search engine:
We can directly wrap all Lucene index access into S3 ReadObject calls. This might work, but HNSW vector search is an iterative graph traversal, which will ruin the latency. Slow (and expensive due to ~500 S3 reads per request) search, but no init time. But it sounds serverless!
We can do a good old segment replication from S3 to Lambda ephemeral storage. Then for a 2GB index and expected 100MB/s throughput our init time is going to be 2GB * 0.1GB/s = 20 seconds. But after that the search speed is going to be perfect with no extra costs.
Napkin storage costs math for 1M requests:
Direct S3 search with no caching: 500 S3 reads per request * 1M requests * 0.0004$/1000reads = 200$/month. Yes running an ES cluster is more expensive, but not much.
Segment replication: considering that 1M requests/month is around 0.5rps, it means that your lambda function is going to be always warm and there’s no repeated inits - you fetch the index once and only refresh changed segments. Then the cost is going to be around 0$.
I don’t like the idea of an init taking half a minute - then we’re not much different from good old Elastic. But what if we host an index on an NFS (e.g. AWS EFS) storage?
I took the FineWiki “simple” part with 300k documents, embedded them with OpenAI text-embedding-3-small model and deployed it on an AWS Lambda with EFS storage attached.
Nixiesearch can do ONNX embedding inference for sentence-transformers models locally, but considering an average embedding model size (1-2GB) and the amount of RAM/vCPU we have it might be not a good idea.
To have a fully-functional tech demo, I vibe-coded (oh no) a simple web front-end on GH pages: https://nixiesearch.github.io/lambda-demo-ui/
There’s some nice server+client side latency breakdown if you want to see where the actual time is spent. And yes, 1.5s first request latency is kinda slower than I’ve initially expected.
In simple words, random reads from NFS-style storage are just slow:
As my test lambda runs in AWS us-east-1 and I’m physically in EU, latency can be improved by replicating lambda to more regions. Embedding latency is the AI toll we have to pay anyway. But why search and fetch stages are so slow?
Because both HNSW search and document fetch are a bunch of iterative random reads, and with per-read AWS EFS latency being around 1ms, that’s what we get.
One of reviewers of this post suggested to bake the whole index directly to the docker image as an alternative: yes, you cannot easily update the index in real-time anymore - you need to rebuild the Docker image from scratch every time you change it. But it may work in cases when you can tolerate some lag in indexing. But the results we got were even more surprising:
I thought 1.5s request latency with AWS EFS was slow, but doing random reads across an index backed directly to the docker image were even slower. Why? Because lambdas are not running docker images as-is: they unpack them and cache in an AZ-local S3 block cache:
In other words, baking index into a docker image is just another way of storing your index on AZ-local S3-Express bucket (and mounting it with s3-fuse or something).
Realistically 1.5s (and even 7s) per cold search might sound horrible, but things get fast pretty quickly as we eventually load cold data into a filesystem cache:
Image above is for Docker-bundled index, but for EFS/NFS attached it’s quite similar.
We get into a reasonable 120ms latency for search and almost instant field fetch around request #10. But it’s still far from the idealistic idea of true serverless search when you don’t need an idling warm node to be up to serve your request.
Folks like turbopuffer, topk and LanceDB advocate the idea that to run on top of S3 you need another non-HNSW data structure like IVF, which is more friendly to high access latency.
Instead of navigating over HNSW graph, iteratively doing ton of random reads, you can just cluster documents together and only perform batch reads of clusters lying nearby to your query:
Much easier search implementation without any iterative random reads patterns: just read a complete set of cluster documents in a single S3 GetObject request.
Clusters can be updated in-place by just appending new documents.
The elephant in the room: IFV have much much worse recall, especially for filtered search. So your search can be either fast or precise, you should choose in advance.
Yes I can just hack IFV support to Nixiesearch (as Lucene supports flat indexes already), but there’s a better way. S3 has almost unlimited concurrency: can we untangle reads from being iterative to being batched and concurrent?
Traversing HNSW graph for K-NN search is iterative:
You land on an entrypoint node, which has M connections to other neighbor nodes.
For each connection, you load its embedding (and do S3 GetObject request) and compute a cosine distance.
After all M neighbor distances are evaluated, you jump on the best next node.
But you don’t need to be iterative while loading neighbor embeddings: Lucene’s HnswGraphSearcher is already quite close to be bend into the direction for making embedding load concurrent and parallel:
So my personal plan for Christmas holidays is to add a custom Scorer implementation which schedules N parallel S3 GetObject requests to get N embeddings on each node visit:
HNSW graph usually has only ~3 layers, so you need to evaluate 1 entrypoint + 3 layers = 4 nodes, doing 4 batches of ~32-64 S3 requests.
Each batch of S3 GetObject requests is ~15ms, so baseline latency is expected to be ~60ms for a complete search stage.
To fetch N documents, you also need to prefetch N chunks of stored fields, which is also a perfectly concurrent operation.
A theoretical ~100ms baseline latency of HNSW running on top of S3 - sounds nice, huh?
Usually at the end of the article a well-educated author should put a summary of what you might learned while reading this, so here we are:
AWS Lambdas are not your friendly docker containers: storage system is completely different, and runtime semantics with constant freeze-thaw stages is what really surprised me.
Running HNSW search on top of network-attached storage is painfully slow right now - sequential random reads, you know. But there’s light at the end of the tunnel, and you don’t need to sacrifice recall for a cheap and fast search.
If you’re brave (and stupid) enough (like me) to spend a weekend on putting a search engine in a lambda, you can do it.
If you haven’t yet tried Nixiesearch, you should: https://github.com/nixiesearch/nixiesearch - a recent 0.8.0 version was used for all the experiments in this post.