![]() |
|
![]() |
| Could you give an example of what you mean by _fluent "py-data" first way_ ?
You mean like a fluent API like `data.transform().filter()...` , that sort of thing? |
![]() |
| Unrelated to the core topic, I really enjoy the aesthetic of their website. Another similar one is from Fixie.ai (also, interestingly, one of their customers). |
![]() |
| Have you tried it? pgvector performance falls off a cliff once you can't cache in ram. Vector search isn't like "normal" workloads that follow a nice pareto distribution. |
![]() |
| I did say the index, not the embeddings themselves. The index is a more compact representation of your embeddings collection, and that's what you need in memory. One approach for indexing is to calculate centroids of your embeddings.
You have multiple parameters to tweak, that affect retrieval performance as well as the memory footprint of your indexes. Here's a rundown on that: https://tembo.io/blog/vector-indexes-in-pgvector |
![]() |
| Thanks for the insights. Precomputing is not really suitable for this and the thing is, I'm mostly using it as a lookup table on key / value queries. I know Duckdb is mostly suitable for aggregation but the http range query support was too attractive to pass on.
I did some tests, querying "where col = 'x'". If the database was a remote duckdb native db, it would issue a bunch of http range requests and the second exact call would not trigger any new requests. Also, querying for col = foo and then col = foob would yield less and less requests as I assume it has the necesary data on hand. Doing it on parquet, with a single long running duckdb cli instance, I get the same requests over and over again. The difference though, I'd need to "attach" the duckdb database under a schema name but would query the parquet file using "select from 'http://.../x.parquet'" syntax. Maybe this causes it to be ephemeral for each query. Will see if the attach syntax also works for parquet. |
![]() |
| >Is there a good general purpose solution where I can store a large read only database in s3 or something and do lookups directly on it?
I think this is pretty much what AWS Athena is. |
My ideal is that turbopuffer ultimately is like a Polars dataframe where all my ranking is expressed in my search API. I could just lazily express some lexical or embedding similarity, boost with various attributes like, maybe by recency, popularity, etc to get a first pass (again all just with dataframe math). Then compute features for a reranking model I run on my side - dataframe math - and it "just works" - runs all this as some kind of query execution DAG - and stays out of my way.