(评论)
(comments)

原始链接: https://news.ycombinator.com/item?id=38020109

为了微调 text-embedding-ada-002 等预训练模型或微调 llm 模型,您通常需要更大的 GPU,具体取决于模型架构和要处理的数据量。 通常,较大的数据集需要更多的计算资源。 这是一个估计: - Text-Embedding-ADA-002 使用大约 100 亿个参数,并已在批量大小为 16k 的 Nvidia V100 上成功测试。 在实际设置中,Quadro RTX 5000、Quadro RTX 6000 或 Tesla T4 等 GPU 就足够了。 - 微调 LSTM 或基于 Transformer 的 LLM 需要大量内存来存储推理或训练期间的中间激活。 这意味着您需要一张具有更多内存的卡,例如 V100 PCIe Passthrough 或 A100 PCIe Passthrough 与 Quadro RTX 6000 或 5000 的组合。以下是一些参考资料,可以为您提供进一步的见解: - Intel Broadwell 架构上的 NLP 处理性能基准 (https://software。intel。com/content/www/us/en/attachments/white_paper-314433。pdf) 提供了对模型大小、参数计数和要求的深入了解 硬件加速器。 - NVIDIA 的 DL 管道硬件加速器优化指南 (https://developer。nvidia。com/optimize-dl-training-inference-gpus-nvidias-accelerated-compute-technology) 提供了针对 Nvidia 硬件堆栈优化 dl 管道的建议。 Additionally, consider reading NLP: An Introduction to Natural Language Processing for practical advice on selecting machine learning algorithms that suit your specific task or set of tasks。 祝你好运!

相关文章

原文
Hacker News new | past | comments | ask | show | jobs | submit login
Jina AI launches open-source 8k text embedding (jina.ai)
559 points by artex_xh 4 days ago | hide | past | favorite | 201 comments










I'm always happy to see OSS contributions but I don't quite understand why this model is so remarkable. As the leaderboard suggests it's ranking lower than OpenAI embeddings, while 14 other contributions are even better than that. Many of which feature a comparable or lower dimensionality than 768.

The 8k context window is new, but isn't the 512 token limitation a soft limit anyway? I'm pretty sure I can stuff bigger documents into BGE for example.

Furthermore, I think that most (all?) benchmarks in the MTEB leaderboard deal with very small documents. So there is nothing here that validates how well this model does on larger documents. If anything, I'd pick a higher ranking model because I put little trust in one that only ranks 17th on small documents. Should I expect it to magically get better when the documents get larger?

Plus, you can expect that this model was designed to perform well on the datasets in MTEB while the OpenAI model probably wasn't.

Many also stated that a 8k context embeddings will not be very useful in list situations.

When would anyone use this model?



Potentially useful for paragraph embedding, where... well, paragraphs can grow a lot. Not sure how this model fares in comparison to other embedding engines (yet), but I can definitely tell you mpnet models fare much better for paragraph embeddings than the leader in HF's leaderboard (being thenlper/gte-large at time of writing).

I can guess the Davinci and similar embeddings work better for code than MPNET and it really matters what you are encoding, not only the context length. What features are actually being extracted by the emb.engine.



I have been trying to understand the hype as well. Happy to see all the work happening in this space still.

I was pretty curious about the context limit. I am not an expert in this area but I always thought the biggest problem was the length of your original text. So typically you might only encode a sentence or a selection of sentences. You could always stuff more in but they you are potentially losing the specificity, I would think that is a function of the dimensionality. This model is 768, are they saying I can stuff 8k tokens worth of text and can utilize it just as well as I have with other models on a per 1-3 sentence level?



Thinking about it some more as I read through more comments. I guess in the stated case of research papers it can make sense if your task is looking for the common themes and not specific details. If you are embedding a sentence or a paragraph you miss out on the connection between those sentences across the whole paper...or at least its harder to manage that. By encoding a large number of pages from the paper (or the entire paper) you can hopefully do a better job of capturing the theme of that paper.

This also opens up another question though, how would that compare to using a LLM to summarize that paper and then just embed on top of that summary.



I would guess that the embedded summary is better, but for many tasks where you use embeddings (like document search), summarizing every document with an LLM is too expensive and slow.


I fail to imagine a 8k-token-length piece of text that has just one single semantic coordinate and is appropriate for embedding and vector search.

In my experience, any text is better embedded using a sliding window of a few dozen words - this is the approximate size of semantic units in a written document in english; although this will wildly differ for different texts and topics.



What are you using those embeddings for?

I can see a sliding window working for semantic search and RAG, but not so much for clustering or finding related documents.



Ah yes, clustering is indeed something that would benefit from large context, I agree.

However even so I would think about the documents themselves and figure out if it is even needed. Lets say we are talking about clustering court proceedings. I'd rather extract the abstract from these document, embed and cluster those instead of the whole text.



> The 8k context window is new

Hasn’t Claude had this for many months (before they bumped to 100k)?

Edit: ah, you mean new for OSS maybe?



Claude is a large language model, which is a different thing from an embedding model.


Any large language model generates embedding representations at every layer of the model, and these can be trivially extracted. So, large language models are indeed embedding models.

This leaderboard doesn't compare these custom tailored embedding models vs the obvious thing of average pooling layered with any traditional LLM, which is easily implemented using sentence transformers.



Because 4K+ dimensional embeddings are functionally useless.


Aha, that’s what I missed, thanks!


This is great news!

It feels like open-source is closing the gap with "Open"AI which is really exciting, and the acceleration towards parity is faster than more advancements made on the closed source models. Maybe it's wishful thinking though?



Is it tho? It's not really open source if they don't give us the information regarding training datasets


It definitely is open source even if they don’t disclose all details behind the training


The very definition of what constitutes open source is being called into question in these kinds of discussions about AI. Without the training details and the weights being made fully open it’s hard to really call something truly open, even if it happens to meet some arbitrary definition of “open source”.

A good definition of “truly open” is whether the exact same results can be reproduced by someone with no extra information from only what has been made available. If that is not possible, because the reproduction methodology is closed (a common reason, like in this case) then what has been made available is not truly open.

We can sit here and technically argue whether or not the subject matter violated some arbitrary “open source” definition but it still doesn’t change the fact that it’s not truly open in spirit



To take an other example, would you call a game that has its code and all assets (ex. character sprites) freely available open source? Or would the process that was used to create the assets in the first place also be required to be considered open?

The parallel can be made with model weights being static assets delivered in their completed state.

(I favor the full process being released especially for scientific reproducibility, but this is an other point)



Imagine someone giving you a executable binary without the source code and calling it "open source"


I'm actually mostly in your camp here. But it's complicated with AI.

What if someone gave you a binary and the source code, but not a compiler? Maybe not even a language spec?

Or what if they gave you a binary and the source code and a fully documented language spec, and both of 'em all the way down to the compiler? BUT it only runs on special proprietary silicon? Or maybe even the silicon is fully documented, but producing that silicon is effectively out of reach to all but F100 companies?

It's turtles all the way down...



There is the binary (the model) and the source (the thing that allows you to recreate the model, the dataset and methodology). Compilers and how art is made quite simply doesn't factor in here, because nobody is talking about the compiler layer. Art isn't even close to what is present. Trying to make this more complicated than it is is playing into companies' hands by troubling the waters around what constitutes open source.


To be fair, OpenSource troubled the waters around what constitutes free software.

Free(dom Respecting) Software wasn’t just about the source code.

https://www.gnu.org/philosophy/open-source-misses-the-point....



You can pass in any command line arguments you like, so it must be open source


Notice you are creating your own arbitrary definition of 'truly open', which IMHO corresponds more with 'reproducible'.

We already have a definition of open source. I don't see any reason to change it.



Problem is, the literal/default definition of "open source" is meaningless/worthless in this context. It's the weights, training data and methodology that matter for those models - NOT the inference shell.

It's basically like giving people a binary program and calling it open source because the compiler and runtime used are open source.



The weights are the inference and result of training. I can give you all the training details and you might not be able to reproduce what I did (google does this all the time). As a dev, I’d much rather an open model over an open recipe without weights. We can all agree having both is the best case scenario but having openly licensed weights is for me the bare minimum of open source


The inference runtime software is open, the weights are an opaque binary. Publishing the training data, hyperparameters, process, etc - that would make the whole thing "open source".


The quake engine is still open source even though it doesn't come with the quake game assets, no?

It seems unreasonable to require the training data just to be called open source, given it has similar copyright challenges as game assets.

Of course, this wouldn't make the model reproducible. But that's different from open source.



Good example. And in fact you are calling the "engine" opensource, not the whole Quake game. The 'assets" in most "opensource" AI models are not available.


Imagine if the Telegram client was open source but not the backend.

Imagine if Facebook open-sourced their front-end libraries like React but not the back-end.

Imagine if Twitter or Google didn’t publish its Algorithm for how they rank things to display to different people.

You don’t need to imagine. That’s exactly what’s happening! Would you call them open source because their front end is open source? Could you host your own back end on your choice of computers?

No. That’s why I even started https://qbix.com/platform



I completely agree with you (and the example you mention are singled out in the "antifeatures" list in F-Droid, to name an example)


It's a bit different - here most of the value lies in the weights.

A better analogy would be some graphics card drivers which ship a massive proprietary GPU firmware blob, and a small(ish) kernel shim to talk with said blob.



Well perhaps we can consider this a kind of short-sightedness of Stallmann. His point with GPL and the free software movement, as I understand it, was to ensure the user could continue to use the software regardless of what the software author decided to do.

Sometimes though the software alone can be near useless without additional assets that aren't necessarily covered by the code license.

Like Quake, having the engine without the assets is useless if what you wanted was to play Quake the game. Neural nets are another prime example, as you mention. Simulators that rely on measured material property databases for usable results also fall into this category, and so on.

So perhaps what we need is new open source licenses that includes the assets needed for the user to be able to reasonably use the program as a whole.



Weights are like binaries. They are not code. It would make more sense to put it under a creative commons license


Well the other day on this very website there were some very opinionated voices stating that Open Source is “exclusively what OSI defines”. I am not on that camp, more like in yours. To me there’s open source and OSI-approved open source. But you will encounter people very set on that other opinion, which I found interesting.

Make no mistake, I am super grateful to OSI for their efforts and most of my code out there uses one of their licenses. I just think they are limited by the circumstances. Some things I consider open are not conforming to their licenses and, like here, some things that conform might not be really open.



The old Stallman definition used the phrase "preferred form for modification" rather than the more specific "source code". What do you need to effectively modify an AI model?


Usually the datasets, not the source code.


Then a lot of stuff is not open source. Have you tried reproducing random GitHub repos, especially in machine learning?


So if someone includes images in their project they need to tell you every brush stroke that led to the final image?

All sorts of intangibles end up in open source projects. This isn’t a science experiment that needs replication. They’re not trying to prove how they came up with the image/code/model.



Those "Brush Strokes" are effectively the source code. To be considered open source, yes source code needs to be provided along side the binaries (the "image").


It’s more like someone giving you an open source front end client, but not giving you a way to host your own backend.

Look into Affero GPL. Images are inert static assets. Here we are talking about the back end engine. The fact that neural networks and model weights are non-von-neumann architecture doesn’t negate the fact that they are executable code and not just static assets!



How do you define "source", then?

By this logic any freely downloadable executable software (a.k.a. freeware) is also open source, even though they don't disclose all details on how to build it.



Source would be the way the data is produced so that you can replicate it yourself and make changes.

If I hand you a beer for free that’s freeware. If I hand you the recipe and instructions to brew the beer that is open source.

We muddy the waters too much lately and call “free” to use things “open source”.



> If I hand you a beer for free that’s freeware. If I hand you the recipe and instructions to brew the beer that is open source.

Yeah, but what those "open source" models are is like you handing me a bottle of beer, plus the instructions to make the glass bottle. You're open-sourcing something, just not the part that matters. It's not "open source beer", it's "beer in an open-source bottle". In the same fashion, those models aren't open source - they're closed models inside a tiny open-source inference script.



Perhaps one more thing that is missing in context is that I'm also getting the right to alter that beer by adding anything I like to it and redistributing it, without knowing its true recipe.


Interesting as the literal source of the result is not open


People need to realize something…

The model weights in eg TensorFlow are the source code.

It is not a von-Neumann architecture but a gigabyte of model weights is the executable part, no less than a gigabyte of imperative code.

Now, the training of the model is akin to the process of writing the code. In classical imperative languages that code may be such spaghetti code that each part would be intertwined with 40 others, so you can’t just modify something easily.

So the fact that you can’t modify the code is Freedom 2 or whatever. But at least you have Freedom 0 of hosting the model where You want and not getting charged for it an exorbitant amount or getting cut off, or having the model change out from under you via RLHF for political correctnesss or whatever.

OpenAI has not even met Freedom Zero of FSR or OSI’s definition. But others can.



That doesn't work for me.

The model weights aren't source code. They are the binary result of compiling that source code.

The source code is the combination of the training data and configuration of model architecture that runs against it.

The model architecture could be considered the compiler.

If you give me gcc and your C code I can compile the binary myself.

If you give me your training data and code that implements your model architecture, I can run those to compile the model weights myself.



No, you would need to spend “eye watering amounts of compute” to do it, similar to hiring a lot of developers to produce the code. The compiling of the code to an executable format is a tiny part of that cost.


I still think of millions of dollars of GPU spend crunching away for a month as a compiler.

A very slow, very expensive compiler - but it's still taking the source code (the training material and model architecture) and compiling that into a binary executable (the model).

Maybe it helps to think about this at a much smaller scale. There are plenty of interesting machine learning models which can be trained on a laptop in a few seconds (or a few minutes). That process feels very much like a compiler - takes less time to compile than a lot of large C++ projects.

Running on a GPU cluster for a month is the exact same process, just scaled up.

Huge projects like Microsoft Windows take hours to compile and that process often runs on expensive clusters, but it's still considered compilation.



Actually, the dirty secret is that a lot of human work (at below minimum wage) went into training and refining the AI models:

https://time.com/6247678/openai-chatgpt-kenya-workers/

And billion-dollar companies made their money off it:

https://www.forbes.com/sites/kenrickcai/2023/04/11/how-alexa...

That’s the dirty secret of why ChatGPT 4 is better. But they’ll tell you it has to do with chaining ChatGPT 3’s together, more fine tuning etc. They go to these poor countries and recruit people to work on training the AI.

Not to mention all the uncompensated work of humans around the world who put their content up on the Web.



Wishful thinking? Embeddings to me were never the interesting or bleeding edge thing at OpenAI. Maybe the various ada models at one point reigned supreme but there have been open-source models at the top of the leaderboard for a while and from a cost/performance perspective, often even the Bert models did a really fine job.


They compare it to OpenAI's ada model though, which is light-years away from ChatGPT.


Don't confuse the current Ada embedding model the old Ada GPT3 model.

It turns out OpenAI have used the name "Ada" for several very different things, purely because they went through a phase of giving everything Ada/Babbage/Curie/DaVinci names because they liked the A/B/C/D thing to indicate which of their models were largest.



Does that not conflate two different things though? Embedding model != LLM Model ?


This is great to see. It looks like the size of the embedding vector is half the size of text-embedding-ada-002 (768 vs 1536) while providing competitive performance. This will save space in databases and make lookups somewhat faster.

For those unaware, if 512 tokens of context is sufficient for your use case, there are already many options that outperform text-embedding-ada-002 on common benchmarks:

https://huggingface.co/spaces/mteb/leaderboard



The 768D-sized embeddings compared to OpenAI's 1536D embeddings are actually a feature outside of index size.

In my experience, OpenAI's embeddings are overspecified and do very poorly with cosine similarity out of the box as they match syntax more than semantic meaning (which is important as that's the metric for RAG). Ideally you'd want cosine similarity in the range of [-1, 1] on a variety of data but in my experience the results are [0.6, 0.8].



Unless I'm missing something, it should be possible to map out in advance which dimensions represent syntactic aspects, and then downweigh or remove them for similarity comparisons. And that map should be a function of the model alone, i.e. fully reusable. Are there any efforts to map out the latent space of ada models like that?


You wrote „out of the box“, did you find a way to improve this?


You can do PCA or some other dimensionality reduction technique. That’ll reduce computation and improve signal/noise ratio when comparing vectors.


Unfortunately this is not feasible with a large amount of words due to the quadratic scaling. But thanks for the response!


Not sure what you mean by large amount of words. You can fit a PCA on millions of vectors relatively performantly, then inference from it is just a matmul.


Not true. You need a distance matrix (for classical PCA it's a covariance matrix), which scales quadratically with the number of points you want to compare. If you have 1 Mio. vectors, each creating a float entry in the matrix, you will end up with approx (10^6)^2 / 2 unique values, which is roughly 2000Gb of memory.


What is the use case for an 8k token embedding? My (somewhat limited) experience with long context models is they aren't great for RAG. I get the impression they are optimized for something else, like writing 8k+ tokens rather than synthesizing responses.

Isn't the normal way of using embedding to find relevant text snippets for a RAG prompt? Where is it better to have coarser retrieval?



> What is the use case for an 8k token embedding?

Calculating embeddings on larger documents than smaller-window embedding models.

> My (somewhat limited) experience with long context models is they aren't great for RAG.

The only reason they wouldn't be great for RAG is that they aren't great at using information in their context window, which is possible (ISTR that some models have a strong recency bias within the window, for instance) but I don't think is a general problem of long context models.

> Isn't the normal way of using embedding to find relevant text snippets for a RAG prompt?

I would say the usual use is for search and semantic similarity comparisons generally. RAG is itself an application of search, but its not the only one.



I wonder how the perfomance fair when context size is increased. Intuitively this should be higher, but some quantized models I've tested showed noticeably worst performance.


Your KV cache size is linear with context size which might put you tight on memory. There is also increased cost of recalculating KV cache of context window when the window has to move but this is close to being solved with streaming LLMs.


BERT style encoder-only models, like the embedding model being discussed here, don't need a KV cache for inference. A KV cache is only needed for efficient inference with encoder-decoder and decoder-only (aka GPT) models.


You could get a facsimile to a summary for a full article or short story. Reducing an 8k token article to a summary using a completions model would cost far more. So if you need to search through collections of contracts, scientific papers, movie scripts, etc. for recommendations/clustering then bigger input sizes can do that in one shot.

Think of it like skipping the square root step in Euclidean distance. Perfectly valid as long as you don’t want a distance so much as a way to compare distances. And doing so skips the most computationally expensive operation.



I think I'm missing something: like, yeah, it's vector search for bigger text chunks. But arguably vector search with bigger text chunks is _definitively_ worse -- this isn't doing summarization, just turning about 25 pages of text to 1024 floats, which you then can use cosine similarity to measure the semantic similarity to other text

I'd much rather know what paragraph to look in than what 25 pages to look in



I imagine it's more useful for finding related articles and clustering things than for semantic search, which will work much better against smaller chunks - especially if you're implementing Retrieval Augmented Generation.


I think the point is: if you compress 25 pages of text into 1024 floats, you will lose a ton of information, regardless of what the use case is, so you're probably still better of with chunking.


> if you compress 25 pages of text into 1024 floats, you will lose a ton of information

Sure, but then if you do it one page at a time, or one paragraph at a time, you lose ton of meaning - after all, individual paragraphs aren't independent of each other. And meaning is kind of the whole point of the exercise.

Or put another way, squashing a ton of text loses you some high-frequency information, while chunking cuts off the low-frequency parts. Ideally you'd want to retain both.



I think that the assumption that you lose a ton of meaning (of low frequency) in doing separate chunks is probably less likely to be true over doing the whole document at once (losing high frequency meaning). As you say, doing both is probably a good strategy, and I think that's why we see a lot of "summarize this text" approaches.

I use a multi-pronged approach to this based on a special type of summarization. I chunk on sentences using punctuation until they are just over 512 characters, then I embed them. After embedding, I ask a foundation model to summarize (or ask a question about the chunk) and then generate keyterms for it. Those keyterms are stored along with the vector in the database. During search, I use the user's input to do a vector search for matching chunks, then pull their keyterms in. Using those keyterms, I do set operations to find related chunks. I then run a vector search against these to the top matches from the vector search to assemble new prompt text.

This strategy is based on the idea of a "back of the book index". It is entirely plausible to look for "outliers" in the keyterms and consider throwing those chunks with those keyterms in there to see if it nets us understanding of some "hidden" meaning in the document.

There is also a means to continue doing the "keyterm" extraction trick as the system is used. Keyterms from answer as well as user prompts may be added to the existing index over time, thus helping improve the ability to return low frequency information that may be initially hidden.



I've been getting great results for related documents by embedding entire blog posts, e.g. here: https://til.simonwillison.net/gis/pmtiles#related

I'm not sure how I would do that after chunking.



Did you compare with simple baselines like bag-of-words and word vectors?


My previous implementation used TF-IDF - I basically took all the words in the post and turned them into a giant "word OR word OR word OR word" search query and piped that through SQLite full-text search. https://til.simonwillison.net/sqlite/related-content

I jumped straight from that to OpenAI embeddings. The results were good enough that I didn't spend time investigating other approaches.



> Into a giant "word OR word OR word OR word"

Does that mean you'd return other docs if they share just one word?

The idea of tfidf is that it gives you a vector (maybe combined with pca or a random dimensionality reduction) that you can use just like an Ada embedding. But you still need vector search.



My goal for related articles was to first filter to every document that shared at least one word with the target - which is probably EVERY document in the set - but then rank them based on which ones share the MOST words, scoring words that are rare in the corpus more highly. BM25 does that for free.

Then I take the top ten by score and call those the "related articles".



That's not quite tfidf though. I agree you can get better results than that with Ada embeddings, but I would argue you can get even better results with embeddings from smaller chunks.


I guess technically it's bm25, since it's using the rank mechanism in SQLite FTS5: https://www.sqlite.org/fts5.html#sorting_by_auxiliary_functi...


Good point, I wonder how different it is to use a large context here vs having some other model summarize an 8k article into a small paragraph and using embedding from the paragraph instead where such a large context wouldn't be necessary.


Ever read the back of a book?


You mean the marketing blurb? Those tend to carry low information value, sometimes even negative - as in, if you didn't know anything else about the book, reading the blurb will make you even more wrong about it than you were. This is a common feature of marketing copy.


Isn't it up to 8k? So you can index your documents by paragraphs if you prefer?


you could do both


Is this what you mean by RAG? https://www.promptingguide.ai/techniques/rag?


I have an explanation of RAG in the context of embeddings here: https://simonwillison.net/2023/Oct/23/embeddings/#answering-...


You could just sum it up for us all rather than do a divert to your blog?

It's Retrieval Augmented Generation btw.

To quote:

> The key idea is this: a user asks a question. You search your private documents for content that appears relevant to the question, then paste excerpts of that content into the LLM (respecting its size limit, usually between 3,000 and 6,000 words) along with the original question.

> The LLM can then answer the question based on the additional content you provided.



> You could just sum it up for us all rather than do a divert to your blog?

Why? Have links gone out of fashion?

I even linked directly to the relevant section rather than linking to the top of the page.

The paper that coined the term used the hyphen, though I think I prefer it without: https://arxiv.org/abs/2005.11401



> Have links gone out of fashion?

Yes.

You wrote far more words than needed to answer the comment, I did it for you instead.



One of the reasons I write so much stuff is so I can provide links to things I've written to answer relevant questions.


And those of us with the sense to value your insight, and the attention-span to read more than tweet-sized content, thank you for it.


Thanks so much for your writings and for posting the link (and also for Datasette!). I've learned in the past few months from your blog.


Appreciate it. Your posts in general have been great - accessible to a large audience, quality links to follow up research and catchy analogies even when they don't fully hold true (llm as a calculator for words - which I admit I use with citation!). Keep going.


Thank you, nice blog.


Just to add that, we appreciate that very much.


I liked your link a lot.


"Links have gone out of fashion" is an odd thing to write on a Link Aggregator website.


You know you're responding to a programmer famous enough to have a Wikipedia page, right?

https://en.m.wikipedia.org/wiki/Simon_Willison



I don't pay the slightest fucking attention to who I'm responding to and take people on their merit/comment.

Should we all do the ad hominem thing? You are actually suggesting that?



Yes


One thing that is missing in comparison: OpenAI's model is multilingual.

And not only it supports and embeds a variety of languages, it also computes the same coordinates for the same semantics in different languages. I.e. if you embed "russia is a terrorist state" and "россия - страна-террорист", both of these embeddings will have almost the same coordinates.



I heard one of the developers on a regular Open Source DIY AI X/twitter space [1] & they are targeting two new models German/English and French/English for the next release

https://x.com/thursdai_pod



I don’t really know what that means but it seems useful


Just quantized the models for onnx usage in e.g. transformers.js and got 4x reduced file size:

- 𝟐𝟖.𝟓 𝐌𝐁 jina-embeddings-v2-small-en (https://huggingface.co/do-me/jina-embeddings-v2-small-en)

- 𝟏𝟎𝟗 𝐌𝐁 jina-embeddings-v2-base-en (https://huggingface.co/do-me/jina-embeddings-v2-base-en)

However, I noted, that the base model is performing quite poorly on small text chunks (a few words) while the small version seems to be unaffected. Might this be some kind of side effect due to the way they deal with large contexts?

If you want to test, you can head over to SemanticFinder (https://do-me.github.io/SemanticFinder/), go to advanced settings, choose the Jina AI base model (at the very bottom) and run with "Find". You'll see that all other models perform just fine and find "food"-related chunks but the base version doesn't.



Why quantize something that is already very small (270mb)?


Just making up stuff here, but smaller models are great for serverless compute like functions, which would also benefit from lighter computation. Don't forget, some people are dealing with hundreds of millions of documents. Accelerating this by 4x may be worth a small performance hit.


I just shipped a new llm-embed-jina plugin for my LLM tool which provides access to these new Jina models: https://github.com/simonw/llm-embed-jina

Here's how to try it out.

First, install LLM. Use pip or pipx or brew:

    brew install llm
Next install the new plugin:

    llm install llm-embed-jina
You can confirm the new models are now available to LLM by running:

    llm embed-models
You should see a list that includes "jina-embeddings-v2-small-en" and "jina-embeddings-v2-base-en"

To embed a string using the small model, run this:

    llm embed -m jina-embeddings-v2-small-en -c 'Hello world'
That will output a JSON array of 512 floating point numbers (see my explainer here for what those are: https://simonwillison.net/2023/Oct/23/embeddings/#what-are-e...)

Embeddings are only really interesting if you store them and use them for comparisons.

Here's how to use the "llm embed-multi" command to create embeddings for the 30 most recent issues in my LLM GitHub repository:

    curl 'https://api.github.com/repos/simonw/llm/issues?state=all&filter=all' \
    | jq '[.[] | {id: .id, title: .title}]' \
    | llm embed-multi -m jina-embeddings-v2-small-en jina-llm-issues - \
    --store
This creates a collection called "jina-llm-issues" in a default SQLite database on your machine (the path to that can be found using "llm collections path").

To search for issues in that collection with titles most similar to the term "bug":

    llm similar jina-llm-issues -c 'bug'
Or for issues most similar to another existing issue by ID:

    llm similar jina-llm-issues 1922688957
Full documentation on what you can do with LLM and embeddings here: https://llm.datasette.io/en/stable/embeddings/index.html

Alternative recipe - this creates embeddings for every single README.md in the current directory and its subdirectories. Run this somewhere with a node_modules folder and you should get a whole lot of interesting stuff:

    llm embed-multi jina-readmes \
      -m jina-embeddings-v2-small-en \
      --files . '**/README.md' --store
Then search them like this:

    llm similar jina-readmes -c 'backup tools'




The only feedback I had from your embedding post was

    wish we could create the array of floating points without openai

Great timely turnaround time, good sir. Ht


Thank you so much for all the work you've put into llm!


Thanks, this is wonderfully simple to use. Just managed to package this up using docker and was able to use it without a lot of drama. Nice how simple this is to use.

I've dabbled a bit with elasticsearch dense vectors before and this model should work great for that. Basically, I just need to feed it a lot of content and add the vectors and vector search should work great.



FYI it seems that llm install llm-embed-jina is missing yaml dependency

  File "/opt/homebrew/Cellar/llm/0.11_1/libexec/lib/python3.12/site-packages/llm/default_plugins/openai_models.py", line 17, in 
    import yaml
ModuleNotFoundError: No module named 'yaml'


Thanks! I wonder if the Python 3.12 upgrade broke something.

The pyyaml package is correctly listed on the formula page though: https://formulae.brew.sh/formula/llm



Excellent! And you were just saying how risky it is to rely long-term on OpenAI text embeddings in your post on the topic. The timing for this open source option worked out nicely.


JFYI, this is what happens on my M1 Macbook:

$ brew install llm $ llm ModuleNotFoundError: No module named 'typing_extensions'

Not sure where to report it.



Whoa, that is a weird one. Do you know what version of Python you have from Homebrew?

It looks like that package is correctly listed in the formula: https://github.com/Homebrew/homebrew-core/blob/a0048881ba9a2...



    % python3 --version
    Python 3.11.6
    
    % which python3
    /opt/homebrew/bin/python3

    % brew info python-typing-extensions
    ==> python-typing-extensions: stable 4.8.0 (bottled)


Probably not this, but check with `which llm` what that's running. I had weird issues not matching the documentation but just had some other random python cli tool called llm I'd put in my home bin for and forgotten about it.


    % which llm
    /opt/homebrew/bin/llm


How well do LLMS like this work with a non-English language? Or are these open source models limited to English?


Quite a few of the top ranked models on this leaderboard are multilingual: https://huggingface.co/spaces/mteb/leaderboard

https://huggingface.co/BAAI/bge-large-en-v1.5 FlagEmbedding for example describes itself as covering Chinese and English.



Stability has a Japanese port which is getting lots of work https://twitter.com/StabilityAI_JP/status/171699857824440759...


This is not an embedding model though. Yes you can always extract some embeddings from somewhere, but for most LLMs those won't perform well for retrieval (which makes sense as it's not what the models are optimizing for)


This isn't an embedding model, but it is a group of people working in this general area in a language other than English. Maybe they'll get to an embedding model next?


That depends on whether the training data contained languages other than English.


Impressive work.

I wonder what would be the best way to use 8k embeddings. It’s a lot of information to keep in a vector, so things like “precision” of the embedding space and its ability to distinguish very similar large documents will be key.

Maybe it can be useful for coarse similarity matching, for example to detect plagiarism?



8K is the context length. Their vector dimension size is actual much smaller, which is great for a number of use cases, though maybe not the ones you are thinking about.


Yes that’s also how I understood it. Maybe it was ambiguously expressed, but I mean “8k tokens as input is a lot of information to encode”


When I go to this leaderboard: https://huggingface.co/spaces/mteb/leaderboard I click on the "Classification" tab, then I see "jina-embeddings-v2-base-en" at number 12, with an average score of 73.45. But the highest scoring model there is llmrails/ember-v1 with 75.99 average score but it only supports 512 tokens, so if you need 8K tokens to be embedded, I guess they are the best. Do people need 8K of tokens for embedding? Maybe not but they might need more than 512 often enough. It could save a summary extraction step.


Small context window means you cannot embed the whole document, you are embedding just a part.

So, if there is some information at the bottom which is dependent on something which is at the top, your embedding could be entirely wrong.



Ada is one of the (if not the) worst model offered by OpenAI, though ...


You're thinking of the old "ada" GPT-3 model - the one that was a companion to "davinci" and "babbage".

I believe "text-embedding-ada-002" is entirely unrelated to those old GPT-3 models. It's a recent embedding model (released in December 2022 - https://openai.com/blog/new-and-improved-embedding-model ) which OpenAI claim is their best current best available embedding model.

I understand your confusion: OpenAI are notoriously bad at naming things!



Oh, thanks for clarifying!

Edit: looking at the press release, the improvement over old Ada is ... marginal? And Ada-01 is/was a poor performing model, tbh. I guess I'll have to run some tests, but at first sight it doesn't seem that wow-ey.



So just to be super clear, this is an embedding model. It generates no text. It’s not outputting words.

Maybe I am assuming incorrectly, but I think the poor performance you are referring to is the old Ada completion model, where the output is text. That was poor indeed.



This article is not kind to the old ada embeddings model:

https://medium.com/@nils_reimers/openai-gpt-3-text-embedding...

If the new ada model only has marginal improvements, it seems open source is way to go.



Just noticed that they (jina.ai) have offices both in Berlin and China. I am wondering how they will they operate with the presence of chip export restrictions and other side effects of USA / China tensions.


It's weird to think there are entire companies built around providing access to a pre-computed vector space model.


Jina AI itself is also a great framework to expose APIs from deep neural net models and deploy them to Kubernetes clusters, which I think is very promising, but they didn't get as much hype as I predicted that they deserved.


I wonder how much better is this, compared to taking the average ( or some other aggregation) of embeddings with a smaller context length. Has anyone done a similar comparison?


The issue with averaging is that over large inputs, it drowns out small signal. For example, there is a chance that it completely loses a reference to something made only in a single sentence somewhere in a large document.


this is super cool! I wish there was an easy to understand and follow guide on how to make your own embedding, for llama2 for example. All I can find are various guides that already assume you know everything there is to training an embedding.

I just want to make an embedding between a conversation of me and my friend and simulate talking to them. Is this a hard thing to train to begin with?

If anyone knows or could help me with this, I would be very grateful!



I will butcher this so if any experts see this please don't flame me. I think you might be conflating ideas? You could definitely fine-tune existing embedding models or train your own from scratch but the goals of embeddings models are different than a LLM conversation. Embedding models are used for things like, classifying, search, image captioning...maybe at a high level anything where you have high dimensionality that you need to condense?

What you are asking for sounds like fine tuning an existing LLM...where the data will be tokenized but the outcomes are different? There is a lot of writeups on how people have done it. You should especially follow some of the work on Huggingface. To replicate talking to your friend though, you will need a very large dataset to train off of I would think and its unclear to me if you can just fine-tune it or you would need to train a model from scratch. So a dataset with 10s of thousands of examples and then you need to train it on a GPU.

https://www.anyscale.com/blog/fine-tuning-llama-2-a-comprehe...



Thank you for sending this. It's still quite puzzling to me if it's actually possible or not. Maybe what I want to train is a style? But then again, it should also remember other important things related to the friend..


Parent comment is on the right track. It sounds like you want to fine tune an llm to mimic the conversation style between you and your friend. Then you can use a general embedding model to implement RAG so that the application can "recall" pieces of your conversation.


You can't fine tune without using their library tied to their cloud? Did I misunderstand? Do you need fine tune?


oh thank god I first read Jira...


You're not the only one... glad I misread that.


Pardon my ignorance in advance but could it be used to "chat" with PDFs and websites? I am looking for OpenAI alternatives as I am in learning phase


No. “Chatting with PDFs” is (mostly) taking a users chat message, retrieve relevant content via e.g embedding search, then feed that into an LLM with a prompt that’s something along the lines of “given this information, can you answer this question”.

This tool helps with embedding part.

I’ve built a bunch of ”chat with your PDFs” bots, do reach out if you have any questions me at brian.jp.



Actually I wanna use langchain. OpwnAI is not free. I wanted to test two use cases:

- chat with documents(pdf, doc etc)

- chat with website. Like, if I integrate with an ecommerce site, I can ask questions from the website. What options do I have in free for both cloud and locally?



Check out my little side project for chatting with PDFs. You should be able to load most models including this one. https://github.com/clarkmcc/chitchat


This looks cool so can it be used to feed Website/Products data in CSV/JSON format and "chat" with it?


Pretty much! Right now it only supports md, pdf, txt, and html, but supporting additional formats is trivial: https://github.com/clarkmcc/chitchat/blob/main/src-tauri/src....


No, this is an embedding model, not a text completion model.


using the bing tab of microsoft edge browser, you can chat with PDFs and i think they use GTP4 or equivalent


Is there something like oobabooga to easily run this in a click-and-run way? Where I can load up a model, a text, and ask it questions?


See my comment here: https://news.ycombinator.com/item?id=38020655 for a CLI tool that lets you do this.

Note that embedding models are a different kind of thing from a Large Language Model, so it's not the kind of model you can ask questions.

It's a model which can take text and turn it into an array of floating point numbers, which you can then use to implement things like semantic search and related documents.

More on that here: https://simonwillison.net/2023/Oct/23/embeddings/



The Hugging Face page for the model has a two line load-and-encode Python code demo: https://huggingface.co/jinaai/jina-embeddings-v2-base-en


iirc ooba has its own integrated vectordb called superbooga.

I bet you could hack this in.



Does anyone know what they are using for this comparison and ranking? And where does instruct-xl stand in the mix?


Oh duh, it’s right in the post and instructor-xl is number 9. And so many new participants now!


The ranking are here:

https://huggingface.co/spaces/mteb/leaderboard

It’s amazing how many new and better ones there are since I last looked a few months ago. Instructor-xl was number 1, now it is number 9, and its size is more than 10x the number 2 ranked!

Things move fast!



Is this a text encoder model, BERT style?


Does it match OpenAI on number of params?


No one knows since OpenAI has not disclosed the number of paramerers their embeddings model uses.


What does this even do?


See this story from yesterday: https://news.ycombinator.com/item?id=37985489


Great company name.


I'm gonna try to explain this because I thought the same thing, though you may enjoy it for another reason. Among Czech or other slavic software people - "jiná AI" could be like "another AI" and, to me at least, brings to mind the "yet another {thing}" naming convention (yacc = "yet another compiler compiler" for example).


Their OpenAI benchmark is GPT3 (text-embedding-ada-002), not GPT4.


"text-embedding-ada-002" isn't GPT3, it's a different kind of model. Embedding models and Large Language Models aren't the same thing.


LLMs and embedding models are certainly different, but it's a useful benchmark to calibrate expectations. OpenAI released text-embedding-ada-002 a year ago, and they describe the ada model as[1] "the original GPT-3 base model [...] capable of very simple tasks, usually the fastest model in the GPT-3 series".

It's fair to expect GPT3-level results - not GPT 3.5 and certainly not open-source tiny GPT4 as some might think when they read "rivaling OpenAI".

[1] https://platform.openai.com/docs/models/whisper



No, you're confusing two things here.

"text-ada-001" is LLM in the GPT3 family, described as "Capable of very simple tasks, usually the fastest model in the GPT-3 series, and lowest cost"

"text-embedding-ada-002" is entirely different - that page describes it as "Our second generation embedding model, text-embedding-ada-002 is a designed to replace the previous 16 first-generation embedding models at a fraction of the cost."



OpenAI doesn't say directly what text-embedding-ada-002 is, but in the release blog post they show that performance is comparable to davinci/curie, which places it firmly in the universe of GPT3. I understand it's not a straight line comparison, but to me it's still a useful mental heuristic about what to expect.

[1] https://openai.com/blog/new-and-improved-embedding-model (see "Model improvements")



You mean this table here?

    text-embedding-ada-002     53.3
    text-search-davinci-*-001 52.8
    text-search-curie-*-001     50.9
    text-search-babbage-*-001 50.4
    text-search-ada-*-001     49.0
That's not comparing it to the davinci/curie/babbage GPT3 models, it's comparing to the "search-text-*" family.

Those were introduced in https://openai.com/blog/introducing-text-and-code-embeddings as the first public release of embeddings models from OpenAI.

> We’re releasing three families of embedding models, each tuned to perform well on different functionalities: text similarity, text search, and code search. The models take either text or code as input and return an embedding vector.

It's not at all clear to me if there's any relationship between those and the GPT3 davinci/curie/babbage/ada models.

My guess is that OpenAI's naming convention back then was "davinci is the best one, then curie, then babbage, then ada".



How interesting. I assumed that a consistent codename such as Ada/Davinci refers to the lineage/DNA of the OpenAI model from which a distinct product was created. But I can see how these codenames could be "just" a revision label of A/B/C/D (Ada/Babbage/Curie/Davinci), similar to "Pro/Max/Ultra". If true, a product named "M2 Ultra" could have nothing to do with another product called "Watch Ultra".


Wow I genuinely hadn't noticed the A/B/C/D thing!


Reading through that article, the specific Davinci/Curie models they seem to be referring to are called the following: 'text-search-davinci-001', 'text-search-curie-001', 'text-similarity-davinci-001' and 'text-similarity-curie-001'.

Are you sure these have anything to do with 'text-davinci-003' or 'text-curie-001'?

Will have to agree with everyone here that OpenAI is good at being extremely confusing. It seems like the logic might be something along the lines of the 'text-search' portion being the actual type of the model, while the 'curie-001' / '-' format is just a personalized way of expressing the version of that type of model. And the whole 'GPT' category used to be a sort family of models, but now they've just switched it to the actual name of the newer gargantuan LLMs. Then, because the 'GPT' models are now that different thing altogether these days, the newest 'text-embedding' model is just named 'ada-' because it's on that iteration of the 'text-embedding' type of model, adhering to the older principle of naming their models? Not sure, ha. Definitely feels like doing some detective work.



tl;dr OpenAI is bad at product naming.


When people talked about GPT-3 they always referred to davinci which is the largest model, not ada.


Anyone got links to examples of text embedding?


Easiest example is taking three words: Universe, University, College.

- University and Universe are similar alphabetically.

- University and College are similar in meaning.

Take embeddings for those three words and `University` will be near `College`, while `Universe` will be further away, because embeddings capture meaning:

UniversityCollegeUniverse

_

With old school search you'd need to handle the special case of treating University and College as similar, but embeddings already handle it.

With embeddings you can do math to find how similar two results are, based on how close their vectors are. The closer the embeddings, the closer the meaning.



Another interesting point is that math can be performed on embedding vectors: emb("king") - emb("man") + emb("woman") = emb("queen").


That's a property of Word2Vec specifically due to how it's trained (a shallow network where most of the "logic" would be contained within the embeddings themselves). Using it for embeddings generated from LLMs or Embedding layers will not give as fun results; in practice the only thing you can do is average or cluster them.


> That's a property of Word2Vec specifically due to how it's trained (a shallow network where most of the "logic" would be contained within the embeddings themselves).

Is it though? I thought the LLM-based embeddings are even more fun for this, as you have many more interesting directions to move in. I.e. not just:

emb("king") - emb("man") + emb("woman") = emb("queen")

But also e.g.:

emb() + av(sad) + bv(short) - c*v(positive) = emb()

Where a, b, c are some constants to tweak, and v(X) is a vector for quality X, which you can get by embedding a bunch of texts expressing the quality X and averaging them out (or doing some other dimensional reduction trickery).

I've suggested this on HN some time ago, but only been told that I'm confused and the idea is not even wrong. But then, there was this talk on some AI conference recently[0], where the speaker demonstrated exactly this kind of latent space translations of text in a language model.

--

[0] - https://www.youtube.com/watch?v=veShHxQYPzo&t=13980s - "The Hidden Life of Embeddings", by Linus Lee from Notion.



That talk used a novel embeddings model trained by the speaker which does exhibit this kind of property - but that was a new (extremely cool) thing, not something that other embeddings models can do.


Interesting video. When he says "we decode the embedding", does he essentially mean that he is searching a vector database or something else?


The model is an encoder-decoder, which encodes some text into a latent embedding, and can then decode it back into text. It’s a feature of the model itself.


OpenAI have a brief explainer with a bunch of example use cases here:

https://platform.openai.com/docs/guides/embeddings/what-are-...



Color me surprised! it looks like its actually open source (Apache 2.0) and not the usual false advertising by some two-faced company or institution. Links here:

* https://huggingface.co/jinaai/jina-embeddings-v2-base-en * https://huggingface.co/jinaai/jina-embeddings-v2-small-en



Some relevant stats from the link:

8192 token input sequence length

768 embedding dimensions

0.27GB model (with 0.07GB model also available)

Tokeniser: BertTokenizer [1], 30528 token vocab [2]

Is an 8K sequence length directly comparable to text-embedding-ada-002 if the vocabulary is much smaller? I seem to remember its tokeniser has a larger vocabulary.

[1] https://huggingface.co/jinaai/jina-embeddings-v2-base-en/blo...

[2] https://huggingface.co/jinaai/jina-embeddings-v2-base-en/blo...



> Is an 8K sequence length directly comparable to text-embedding-ada-002 if the vocabulary is much smaller? I seem to remember its tokeniser has a larger vocabulary.

Words that aren't in the vocabulary can still be represented by multiple tokens. Some models can input and output valid UTF-8 at the byte level (rather than needing a unique token for each codepoint). For example RWKV-World.



A large vocabulary means less tokens are needed to represent the same information


Thanks.


*fewer

Less is used for qualitative data like “I love him less”. Whereas fewer is used for countable things like “I need fewer tokens.”



Username checks out.


Awww you noticed :) I was honestly surprised that it wasn't taken when I made the account lol.

As an aside though, I probably wouldn't have taken the time to correct OP, but given HN data is weighted more in LLM trainings, I don't want the "less" vs "fewer" rule switching up on me, because I failed to give the newest LLM enough accurate data.



A uniform distribution over 30528 tokens is just under 15 bits of information per token, whereas a vocabulary size of ~60000 would be just under 16 bits per token. In practice it's not uniform, but this shows that they're in the same ballpark.


Thanks what size gpu would you need to fine tune or do an inference?






Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact



Search:
联系我们 contact @ memedata.com