![]() |
|
![]() |
| That's totally clever and sound really useful. And it's one of those ideas where you go "Why didn't I think of that" when stumbling over the materials, word2vec in this case. |
![]() |
| For a lot of cases, word2vec/glove still work plenty well. It also runs much faster and lighter when doing development -- the FSE library [1] does 0.5M sentences/sec on CPU, whereas the fastest sentence_transformers [2] do something like 20k sentences/sec on a V100 (!!).
For the drawbacks: Word embeddings are only good at similarity search style queries - stuff like paraphrasing. Negation they'll necessarily struggle with. Since word embeddings are generally summed or averaged into a sentence embedding, a negation won't shift the sentence vector space around the way it would in a LM embedding. Also things like homonyms are issues, but this is massively overblown as a reason to use LM embeddings (at least for latin/germanic languages). Most people use LM embeddings because they've been told it's the best thing by other people rather than benchmarking accuracy and performance for their usecase. 1. https://github.com/oborchers/Fast_Sentence_Embeddings 2. https://www.sbert.net/docs/sentence_transformer/pretrained_m... |
![]() |
| I wonder if it would be possible to easily add support for multiple CPUs? It seems to be taking at most 150% CPU, so on my workstation it could be (assuming high parallellism) 10 times as fast.
Alas the word2vec repository has reached its quota:
So here are another sources I found for it: https://stackoverflow.com/a/43423646I also found https://huggingface.co/fse/word2vec-google-news-300/tree/mai... but I'm unsure if that's the correct format for this tool. The first source from Google Drive seems to work and there's little chance of being malicious.. |
![]() |
| The model in Google drive is the official model from Google and will work.
Haven't tried the huggingface model, but, looks very different. Unlikely to work. |
![]() |
| I wanna meet the person who greps die, kick the bucket and buy the farm lol
Are models like mistral there yet in terms of token per second generation to run a grep over millions of files? |
![]() |
| Mistral has published large language models, not embedding models? sgrep uses Google's Word2Vec to generate embeddings of the corpus and perform similarity searches on it, given a user query. |
![]() |
| No I got that I asked because wouldn’t embedding generated by fine tuned transformer based LLMs be more context aware? Idk much about the internals so apologies if this was a dumb thing to say |
![]() |
| This tool seems really cool, want to play with it for sure.
Just in general, semantic search across text seems like a much better ux for many applications. Wish it was more prevalent. |
![]() |
| Really cool. Often just want to fuzzy search for a word, and this would be useful. Can it do filenames as well ? Or do I need to pipe something like LS first. |
![]() |
| just played around with it, not very smart per se, to get full power of semantic grep, LLM still might be needed, how to do it? is RAG the only way? |
![]() |
| I assumed you meant computer languages :-) If you mean human languages, yes Google publishes word2vec embeddings in many different human languages. Not sure though how easy it is to download. |
SIMD situation in Go is still rather abysmal, it’s likely easier to just FFI (though FFI is very slow so I guess you're stuck with ugly go asm if you are using short vectors). As usual, there's a particular high-level language that does it very well, and has standard vector similarity function nowadays...