（评论）

（评论）
(comments)

原始链接: https://news.ycombinator.com/item?id=39923404

该用户分享了他们构建名为 Ollama 的大型语言模型 (LLM) 项目的经验，他们将其描述为“本地运行的搜索引擎”。他们解释说，它涉及对各种函数签名和输出的训练模型，最终通过字符串操作将输入转换为输出。他们表示热衷于未来做出更详细的解释，欢迎其他人加入学习之旅。他们为刚开始该领域的计算机科学学生提供建议，建议在深入研究高级主题之前先关注基本概念。他们提到使用 starling-ml-beta、7b 和 librechat 等库进行研究。他们邀请用户探索上述项目并提供链接，建议一定程度的量化以获得更好的性能。他们讨论了使用 Hermes-2-Pro-Mistral 等模型来构建文本，强调反馈和编辑对于改进生成的响应的重要性。他们还建议初学者从 librechat 和 ollama 开始。他们承认 Ollama 和外部数据源之间缺乏互连性等局限性，并鼓励读者关注与本地法学硕士相关的 Reddit 子版块以获取更多信息。总的来说，他们对法学硕士及其应用的潜力感到兴奋。

Happy to answer any questions and open for suggestions :)

It's basically a LLMs with access to a search engine and the ability to query a vector db.

The top n results from each search query (initialized by the LLM) will be scraped, split into little chunks and saved to the vector db. The LLM can then query this vector db to get the relevant chunks. This obviously isn't as comprehensive as having a 128k context LLM just summarize everything, but at least on local hardware it's a lot faster and way more resource friendly. The demo on GitHub runs on a normal consumer GPU (amd rx 6700xt) with 12gb vRAM.

If you're open to it, it would be great if you could make a post explaining how you built this. Even if it's brief. Trying to learn more about this space and this looks pretty cool. And ofc, nice work!

guys, i didn't thought there would be this much interest in my project haha. I feel kinda bad for just posting it in this state haha. I would love to make a more detailed post on how it works in the future (keep an eye on the repo?)

Thanks! As a CS student interested in learning more about this space how do you recommend I get started? I'm pretty early in my education so I kind of want to learn how to drive the car for now and learn how the engine works more formally later, if you know what I mean.

Wonderful work!

is it possible to make it only use a subset of the web? (Only sites that I trust and think are relevant to producing an accurate answer), and are there ways to possibly make it work offline on pre installed websites? (wikipedia, some other wikis and possibly news sites that are archived locally), and how about other forms of documents? (books and research papers as pdfs)

I was most interested in the offline aspect of it, which I wouldn't know where to even start with if I were to fork.

How do you parse and efficiently store large, unstructured information for arbitrary, unstructured queries?

Your project looks very cool. I had on my ‘list’ to re-learn Typescript (I took a TS course about 5 years ago, but didn’t do anything with it) so I just cloned your repo so I can experiment with it.

EDIT: I just noticed that most of the code is Go. Still going to play with it!

This might be more of a searxng question, but doesn't it quickly run up against anti-bot measures? CAPTCHA challenges and Forbidden responses? I can see the manual has some support for dealing with CAPTCHA [1], but in practical terms, I would guess a tool like this can't be used extensively all day long.

I'm wondering if there's a search API that would make the backend seamless for something like this.

1. https://docs.searxng.org/admin/answer-captcha.html

As a last resort we could have AI work on top of a real web browser and solving captchas as well. Should look like normal usage. I think these kinds of systems LLM + RAG + Web Agent will become widespread and the preferred method to interact with the web.

We can escape all ads and dark UI patterns by delegating this task to AI agents. We could have it collect our feeds, filter, rank and summarize them to our preferences, not theirs. I think every web browser, operating system and mobile device will come equipped with its own LLM agent.

The development of AI screen agents will probably get a big boost from training on millions of screen capture videos with commentary on YouTube. They will become a major point of competition on features. Not just browser, but also OS, device and even the chips inside are going to be tailored for AI agents running locally.

If content creators can't find anything that is uniquely human and cannot be made by AI, then maybe they are not creative enough for the job. The thing about generative AI is that it can take context, you can put a lot or very little guidance in it. The more you specify, the more you can mix your own unique sauce in the final result.

I personally use AI for text style changes, as a summarizer of ideas and as rubber duck, something to bounce ideas off of. It's good to get ideas flowing and sometimes can help you realize things you missed, or frame something better than you could.

I didn't run into a lot of timeouts while using it myself, but you would probably need another search source if you plan to host this service for multiple users at the same time.

There are projects like flareresolverr which might be interesting

To scrape the websites, do you just blindly cut all of the HTML into defined size chunks or is there some more sophisticated logic to extract text of interest ?

I'm wondering because most news websites now have a lot of polluting elements like popups, would they also go into the database ?

If you look at the vector handler in his code, he is using blue Monday sanitizer and doing some "replaceAll".

So I think there may be some useless data in the vector, but that may not be a issue since it is coming from multiple sources (for simple question at least)

Sure (if they are openai api compatible i can add them within minutes) otherwise I'm open for pull requests :)

Also, i don't own an Nvidia Card or Windows / MacOS

This is awesome, would love if there were executable files where these dependencies are needed. That would make it wayyyy more accessible rather than just to those that know how to use the command line and resolve dependencies (yes, even docker runs into that when fighting the local system).

In five year's time - by 2030, I foresee that lots of inference would be happening on local machines with models being downloaded on demand. Think docker registry of AI models which is pretty much Hugging Face already there.

This all would be due to optimisations within model inference code and techniques, hardware and packaging of software like the above.

Don't see billion dollar valuations for lots of AI startups out there to materialise into anything.

> I foresee that lots of inference would be happening on local machines with models being downloaded on demand

Why? It's much more efficient to have centralized special purpose hardware to run enormous models and then ship the comparatively small result over the internet.

By analogy, you don't have a search engine running on your phone right?

You currently can't have a search engine running locally on your phone. Google search is possible the single largest c++ program every built. And nevermind the storage needs...

But in a few years we might be able to have LLMs running on our phones that work just as well if not better. Of couse as you mention the LLMs running on large servers might still be much more powerfull, but the local ones might be powerfull enough.

No, a more appropriate analogy would be driving your own billion-dollar super-yacht vs driving your own car.

Will not happen any time soon. Consumer hardware can't even run GPT-4 locally, and won't be able for a looong time. Each GPT-4 instance runs on 8 A100. The cost of such system is ~$81K. Not even in the ballpark of what most consumers can afford.

"640kb will be enough for everyone." (Gates)

I think that the models will evolve and grow as more powerful compute/hardware comes out.

You may be able to run scaled down n versions of what state of the art now, but by then the giant models will have grown in size and in required compute.

The 6 year old models will be retro computingish.

Somewhat like how you can play 6 year old games on a new powerful PC but by then the new huge games will no longer play well on your Old mach

Unfortunately training is insanely expensive. StabilityAI is struggling with staying alive. And Anthropic wants to spend $100 BILLION building a supercomputer just for training.

That's why I think these private companies will have the best AIs for many decades.

A while back you commented on my personal project Airdraw which I really appreciated. This looks awesome and you're well on your way to another banger project - looking forward to toying around with this :)

Whenever I see these projects I always find reading the prompts fascinating.

> Useful for searching through added files and websites. Search for keywords in the text not whole questions, avoid relative words like "yesterday" think about what could be in the text. > The input to this tool will be run against a vector db. The top results will be returned as json.

Presumably each clarification is an attempt to fix a bug experienced by the developer, except the fix is in English not in Go.

This is cool, haven't run this yet but seems really promising. Am thinking how this can be a super useful to hook with internal corporate search engines and then get answers from that.

Good to see more of these non API key products being built (connected to local llms)

Excellent work! Cool side projects like that will eventually help you get hired by a top startup or may even lead to building your own.

I can only encourage other makers to post their projects on HN and put them out into the world.

Impressive, I don't think I've seen a local model call upon specialised modules yet (although I can't keep up with everything going on).

I too use local 7b open-hermes and it's really good.

Thanks :). It's just a lot of prompting and string parsing. There are models like "Hermes-2-Pro-Mistral" (the one from the video) which are trained to work with function signatures and outputting structured text. But at the end it's just strings in > strings out, haha. But its fun (and sometimes frustrating) to use LLMs for flow control (conditions, loops...) inside your programs.

It's my go to "structured text model" atm. Try "starling-ml-beta" (7b) for some very impressive chat capabilities. I honestly think that it outperforms GPT3 half the time.

Sorry to repeat the same question I just asked the other commenter in this thread, but could you link the model page and recommend a specific level of quantization for the models you've referenced? I'd love to play with these models and see what you're talking about.

Thank you — from that page, at the bottom, I was able to find this link to what I think are the quantized versions

https://huggingface.co/NousResearch/Hermes-2-Pro-Mistral-7B-...

If you have the time, could you explain what you mean by "Q5 is minimum"? Did you determine that by trying the different models and finding this one is best, or did someone else do that evaluation, or is that just generally accepted knowledge? Sorry, I find this whole ecosystem quite confusing still, but I'm very new and that's not your problem.

Talking GGUF, Usually the higher you can afford to go wrt. quantization(e.g. Q5 is better than Q4, etc), the better. A Q6_K has minimal performance loss from the Q8, so in most cases if you can fit a Q6_K it's recommended to just use that. TheBloke's READMEs[0] usually have a good table summarizing each quantization level.

If you're RAM constrained, you'll also have to make trade-offs about the context length. e.g. you could have 8 GB RAM and a Q5 quant with shorter context, vs Q3 with longer, etc.

[0]:https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF

Absolutely, 7B will run comfortably on 16GB of RAM and most consumer level hardware. Some of the 40B run on 32GB, but it depends on the model I found (GGUF, crossing fingers help).

I ran this originally on a M1 with 32GB, I run this on an Air M2 with 16GB (and mac mini M2 32GB), no problem.

I use llama.cpp with a SwiftUI interface (my own), all native, no scripts python/js/web.

7b is obviously less capable but the instant response makes it worth exploring. It's very useful as a Google search replacement that is instantly more valuable, for general questions, than dealing with the hellscape of blog spam ruling Google atm.

Note, for my complex code queries at $dayjob where time is of the essence, I still use GPT4 plus, which is still unmatched imho, without running special hardware at least.

This is so cool! And the fact that you can use Ollama as 'llm backend' makes it sustainable. didn't see how to switch models in the demo, that might be worth to highlight in readme..

I have a 'feature request', can we manage which sites are being used by some categories in the frontend? For example, if I build a list of websites and out them under "coding", then I'd like to use those to answer my programming questions. Meanwhile, I'd like to add an "art" category for museum's homepages so that I can ask which year was XYZ painting from. And so on. The current implementation looks like the inter-operability with searing is more static... IDK if searxng has an API to switch those filters or if they can be managed already through 'profiles'.. that kind of thing..

Why call it "Perplexity clone" when this is much more than what Perplexity offers?

Btw. This is the first time I hear about Perplexity, which after 10 minutes of experimentation, looks like a worse clone of Phind.

This is really neat! I have questions:

“Needs tool usage” and “found the answer” blocks in your infra, how are these decisions made?

Looking at the demo, it takes a little time to return results, from the search, vector storage and vector db retrieval, which step takes the most time?

Thanks :)

Die LLM makes these decisions on its own. If it writes a message which contains a tool call (Action: Web search Action Input: weight of a llama) the matching function will be executed and the response returned to the LLM. It's basically chatting with the tool.

You can toggle the log viewer on the top right, to get more detail on what it's doing and what is taking time. Timing depends on multiple things: - the size of the top n articles (generating embeddings for them takes some time) - the amount of matching vector DB responses (reading them takes some time)

> Q: is chrome on ios powered by safari

> According to the sources provided, Chrome on iOS is not powered by Safari. Google's Chrome uses the Blink engine, while Safari uses the WebKit engine.

I find it amusing how when people show off their LLM projects their examples are always of it failing, and providing a bad answer.

Well i don't indent to get money from people, so i guess showing real results isnt a "problem".

Besides i think the following sentences arent wrong? Its just a 7b model give it some slack haha

Speaking of LLM's... here's my "dear lazyweb" to HN:

What would be the best self hosted option to build sort of a textual AI assistant into your app? Preferably something that I can train myself over time with domain knowledge.

Fine tuning on your own knowledge probably isn't what you want to do, you probably want to do retrieval aided generation instead. Basically a search engine on some local documents, and you put the results of the search into your prompt. The search engine uses the same vector space as your language model as its index, so the results should be highly relevant to whatever the prompt is.

I'd start with "librechat" and mistral, so far that's one of the best chat interfaces and has good support for self hosting. For the actual model runner, ollama seems to be the way to go.

I believe it's built on "langchain", so you can switch to that when it makes sense to. When you've tested all your queries and setup with librechat, know that librechat is a wrapper around "langchain".

I'd start by testing the workflow in librechat, and if librechat's API doesn't do what you want, well I've always found fastAPI pleasant to work with.

---

Less for your use case, and more in-general. I've been assessing a lot of LLM interfaces lately, and the weird porn community has some really powerful and flexible interfaces. With sillytavern you can set up multiple agents, have one agent program, another agent critique, and a third asses it for security concerns. This kind of feedback can help catch a lot of LLM mistakes. You can also go back and edit the LLM's response, which can really help. If you go back and edit an LLM message to fix code or change variable names, it will tend to stick with those decisions. But those interfaces are still very much optimized for "Role playing".

Recommend keeping an eye on https://www.reddit.com/r/LocalLLaMA/

Thanks - will check out librechat etc. It's interesting that fine tuning is no longer the thing to do. I am not clear on how one connects librechat to local data but am sure I will when I dive deeper into this.

I if a quick poke through the source and it seems like there’s not much reason this couldn’t run on macOS? It seems that ollama is doing the inference and then there’s a go binary doing everything else? I might give it a go and see what happens!

It says it's a "locally running search engine" - but not sure how it finds the sites and pages to index in the first place?

Yea I guess that's misleading, I should probably change that. I was referring to the LLM part as locally running. Indexing is still done by the big guys and queried using searxng

Would be good if the readme mentions minimum hardware specs to get a reasonably decent performance. E.g. I have a ThinkPad X1 extreme i7 with MaxQ graphics, any hopes of running this on it without completely ruining the performance?

You could run the LLM using your CPU and normal (non video) ram. But that's a lot slower. There are people working on making it a lot faster tho. The bottleneck is the transfer speed between the ram Sticks and the CPU.

Just taking a guess, but I wouldn't expect more than a couple tokens (more or less like syllables) per second. Which is probably to slow, since it has to read a couple thousand per search result.

It's hard to provide minimum requirements, since there are so many edge cases.

Awesome project! As I newbie myself in everything LLM, where should I start looking to create a similar project than yours? Which resources/projects are good to know about? Thank you for sharing!

I think the easiest entry point would be the python langchain project? It has a lot more documentation and working examples than the golang one I've used :)

If you could tell me more about your goals, I can probably provide a more narrow answer :)

Excellent work! I plan to use it with existing LLMs tbh, but great to see it working locally also! Thank you so much for sharing. I love the architecture.

Did you really make a perplexity clone if you didn’t spend more time promoting yourself on Twitter and LinkedIn than on the engineering?

I've been working on a small personal project similar to this and agree that replicating the overall experience provided by Perplexity.ai, or even improving it for personal use, isn't that challenging. (The concerns of scale or cost are less significant in personal projects. Perplexity doesn't do too much planning or query expansion, nor does it dig super deep into the sources afaik)

I must say, though, that they are doing a commendable job integrating sources like YouTube and Reddit. These platforms benefit from special preprocessing and indeed add value.

I assume the same, feels like their product is just summarizing the top n results? I wouldn't need the whole vector db thing, if local models (or hardware) would be able to run with a context of this size.

That is probably exactly why they got funding. You can sell it as focus on adding new features and leveraging the best available tools before reinventing the wheel.

They do train their own models now, but for about a year they just forwarded calls to models like gpt3.5T. You still have the option to use models not trained by perplexity.

Wait, are you directly comparing Perplexity and C.ai or Pi? Perplexity is a search engine, Pi is a chatbot, and C.ai is roleplay? Their value propositions are very different

（评论） (comments)

（评论）
(comments)