询问 HN:2023 年 12 月,如何在我自己的文档上训练自定义 LLM/ChatGPT?
Ask HN: How do I train a custom LLM/ChatGPT on my own documents in Dec 2023?

原始链接: https://news.ycombinator.com/item?id=38759877

Langroid提供了多种RAG技术,包括词汇和语义检索、重新排序和相关性提取,可以显着提高精确度和召回率。 此外,Langroid 通过其独特的会话任务循环简化了多代理设置,从而实现关注点分离、模块化和更轻松的状态管理。 提供的示例笔记本演示了如何使用 Langroid 从文档中提取结构化信息,利用法学硕士提出的开放式问题来生成候选段落,然后根据所需的结构进行解析和匹配。 总体而言,Langroid 在利用 LLM 功能的同时,在管理复杂任务和构建数据方面提供了灵活性和可扩展性。

在针对特定语言(例如本例中的土耳其语)实现 RAG 方面,主要涉及使用支持该语言的 RAG 库、在该语言的文本语料库上训练和微调 RAG 模型,以及处理特定于语言的细微差别(例如词法) 和语法。 许多流行的 RAG 库(例如 PyTorch FairSeq 和 LLAP-NL)提供对多种语言的支持,并允许轻松集成到具有不同专业水平的应用程序中。 此外,针对特定语言的专用 RAG 库(例如 TLR 的 Retriever-Generator 和 SURGE)可以提供更高的性能和准确性。 最终,选择正确的 RAG 实现或开发适合特定语言需求的定制 RAG 实现需要仔细考虑训练数据可用性、应用程序上下文、计算资源限制和成本考虑等因素。
相关文章

原文
Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: How do I train a custom LLM/ChatGPT on my own documents in Dec 2023?
693 points by divan 1 day ago | hide | past | favorite | 220 comments
There is a 5 month old thread [1] on this, but it might be already outdated.

What is the best approach for feeding custom set of documents to LLM and get non-halucinating and decent result in Dec 2023?

UPD: The question is generally about how to "teach" LLM answer questions using your set of documents (not necessarily train your own, so approaches like RAG counts)

[1] https://news.ycombinator.com/item?id=36832572











You don't train on documents. There are many startups claiming that but they are deliberately using a misleading term because they know that's what people are searching for.

You still do RAG. Llamaindex is still the best option that I know of. Most of the startups that have working products are likely using llamaindex. All of the ones that say they are training on documents are actually using RAG.

Test it out. If it really and truly doesn't work, search for a script that creates question and answer pairs automatically with gpt-4. Then try using that for qLoRA. I have never heard of anyone successfully using that for a private document knowledgebase though. Only for skills like math, reasoning, Python, etc. I think the issue is that you need a LOT of data and it needs to repeat concepts or any facts you need to learn many, many times in different supporting ways.

What absolutely does not work is trying to just feed a set of documents into fine tuning. I personally have proven that dozens of times because I had a client who is determined to do it. He has been mislead.

What it will do is learn the patterns that are in those documents.



We just held a workshop about this a few weeks ago: https://red.ht/llmappdev We created a simple chatbot using local models with Ollama (llamacpp), LlamaIndex and streamlit. Have a look at the streamlit folder, it's super easy.

I used this simple example to teach about RAG, the importance of the system prompt and prompt injection. The notebook folder has a few more examples, local models can even do natural language SQL querying now.



Llamaindex has so mucu potential. Any benchmarks on performance compared to fine-tuning?


You probably don't need fine-tuning, at least if it's just new content (and no new instructions). It may even be detrimental, since LLMs are als good at forgetting: https://twitter.com/abacaj/status/1739015011748499772


looks very promising, do you plan to keep this single repo up to date as new things are released?


Good question, as you can see I haven't touched it for a month. I wanted to show what's possible then with open source and (open) local models and there's already so much new stuff out there.

I'll probably fix some things this week and then either update it or start from scratch. Guided generation, structured extraction, function calling and multi-modal are things I wanted to add and chainlit looks interesting.



What is RAG? That's hard to search for


This one seems like a good summary

Retrieval-Augmented Generation for Large Language Models: A Survey

https://arxiv.org/abs/2312.10997

The photos of this post are also good for a high level look

https://twitter.com/dotey/status/1738400607336120573/photo/2

From the various posts I have seen people claim that phi-2 is a good model to start off from.

If you just want to do embeddings, there are various tutorials to use pgvector for that.



Retrieval Augmented Generation - in brief, using some kind of search to find relevant documents to the user’s question (often vector DB search, which can search by “meaning”, by also other forms of more traditional search), then injecting those into the prompt to the LLM alongside the question, so it hopefully has facts to refer to (and its “generation” can be “augmented” by documents you’ve “retrieved”, I guess!)


So, as a contrived example, with RAG you make some queries, in some format, like “Who is Sauron?” And then start feeding in what books he’s mentioned in, paragraphs describing him from Tolkien books, things he has done.

Then you start making more specific queries? How old is he, how tall is he, etc.

And the game is you run a “questionnaire AI” that can look at a blob of text, and you ask it “what kind of questions might this paragraph answer”, and then turn around and feed those questions and text back into the system.

Is that a 30,000 foot view really of how this works?



The 3rd paragraph missed the mark but previous ones are in the right ballpark.

You take the users question either embed it directly or augment it for embedding (you can for example use LLM to extract keywords form the question), query the vector db containing the data related to the question and then feed it all of LLM as: here is question form the user and here is some data that might be related to it.



Essentially you take any decent model trained on factual information regurgitation, or well any decently well rounded model, a llama 2 variant or something.

Then you craft a prompt for the model along the lines of "you are a helpful assistant, you will provide an answer based on the provided information. If no information matches simply respond with 'I don't know that'".

Then, you take all of your documents and divide them into meaningful chunks, ie by paragraph or something. Then you take these chunks and create embeddings for them. An embedding model is another type (not an llm) that generates vectors for strings of text often based on how similar the words are in _meaning_. Ie if I generate embeddings for the phrase "I have a dog" it might (simplified) be a vector like [0.1,0.2,0.3,0.4]. This vector can be seen as representing a point in a multidimensional space. What an embedding model does with the word meaning is something like if I want to search for "cat" that might embed as a vector [0.42]. Now, say we want to search for the query "which pets do I have" first we generate embeddings for this phrase, the word "pet" might be embedded as [0.41] in the vector. Because it's based on trained meaning, the vectors for "pet" and for "dog" will be close together in our multidimensional space. We can choose how strict we want to be with this search (basically a limit to how close the vectors need to be together in space to count as a match).

Next step is to put this into a vector database, a db designed with vector search operations in mind. We store each chunk, the part of the file it's from and that chunks embedding vector in the database.

Then, when the LLM is queried, say "which pets do I have?", we first generate embeddings for the query, then we use the embedding vector to query our database for things that match close enough in space to be relevant but loose enough that we get "connected" words. This gives us a bunch of our chunks ranked by how close that chunks vector is to our query vector in the multidimensional space. We can then take the n highest ranked chunks, concatenate their original text and prepend this to our original LLM query. The LLM then digests this information and responds in natural language.

So the query sent to the LLM might be something like: "you are a helpful assistant, you will provide an answer based on the provided information. If no information matches simply respond with 'I don't know that'

Information:I have a dog,my dog likes steak,my dog's name is Fenrir

User query: which pets do I have?"

All under "information" is passed in from the chunked text returned from the vector db. And the response from that LLM query would ofc be something like "You have a dog, its name is Fenrir and it likes steak."



Stupid Question: Eli5; Can/Does/Would it make sense to 'cache' (for lack of a better term) a 'memory' of having answered that question.... and so if that question is asked again, it knows that it has answered it in the past, and can/does better?

(Seems like this is what reinforcement training is, but I am just not sure? Everything seems to mush together when talking about gpts logic)



Off Topic;

It fascinates me how much variance there is in peoples searching skills.

some people think they are talking to a person when searching e.g 'what is the best way that i can {action}' I think the number one trick is to forget grammar and other language niceties and just enter concepts e.g. 'clean car best'



I used to do this. Then when Google's search results started declining in quality, I often found it better to search by what the average user would probably write.


and what would an average user write?


Over the last couple of years, at least with Google, I've found that no strategy really seems to work all that well - Google just 'interprets' my request and assumes that I'm searching for a similar thing that has a lot more answers than what I was actually searching for, and shows me the results for that.


Some concepts seems to be permanently defined as a spelling error and will just be impossible to search for.


I found something very annoying while looking for technical data ( a service manual for an ancient medical device - build around 2001).

The same term was the name of the device + something about the power source.

The result from the client network - my phone/client computer nothing related to the search for 4-5 pages.

Same search from work - second result was what I was looking.

So it seems there is a relation with your search history, but somehow connected with the related search history from the same ip/network.



same experience. I'm generally getting better results at client's (VPN) network, we are all googling for the same stuff, I guess.

It must be possible to create a fixed set of google searches and rate the location based on the results. So you could physically travel to a Starbucks 20miles away to get the best results for the 'best USB-C dongle reddit'.



That’s why they will love chatgpt


Retrieval-augmented generation, RAG + LLM will turn up more results.


Seems fairly easy to search for to me - top results are all relevant:

https://kagi.com/search?q=ml+rag

https://www.google.com/search?q=ml+rag



"Retrieval augmented generation". I found success from "rag llm tutorial" as a search input to better explain the process.


RAG: having a LLM spew search queries for you because your search foo is worse than a chat bot alucinations.

or because you want to charge your client the "ai fee".

or because your indexing is so bad you hide it from your user and blame the llm assistant dept.



Ask chatgpt next time. "What is rag in context of AI?"


Or just using a traditional search engine and "rag" plus literally any ML/AI/LLM term will yield a half dozen results at the top with "Retrieval-augmented generation" in the page title.


Or people could just not use obscure acronyms when discussing specialised topics on an open forum?


Where do you draw the line though?


Or if GGP can't think of an AI-related term they can use HN search. Searching 'rag' shows the term on the first page of results:

https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...



Searching for "RAG" on Kagi and Google give some AI-related results fairly high up, including results that explain it and say what it stands for.


What percentage of people could you fool if you told them it was AI and replayed standard search results but with the "karaoke-like" prompt that highlights each word (as if we're 2nd graders in Special Ed learning how to string more than 2 sentences together)


Right? How does someone who browses this forum not know how to find knowledge online?


To sing the praises of Bedrock again, it does have continuous pre-training as well as RAG “knowledge bases”. The former is based on JSON fragments and the RAG stuff is PDFs and other document formats.

With regards to its efficacy, I haven’t gone to production with it yet but I was reasonably impressed.

I uploaded 100 legal case documents to Bedrock via Claude and could push it pretty hard asking about the various cases and for situations across the knowledge base.

It did feel like it broke down and got confused at a certain point of complexity of questioning, but I still think it’s already useful as a “copilot” or search engine and surely it will only improve over time.



I forgot about the continuous pre-training thing. How long and how much did they cost on Bedrock?

I had tried to suggest continuous pre-training to my client but it seemed expensive and when I mentioned that he lost interest and just kept wanting me to do fine tuning.

Also to clarify, did you do the continuous pre-training or RAG? And did you compare the efficacy of one or the other or both?



I used the RAG knowledge bases for most of my testing described above.

I got a toy demo up and running with continuous pre-training but haven’t evaluated it unfortunately.



Oh Great! How did you evaluate the LLM responses? I'm cofounder of an evaluation and monitoring platform - Athina AI (www.athina.ai) You can use our monitoring dashboard and evals to check your LLM performance and iterate quickly.


LlamaIndex can't do chunk-level metadata, only document-level metadata, so you can't put precise references to where materials the LLM synthesized answers from originated, e.g. HTML anchors. Just write your own RAG with Pinecone and OpenAI APIs directly.


LlamaIndex lets you attach metadata to Nodes which are basically chunks, although that fact is poorly documented! Will fix.


Thanks! Even with a better documentation, document importers don't extract node metadata so one needs to write their own "text and metadata extractor" as well. It's then easier to skip LlamaIndex altogether, or just get inspiration from some re-ranking etc. you guys did.


Ouch your client! I had one earlier this year like this. We were doing some audio processing for word matching, he had also been mislead before coming to us, he fully believed that this was going to be some form of super AI trained on his 5 audio records of him repeating the words over and over...

We did all we could to steer him toward a correct path of understanding. Sadly we launched a working product but he doesn't understand it and continues to miss represent and miss sell it.

After continuing to give him time and follow up with him (I tend to personally do this with Clients like this), I can tell he is starting to realize his lack of understanding...



Another question, which one is preferred, LlamaIndex or Langchain, for RAG? Thanks in advance for your insights.


You basically don't use langchain for anything besides 30 minute demos that you copied from someone else's github. It has a completely spaghettified API, is not performant, and forces you into excessive mental contortions to reason about otherwise simple tasks.

LlamaIndex is pretty good.



Yea discovered this with Langchain last week. Was great for a demo then started to push it harder and spent ages trawling Reddit, discord, GitHub trying to find solutions to issues only to discover what was supposed to be supported was deprecated. Got a massive headache for what should have been a simple change. Moved on now.


Yeah +1

We originally started out building features with LangChain (loading chains from YAML sounded good—it felt like it would be easy to get non-engineers to help with prompt development) but in practice it’s just way too complicated. Nice idea, but the execution feels lacking.

It also doesn’t help that LangChain is evolving so rapidly. When we first started using it a lot of code samples on the internet couldn’t be copy/pasted because of import paths changing, and at one point we had to bump by ~60 patch versions to get a bug fix, which was painful because it broke all kinds of stuff



Echoing others’ sentiments, I was frustrated with the bloat and obscurity of existing tools. This led me to start building Langroid with an agent-oriented paradigm 8 months ago https://github.com/langroid/langroid We have companies using it in production for various use-cases. They especially like our RAG and multi-agent orchestration. See my other comment for details.


what's the "groid"? isn't that a slur?


language android i imagine..


You got it


If you think that's bad, you're gonna hate Scunthorpe.


LlamaIndex is mainly focused on RAG. LangChain does a ton of other stuff too. I'd focus on LlamaIndex first.


Besides the other comments in this thread, I'd really recommending looking at least first to the (relatively new) "Managed index" in LlamaIndex: https://docs.llamaindex.ai/en/stable/community/integrations/... . These handle combining the retrieval with the generative side. I've seen a lot of users both get frustrated and get bad results by trying to write their own glue to string together various components of retrieval and generation and these are much easier to get started with


Haystack [1] is another good option. It‘s modular, doesn’t get in your way and is particularly strong at retrieval. People like the documentation too.

Disclaimer: I work at deepset

[1] https://github.com/deepset-ai/haystack



What’s the benefit of llamaindex over just storing documents in chroma and using chroma to query? I’ve done the latter and trying to understand if there’s a performance gain to the former?


Not much, actually. For lower volumes of documents, vector stores like Chroma or Weaviate provide inbuilt RAG.

Things get messy when the number and type of documents increase. Below are the reasons why you may need advanced RAG.

1. Intelligent Data Parser 2. Chunking efficiently 3. Choice of embedding models 4. Query transformation 5. RAG technique 6. Prompt design 7. Feedback loop

Check out my blog on the 27 parameters, considerations and techniques one could follow to build a State-of-the-Art Chatbot.

https://www.lyzr.ai/27-parameters-techniques-considerations-...

So here is the quick guide.

For simpler usecases - inbuilt vector database RAG is sufficient For more complex ones - LlamaIndex or Langchain options are suitable For enterprise grade production use cases - Lyzr's SOTA RAG architecture comes in handy



Are there public examples of working products using RAG, compared with fine-tuning or training from scratch?


The OpenAI assistants API is an implementation of a RAG pipeline. It performs both RAG on any documents you upload, and on any conversation you have with it that exceeds the context.




Not public but internally I wrote a tool to help us respond to RFPs. You pass in a question from a new RFP and it outputs surprisingly great answers most of the time. Is writing 75%+ of our RFP responses now (naturally we review and adjust sometimes and as needed). And best of all it was very quickly hacked together and it’s actually useful. Copied questions/answers from all previous ones into a doc, and am using OpenAI embeddings api + FAISS vector db + GPT-4 to load the chunks + store the embeddings + process the resulting chunks.


Amazon Q is (at least partially) a RAG implementation.


Another super easy option for RAG is AWS Bedrock Knowledge Base. It can ingest docs from S3. Just don’t use the OpenSearch serverless store it’s $$$. Can use a low end RDS with pgvector extension.


Has anyone tried using an LLM for the retrieval stage? Instead of using vector embeddings, have a (small, fast) LLM literally scan the entire corpus in chunks extracting relevant sections?


That would still be very slow with any reasonably large corpus.


You don't just feed documents in, you need to build a dataset representative of how you want to interact with it. So likely using gpt-4 or something to create: a chunk of a document, a question that can be answered by that chunk and a good answer. (Or something)


Well said. The problem is, there are way too many alternatives. Any idea how llamaindex's ingestion engine compares to unstructured.io? ( Which is used in langchain)


I think they may be using the same thing.


RAG is a funny thing. It’s like going back to Watson for specifics but letting the LLM handle the generic stuff.


Why Llamaindex instead of Langchain?


> What absolutely does not work is trying to just feed a set of documents into fine tuning.

Not quite. It does work, albeit likely not optimal.

See https://github.com/bublint/ue5-llama-lora



I think the answer depends on how many documents you have. To think in terms of tokens (assuming 750-1000 tokens is a page), if you have a good estimate of number of pages you want to query on, you can decide on the approach. Three popular approaches:

1. RAG: Most popular and works really well on smaller datasets. It is limited by number of vectors/embeddings. A typical embedding could be of 1000 tokens in size. Llamaindex did a lot of engineering on this and their techniques work pretty well. The problem with large datasets is almost always that users don't like writing long prompts/queries so the answers are more generic.

2. Finetuning + RAG: You can finetune a model on the expected outputs. If your datasets have the knowledge which might already be on open internet (blogposts, articles, anything non proprietary), then finetuning would work really well in combination with RAG, especially for large datasets. It may not work if you are working on a proprietary knowledge hard to find on open internet.

3. Continual pretraining - large large datasets, and when the knowledge is proprietary. I talked to a firm with 70GB worth of data. No way a RAG pipeline would give them results. They are struggling to get LLMs to work for them. Needs a model that is trained on their data and then Instruction Tuning of top of that. Most likely you wont need to do this



> I talked to a firm with 70GB worth of data. No way a RAG pipeline would give them results. They are struggling to get LLMs to work for them.

Wow so RAG is basically a toy for demos and low effort MVPs. 70GB is tiny, it’d barely qualify as “big data” 20 years ago.

Is anyone trying more advanced stuff like knowledge graph augmented generation to try to expand on that?



KG in the intuitive 90's sense is fundamentally worse, but with llm-era rethinking, potentially useful

- kg DBs have the same retrieval scale problem, and often built on top of the same DBs

- traditional kg mining would still use LLMs, except instead of storing rich text ("in a clearly friendly and playful manner, the dog chased the cat") or less-rich high-dimensional chunk embeddings of the same, they get discretized down to much lossier rdf ontologies like (dog,chased,cat). There are uses, like building an entity index, but I wouldn't use for core RAG operations like answering questions accurately

- kg can be useful in the sense of, after chunking & embedding, adding *additional* linkages, such as adding summarization hierarchies and then linking related summaries back to source citation chunks. So runtimes can look up not just similar chunks, but also those that are a logical hop away, even if the embeddings are dissimilar, and via pre-summarization, incorporate a lot more insight. Though that's not a traditional kg, it highlights needing non-vector linkage tracking.

We are working on real-time large-scale projects like teaching LLMs to understand the news as it breaks as part of louie.ai, and if folks have projects like that, happy to chat as we figure out our q1+q2 cohorts. It's a fascinating time -- we had to basically scrap our pre-2023 stack here because of the significant advances, and been amazing being able to tackle much harder problems.



Some caution here. Not everything needs to go into a RAG pipeline (Eg: a database table would not necessarily need to be embedded, but its schema should be.). There would be a lot of repetitions, lots of junk and useless data, and numerical data and parsing through that would be a pain. Then comes how the users would behave. You need a longer string to get accurate results. Most non tech users would rather write shorter strings and expect technology to read their mind. (it's a human issue and not tech issue)

A simpler way here is just train the model unsupervised so all the knowledge is there in the model, and instruction tune it on the use-cases you want. Simpler from human effort perspective. Somewhat costly though the cost of storing that many vectors would be more than training the model itself. Everything else requires a lot of custom effort. Knowledge graph augmentation is probably the next step in the hype cycle, but it does not solve the fundamental human problem of writing fewer letters. (Training solves as changing 1-2 keywords do the trick if the generic string does not get the answer. See how Chatgpt changes answers if you tweak your prompt a bit). In a way RAG is an engineering solution to what is basically a data problem. It works for many cases, but when it does not, people will have to solve it via data science.

> Wow so RAG is basically a toy for demos and low effort MVPs

I would not say it's for demos or low effort MVPs. Many companies wont have that amount of data. You can also segregate it by teams. Eg: customer support has one, sales has one, product has one. Then, a golden use case is for parsing user docs. We created one for GST queries in India that works quite well.[1]. It's a search engine, but points to right docs at the source when you ask about any clause. Useful for CAs only and addresses a very narrow use case.(it's a market need as the notifications are published in PDF format and not indexed by Google)

[1]:https://clioapp.ai/gst-search



"Toy" is the wrong word to describe it but it seems like another order of magnitude or two increase in context size will solve all their problems.

On the other hand I've got a terabyte of text extracted from LibGen - let's say I can ignore the half that is fiction and I can dedupe the rest further by 80% - that's still 100gb. On top of that I've got 300gb of text extracted from court documents and that's just from California! I haven't even downloaded the federal dump yet, let alone the other 49 states. Even if I limited myself to just the US Code and Federal Code of Regulations, that's hundreds of millions of tokens of very dense text. Embedding based RAG has been pretty much useless in each of these cases but maybe I just suck at implementing the retrieval part.

What's the data size on the GST search engine? Do you have any examples of more complex queries?

The only thing that has even been remotely useful to tackling the kind of questions I want to ask of my data sources is having the LLM generate search queries, navigate the knowledge graph, and rank the relevance of retrieved snippets but that often takes dozens if not hundreds of LLM calls which is incredibly slow and expensive.



As a human would you read 100GB of data all at once?

Or would you read it bit by bit, taking notes and summarising as you went along. Then compiling your notes/summaries into a final report?

Because I don't see why we expect these models to be so superhuman when a 100K context would already be considered superhuman memory.

Imagine me regurgitating 100k tokens worth of dialogue at you and expecting you to take into account every thing I said. I know I couldn't do it, ha ha.



As a human would you do tens of billions of multiplies and additions per second? Store tens of thousands of books on something the size of a finger nail and recall them with perfect fidelity every time? Communicate with another human via optical signals using thousand mile long optical fiber across the entire Pacific ocean? Eat electricity instead of food? Project images from your eyes? Can you stick an audio cable in your butt to power speakers?

I'm talking about computers, not humans.



What are you trying to achieve with that dataset from LibGen? I kinda expect that GPT4 was trained on the data that is available on LibGen


I need it to be able to cite answers, explore surrounding context, and while it might have been trained on Libgen, it doesn't mean it "internalized" all the data, let alone enough to be useful.


Right tool for the right job.

With a large amount of data, a large amount of data can be "relevant" with a loose query.

I think in those situations it's fine, using a model with an extra large context and similarity etc filters quite tight.

Developing it to realise when there are too many results and to prompt the user to clarify or be more specific would help.

Companies that want to trawl data like this can just deal with it and pay hardware that can run model for >100k context.

If >all

It's totally doable, but not "out of the box".



Most use cases that actually require this much data are probably best solved by more traditional ML architectures (ie classification).

LLM work best on use cases where the working context is the length of a short research paper (or less). Building with LLM is mostly a exercise in application engineering on how to get them the most relevant context at the right time and how to narrow its scope to produce reliable outputs.

Fine tuning can help specialize the LLM model to perform better, but AFAIK, the training sets are relatively small (in big data terms)



I have a _small_ e-commerce company and we have >300GB. Most of that bulk is photos and videos though, but in an ideal world I’d like my AI assistant to find that stuff too: “I’m making a Boxing Day ad campaign. Can you show me the ads that we’ve made in previous years and all of the photos that we’ve taken of our new Reindeer and Elf designs?”


Photos and videos are very different from text. 300 GB of text is not comparable to 300 GB of photos.

You can do something using image embeddings to get what you want.



That can be done if we use Imagebind from meta(embeds text, image, video, audio in same vector space). I would want to explore this if possible just for a POC if you are okay with it. Would you be interested?


AWS Bedrock is fairly easy. You can do it in 5 or 6 clicks.

You have to upload your documents to S3, create a “Knowledge Base” then sync your documents into a vector database like OpenSearch or PineCone. You are then good to go via their playground or the AWS API.

I made a video here describing the process, check around 14 minutes in:

https://ensembleanalytics.io/blog/introducing-bedrock-knowle...

Bedrock is a decent product I think. All of the models in one place (apart from the big dogs from OpenAI) and a common API across them.



Bedrock is cool, but I found it prohibitively expensive for hobbyists and small companies. On a first glance, it would cost me something like $ 5,000 per month for a simple trained model.


I suspect most of the cost in the OpenSearch vector database?

I agree it is high and it is not exactly serverless. You pay by the hour.

Bedrock itself is charged based on tokens exchanged.

I think Pinecone is a cheaper database for hobby and small business projects, though I haven't looked into it.



No, if you use a custom model (trained with your data), you'll pay around $20 per hour, minimum. That equates to ~$15,000.

You can reduce a lot by committing to a 6-month contract, but it won't get cheaper than about ~$5,000/mo.

That's prohibitively expensive for small projects.

Fine tuning GPT 3.5 is much cheaper.



To be fair, Bedrock is more flexible than OpenAI's fine tuning. It's nice to have at least the option to pay big bucks for this. But it's big bucks nonetheless.


Don’t use OpenSearch it’s way overpriced. Use a cheap-o RDS with pgvector. Everything else is charged like any other LLM by token use.


This has nothing to do with OpenSearch. Using off the shelf models is cheap and pay as you go. Using custom models is what's expensive.


How does bedrock satisfy the non-hallucinating requirement?


It doesn't. You need to try and reduce hallucination as much as possible with your prompts, and then benchmark it.


Is there a limit? Could I create a knowledge base with 10,000 documents? 100k? 1M?


The documents are encoded as vectors and stored in a database, so I suspect it would be effectively unlimited. You would just pay for storage and compute.

AWS OpenSearch has fairly good integration so you could look up costs for that. It’s not the cheapest AWS service to run and not exactly serverless as you pay by the hour.



Even if you could, the problem is that these documents are first chunked into smaller pages, and then embeddings are created. When you ask a question, the algo searches for relevant chunks and passes them to the LLM's overall prompt. If there are too many chunks, or too many chunks with similar content, the search also coupled with the LLM's limited context window would mean only 1-3 chunks get passed.

This isn't the same as the training data the LLM is trained on. As a result, it doesn't take advantage of the entire document set. So, if you have a billion documents, only 1-3 chunks will be picked for the final answer. When you know a question spans many many documents, the answer is never going to cover that.

You could make a recursive also where you parse all the chunks, generate summaries of those and then pass them to the next chunk sequentially and so on. But you can imagine how expensive and slow that will be. It might still work for you but this is a very lossy approach.



Easy: if > n chunks returned query "too much information, ask user to clarify".

If you have that much data in your vdb, and the user is querying a slice, it just won't do to have them ask something too generic, they need to be asked to be specific.

Ie "what was that thing from last year?" "Sorry, could you be more specific?" "Oh uh the thing to do with the financial reports" "Financial reports, yes, what about them?" "It was something to do with the way we generate them" "Ah yes, the generation of financial reports; last year in December your team leader said on Slack...etc"





I’m sorry, I don’t understand those limits. It uses a lot of unfamiliar terms like “batch inference” and “modality”. I just want a nice UI that I can give my hard-drive to and then ask it questions.


That’s probably unrealistic at this time


It's doable with Amazon Q, but it's in Preview phase now. https://aws.amazon.com/q/


Is Q the answer to this whole thread? People are talking about AWS Bedrock but that seems like something a startup would build upon and offer something like Q, eh?


I’m going to try it this week so maybe we have a follow up thread??


[flagged]



This attitude puzzles me. "If you wish to make an apple pie from scratch, you must first invent the universe" energy.

We are not always makers. Oftentimes we're consumers as well.

I don't want to read documentation and experiment with my phone, I just want it to work out of the box and do what I expect.

This is standard consumer behaviour and you're lying to yourself if you don't think you act like this with some things.



As others have said you want RAG.

The most feature complete implementation I've seen is h2ogpt[0] (not affiliated).

The code is kind of a mess (most of the logic is in an ~8000 line python file) but it supports ingestion of everything from YouTube videos to docx, pdf, etc - either offline or from the web interface. It uses langchain and a ton of additional open source libraries under the hood. It can run directly on Linux, via docker, or with one-click installers for Mac and Windows.

It has various model hosting implementations built in - transformers, exllama, llama.cpp as well as support for model serving frameworks like vLLM, HF TGI, etc or just OpenAI.

You can also define your preferred embedding model along with various other parameters but I've found the out of box defaults to be pretty sane and usable.

[0] - https://github.com/h2oai/h2ogpt



I just tried installing this on a fresh GCP instance with NVidia T4 GPU, and let's just say it was non-trivial. The CPU version running on my mac was mostly an OK install and worked pretty well.


We do just that at Flexor!

We've built what we call an Unstructured Data Transformation Layer. Think about it like an assembly line from raw text to tables in your data warehouse.

We don't use Llamaindex, we have our own (proprietary) piece of tech that does this. We can and have been outputting gold-standard tables on top of a lot of different types of context (legal docs, call transcripts, customer reviews, chat logs, emails, blog posts, social media posts, etc...) and looking to expand to more interesting domains soon.

If anyone wants to hear more hit me up at tom [at] flexor [dot] ai (does this still work or are scrapers smart enough nowadays to just grep for this too lol)



PrivateGPT is one of the better-known examples, but most people are not aware that GPT4 Assistants handle RAG natively now: https://platform.openai.com/docs/assistants/overview


Here's a (video) guide on fine-tuning Mistral 7B with QLoRA: https://www.harpercarroll.com/articles/ai/llm-finetune-own-d... / https://ghostarchive.org/varchive/kmkcNVvEz-k

Fine tuning does result in degradation of the overall model (https://twitter.com/xaiguydotagi/status/1737082280835703142) and so various RAG techniques may be desirable. As others have mentioned, LlamaIndex is a neat solution to build RAG pipelines: https://docs.llamaindex.ai/en/stable/optimizing/production_r...



I strongly agree this is the direction the author is looking for. RAG is one approach, but if the query doesn't match the right documents, you're screwed. And often they use a different, much simpler, embedding model.

I think harpercarrol link is a pretty good one, but it basically just feeds in the documents for completion, which isn't a good approach. The dataset needs to represent how you want to use it.

This one might also be helpful https://www.deeplearning.ai/short-courses/finetuning-large-l...

Honestly surprised how almost everyone is saying to use RAG (on its own). One strong benefit to RAG is the data can change, but has lots of failure modes.

People often use hybrid search (fuzzy or bm25 etc alongside embedding search) which I suppose is still RAG.

But fine-tuning models to be better at RAG is valuable as well, increasing accuracy.

https://ragntune.com/blog/Fine-tuning-an-LLM-to-be-good-at-R...

Ideally, I'd try both. Fine tune on both the documents (create a question / answer dataset with gpt4) and rag instruction fine tune it.



Run https://github.com/imartinez/privateGPT

Then

make ingest /path/to/folder/with/files

Then chat to the LLM.

Done.

Docs: https://docs.privategpt.dev/overview/welcome/quickstart



I've tried LocalGPT, PrivateGPT, and H2OGPT. Have you been satisfied with the responses you get from PrivateGPT? When I tried it, it seemed very shallow/cursory in its responses. I saw much more detailed and complete responses when trying H2OGPT.


The models released over the last two weeks are much much better than the defaults. Try changing settings.yaml to use https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGU... .


Others have said that RAG is only good up to a few 100s of files, eh? And since this is based on IndexLlama which is RAG-based, this would have the same limitation, eh?


I haven't personally tried this for anything serious yet, but to get the thread started:

Cheshire Cat [0] looks promising. It's a framework for building AI assistants by providing it with documents that it stores as "memories" that can be retrieved later. I'm not sure how well it works yet, but it has an active community on Discord and seems to be developing rapidly.

The main perk over the cloud options is that you can point it at any language model, including fully local—my local install pointed at my local Ollama running Mistral.

[0] https://github.com/cheshire-cat-ai/core



But that's not training. That's RAG. They seem to be using qdrant which I believe is a vector store.


They've updated the question to clarify that RAG counts, and as many have noted, properly "training" on a set of documents isn't really a thing.


You can pay for a ChatGPT account and upload your own documents. I didn't do this myself but my dad uploaded 6 years of sermon transcripts from our church. It sounds exactly like the pastor.


Did this in the summer via RAG. One thing we realised is that pure vector embeddings retrieval doesn't work so well for docs with acronyms (which let's face it all businesses have). Created a hybrid solution using embeddings and BM25 which is traditional ranking tool. This hybrid gave best results.


I was going to ask how you integrated BM25 but then I found this: https://docs.llamaindex.ai/en/stable/examples/retrievers/bm2...


NGL I think this one has passed the point on the tech maturityc curve where it makes sense to roll your own I played with MS office's copilot builder the other day and it's amazing. Point it at a set of base URLs, uploaded files, public or behind authentication. In literal seconds you have a copilot that can be embedded anywhere, including messengers. I gave it the root of Azure documentation, the root of red hat documentation? And the root of the ansible documentation and it's excellent. It uses MS's open source LLM copilot framework, and you can swap out models for an open source one (instead of GPT) if you like.


Can you do me an experiment.

Feed it a subreddit. Post results.



With OpenAI, you can first build Question & Answer pairs derived from your documents and use the OpenAI fine-tuning feature to build yourself a custom model. This method is more than just learning behavior in that facts do get recalled. I have written about it here, with a play demo use-case: https://ndurner.github.io/training-own-model-finetuning Note that I have yet to use this in a real world use-case, and I would love to hear feedback.

Other than OpenAI, there is the newly introduced „continued pre-training“ of Amazon Bedrock - but I haven’t tried.

RAG: I think that‘s a fundamentally flawed concept, but RAGfluencers will disagree. ;-)



Could you expand on why RAG is flawed in your eyes? From the other answers it seems like its the way to go


My approach was not to train the model on the documents, as others mentioned.

I built a vector database from the documents, and I query the questions against it, which is very fast. This is the RAG (retrival augmented generation) step others mentioned.

The results, which are literal extracts from the documents, but short ones, are given to the model, which produces an answer. This is the slow part.

I used many of Langchain's tools to manage the whole process.

You can try it on Supawiki, with one of the featured wikis. Then, if you are ok with a solution that hosts your documents for you, you can upload them and use our solution.



Since no one has mentioned it so far: I did just this recently with txtai in a few lines of code.

https://neuml.github.io/txtai/



If you’re looking for something that is hosted for you, at Notion we launched a feature for this a few weeks ago and it works quite well in my experience. RAG is one of the techniques used. https://www.notion.so/blog/introducing-q-and-a


Thank you! I have been reading these QLoRa posts in the hopes of training it on my notes stored in Notion, but then you do it for me! Nice product ;).


Slightly off topic but is there recommended advice on how to tune / train not for document retrieval but for consistent JSON output with specific enums?

i.e given a text, always return back a certain set of fields. For some keys here is the possible set of enums etc. One shot prompting does work but curious how others approach this if you have training data on hand.



There are many interesting tools that achieve this, like Outlines[0] and jsonformer[1]. I haven't tried them myself but they look very promising.

[0]: https://github.com/outlines-dev/outlines [1]: https://github.com/1rgs/jsonformer



You want grammars to restrict the output, search for "gbnf grammar". That and combined with a good prompt with an example, also check out outlines.dev


Ask the model nicely, check any json in the output against your schema, regenerate if it doesn’t match.

Crude, I know, but it’s compatible with every model. Which is useful if you want to compare the many different models out there.



That is a fairly good strategy. Six years ago at Capital One, I experimented with generating synthetic JSON AWS CloudWatch files for testing (that would not contain any sensitive information). Way back then I used LSTM models and I simply had code to check output and only keep valid samples.

LLMs are so much better for this than LSTMs now.



For OpenAI, use their functions schema mechanism.

Aside from that, take a look at llama.cpp grammars.



Microsoft Guidance will do this.


Seems Microsoft spun them out and gave them independence. Not sure why, given it's the kind of IP that helps keep microsoft dominant.


Gpt4all is a local desktop app with a Python API that can be trained on your documents: https://gpt4all.io/


If you want a simpler task, like training a mistral llama etc on your documents, to act as a document completer, how would you proceed instead? Probably much easier. Thanks


Markov chains, or, if you need to get fancy, Hidden Markov models?


Is this a joke? I'd like to fine-tune an existing model, let's say mistral, on a new dataset using (existing tools),

I've seen that there are a lot of approaches, but none that has gained traction, there isn't a clear consensus...



Not a joke. Without more specifics, doesn't sound like LLMs are what you need/want.


Thanks for wasting my time. I ask about fine tuning llama or mixtral and he answers with a nonsense, telling me that I don't want what I want.


Here are ways to do it by simply adding files to an online interface. I mention them only because they are quite straightforward (and free) to set up.

- https://notebooklm.google/ (US or VPN): uses the "gemini pro" model.

- poe.com: You need to "create a new bot", disable "Make bot publicly accessible," and then "add a knowledge source." this offers many models, although the best ones require a subscription.



You can use embedchain[1] to connect various data sources and then get a RAG application running on your local and production very easily. Embedchain is an open source RAG framework and It follows a conventional but configurable approach.

The conventional approach is suitable for software engineer where they may not be less familiar with AI. The configurable approach is suitable for ML engineer where they have sophisticated uses and would want to configure chunking, indexing and retrieval strategies.

[1]: https://github.com/embedchain/embedchain



Yes, this is most intuitive and easy to use framework for building RAG applications. Do give it a try!


I'm a fan of Khoj. Been using it for months. https://github.com/khoj-ai/khoj


I need to give Khoj a try again. I tried it on my org inbox on a whim, couldn't get useful results, and promptly forgot about it.


Nice pun


I'd very much appreciate if someone could clarify what exactly is needed in terms of hardware and software to implement these suggestions.

Would a five years old laptop work? Do you need a beefy GPU? Are you using some prepackaged software?



Easiest is OpenAI assistants api. Use the playground and it’s a no code experience.


How do you upload the documents? Via the API or do you have to upload them beforehand through the UI?


You upload in the GUI. It's available on free accounts as well, as long as you use GPT 3.5. There's a dropdown. It's super-convenient.

I uploaded an AWS study guide, and I'm asking for example questions for my testing. As far as I can see, I can tell no way to determine if it's pulling from the guide, or if it's from GPT's data.



As mentioned above, I don't think you'd need to train your own model for this (or for most use cases of this, anyway). You'd use a RAG.

I've tried out working with custom documents in two different ways for different types of data:

* Once using LlamaIndex + Chroma[0] to transcribe and then conversationally query video contents (using GPT 3.5 or 4 as the backing LLM).

* Once using GPT Plus, uploading long-form PDFs of my own fiction books to the GPT's knowledge base. I use this to help me remember character names and timelines (not always accurate, so results need to be treated with caution) and help brainstorm story or tech ideas for my world.

Both work for what I'm using them for. I feel like option one is more customizable and easier to tweak for the types of results I would want, if I have concrete requirements about what kind of output I'm looking for. Option two has a lower barrier to entry and is just a little lower effort (no need to run your own app).

For the next iteration, I'd like to try out AWS Bedrock and compare the workflow and results.

[0] https://www.daily.co/blog/search-your-video-content-library-...



I have a related question. I have a fair idea of the LLM ecosystem. (Thanks to this very nice blog called Emerging Architectures for LLM Applications). The problem is, there are way too options in each component. ( For E.g, too many vector store implementations, ingestion engines etc) What is the easiest way to get started? Primarily around RAG on my own pdf files. Also, what is the best/easiest option for hosting?. That blog lists vercel,streamlit, streamship and modal. I know vercel at a high level and found it very good. I am not well versed with javascript/typescript though. I believe the best option for UI generation is to use one of their templates.


I'm curious about this as well but my data is mostly (95%) numerical metrics. Is there a "RAG" mechanism for numerical data instead of text? My use case is data analysis, insight discovery for example.


im not an expert, but take a look at the llamaindex structured data retrieval examples: https://docs.llamaindex.ai/en/stable/understanding/putting_i...

There are also examples that combine semantic retrieval and structured retrieval.



So far the recommendations are mostly hosted, so here's one local: https://github.com/weaviate/Verba

I'm very happy with its results, even though the system is still young and a little bit janky. You can use it with either GPT API, or your local models through LiteLlm. (I'm running ollama + dolphin-mixtral)



Thanks, I'd rather not feed the borgs.


I have usually seen people recommend to chunk by sentences or paragraphs or some fixed length of characters. IMO, all these are suggested because they are easy to write code for, but in reality, length of a meaningful chunk depends entirely on the data. The way we chunk an FAQ document vs a PRD is different.

Based on this assumption, I have a couple of questions:

1. Is chunking the most significant factor in RAG quality?

2. If there are no limitations, would humans that are experts in that dataset, be the best people to create chunks?



GPT-4 Turbo has a 128K (~300 pages) context window, which probably handles a lot of use cases which might have previously needed extra training/refinement.


The chatgtp app says it has a context window of 4096 tokens (gpt 4). How do I get access to turbo?


If you're asking ChatGPT about its own characteristics, you should not believe the responses. Models can't examine themselves, so unless the model was trained on specific information about itself, or unless the info is put into the System Prompt, then it cannot know the answer. A response indicating 4096 tokens would just be a hallucination, or a misunderstanding based on training data that saw references to what older versions of ChatGPT were trained on.

ChatGPT-4 is powered by GPT-4 Turbo at this point, so it has a context window that is much larger than 4096 tokens, whether it knows it or not. The ChatGPT application may limit the context size from reaching the full 128k to keep costs down, and it will be using some of the context window for its own purposes, but it's certainly able to maintain context across more than 4096 tokens of conversation.



The gpt-4-1106-preview (aka gpt-4-turbo) is a foundational model with a 128K context window that OpenAI makes available for API consumption both directly and via Azure OpenAI Services.

ChatGPT is a consumer facing service that wraps the GPT-4 foundational model but at some point will likely wrap gpt-4-turbo.

Signing up for OpenAI API access or Azure OpenAI Services will grant you access to this model (with some rate-limits in place given its a preview model).



Long context length models are still mostly a mirage with the "lost in the middle" phenomenon rearing it's ugly little head on actual production use-cases of this.


Not true.


Huh? Obviously true. You can not have tried it.




I've been looking for the answer to this to have a chat interface to my obsidian markdown notes (the whole vault, not just rag of individual notes). Will be following these threads closely


A bit unrelated, but one could open any binary file as text. With enough training data, could an llm just learn the format?


Try https://github.com/SecureAI-Tools/SecureAI-Tools -- it's an open-source application layer for Retrieval-Augmented Generation (RAG). It allows you to use any LLM -- you can use OpenAI APIs, or run models locally with Ollama.


We have to add LLMs and MMMs (multi modal models) into all standard Linux distributions. A service will index all local files creating embedding connectors, this will be used to augment user prompts, and voila we can search for anything with natural language.


> We have to add LLMs and MMMs (multi modal models) into all standard Linux distributions.

Why? This seems like a quick road to remote-execution-vuln-as-system-service.

> and voila we can search for anything with natural language.

What if your model keeps misunderstanding you?



Once models / inference engines are performant enough to run on consumer hardware without hogging all the resources of a machine -- sure! Embedding this into the core of a linux system sounds very interesting. A shell interaction could be had in nerd gobbledygook or plain human language.


Is there any open source front ends out there? I know of anything LLM but hoping to plug my own home built rag system into a nice front end.


What would be nice is some type of box or device I connect to a computer. I then give it full access to the system. It trains itself on all of the data on that computer. The device is now a portable LLM.


I have a question for which I haven't found a definitive answer yet: is how can one effectively manage typos and Out-of-Vocabulary (OOV) words in RAG systems?

For instance, if I search for a specific product name but accidentally mistype it, the resulting encoded vector might not be similar to the vector for the correctly spelled product name ?



AFAIK, modern tokenizers/vocabs don’t have OOV anymore because they use sub-word tokens if there is no entry for the given word. This also works for languages with composite words.

In your example sum of embeddings for non-typo and typo version should have close cosine distance if it trained correctly.



Good embeddings are supposed to handle this natively. For example the embedding for API should be close to the embedding for Application Programming Interface. Purportedly.


It's super easy, an example could be found here https://technoclub.bearblog.dev/creating-a-simple-ai-chat-wi...


A go-to method is to ingest different chunksizes based on the document hierarchy & then use langchain with a bunch of retrievers depending on the doc type.

Then create an index about the metadata of each doc. So that you can ask the RAGbot what all it can answer about.

Another way to ensure it stays on-domain is to generate synthetic questions & check for similarity against user queries. There's a whole rabbit hole of query decomposition to avoid straying off topic as well.



There was something similar about Retrieval Augmented Generation (RAG) recently on HN: https://news.ycombinator.com/item?id=38491251

Early next year I’m preparing something similar for my team, so I’ll surely look into the useful links/recommendations posted by fellow HNers :-)



What is your usecase? If you want to search for relevant info in your documents and get relevant info, and you want to avoid hallucination, you might avoid the text generation altogether.

Instead you can extract text embeddings from your documents, put them in a vector DB, and then you have a super search. You can convert your search query to an embedding, search the DB and keep the e.g. 10 closest matches.



Hey, GPT Researcher shows exactly how to do that with RAG. See here https://github.com/assafelovic/gpt-researcher


If you’re looking for an open source RAG solution, try our library:

https://www.github.com/jerpint/Buster



https://khoj.dev/

Tried this summer, and kinda worked!



Unstract - https://unstract.com/ They are a month away from launch(both open source and cloud) The team might be able to give you a quick demo on your specific requirements.




Show HN from two weeks ago mentioned this. https://news.ycombinator.com/item?id=38587052


What are you trying to do more specifically? You can use https://docalysis.com/ for most document RAG tasks.


How you do RAG with embeddings NOT in English? I mean there are a few thousand more languages.


Just use your embeddings model of choice that works with your language, I believe Ada from openai is multilingual but I don’t know what languages it works well on, there are many embedding models out there, huggingface is your friend in this search. The output is just a vector and the rest of the system can basically stay the same. The only other thing that may need to change depending on language is any text preprocessing you need to do like word or sentence breaking for languages with compound words (German), agglutination (Turkish), etc.


Well, thanks, how do they deal with technical terms never seen during training?


Embeddings are not just english based ....


Train on your own documents or analyze your own documents for answers? Very different things.

For the first (fine tuning) follow “AI Jason” on YouTube. He has some great tutorials.

For the second (RAG or similar), fire up a cloud VM with GPUs or use Ollama locally and read through the LlamaIndex docs on how to build a RAG pipeline.



Would you kindly elaborate a little bit the difference between training on own documents vs analyzing documents for answers?


The word "training" implies creating a new model by fine-tuning an existing model on top of new documents.

As several other comments in this thread have already indicated: this is almost always the wrong direction. Which is confusing because it's the direction everyone always assumes they should go in at first.

The approaches that does work is surprisingly simple: take the user's question, search for snippets of your documents that appear to be about that question, then paste all of those snippets into the prompt along with the user's question and see what answer you get.

This is known as RAG: Retrieval Augmented Generation. It's a very powerful approach.



> take the user's question, search for snippets of your documents that appear to be about that question, then paste all of those snippets into the prompt along with the user's question and see what answer you get.

We use RAG at my job, but we don’t do any preprocessing on the message from the user, so the results are not always great for us.

Do any of you have experience using a small local model just for extracting keywords from messages which you then use for the retrieval? And then feed the search result and your prompt into OpenAI or whatever as normal.



I've been trying out an interesting embedding model that knows how to treat text as a question be as a phrase about the world, and embeds the question such that it's likely to end up close to phrases that might answer that question: https://til.simonwillison.net/llms/embed-paragraphs

Embedding and chunking large amounts of documents is expensive though, in both compute and storage.

The other trick I've been planning to explore is using an LLM to turn the user's question into a small number of normal FTS search queries and then run those to try and get context data.



> The other trick I've been planning to explore is using an LLM to turn the user's question into a small number of normal FTS search queries and then run those to try and get context data.

I have also been working on this. I still fail to see why this approach isn't the default frankly. There's little benefit to vector databases.



https://docs.llamaindex.ai/en/stable/examples/retrievers/bm2...

Also maybe try to include tags or categories when you index and then you can filter on those when doing the vector search. Might get a similar effect from BM25.

Also llamaindex does RAG better than some other solutions.



how do RAG implementations work with generic prompts vs specific prompts? meaning, there are prompts that could easily be answered by the base model itself and doesn't require RAG. but some prompts might involve questions about something proprietary where RAG is actually useful.

so is the default to just run the RAG search index on every prompt and if it returns nothing then you get the plain answer from the base model otherwise you get the augmented answer?



You can not get a non-hallucinating AI in 2023.


It seems like there's not much point to LLM+RAG over a good local search engine.

At least, that's what I've rapidly been concluding.

I'm still holding out a bit of hope for there being more though.



Could you expand on the reasons and possibly relevant papers?


anything-llm looks pretty interesting and easy to use https://github.com/Mintplex-Labs/anything-llm


And what’s the correct answer in December 2023 if one wants to narrow down only to tools and services provided on Azure?


https://github.com/microsoft/semantic-kernel

Semantic Kernel is MS's response to LangChain and LlamaIndex - available for .NET, python and Java.

Using their Memory support (using Azure Cognitive Search), it gives a powerful RAG quickstart, which you can combined with Azure Document Intelligence to chunk your source documentation into memories that your foundational models can later use.

(Disclaimer: Its only very recently gone 1.0 and is still likely to undergo API change as the LLM domain itself is still rapidly evolving - I've substantially forked the project for my own needs, but I hope that as it stabilises, I can contribute PRs for some of my more advanced use cases).



Thanks for the link and good luck with your contributions to this project


Is Llamaindex + hosted model on Azure OpenAI Services still the best option?


There's Azure AI studio which is kinda like AWS bedrock. It's not bad, but for max control and versatility i'd start out rolling my own w/ for example llamaindex+azure-branded openai like you say.


Thanks. Is it possible to have persistent RAG in Azure AI Studio though? I found only a preview version of uploading files that are available to model, but when using this model through API, this uploaded data is not available to the model

Likely I misunderstood about how RAG works with Azure AI Studio, so sorry in advance



https://github.com/azure/aistudio-copilot-sample

Check it out, specifically steps 3 and 4. As with almost every Microsoft CLI tool and SDK, it's clunky... and you can tell everyone is rushing this AI shit out as fast as they can to stay in the game. But what you want should be doable.



Much thanks to you


Happy holidays, and good luck building!


Thanks, and same for you


how to run a local llm model for RAG apps. Retrieval documents are turkish. But ı would to analyze this documents with llm. But ı have not a turkish local llm. How to solve this problem. Out of fine-tune and training.


If the LLM you use supports Turkish (I am pretty sure chatgpt does) then the language doesn’t matter. Augment the Generation by Retrieving turkish documents / snippets.


Thank you. but I don't want to use gpt. I want to use local llm. LLM does'not support Turkish a good level.


Try with private gpt github repo


Many services/platforms are careless/disingenuous when they claim they “train” on your documents, where they actually mean they do RAG.

An under-appreciate benefit of RAG is the ability to have the LLM cite sources for its answers (which are in principle automatically/manually verifiable). You lose this citation ability when you finetune on your documents.

In Langroid (the Multi-Agent framework from ex-CMU/UW-Madison researchers) https://github.com/langroid/langroid we’ve implemented a number of RAG techniques: our DocChatAgent uses a combination of lexical and semantic retrieval, reranking and relevance extraction to improve precision and recall: https://github.com/langroid/langroid/blob/main/langroid/agen... All the code is laid out clearly so can be tweaked. We have companies using Langroid in production (e.g. for customer-support); they especially like the RAG and multi-agent features.

One of the interesting techniques in Langroid is a numbering trick for the relevance extraction stage (having the LLM extract verbatim relevant parts of passages) — instead of having the LLM “parrot” out the relevant portions, thus wasting time and tokens, we have it just spit out the relevant sentence numbers from a pre-annotated passage. We use a tool/function-call for this, leveraging Langroid’s task loop that seamlessly works for tool handling as well as sub-task handoff.

Many interesting RAG applications often require more than simple question-answering (e.g extract structured info, match doc against requirements etc) and these scenarios benefit immensely from having multiple agents (so you get separation of concerns, modularity and easier state management). Langroid simplifies this type of multi-agent setup with its unique conversational task loops, e.g https://langroid.github.io/langroid/examples/agent-tree/

Colab quick start that builds up to a 2-agent system for extracting structured info from a document:

https://colab.research.google.com/github/langroid/langroid/b...

Among many other things, Langroid also has full support for the OpenAI Assistants API, so you could use the “built-in” RAG from this API, which is a convenience but is a black box, I.e you don’t know what retrieval algo it is using, how it is filling context, and the tokens consumed.



Thanks. I just read through your Colab examples notebook.






Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact



Search:
联系我们 contact @ memedata.com