Silkenweb Example: Hackernews Clone

Ask HN: How to train a LLM against a knowledge base

3 points by njsubedi 11 months ago | 3 comments

I understand that this might be a bit late to ask this question, and I don’t know a lot about AI/ML in general. I have to train/tune a pre-trained model for specific context. By context, I mean a knowledge base, a product’s documentation, or user manual, or in my case, an inventory of electronic items in our warehouse.

I tried dumping the inventory information and basic BOM content as a system message for ChatGPT4-o model using their platform playground, and asking questions like “we are making 1000 power banks this month, so which components should I pre-order so that we don’t run out of them?”. This works as expected, but it took me a while to realize that each query used more than 30K tokens! That’s a quick way to lose money.

I am looking for a solution from someone who trained/tuned a decent LLM on custom data, and I’m pretty sure a lot of other small business owners are looking for something like this. Thank you!

PaulHoule 11 months ago
I do most of my work so far with BERT models but if I was trying to fine-tune a generative model I think I'd try a T5 model.
https://huggingface.co/docs/transformers/en/model_doc/t5
https://medium.com/nlplanet/a-full-guide-to-finetuning-t5-fo...
Specifically you can show a T5 model input and output texts and it will try to learn the transformation between them. People tell me T5 models are relatively easy to train and they perform well on many tasks.
Note another approach to your problem is RAG
https://www.promptingguide.ai/techniques/rag
If you have some specific documentation on your topic you could use the embedding to find some text that is relevant to the query. In fact this stacks great with the fine-tuning because you could train the model to, given a question and a relevant document, give an answer. T5 is good at that kind of basically summarization task.
- njsubedi 11 months ago
  Thank you for explaining this. I’m looking at the links and the models to see what works best for me.
  - verdverm 11 months ago
    Check out llama-index, they have a ton of great content on RAG (Retrieval Augmented Generation), which is what you probably want to look at instead of training, much cheaper and new documents don't require more training