I'm using a mix of 7B and 13B models that have been fine-tuned using LoRA for sp...

spaghetti1535 · on Sept 28, 2023

We're looking into fine-tuning and using 7B and 13B models and while we understand most of the mechanics we are somewhat overwhelmed by the amount of options available and are unsure where to start. Do you recommend any open source frameworks for fine-tuning and running models? Additionally, are you open to and available for consulting in this area?

TrueDuality · on Sept 28, 2023

I appreciate the offer but I'm a bit underwater with the amount that I have on my plate right now. We're using a custom solution in-house for all of our training and hosting and it can definitely be daunting to get that far.

I'm not sure how experienced you are in the field but there are kind of two levels of fine-tuning, full fine-tuning (update all the weights of the model, usually requires 2-3x the memory required for inference). This allows you to change and update the knowledge contained inside the model.

If the model has sufficient understanding already of the content of the task and you want to change how it responds, such as to a specific output format, "personality" or "flavor" of output, or to have it already know the kind of task its performing without including those details in the prompt I would go with parameter efficient fine-tuning.

If you're looking to do a one-off train for a model, you might be able to get away with doing it in something like this: https://github.com/oobabooga/text-generation-webui Very easy to use project but it really doesn't allow for the kind of metrics, analysis, or professional grade hosting you'll want.

vLLM can help with the hosting and is really solid once you have the models fine-tuned we tried that at first but its core architecture simply wouldn't work for what we were trying to do which is why we went fully in-house.

Once you get into a lot of fine-tuning, you're probably going to want to do it directly in pytorch or the equivalent for your language of choice. A good resource for seeing how people do this is actually opensource models published on hugging face. Look for some LoRA models, or fine tunes similar to what you'd like. A lot of people publish their training code and datasets on GitHub which can be very useful references.

Right now I'd recommend llama2 as a base model for most general language model based tasks if you don't cross their commercial use threshold (which is very very generous).

Hope this helps!

spaghetti1535 · on Oct 1, 2023

I can imagine, sounds like you're doing interesting and challenging stuff, best of luck. And yes thank you! You confirm some thoughts I had and clarify others. I appreciate it

kirill5pol · on Sept 27, 2023

The last part is interesting! What kind of use case would the users prefer to have it slower?

TrueDuality · on Sept 27, 2023

It's not so much about preference but controlling our load and resource consumption right now. We're setting an easy threshold to meet consistently and the added delay allows us to imperceptibly handle things like crashes in Nvidia's drivers, live swapping of model and LoRA layers, etc.

(For clarification the users preference in my original post, is about interactive users preferring to see a stream of tokens coming in rather than waiting for the entire request to complete and having it show up all at once. The performance of that sets the expectation for the time of non-interactive responses.)