I'm using a mix of 7B and 13B models that have been fine-tuned using LoRA for specific tasks and they work fantastically depending on the specific task at hand _after fine-tuning_. Generally they're kind of garbage in my experience without fine tuning but I haven't tested the base models directly for tasks besides the statistics at the beginning of the training run.
As for performance, I'm generally seeing 40-50 tokens/sec per model on a Tesla family Nvidia GPU but I keep multiple models loaded and active at a time so that estimate is probably a bit low for overall throughput (I also realized that our monitoring doesn't have any cumulative GPU token rate metrics just now thanks to this question hahah).
Interesting anecdote others may be interested in... I'm rate limiting the output from our streaming API to 8 tokens/sec to artificially smooth out front-end requests. Interactive users will wait and even prefer seeing the stream of the response, and non-interactive users tend to base their performance expectations on the what the streaming API does. It's kind of sneaky but I'm also artificially slowing down those API requests.
We're looking into fine-tuning and using 7B and 13B models and while we understand most of the mechanics we are somewhat overwhelmed by the amount of options available and are unsure where to start.
Do you recommend any open source frameworks for fine-tuning and running models?
Additionally, are you open to and available for consulting in this area?
I appreciate the offer but I'm a bit underwater with the amount that I have on my plate right now. We're using a custom solution in-house for all of our training and hosting and it can definitely be daunting to get that far.
I'm not sure how experienced you are in the field but there are kind of two levels of fine-tuning, full fine-tuning (update all the weights of the model, usually requires 2-3x the memory required for inference). This allows you to change and update the knowledge contained inside the model.
If the model has sufficient understanding already of the content of the task and you want to change how it responds, such as to a specific output format, "personality" or "flavor" of output, or to have it already know the kind of task its performing without including those details in the prompt I would go with parameter efficient fine-tuning.
If you're looking to do a one-off train for a model, you might be able to get away with doing it in something like this: https://github.com/oobabooga/text-generation-webui Very easy to use project but it really doesn't allow for the kind of metrics, analysis, or professional grade hosting you'll want.
vLLM can help with the hosting and is really solid once you have the models fine-tuned we tried that at first but its core architecture simply wouldn't work for what we were trying to do which is why we went fully in-house.
Once you get into a lot of fine-tuning, you're probably going to want to do it directly in pytorch or the equivalent for your language of choice. A good resource for seeing how people do this is actually opensource models published on hugging face. Look for some LoRA models, or fine tunes similar to what you'd like. A lot of people publish their training code and datasets on GitHub which can be very useful references.
Right now I'd recommend llama2 as a base model for most general language model based tasks if you don't cross their commercial use threshold (which is very very generous).
I can imagine, sounds like you're doing interesting and challenging stuff, best of luck. And yes thank you! You confirm some thoughts I had and clarify others. I appreciate it
It's not so much about preference but controlling our load and resource consumption right now. We're setting an easy threshold to meet consistently and the added delay allows us to imperceptibly handle things like crashes in Nvidia's drivers, live swapping of model and LoRA layers, etc.
(For clarification the users preference in my original post, is about interactive users preferring to see a stream of tokens coming in rather than waiting for the entire request to complete and having it show up all at once. The performance of that sets the expectation for the time of non-interactive responses.)
As for performance, I'm generally seeing 40-50 tokens/sec per model on a Tesla family Nvidia GPU but I keep multiple models loaded and active at a time so that estimate is probably a bit low for overall throughput (I also realized that our monitoring doesn't have any cumulative GPU token rate metrics just now thanks to this question hahah).
Interesting anecdote others may be interested in... I'm rate limiting the output from our streaming API to 8 tokens/sec to artificially smooth out front-end requests. Interactive users will wait and even prefer seeing the stream of the response, and non-interactive users tend to base their performance expectations on the what the streaming API does. It's kind of sneaky but I'm also artificially slowing down those API requests.