awesome, that makes sense, thank you! do you happen to remember the approximate time-to-first-token for the finetuned model? fwiw, I've been seeing ~0.5-1.5s for 3.5-turbo-4k (and a bit less for 3.5-turbo-16k)
This is great stuff! I am at work on a similar project for custom tone (basically, duplicating my own email tone), and would welcome any more specific thoughts on training data you guys may have. I have been working in the Huggingface ecosystem, but thinking of doing a smallish chatGPT experiment to see how it moves forward.
great work! I'm curious about the latency measurements; did you notice how time-to-first-token affected when streaming responses?
I've also seen some reports that time-to-first-token latency with fine-tuned models is less consistent than vanilla 3.5-turbo; did you observe this?
Hey, great question! We actually redid the study using streaming and found that the fine-tuned model is still consistently faster then the base model.
Though we did notice that the first request tends to be slower so maybe there's a warm-up period.
awesome, that makes sense, thank you! do you happen to remember the approximate time-to-first-token for the finetuned model? fwiw, I've been seeing ~0.5-1.5s for 3.5-turbo-4k (and a bit less for 3.5-turbo-16k)
similar but varies a lot! Also I think there are some noise introduced by network speed as well so it might be harder to benchmark it exactly.
This is great stuff! I am at work on a similar project for custom tone (basically, duplicating my own email tone), and would welcome any more specific thoughts on training data you guys may have. I have been working in the Huggingface ecosystem, but thinking of doing a smallish chatGPT experiment to see how it moves forward.