> I guess the next question though, would be: is the objective to build a model ...

> I guess the next question though, would be: is the objective to build a model that understands all words, or conversational speech? <novice> It seems like transfer learning on a model trained on audiobooks and then conversations would still be a good path, right? </novice>

Understanding all words is not the problem. I don't know if it's universal, but frequently, a speech-to-text model is actually two models: A voice model (mapping raw audio to phonemes) and a language model (which models what the language looks like, i.e. what sentences are likely and which words exist). So if you want the STT system to understand novels, include novels in the training data for the language model. You can then combine it with a voice model suitable for conversational speech/the user's accent/background noise.