Well, GPU failure modes (from what Ive heard from ML infra people) are often subtle, eg incorrect multiplication results. So its not as simple as the usual ‘treat resources like cattle not pets’ because tou dont know which cows have mad cow disease before committing to an expensive training run