Transcribe does “Recognize Multiple Speakers” — it’s on the page that you linked...

Transcribe does “Recognize Multiple Speakers” — it’s on the page that you linked to, in the list of features.

I would also recommend comparing it to Google’s and maybe MS/Azure’s services.

Aside from the fidelity of the transcription itself, and the accuracy of disambiguating the different speakers, I’m also not convinced that all of these services will give you an end timestamp (start timestamps are sometimes there, but not necessarily for every word or sentence).

You could try a multi prong approach by doing the transcript to get text and speakers only (using the services mentioned above), and then using an aligner such as “gentle” [0] to find the start / end times.

You will have some gaps and wrongly transcribed words, but it may be a start..!

[0] https://github.com/lowerquality/gentle