The clarity of the voices on the recording is vital for obvious reasons. The volume should be even throughout with no cuts and fades mid-sentence. This means using good microphones and recording devices. The setup should be optimal with the microphone at a good distance from the speaker.
If there are two or more speakers, it would be best to minimize instances wherein they are talking simultaneously. For interviews, the host should refrain from interrupting the guests. Otherwise, the software will find it hard to distinguish the voices and may return unintelligible results.
The accent of the speakers is another consideration. If the voices have heavy accents, then the results might be less than ideal. They simply pronounce words quite different from the standard which leads to mistakes on the part of the software.
Manual transcribers also have a difficult time in this case with several passes required to understand what is being said. Local speakers who have the same accent can understand such speech better. Perhaps software can adjust to various accents in the future.
Be careful when you add closed captions to videos. Automated transcript software makes the task fast and easy but it can only work well if the audio recording provided is clear with minimal accents and background noise.