Attendees at TechCrunch’s recent Disrupt conference in San Francisco not only saw and heard the speakers, but were able to read a transcript of what was being said in real-time either on a screen in the room or their phone or PC.
That’s because of a smartphone app called Otter.ai, from Los Altos-based AISense. The app, which runs on iOS, Android and the web, records audio and, as it records, transcribes the audio into text. Like all voice recognition systems, it’s not perfect. It sometimes misspells last names and types the wrong word.
But, as someone who’s used plenty of speech recognition software, I am impressed with how good it is, especially when transcribing a conversation or a presentation where the person isn’t going out of his or her way to speak slowly and deliberately as you often must when dictating to Siri, Google Assistant, Amazon Alexa or speech-dictation software.
You can see examples of how it works at Larrysworld.com/otter. There you’ll find an interview with AISense CEO Sam Liang with both the audio recording and the transcript. The transcript on the site is edited to remove transcription errors and for clarity, but there is a link to the raw transcript on the Otter.ai site. Below the transcript on that site is a play button that allows you to play the audio and follow along in the text, making it obvious when Otter is getting it right and when it makes mistakes. The mistakes it most often makes include failing to capitalize a proper noun or knowing when to insert a period or comma.
I’ve also used Otter.ai to transcribe my daily CBS News Eye on Tech segments. I simply load in the recorded MP3 files and wait a few seconds for it to do the transcription. I then go in and edit out any mistakes. From using it, I’ve actually learned that there are some words that I don’t pronounce clearly. Humans can probably figure it out, but Otter types exactly what it hears.You can see and hear examples at larrysworld.com/eye-on-tech.
In that podcast & transcribed interview on Larrysworld, Liang described the app as “very different than Siri or Alexa and Google Home.” He said that “they handle a conversation between the human being and a robot. You can ask a short question like, what’s the weather tomorrow? And the robot will answer that question. However, Otter is doing something totally different. It listens to human-to-human conversations and transcribes the conversation in real time.”
Unlike previous articles where I’ve used a quote from one of my recorded interviews, I didn’t have to listen and type this quote. It’s from the transcript that Otter created when I loaded in the MP3 file of the interview. My podcasts are recorded using professional audio equipment, but you get surprisingly good results when speaking into a smartphone or even just taking out your smartphone to record conversations in a room, a car or a lecture hall. As a test, I ran the software while riding in a car with another person, and it did a good job of picking up and transcribing both of our voices.
Figuring out who’s speaking
The app tries to figure out when a new person starts to speak and separate all the voices. You can tag a sample with the name of a speaker, and it will analyze the rest of the conversation and apply that person’s name each time he or she speaks. Again, it’s not perfect, but it gets it right most of the time.
I can think of all sorts of applications for this technology. Journalists can use it to automatically transcribe interviews. Making the corrections is a lot easier than typing it from scratch. Students could use it to record and transcribe lectures and, perhaps, share them with a classmate. Legislative bodies, like city councils, could use it to provide citizens with a real-time transcript of meetings.
Antidote to distraction
At the Disrupt conference, I was watching the Otter.ai transcript in real-time as I was listening to speakers and was impressed that it instantaneously typed the words as they were spoken. There were times I was distracted and wasn’t listening carefully, but I was able to quickly catch up by reviewing the transcript. I also used it to read sessions that I wasn’t able to attend. Of course, I could have listened to the audio of those sessions, but it’s a lot faster to read, or at least skim, a transcript.
If transcripts are posted on the web, they can be searched by Google and other search engines, which is usually not the case for audio files. So it’s a way for podcasters to make their work more discoverable.
Even though voice recognition has been around for decades, there is still a lot of work to be done to make machines as good as humans when it comes to understanding, acting on and transcribing voice. If you don’t believe me, ask Siri, Alexa, Microsoft Cortana or Google Assistant. They might actually answer you — assuming they understand what you’re saying.
Larry Magid is a tech journalist and internet safety activist.