Sometimes you get the subtitles (subs) but not the voicing (dubs).
So what are you supposed to do?
Translation is a careful art that cant be automated and requires the loving touch of a human hand.

I mean, who would want to listen to machine voices for an entire season?
Only a real sicko would want that.
Well start by transcribing audio to text using Google CloudsSpeech-to-Text API.

Next, well translate that text with theTranslate API.
AI-dubbed videos: Do they usually sound good?
Before you embark on this journey, you probably want to know what you have to look forward to.

What quality can we realistically expect to achieve from an ML-video-dubbing pipeline?
Heres one example dubbed automatically from English to Spanish (the subtitles are also automatically generated in English).
(Ignore the fact that the speaker sometimes speaks too fast more on that later.)

Dubbing from non-English languages proved substantially more challenging.
But if not, read on!
To launch the code yourself, follow the README to configure your credentials and enable APIs.

Here in this post, Ill just walk through my findings at a high level.
But as a programmer, all hubris must be punished, and boy, was I punished.
But more on that in a bit.

To do this, I used Google CloudsSpeech-to-Text API.
As an example, I transcribedthis video.
you could see the JSON returned by the API inthis gist.

The output also lets us do a quick quality sanity check:
What I actually said:
Software Developers.
Were not known for our rockin style, are we?
Today, Ill show you how I used ML to make me trendier, taking inspiration from influencers.
What the API thought I said:
Software developers.
Were not known for our Rock and style.
Are we or are we today?
Ill show you how I use ml to make new trendier taking inspiration from influencers.
Note that the punctuation is a little off.
At this point, we can use the API output to generate (non-translated) subtitles.
This is where things start to get a little ?.
The problem, though, is that translations arent word-for-word.
A sentence translated from English to Japanese may have a word order jumbled.
But even this becomes complicated, because how do you denote a single sentence?
In English, we can split words by punctuation mark, i.e.
Plus, inreal-lifespeech, we often dont talk in complete sentences.
Ill get the translation:
Je me sens bleu, mais jaime aussi le rose.
Im feeling sad, but I like pink too.
Heres an example of what that looked like:
This naturally led to some awkward translations (i.e.
And one last thing.
If you already know how you want certain words to be translated (i.e.
I wrote a blog post about thathere.
Its called theMedia Translation API, and it runs translation on audio directly (i.e.
no transcribed text intermediary).
Text-to-Speech
Now for the fun bitpicking out computer voices!
If you read about myPDF-to-Audiobook converter, you know that I love me a funny-sounding computer voice.
To generate audio for dubbing, I used the Google CloudText-to-Speech API.
Then the dubs would be impossible to align to the source video.
Or, what if a translation is more verbose than the original wording, leading to the same problem?
To deal with this issue, I played around with thespeakingRateparameter available in the Text-to-Speech API.
Sound a little complicated?
Butthatsa problem for V2.
Was it worth it?
You know the expression, Play stupid games, win stupid prizes?
She also likes solving her own life problems with AI, and talks about it on YouTube.