Dubformer: AI and the dubbing revolution

Anton Dvorkovich, CEO of AI dubbing startup Dubformer, explains how AI can revolutionise media localisation.

DTVE: Why is AI dubbing becoming more relevant to the media business? What is driving rising interest?

AD: There are two main reasons. During the last decade the media industry has come through numerous changes: YouTube has won the hearts of millions and it’s impossible to ignore it. Practically the same thing is happening with FAST – it’s audience share is increasing year by year.

That means that media companies can’t underestimate this potential in their revenue portfolio. They are looking for new ways of gaining more, faster, and experimenting with new business models – and here AI technologies, especially AI dubbing, could lend its shoulder to make this move to revolutionise the media industry together.

In this new environment, dubbing and localisation in general needs to be done faster and needs to be less costly. It also needs to be scalable.

Second, the technology of AI dubbing is maturing fast in terms of machine learning and natural language processing and synthetic voices, speech-to-speech translation and stuff like that. The technology is now ready to serve the demands of the media market.

DTVE: What are the main pain points that you address with your solution? What are the main challenges regarding localisation that you see across the industry today?

AD: One of the main pain points is the economic pressure the industry is facing in general. There is growing demand for video content. People are spending more and more time with their electronic devices and they want to be entertained by them.

However, there is a battle for their attention – people don’t just consume content through the TV, and this presents a challenge for any company.

They say that content is king. Those who offer unique content will benefit. But those who will find a way to put it on a rail and to be able to replenish a unique base of content regularly and systematically will win.
There are a few ways of doing this: first, produce content inside the company – in other words, open an internal production studio; second, have an exclusive contract with a production studio or, third, localize existing content the company already has in other languages and enter new markets with it to test via new channels such as YouTube.

One of the key pain points is the high cost of localising content. And when it comes to content localisation, as we already know media companies have a lot of difficulties. Imagine you have a good documentary that has proven success in your local market in English. And you want to localise in five more languages to show to new audiences. Or maybe 10 languages or 30.

That means you need to find 5-15 subcontractors depending on the languages you need.

And here it will be difficult to control all of them in terms of quality and delivery time: they have different internal processes and human error also exists. Second, it will be difficult to deliver a consistent level of quality and standards for the dubbed content.

Moreover, these processes are difficult or even impossible to scale.

The good news is that although AI is very new, AI dubbing is already happening. Over five million minutes of content will be dubbed using AI over the next year. And AI dubbing helps reduce costs, helps with scalability and basically helps get the job done.

DTVE: Are audiences ready for AI-dubbed content?

AD: Definitely. The main concern has been about quality – how good is the dubbing? The reality is that it has already reached broadcast quality. Of course use of AI varies according to the type of content.

But our experience shows that if you take less emotional content, it could be a good strategic decision to use AI dubbing for it.

For instance, some types of content are already very successfully being dubbed using AI, including narrative or factual content, news, unscripted content, educational content. Some types of content are of course more challenging, such as kids content and movies and series. Some of this is still pretty challenging using the currently available technology.

But overall the audience is definitely ready. We previously created Y.browser, a tool for the general audience to use to automatically dub videos from YouTube and Vimeo and other internet platforms to their own language. It was super-successful and has had huge use, with over 150 million minutes of content dubbed daily. And the watch-through rate is 80%, which suggests that the quality is good enough for the millions of users that watch this content daily.

DTVE: The human review element, or human-in-the-loop, isn’t something new for machine translation but I believe you have several innovative solutions related to this within your product. What makes them different?

AD: This is a great question because the human-in-the-loop or human review has been key. Although the technology is astonishingly good and developing really fast, the truth is that there is always room for mistakes. And if media companies are concerned here, for them it’s incredibly important to reduce the number of mistakes around special terms and names wrongly pronounced, putting logical intonational pauses in the right place and so on.

Mistakes may be few, but they exist. So you need to build a complex system that combines human and artificial intelligence, with human listeners – professional translators and native speakers – able to fix mistakes.

Those mistakes can be mis-pronunciation of individual words – for example proper names, putting the stress on the wrong word or simply mis-pronouncing it – we know that the same word can be pronounced in different ways in different contexts. You need humans to not just point out the mistake but to input into the AI so it knows how to fix it.

Harder problems to solve include getting the intonation of a sentence wrong by choosing the wrong word to stress – “Time flies like an arrow; fruit flies like a banana” – things like that. Sometimes dubbing gets this wrong and human review exists to fix that. We have created technology that makes fewer mistakes, but all mistakes that are made are fix-able.

It’s easy to fix mistakes in text, but it’s hard to fix mistakes in an audio track. You can’t just re-write the segment. We have innovative solutions that we created that allow proof listeners to manipulate the audio track. Moreover, you need to layout the translated speech to fit the speakers’ labelling and the length of the scene. Thus the audio and video will be synced which creates a better perception among the audience. This is something that hasn’t really been done in the industry before.

DTVE: Why do you think you have the potential to revolutionise the dubbing industry?

AD: The industry is ripe for disruption. There is high demand for dubbing services. Media companies want to expand to find new audiences and new business models. There is high demand for cheap and scalable AI dubbing.

Content is one of the big drivers of the media and entertainment industry and localisation is a very nice and easy way to create new content.

It comes from both ends: not only are we ready to revolutionise the industry, but the media industry itself is seeking appropriate evolutionary methods.

On the one hand, media has been always seeking ways to gain additional revenue and one of the ways is to balance the costs. For many years the structure of the industry remained the same. Nowadays digital channels are emerging and their share in terms of users grows every year, so media companies can’t pretend they don’t exist; they need to set up experiments with them in order not to miss new opportunities.

Moreover, they need to deliver content faster to keep it up-to-date for the audience. So, they need the content and ability to localise it faster, but without sacrificing the quality.
When it comes to traditional dubbing, about 70% of localisation costs are in voice-over. AI dubbing allows you to reduce that considerably.

It’s always good to find the right moment. The moment has come as we see it.

Then you have the state of the technology. It’s really amazing how text-to-speech technology and synthetic voice has developed over the last few years, particularly during the last year. There are many startups doing well. We have the technology to revolutionise or disrupt the industry. We at Dubformer have a very strong, very experienced team both in machine learning but also from the media industry who have been active for years creating dubbing. We have over 10 years of experience in building these systems. We also have experience of related fields such as text generation, computer vision, machine translation and speech recognition. We created our AI dubbing solution for Y Browser, which has an audiene of over 100 million.

It’s not just about creating cool machine learning models, it’s about making these models steerable so that mistakes are fixable. This is a very intricate area in which we also have lots of expertise.

People have also started turning away from thinking about AI as something that is harmful and weird and are thinking more about how to put it to use.

We’re a very young startup, but we’re moving fast. We’ve learned a lot, for example about mixing. We should have paid attention to this from the beginning. You can partly replace a sound engineer with what we call AI mixing, but there’s a long way to go from synthetic voices, even if they sound really natural and human, to getting to final audio track.

During the last year we worked very closely with media companies, understanding not only their pains, but also all the nuances concerned with broadcasting standards. For instance, if we take mixing, there are specifications for it accepted in the industry. And we tailored our solution towards that.

We are already working with media companies a lot. And this autumn you’ll see some content dubbed by us on Amazon Prime and YouTube. Stay tuned!

DTVE: What is next for Dubformer?

AD: We’re very ambitious. Our goal is to disrupt the localisation market and help media companies work through these new challenges that they face, by getting them together to find solutions. We’re expanding our voice library. We currently have more than 1,000 voices, but we need even more. We’re enhancing control. We’re doing a lot of work in voice conversion and cloning and we’re extending our support for more languages, including Hindi.

We’re also doing a lot of long-term research and we are scaling up the human-in-the-loop platform that we’ve created internally to be able to serve lots of clients. Next year we expect to be producing one million minutes of professional content with AI.

We will be at IBC this week. Come along to stand 8.A72 and we will tell you more and give you a live news dubbing demo.