November 4, 2016

Is Adobe's Project VoCo the Photoshop for Audio?

Adobe Project VoCo replicates voices to add dialogue you forgot to record.

Over the past week, Adobe has shown off a slew of new technology as part of its annual San Diego mega-conference, AdobeMax. In addition to highlighting upcoming releases, the convention at AdobeMax serves as a training center for existing software, and hosts dozens of panels throughout the course of its week-long duration.

One of the most unusual reveals Adobe has treated its attendees to is a project under development as part of a collaboration with Princeton University. Adobe Developer Zeyu Jin took the stage to introduce Project VoCo, a prototype he described as having the potential to do for audio what Photoshop does for photography. 

Add text to a recording in exactly the same voice by simply selecting a clip of speech, opening up an edit box, and typing in new text.

Essentially, the software will allow you to add words to your audio recording that were never recorded. If one of your actors gives a reading that proves to be just a little off, you may now be able to tweak it, adding or replacing a word that doesn’t originally appear in the audio file.

It may sound like some sort of strange voodoo, but to prove Adobe's vision, Jin did a live demonstration of the software. He was able to add text to a recording in exactly the same voice by simply selecting a clip of speech, opening up an edit box, and typing in new text. In the words of one attendee, he "redubbed what the speaker had actually said."

In order to achieve this level of technical wizardry, all you need is around 20 minutes of recorded speech for the algorithm to kick into gear for replication purposes. It analyzes the speech, breaks it down into phonemes, transcribes it, and creates the voice model.

The tech blog Tech Crunch says the project isn't "based on traditional speech synthesis technology, but on what Adobe calls 'voice conversion.'" They go on to report that "there’s almost no manual intervention necessary. You can always correct the auto-generated transcript to improve the synthesis, but there’s no need to set timestamps, for example. The algorithms can figure that out themselves."

Project VoCo looks like it could be a real game-changer for filmmakers.

An official statement from Adobe released earlier today details the purpose of the prototype: "When recording voiceovers, dialog, and narration, people would often like to change or insert a word or a few words due to either a mistake they made or simply because they would like to change part of the narrative. We have developed a technology called Project VoCo in which you can simply type in the word or words that you would like to change or insert into the voiceover. The algorithm does the rest and makes it sound like the original speaker said those words."

Courtesy of Adobe.

To be sure, many of Adobe's prototypes haven't ever seen the light of day. But given the popularity of podcasting, interminable problems with capturing clean audio, and the monotony of ADR, Project VoCo looks like it could be a real game-changer for filmmakers. This one may indeed hit the market relatively soon.

VoCo does have some frightening, dystopian potential—you could replicate someone's voice for the sake of any number of nefarious deeds—but as avid podcasters and filmmakers ourselves, it's safe to say we're excited by what we see.     

Your Comment


I think he meant to say project Voco, project Felix is Adobe's stab at easy 3D. This would be incredibly helpful for the projects I work on! I'm excited to see what the final application looks like!

November 4, 2016 at 1:47PM, Edited November 4, 1:47PM


Wow. That's incredible. The sheer possibility of not needing to do ADR anymore...? I can only imagine the potential once the software evolves to adjust tonality...

November 4, 2016 at 1:49PM, Edited November 4, 1:49PM

Andy Zou
Filmmaker / Creative Director

They mentioned that they need 20 minutes of actual speech to make this possible.
But after some more research I guess it will be possible to cut that down :)

November 4, 2016 at 2:05PM

Robert Zinke
Blogger / DP / VFX Artist

I had to watch the video twice because the first time I was just too mesmerized by that fantastic unibrow to concentrate on anything else.

November 4, 2016 at 3:22PM, Edited November 4, 3:22PM

Matthew Macar

So if much more than 20 mins is fed to the algorithm, will it have the ability to alter huge chunks of dialogue instead of the odd second ?

November 4, 2016 at 6:49PM

Saied M.

I wonder that too. Its killer move seems to be adapting the tone, cadence, and volume of the word it's replacing so maybe not? We'll find out...

November 4, 2016 at 7:01PM

J Robbins

"That algorithm really nailed that performance," said nobody ever. The future of cinema, according to Adobe, is a bunch of infants putting square pegs into square holes and then patting themselves on the back.

November 4, 2016 at 8:01PM, Edited November 4, 8:01PM

Thomas Cassetta
Production Sound Mixer / Re-recording Mixer

All the obvious concerns about misuse of this tech are obvious, but so are the practical uses. I'm excited about this, as it could (potentially) solve a lot of post production glitches. Like any tech, there will be times it doesn't hit the mark, but for the times that it does, it will be powerful.

November 5, 2016 at 3:57PM

David Patterson

Nice! This could really help in ADR situations. It also could be used for evil lol.

November 6, 2016 at 7:27PM

Dantly Wyatt
Musical Comedy & Content Creator.

This is going to affect voiceover people. They're going to lose money. Now when something is modified they don't need to be called back in to do additional dubs.

November 7, 2016 at 2:20AM


It just makes me tingle all over with Excitement - no humans needed ! And there goes yet another segment of folks with their Jobs & what they love doing destroyed. How wonderful.
I guess Adobe is planning the Subscription model for Cylons only

November 11, 2016 at 7:01PM