Adobe Max 2016 had an unexpected surprise, Adobe VoCo, think of it as the photoshop of spoken audio.
Many in the vfx community have speculated on what we were seeing but the truth is much more interesting than you might have expected.
At the Adobe Max conference video clip below if you do not watch it to the end it is not immediately obvious how significant this technology is. Being able to edit a recorded audio clip via text is cool but not revolutionary… being able to make an audio clip of someone saying a sentence they never even remotely said is completely game changing. Remarkably, we all thought that Adobe had added deep learning AI technology to the company’s audio team. But fxguide spoke to Zeyu Jin directly and while he and the team at Adobe thought the future lies in the deep neural networks he explained that “the VoCo you saw at the Max conference is not based on deep learning”.
Near the end of this presentation Zeyu Jin reveals in a brief off script chat that Adobe uses 20 mins of training data. From this 20 minute sample, the system can not just edit existing speech but create whole new sentences, with the correct weighting and cadence needed to be believable. It was assumed by many including here at fxguide, that this was 20 minutes of deep learning training data, but that is not how Adobe manages to drive VoCo, – which in one sense makes it even more remarkable.
Project VoCo allows you to edit and create speech in text, opening the door to vastly more realistic conversational assistive agents, better video game characters and whole new levels of interactive computer communication. Already Google and Apple’s embodied agents such as SIRI produce a passable synthetic conversation dialogue, but if Adobe can commercialise this tech preview to a full product, the sky is the limit.
As VoCo allows you to change words in a voiceover simply by typing new words it was a huge hit of the Adobe MAX 2016 Sneak Peeks. While this technology are not yet part of Creative Cloud, many such technology preview “sneaks’ from previous years have later been incorporated into Adobe products.
Which begs the question how are they doing this, if they are not using deep learning?
“It is not related to decision trees or traditional machine learning but mostly to mathematical optimization and phonetic analysis” Jin explained.
A couple of years ago Adobe showed a text tool that would allow you to search a clip by text, initially many assumed that this was more than it was, in reality the software took your written script and matched it to the audio for searching. At the time it was both exciting to see Adobe doing this work and a tad disappointing they were not showing auto-transcribing software as early reports had indicated. Jump to 2016 and Adobe is demoing that it can do to audio what it has done to editing still images. But this new VoCo technology is not related at all to the earlier Adobe text search-edit demo, it is completely brand new tech. ” It is not built on any existing Adobe technology. It is originated from Princeton University where I am doing my Ph.D.” he explains. “Unfortunately, it also does not imply improvement on auto transcription. We are actually relying on existing transcription approach to perform phoneme segmentation”.
“The core of this method is voice conversion. Turning one voice into another”. The system uses micro clips but with no pitch correction, “in fact, pitch correction is what makes other approaches inferior to our approach” he comments.
Most state-of-the art voice conversion methods re-synthesize voice from spectral representations and this introduces muffled artefacts. Jin’s PhD research uses a system he calls CUTE: A Concatenative Method For Voice Conversion Using Exemplar-based Unit Selection. Or in loose terms: it stitches together pieces of the target voice using examples. It just does it very cleverly. It optimizes for three goals: matching the string you want, using long consecutive segments, and finally smooth transitions between those segments.
CUTE stands for:
- Concatenative synthesis
- Unit selection
- Triphone pre-selection
- Exemplar-based features
To have sufficient ‘units’ of audio or sound to flexibility use, the approach defines these ‘units’ to the frame level. To make sure there is a smooth transition, it computes features using the 20 minutes of example audio and concatenates the spectral representations over multiple consecutive frames.
To obtain the phoneme segmentation, the system first translate the transcripts into phoneme sequences and then apply forced alignment to align phonemes to the target speaker’s voice.
The original PhD University approach defined two types of examples to use, a target exemplar and concatenation exemplar, to allow controlling the patterns of rhythm and sound (or the patterns of stress and intonation in the spoken words) with source examples and enforce concatenation smoothness in the target samples. Using phoneme information to pre-select the phonemes, we ensure the longest possible phonetically correct segments to be used in concatenation synthesis.
Experiments demonstrate that our CUTE method has better quality than previous voice conversion methods and high individuality comparable to real samples.
You can see Jin’s original paper here.
Read more : https://www.fxguide.com/