Enhancing Lip Sync Quality: A Journey Through AI and Machine Learning Models

Enhancing Lip Sync Quality: A Journey Through AI and Machine Learning Models

At MakerX, our passion for experimentation is fundamental to our R&D studio culture. One of our more intriguing experiments involved creating a fully automated, cable-TV-like news service called Choose Your News. A significant challenge in this project was finding an effective way for the news anchor avatar to read scripts. Initially, we used Wav2Lip, an open-source ML model, to do lip syncing so our avatar looked like it was saying the words in the audio. Wav2Lip was created in 2020, and AI moves swiftly so we knew that it was not a perfect solution, but it was good enough for our experiment.

A few weeks ago, we revisited the landscape to see if any new models could enhance the quality of our lip-synced avatars. Given the rapid advancements in the machine learning ecosystem, several new models had emerged. One such model was LatentSync from ByteDance, which was partially built on Wav2Lip source code and is fully open-source.

We tested LatentSync on Replicate, a cloud-based ML model execution service. The new model offered significantly improved visual results, albeit with longer processing times and reduced audio quality. Despite this trade-off, we opted to update our news generation pipeline with LatentSync for the visual quality upgrade.

After integrating LatentSync, we sought ways to mitigate the downsides of this upgrade. By examining the LatentSync repository code, we identified the causes of the audio quality loss and increased run times. The base model of LatentSync was trained on audio with a 16kHz mono sample rate, much lower than the 44.1kHz stereo typical for most audio. Consequently, it down-sampled input audio to fit its expected format, leading to audio quality loss.

Additionally, LatentSync's longer inference times were attributable to the high resolution of the lip-sync outputs compared to wav2lip. The larger model size simply needs more compute. Additionally, the inference time and the amount of memory needed scales with the input video resolution.

We identified two potential solutions to address these issues:

  1. Crop the input video to focus solely on the face, thereby reducing the input video's resolution. This change would hasten model execution and the resulting video could then be overlaid seamlessly onto the original larger video.
  2. Retain the original audio, bypassing the low-quality, resampled audio from the model output.

By implementing these changes, we successfully reduced run times and restored audio quality to the levels achieved before switching from Wav2Lip to LatentSync.

At MakerX, we pride ourselves on applying our expertise to projects and systems, continually pushing boundaries and embracing the joy of solving tough problems. This spirit of curiosity and innovation is at the heart of our R&D culture, driving us to explore new frontiers and achieve extraordinary outcomes.