Microsoft’s AI voice cloning technology is so good you won’t be able to use it

Microsoft’s research team has introduced VALL-E 2, a new AI system for speech synthesis capable of generating “human-level performance” voices from just a few seconds of audio that were indistinguishable from the source.

“(VALL-E 2 is) the latest advancement in neural codec language models that marks a milestone in zero-shot text-to-speech (TTS) synthesis, achieving human parity for the first time,” the research paper reads. The system builds on its predecessor, VALL-E, unveiled in early 2023. Neural codec language models represent speech as sequences of code.

What sets VALL-E 2 apart from other voice cloning techniques is its “repetition-aware sampling” method and adaptive switching between sampling techniques, the team said. The strategies improve consistency and address the most common problems in traditional generative speech.

“VALL-E 2 consistently synthesizes high-quality speech, even for sentences that are traditionally challenging due to their complexity or repetitive phrasing,” the researchers wrote, noting that the technology could help generate speech for people who lose the ability to speak.

However, as impressive as it may be, the tool will not be available to the public.

“We currently have no plans to incorporate VALL-E 2 into a product or expand access to the public,” Microsoft said in its ethics statement, noting that such tools carry risks such as voice mimicry without consent and the use of convincing AI voices in scams and other criminal activities.

The research team stressed that there is a need for a standard method to digitally mark AI generations, acknowledging that detecting AI-generated content with high accuracy remains a challenge.

“If the model is to be generalized to unseen speakers in the real world, it should include a protocol to ensure that the speaker approves the use of his or her voice and a model for detecting synthesized speech,” they wrote.

That said, VALL-E 2’s results are highly accurate compared to other tools. In a series of tests conducted by the research team, VALL-E 2 outperformed human benchmarks in terms of robustness, naturalness, and similarity of generated speech.

Image: Microsoft

VALL-E-2 achieved these results with just 3 seconds of audio. However, the research team noted that “using 10-second voice samples yielded even better quality.”

Microsoft isn’t the only AI company that has demonstrated cutting-edge AI models without bringing them to market. Meta’s Voicebox and OpenAI’s Voice Engine are two impressive voice cloners that also face similar restrictions.

“There are many interesting use cases for generative speech models, but due to potential risks of misuse, we are not making the Voicebox model or code publicly available at this time,” a Meta AI spokesperson said. Decipher last year.

Additionally, OpenAI explained that it is trying to address the security issue first before releasing its synthetic voice model.

“In keeping with our focus on AI safety and our voluntary commitments, we have decided to preview, but not widely release, this technology at this time,” OpenAI explained in an official blog post.

This call for ethical guidelines is spreading across the AI community, especially as regulators begin to raise concerns about the impact of generative AI on our daily lives.

Edited by Ryan Ozawa.

Source link