Microsoft VibeVoice: The Next Frontier of AI

I haven't really been following the AI voice generation scene recently, but Microsoft's VibeVoice caught my eye. I’ve tried some great models in the past and some of them are... well, let's just say they're still learning. But when I heard about Microsoft's new VibeVoice model, I was eager to try it out. It's designed specifically for long-form, multi-speaker conversations like podcasts and audiobooks and the main problems it is trying to target is speaker consistency and natural conversation flow within Text-To-Speech (TTS) systems.

The tech behind it is fascinating. It uses a new type of speech tokenizer that compresses audio really efficiently, and it's powered by a Large Language Model (LLM) that helps it understand the flow of dialogue. The result? Surprisingly natural-sounding conversations. And the best part? Microsoft initially released it as an open-source tool, which got the whole tech community buzzing. It wasn't long, though, before they had to pull the repo due to some serious security implications. When I first stumbled upon this Youtube video by the channel AI Search, which I highly recommend to check out for a very detailed analysis. I was impressed within the first thirty seconds of the video where he showcased various use-cases such as accents and simple voice cloning.

Some technical jargon

The released models are:

VibeVoice 1.5B: 1.5 billion parameters, 64,000 tokens content length. Can generate up to ninety (90) minutes of audio. GPU VRAM needed is expected to be around 8-10GB.
VibeVoice Large (7B): 7 billion parameters, 32,000 tokens content length. Can generate up to forty-five (45) minutes of audio. GPU VRAM needed is expected to be around 15-18GB+.

The model used in the experiments below is VibeVoice 1.5B as they run on Google Colab's free tier. Everybody has free access to this tier and can run 1.5B model comfortably.

It does have its limitations such as not being able to have overlapping conversation which is extremely important for a real sounding conversation. Other limitations include random noises sometimes as if starting a news show(?), the wait between sections of text and poor performance in short text scenarios. Additionally, the English input to Chinese output sounds odd a lot of times. It sometimes completely switches the voice as if a Chinese person is speaking, creepy.

Want to Try VibeVoice Yourself?

Even though the official repo is gone, the open-source community is incredible. They've already created forks and community-maintained versions to keep the project alive. Since the original models were released under an MIT license, it's all completely legal. This is great for those of us who want to experiment with it. The easiest way to get your hands on it and try it for yourself is by using a Google Colab notebook, which lets you run the model for free as long as you have a clip and a transcript to input.

Feel free to check out my other post on easy to follow and simple setup instructions - no coding knowledge required and best of all, its absolutely FREE!

A depiction of a Microsoft VibeVoice workflow, with a text transcript on one side and a generated audio waveform on the other.

(the UI once setup, forked from NeuralFalconYT's GitHub)

Potential Dangers and Why Microsoft Pulled the Repo

This is where things get a bit scary. Microsoft explicitly stated that VibeVoice is for research and development only and should not be used for commercial or real-world applications without further testing. They even put in place guardrails to prevent its use for things like voice impersonation and disinformation. But as soon as a tool this powerful is released to the public, there’s no way to control how people will use it. It's a classic case of the cat being out of the bag.

The primary concern, and what I believe led to the repo's removal, is the potential for creating deepfakes. With VibeVoice, it would be trivial to take a short audio clip of someone's voice and perfectly clone it to make them say anything you want. I can just imagine someone using it to scam a family member or spread false information. Microsoft's official statement mentioned that the tool was being used in ways "inconsistent with the stated intent," which is a polite way of saying people were doing exactly that.

Every audio file created with VibeVoice is supposed to have an audible disclaimer and a hidden digital watermark, but people in the community have already found ways to work around that. This is a serious issue that highlights the ethical tightrope these companies are walking when they release powerful AI models to the public. It’s a terrifying mix of amazing technology and the potential for serious misuse.

My Experiments: The Good, the Bad, and the Botched

Experiment #1: Taylor Swift - a comparison

For my first test, I took a 25-second clip of Taylor Swift at her Eras Tour movie premiere. I ran it through AI vocal separator to remove the annoying music and then also clipped out a section where there was little to no audible claps. You can check the input reference file below.

Then I decided to find a AI TTS, explicitly with voice cloning to compare what VibeVoice generates and what we get generally. My AI TTS of choice was fish.audio. It already had a lot of voice samples in there but I chose to upload my own to maintain consistency across both models and give them the same input to make the playing field levelled.

The way it works in fish.audio is that you can add emotion control tags such as (angry), (sad) etc. You can read more about that here. So I asked Gemini to generate a single paragraph prompt to put into both and since fish.audio had this functionality, I had Gemini spin up a specific version with emotion control tags within the paragraph.

VibeVoice does not have this option so I just added the raw paragraph as is into it.

Let's see how the differing the outputs were!

Input Clip:

INPUT TEXT - READ ALONG TRANSCRIPT

Speaker 1: (serious)Alright, David, let's consider the proposal for a moment. (conciliative)The initial data suggests a very straightforward path, and I know what you're probably thinking, 'can't we just implement the standard framework and be done with it?' (sincere)Well, the issue, and this is the truly crucial part, isn't the framework itself but rather its tertiary effects on our legacy systems, specifically modules one, four, and the big one, seven. (curious)So you might ask, 'then why is this even on the table?' (confident)The answer, quite simply, is because the alternative, while appearing safer on the surface, provides absolutely no path for future scalability. (serious)Therefore, we must decide not what is easy now, but what is necessary for later. (sighing)

For VibeVoice, text in paranthesis' was ommitted.

Output from fish.audio:

Output from VibeVoice:

(generated with vibevoice 1.5b)

I'd say fish.audio was not bad inherently but even with the emotional tags added, it didn't really make a lot of sense, at least to me. Don't get me wrong, I think it does sound somewhat like Taylor Swift but I personally feel it's... off. The emotion and tonality, the context in the conversation? Not doing it for me personally.

VibeVoice on the other hand did a really good job especially sections where emotional context in a conversation is important and was not defined explicitly. I think that's what sold me on classifying this as a winner, for me personally.

Experiment #2: Project 'McDrive-Thru 2.0'

A funny, Gemini-generated image of a Tesla car in a McDonald's drive-thru.

(While asking to generate the script for a funny Elon Musk speech between a collaboration between Tesla and McDonald's, Gemini's image model randomly generated this without me asking and I found it a bit funny. Nano Banana is truly insane.)

I can't stress enough how insane this output was.

I chose Elon Musk's voice for this one because he's known for using a lot of "ums" and "ahs" while speaking in public, a habit many of us share. While most public speakers train themselves to avoid this, it makes his speech patterns surprisingly complex to replicate with all the characteristic pauses and hesitations.

The audio sample I used was a simple 25-second, audio-only clip of him speaking about Prime Minister Narendra Modi I just clipped his part from a longer video because I was lazy and grabbed a quick excerpt.

Input Clip:

(check source if it doesn't play in your browser)

Output Clip:

(generated with vibevoice 1.5b)

INPUT TEXT - Read-Along Transcript

Speaker 1: So... yeah. You're probably wondering what I'm doing here. And frankly, so am I.

But when the opportunity arose... to revolutionize... consumption... with one of the most iconic brands on Earth... I couldn't resist.

For too long, the act of eating has been... analog. Slow. Inefficient, if you really think about it. And don't even get me started on the drive-thru experience. The human error! The miscommunications! We can put a rocket on Mars, but we can't get a consistent order of fries? Unacceptable!

So, today, Tesla and McDonald's are proud to announce... Project 'McDrive-Thru 2.0'. We're integrating neural-linguistic programming directly into the ordering process. Forget speaking. Forget even thinking about what you want. Your car, your Tesla, will sense your craving.

It will pre-order your Big Mac, your fries, your McFlurry – with the exact ratio of M&M's you prefer – before you even pull into the parking lot. But wait, there's more. We're introducing the 'Happy Meal Autopilot'. Your car will not only order, it will then drive itself to the nearest McDonald's, pick up your meal, and bring it back to you.

No more fumbling with bags, no more spilt coffee. Just pure, unadulterated, automated deliciousness. And for those who say, 'But Elon, what about the human touch?' I say, the human touch is now pure enjoyment. No effort, just pleasure. We're elevating the fast-food experience to an entirely new dimension of efficiency and... well, happiness. It's truly revolutionary.

Now, if you'll excuse me, I believe my car just ordered me a Filet-O-Fish.

Even with that simple sample, the result was impressive. And just to be clear, I gave the AI only the raw text of the speech. I didn't add any special instructions for tonality or any "(pause)" markers in the prompt.

The moment that truly surprised me was the line, "And for those who say, 'But Elon, what about the human touch?'" (at 1:00) The way the AI captured the natural shift in tone between his narration and the quoted question was mind-blowingly impressive.

It truly feels like black magic.

Final Thoughts

Microsoft's VibeVoice is a groundbreaking piece of technology that pushes the boundaries of what's possible with speech synthesis. It’s an incredible tool for creators (up-to 90-min of smooth flowing conversational audio with up-to 4 speakers), but it also shines a bright light on the ethical dilemmas of open-source AI.

I also tried it on my own voice sample of 20 seconds of me saying random words (just yapping) and the output was way too impressive, and... scary.

I am confused what Microsoft meant when it said that it had guardrails in place when I was able to put all the swear words I knew and make my voice-cloned version say it with accurate fluency. The rapid removal of the official repo shows just how quickly these dangers can become real-world problems.

Reddit user u/adumdumonreddit describing VibeVoice's release

I think u/adumdumonreddit puts it aptly, lol (src)

FAQ

Q: What is VibeVoice?

A: VibeVoice is a powerful, open-source AI text-to-speech model from Microsoft Research designed to generate expressive, long-form, and multi-speaker conversational audio, such as podcasts and audiobooks.

Q: Is it free?

A: Yes and open-source.

Q: Why was the VibeVoice GitHub repository removed?

A: The official repository was removed by Microsoft because the tool was being used in ways that were inconsistent with its intended research purpose, particularly in relation to the creation of deepfakes and potential misuse for disinformation and scams.

Q: What languages does VibeVoice support?

A: The model was trained on and officially supports English and Chinese. Using it for other languages may result in unintelligible or inaccurate output.

Q: Is it safe to use VibeVoice for commercial projects?

A: No. Microsoft explicitly states that VibeVoice is intended for research and development purposes only and is not recommended for commercial or real-world applications without further testing and development.

Q: How do I use it?

A: You can clone the HuggingFace repository from here. Alternatively, if you just want to try it out quickly without having a big beefy AI setup with an expensive GPU, you can read my tutorial here.

My Blog

Microsoft VibeVoice: The New Voice of AI

Want to Try VibeVoice Yourself?

Potential Dangers and Why Microsoft Pulled the Repo

My Experiments: The Good, the Bad, and the Botched

Experiment #1: Taylor Swift - a comparison

Experiment #2: Project 'McDrive-Thru 2.0'

Final Thoughts

FAQ

Share this article

Related Posts

Move Over SEO, It’s Time for GEO!

Unleash Your Creativity: A Beginner's 3-Part Guide to Using VibeVoice for FREE

Nano Banana: Google's New AI Image Tool That Is FREE And POWERFUL

Comments

Get More Insights Like This