Solega Co. Done For Your E-Commerce solutions.
  • Home
  • E-commerce
  • Start Ups
  • Project Management
  • Artificial Intelligence
  • Investment
  • More
    • Cryptocurrency
    • Finance
    • Real Estate
    • Travel
No Result
View All Result
  • Home
  • E-commerce
  • Start Ups
  • Project Management
  • Artificial Intelligence
  • Investment
  • More
    • Cryptocurrency
    • Finance
    • Real Estate
    • Travel
No Result
View All Result
No Result
View All Result
Home Artificial Intelligence

Pushing the frontiers of audio generation

Solega Team by Solega Team
October 31, 2024
in Artificial Intelligence
Reading Time: 7 mins read
0
Pushing the frontiers of audio generation
0
SHARES
0
VIEWS
Share on FacebookShare on Twitter


Applied sciences

Revealed
30 October 2024
Authors

Zalán Borsos, Matt Sharifi and Marco Tagliasacchi

An illustration depicting speech patterns, iterative progress on dialogue generation,  and a relaxed conversation between two voices.

Our pioneering speech era applied sciences are serving to folks world wide work together with extra pure, conversational and intuitive digital assistants and AI instruments.

Speech is central to human connection. It helps folks world wide change data and concepts, specific feelings and create mutual understanding. As our know-how constructed for producing pure, dynamic voices continues to enhance, we’re unlocking richer, extra participating digital experiences.

Over the previous few years, we’ve been pushing the frontiers of audio era, creating fashions that may create top quality, pure speech from a variety of inputs, like textual content, tempo controls and specific voices. This know-how powers single-speaker audio in lots of Google merchandise and experiments — together with Gemini Live, Project Astra, Journey Voices and YouTube’s auto dubbing — and helps folks world wide work together with extra pure, conversational and intuitive digital assistants and AI instruments.

Working along with companions throughout Google, we not too long ago helped develop two new options that may generate long-form, multi-speaker dialogue for making complicated content material extra accessible:

  • NotebookLM Audio Overviews turns uploaded paperwork into participating and energetic dialogue. With one click on, two AI hosts summarize person materials, make connections between subjects and banter backwards and forwards.
  • Illuminate creates formal AI-generated discussions about analysis papers to assist make data extra accessible and digestible.

Right here, we offer an outline of our newest speech era analysis underpinning all of those merchandise and experimental instruments.

Pioneering methods for audio era

For years, we have been investing in audio era analysis and exploring new methods for producing extra pure dialogue in our merchandise and experimental instruments. In our earlier analysis on SoundStorm, we first demonstrated the power to generate 30-second segments of pure dialogue between a number of audio system.

This prolonged our earlier work, SoundStream and AudioLM, which allowed us to use many text-based language modeling methods to the issue of audio era.

SoundStream is a neural audio codec that effectively compresses and decompresses an audio enter, with out compromising its high quality. As a part of the coaching course of, SoundStream learns methods to map audio to a variety of acoustic tokens. These tokens seize all the data wanted to reconstruct the audio with excessive constancy, together with properties reminiscent of prosody and timbre.

AudioLM treats audio era as a language modeling activity to supply the acoustic tokens of codecs like SoundStream. In consequence, the AudioLM framework makes no assumptions in regards to the kind or make-up of the audio being generated, and may flexibly deal with quite a lot of sounds while not having architectural changes — making it a great candidate for modeling multi-speaker dialogues.

Instance of a multi-speaker dialogue generated by NotebookLM Audio Overview, primarily based on just a few potato-related paperwork.

Constructing upon this analysis, our newest speech era know-how can produce 2 minutes of dialogue, with improved naturalness, speaker consistency and acoustic high quality, when given a script of dialogue and speaker flip markers. The mannequin additionally performs this activity in underneath 3 seconds on a single Tensor Processing Unit (TPU) v5e chip, in a single inference go. This implies it generates audio over 40-times sooner than actual time.

Scaling our audio era fashions

Scaling our single-speaker era fashions to multi-speaker fashions then grew to become a matter of information and mannequin capability. To assist our newest speech era mannequin produce longer speech segments, we created an much more environment friendly speech codec for compressing audio right into a sequence of tokens, in as little as 600 bits per second, with out compromising the standard of its output.

The tokens produced by our codec have a hierarchical construction and are grouped by time frames. The primary tokens inside a gaggle seize phonetic and prosodic data, whereas the final tokens encode superb acoustic particulars.

Even with our new speech codec, producing a 2-minute dialogue requires producing over 5000 tokens. To mannequin these lengthy sequences, we developed a specialised Transformer structure that may effectively deal with hierarchies of data, matching the construction of our acoustic tokens.

With this system, we are able to effectively generate acoustic tokens that correspond to the dialogue, inside a single autoregressive inference go. As soon as generated, these tokens may be decoded again into an audio waveform utilizing our speech codec.

Animation displaying how our speech era mannequin produces a stream of audio tokens autoregressively, that are decoded again to a waveform consisting of a two-speaker dialogue.

To show our mannequin methods to generate sensible exchanges between a number of audio system, we pretrained it on tons of of hundreds of hours of speech information. Then we finetuned it on a a lot smaller dataset of dialogue with excessive acoustic high quality and exact speaker annotations, consisting of unscripted conversations from a variety of voice actors and sensible disfluencies — the “umm”s and “aah”s of actual dialog. This step taught the mannequin methods to reliably change between audio system throughout a generated dialogue and to output solely studio high quality audio with sensible pauses, tone and timing.

Consistent with our AI Principles and our dedication to creating and deploying AI applied sciences responsibly, we’re incorporating our SynthID know-how to watermark non-transient AI-generated audio content material from these fashions, to assist safeguard towards the potential misuse of this know-how.

New speech experiences forward

We’re now targeted on enhancing our mannequin’s fluency, acoustic high quality and including extra fine-grained controls for options, like prosody, whereas exploring how finest to mix these advances with different modalities, reminiscent of video.

The potential functions for superior speech era are huge, particularly when mixed with our Gemini household of fashions. From enhancing studying experiences to creating content material extra universally accessible, we’re excited to proceed pushing the boundaries of what’s doable with voice-based applied sciences.

Acknowledgements

Authors of this work: Zalán Borsos, Matt Sharifi, Brian McWilliams, Yunpeng Li, Damien Vincent, Félix de Chaumont Quitry, Martin Sundermeyer, Eugene Kharitonov, Alex Tudor, Victor Ungureanu, Sertan Girgin, Jonas Rothfuss, Jake Walker and Marco Tagliasacchi.

We thank Leland Rechis, Ralph Leith, Paul Middleton, Poly Pata, Minh Truong and RJ Skerry-Ryan for his or her essential efforts on dialogue information.

We’re very grateful to our collaborators throughout Labs, Illuminate, Cloud, Speech and YouTube for his or her excellent work bringing these fashions into merchandise.

We additionally thank Françoise Beaufays, Krishna Bharat, Tom Hume, Simon Tokumine, James Zhao for his or her steering on the undertaking.



Source link

Tags: audiofrontiersgenerationPushing
Previous Post

Bitcoin Price Prediction 2025: Will BTC Hit $1,000,000?

Next Post

Dreaming of a European Move? Ditch The Norm And Discover These Alternative Destinations | by Greyson Ferguson | The Startup | Oct, 2024

Next Post
Dreaming of a European Move? Ditch The Norm And Discover These Alternative Destinations | by Greyson Ferguson | The Startup | Oct, 2024

Dreaming of a European Move? Ditch The Norm And Discover These Alternative Destinations | by Greyson Ferguson | The Startup | Oct, 2024

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

POPULAR POSTS

  • 10 Ways To Get a Free DoorDash Gift Card

    10 Ways To Get a Free DoorDash Gift Card

    0 shares
    Share 0 Tweet 0
  • They Combed the Co-ops of Upper Manhattan With $700,000 to Spend

    0 shares
    Share 0 Tweet 0
  • Saal.AI and Cisco Systems Inc Ink MoU to Explore AI and Big Data Innovations at GITEX Global 2024

    0 shares
    Share 0 Tweet 0
  • Exxon foe Engine No. 1 to build fossil fuel plants with Chevron

    0 shares
    Share 0 Tweet 0
  • They Wanted a House in Chicago for Their Growing Family. Would $650,000 Be Enough?

    0 shares
    Share 0 Tweet 0
Solega Blog

Categories

  • Artificial Intelligence
  • Cryptocurrency
  • E-commerce
  • Finance
  • Investment
  • Project Management
  • Real Estate
  • Start Ups
  • Travel

Connect With Us

Recent Posts

USDT Coming to Bitcoin: Tether Partners with RGB Protocol for Native Bitcoin Stablecoin Support

USDT Coming to Bitcoin: Tether Partners with RGB Protocol for Native Bitcoin Stablecoin Support

August 29, 2025
Walmart Marketplace Sellers Summit debuts new tools

Walmart Marketplace Sellers Summit debuts new tools

August 28, 2025

© 2024 Solega, LLC. All Rights Reserved | Solega.co

No Result
View All Result
  • Home
  • E-commerce
  • Start Ups
  • Project Management
  • Artificial Intelligence
  • Investment
  • More
    • Cryptocurrency
    • Finance
    • Real Estate
    • Travel

© 2024 Solega, LLC. All Rights Reserved | Solega.co