ChatTTS: Revolutionary Conversational Text-to-Speech Model

Explore the cutting-edge text-to-speech algorithm tailored for daily dialogue, offering advanced features in English and Chinese

Key Aspects

No key aspects available

ChatTTS Features

Conversational TTS

ChatTTS is optimized for dialogue-based tasks, enabling natural and expressive speech synthesis. It supports multiple speakers, facilitating interactive conversations.

Fine-grained Control

The model can predict and control fine-grained prosodic features, including laughter, pauses, and interjections, providing a more natural speech output.

ChatTTS Specifications

Dataset and Model Details

The main model is trained with Chinese and English audio data of over 100,000 hours. An open-source version is available on HuggingFace, which is a 40,000 hours pre-trained model without SFT.

Roadmap

Future plans for ChatTTS include open-sourcing the 40k-hours-base model and spk_stats file, streaming audio generation, and multi-emotion controlling. Additionally, there are plans for ChatTTS.cpp, indicating a potential for broader compatibility and integration.

ChatTTS Usage Instructions

Installation and Setup

To get started with ChatTTS, users can clone the repository from GitHub and install the required packages. There are options to install directly via pip or from conda, with additional optional installations for NVIDIA GPU users.

Basic and Advanced Usage

Basic usage involves importing the ChatTTS module, loading the model, and inferring text to speech. Advanced usage allows for more control, including sampling a random speaker, customizing temperature, and manual control over sentence and word levels.

ChatTTS Customer Service Details

Contact Information

For formal inquiries about the model and roadmap, users can contact the team at [email protected]. Additionally, there are multiple online chat options available, including QQ groups for Chinese users and a Discord server for global community interaction.

ChatTTS Common Issues and Problems

Model Stability and Performance

Some users may experience issues with model stability, particularly with multi-speaker scenarios or poor audio quality. These issues are common with autoregressive models and can be mitigated by trying multiple samples to find a suitable result.

Control Over Emotions

Currently, the released model allows for control over laughter, breaks, and intonation. Future versions may include more emotional control capabilities.

Go to ChatTTS