Creating Custom Voices with Piper

This guide is based on the excellent Piper Training Guide with Screen Reader by ZachB100.

Introduction to Piper

Piper is a fast, local text-to-speech engine optimized for low-end hardware like Raspberry Pi. It uses VITS, an end-to-end speech synthesis model, and exports voices to the ONNX runtime.

While initially developed for screen readers like NVDA, Piper's potential extends far beyond. Its performance is continually improving, making it suitable for various applications such as web browsing, email reading, and social media consumption.

Creating Your Dataset

For TTS, your dataset should include:

  • Audio files: 16 or 22.5 kHz mono .wav files (16-bit resolution)
  • Text transcripts: Formatted according to LJSpeech conventions

The format for text transcripts looks like this:

                            audio1|This is the first sentence.
audio2|This is the second sentence.

For best results, ensure your recordings are clear, with minimal background noise. Studio-quality is ideal, but a quiet room and a decent microphone can suffice. Aim for at least five minutes of audio to start, though more data will generally yield better results.

Training Your Model

Piper models are trained using Google Colab, a cloud-based Jupyter notebook environment. This approach provides access to powerful GPUs necessary for efficient training. The process involves:

  1. Uploading your dataset to Google Drive
  2. Setting up the training environment in Colab
  3. Configuring and initiating the training process
  4. Exporting the trained model for use

Use these Colab notebooks for training and exporting your model:

When configuring your model, consider increasing the "validation split" to 0.05, depending on your dataset size. For English voices, choosing US English is recommended for the best results with pre-trained models.

Testing and Using Your Model

After training, you can test your model using the provided Colab notebooks. These allow you to synthesize speech from text input, giving you a chance to evaluate and refine your model.

Once satisfied with your model, you can use it with compatible software like the NVDA screen reader (with the appropriate add-on) or any other application that supports Piper voices.

To install a new voice in NVDA:

  1. Go to the NVDA settings
  2. Locate the Piper category
  3. Find the button to install voices from a local archive
  4. Choose the voice you want and press enter

Join the Piper Community

Creating custom voices with Piper is an exciting journey into the world of AI-driven speech synthesis. Whether you're looking to create voices for accessibility purposes, creative projects, or just out of curiosity, Piper offers a powerful and flexible platform.

We encourage you to experiment, share your experiences, and contribute to the growing Piper community. Your innovations could help shape the future of accessible and personalized text-to-speech technology!

For more detailed instructions and updates, please refer to the original guide.