DramaBox TTS Pinokio - Free Alternative to ElevenLabs

DramaBox TTS Pinokio - Free Alternative to ElevenLabs

DramaBox TTS Pinokio is a free, open-source, 1-click local launcher for Resemble AI's DramaBox - a prompt-driven expressive text-to-speech model with voice cloning, built on LTX-2.3. A powerful self-hosted alternative to ElevenLabs.

Linux macOS Open Source Pinokio Self Hosted
Windows
DramaBox TTS Pinokio - Free Alternative to ElevenLabs

DramaBox TTS Pinokio - Free Alternative to ElevenLabs

Overview

DramaBox TTS Pinokio is a free, open-source 1-click Pinokio launcher for DramaBox, Resemble AI's prompt-driven expressive text-to-speech model with voice cloning, built on the LTX-2.3 3.3B audio-only diffusion transformer. Created by community developer PierrunoYT, the Pinokio launcher wraps the upstream resemble-ai/DramaBox repository with a one-click install, start, and browser-launch workflow - making it a practical free alternative to ElevenLabs for users who want expressive, directable AI voice generation without a subscription or cloud dependency. The tool is distributed under the LTX-2 Community License, runs entirely on local hardware (Windows, macOS with Apple Silicon, and Linux), and requires no API key or internet connection after the initial model download.

Key Features

  • Prompt-directed speech performance - The prompt controls speaker identity, emotion, delivery style, laughs, sighs, pauses, and scene transitions. Dialogue inside double quotes is spoken; stage directions outside the quotes are performed but never spoken aloud.
  • Voice cloning from a 10-second reference - Provide 10 or more seconds of reference audio and DramaBox applies that speaker's timbre to the generated performance. The reference is optional; without it, the model generates a voice to match the speaker description in the prompt.
  • Gradio web UI via Pinokio - The launcher installs all dependencies and starts a local Gradio interface accessible in the browser. No command-line experience is required for basic use.
  • Multi-platform hardware support - Runs on NVIDIA CUDA GPUs (~24 GB VRAM recommended), Apple Silicon via PyTorch MPS, and falls back to CPU. A Low VRAM mode using MMGP offloading is available for systems below the recommended VRAM threshold.
  • PerTh neural watermarking - Every generated audio file is automatically watermarked with Resemble AI's PerTh imperceptible watermark, which survives MP3/AAC encoding and common audio edits at approximately 100% detection accuracy.
  • Python, CLI, JavaScript, and REST API access - Once the server is running, the model is accessible via a Python inference server, a CLI script, the Gradio JavaScript client, or raw HTTP POST requests - suitable for integration into pipelines and applications.
  • LoRA fine-tuning support - Users can train custom LoRA adapters on top of DramaBox to add a specific speaker, language flavour, or delivery style, using JSONL, TSV, or other supported dataset formats.
  • 48 kHz stereo output - Audio is generated at 48 kHz stereo, matching or exceeding the output quality of most commercial TTS APIs.

How It Compares to ElevenLabs

Feature DramaBox TTS Pinokio ElevenLabs
Pricing Free (self-hosted, no subscription) Free tier limited to 10,000 characters/month; paid plans from $5-$330+/month
Voice cloning Yes - from 10+ seconds of reference audio Yes - Instant Voice Cloning on free tier; Professional Voice Cloning on paid tiers
Expressive / directed performance Yes - prompt-driven emotion, pacing, paralinguistic cues Partial - emotion sliders and voice settings; less scriptable direction
Output quality 48 kHz stereo; ~2.5 s per generation on H100 Up to 44.1 kHz; near-instant via cloud API
Platform Local / self-hosted (Windows, macOS, Linux) Cloud-based web app and API (Windows, macOS, Linux, iOS, Android via browser)
Hardware requirement ~24 GB VRAM (NVIDIA) or Apple Silicon; ~17 GB disk space None - runs in the cloud
Privacy / data control Fully local - no data sent externally Audio processed on ElevenLabs servers
Usage limits None - unlimited local generation Character quotas per plan; overage charges apply
Watermarking PerTh neural watermark on by default (can be disabled) No built-in watermarking
API access Local REST, Python, JavaScript, CLI Cloud REST API (paid plans)
License LTX-2 Community License (open source base) Proprietary SaaS
LoRA / fine-tuning Yes - train custom LoRAs on top of DramaBox No - voice customisation limited to cloning and settings

Free Version Limitations

DramaBox TTS Pinokio is entirely free with no paywalled features, character caps, or watermarked audio in the commercial sense. However, users should be aware of the following practical constraints:

  • Hardware requirements are significant - The recommended configuration is an NVIDIA GPU with approximately 24 GB of VRAM. Users without this hardware will need to use the experimental Low VRAM mode (MMGP offloading), which is slower and requires more system RAM, or fall back to CPU inference, which is considerably slower.
  • Model download size - Approximately 17 GB of model weights are downloaded on first run (DiT transformer: 6.6 GB, audio components: 1.9 GB, Gemma text encoder: ~8 GB). A fast internet connection and sufficient disk space are required.
  • Generation speed is hardware-dependent - The ~2.5 seconds per generation figure applies to a warm H100 server. Consumer-grade GPUs will be slower. Cold inference (loading the Gemma model per request) takes approximately 30 seconds even on high-end hardware.
  • LTX-2 Community License restrictions - The LTX-2 Community License permits non-commercial and research use freely. Commercial use may require review of the licence terms before deployment in a production product.
  • No mobile or web-hosted interface - DramaBox TTS Pinokio runs locally only. There is no hosted web version included in this launcher (though Resemble AI provides a ZeroGPU demo space on Hugging Face separately).

Who Is It Best For?

  • Content creators and indie filmmakers who need expressive, directed voice performances for narration, character dialogue, or short-form video - without paying per-character API fees.
  • Developers and AI researchers who want to integrate a locally-hosted, scriptable TTS engine into pipelines, applications, or experiments via the Python server, CLI, or REST API.
  • Privacy-conscious users who require that voice data and reference audio never leave their own machine - particularly relevant for users working with sensitive or proprietary voice samples.
  • Hobbyists and open-source enthusiasts who want to explore state-of-the-art expressive TTS and voice cloning on their own hardware without a cloud subscription.
  • ML practitioners who want to fine-tune a custom LoRA on top of DramaBox to add a specific speaker identity, accent, or delivery style to the base model.

Getting Started

  1. Install Pinokio on your Windows, macOS, or Linux machine.
  2. Open Pinokio and navigate to the Discover tab, or paste the repository URL https://github.com/PierrunoYT/DramaBox-TTS-Pinokio directly into Pinokio's install field.
  3. Click Install - Pinokio clones the repository and installs all Python dependencies automatically. Model weights (~17 GB) are downloaded from Hugging Face on first run.
  4. Click Start (or Start Low VRAM if your GPU has less than 24 GB VRAM) to launch the local inference server.
  5. Click Open Web UI to open the Gradio interface in your browser at http://127.0.0.1:7860.
  6. Write a prompt using the structure: <speaker description>, "<dialogue>" <stage direction> "<more dialogue>" - for example: A woman speaks warmly, "Hello, how are you today?" She laughs, "Hahaha, it is so good to see you!"
  7. Optionally upload a 10+ second WAV reference file to clone a specific voice timbre.
  8. Click Generate and download the resulting 48 kHz stereo WAV file.

Full documentation and prompt writing tips are available at the official repository: https://github.com/PierrunoYT/DramaBox-TTS-Pinokio

Other Free Alternatives to ElevenLabs

  • Voicebox - A free, open-source, local-first voice cloning studio with seven TTS engines (including Chatterbox by Resemble AI), a multi-voice timeline editor, 23 languages, and a REST API. Runs entirely on your own machine with no subscriptions.
  • Tortoise TTS - A high-fidelity neural TTS system that prioritises audio quality over generation speed. Supports voice cloning from short audio samples and produces exceptionally natural-sounding speech. Open source, self-hosted, runs on Windows, macOS, and Linux.
  • Bark by Suno - A transformer-based text-to-audio model capable of generating speech, music, sound effects, and non-verbal vocalisations. Supports multiple languages with emotional expression and realistic voice synthesis. Open source and self-hosted.

Reviews

No reviews yet

Similar listings in category

Articles related to listings