Step-Audio-EditX Usage Tips and Installation Guide
What is Step-Audio-EditX
Step-Audio-EditX is an advanced open-source audio foundation model developed by the StepFun AI team, designed to let users “edit speech as easily as writing a prompt.” Even non-technical users can effortlessly generate highly natural and expressive speech—simply by uploading a short reference audio clip and writing clear text instructions describing how the AI should speak. The model will then clone the voice and precisely convey the desired emotion, accent, or vocal nuance.
🎯 Powerful Yet Easy-to-Use Features
- Zero-shot voice cloning: Upload any voice sample and synthesize new text in that voice—no training required.
- Iterative editing: Not satisfied? Refine your instructions repeatedly to gradually improve the output.
- Expressive control: Go beyond what is said—you can finely control how it’s said: happy or angry? Whispering or playful? Even add breaths or laughter?
All of this is achieved by adding simple tags to your input text, making it intuitive and effective.
✍️ How to “Direct” AI Speech with Text Prompts
Step-Audio-EditX allows you to insert special tags before your target sentence to control four key dimensions: dialect, emotion, speaking style, and paralinguistic cues. Tags can be used individually or combined freely.
Below is the complete list of officially supported tags:
1. Dialect
| Tag | Description |
|---|---|
[Sichuanese] | Sichuan dialect (Chinese) |
[Cantonese] | Cantonese |
Example:
[Sichuanese]The weather is really nice today!
2. Emotion
| Tag | Description |
|---|---|
[Angry] | Angry |
[Happy] | Happy |
[Sad] | Sad |
[Excited] | Excited |
[Fearful] | Fearful |
[Surprised] | Surprised |
[Disgusted] | Disgusted |
Example:
[Happy]I got accepted!
3. Speaking Style
| Tag | Description |
|---|---|
[Act_coy] | Coy / flirtatious tone |
[Older] | Mature / elderly-sounding |
[Child] | Childlike voice |
[Whisper] | Whisper |
[Serious] | Serious tone |
[Generous] | Generous / hearty manner |
[Exaggerated] | Exaggerated delivery |
Example:
[Whisper][Child]Is Mommy asleep?
4. Paralinguistic Cues
| Tag | Description |
|---|---|
[Breathing] | Audible breathing |
[Laughter] | Laughter |
[Suprise-oh] | Surprised “oh!” |
[Confirmation-en] | Confirming “uh-huh” (English-style) |
[Uhm] | Hesitation “uhm” |
[Suprise-ah] | Surprised “ah!” |
[Suprise-wa] | Surprised “wa!” |
[Sigh] | Sigh |
[Question-ei] | Questioning “ei?” |
[Dissatisfaction-hnn] | Dissatisfied “hnn” |
Example:
[Sigh][Sad]Ugh… I failed again.
With this structured, tag-based prompting system, Step-Audio-EditX makes expressive, controllable speech synthesis as simple as writing a single sentence—while delivering remarkable realism and flexibility.
English Comprehensive Example (with Paralinguistic Tags)
[Happy]Oh my gosh, [Laughter]I can't believe we won! [Uhm] Should we celebrate tonight?
[Excited]Guess what? [Suprise-oh]They said yes! [Confirmation-en] Mmhmm, it’s really happening.System Requirements
- 32GB RAM recommended. 20GB+ storage recommended.
- Windows 10/11: NVIDIA GPU 12GB+ VRAM required.
- Note: For NVIDIA GPUs, install a newer driver.
Install Step-Audio-EditX
Open LM Downloader, then click the "Local Apps" in the left menu. You could see Step-Audio-EditX in the app list. Click the Step-Audio-EditX icon to go to the introduction page. Click the Install Button,the install window opens. If you already have Step-Audio-EditX installed, don't worry, this can be treated as an update and won't affect the models you've previously downloaded.
On the app details page, click the Run button on the right to open the runtime window. After successful launch, your browser will open automatically.