Skip to content

Step-Audio-EditX Usage Tips and Installation Guide

What is Step-Audio-EditX

Step-Audio-EditX is an advanced open-source audio foundation model developed by the StepFun AI team, designed to let users “edit speech as easily as writing a prompt.” Even non-technical users can effortlessly generate highly natural and expressive speech—simply by uploading a short reference audio clip and writing clear text instructions describing how the AI should speak. The model will then clone the voice and precisely convey the desired emotion, accent, or vocal nuance.

🎯 Powerful Yet Easy-to-Use Features

  • Zero-shot voice cloning: Upload any voice sample and synthesize new text in that voice—no training required.
  • Iterative editing: Not satisfied? Refine your instructions repeatedly to gradually improve the output.
  • Expressive control: Go beyond what is said—you can finely control how it’s said: happy or angry? Whispering or playful? Even add breaths or laughter?

All of this is achieved by adding simple tags to your input text, making it intuitive and effective.

✍️ How to “Direct” AI Speech with Text Prompts

Step-Audio-EditX allows you to insert special tags before your target sentence to control four key dimensions: dialect, emotion, speaking style, and paralinguistic cues. Tags can be used individually or combined freely.

Below is the complete list of officially supported tags:

1. Dialect

TagDescription
[Sichuanese]Sichuan dialect (Chinese)
[Cantonese]Cantonese

Example: [Sichuanese]The weather is really nice today!

2. Emotion

TagDescription
[Angry]Angry
[Happy]Happy
[Sad]Sad
[Excited]Excited
[Fearful]Fearful
[Surprised]Surprised
[Disgusted]Disgusted

Example: [Happy]I got accepted!

3. Speaking Style

TagDescription
[Act_coy]Coy / flirtatious tone
[Older]Mature / elderly-sounding
[Child]Childlike voice
[Whisper]Whisper
[Serious]Serious tone
[Generous]Generous / hearty manner
[Exaggerated]Exaggerated delivery

Example: [Whisper][Child]Is Mommy asleep?

4. Paralinguistic Cues

TagDescription
[Breathing]Audible breathing
[Laughter]Laughter
[Suprise-oh]Surprised “oh!”
[Confirmation-en]Confirming “uh-huh” (English-style)
[Uhm]Hesitation “uhm”
[Suprise-ah]Surprised “ah!”
[Suprise-wa]Surprised “wa!”
[Sigh]Sigh
[Question-ei]Questioning “ei?”
[Dissatisfaction-hnn]Dissatisfied “hnn”

Example: [Sigh][Sad]Ugh… I failed again.

With this structured, tag-based prompting system, Step-Audio-EditX makes expressive, controllable speech synthesis as simple as writing a single sentence—while delivering remarkable realism and flexibility.

English Comprehensive Example (with Paralinguistic Tags)

text
[Happy]Oh my gosh, [Laughter]I can't believe we won! [Uhm] Should we celebrate tonight?

[Excited]Guess what? [Suprise-oh]They said yes! [Confirmation-en] Mmhmm, it’s really happening.

System Requirements

  • 32GB RAM recommended. 20GB+ storage recommended.
  • Windows 10/11: NVIDIA GPU 12GB+ VRAM required.
  • Note: For NVIDIA GPUs, install a newer driver.

Install Step-Audio-EditX

Open LM Downloader, then click the "Local Apps" in the left menu. You could see Step-Audio-EditX in the app list. Click the Step-Audio-EditX icon to go to the introduction page. Click the Install Button,the install window opens. If you already have Step-Audio-EditX installed, don't worry, this can be treated as an update and won't affect the models you've previously downloaded.

On the app details page, click the Run button on the right to open the runtime window. After successful launch, your browser will open automatically.