<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Fish Speech on Chico's Tech Blog</title><link>https://realtime-ai.chat/tags/fish-speech/</link><description>Recent content in Fish Speech on Chico's Tech Blog</description><image><title>Chico's Tech Blog</title><url>https://github.com/chicogong.png</url><link>https://github.com/chicogong.png</link></image><generator>Hugo</generator><language>zh-cn</language><lastBuildDate>Fri, 16 Jan 2026 10:00:00 +0800</lastBuildDate><atom:link href="https://realtime-ai.chat/tags/fish-speech/index.xml" rel="self" type="application/rss+xml"/><item><title>TTS模型微调：用自己的声音训练语音模型</title><link>https://realtime-ai.chat/posts/tts-finetuning/</link><pubDate>Fri, 16 Jan 2026 10:00:00 +0800</pubDate><guid>https://realtime-ai.chat/posts/tts-finetuning/</guid><description>TTS 模型微调实战:用 XTTS、Fish Speech 训练你自己的声音,语音克隆的完整步骤。</description><content:encoded><![CDATA[<h2 id="两个主流开源方案">两个主流开源方案</h2>
<table>
  <thead>
      <tr>
          <th>模型</th>
          <th>特点</th>
          <th>数据需求</th>
          <th>显存要求</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>XTTS v2</td>
          <td>多语言，效果稳定</td>
          <td>2-20分钟</td>
          <td>12GB+</td>
      </tr>
      <tr>
          <td>Fish Speech</td>
          <td>中文效果好，速度快</td>
          <td>3-10秒起</td>
          <td>4GB+</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="方案一xtts微调">方案一：XTTS微调</h2>
<h3 id="准备工作">准备工作</h3>
<p><strong>硬件要求</strong>：</p>
<ul>
<li>GPU：12GB显存以上（推荐16GB）</li>
<li>内存：16GB以上</li>
</ul>
<p><strong>数据要求</strong>：</p>
<ul>
<li>至少2-3分钟清晰录音</li>
<li>推荐5-20分钟效果更好</li>
<li>WAV格式，16kHz以上</li>
</ul>
<h3 id="安装">安装</h3>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">git clone https://github.com/daswer123/xtts-finetune-webui
</span></span><span class="line"><span class="cl"><span class="nb">cd</span> xtts-finetune-webui
</span></span><span class="line"><span class="cl">pip install -r requirements.txt
</span></span></code></pre></td></tr></table>
</div>
</div><h3 id="数据格式">数据格式</h3>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span><span class="lnt">6
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">dataset/
</span></span><span class="line"><span class="cl">├── audio/
</span></span><span class="line"><span class="cl">│   ├── 001.wav
</span></span><span class="line"><span class="cl">│   ├── 002.wav
</span></span><span class="line"><span class="cl">│   └── ...
</span></span><span class="line"><span class="cl">└── metadata.csv
</span></span></code></pre></td></tr></table>
</div>
</div><p>metadata.csv 格式：</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">audio_file|text|speaker_name
</span></span><span class="line"><span class="cl">audio/001.wav|今天天气真不错。|my_voice
</span></span><span class="line"><span class="cl">audio/002.wav|我们去公园散步吧。|my_voice
</span></span></code></pre></td></tr></table>
</div>
</div><h3 id="训练配置">训练配置</h3>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="c"># 关键参数</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nt">batch_size</span><span class="p">:</span><span class="w"> </span><span class="m">2</span><span class="w">          </span><span class="c"># 显存不够就调小</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nt">epochs</span><span class="p">:</span><span class="w"> </span><span class="m">10-50</span><span class="w">          </span><span class="c"># 数据少就多跑几轮</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nt">learning_rate</span><span class="p">:</span><span class="w"> </span><span class="m">5e-6</span><span class="w">    </span><span class="c"># 别调太大，容易过拟合</span><span class="w">
</span></span></span></code></pre></td></tr></table>
</div>
</div><h3 id="常见问题">常见问题</h3>
<p><strong>问题1：训练后声音变奇怪</strong>
→ 过拟合了，减少epochs或增加数据</p>
<p><strong>问题2：声音不像</strong>
→ 数据太少或质量不好，检查录音</p>
<p><strong>问题3：显存不够</strong>
→ 减小batch_size，或用gradient accumulation</p>
<hr>
<h2 id="方案二fish-speech微调">方案二：Fish Speech微调</h2>
<p>Fish Speech对中文友好，而且显存要求低。</p>
<h3 id="安装-1">安装</h3>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">git clone https://github.com/fishaudio/fish-speech
</span></span><span class="line"><span class="cl"><span class="nb">cd</span> fish-speech
</span></span><span class="line"><span class="cl">pip install -e .
</span></span></code></pre></td></tr></table>
</div>
</div><h3 id="零样本克隆不用训练">零样本克隆（不用训练）</h3>
<p>Fish Speech支持用3-10秒音频直接克隆：</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span><span class="lnt">6
</span><span class="lnt">7
</span><span class="lnt">8
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">fish_speech</span> <span class="kn">import</span> <span class="n">FishSpeech</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">model</span> <span class="o">=</span> <span class="n">FishSpeech</span><span class="p">()</span>
</span></span><span class="line"><span class="cl"><span class="c1"># 用参考音频生成</span>
</span></span><span class="line"><span class="cl"><span class="n">audio</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">generate</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">text</span><span class="o">=</span><span class="s2">&#34;这是克隆后的声音&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">reference_audio</span><span class="o">=</span><span class="s2">&#34;reference.wav&#34;</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><h3 id="微调效果更好">微调（效果更好）</h3>
<p>如果想要更像的效果，可以微调：</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl"><span class="c1"># 准备数据</span>
</span></span><span class="line"><span class="cl">python tools/prepare_data.py --input-dir ./my_audio --output-dir ./dataset
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 开始微调</span>
</span></span><span class="line"><span class="cl">python train.py --config configs/finetune.yaml --data-dir ./dataset
</span></span></code></pre></td></tr></table>
</div>
</div><h3 id="推理">推理</h3>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="c1"># 使用微调后的模型</span>
</span></span><span class="line"><span class="cl"><span class="n">audio</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">generate</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">text</span><span class="o">=</span><span class="s2">&#34;现在声音更像了&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">voice_id</span><span class="o">=</span><span class="s2">&#34;my_custom_voice&#34;</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><hr>
<h2 id="对比选择">对比选择</h2>
<table>
  <thead>
      <tr>
          <th>场景</th>
          <th>推荐</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>快速验证</td>
          <td>Fish Speech（零样本）</td>
      </tr>
      <tr>
          <td>中文场景</td>
          <td>Fish Speech</td>
      </tr>
      <tr>
          <td>多语言</td>
          <td>XTTS</td>
      </tr>
      <tr>
          <td>最高质量</td>
          <td>XTTS微调20分钟数据</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="训练技巧">训练技巧</h2>
<h3 id="1-不要贪多">1. 不要贪多</h3>
<ul>
<li>10分钟高质量数据 &gt; 1小时有底噪数据</li>
</ul>
<h3 id="2-监控过拟合">2. 监控过拟合</h3>
<ul>
<li>训练loss下降但生成效果变差 → 停止训练</li>
</ul>
<h3 id="3-多做对比">3. 多做对比</h3>
<ul>
<li>保存多个checkpoint，对比选最好的</li>
</ul>
<h3 id="4-参考音频很重要">4. 参考音频很重要</h3>
<ul>
<li>XTTS生成时用的参考音频影响很大</li>
<li>选一段最清晰、最有代表性的</li>
</ul>
<hr>
<h2 id="部署">部署</h2>
<p>训练好的模型可以用API服务起来：</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">fastapi</span> <span class="kn">import</span> <span class="n">FastAPI</span>
</span></span><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">fish_speech</span> <span class="kn">import</span> <span class="n">FishSpeech</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">app</span> <span class="o">=</span> <span class="n">FastAPI</span><span class="p">()</span>
</span></span><span class="line"><span class="cl"><span class="n">model</span> <span class="o">=</span> <span class="n">FishSpeech</span><span class="p">(</span><span class="n">checkpoint</span><span class="o">=</span><span class="s2">&#34;my_model&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nd">@app.post</span><span class="p">(</span><span class="s2">&#34;/tts&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">generate_speech</span><span class="p">(</span><span class="n">text</span><span class="p">:</span> <span class="nb">str</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="n">audio</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">generate</span><span class="p">(</span><span class="n">text</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="p">{</span><span class="s2">&#34;audio&#34;</span><span class="p">:</span> <span class="n">audio</span><span class="p">}</span>
</span></span></code></pre></td></tr></table>
</div>
</div><hr>
<p>有问题留言。</p>
]]></content:encoded></item></channel></rss>