<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>部署运维 on Chico's Tech Blog</title><link>https://realtime-ai.chat/categories/%E9%83%A8%E7%BD%B2%E8%BF%90%E7%BB%B4/</link><description>Recent content in 部署运维 on Chico's Tech Blog</description><image><title>Chico's Tech Blog</title><url>https://github.com/chicogong.png</url><link>https://github.com/chicogong.png</link></image><generator>Hugo</generator><language>zh-cn</language><lastBuildDate>Tue, 23 Dec 2025 10:00:00 +0800</lastBuildDate><atom:link href="https://realtime-ai.chat/categories/%E9%83%A8%E7%BD%B2%E8%BF%90%E7%BB%B4/index.xml" rel="self" type="application/rss+xml"/><item><title>本地部署大模型完全指南：Ollama + vLLM + LMStudio 实战</title><link>https://realtime-ai.chat/posts/local-llm-deployment/</link><pubDate>Tue, 23 Dec 2025 10:00:00 +0800</pubDate><guid>https://realtime-ai.chat/posts/local-llm-deployment/</guid><description>本地部署大模型完全指南:Ollama、vLLM、LMStudio 三种方案实战对比,兼顾隐私、性能与成本。</description><content:encoded><![CDATA[<h2 id="为什么要本地部署">为什么要本地部署？</h2>
<p>在云端API满天飞的2025年，为什么还要本地部署大模型？</p>
<h3 id="理由1隐私安全">理由1：隐私安全</h3>
<p>你的代码、文档、聊天记录……全都发给了云端。</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">敏感场景：
</span></span><span class="line"><span class="cl">- 公司内部代码 → 发给OpenAI？
</span></span><span class="line"><span class="cl">- 医疗病历数据 → 发给云端？
</span></span><span class="line"><span class="cl">- 法律合同文本 → 谁来保证不泄露？
</span></span></code></pre></td></tr></table>
</div>
</div><p>本地部署 = 数据永远不出你的电脑。</p>
<h3 id="理由2成本控制">理由2：成本控制</h3>
<table>
  <thead>
      <tr>
          <th style="text-align: left">使用场景</th>
          <th style="text-align: left">云端API成本</th>
          <th style="text-align: left">本地部署成本</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left">每天1万次调用</td>
          <td style="text-align: left">~$300/月</td>
          <td style="text-align: left">电费 ~$30/月</td>
      </tr>
      <tr>
          <td style="text-align: left">7B模型长期使用</td>
          <td style="text-align: left">持续付费</td>
          <td style="text-align: left">一次性硬件投入</td>
      </tr>
      <tr>
          <td style="text-align: left">团队10人使用</td>
          <td style="text-align: left">$200+/人/月</td>
          <td style="text-align: left">共享一台服务器</td>
      </tr>
  </tbody>
</table>
<h3 id="理由3低延迟">理由3：低延迟</h3>
<p>云端API：网络往返 100-500ms
本地部署：几乎零延迟</p>
<h3 id="理由4自由定制">理由4：自由定制</h3>
<ul>
<li>想微调？随便调</li>
<li>想改提示词模板？自己改</li>
<li>想限制输出长度？随心所欲</li>
</ul>
<hr>
<h2 id="硬件要求">硬件要求</h2>
<h3 id="最低配置跑7b模型">最低配置（跑7B模型）</h3>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">CPU：8核以上
</span></span><span class="line"><span class="cl">内存：16GB
</span></span><span class="line"><span class="cl">显卡：8GB显存（如RTX 3070）
</span></span><span class="line"><span class="cl">     或 Apple M1/M2/M3（统一内存）
</span></span><span class="line"><span class="cl">存储：50GB SSD可用空间
</span></span></code></pre></td></tr></table>
</div>
</div><h3 id="推荐配置跑13b-70b模型">推荐配置（跑13B-70B模型）</h3>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">CPU：12核以上
</span></span><span class="line"><span class="cl">内存：32GB+
</span></span><span class="line"><span class="cl">显卡：24GB显存（如RTX 4090）
</span></span><span class="line"><span class="cl">     或 Apple M2 Pro/Max/Ultra
</span></span><span class="line"><span class="cl">存储：200GB SSD可用空间
</span></span></code></pre></td></tr></table>
</div>
</div><h3 id="显存-vs-模型大小速查表">显存 vs 模型大小速查表</h3>
<table>
  <thead>
      <tr>
          <th style="text-align: left">模型大小</th>
          <th style="text-align: left">最低显存</th>
          <th style="text-align: left">推荐显存</th>
          <th style="text-align: left">代表模型</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left">3B</td>
          <td style="text-align: left">4GB</td>
          <td style="text-align: left">6GB</td>
          <td style="text-align: left">Phi-3 Mini</td>
      </tr>
      <tr>
          <td style="text-align: left">7B</td>
          <td style="text-align: left">6GB</td>
          <td style="text-align: left">8GB</td>
          <td style="text-align: left">Llama 3.1 7B, Qwen2.5 7B</td>
      </tr>
      <tr>
          <td style="text-align: left">13B</td>
          <td style="text-align: left">10GB</td>
          <td style="text-align: left">16GB</td>
          <td style="text-align: left">Llama 3.1 13B</td>
      </tr>
      <tr>
          <td style="text-align: left">34B</td>
          <td style="text-align: left">20GB</td>
          <td style="text-align: left">24GB</td>
          <td style="text-align: left">CodeLlama 34B</td>
      </tr>
      <tr>
          <td style="text-align: left">70B</td>
          <td style="text-align: left">40GB</td>
          <td style="text-align: left">48GB</td>
          <td style="text-align: left">Llama 3.1 70B</td>
      </tr>
  </tbody>
</table>
<p><strong>注</strong>：使用量化（Q4/Q5）可降低约50%显存需求。</p>
<hr>
<h2 id="方案一ollama推荐新手">方案一：Ollama（推荐新手）</h2>
<p>Ollama 是目前最简单的本地大模型部署方案，一行命令就能跑。</p>
<h3 id="安装">安装</h3>
<p><strong>macOS / Linux：</strong></p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">curl -fsSL https://ollama.com/install.sh <span class="p">|</span> sh
</span></span></code></pre></td></tr></table>
</div>
</div><p><strong>Windows：</strong>
下载安装包：https://ollama.com/download</p>
<h3 id="基础使用">基础使用</h3>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl"><span class="c1"># 下载并运行 Llama 3.1 8B</span>
</span></span><span class="line"><span class="cl">ollama run llama3.1
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 下载并运行 Qwen 2.5 7B（中文更好）</span>
</span></span><span class="line"><span class="cl">ollama run qwen2.5
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 下载并运行 DeepSeek Coder（代码专用）</span>
</span></span><span class="line"><span class="cl">ollama run deepseek-coder-v2
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 查看已下载的模型</span>
</span></span><span class="line"><span class="cl">ollama list
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 删除模型</span>
</span></span><span class="line"><span class="cl">ollama rm llama3.1
</span></span></code></pre></td></tr></table>
</div>
</div><h3 id="作为api服务使用">作为API服务使用</h3>
<p>Ollama 默认启动 API 服务在 <code>http://localhost:11434</code></p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span><span class="lnt">6
</span><span class="lnt">7
</span><span class="lnt">8
</span><span class="lnt">9
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">requests</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">response</span> <span class="o">=</span> <span class="n">requests</span><span class="o">.</span><span class="n">post</span><span class="p">(</span><span class="s1">&#39;http://localhost:11434/api/generate&#39;</span><span class="p">,</span> <span class="n">json</span><span class="o">=</span><span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;model&#39;</span><span class="p">:</span> <span class="s1">&#39;qwen2.5&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;prompt&#39;</span><span class="p">:</span> <span class="s1">&#39;用Python写一个快速排序&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;stream&#39;</span><span class="p">:</span> <span class="kc">False</span>
</span></span><span class="line"><span class="cl"><span class="p">})</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="n">response</span><span class="o">.</span><span class="n">json</span><span class="p">()[</span><span class="s1">&#39;response&#39;</span><span class="p">])</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p><strong>兼容 OpenAI API 格式：</strong></p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">openai</span> <span class="kn">import</span> <span class="n">OpenAI</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">client</span> <span class="o">=</span> <span class="n">OpenAI</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">base_url</span><span class="o">=</span><span class="s1">&#39;http://localhost:11434/v1&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">api_key</span><span class="o">=</span><span class="s1">&#39;ollama&#39;</span>  <span class="c1"># 任意值即可</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">response</span> <span class="o">=</span> <span class="n">client</span><span class="o">.</span><span class="n">chat</span><span class="o">.</span><span class="n">completions</span><span class="o">.</span><span class="n">create</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">model</span><span class="o">=</span><span class="s1">&#39;qwen2.5&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">messages</span><span class="o">=</span><span class="p">[</span>
</span></span><span class="line"><span class="cl">        <span class="p">{</span><span class="s1">&#39;role&#39;</span><span class="p">:</span> <span class="s1">&#39;user&#39;</span><span class="p">,</span> <span class="s1">&#39;content&#39;</span><span class="p">:</span> <span class="s1">&#39;你好，介绍一下你自己&#39;</span><span class="p">}</span>
</span></span><span class="line"><span class="cl">    <span class="p">]</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="n">response</span><span class="o">.</span><span class="n">choices</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">message</span><span class="o">.</span><span class="n">content</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><h3 id="自定义模型modelfile">自定义模型（Modelfile）</h3>
<p>创建 <code>Modelfile</code>：</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-dockerfile" data-lang="dockerfile"><span class="line"><span class="cl"><span class="k">FROM</span><span class="w"> </span><span class="s">qwen2.5</span><span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err"></span><span class="c"># 设置系统提示词</span><span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err"></span>SYSTEM <span class="s2">&#34;&#34;&#34;</span><span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err"></span>你是一个专业的Python开发助手。<span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err"></span>回答要简洁，代码要有注释。<span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err"></span><span class="s2">&#34;&#34;&#34;</span><span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err"></span><span class="c"># 设置参数</span><span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err"></span>PARAMETER temperature 0.7<span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err"></span>PARAMETER num_ctx <span class="m">4096</span><span class="err">
</span></span></span></code></pre></td></tr></table>
</div>
</div><p>构建并运行：</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">ollama create my-python-helper -f Modelfile
</span></span><span class="line"><span class="cl">ollama run my-python-helper
</span></span></code></pre></td></tr></table>
</div>
</div><h3 id="ollama-优缺点">Ollama 优缺点</h3>
<p><strong>优点</strong>：</p>
<ul>
<li>安装简单，一行命令</li>
<li>模型库丰富，一键下载</li>
<li>内存管理优秀</li>
<li>社区活跃</li>
</ul>
<p><strong>缺点</strong>：</p>
<ul>
<li>推理速度不是最快</li>
<li>高级功能较少</li>
<li>不支持多卡并行（原生）</li>
</ul>
<hr>
<h2 id="方案二vllm推荐生产环境">方案二：vLLM（推荐生产环境）</h2>
<p>vLLM 是性能最强的本地推理引擎，来自UC Berkeley。</p>
<h3 id="安装-1">安装</h3>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">pip install vllm
</span></span></code></pre></td></tr></table>
</div>
</div><h3 id="启动服务">启动服务</h3>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl"><span class="c1"># 启动 OpenAI 兼容的 API 服务</span>
</span></span><span class="line"><span class="cl">python -m vllm.entrypoints.openai.api_server <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span>    --model Qwen/Qwen2.5-7B-Instruct <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span>    --port <span class="m">8000</span> <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span>    --tensor-parallel-size <span class="m">1</span>
</span></span></code></pre></td></tr></table>
</div>
</div><h3 id="使用-api">使用 API</h3>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">openai</span> <span class="kn">import</span> <span class="n">OpenAI</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">client</span> <span class="o">=</span> <span class="n">OpenAI</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">base_url</span><span class="o">=</span><span class="s1">&#39;http://localhost:8000/v1&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">api_key</span><span class="o">=</span><span class="s1">&#39;vllm&#39;</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">response</span> <span class="o">=</span> <span class="n">client</span><span class="o">.</span><span class="n">chat</span><span class="o">.</span><span class="n">completions</span><span class="o">.</span><span class="n">create</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">model</span><span class="o">=</span><span class="s1">&#39;Qwen/Qwen2.5-7B-Instruct&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">messages</span><span class="o">=</span><span class="p">[</span>
</span></span><span class="line"><span class="cl">        <span class="p">{</span><span class="s1">&#39;role&#39;</span><span class="p">:</span> <span class="s1">&#39;user&#39;</span><span class="p">,</span> <span class="s1">&#39;content&#39;</span><span class="p">:</span> <span class="s1">&#39;解释一下什么是RAG&#39;</span><span class="p">}</span>
</span></span><span class="line"><span class="cl">    <span class="p">]</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="n">response</span><span class="o">.</span><span class="n">choices</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">message</span><span class="o">.</span><span class="n">content</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><h3 id="高级特性">高级特性</h3>
<p><strong>1. 多GPU并行</strong></p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl"><span class="c1"># 使用2张显卡</span>
</span></span><span class="line"><span class="cl">python -m vllm.entrypoints.openai.api_server <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span>    --model meta-llama/Llama-3.1-70B-Instruct <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span>    --tensor-parallel-size <span class="m">2</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p><strong>2. 量化加载</strong></p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl"><span class="c1"># AWQ 量化</span>
</span></span><span class="line"><span class="cl">python -m vllm.entrypoints.openai.api_server <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span>    --model TheBloke/Llama-2-7B-Chat-AWQ <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span>    --quantization awq
</span></span></code></pre></td></tr></table>
</div>
</div><p><strong>3. 批处理优化</strong></p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span><span class="lnt">16
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">vllm</span> <span class="kn">import</span> <span class="n">LLM</span><span class="p">,</span> <span class="n">SamplingParams</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">llm</span> <span class="o">=</span> <span class="n">LLM</span><span class="p">(</span><span class="n">model</span><span class="o">=</span><span class="s2">&#34;Qwen/Qwen2.5-7B-Instruct&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">sampling_params</span> <span class="o">=</span> <span class="n">SamplingParams</span><span class="p">(</span><span class="n">temperature</span><span class="o">=</span><span class="mf">0.8</span><span class="p">,</span> <span class="n">max_tokens</span><span class="o">=</span><span class="mi">256</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 批量处理多个请求</span>
</span></span><span class="line"><span class="cl"><span class="n">prompts</span> <span class="o">=</span> <span class="p">[</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;什么是机器学习？&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;Python和Java的区别？&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;如何学习编程？&#34;</span>
</span></span><span class="line"><span class="cl"><span class="p">]</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">outputs</span> <span class="o">=</span> <span class="n">llm</span><span class="o">.</span><span class="n">generate</span><span class="p">(</span><span class="n">prompts</span><span class="p">,</span> <span class="n">sampling_params</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">for</span> <span class="n">output</span> <span class="ow">in</span> <span class="n">outputs</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="nb">print</span><span class="p">(</span><span class="n">output</span><span class="o">.</span><span class="n">outputs</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">text</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><h3 id="vllm-优缺点">vLLM 优缺点</h3>
<p><strong>优点</strong>：</p>
<ul>
<li>推理速度最快（PagedAttention技术）</li>
<li>吞吐量高，适合生产环境</li>
<li>支持多GPU并行</li>
<li>内存效率极高</li>
</ul>
<p><strong>缺点</strong>：</p>
<ul>
<li>安装配置较复杂</li>
<li>需要CUDA环境</li>
<li>不支持macOS（GPU加速）</li>
</ul>
<hr>
<h2 id="方案三lm-studio推荐小白">方案三：LM Studio（推荐小白）</h2>
<p>LM Studio 是带GUI的本地大模型工具，适合不想碰命令行的用户。</p>
<h3 id="安装-2">安装</h3>
<p>下载地址：https://lmstudio.ai/</p>
<p>支持 Windows / macOS / Linux。</p>
<h3 id="使用方法">使用方法</h3>
<ol>
<li>
<p><strong>下载模型</strong>：</p>
<ul>
<li>打开 LM Studio</li>
<li>搜索想要的模型（如 &ldquo;qwen2.5&rdquo;）</li>
<li>点击下载</li>
</ul>
</li>
<li>
<p><strong>对话</strong>：</p>
<ul>
<li>选择已下载的模型</li>
<li>在聊天界面直接对话</li>
</ul>
</li>
<li>
<p><strong>启动本地服务</strong>：</p>
<ul>
<li>点击 &ldquo;Local Server&rdquo;</li>
<li>启动后可在 <code>localhost:1234</code> 访问 API</li>
</ul>
</li>
</ol>
<h3 id="api-调用">API 调用</h3>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">openai</span> <span class="kn">import</span> <span class="n">OpenAI</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">client</span> <span class="o">=</span> <span class="n">OpenAI</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">base_url</span><span class="o">=</span><span class="s1">&#39;http://localhost:1234/v1&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">api_key</span><span class="o">=</span><span class="s1">&#39;lm-studio&#39;</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">response</span> <span class="o">=</span> <span class="n">client</span><span class="o">.</span><span class="n">chat</span><span class="o">.</span><span class="n">completions</span><span class="o">.</span><span class="n">create</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">model</span><span class="o">=</span><span class="s1">&#39;local-model&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">messages</span><span class="o">=</span><span class="p">[</span>
</span></span><span class="line"><span class="cl">        <span class="p">{</span><span class="s1">&#39;role&#39;</span><span class="p">:</span> <span class="s1">&#39;user&#39;</span><span class="p">,</span> <span class="s1">&#39;content&#39;</span><span class="p">:</span> <span class="s1">&#39;你好&#39;</span><span class="p">}</span>
</span></span><span class="line"><span class="cl">    <span class="p">]</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="n">response</span><span class="o">.</span><span class="n">choices</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">message</span><span class="o">.</span><span class="n">content</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><h3 id="lm-studio-优缺点">LM Studio 优缺点</h3>
<p><strong>优点</strong>：</p>
<ul>
<li>图形界面，操作简单</li>
<li>模型管理方便</li>
<li>支持各种量化格式（GGUF）</li>
<li>跨平台</li>
</ul>
<p><strong>缺点</strong>：</p>
<ul>
<li>性能不如vLLM</li>
<li>高级功能受限</li>
<li>不适合生产部署</li>
</ul>
<hr>
<h2 id="推荐模型">推荐模型</h2>
<h3 id="通用对话">通用对话</h3>
<table>
  <thead>
      <tr>
          <th style="text-align: left">模型</th>
          <th style="text-align: left">大小</th>
          <th style="text-align: left">特点</th>
          <th style="text-align: left">下载命令（Ollama）</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left">Qwen2.5</td>
          <td style="text-align: left">7B</td>
          <td style="text-align: left">中文最强</td>
          <td style="text-align: left"><code>ollama run qwen2.5</code></td>
      </tr>
      <tr>
          <td style="text-align: left">Llama 3.1</td>
          <td style="text-align: left">8B</td>
          <td style="text-align: left">综合均衡</td>
          <td style="text-align: left"><code>ollama run llama3.1</code></td>
      </tr>
      <tr>
          <td style="text-align: left">Mistral</td>
          <td style="text-align: left">7B</td>
          <td style="text-align: left">欧洲血统，推理强</td>
          <td style="text-align: left"><code>ollama run mistral</code></td>
      </tr>
  </tbody>
</table>
<h3 id="代码生成">代码生成</h3>
<table>
  <thead>
      <tr>
          <th style="text-align: left">模型</th>
          <th style="text-align: left">大小</th>
          <th style="text-align: left">特点</th>
          <th style="text-align: left">下载命令</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left">DeepSeek Coder V2</td>
          <td style="text-align: left">16B</td>
          <td style="text-align: left">代码专精</td>
          <td style="text-align: left"><code>ollama run deepseek-coder-v2</code></td>
      </tr>
      <tr>
          <td style="text-align: left">CodeLlama</td>
          <td style="text-align: left">7B-34B</td>
          <td style="text-align: left">Meta出品</td>
          <td style="text-align: left"><code>ollama run codellama</code></td>
      </tr>
      <tr>
          <td style="text-align: left">Qwen2.5-Coder</td>
          <td style="text-align: left">7B</td>
          <td style="text-align: left">代码+中文</td>
          <td style="text-align: left"><code>ollama run qwen2.5-coder</code></td>
      </tr>
  </tbody>
</table>
<h3 id="长文本处理">长文本处理</h3>
<table>
  <thead>
      <tr>
          <th style="text-align: left">模型</th>
          <th style="text-align: left">上下文</th>
          <th style="text-align: left">特点</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left">Qwen2.5</td>
          <td style="text-align: left">128K</td>
          <td style="text-align: left">超长上下文</td>
      </tr>
      <tr>
          <td style="text-align: left">Yi-1.5</td>
          <td style="text-align: left">200K</td>
          <td style="text-align: left">国产长文本王者</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="性能优化技巧">性能优化技巧</h2>
<h3 id="1-使用量化模型">1. 使用量化模型</h3>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span><span class="lnt">6
</span><span class="lnt">7
</span><span class="lnt">8
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl"><span class="c1"># Q4 量化（速度快，精度略降）</span>
</span></span><span class="line"><span class="cl">ollama run qwen2.5:7b-q4_K_M
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># Q5 量化（平衡选择）</span>
</span></span><span class="line"><span class="cl">ollama run qwen2.5:7b-q5_K_M
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># Q8 量化（精度高，显存占用大）</span>
</span></span><span class="line"><span class="cl">ollama run qwen2.5:7b-q8_0
</span></span></code></pre></td></tr></table>
</div>
</div><h3 id="2-调整上下文长度">2. 调整上下文长度</h3>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span><span class="lnt">6
</span><span class="lnt">7
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl"><span class="c1"># Ollama</span>
</span></span><span class="line"><span class="cl">ollama run qwen2.5 --num-ctx <span class="m">8192</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># vLLM</span>
</span></span><span class="line"><span class="cl">python -m vllm.entrypoints.openai.api_server <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span>    --model Qwen/Qwen2.5-7B-Instruct <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span>    --max-model-len <span class="m">8192</span>
</span></span></code></pre></td></tr></table>
</div>
</div><h3 id="3-启用-flash-attention">3. 启用 Flash Attention</h3>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl"><span class="c1"># vLLM 自动启用</span>
</span></span><span class="line"><span class="cl"><span class="c1"># 确保安装了 flash-attn</span>
</span></span><span class="line"><span class="cl">pip install flash-attn
</span></span></code></pre></td></tr></table>
</div>
</div><h3 id="4-使用-gpu-offloading">4. 使用 GPU Offloading</h3>
<p>对于显存不足的情况：</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl"><span class="c1"># Ollama 自动处理</span>
</span></span><span class="line"><span class="cl"><span class="c1"># 可通过环境变量控制</span>
</span></span><span class="line"><span class="cl"><span class="nv">OLLAMA_NUM_GPU</span><span class="o">=</span><span class="m">1</span> ollama run llama3.1:70b
</span></span></code></pre></td></tr></table>
</div>
</div><hr>
<h2 id="常见问题">常见问题</h2>
<h3 id="q-显存不够怎么办">Q: 显存不够怎么办？</h3>
<p>A:</p>
<ol>
<li>使用更小的模型（7B代替13B）</li>
<li>使用量化版本（Q4/Q5）</li>
<li>减少上下文长度</li>
<li>使用CPU推理（会慢很多）</li>
</ol>
<h3 id="q-mac-能跑吗">Q: Mac 能跑吗？</h3>
<p>A: 可以！Apple Silicon（M1/M2/M3）效果很好。</p>
<ul>
<li>Ollama 原生支持</li>
<li>LM Studio 原生支持</li>
<li>vLLM 不支持Mac GPU</li>
</ul>
<h3 id="q-生成速度太慢">Q: 生成速度太慢？</h3>
<p>A:</p>
<ol>
<li>检查是否使用了GPU（而非CPU）</li>
<li>使用量化模型</li>
<li>减少生成长度</li>
<li>升级到 vLLM</li>
</ol>
<h3 id="q-如何让多人同时使用">Q: 如何让多人同时使用？</h3>
<p>A:</p>
<ol>
<li>部署在服务器上</li>
<li>使用 vLLM（支持高并发）</li>
<li>前面加一层 Nginx 做负载均衡</li>
</ol>
<hr>
<h2 id="总结">总结</h2>
<table>
  <thead>
      <tr>
          <th style="text-align: left">方案</th>
          <th style="text-align: left">适合人群</th>
          <th style="text-align: left">难度</th>
          <th style="text-align: left">性能</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>Ollama</strong></td>
          <td style="text-align: left">开发者、个人使用</td>
          <td style="text-align: left">⭐</td>
          <td style="text-align: left">⭐⭐⭐</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>vLLM</strong></td>
          <td style="text-align: left">团队、生产环境</td>
          <td style="text-align: left">⭐⭐⭐</td>
          <td style="text-align: left">⭐⭐⭐⭐⭐</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>LM Studio</strong></td>
          <td style="text-align: left">小白、尝鲜</td>
          <td style="text-align: left">⭐</td>
          <td style="text-align: left">⭐⭐</td>
      </tr>
  </tbody>
</table>
<p><strong>我的建议</strong>：</p>
<ul>
<li>刚入门 → 先用 LM Studio 体验</li>
<li>日常开发 → Ollama 足够</li>
<li>生产部署 → vLLM 一把梭</li>
</ul>
<p>本地大模型的时代已经来临，拥抱它！</p>
]]></content:encoded></item></channel></rss>