<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>大模型 on Chico's Tech Blog</title><link>https://realtime-ai.chat/tags/%E5%A4%A7%E6%A8%A1%E5%9E%8B/</link><description>Recent content in 大模型 on Chico's Tech Blog</description><image><title>Chico's Tech Blog</title><url>https://github.com/chicogong.png</url><link>https://github.com/chicogong.png</link></image><generator>Hugo</generator><language>zh-cn</language><lastBuildDate>Mon, 18 May 2026 10:00:00 +0800</lastBuildDate><atom:link href="https://realtime-ai.chat/tags/%E5%A4%A7%E6%A8%A1%E5%9E%8B/index.xml" rel="self" type="application/rss+xml"/><item><title>2026 大模型选型:别问「哪个最强」,问「哪个够用」</title><link>https://realtime-ai.chat/posts/llm-selection-2026/</link><pubDate>Mon, 18 May 2026 10:00:00 +0800</pubDate><guid>https://realtime-ai.chat/posts/llm-selection-2026/</guid><description>2026 年大模型选型不该看跑分排名。这篇给一套按场景选型的决策框架:能力梯队、推理成本、延迟、上下文、闭源开源、私有化部署,附决策流程图。</description><content:encoded><![CDATA[<p>去年我们一个内部项目,用 Claude Opus 跑一个意图分类:输入一句用户的话,输出三个标签之一。上线两周,有人去看账单,愣住了——这个分类任务,一个 14B 的开源模型在自己的卡上跑,效果差不了几个点,成本是它的几十分之一。</p>
<p>这就是 2026 年选型最常见的错误:<strong>把&quot;哪个模型最强&quot;当成了&quot;我该用哪个模型&quot;。</strong></p>
<p>这两个问题根本不是一回事。GPQA、SWE-bench、ARC-AGI-2 这些榜单告诉你的是天花板,而你大部分的线上请求,离天花板远着呢。一个分类、一段摘要、一次格式化抽取——这些活儿,旗舰模型是高射炮打蚊子。选型不是选最强,是给<strong>每一类任务</strong>配一个&quot;刚好够用、且最便宜&quot;的模型。</p>
<p>这篇不排名。给你一套按场景拆的决策框架。</p>
<h2 id="先认清2026-年的模型是分梯队的">先认清:2026 年的模型是分梯队的</h2>
<p>2026 年 5 月,前沿模型大概是这么个格局——记住具体版本号意义不大,它们每两三个月就跳一次,记住<strong>梯队</strong>就行:</p>
<table>
  <thead>
      <tr>
          <th>梯队</th>
          <th>代表模型(2026.05)</th>
          <th>典型 API 价格(输入/输出,每百万 token)</th>
          <th>该干什么</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>旗舰</td>
          <td>GPT-5.5、Claude Opus 4.7、Gemini 3.1 Pro</td>
          <td>$5 / $25 量级</td>
          <td>复杂推理、Agent 编排、难代码</td>
      </tr>
      <tr>
          <td>主力</td>
          <td>Claude Sonnet 4.6、Gemini 3 Flash、DeepSeek V4-Pro</td>
          <td>$1–3 / $3–15 量级</td>
          <td>绝大多数生产任务</td>
      </tr>
      <tr>
          <td>快而省</td>
          <td>Claude Haiku 4.5、Gemini 3 Flash-Lite、DeepSeek V4-Flash</td>
          <td>$0.1–1 / $0.3–5 量级</td>
          <td>分类、抽取、路由、简单问答</td>
      </tr>
  </tbody>
</table>
<p>这张表里藏着一个关键事实:<strong>旗舰和&quot;快而省&quot;之间,输出价格差了几十倍。</strong> DeepSeek V4-Flash 的输出大约 $0.28,GPT-5.5 是 $30——一百多倍。这个差距不是边角料,它会直接决定你的产品能不能规模化。</p>
<p>而梯队之间的<strong>能力</strong>差距,这两年反而在缩小。2024 年你能明显感觉到旗舰和主力不是一个物种;2026 年,在很多具体任务上,主力模型只比旗舰差几个百分点,有时候你压根测不出来。能力在收敛,价格还拉得很开——这就是&quot;按梯队选型&quot;能省钱的根本原因。</p>
<p>所以第一条原则:<strong>默认从主力梯队起步,只在它确实顶不住时才往上抬。</strong> 不要反过来,从旗舰往下砍——那样你永远不知道下面那一档是不是早就够了。</p>
<h2 id="维度一能力够不够要按任务类型问">维度一:能力够不够,要按&quot;任务类型&quot;问</h2>
<p>&ldquo;够用&quot;不是一个模糊的感觉,它可以拆。把你的任务大致归到三类:</p>
<p><strong>确定性任务</strong>——分类、实体抽取、格式转换、敏感词过滤。这类任务有标准答案,对错可量化。结论很直接:<strong>用快而省梯队,甚至小开源模型。</strong> 旗舰在这里没有任何优势,它多出来的&quot;智商&quot;在一个三选一的分类题上无处发挥。我前面说的那个翻车案例,就是这一类。</p>
<p><strong>生成与改写任务</strong>——写文案、做摘要、客服话术、翻译。这类没有唯一答案,但对&quot;质量&quot;敏感。主力梯队是甜区。值得一提:Claude 系列在中长文写作上的语感明显更自然,一次能稳定输出十几万 token 不塌;如果你的产品核心就是&quot;写得像人&rdquo;,这个差异值得你多花那点钱。</p>
<p><strong>推理与 Agent 任务</strong>——多步代码、需要调工具、长链路规划、&ldquo;自己想办法完成&rdquo;。这是 2026 年唯一<strong>真的需要旗舰</strong>的地方。一个 Agent 要连续做二三十步,每一步的小错误会累积,中间某一步判断失误,后面全废。这种场景下,旗舰多出来的几个点,放大到整条链路就是&quot;能跑通&quot;和&quot;跑不通&quot;的区别。GPT-5.5、Claude Opus 4.7 这一档,贵有贵的道理——但前提是,你的任务真的是 Agent,而不是被包装成 Agent 的一次性问答。</p>
<p>一个实操建议:<strong>别用一个模型扛所有任务。</strong> 成熟的做法是按任务路由——一个便宜模型做分流和简单活儿,难的才转交旗舰。这比&quot;全程旗舰&quot;省一大笔,也比&quot;全程便宜&quot;靠谱。</p>
<h2 id="维度二成本不是单价是单价--调用量--输出长度">维度二:成本不是单价,是「单价 × 调用量 × 输出长度」</h2>
<p>很多人看 API 价格,只瞄一眼那个&quot;每百万 token 多少钱&quot;。这是不够的。真正的账是三个数相乘:</p>
<ul>
<li><strong>单价</strong>——尤其是<strong>输出</strong>单价,通常是输入的 3 到 5 倍,而且 Agent 类任务输出占比高。</li>
<li><strong>调用量</strong>——一天一千次还是一千万次,差四个数量级。</li>
<li><strong>平均输出长度</strong>——让模型&quot;先想再答&quot;(reasoning)能提质量,但思考链本身也是要付费的 token。</li>
</ul>
<p>把这三个乘起来,你常会得到一个反直觉的结论。举个例子:一个日活几万的客服机器人,绝大多数对话是&quot;查物流&quot;&ldquo;改地址&quot;这种,真正复杂的咨询可能只占 5%。如果你全程用旗舰,等于为了那 5% 的复杂场景,给 95% 的简单场景也付了旗舰价。把 95% 切到主力或快省梯队,月成本可能直接砍掉七八成,用户一点感知都没有。</p>
<p>两个几乎免费、却经常被忘掉的省钱手段,务必用上:</p>
<ul>
<li><strong>Prompt Caching(提示缓存)</strong>——固定不变的前缀(system prompt、长文档、few-shot 例子)缓存住,命中后这部分输入便宜约 90%。多轮对话、RAG、批量同模板任务,收益巨大。</li>
<li><strong>Batch(批处理)</strong>——不要求实时返回的任务,走批处理接口,普遍五折。离线打标、夜间报表、内容审核这类活儿,没理由不用。</li>
</ul>
<p>记住:<strong>选型省下的钱,常常比换一个&quot;更便宜的模型&quot;省得还多。</strong> 因为它省的是结构性的浪费。</p>
<h2 id="维度三延迟上下文被场景一票否决的硬约束">维度三:延迟、上下文——被场景一票否决的硬约束</h2>
<p>有些维度不参与&quot;性价比&quot;的权衡,它们是<strong>门槛</strong>:不过线,这个模型直接出局,多强多便宜都没用。</p>
<p><strong>延迟。</strong> 如果你做的是实时语音对话,用户说完到 AI 出声的预算只有几百毫秒(这个我在<a href="../voice-technology/voice-latency-budget/">上一篇</a>里专门拆过)。这种场景,你要盯的是<strong>首 token 延迟(TTFT)</strong>,不是模型聪不聪明。一个慢半拍的旗舰,体验上输给一个快的主力模型。反过来,如果是离线批处理,延迟根本不在你的考虑范围里——这时候为&quot;快&quot;付的溢价就是纯浪费。</p>
<p><strong>上下文长度。</strong> 2026 年长上下文已经不稀缺:Gemini 3.1 Pro 和 DeepSeek V4 都是 1M token 窗口,Llama 4 甚至把 10M 带进了开源世界。但<strong>有窗口不等于会用</strong>。把 50 万 token 一股脑塞进去,模型对中间段落的注意力会明显下降——业内说的 &ldquo;lost in the middle&rdquo; 没有消失。所以长上下文是个二元的资格题:你的单次任务真需要塞进一整本书、一个大代码库,那 1M 窗口是硬指标;如果你的输入本来就几千 token,纠结谁的窗口更大毫无意义,<strong>该花力气的是 RAG 的检索质量,而不是模型的窗口数字。</strong></p>
<p>判断方法很简单:<strong>先问&quot;这个场景能不能容忍 X&rdquo;,不能就直接划掉一批模型,再在活下来的里面比性价比。</strong> 别把硬约束和软偏好混在一起算。</p>
<h2 id="维度四闭源还是开源2026-年这道题变简单了">维度四:闭源还是开源,2026 年这道题变简单了</h2>
<p>两年前这是个艰难抉择,因为开源模型确实差一截。2026 年不一样了。</p>
<p>DeepSeek V4-Pro 在 SWE-bench Verified 上能摸到 80% 出头,和顶级闭源模型只差零点几个点,而且是 MIT 许可证。Qwen 3.5 / 3.6、Llama 4 也都在各自的领域逼近前沿。<strong>开源和闭源的能力差距,现在是用单个 benchmark 上的几个点来衡量,不再是&quot;差一代&quot;。</strong> 同时,主流开源模型现在发布即附带官方量化版本(Q4/Q5/Q8),部署门槛大幅下降。</p>
<p>所以这道题的判据,从&quot;谁更强&quot;变成了别的:</p>
<ul>
<li>选<strong>闭源 API</strong>:你要的是省心。不碰 GPU、不管扩缩容、要最新最强、出了事有人兜底。绝大多数从 0 到 1 的产品,该走这条路——你的精力应该花在产品上,不是运维推理集群。</li>
<li>选<strong>开源</strong>:你有三个理由之一——量足够大(自己跑的边际成本能把闭源 API 打下去)、需要深度微调(让模型长出领域知识)、或者数据不能出门(下一节细说)。</li>
</ul>
<p>还有个容易被忽视的点:<strong>开源是一份保险。</strong> 用闭源 API,你绑定了对方的定价、限流和模型下线节奏——它说某个版本退役,你就得连夜迁移。把一部分负载放在能自己掌控的开源模型上,是对供应商风险的对冲。</p>
<h2 id="维度五要不要私有化部署这题先于选模型">维度五:要不要私有化部署——这题先于选模型</h2>
<p>如果你的数据是病历、银行流水、未公开的财报、核心代码——<strong>这一条会推翻上面所有结论。</strong> 它不是一个性价比维度,它是法律和信任的红线。</p>
<p>判断私有化部署需求,问三个问题:</p>
<ol>
<li><strong>数据能不能离开你的网络?</strong> 受监管的医疗、金融、政务,答案常常是&quot;不能&quot;。</li>
<li><strong>合规要求审计闭环吗?</strong> 欧盟 AI 法案 2026 年 8 月全面生效,高风险系统要求可追溯、可解释、有人类监督。这些在一个黑盒 API 后面很难自证。</li>
<li><strong>数据主权有没有硬约束?</strong> 某些行业、某些地区,要求推理全程在境内、在自有设施内完成。</li>
</ol>
<p>只要有一个答案指向&quot;必须自己掌控&quot;,那就<strong>只能选能私有化的开源模型</strong>——Qwen、Llama、DeepSeek 这一类,把权重下载下来,跑在自己的 VPC 或机房里。这时候&quot;GPT-5.5 更强&quot;是一句正确的废话,因为它根本不在你的候选集里。</p>
<p>要提醒的是,私有化不是&quot;省钱&quot;的同义词。算上 GPU 采购或租赁、运维、扩缩容、安全加固,<strong>很多时候它比 API 更贵。</strong> 选它的理由是控制权和合规,不是成本。如果你既没有合规硬约束、量也撑不起一个推理集群,却因为&quot;感觉更安全&quot;去自建,那大概率是给自己挖坑。</p>
<h2 id="把这套框架连起来">把这套框架连起来</h2>
<p>选型不是从一张排行榜里挑第一名,而是带着你的场景,依次过几道闸门:</p>
<pre class="mermaid">flowchart TD
  A[一个具体任务] --> B{数据能否出本网络?}
  B -- 不能/强合规 --> P[私有化部署<br/>开源模型: Qwen / Llama / DeepSeek]
  B -- 可以 --> C{任务类型?}
  C -- 确定性<br/>分类·抽取·路由 --> D[快而省梯队<br/>Haiku / Flash-Lite / 小开源模型]
  C -- 生成改写<br/>文案·摘要·翻译 --> E[主力梯队<br/>Sonnet / Flash / DeepSeek V4-Pro]
  C -- 推理 Agent<br/>多步·调工具·规划 --> F[旗舰梯队<br/>GPT-5.5 / Opus 4.7 / Gemini 3.1 Pro]
  D --> G{有延迟或上下文硬约束?}
  E --> G
  F --> G
  G -- 有 --> H[在满足约束的模型里<br/>重新筛一遍]
  G -- 没有 --> I[按规模决定<br/>闭源 API or 自建开源]
</pre><p>注意几个细节。<strong>第一道闸是数据合规,不是能力</strong>——合规一票否决,放在最前面,免得你比了半天性价比最后发现这个模型根本不能用。任务类型决定的是<strong>梯队</strong>,不是具体型号——型号每季度都变,梯队的逻辑稳定得多。延迟和上下文是<strong>筛选器</strong>,不是打分项——它们只负责把不合格的划掉。最后才轮到闭源还是开源,而这一步<strong>主要由调用量决定</strong>:量小走 API,量大到自建更划算时,再考虑迁。</p>
<h2 id="最后">最后</h2>
<p>2026 年大模型这块,缺的从来不是好模型,缺的是&quot;清楚自己要什么&quot;。</p>
<p>榜单天天有人更新,排名天天有人吵,但你的客服机器人需要的可能只是一个稳定、便宜、够快的主力模型;你的代码 Agent 才真的吃旗舰那几个点的智商;你的合规系统压根不在公开榜单的讨论范围里。</p>
<p>把&quot;哪个最强&quot;这个问题放下。换成一串具体的问题:这个任务是什么类型?能容忍多少延迟?一天调用多少次?数据能不能出门?——这几个问题答完,该选哪个,基本也就清楚了。</p>
<p>选型的功夫,九成在想清楚需求,一成在看模型。顺序别搞反。</p>
]]></content:encoded></item><item><title>本地部署大模型完全指南：Ollama + vLLM + LMStudio 实战</title><link>https://realtime-ai.chat/posts/local-llm-deployment/</link><pubDate>Tue, 23 Dec 2025 10:00:00 +0800</pubDate><guid>https://realtime-ai.chat/posts/local-llm-deployment/</guid><description>本地部署大模型完全指南:Ollama、vLLM、LMStudio 三种方案实战对比,兼顾隐私、性能与成本。</description><content:encoded><![CDATA[<h2 id="为什么要本地部署">为什么要本地部署？</h2>
<p>在云端API满天飞的2025年，为什么还要本地部署大模型？</p>
<h3 id="理由1隐私安全">理由1：隐私安全</h3>
<p>你的代码、文档、聊天记录……全都发给了云端。</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">敏感场景：
</span></span><span class="line"><span class="cl">- 公司内部代码 → 发给OpenAI？
</span></span><span class="line"><span class="cl">- 医疗病历数据 → 发给云端？
</span></span><span class="line"><span class="cl">- 法律合同文本 → 谁来保证不泄露？
</span></span></code></pre></td></tr></table>
</div>
</div><p>本地部署 = 数据永远不出你的电脑。</p>
<h3 id="理由2成本控制">理由2：成本控制</h3>
<table>
  <thead>
      <tr>
          <th style="text-align: left">使用场景</th>
          <th style="text-align: left">云端API成本</th>
          <th style="text-align: left">本地部署成本</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left">每天1万次调用</td>
          <td style="text-align: left">~$300/月</td>
          <td style="text-align: left">电费 ~$30/月</td>
      </tr>
      <tr>
          <td style="text-align: left">7B模型长期使用</td>
          <td style="text-align: left">持续付费</td>
          <td style="text-align: left">一次性硬件投入</td>
      </tr>
      <tr>
          <td style="text-align: left">团队10人使用</td>
          <td style="text-align: left">$200+/人/月</td>
          <td style="text-align: left">共享一台服务器</td>
      </tr>
  </tbody>
</table>
<h3 id="理由3低延迟">理由3：低延迟</h3>
<p>云端API：网络往返 100-500ms
本地部署：几乎零延迟</p>
<h3 id="理由4自由定制">理由4：自由定制</h3>
<ul>
<li>想微调？随便调</li>
<li>想改提示词模板？自己改</li>
<li>想限制输出长度？随心所欲</li>
</ul>
<hr>
<h2 id="硬件要求">硬件要求</h2>
<h3 id="最低配置跑7b模型">最低配置（跑7B模型）</h3>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">CPU：8核以上
</span></span><span class="line"><span class="cl">内存：16GB
</span></span><span class="line"><span class="cl">显卡：8GB显存（如RTX 3070）
</span></span><span class="line"><span class="cl">     或 Apple M1/M2/M3（统一内存）
</span></span><span class="line"><span class="cl">存储：50GB SSD可用空间
</span></span></code></pre></td></tr></table>
</div>
</div><h3 id="推荐配置跑13b-70b模型">推荐配置（跑13B-70B模型）</h3>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">CPU：12核以上
</span></span><span class="line"><span class="cl">内存：32GB+
</span></span><span class="line"><span class="cl">显卡：24GB显存（如RTX 4090）
</span></span><span class="line"><span class="cl">     或 Apple M2 Pro/Max/Ultra
</span></span><span class="line"><span class="cl">存储：200GB SSD可用空间
</span></span></code></pre></td></tr></table>
</div>
</div><h3 id="显存-vs-模型大小速查表">显存 vs 模型大小速查表</h3>
<table>
  <thead>
      <tr>
          <th style="text-align: left">模型大小</th>
          <th style="text-align: left">最低显存</th>
          <th style="text-align: left">推荐显存</th>
          <th style="text-align: left">代表模型</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left">3B</td>
          <td style="text-align: left">4GB</td>
          <td style="text-align: left">6GB</td>
          <td style="text-align: left">Phi-3 Mini</td>
      </tr>
      <tr>
          <td style="text-align: left">7B</td>
          <td style="text-align: left">6GB</td>
          <td style="text-align: left">8GB</td>
          <td style="text-align: left">Llama 3.1 7B, Qwen2.5 7B</td>
      </tr>
      <tr>
          <td style="text-align: left">13B</td>
          <td style="text-align: left">10GB</td>
          <td style="text-align: left">16GB</td>
          <td style="text-align: left">Llama 3.1 13B</td>
      </tr>
      <tr>
          <td style="text-align: left">34B</td>
          <td style="text-align: left">20GB</td>
          <td style="text-align: left">24GB</td>
          <td style="text-align: left">CodeLlama 34B</td>
      </tr>
      <tr>
          <td style="text-align: left">70B</td>
          <td style="text-align: left">40GB</td>
          <td style="text-align: left">48GB</td>
          <td style="text-align: left">Llama 3.1 70B</td>
      </tr>
  </tbody>
</table>
<p><strong>注</strong>：使用量化（Q4/Q5）可降低约50%显存需求。</p>
<hr>
<h2 id="方案一ollama推荐新手">方案一：Ollama（推荐新手）</h2>
<p>Ollama 是目前最简单的本地大模型部署方案，一行命令就能跑。</p>
<h3 id="安装">安装</h3>
<p><strong>macOS / Linux：</strong></p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">curl -fsSL https://ollama.com/install.sh <span class="p">|</span> sh
</span></span></code></pre></td></tr></table>
</div>
</div><p><strong>Windows：</strong>
下载安装包：https://ollama.com/download</p>
<h3 id="基础使用">基础使用</h3>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl"><span class="c1"># 下载并运行 Llama 3.1 8B</span>
</span></span><span class="line"><span class="cl">ollama run llama3.1
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 下载并运行 Qwen 2.5 7B（中文更好）</span>
</span></span><span class="line"><span class="cl">ollama run qwen2.5
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 下载并运行 DeepSeek Coder（代码专用）</span>
</span></span><span class="line"><span class="cl">ollama run deepseek-coder-v2
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 查看已下载的模型</span>
</span></span><span class="line"><span class="cl">ollama list
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 删除模型</span>
</span></span><span class="line"><span class="cl">ollama rm llama3.1
</span></span></code></pre></td></tr></table>
</div>
</div><h3 id="作为api服务使用">作为API服务使用</h3>
<p>Ollama 默认启动 API 服务在 <code>http://localhost:11434</code></p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span><span class="lnt">6
</span><span class="lnt">7
</span><span class="lnt">8
</span><span class="lnt">9
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">requests</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">response</span> <span class="o">=</span> <span class="n">requests</span><span class="o">.</span><span class="n">post</span><span class="p">(</span><span class="s1">&#39;http://localhost:11434/api/generate&#39;</span><span class="p">,</span> <span class="n">json</span><span class="o">=</span><span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;model&#39;</span><span class="p">:</span> <span class="s1">&#39;qwen2.5&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;prompt&#39;</span><span class="p">:</span> <span class="s1">&#39;用Python写一个快速排序&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;stream&#39;</span><span class="p">:</span> <span class="kc">False</span>
</span></span><span class="line"><span class="cl"><span class="p">})</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="n">response</span><span class="o">.</span><span class="n">json</span><span class="p">()[</span><span class="s1">&#39;response&#39;</span><span class="p">])</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p><strong>兼容 OpenAI API 格式：</strong></p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">openai</span> <span class="kn">import</span> <span class="n">OpenAI</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">client</span> <span class="o">=</span> <span class="n">OpenAI</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">base_url</span><span class="o">=</span><span class="s1">&#39;http://localhost:11434/v1&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">api_key</span><span class="o">=</span><span class="s1">&#39;ollama&#39;</span>  <span class="c1"># 任意值即可</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">response</span> <span class="o">=</span> <span class="n">client</span><span class="o">.</span><span class="n">chat</span><span class="o">.</span><span class="n">completions</span><span class="o">.</span><span class="n">create</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">model</span><span class="o">=</span><span class="s1">&#39;qwen2.5&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">messages</span><span class="o">=</span><span class="p">[</span>
</span></span><span class="line"><span class="cl">        <span class="p">{</span><span class="s1">&#39;role&#39;</span><span class="p">:</span> <span class="s1">&#39;user&#39;</span><span class="p">,</span> <span class="s1">&#39;content&#39;</span><span class="p">:</span> <span class="s1">&#39;你好，介绍一下你自己&#39;</span><span class="p">}</span>
</span></span><span class="line"><span class="cl">    <span class="p">]</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="n">response</span><span class="o">.</span><span class="n">choices</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">message</span><span class="o">.</span><span class="n">content</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><h3 id="自定义模型modelfile">自定义模型（Modelfile）</h3>
<p>创建 <code>Modelfile</code>：</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-dockerfile" data-lang="dockerfile"><span class="line"><span class="cl"><span class="k">FROM</span><span class="w"> </span><span class="s">qwen2.5</span><span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err"></span><span class="c"># 设置系统提示词</span><span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err"></span>SYSTEM <span class="s2">&#34;&#34;&#34;</span><span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err"></span>你是一个专业的Python开发助手。<span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err"></span>回答要简洁，代码要有注释。<span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err"></span><span class="s2">&#34;&#34;&#34;</span><span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err"></span><span class="c"># 设置参数</span><span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err"></span>PARAMETER temperature 0.7<span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err"></span>PARAMETER num_ctx <span class="m">4096</span><span class="err">
</span></span></span></code></pre></td></tr></table>
</div>
</div><p>构建并运行：</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">ollama create my-python-helper -f Modelfile
</span></span><span class="line"><span class="cl">ollama run my-python-helper
</span></span></code></pre></td></tr></table>
</div>
</div><h3 id="ollama-优缺点">Ollama 优缺点</h3>
<p><strong>优点</strong>：</p>
<ul>
<li>安装简单，一行命令</li>
<li>模型库丰富，一键下载</li>
<li>内存管理优秀</li>
<li>社区活跃</li>
</ul>
<p><strong>缺点</strong>：</p>
<ul>
<li>推理速度不是最快</li>
<li>高级功能较少</li>
<li>不支持多卡并行（原生）</li>
</ul>
<hr>
<h2 id="方案二vllm推荐生产环境">方案二：vLLM（推荐生产环境）</h2>
<p>vLLM 是性能最强的本地推理引擎，来自UC Berkeley。</p>
<h3 id="安装-1">安装</h3>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">pip install vllm
</span></span></code></pre></td></tr></table>
</div>
</div><h3 id="启动服务">启动服务</h3>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl"><span class="c1"># 启动 OpenAI 兼容的 API 服务</span>
</span></span><span class="line"><span class="cl">python -m vllm.entrypoints.openai.api_server <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span>    --model Qwen/Qwen2.5-7B-Instruct <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span>    --port <span class="m">8000</span> <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span>    --tensor-parallel-size <span class="m">1</span>
</span></span></code></pre></td></tr></table>
</div>
</div><h3 id="使用-api">使用 API</h3>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">openai</span> <span class="kn">import</span> <span class="n">OpenAI</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">client</span> <span class="o">=</span> <span class="n">OpenAI</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">base_url</span><span class="o">=</span><span class="s1">&#39;http://localhost:8000/v1&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">api_key</span><span class="o">=</span><span class="s1">&#39;vllm&#39;</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">response</span> <span class="o">=</span> <span class="n">client</span><span class="o">.</span><span class="n">chat</span><span class="o">.</span><span class="n">completions</span><span class="o">.</span><span class="n">create</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">model</span><span class="o">=</span><span class="s1">&#39;Qwen/Qwen2.5-7B-Instruct&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">messages</span><span class="o">=</span><span class="p">[</span>
</span></span><span class="line"><span class="cl">        <span class="p">{</span><span class="s1">&#39;role&#39;</span><span class="p">:</span> <span class="s1">&#39;user&#39;</span><span class="p">,</span> <span class="s1">&#39;content&#39;</span><span class="p">:</span> <span class="s1">&#39;解释一下什么是RAG&#39;</span><span class="p">}</span>
</span></span><span class="line"><span class="cl">    <span class="p">]</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="n">response</span><span class="o">.</span><span class="n">choices</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">message</span><span class="o">.</span><span class="n">content</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><h3 id="高级特性">高级特性</h3>
<p><strong>1. 多GPU并行</strong></p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl"><span class="c1"># 使用2张显卡</span>
</span></span><span class="line"><span class="cl">python -m vllm.entrypoints.openai.api_server <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span>    --model meta-llama/Llama-3.1-70B-Instruct <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span>    --tensor-parallel-size <span class="m">2</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p><strong>2. 量化加载</strong></p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl"><span class="c1"># AWQ 量化</span>
</span></span><span class="line"><span class="cl">python -m vllm.entrypoints.openai.api_server <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span>    --model TheBloke/Llama-2-7B-Chat-AWQ <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span>    --quantization awq
</span></span></code></pre></td></tr></table>
</div>
</div><p><strong>3. 批处理优化</strong></p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span><span class="lnt">16
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">vllm</span> <span class="kn">import</span> <span class="n">LLM</span><span class="p">,</span> <span class="n">SamplingParams</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">llm</span> <span class="o">=</span> <span class="n">LLM</span><span class="p">(</span><span class="n">model</span><span class="o">=</span><span class="s2">&#34;Qwen/Qwen2.5-7B-Instruct&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">sampling_params</span> <span class="o">=</span> <span class="n">SamplingParams</span><span class="p">(</span><span class="n">temperature</span><span class="o">=</span><span class="mf">0.8</span><span class="p">,</span> <span class="n">max_tokens</span><span class="o">=</span><span class="mi">256</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 批量处理多个请求</span>
</span></span><span class="line"><span class="cl"><span class="n">prompts</span> <span class="o">=</span> <span class="p">[</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;什么是机器学习？&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;Python和Java的区别？&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;如何学习编程？&#34;</span>
</span></span><span class="line"><span class="cl"><span class="p">]</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">outputs</span> <span class="o">=</span> <span class="n">llm</span><span class="o">.</span><span class="n">generate</span><span class="p">(</span><span class="n">prompts</span><span class="p">,</span> <span class="n">sampling_params</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">for</span> <span class="n">output</span> <span class="ow">in</span> <span class="n">outputs</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="nb">print</span><span class="p">(</span><span class="n">output</span><span class="o">.</span><span class="n">outputs</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">text</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><h3 id="vllm-优缺点">vLLM 优缺点</h3>
<p><strong>优点</strong>：</p>
<ul>
<li>推理速度最快（PagedAttention技术）</li>
<li>吞吐量高，适合生产环境</li>
<li>支持多GPU并行</li>
<li>内存效率极高</li>
</ul>
<p><strong>缺点</strong>：</p>
<ul>
<li>安装配置较复杂</li>
<li>需要CUDA环境</li>
<li>不支持macOS（GPU加速）</li>
</ul>
<hr>
<h2 id="方案三lm-studio推荐小白">方案三：LM Studio（推荐小白）</h2>
<p>LM Studio 是带GUI的本地大模型工具，适合不想碰命令行的用户。</p>
<h3 id="安装-2">安装</h3>
<p>下载地址：https://lmstudio.ai/</p>
<p>支持 Windows / macOS / Linux。</p>
<h3 id="使用方法">使用方法</h3>
<ol>
<li>
<p><strong>下载模型</strong>：</p>
<ul>
<li>打开 LM Studio</li>
<li>搜索想要的模型（如 &ldquo;qwen2.5&rdquo;）</li>
<li>点击下载</li>
</ul>
</li>
<li>
<p><strong>对话</strong>：</p>
<ul>
<li>选择已下载的模型</li>
<li>在聊天界面直接对话</li>
</ul>
</li>
<li>
<p><strong>启动本地服务</strong>：</p>
<ul>
<li>点击 &ldquo;Local Server&rdquo;</li>
<li>启动后可在 <code>localhost:1234</code> 访问 API</li>
</ul>
</li>
</ol>
<h3 id="api-调用">API 调用</h3>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">openai</span> <span class="kn">import</span> <span class="n">OpenAI</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">client</span> <span class="o">=</span> <span class="n">OpenAI</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">base_url</span><span class="o">=</span><span class="s1">&#39;http://localhost:1234/v1&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">api_key</span><span class="o">=</span><span class="s1">&#39;lm-studio&#39;</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">response</span> <span class="o">=</span> <span class="n">client</span><span class="o">.</span><span class="n">chat</span><span class="o">.</span><span class="n">completions</span><span class="o">.</span><span class="n">create</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">model</span><span class="o">=</span><span class="s1">&#39;local-model&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">messages</span><span class="o">=</span><span class="p">[</span>
</span></span><span class="line"><span class="cl">        <span class="p">{</span><span class="s1">&#39;role&#39;</span><span class="p">:</span> <span class="s1">&#39;user&#39;</span><span class="p">,</span> <span class="s1">&#39;content&#39;</span><span class="p">:</span> <span class="s1">&#39;你好&#39;</span><span class="p">}</span>
</span></span><span class="line"><span class="cl">    <span class="p">]</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="n">response</span><span class="o">.</span><span class="n">choices</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">message</span><span class="o">.</span><span class="n">content</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><h3 id="lm-studio-优缺点">LM Studio 优缺点</h3>
<p><strong>优点</strong>：</p>
<ul>
<li>图形界面，操作简单</li>
<li>模型管理方便</li>
<li>支持各种量化格式（GGUF）</li>
<li>跨平台</li>
</ul>
<p><strong>缺点</strong>：</p>
<ul>
<li>性能不如vLLM</li>
<li>高级功能受限</li>
<li>不适合生产部署</li>
</ul>
<hr>
<h2 id="推荐模型">推荐模型</h2>
<h3 id="通用对话">通用对话</h3>
<table>
  <thead>
      <tr>
          <th style="text-align: left">模型</th>
          <th style="text-align: left">大小</th>
          <th style="text-align: left">特点</th>
          <th style="text-align: left">下载命令（Ollama）</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left">Qwen2.5</td>
          <td style="text-align: left">7B</td>
          <td style="text-align: left">中文最强</td>
          <td style="text-align: left"><code>ollama run qwen2.5</code></td>
      </tr>
      <tr>
          <td style="text-align: left">Llama 3.1</td>
          <td style="text-align: left">8B</td>
          <td style="text-align: left">综合均衡</td>
          <td style="text-align: left"><code>ollama run llama3.1</code></td>
      </tr>
      <tr>
          <td style="text-align: left">Mistral</td>
          <td style="text-align: left">7B</td>
          <td style="text-align: left">欧洲血统，推理强</td>
          <td style="text-align: left"><code>ollama run mistral</code></td>
      </tr>
  </tbody>
</table>
<h3 id="代码生成">代码生成</h3>
<table>
  <thead>
      <tr>
          <th style="text-align: left">模型</th>
          <th style="text-align: left">大小</th>
          <th style="text-align: left">特点</th>
          <th style="text-align: left">下载命令</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left">DeepSeek Coder V2</td>
          <td style="text-align: left">16B</td>
          <td style="text-align: left">代码专精</td>
          <td style="text-align: left"><code>ollama run deepseek-coder-v2</code></td>
      </tr>
      <tr>
          <td style="text-align: left">CodeLlama</td>
          <td style="text-align: left">7B-34B</td>
          <td style="text-align: left">Meta出品</td>
          <td style="text-align: left"><code>ollama run codellama</code></td>
      </tr>
      <tr>
          <td style="text-align: left">Qwen2.5-Coder</td>
          <td style="text-align: left">7B</td>
          <td style="text-align: left">代码+中文</td>
          <td style="text-align: left"><code>ollama run qwen2.5-coder</code></td>
      </tr>
  </tbody>
</table>
<h3 id="长文本处理">长文本处理</h3>
<table>
  <thead>
      <tr>
          <th style="text-align: left">模型</th>
          <th style="text-align: left">上下文</th>
          <th style="text-align: left">特点</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left">Qwen2.5</td>
          <td style="text-align: left">128K</td>
          <td style="text-align: left">超长上下文</td>
      </tr>
      <tr>
          <td style="text-align: left">Yi-1.5</td>
          <td style="text-align: left">200K</td>
          <td style="text-align: left">国产长文本王者</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="性能优化技巧">性能优化技巧</h2>
<h3 id="1-使用量化模型">1. 使用量化模型</h3>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span><span class="lnt">6
</span><span class="lnt">7
</span><span class="lnt">8
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl"><span class="c1"># Q4 量化（速度快，精度略降）</span>
</span></span><span class="line"><span class="cl">ollama run qwen2.5:7b-q4_K_M
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># Q5 量化（平衡选择）</span>
</span></span><span class="line"><span class="cl">ollama run qwen2.5:7b-q5_K_M
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># Q8 量化（精度高，显存占用大）</span>
</span></span><span class="line"><span class="cl">ollama run qwen2.5:7b-q8_0
</span></span></code></pre></td></tr></table>
</div>
</div><h3 id="2-调整上下文长度">2. 调整上下文长度</h3>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span><span class="lnt">6
</span><span class="lnt">7
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl"><span class="c1"># Ollama</span>
</span></span><span class="line"><span class="cl">ollama run qwen2.5 --num-ctx <span class="m">8192</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># vLLM</span>
</span></span><span class="line"><span class="cl">python -m vllm.entrypoints.openai.api_server <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span>    --model Qwen/Qwen2.5-7B-Instruct <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span>    --max-model-len <span class="m">8192</span>
</span></span></code></pre></td></tr></table>
</div>
</div><h3 id="3-启用-flash-attention">3. 启用 Flash Attention</h3>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl"><span class="c1"># vLLM 自动启用</span>
</span></span><span class="line"><span class="cl"><span class="c1"># 确保安装了 flash-attn</span>
</span></span><span class="line"><span class="cl">pip install flash-attn
</span></span></code></pre></td></tr></table>
</div>
</div><h3 id="4-使用-gpu-offloading">4. 使用 GPU Offloading</h3>
<p>对于显存不足的情况：</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl"><span class="c1"># Ollama 自动处理</span>
</span></span><span class="line"><span class="cl"><span class="c1"># 可通过环境变量控制</span>
</span></span><span class="line"><span class="cl"><span class="nv">OLLAMA_NUM_GPU</span><span class="o">=</span><span class="m">1</span> ollama run llama3.1:70b
</span></span></code></pre></td></tr></table>
</div>
</div><hr>
<h2 id="常见问题">常见问题</h2>
<h3 id="q-显存不够怎么办">Q: 显存不够怎么办？</h3>
<p>A:</p>
<ol>
<li>使用更小的模型（7B代替13B）</li>
<li>使用量化版本（Q4/Q5）</li>
<li>减少上下文长度</li>
<li>使用CPU推理（会慢很多）</li>
</ol>
<h3 id="q-mac-能跑吗">Q: Mac 能跑吗？</h3>
<p>A: 可以！Apple Silicon（M1/M2/M3）效果很好。</p>
<ul>
<li>Ollama 原生支持</li>
<li>LM Studio 原生支持</li>
<li>vLLM 不支持Mac GPU</li>
</ul>
<h3 id="q-生成速度太慢">Q: 生成速度太慢？</h3>
<p>A:</p>
<ol>
<li>检查是否使用了GPU（而非CPU）</li>
<li>使用量化模型</li>
<li>减少生成长度</li>
<li>升级到 vLLM</li>
</ol>
<h3 id="q-如何让多人同时使用">Q: 如何让多人同时使用？</h3>
<p>A:</p>
<ol>
<li>部署在服务器上</li>
<li>使用 vLLM（支持高并发）</li>
<li>前面加一层 Nginx 做负载均衡</li>
</ol>
<hr>
<h2 id="总结">总结</h2>
<table>
  <thead>
      <tr>
          <th style="text-align: left">方案</th>
          <th style="text-align: left">适合人群</th>
          <th style="text-align: left">难度</th>
          <th style="text-align: left">性能</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>Ollama</strong></td>
          <td style="text-align: left">开发者、个人使用</td>
          <td style="text-align: left">⭐</td>
          <td style="text-align: left">⭐⭐⭐</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>vLLM</strong></td>
          <td style="text-align: left">团队、生产环境</td>
          <td style="text-align: left">⭐⭐⭐</td>
          <td style="text-align: left">⭐⭐⭐⭐⭐</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>LM Studio</strong></td>
          <td style="text-align: left">小白、尝鲜</td>
          <td style="text-align: left">⭐</td>
          <td style="text-align: left">⭐⭐</td>
      </tr>
  </tbody>
</table>
<p><strong>我的建议</strong>：</p>
<ul>
<li>刚入门 → 先用 LM Studio 体验</li>
<li>日常开发 → Ollama 足够</li>
<li>生产部署 → vLLM 一把梭</li>
</ul>
<p>本地大模型的时代已经来临，拥抱它！</p>
]]></content:encoded></item></channel></rss>