<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Gemini on Chico's Tech Blog</title><link>https://realtime-ai.chat/tags/gemini/</link><description>Recent content in Gemini on Chico's Tech Blog</description><image><title>Chico's Tech Blog</title><url>https://github.com/chicogong.png</url><link>https://github.com/chicogong.png</link></image><generator>Hugo</generator><language>zh-cn</language><lastBuildDate>Mon, 18 May 2026 10:00:00 +0800</lastBuildDate><atom:link href="https://realtime-ai.chat/tags/gemini/index.xml" rel="self" type="application/rss+xml"/><item><title>2026 大模型选型:别问「哪个最强」,问「哪个够用」</title><link>https://realtime-ai.chat/posts/llm-selection-2026/</link><pubDate>Mon, 18 May 2026 10:00:00 +0800</pubDate><guid>https://realtime-ai.chat/posts/llm-selection-2026/</guid><description>2026 年大模型选型不该看跑分排名。这篇给一套按场景选型的决策框架:能力梯队、推理成本、延迟、上下文、闭源开源、私有化部署,附决策流程图。</description><content:encoded><![CDATA[<p>去年我们一个内部项目,用 Claude Opus 跑一个意图分类:输入一句用户的话,输出三个标签之一。上线两周,有人去看账单,愣住了——这个分类任务,一个 14B 的开源模型在自己的卡上跑,效果差不了几个点,成本是它的几十分之一。</p>
<p>这就是 2026 年选型最常见的错误:<strong>把&quot;哪个模型最强&quot;当成了&quot;我该用哪个模型&quot;。</strong></p>
<p>这两个问题根本不是一回事。GPQA、SWE-bench、ARC-AGI-2 这些榜单告诉你的是天花板,而你大部分的线上请求,离天花板远着呢。一个分类、一段摘要、一次格式化抽取——这些活儿,旗舰模型是高射炮打蚊子。选型不是选最强,是给<strong>每一类任务</strong>配一个&quot;刚好够用、且最便宜&quot;的模型。</p>
<p>这篇不排名。给你一套按场景拆的决策框架。</p>
<h2 id="先认清2026-年的模型是分梯队的">先认清:2026 年的模型是分梯队的</h2>
<p>2026 年 5 月,前沿模型大概是这么个格局——记住具体版本号意义不大,它们每两三个月就跳一次,记住<strong>梯队</strong>就行:</p>
<table>
  <thead>
      <tr>
          <th>梯队</th>
          <th>代表模型(2026.05)</th>
          <th>典型 API 价格(输入/输出,每百万 token)</th>
          <th>该干什么</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>旗舰</td>
          <td>GPT-5.5、Claude Opus 4.7、Gemini 3.1 Pro</td>
          <td>$5 / $25 量级</td>
          <td>复杂推理、Agent 编排、难代码</td>
      </tr>
      <tr>
          <td>主力</td>
          <td>Claude Sonnet 4.6、Gemini 3 Flash、DeepSeek V4-Pro</td>
          <td>$1–3 / $3–15 量级</td>
          <td>绝大多数生产任务</td>
      </tr>
      <tr>
          <td>快而省</td>
          <td>Claude Haiku 4.5、Gemini 3 Flash-Lite、DeepSeek V4-Flash</td>
          <td>$0.1–1 / $0.3–5 量级</td>
          <td>分类、抽取、路由、简单问答</td>
      </tr>
  </tbody>
</table>
<p>这张表里藏着一个关键事实:<strong>旗舰和&quot;快而省&quot;之间,输出价格差了几十倍。</strong> DeepSeek V4-Flash 的输出大约 $0.28,GPT-5.5 是 $30——一百多倍。这个差距不是边角料,它会直接决定你的产品能不能规模化。</p>
<p>而梯队之间的<strong>能力</strong>差距,这两年反而在缩小。2024 年你能明显感觉到旗舰和主力不是一个物种;2026 年,在很多具体任务上,主力模型只比旗舰差几个百分点,有时候你压根测不出来。能力在收敛,价格还拉得很开——这就是&quot;按梯队选型&quot;能省钱的根本原因。</p>
<p>所以第一条原则:<strong>默认从主力梯队起步,只在它确实顶不住时才往上抬。</strong> 不要反过来,从旗舰往下砍——那样你永远不知道下面那一档是不是早就够了。</p>
<h2 id="维度一能力够不够要按任务类型问">维度一:能力够不够,要按&quot;任务类型&quot;问</h2>
<p>&ldquo;够用&quot;不是一个模糊的感觉,它可以拆。把你的任务大致归到三类:</p>
<p><strong>确定性任务</strong>——分类、实体抽取、格式转换、敏感词过滤。这类任务有标准答案,对错可量化。结论很直接:<strong>用快而省梯队,甚至小开源模型。</strong> 旗舰在这里没有任何优势,它多出来的&quot;智商&quot;在一个三选一的分类题上无处发挥。我前面说的那个翻车案例,就是这一类。</p>
<p><strong>生成与改写任务</strong>——写文案、做摘要、客服话术、翻译。这类没有唯一答案,但对&quot;质量&quot;敏感。主力梯队是甜区。值得一提:Claude 系列在中长文写作上的语感明显更自然,一次能稳定输出十几万 token 不塌;如果你的产品核心就是&quot;写得像人&rdquo;,这个差异值得你多花那点钱。</p>
<p><strong>推理与 Agent 任务</strong>——多步代码、需要调工具、长链路规划、&ldquo;自己想办法完成&rdquo;。这是 2026 年唯一<strong>真的需要旗舰</strong>的地方。一个 Agent 要连续做二三十步,每一步的小错误会累积,中间某一步判断失误,后面全废。这种场景下,旗舰多出来的几个点,放大到整条链路就是&quot;能跑通&quot;和&quot;跑不通&quot;的区别。GPT-5.5、Claude Opus 4.7 这一档,贵有贵的道理——但前提是,你的任务真的是 Agent,而不是被包装成 Agent 的一次性问答。</p>
<p>一个实操建议:<strong>别用一个模型扛所有任务。</strong> 成熟的做法是按任务路由——一个便宜模型做分流和简单活儿,难的才转交旗舰。这比&quot;全程旗舰&quot;省一大笔,也比&quot;全程便宜&quot;靠谱。</p>
<h2 id="维度二成本不是单价是单价--调用量--输出长度">维度二:成本不是单价,是「单价 × 调用量 × 输出长度」</h2>
<p>很多人看 API 价格,只瞄一眼那个&quot;每百万 token 多少钱&quot;。这是不够的。真正的账是三个数相乘:</p>
<ul>
<li><strong>单价</strong>——尤其是<strong>输出</strong>单价,通常是输入的 3 到 5 倍,而且 Agent 类任务输出占比高。</li>
<li><strong>调用量</strong>——一天一千次还是一千万次,差四个数量级。</li>
<li><strong>平均输出长度</strong>——让模型&quot;先想再答&quot;(reasoning)能提质量,但思考链本身也是要付费的 token。</li>
</ul>
<p>把这三个乘起来,你常会得到一个反直觉的结论。举个例子:一个日活几万的客服机器人,绝大多数对话是&quot;查物流&quot;&ldquo;改地址&quot;这种,真正复杂的咨询可能只占 5%。如果你全程用旗舰,等于为了那 5% 的复杂场景,给 95% 的简单场景也付了旗舰价。把 95% 切到主力或快省梯队,月成本可能直接砍掉七八成,用户一点感知都没有。</p>
<p>两个几乎免费、却经常被忘掉的省钱手段,务必用上:</p>
<ul>
<li><strong>Prompt Caching(提示缓存)</strong>——固定不变的前缀(system prompt、长文档、few-shot 例子)缓存住,命中后这部分输入便宜约 90%。多轮对话、RAG、批量同模板任务,收益巨大。</li>
<li><strong>Batch(批处理)</strong>——不要求实时返回的任务,走批处理接口,普遍五折。离线打标、夜间报表、内容审核这类活儿,没理由不用。</li>
</ul>
<p>记住:<strong>选型省下的钱,常常比换一个&quot;更便宜的模型&quot;省得还多。</strong> 因为它省的是结构性的浪费。</p>
<h2 id="维度三延迟上下文被场景一票否决的硬约束">维度三:延迟、上下文——被场景一票否决的硬约束</h2>
<p>有些维度不参与&quot;性价比&quot;的权衡,它们是<strong>门槛</strong>:不过线,这个模型直接出局,多强多便宜都没用。</p>
<p><strong>延迟。</strong> 如果你做的是实时语音对话,用户说完到 AI 出声的预算只有几百毫秒(这个我在<a href="../voice-technology/voice-latency-budget/">上一篇</a>里专门拆过)。这种场景,你要盯的是<strong>首 token 延迟(TTFT)</strong>,不是模型聪不聪明。一个慢半拍的旗舰,体验上输给一个快的主力模型。反过来,如果是离线批处理,延迟根本不在你的考虑范围里——这时候为&quot;快&quot;付的溢价就是纯浪费。</p>
<p><strong>上下文长度。</strong> 2026 年长上下文已经不稀缺:Gemini 3.1 Pro 和 DeepSeek V4 都是 1M token 窗口,Llama 4 甚至把 10M 带进了开源世界。但<strong>有窗口不等于会用</strong>。把 50 万 token 一股脑塞进去,模型对中间段落的注意力会明显下降——业内说的 &ldquo;lost in the middle&rdquo; 没有消失。所以长上下文是个二元的资格题:你的单次任务真需要塞进一整本书、一个大代码库,那 1M 窗口是硬指标;如果你的输入本来就几千 token,纠结谁的窗口更大毫无意义,<strong>该花力气的是 RAG 的检索质量,而不是模型的窗口数字。</strong></p>
<p>判断方法很简单:<strong>先问&quot;这个场景能不能容忍 X&rdquo;,不能就直接划掉一批模型,再在活下来的里面比性价比。</strong> 别把硬约束和软偏好混在一起算。</p>
<h2 id="维度四闭源还是开源2026-年这道题变简单了">维度四:闭源还是开源,2026 年这道题变简单了</h2>
<p>两年前这是个艰难抉择,因为开源模型确实差一截。2026 年不一样了。</p>
<p>DeepSeek V4-Pro 在 SWE-bench Verified 上能摸到 80% 出头,和顶级闭源模型只差零点几个点,而且是 MIT 许可证。Qwen 3.5 / 3.6、Llama 4 也都在各自的领域逼近前沿。<strong>开源和闭源的能力差距,现在是用单个 benchmark 上的几个点来衡量,不再是&quot;差一代&quot;。</strong> 同时,主流开源模型现在发布即附带官方量化版本(Q4/Q5/Q8),部署门槛大幅下降。</p>
<p>所以这道题的判据,从&quot;谁更强&quot;变成了别的:</p>
<ul>
<li>选<strong>闭源 API</strong>:你要的是省心。不碰 GPU、不管扩缩容、要最新最强、出了事有人兜底。绝大多数从 0 到 1 的产品,该走这条路——你的精力应该花在产品上,不是运维推理集群。</li>
<li>选<strong>开源</strong>:你有三个理由之一——量足够大(自己跑的边际成本能把闭源 API 打下去)、需要深度微调(让模型长出领域知识)、或者数据不能出门(下一节细说)。</li>
</ul>
<p>还有个容易被忽视的点:<strong>开源是一份保险。</strong> 用闭源 API,你绑定了对方的定价、限流和模型下线节奏——它说某个版本退役,你就得连夜迁移。把一部分负载放在能自己掌控的开源模型上,是对供应商风险的对冲。</p>
<h2 id="维度五要不要私有化部署这题先于选模型">维度五:要不要私有化部署——这题先于选模型</h2>
<p>如果你的数据是病历、银行流水、未公开的财报、核心代码——<strong>这一条会推翻上面所有结论。</strong> 它不是一个性价比维度,它是法律和信任的红线。</p>
<p>判断私有化部署需求,问三个问题:</p>
<ol>
<li><strong>数据能不能离开你的网络?</strong> 受监管的医疗、金融、政务,答案常常是&quot;不能&quot;。</li>
<li><strong>合规要求审计闭环吗?</strong> 欧盟 AI 法案 2026 年 8 月全面生效,高风险系统要求可追溯、可解释、有人类监督。这些在一个黑盒 API 后面很难自证。</li>
<li><strong>数据主权有没有硬约束?</strong> 某些行业、某些地区,要求推理全程在境内、在自有设施内完成。</li>
</ol>
<p>只要有一个答案指向&quot;必须自己掌控&quot;,那就<strong>只能选能私有化的开源模型</strong>——Qwen、Llama、DeepSeek 这一类,把权重下载下来,跑在自己的 VPC 或机房里。这时候&quot;GPT-5.5 更强&quot;是一句正确的废话,因为它根本不在你的候选集里。</p>
<p>要提醒的是,私有化不是&quot;省钱&quot;的同义词。算上 GPU 采购或租赁、运维、扩缩容、安全加固,<strong>很多时候它比 API 更贵。</strong> 选它的理由是控制权和合规,不是成本。如果你既没有合规硬约束、量也撑不起一个推理集群,却因为&quot;感觉更安全&quot;去自建,那大概率是给自己挖坑。</p>
<h2 id="把这套框架连起来">把这套框架连起来</h2>
<p>选型不是从一张排行榜里挑第一名,而是带着你的场景,依次过几道闸门:</p>
<pre class="mermaid">flowchart TD
  A[一个具体任务] --> B{数据能否出本网络?}
  B -- 不能/强合规 --> P[私有化部署<br/>开源模型: Qwen / Llama / DeepSeek]
  B -- 可以 --> C{任务类型?}
  C -- 确定性<br/>分类·抽取·路由 --> D[快而省梯队<br/>Haiku / Flash-Lite / 小开源模型]
  C -- 生成改写<br/>文案·摘要·翻译 --> E[主力梯队<br/>Sonnet / Flash / DeepSeek V4-Pro]
  C -- 推理 Agent<br/>多步·调工具·规划 --> F[旗舰梯队<br/>GPT-5.5 / Opus 4.7 / Gemini 3.1 Pro]
  D --> G{有延迟或上下文硬约束?}
  E --> G
  F --> G
  G -- 有 --> H[在满足约束的模型里<br/>重新筛一遍]
  G -- 没有 --> I[按规模决定<br/>闭源 API or 自建开源]
</pre><p>注意几个细节。<strong>第一道闸是数据合规,不是能力</strong>——合规一票否决,放在最前面,免得你比了半天性价比最后发现这个模型根本不能用。任务类型决定的是<strong>梯队</strong>,不是具体型号——型号每季度都变,梯队的逻辑稳定得多。延迟和上下文是<strong>筛选器</strong>,不是打分项——它们只负责把不合格的划掉。最后才轮到闭源还是开源,而这一步<strong>主要由调用量决定</strong>:量小走 API,量大到自建更划算时,再考虑迁。</p>
<h2 id="最后">最后</h2>
<p>2026 年大模型这块,缺的从来不是好模型,缺的是&quot;清楚自己要什么&quot;。</p>
<p>榜单天天有人更新,排名天天有人吵,但你的客服机器人需要的可能只是一个稳定、便宜、够快的主力模型;你的代码 Agent 才真的吃旗舰那几个点的智商;你的合规系统压根不在公开榜单的讨论范围里。</p>
<p>把&quot;哪个最强&quot;这个问题放下。换成一串具体的问题:这个任务是什么类型?能容忍多少延迟?一天调用多少次?数据能不能出门?——这几个问题答完,该选哪个,基本也就清楚了。</p>
<p>选型的功夫,九成在想清楚需求,一成在看模型。顺序别搞反。</p>
]]></content:encoded></item><item><title>多模态AI：当机器学会「看图说话」</title><link>https://realtime-ai.chat/posts/multimodal-ai-breakthrough/</link><pubDate>Fri, 12 Dec 2025 10:00:00 +0800</pubDate><guid>https://realtime-ai.chat/posts/multimodal-ai-breakthrough/</guid><description>多模态 AI 最新进展:GPT-4V、Gemini、CLIP 等视觉语言模型如何让机器「看图说话」,理解图像并给出建议。</description><content:encoded><![CDATA[<h2 id="开场一个神奇的对话">开场：一个神奇的对话</h2>
<p><strong>2025年某天，你和AI的对话</strong>：</p>
<blockquote>
<p>你：[上传一张冰箱照片]<br>
你：&ldquo;帮我看看能做什么菜&rdquo;</p>
<p>AI：&ldquo;我看到你冰箱里有：鸡蛋、西红柿、青椒、米饭&hellip;<br>
推荐做番茄炒蛋盖饭！步骤如下&hellip;&rdquo;</p>
<p>你：&ldquo;等等，我不吃辣&rdquo;</p>
<p>AI：&ldquo;好的，那把青椒换成黄瓜，做黄瓜炒蛋&hellip;&rdquo;</p></blockquote>
<p><strong>这不是科幻，这是2025年的现实。</strong></p>
<p>AI不仅能&quot;看懂&quot;你的冰箱，还能理解上下文、给出建议、甚至根据你的偏好调整方案。</p>
<p><strong>这就是多模态AI的魔力。</strong></p>
<hr>
<h2 id="第一章什么是多模态ai">第一章：什么是多模态AI？</h2>
<h3 id="11-从单一感官到全感官">1.1 从「单一感官」到「全感官」</h3>
<p><strong>传统AI（单模态）</strong>：</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span><span class="lnt">6
</span><span class="lnt">7
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="c1"># 只能处理文字</span>
</span></span><span class="line"><span class="cl"><span class="n">text_ai</span> <span class="o">=</span> <span class="n">GPT3</span><span class="p">()</span>
</span></span><span class="line"><span class="cl"><span class="n">response</span> <span class="o">=</span> <span class="n">text_ai</span><span class="o">.</span><span class="n">chat</span><span class="p">(</span><span class="s2">&#34;今天天气怎么样？&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="c1"># ✅ 能回答</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">response</span> <span class="o">=</span> <span class="n">text_ai</span><span class="o">.</span><span class="n">chat</span><span class="p">(</span><span class="s2">&#34;[图片: 窗外风景]&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="c1"># ❌ 看不懂图片</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p><strong>多模态AI</strong>：</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span><span class="lnt">16
</span><span class="lnt">17
</span><span class="lnt">18
</span><span class="lnt">19
</span><span class="lnt">20
</span><span class="lnt">21
</span><span class="lnt">22
</span><span class="lnt">23
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="c1"># 能处理文字、图片、音频、视频</span>
</span></span><span class="line"><span class="cl"><span class="n">multimodal_ai</span> <span class="o">=</span> <span class="n">GPT4V</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 文字 ✅</span>
</span></span><span class="line"><span class="cl"><span class="n">response</span> <span class="o">=</span> <span class="n">multimodal_ai</span><span class="o">.</span><span class="n">chat</span><span class="p">(</span><span class="s2">&#34;今天天气怎么样？&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 图片 ✅</span>
</span></span><span class="line"><span class="cl"><span class="n">response</span> <span class="o">=</span> <span class="n">multimodal_ai</span><span class="o">.</span><span class="n">chat</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">text</span><span class="o">=</span><span class="s2">&#34;这是什么？&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">image</span><span class="o">=</span><span class="s2">&#34;photo.jpg&#34;</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 音频 ✅</span>
</span></span><span class="line"><span class="cl"><span class="n">response</span> <span class="o">=</span> <span class="n">multimodal_ai</span><span class="o">.</span><span class="n">chat</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">text</span><span class="o">=</span><span class="s2">&#34;这段音乐是什么风格？&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">audio</span><span class="o">=</span><span class="s2">&#34;music.mp3&#34;</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 视频 ✅</span>
</span></span><span class="line"><span class="cl"><span class="n">response</span> <span class="o">=</span> <span class="n">multimodal_ai</span><span class="o">.</span><span class="n">chat</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">text</span><span class="o">=</span><span class="s2">&#34;视频里的人在做什么？&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">video</span><span class="o">=</span><span class="s2">&#34;video.mp4&#34;</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><h3 id="12-多模态的模态是什么">1.2 多模态的「模态」是什么？</h3>
<p><strong>模态（Modality）</strong> = 信息的表现形式</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">class</span> <span class="nc">Modality</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;AI能理解的信息类型&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">    <span class="n">types</span> <span class="o">=</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;文本&#34;</span><span class="p">:</span> <span class="s2">&#34;Text&#34;</span><span class="p">,</span>           <span class="c1"># 文字、代码</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;图像&#34;</span><span class="p">:</span> <span class="s2">&#34;Image&#34;</span><span class="p">,</span>          <span class="c1"># 照片、图表、截图</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;音频&#34;</span><span class="p">:</span> <span class="s2">&#34;Audio&#34;</span><span class="p">,</span>          <span class="c1"># 语音、音乐、声音</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;视频&#34;</span><span class="p">:</span> <span class="s2">&#34;Video&#34;</span><span class="p">,</span>          <span class="c1"># 动态画面</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;3D&#34;</span><span class="p">:</span> <span class="s2">&#34;3D Model&#34;</span><span class="p">,</span>         <span class="c1"># 三维模型</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;传感器&#34;</span><span class="p">:</span> <span class="s2">&#34;Sensor Data&#34;</span>   <span class="c1"># 温度、压力等</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p><strong>多模态AI = 能同时理解和处理多种模态的AI</strong></p>
<hr>
<h2 id="第二章多模态ai的超能力">第二章：多模态AI的「超能力」</h2>
<h3 id="21-超能力一跨模态理解">2.1 超能力一：跨模态理解</h3>
<p><strong>例子：图生文（Image-to-Text）</strong></p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span><span class="lnt">16
</span><span class="lnt">17
</span><span class="lnt">18
</span><span class="lnt">19
</span><span class="lnt">20
</span><span class="lnt">21
</span><span class="lnt">22
</span><span class="lnt">23
</span><span class="lnt">24
</span><span class="lnt">25
</span><span class="lnt">26
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">openai</span> <span class="kn">import</span> <span class="n">OpenAI</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">client</span> <span class="o">=</span> <span class="n">OpenAI</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 上传图片，AI生成描述</span>
</span></span><span class="line"><span class="cl"><span class="n">response</span> <span class="o">=</span> <span class="n">client</span><span class="o">.</span><span class="n">chat</span><span class="o">.</span><span class="n">completions</span><span class="o">.</span><span class="n">create</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">model</span><span class="o">=</span><span class="s2">&#34;gpt-4-vision-preview&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">messages</span><span class="o">=</span><span class="p">[</span>
</span></span><span class="line"><span class="cl">        <span class="p">{</span>
</span></span><span class="line"><span class="cl">            <span class="s2">&#34;role&#34;</span><span class="p">:</span> <span class="s2">&#34;user&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">            <span class="s2">&#34;content&#34;</span><span class="p">:</span> <span class="p">[</span>
</span></span><span class="line"><span class="cl">                <span class="p">{</span><span class="s2">&#34;type&#34;</span><span class="p">:</span> <span class="s2">&#34;text&#34;</span><span class="p">,</span> <span class="s2">&#34;text&#34;</span><span class="p">:</span> <span class="s2">&#34;详细描述这张图片&#34;</span><span class="p">},</span>
</span></span><span class="line"><span class="cl">                <span class="p">{</span>
</span></span><span class="line"><span class="cl">                    <span class="s2">&#34;type&#34;</span><span class="p">:</span> <span class="s2">&#34;image_url&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                    <span class="s2">&#34;image_url&#34;</span><span class="p">:</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">                        <span class="s2">&#34;url&#34;</span><span class="p">:</span> <span class="s2">&#34;https://example.com/photo.jpg&#34;</span>
</span></span><span class="line"><span class="cl">                    <span class="p">}</span>
</span></span><span class="line"><span class="cl">                <span class="p">}</span>
</span></span><span class="line"><span class="cl">            <span class="p">]</span>
</span></span><span class="line"><span class="cl">        <span class="p">}</span>
</span></span><span class="line"><span class="cl">    <span class="p">]</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="n">response</span><span class="o">.</span><span class="n">choices</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">message</span><span class="o">.</span><span class="n">content</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="c1"># 输出: &#34;这是一张在海边拍摄的日落照片。天空呈现出橙红色的渐变，</span>
</span></span><span class="line"><span class="cl"><span class="c1">#        海面波光粼粼，远处有一艘帆船...&#34;</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p><strong>真实案例</strong>：</p>
<table>
  <thead>
      <tr>
          <th>输入图片</th>
          <th>AI描述</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>🍕 披萨照片</td>
          <td>&ldquo;一份意式玛格丽特披萨，上面有新鲜罗勒叶、马苏里拉奶酪和番茄酱&hellip;&rdquo;</td>
      </tr>
      <tr>
          <td>📊 数据图表</td>
          <td>&ldquo;这是一个柱状图，显示2020-2025年的销售趋势，2025年达到峰值&hellip;&rdquo;</td>
      </tr>
      <tr>
          <td>🐱 猫咪照片</td>
          <td>&ldquo;一只橘色的短毛猫，正趴在窗台上晒太阳，表情慵懒&hellip;&rdquo;</td>
      </tr>
  </tbody>
</table>
<h3 id="22-超能力二跨模态生成">2.2 超能力二：跨模态生成</h3>
<p><strong>例子：文生图（Text-to-Image）</strong></p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="c1"># DALL-E 3 / Midjourney / Stable Diffusion</span>
</span></span><span class="line"><span class="cl"><span class="n">prompt</span> <span class="o">=</span> <span class="s2">&#34;一只穿着宇航服的猫在月球上弹吉他，赛博朋克风格，8K高清&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">image</span> <span class="o">=</span> <span class="n">generate_image</span><span class="p">(</span><span class="n">prompt</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="c1"># 生成符合描述的图片</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p><strong>更多跨模态生成</strong>：</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">class</span> <span class="nc">CrossModalGeneration</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;跨模态生成能力&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">    <span class="n">capabilities</span> <span class="o">=</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;文 → 图&#34;</span><span class="p">:</span> <span class="s2">&#34;DALL-E, Midjourney, Stable Diffusion&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;文 → 音&#34;</span><span class="p">:</span> <span class="s2">&#34;MusicGen, AudioLDM&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;文 → 视频&#34;</span><span class="p">:</span> <span class="s2">&#34;Sora, Runway Gen-2&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;图 → 文&#34;</span><span class="p">:</span> <span class="s2">&#34;GPT-4V, Claude 3.5&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;音 → 文&#34;</span><span class="p">:</span> <span class="s2">&#34;Whisper, Qwen-Audio&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;视频 → 文&#34;</span><span class="p">:</span> <span class="s2">&#34;Gemini 2.0, GPT-4V&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span></code></pre></td></tr></table>
</div>
</div><h3 id="23-超能力三多模态推理">2.3 超能力三：多模态推理</h3>
<p><strong>例子：看图做数学题</strong></p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span><span class="lnt">16
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="c1"># 上传一张手写数学题的照片</span>
</span></span><span class="line"><span class="cl"><span class="n">image</span> <span class="o">=</span> <span class="s2">&#34;math_problem.jpg&#34;</span>  <span class="c1"># 图片内容: &#34;解方程 2x + 5 = 13&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">response</span> <span class="o">=</span> <span class="n">gpt4v</span><span class="o">.</span><span class="n">chat</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">text</span><span class="o">=</span><span class="s2">&#34;解这道题，并给出详细步骤&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">image</span><span class="o">=</span><span class="n">image</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="n">response</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="c1"># 输出:</span>
</span></span><span class="line"><span class="cl"><span class="c1"># &#34;这是一个一元一次方程：</span>
</span></span><span class="line"><span class="cl"><span class="c1">#  步骤1: 2x + 5 = 13</span>
</span></span><span class="line"><span class="cl"><span class="c1">#  步骤2: 2x = 13 - 5</span>
</span></span><span class="line"><span class="cl"><span class="c1">#  步骤3: 2x = 8</span>
</span></span><span class="line"><span class="cl"><span class="c1">#  步骤4: x = 4</span>
</span></span><span class="line"><span class="cl"><span class="c1">#  答案: x = 4&#34;</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p><strong>更复杂的推理</strong>：</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="c1"># 场景：医疗诊断</span>
</span></span><span class="line"><span class="cl"><span class="n">inputs</span> <span class="o">=</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;X光片&#34;</span><span class="p">:</span> <span class="s2">&#34;chest_xray.jpg&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;病历&#34;</span><span class="p">:</span> <span class="s2">&#34;患者男性，65岁，咳嗽两周...&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;血液检测&#34;</span><span class="p">:</span> <span class="s2">&#34;blood_test.pdf&#34;</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">diagnosis</span> <span class="o">=</span> <span class="n">multimodal_ai</span><span class="o">.</span><span class="n">analyze</span><span class="p">(</span><span class="n">inputs</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="c1"># 输出: &#34;根据X光片显示的肺部阴影、病史和血液指标，</span>
</span></span><span class="line"><span class="cl"><span class="c1">#        建议进一步做CT检查排除肺部感染...&#34;</span>
</span></span></code></pre></td></tr></table>
</div>
</div><hr>
<h2 id="第三章2025年的多模态ai明星">第三章：2025年的多模态AI明星</h2>
<h3 id="31-gpt-4vopenai">3.1 GPT-4V（OpenAI）</h3>
<p><strong>特点</strong>：视觉理解能力最强</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span><span class="lnt">16
</span><span class="lnt">17
</span><span class="lnt">18
</span><span class="lnt">19
</span><span class="lnt">20
</span><span class="lnt">21
</span><span class="lnt">22
</span><span class="lnt">23
</span><span class="lnt">24
</span><span class="lnt">25
</span><span class="lnt">26
</span><span class="lnt">27
</span><span class="lnt">28
</span><span class="lnt">29
</span><span class="lnt">30
</span><span class="lnt">31
</span><span class="lnt">32
</span><span class="lnt">33
</span><span class="lnt">34
</span><span class="lnt">35
</span><span class="lnt">36
</span><span class="lnt">37
</span><span class="lnt">38
</span><span class="lnt">39
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="c1"># 实战：分析商品评论的配图</span>
</span></span><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">openai</span> <span class="kn">import</span> <span class="n">OpenAI</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">client</span> <span class="o">=</span> <span class="n">OpenAI</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">analyze_product_review</span><span class="p">(</span><span class="n">image_url</span><span class="p">,</span> <span class="n">review_text</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;分析带图片的商品评论&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">    <span class="n">response</span> <span class="o">=</span> <span class="n">client</span><span class="o">.</span><span class="n">chat</span><span class="o">.</span><span class="n">completions</span><span class="o">.</span><span class="n">create</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">        <span class="n">model</span><span class="o">=</span><span class="s2">&#34;gpt-4-vision-preview&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">        <span class="n">messages</span><span class="o">=</span><span class="p">[</span>
</span></span><span class="line"><span class="cl">            <span class="p">{</span>
</span></span><span class="line"><span class="cl">                <span class="s2">&#34;role&#34;</span><span class="p">:</span> <span class="s2">&#34;user&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                <span class="s2">&#34;content&#34;</span><span class="p">:</span> <span class="p">[</span>
</span></span><span class="line"><span class="cl">                    <span class="p">{</span>
</span></span><span class="line"><span class="cl">                        <span class="s2">&#34;type&#34;</span><span class="p">:</span> <span class="s2">&#34;text&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                        <span class="s2">&#34;text&#34;</span><span class="p">:</span> <span class="sa">f</span><span class="s2">&#34;用户评论：</span><span class="si">{</span><span class="n">review_text</span><span class="si">}</span><span class="se">\n</span><span class="s2">请结合图片分析这个评论是否真实可信&#34;</span>
</span></span><span class="line"><span class="cl">                    <span class="p">},</span>
</span></span><span class="line"><span class="cl">                    <span class="p">{</span>
</span></span><span class="line"><span class="cl">                        <span class="s2">&#34;type&#34;</span><span class="p">:</span> <span class="s2">&#34;image_url&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                        <span class="s2">&#34;image_url&#34;</span><span class="p">:</span> <span class="p">{</span><span class="s2">&#34;url&#34;</span><span class="p">:</span> <span class="n">image_url</span><span class="p">}</span>
</span></span><span class="line"><span class="cl">                    <span class="p">}</span>
</span></span><span class="line"><span class="cl">                <span class="p">]</span>
</span></span><span class="line"><span class="cl">            <span class="p">}</span>
</span></span><span class="line"><span class="cl">        <span class="p">],</span>
</span></span><span class="line"><span class="cl">        <span class="n">max_tokens</span><span class="o">=</span><span class="mi">500</span>
</span></span><span class="line"><span class="cl">    <span class="p">)</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">response</span><span class="o">.</span><span class="n">choices</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">message</span><span class="o">.</span><span class="n">content</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 使用示例</span>
</span></span><span class="line"><span class="cl"><span class="n">review</span> <span class="o">=</span> <span class="s2">&#34;这个键盘手感超好，RGB灯效炫酷！&#34;</span>
</span></span><span class="line"><span class="cl"><span class="n">image</span> <span class="o">=</span> <span class="s2">&#34;https://example.com/keyboard.jpg&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">analysis</span> <span class="o">=</span> <span class="n">analyze_product_review</span><span class="p">(</span><span class="n">image</span><span class="p">,</span> <span class="n">review</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="n">analysis</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="c1"># 输出: &#34;图片显示的确实是一款机械键盘，RGB背光清晰可见，</span>
</span></span><span class="line"><span class="cl"><span class="c1">#        与评论描述一致。从键帽磨损程度看，应该是新品。</span>
</span></span><span class="line"><span class="cl"><span class="c1">#        评论可信度：高&#34;</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p><strong>应用场景</strong>：</p>
<ul>
<li>📸 图片内容审核</li>
<li>🛒 电商商品分析</li>
<li>📄 文档OCR + 理解</li>
<li>🎨 艺术作品鉴赏</li>
</ul>
<h3 id="32-gemini-20google">3.2 Gemini 2.0（Google）</h3>
<p><strong>特点</strong>：原生多模态，支持超长视频</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span><span class="lnt">16
</span><span class="lnt">17
</span><span class="lnt">18
</span><span class="lnt">19
</span><span class="lnt">20
</span><span class="lnt">21
</span><span class="lnt">22
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">google.generativeai</span> <span class="k">as</span> <span class="nn">genai</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">genai</span><span class="o">.</span><span class="n">configure</span><span class="p">(</span><span class="n">api_key</span><span class="o">=</span><span class="s2">&#34;YOUR_API_KEY&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># Gemini的杀手锏：理解长视频</span>
</span></span><span class="line"><span class="cl"><span class="n">model</span> <span class="o">=</span> <span class="n">genai</span><span class="o">.</span><span class="n">GenerativeModel</span><span class="p">(</span><span class="s1">&#39;gemini-2.0-flash&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 上传一个1小时的会议录像</span>
</span></span><span class="line"><span class="cl"><span class="n">video_file</span> <span class="o">=</span> <span class="n">genai</span><span class="o">.</span><span class="n">upload_file</span><span class="p">(</span><span class="n">path</span><span class="o">=</span><span class="s2">&#34;meeting.mp4&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 让AI总结会议内容</span>
</span></span><span class="line"><span class="cl"><span class="n">response</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">generate_content</span><span class="p">([</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;请总结这次会议的关键决策和行动项&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">video_file</span>
</span></span><span class="line"><span class="cl"><span class="p">])</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="n">response</span><span class="o">.</span><span class="n">text</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="c1"># 输出: &#34;会议主要讨论了Q4产品路线图：</span>
</span></span><span class="line"><span class="cl"><span class="c1">#        1. 决定推迟Feature A的发布至明年Q1</span>
</span></span><span class="line"><span class="cl"><span class="c1">#        2. 增加移动端开发资源</span>
</span></span><span class="line"><span class="cl"><span class="c1">#        3. 行动项：@张三 本周完成技术方案</span>
</span></span><span class="line"><span class="cl"><span class="c1">#        ...&#34;</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p><strong>Gemini的优势</strong>：</p>
<table>
  <thead>
      <tr>
          <th>能力</th>
          <th>说明</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>长上下文</td>
          <td>支持100万token（约750小时音频）</td>
      </tr>
      <tr>
          <td>原生多模态</td>
          <td>不是&quot;拼接&quot;，而是从底层设计</td>
      </tr>
      <tr>
          <td>实时交互</td>
          <td>支持语音对话</td>
      </tr>
      <tr>
          <td>多语言</td>
          <td>支持100+种语言</td>
      </tr>
  </tbody>
</table>
<h3 id="33-claude-35anthropic">3.3 Claude 3.5（Anthropic）</h3>
<p><strong>特点</strong>：最强的视觉推理能力</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span><span class="lnt">16
</span><span class="lnt">17
</span><span class="lnt">18
</span><span class="lnt">19
</span><span class="lnt">20
</span><span class="lnt">21
</span><span class="lnt">22
</span><span class="lnt">23
</span><span class="lnt">24
</span><span class="lnt">25
</span><span class="lnt">26
</span><span class="lnt">27
</span><span class="lnt">28
</span><span class="lnt">29
</span><span class="lnt">30
</span><span class="lnt">31
</span><span class="lnt">32
</span><span class="lnt">33
</span><span class="lnt">34
</span><span class="lnt">35
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">anthropic</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">client</span> <span class="o">=</span> <span class="n">anthropic</span><span class="o">.</span><span class="n">Anthropic</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># Claude擅长复杂的视觉推理</span>
</span></span><span class="line"><span class="cl"><span class="n">message</span> <span class="o">=</span> <span class="n">client</span><span class="o">.</span><span class="n">messages</span><span class="o">.</span><span class="n">create</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">model</span><span class="o">=</span><span class="s2">&#34;claude-3-5-sonnet-20241022&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">max_tokens</span><span class="o">=</span><span class="mi">1024</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">messages</span><span class="o">=</span><span class="p">[</span>
</span></span><span class="line"><span class="cl">        <span class="p">{</span>
</span></span><span class="line"><span class="cl">            <span class="s2">&#34;role&#34;</span><span class="p">:</span> <span class="s2">&#34;user&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">            <span class="s2">&#34;content&#34;</span><span class="p">:</span> <span class="p">[</span>
</span></span><span class="line"><span class="cl">                <span class="p">{</span>
</span></span><span class="line"><span class="cl">                    <span class="s2">&#34;type&#34;</span><span class="p">:</span> <span class="s2">&#34;image&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                    <span class="s2">&#34;source&#34;</span><span class="p">:</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">                        <span class="s2">&#34;type&#34;</span><span class="p">:</span> <span class="s2">&#34;base64&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                        <span class="s2">&#34;media_type&#34;</span><span class="p">:</span> <span class="s2">&#34;image/jpeg&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                        <span class="s2">&#34;data&#34;</span><span class="p">:</span> <span class="n">base64_image</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                    <span class="p">},</span>
</span></span><span class="line"><span class="cl">                <span class="p">},</span>
</span></span><span class="line"><span class="cl">                <span class="p">{</span>
</span></span><span class="line"><span class="cl">                    <span class="s2">&#34;type&#34;</span><span class="p">:</span> <span class="s2">&#34;text&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                    <span class="s2">&#34;text&#34;</span><span class="p">:</span> <span class="s2">&#34;这个电路图有什么问题？&#34;</span>
</span></span><span class="line"><span class="cl">                <span class="p">}</span>
</span></span><span class="line"><span class="cl">            <span class="p">],</span>
</span></span><span class="line"><span class="cl">        <span class="p">}</span>
</span></span><span class="line"><span class="cl">    <span class="p">],</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="n">message</span><span class="o">.</span><span class="n">content</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">text</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="c1"># 输出: &#34;电路图中存在以下问题：</span>
</span></span><span class="line"><span class="cl"><span class="c1">#        1. R2电阻的阻值标注错误（应该是10kΩ而不是1kΩ）</span>
</span></span><span class="line"><span class="cl"><span class="c1">#        2. C1电容的极性接反了</span>
</span></span><span class="line"><span class="cl"><span class="c1">#        3. 缺少保护二极管</span>
</span></span><span class="line"><span class="cl"><span class="c1">#        建议修改...&#34;</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p><strong>Claude的杀手锏</strong>：</p>
<ul>
<li>🧠 <strong>深度推理</strong>：能理解复杂的图表、代码截图</li>
<li>📊 <strong>数据分析</strong>：从图表中提取数据并分析</li>
<li>🔍 <strong>细节捕捉</strong>：能发现图片中的细微错误</li>
</ul>
<h3 id="34-qwen-vl阿里">3.4 Qwen-VL（阿里）</h3>
<p><strong>特点</strong>：开源、中文友好</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span><span class="lnt">16
</span><span class="lnt">17
</span><span class="lnt">18
</span><span class="lnt">19
</span><span class="lnt">20
</span><span class="lnt">21
</span><span class="lnt">22
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">transformers</span> <span class="kn">import</span> <span class="n">AutoModelForCausalLM</span><span class="p">,</span> <span class="n">AutoTokenizer</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 加载Qwen-VL模型</span>
</span></span><span class="line"><span class="cl"><span class="n">model</span> <span class="o">=</span> <span class="n">AutoModelForCausalLM</span><span class="o">.</span><span class="n">from_pretrained</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;Qwen/Qwen-VL-Chat&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">device_map</span><span class="o">=</span><span class="s2">&#34;auto&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">trust_remote_code</span><span class="o">=</span><span class="kc">True</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">tokenizer</span> <span class="o">=</span> <span class="n">AutoTokenizer</span><span class="o">.</span><span class="n">from_pretrained</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;Qwen/Qwen-VL-Chat&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">trust_remote_code</span><span class="o">=</span><span class="kc">True</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 中文图片问答</span>
</span></span><span class="line"><span class="cl"><span class="n">query</span> <span class="o">=</span> <span class="n">tokenizer</span><span class="o">.</span><span class="n">from_list_format</span><span class="p">([</span>
</span></span><span class="line"><span class="cl">    <span class="p">{</span><span class="s1">&#39;image&#39;</span><span class="p">:</span> <span class="s1">&#39;https://example.com/image.jpg&#39;</span><span class="p">},</span>
</span></span><span class="line"><span class="cl">    <span class="p">{</span><span class="s1">&#39;text&#39;</span><span class="p">:</span> <span class="s1">&#39;图片里的人在做什么？&#39;</span><span class="p">},</span>
</span></span><span class="line"><span class="cl"><span class="p">])</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">response</span><span class="p">,</span> <span class="n">history</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">chat</span><span class="p">(</span><span class="n">tokenizer</span><span class="p">,</span> <span class="n">query</span><span class="o">=</span><span class="n">query</span><span class="p">,</span> <span class="n">history</span><span class="o">=</span><span class="kc">None</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="n">response</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="c1"># 输出: &#34;图片中有两个人在打羽毛球，背景是室内体育馆&#34;</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p><strong>Qwen-VL的优势</strong>：</p>
<ul>
<li>✅ 完全开源（可本地部署）</li>
<li>✅ 中文理解优秀</li>
<li>✅ 支持细粒度定位（能标注图片中的具体位置）</li>
</ul>
<hr>
<h2 id="第四章多模态ai的黑科技应用">第四章：多模态AI的「黑科技」应用</h2>
<h3 id="41-应用一智能购物助手">4.1 应用一：智能购物助手</h3>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span><span class="lnt">16
</span><span class="lnt">17
</span><span class="lnt">18
</span><span class="lnt">19
</span><span class="lnt">20
</span><span class="lnt">21
</span><span class="lnt">22
</span><span class="lnt">23
</span><span class="lnt">24
</span><span class="lnt">25
</span><span class="lnt">26
</span><span class="lnt">27
</span><span class="lnt">28
</span><span class="lnt">29
</span><span class="lnt">30
</span><span class="lnt">31
</span><span class="lnt">32
</span><span class="lnt">33
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">class</span> <span class="nc">SmartShoppingAssistant</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;拍照即可搜索商品&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">vision_model</span> <span class="o">=</span> <span class="n">GPT4V</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">search_engine</span> <span class="o">=</span> <span class="n">TaobaoAPI</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">find_product</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">image</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;&#34;&#34;通过图片找商品&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">        <span class="c1"># Step 1: AI识别商品</span>
</span></span><span class="line"><span class="cl">        <span class="n">description</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">vision_model</span><span class="o">.</span><span class="n">describe</span><span class="p">(</span><span class="n">image</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="c1"># &#34;这是一双白色的Nike Air Force 1运动鞋，鞋码约为42&#34;</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">        <span class="c1"># Step 2: 提取关键信息</span>
</span></span><span class="line"><span class="cl">        <span class="n">keywords</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">vision_model</span><span class="o">.</span><span class="n">extract_keywords</span><span class="p">(</span><span class="n">description</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="c1"># [&#34;Nike&#34;, &#34;Air Force 1&#34;, &#34;白色&#34;, &#34;42码&#34;]</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">        <span class="c1"># Step 3: 搜索商品</span>
</span></span><span class="line"><span class="cl">        <span class="n">products</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">search_engine</span><span class="o">.</span><span class="n">search</span><span class="p">(</span><span class="n">keywords</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">        <span class="c1"># Step 4: 匹配相似度</span>
</span></span><span class="line"><span class="cl">        <span class="n">best_match</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">vision_model</span><span class="o">.</span><span class="n">find_most_similar</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">            <span class="n">image</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">            <span class="p">[</span><span class="n">p</span><span class="o">.</span><span class="n">image</span> <span class="k">for</span> <span class="n">p</span> <span class="ow">in</span> <span class="n">products</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">        <span class="p">)</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="n">best_match</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 使用</span>
</span></span><span class="line"><span class="cl"><span class="n">assistant</span> <span class="o">=</span> <span class="n">SmartShoppingAssistant</span><span class="p">()</span>
</span></span><span class="line"><span class="cl"><span class="n">result</span> <span class="o">=</span> <span class="n">assistant</span><span class="o">.</span><span class="n">find_product</span><span class="p">(</span><span class="s2">&#34;shoe_photo.jpg&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;找到商品：</span><span class="si">{</span><span class="n">result</span><span class="o">.</span><span class="n">name</span><span class="si">}</span><span class="s2">，价格：¥</span><span class="si">{</span><span class="n">result</span><span class="o">.</span><span class="n">price</span><span class="si">}</span><span class="s2">&#34;</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p><strong>真实案例</strong>：</p>
<ul>
<li>📱 <strong>Google Lens</strong>：拍照搜索任何东西</li>
<li>🛍️ <strong>淘宝拍立淘</strong>：拍照找同款</li>
<li>👗 <strong>小红书识图</strong>：找穿搭灵感</li>
</ul>
<h3 id="42-应用二ai医生助手">4.2 应用二：AI医生助手</h3>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span><span class="lnt">16
</span><span class="lnt">17
</span><span class="lnt">18
</span><span class="lnt">19
</span><span class="lnt">20
</span><span class="lnt">21
</span><span class="lnt">22
</span><span class="lnt">23
</span><span class="lnt">24
</span><span class="lnt">25
</span><span class="lnt">26
</span><span class="lnt">27
</span><span class="lnt">28
</span><span class="lnt">29
</span><span class="lnt">30
</span><span class="lnt">31
</span><span class="lnt">32
</span><span class="lnt">33
</span><span class="lnt">34
</span><span class="lnt">35
</span><span class="lnt">36
</span><span class="lnt">37
</span><span class="lnt">38
</span><span class="lnt">39
</span><span class="lnt">40
</span><span class="lnt">41
</span><span class="lnt">42
</span><span class="lnt">43
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">class</span> <span class="nc">MedicalAIAssistant</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;辅助医生诊断&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">analyze_xray</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">xray_image</span><span class="p">,</span> <span class="n">patient_info</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;&#34;&#34;分析X光片&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">        <span class="c1"># 多模态输入</span>
</span></span><span class="line"><span class="cl">        <span class="n">inputs</span> <span class="o">=</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">            <span class="s2">&#34;image&#34;</span><span class="p">:</span> <span class="n">xray_image</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">            <span class="s2">&#34;text&#34;</span><span class="p">:</span> <span class="sa">f</span><span class="s2">&#34;&#34;&#34;
</span></span></span><span class="line"><span class="cl"><span class="s2">                患者信息：
</span></span></span><span class="line"><span class="cl"><span class="s2">                - 年龄：</span><span class="si">{</span><span class="n">patient_info</span><span class="p">[</span><span class="s1">&#39;age&#39;</span><span class="p">]</span><span class="si">}</span><span class="s2">
</span></span></span><span class="line"><span class="cl"><span class="s2">                - 性别：</span><span class="si">{</span><span class="n">patient_info</span><span class="p">[</span><span class="s1">&#39;gender&#39;</span><span class="p">]</span><span class="si">}</span><span class="s2">
</span></span></span><span class="line"><span class="cl"><span class="s2">                - 症状：</span><span class="si">{</span><span class="n">patient_info</span><span class="p">[</span><span class="s1">&#39;symptoms&#39;</span><span class="p">]</span><span class="si">}</span><span class="s2">
</span></span></span><span class="line"><span class="cl"><span class="s2">                - 病史：</span><span class="si">{</span><span class="n">patient_info</span><span class="p">[</span><span class="s1">&#39;history&#39;</span><span class="p">]</span><span class="si">}</span><span class="s2">
</span></span></span><span class="line"><span class="cl"><span class="s2">            &#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">        <span class="p">}</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">        <span class="c1"># AI分析</span>
</span></span><span class="line"><span class="cl">        <span class="n">analysis</span> <span class="o">=</span> <span class="n">multimodal_ai</span><span class="o">.</span><span class="n">analyze</span><span class="p">(</span><span class="n">inputs</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">            <span class="s2">&#34;findings&#34;</span><span class="p">:</span> <span class="n">analysis</span><span class="o">.</span><span class="n">findings</span><span class="p">,</span>      <span class="c1"># 发现的异常</span>
</span></span><span class="line"><span class="cl">            <span class="s2">&#34;diagnosis&#34;</span><span class="p">:</span> <span class="n">analysis</span><span class="o">.</span><span class="n">diagnosis</span><span class="p">,</span>    <span class="c1"># 初步诊断</span>
</span></span><span class="line"><span class="cl">            <span class="s2">&#34;confidence&#34;</span><span class="p">:</span> <span class="n">analysis</span><span class="o">.</span><span class="n">confidence</span><span class="p">,</span>  <span class="c1"># 置信度</span>
</span></span><span class="line"><span class="cl">            <span class="s2">&#34;recommendations&#34;</span><span class="p">:</span> <span class="n">analysis</span><span class="o">.</span><span class="n">recommendations</span>  <span class="c1"># 建议</span>
</span></span><span class="line"><span class="cl">        <span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 使用示例</span>
</span></span><span class="line"><span class="cl"><span class="n">patient</span> <span class="o">=</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;age&#34;</span><span class="p">:</span> <span class="mi">45</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;gender&#34;</span><span class="p">:</span> <span class="s2">&#34;男&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;symptoms&#34;</span><span class="p">:</span> <span class="s2">&#34;胸痛、咳嗽&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;history&#34;</span><span class="p">:</span> <span class="s2">&#34;吸烟20年&#34;</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">result</span> <span class="o">=</span> <span class="n">assistant</span><span class="o">.</span><span class="n">analyze_xray</span><span class="p">(</span><span class="s2">&#34;chest_xray.jpg&#34;</span><span class="p">,</span> <span class="n">patient</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;发现：</span><span class="si">{</span><span class="n">result</span><span class="p">[</span><span class="s1">&#39;findings&#39;</span><span class="p">]</span><span class="si">}</span><span class="s2">&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;建议：</span><span class="si">{</span><span class="n">result</span><span class="p">[</span><span class="s1">&#39;recommendations&#39;</span><span class="p">]</span><span class="si">}</span><span class="s2">&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="c1"># 输出:</span>
</span></span><span class="line"><span class="cl"><span class="c1"># 发现：左肺下叶可见片状阴影</span>
</span></span><span class="line"><span class="cl"><span class="c1"># 建议：建议进行CT检查以进一步确认，排除肺部感染或肿瘤</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p><strong>注意</strong>：AI只是辅助工具，最终诊断必须由专业医生做出！</p>
<h3 id="43-应用三智能监控">4.3 应用三：智能监控</h3>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span><span class="lnt">16
</span><span class="lnt">17
</span><span class="lnt">18
</span><span class="lnt">19
</span><span class="lnt">20
</span><span class="lnt">21
</span><span class="lnt">22
</span><span class="lnt">23
</span><span class="lnt">24
</span><span class="lnt">25
</span><span class="lnt">26
</span><span class="lnt">27
</span><span class="lnt">28
</span><span class="lnt">29
</span><span class="lnt">30
</span><span class="lnt">31
</span><span class="lnt">32
</span><span class="lnt">33
</span><span class="lnt">34
</span><span class="lnt">35
</span><span class="lnt">36
</span><span class="lnt">37
</span><span class="lnt">38
</span><span class="lnt">39
</span><span class="lnt">40
</span><span class="lnt">41
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">class</span> <span class="nc">SmartSecuritySystem</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;智能安防系统&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">video_model</span> <span class="o">=</span> <span class="n">Gemini2</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">alert_system</span> <span class="o">=</span> <span class="n">AlertSystem</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">    <span class="k">async</span> <span class="k">def</span> <span class="nf">monitor_camera</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">camera_stream</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;&#34;&#34;实时监控摄像头&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">        <span class="k">while</span> <span class="kc">True</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="c1"># 获取视频帧</span>
</span></span><span class="line"><span class="cl">            <span class="n">frame</span> <span class="o">=</span> <span class="k">await</span> <span class="n">camera_stream</span><span class="o">.</span><span class="n">get_frame</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">            
</span></span><span class="line"><span class="cl">            <span class="c1"># AI分析</span>
</span></span><span class="line"><span class="cl">            <span class="n">analysis</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">video_model</span><span class="o">.</span><span class="n">analyze</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">                <span class="n">frame</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                <span class="n">prompt</span><span class="o">=</span><span class="s2">&#34;检测是否有异常行为：打架、摔倒、闯入等&#34;</span>
</span></span><span class="line"><span class="cl">            <span class="p">)</span>
</span></span><span class="line"><span class="cl">            
</span></span><span class="line"><span class="cl">            <span class="c1"># 发现异常</span>
</span></span><span class="line"><span class="cl">            <span class="k">if</span> <span class="n">analysis</span><span class="o">.</span><span class="n">has_anomaly</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">                <span class="c1"># 生成详细报告</span>
</span></span><span class="line"><span class="cl">                <span class="n">report</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">video_model</span><span class="o">.</span><span class="n">generate_report</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">                    <span class="n">frame</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                    <span class="n">prompt</span><span class="o">=</span><span class="sa">f</span><span class="s2">&#34;详细描述发生了什么：</span><span class="si">{</span><span class="n">analysis</span><span class="o">.</span><span class="n">anomaly_type</span><span class="si">}</span><span class="s2">&#34;</span>
</span></span><span class="line"><span class="cl">                <span class="p">)</span>
</span></span><span class="line"><span class="cl">                
</span></span><span class="line"><span class="cl">                <span class="c1"># 发送警报</span>
</span></span><span class="line"><span class="cl">                <span class="k">await</span> <span class="bp">self</span><span class="o">.</span><span class="n">alert_system</span><span class="o">.</span><span class="n">send_alert</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">                    <span class="nb">type</span><span class="o">=</span><span class="n">analysis</span><span class="o">.</span><span class="n">anomaly_type</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                    <span class="n">description</span><span class="o">=</span><span class="n">report</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                    <span class="n">image</span><span class="o">=</span><span class="n">frame</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                    <span class="n">timestamp</span><span class="o">=</span><span class="n">datetime</span><span class="o">.</span><span class="n">now</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">                <span class="p">)</span>
</span></span><span class="line"><span class="cl">            
</span></span><span class="line"><span class="cl">            <span class="k">await</span> <span class="n">asyncio</span><span class="o">.</span><span class="n">sleep</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>  <span class="c1"># 每秒分析一次</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 部署</span>
</span></span><span class="line"><span class="cl"><span class="n">system</span> <span class="o">=</span> <span class="n">SmartSecuritySystem</span><span class="p">()</span>
</span></span><span class="line"><span class="cl"><span class="k">await</span> <span class="n">system</span><span class="o">.</span><span class="n">monitor_camera</span><span class="p">(</span><span class="n">camera</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p><strong>实际效果</strong>：</p>
<table>
  <thead>
      <tr>
          <th>传统监控</th>
          <th>AI监控</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>需要人工24小时盯着屏幕</td>
          <td>AI自动监控，只在异常时报警</td>
      </tr>
      <tr>
          <td>只能事后回看录像</td>
          <td>实时检测并预警</td>
      </tr>
      <tr>
          <td>无法理解复杂场景</td>
          <td>能识别&quot;打架&quot;&ldquo;摔倒&quot;等行为</td>
      </tr>
  </tbody>
</table>
<h3 id="44-应用四教育辅导">4.4 应用四：教育辅导</h3>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span><span class="lnt">16
</span><span class="lnt">17
</span><span class="lnt">18
</span><span class="lnt">19
</span><span class="lnt">20
</span><span class="lnt">21
</span><span class="lnt">22
</span><span class="lnt">23
</span><span class="lnt">24
</span><span class="lnt">25
</span><span class="lnt">26
</span><span class="lnt">27
</span><span class="lnt">28
</span><span class="lnt">29
</span><span class="lnt">30
</span><span class="lnt">31
</span><span class="lnt">32
</span><span class="lnt">33
</span><span class="lnt">34
</span><span class="lnt">35
</span><span class="lnt">36
</span><span class="lnt">37
</span><span class="lnt">38
</span><span class="lnt">39
</span><span class="lnt">40
</span><span class="lnt">41
</span><span class="lnt">42
</span><span class="lnt">43
</span><span class="lnt">44
</span><span class="lnt">45
</span><span class="lnt">46
</span><span class="lnt">47
</span><span class="lnt">48
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">class</span> <span class="nc">AITutor</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;AI家教&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">help_with_homework</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">homework_image</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;&#34;&#34;帮助解答作业&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">        <span class="c1"># Step 1: OCR识别题目</span>
</span></span><span class="line"><span class="cl">        <span class="n">problem</span> <span class="o">=</span> <span class="n">vision_model</span><span class="o">.</span><span class="n">extract_text</span><span class="p">(</span><span class="n">homework_image</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">        <span class="c1"># Step 2: 理解题目类型</span>
</span></span><span class="line"><span class="cl">        <span class="n">problem_type</span> <span class="o">=</span> <span class="n">vision_model</span><span class="o">.</span><span class="n">classify</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">            <span class="n">homework_image</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">            <span class="n">categories</span><span class="o">=</span><span class="p">[</span><span class="s2">&#34;数学&#34;</span><span class="p">,</span> <span class="s2">&#34;物理&#34;</span><span class="p">,</span> <span class="s2">&#34;化学&#34;</span><span class="p">,</span> <span class="s2">&#34;语文&#34;</span><span class="p">,</span> <span class="s2">&#34;英语&#34;</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">        <span class="p">)</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">        <span class="c1"># Step 3: 生成解答</span>
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="n">problem_type</span> <span class="o">==</span> <span class="s2">&#34;数学&#34;</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="c1"># 识别手写公式</span>
</span></span><span class="line"><span class="cl">            <span class="n">equation</span> <span class="o">=</span> <span class="n">vision_model</span><span class="o">.</span><span class="n">parse_math</span><span class="p">(</span><span class="n">homework_image</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">            
</span></span><span class="line"><span class="cl">            <span class="c1"># 逐步求解</span>
</span></span><span class="line"><span class="cl">            <span class="n">solution</span> <span class="o">=</span> <span class="n">math_solver</span><span class="o">.</span><span class="n">solve_step_by_step</span><span class="p">(</span><span class="n">equation</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">            
</span></span><span class="line"><span class="cl">            <span class="k">return</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">                <span class="s2">&#34;problem&#34;</span><span class="p">:</span> <span class="n">equation</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                <span class="s2">&#34;steps&#34;</span><span class="p">:</span> <span class="n">solution</span><span class="o">.</span><span class="n">steps</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                <span class="s2">&#34;answer&#34;</span><span class="p">:</span> <span class="n">solution</span><span class="o">.</span><span class="n">answer</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                <span class="s2">&#34;explanation&#34;</span><span class="p">:</span> <span class="n">solution</span><span class="o">.</span><span class="n">explanation</span>
</span></span><span class="line"><span class="cl">            <span class="p">}</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">        <span class="k">elif</span> <span class="n">problem_type</span> <span class="o">==</span> <span class="s2">&#34;英语&#34;</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="c1"># 识别作文</span>
</span></span><span class="line"><span class="cl">            <span class="n">essay</span> <span class="o">=</span> <span class="n">vision_model</span><span class="o">.</span><span class="n">extract_text</span><span class="p">(</span><span class="n">homework_image</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">            
</span></span><span class="line"><span class="cl">            <span class="c1"># 批改作文</span>
</span></span><span class="line"><span class="cl">            <span class="n">feedback</span> <span class="o">=</span> <span class="n">english_tutor</span><span class="o">.</span><span class="n">grade_essay</span><span class="p">(</span><span class="n">essay</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">            
</span></span><span class="line"><span class="cl">            <span class="k">return</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">                <span class="s2">&#34;score&#34;</span><span class="p">:</span> <span class="n">feedback</span><span class="o">.</span><span class="n">score</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                <span class="s2">&#34;grammar_errors&#34;</span><span class="p">:</span> <span class="n">feedback</span><span class="o">.</span><span class="n">grammar_errors</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                <span class="s2">&#34;suggestions&#34;</span><span class="p">:</span> <span class="n">feedback</span><span class="o">.</span><span class="n">suggestions</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                <span class="s2">&#34;corrected_version&#34;</span><span class="p">:</span> <span class="n">feedback</span><span class="o">.</span><span class="n">corrected_essay</span>
</span></span><span class="line"><span class="cl">            <span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 使用</span>
</span></span><span class="line"><span class="cl"><span class="n">tutor</span> <span class="o">=</span> <span class="n">AITutor</span><span class="p">()</span>
</span></span><span class="line"><span class="cl"><span class="n">result</span> <span class="o">=</span> <span class="n">tutor</span><span class="o">.</span><span class="n">help_with_homework</span><span class="p">(</span><span class="s2">&#34;homework.jpg&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="n">result</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p><strong>真实产品</strong>：</p>
<ul>
<li>📱 <strong>小猿搜题</strong>：拍照搜题</li>
<li>📝 <strong>作业帮</strong>：AI批改作业</li>
<li>🎓 <strong>Khan Academy</strong>：个性化辅导</li>
</ul>
<hr>
<h2 id="第五章多模态ai的技术原理简化版">第五章：多模态AI的技术原理（简化版）</h2>
<h3 id="51-核心架构">5.1 核心架构</h3>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span><span class="lnt">16
</span><span class="lnt">17
</span><span class="lnt">18
</span><span class="lnt">19
</span><span class="lnt">20
</span><span class="lnt">21
</span><span class="lnt">22
</span><span class="lnt">23
</span><span class="lnt">24
</span><span class="lnt">25
</span><span class="lnt">26
</span><span class="lnt">27
</span><span class="lnt">28
</span><span class="lnt">29
</span><span class="lnt">30
</span><span class="lnt">31
</span><span class="lnt">32
</span><span class="lnt">33
</span><span class="lnt">34
</span><span class="lnt">35
</span><span class="lnt">36
</span><span class="lnt">37
</span><span class="lnt">38
</span><span class="lnt">39
</span><span class="lnt">40
</span><span class="lnt">41
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">class</span> <span class="nc">MultimodalAI</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;多模态AI的基本架构&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="c1"># 各模态的编码器</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">text_encoder</span> <span class="o">=</span> <span class="n">TextEncoder</span><span class="p">()</span>      <span class="c1"># BERT, GPT</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">image_encoder</span> <span class="o">=</span> <span class="n">ImageEncoder</span><span class="p">()</span>    <span class="c1"># ViT, CLIP</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">audio_encoder</span> <span class="o">=</span> <span class="n">AudioEncoder</span><span class="p">()</span>    <span class="c1"># Whisper</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">video_encoder</span> <span class="o">=</span> <span class="n">VideoEncoder</span><span class="p">()</span>    <span class="c1"># VideoMAE</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">        <span class="c1"># 融合层</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">fusion_layer</span> <span class="o">=</span> <span class="n">MultimodalFusion</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">        <span class="c1"># 解码器</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">decoder</span> <span class="o">=</span> <span class="n">UnifiedDecoder</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">process</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">inputs</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;&#34;&#34;处理多模态输入&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">        <span class="c1"># Step 1: 各模态编码</span>
</span></span><span class="line"><span class="cl">        <span class="n">embeddings</span> <span class="o">=</span> <span class="p">[]</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="s2">&#34;text&#34;</span> <span class="ow">in</span> <span class="n">inputs</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="n">text_emb</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">text_encoder</span><span class="p">(</span><span class="n">inputs</span><span class="p">[</span><span class="s2">&#34;text&#34;</span><span class="p">])</span>
</span></span><span class="line"><span class="cl">            <span class="n">embeddings</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">text_emb</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="s2">&#34;image&#34;</span> <span class="ow">in</span> <span class="n">inputs</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="n">image_emb</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">image_encoder</span><span class="p">(</span><span class="n">inputs</span><span class="p">[</span><span class="s2">&#34;image&#34;</span><span class="p">])</span>
</span></span><span class="line"><span class="cl">            <span class="n">embeddings</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">image_emb</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="s2">&#34;audio&#34;</span> <span class="ow">in</span> <span class="n">inputs</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="n">audio_emb</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">audio_encoder</span><span class="p">(</span><span class="n">inputs</span><span class="p">[</span><span class="s2">&#34;audio&#34;</span><span class="p">])</span>
</span></span><span class="line"><span class="cl">            <span class="n">embeddings</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">audio_emb</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">        <span class="c1"># Step 2: 融合</span>
</span></span><span class="line"><span class="cl">        <span class="n">fused_embedding</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">fusion_layer</span><span class="p">(</span><span class="n">embeddings</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">        <span class="c1"># Step 3: 解码生成输出</span>
</span></span><span class="line"><span class="cl">        <span class="n">output</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">decoder</span><span class="p">(</span><span class="n">fused_embedding</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="n">output</span>
</span></span></code></pre></td></tr></table>
</div>
</div><h3 id="52-关键技术clip">5.2 关键技术：CLIP</h3>
<p><strong>CLIP = 连接图像和文字的桥梁</strong></p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span><span class="lnt">16
</span><span class="lnt">17
</span><span class="lnt">18
</span><span class="lnt">19
</span><span class="lnt">20
</span><span class="lnt">21
</span><span class="lnt">22
</span><span class="lnt">23
</span><span class="lnt">24
</span><span class="lnt">25
</span><span class="lnt">26
</span><span class="lnt">27
</span><span class="lnt">28
</span><span class="lnt">29
</span><span class="lnt">30
</span><span class="lnt">31
</span><span class="lnt">32
</span><span class="lnt">33
</span><span class="lnt">34
</span><span class="lnt">35
</span><span class="lnt">36
</span><span class="lnt">37
</span><span class="lnt">38
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="c1"># CLIP的训练方式</span>
</span></span><span class="line"><span class="cl"><span class="k">class</span> <span class="nc">CLIP</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">image_encoder</span> <span class="o">=</span> <span class="n">ViT</span><span class="p">()</span>  <span class="c1"># Vision Transformer</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">text_encoder</span> <span class="o">=</span> <span class="n">Transformer</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">train</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">image_text_pairs</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;&#34;&#34;对比学习&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">        <span class="k">for</span> <span class="n">image</span><span class="p">,</span> <span class="n">text</span> <span class="ow">in</span> <span class="n">image_text_pairs</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="c1"># 编码</span>
</span></span><span class="line"><span class="cl">            <span class="n">image_emb</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">image_encoder</span><span class="p">(</span><span class="n">image</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">            <span class="n">text_emb</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">text_encoder</span><span class="p">(</span><span class="n">text</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">            
</span></span><span class="line"><span class="cl">            <span class="c1"># 目标：匹配的图文对相似度高，不匹配的相似度低</span>
</span></span><span class="line"><span class="cl">            <span class="n">similarity</span> <span class="o">=</span> <span class="n">cosine_similarity</span><span class="p">(</span><span class="n">image_emb</span><span class="p">,</span> <span class="n">text_emb</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">            
</span></span><span class="line"><span class="cl">            <span class="c1"># 损失函数</span>
</span></span><span class="line"><span class="cl">            <span class="n">loss</span> <span class="o">=</span> <span class="n">contrastive_loss</span><span class="p">(</span><span class="n">similarity</span><span class="p">,</span> <span class="n">is_match</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">            
</span></span><span class="line"><span class="cl">            <span class="c1"># 反向传播</span>
</span></span><span class="line"><span class="cl">            <span class="n">loss</span><span class="o">.</span><span class="n">backward</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 使用CLIP</span>
</span></span><span class="line"><span class="cl"><span class="n">clip</span> <span class="o">=</span> <span class="n">CLIP</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 图片搜索</span>
</span></span><span class="line"><span class="cl"><span class="n">image</span> <span class="o">=</span> <span class="n">load_image</span><span class="p">(</span><span class="s2">&#34;cat.jpg&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">texts</span> <span class="o">=</span> <span class="p">[</span><span class="s2">&#34;一只猫&#34;</span><span class="p">,</span> <span class="s2">&#34;一只狗&#34;</span><span class="p">,</span> <span class="s2">&#34;一辆车&#34;</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 计算相似度</span>
</span></span><span class="line"><span class="cl"><span class="n">similarities</span> <span class="o">=</span> <span class="p">[</span>
</span></span><span class="line"><span class="cl">    <span class="n">clip</span><span class="o">.</span><span class="n">similarity</span><span class="p">(</span><span class="n">image</span><span class="p">,</span> <span class="n">text</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="k">for</span> <span class="n">text</span> <span class="ow">in</span> <span class="n">texts</span>
</span></span><span class="line"><span class="cl"><span class="p">]</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">best_match</span> <span class="o">=</span> <span class="n">texts</span><span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">argmax</span><span class="p">(</span><span class="n">similarities</span><span class="p">)]</span>
</span></span><span class="line"><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="n">best_match</span><span class="p">)</span>  <span class="c1"># 输出: &#34;一只猫&#34;</span>
</span></span></code></pre></td></tr></table>
</div>
</div><h3 id="53-训练数据规模">5.3 训练数据规模</h3>
<p><strong>多模态AI需要海量数据</strong>：</p>
<table>
  <thead>
      <tr>
          <th>模型</th>
          <th>训练数据规模</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>CLIP</td>
          <td>4亿图文对</td>
      </tr>
      <tr>
          <td>GPT-4V</td>
          <td>未公开（估计万亿级token）</td>
      </tr>
      <tr>
          <td>Gemini 2.0</td>
          <td>未公开（包含YouTube全部视频）</td>
      </tr>
      <tr>
          <td>Qwen-VL</td>
          <td>15亿图文对</td>
      </tr>
  </tbody>
</table>
<p><strong>为什么需要这么多数据？</strong></p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span><span class="lnt">6
</span><span class="lnt">7
</span><span class="lnt">8
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="c1"># 多模态AI要学习的映射关系</span>
</span></span><span class="line"><span class="cl"><span class="n">mappings</span> <span class="o">=</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;图片中的猫&#34;</span> <span class="err">↔</span> <span class="s2">&#34;文字&#39;猫&#39;&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;笑脸表情&#34;</span> <span class="err">↔</span> <span class="s2">&#34;开心的情绪&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;红色&#34;</span> <span class="err">↔</span> <span class="s2">&#34;热情、危险、停止&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;钢琴声&#34;</span> <span class="err">↔</span> <span class="s2">&#34;优雅、古典&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># ... 数十亿种映射关系</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></td></tr></table>
</div>
</div><hr>
<h2 id="第六章多模态ai的挑战">第六章：多模态AI的挑战</h2>
<h3 id="61-挑战一幻觉hallucination">6.1 挑战一：幻觉（Hallucination）</h3>
<p><strong>问题</strong>：AI有时会&quot;看到&quot;不存在的东西</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span><span class="lnt">6
</span><span class="lnt">7
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="c1"># 真实案例</span>
</span></span><span class="line"><span class="cl"><span class="n">image</span> <span class="o">=</span> <span class="s2">&#34;empty_room.jpg&#34;</span>  <span class="c1"># 一个空房间的照片</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">response</span> <span class="o">=</span> <span class="n">ai</span><span class="o">.</span><span class="n">describe</span><span class="p">(</span><span class="n">image</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="n">response</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="c1"># 错误输出: &#34;房间里有一张桌子和两把椅子&#34;</span>
</span></span><span class="line"><span class="cl"><span class="c1"># （实际上房间是空的！）</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p><strong>原因</strong>：</p>
<ul>
<li>AI基于概率预测，会&quot;脑补&quot;常见物品</li>
<li>训练数据中的偏见</li>
</ul>
<p><strong>解决方案</strong>：</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span><span class="lnt">6
</span><span class="lnt">7
</span><span class="lnt">8
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="c1"># 使用置信度阈值</span>
</span></span><span class="line"><span class="cl"><span class="n">response</span> <span class="o">=</span> <span class="n">ai</span><span class="o">.</span><span class="n">describe</span><span class="p">(</span><span class="n">image</span><span class="p">,</span> <span class="n">min_confidence</span><span class="o">=</span><span class="mf">0.8</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 或者要求AI标注不确定的部分</span>
</span></span><span class="line"><span class="cl"><span class="n">response</span> <span class="o">=</span> <span class="n">ai</span><span class="o">.</span><span class="n">describe</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">image</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">instruction</span><span class="o">=</span><span class="s2">&#34;如果不确定，请说&#39;不确定&#39;而不是猜测&#34;</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><h3 id="62-挑战二计算成本">6.2 挑战二：计算成本</h3>
<p><strong>多模态AI非常&quot;烧钱&rdquo;</strong>：</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span><span class="lnt">16
</span><span class="lnt">17
</span><span class="lnt">18
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="c1"># 成本对比</span>
</span></span><span class="line"><span class="cl"><span class="n">costs</span> <span class="o">=</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;纯文本&#34;</span><span class="p">:</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;GPT-4&#34;</span><span class="p">:</span> <span class="s2">&#34;$0.03 / 1K tokens&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;Claude&#34;</span><span class="p">:</span> <span class="s2">&#34;$0.015 / 1K tokens&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="p">},</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;多模态&#34;</span><span class="p">:</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;GPT-4V&#34;</span><span class="p">:</span> <span class="s2">&#34;$0.01 / image + $0.03 / 1K tokens&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;Gemini Pro Vision&#34;</span><span class="p">:</span> <span class="s2">&#34;$0.0025 / image&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 处理1000张图片 + 对话</span>
</span></span><span class="line"><span class="cl"><span class="n">text_only_cost</span> <span class="o">=</span> <span class="mf">0.03</span> <span class="o">*</span> <span class="mi">10</span>  <span class="c1"># $0.30</span>
</span></span><span class="line"><span class="cl"><span class="n">multimodal_cost</span> <span class="o">=</span> <span class="mf">0.01</span> <span class="o">*</span> <span class="mi">1000</span> <span class="o">+</span> <span class="mf">0.03</span> <span class="o">*</span> <span class="mi">10</span>  <span class="c1"># $10.30</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;多模态成本是纯文本的 </span><span class="si">{</span><span class="n">multimodal_cost</span> <span class="o">/</span> <span class="n">text_only_cost</span><span class="si">:</span><span class="s2">.0f</span><span class="si">}</span><span class="s2"> 倍&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="c1"># 输出: 多模态成本是纯文本的 34 倍</span>
</span></span></code></pre></td></tr></table>
</div>
</div><h3 id="63-挑战三隐私和安全">6.3 挑战三：隐私和安全</h3>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span><span class="lnt">16
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="c1"># 风险场景</span>
</span></span><span class="line"><span class="cl"><span class="k">class</span> <span class="nc">PrivacyRisks</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="n">risks</span> <span class="o">=</span> <span class="p">[</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;人脸识别 → 隐私泄露&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;医疗图像 → 敏感信息&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;监控视频 → 滥用风险&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;深度伪造 → 虚假信息&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="p">]</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">    <span class="c1"># 防护措施</span>
</span></span><span class="line"><span class="cl">    <span class="n">protections</span> <span class="o">=</span> <span class="p">[</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;数据脱敏&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;本地部署（不上传云端）&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;访问控制&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;水印技术&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="p">]</span>
</span></span></code></pre></td></tr></table>
</div>
</div><hr>
<h2 id="第七章未来展望">第七章：未来展望</h2>
<h3 id="71-2026年预测">7.1 2026年预测</h3>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">future_capabilities</span> <span class="o">=</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;2026&#34;</span><span class="p">:</span> <span class="p">[</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;实时多模态对话（像人类一样边看边聊）&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;3D场景理解（理解空间关系）&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;情感识别（从表情、语气判断情绪）&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;跨模态生成（说一句话，生成视频）&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="p">],</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;2027&#34;</span><span class="p">:</span> <span class="p">[</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;具身智能（机器人 + 多模态AI）&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;全感官AI（视觉+听觉+触觉+嗅觉）&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;实时翻译（包括手语、表情）&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;AI导演（自动拍摄剪辑视频）&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="p">]</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></td></tr></table>
</div>
</div><h3 id="72-终极目标通用人工智能agi">7.2 终极目标：通用人工智能（AGI）</h3>
<p><strong>多模态是通向AGI的必经之路</strong></p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span><span class="lnt">16
</span><span class="lnt">17
</span><span class="lnt">18
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="c1"># 人类的智能 = 多模态</span>
</span></span><span class="line"><span class="cl"><span class="n">human_intelligence</span> <span class="o">=</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;视觉&#34;</span><span class="p">:</span> <span class="s2">&#34;看&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;听觉&#34;</span><span class="p">:</span> <span class="s2">&#34;听&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;触觉&#34;</span><span class="p">:</span> <span class="s2">&#34;摸&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;嗅觉&#34;</span><span class="p">:</span> <span class="s2">&#34;闻&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;味觉&#34;</span><span class="p">:</span> <span class="s2">&#34;尝&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;综合&#34;</span><span class="p">:</span> <span class="s2">&#34;理解世界&#34;</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># AI要达到人类水平，必须也是多模态的</span>
</span></span><span class="line"><span class="cl"><span class="n">agi</span> <span class="o">=</span> <span class="n">MultimodalAI</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">vision</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">audio</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">touch</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span>  <span class="c1"># 未来</span>
</span></span><span class="line"><span class="cl">    <span class="n">smell</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span>  <span class="c1"># 未来</span>
</span></span><span class="line"><span class="cl">    <span class="n">taste</span><span class="o">=</span><span class="kc">True</span>   <span class="c1"># 未来</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><hr>
<h2 id="结语感知的革命">结语：感知的革命</h2>
<p><strong>多模态AI不仅仅是技术进步，它改变了AI与世界的交互方式。</strong></p>
<h3 id="从读到看">从「读」到「看」</h3>
<ul>
<li><strong>以前</strong>：AI只能读文字（像盲人）</li>
<li><strong>现在</strong>：AI能看、能听、能理解（像正常人）</li>
</ul>
<h3 id="从工具到伙伴">从「工具」到「伙伴」</h3>
<ul>
<li><strong>以前</strong>：AI是搜索引擎（你问我答）</li>
<li><strong>现在</strong>：AI是助手（能主动观察、理解、建议）</li>
</ul>
<h3 id="开发者的新机会">开发者的新机会</h3>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span><span class="lnt">6
</span><span class="lnt">7
</span><span class="lnt">8
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="c1"># 你可以做的事情</span>
</span></span><span class="line"><span class="cl"><span class="n">opportunities</span> <span class="o">=</span> <span class="p">[</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;开发多模态应用（医疗、教育、安防）&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;训练垂直领域的多模态模型&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;创建多模态数据集&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;研究新的融合算法&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;探索新的应用场景&#34;</span>
</span></span><span class="line"><span class="cl"><span class="p">]</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p><strong>多模态AI的时代才刚刚开始。</strong></p>
<p><strong>你准备好了吗？</strong></p>
<hr>
<p><strong>快速开始</strong>：</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="c1"># 1. 试用GPT-4V</span>
</span></span><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">openai</span> <span class="kn">import</span> <span class="n">OpenAI</span>
</span></span><span class="line"><span class="cl"><span class="n">client</span> <span class="o">=</span> <span class="n">OpenAI</span><span class="p">()</span>
</span></span><span class="line"><span class="cl"><span class="c1"># 上传图片，开始对话</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 2. 试用Gemini</span>
</span></span><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">google.generativeai</span> <span class="k">as</span> <span class="nn">genai</span>
</span></span><span class="line"><span class="cl"><span class="n">genai</span><span class="o">.</span><span class="n">configure</span><span class="p">(</span><span class="n">api_key</span><span class="o">=</span><span class="s2">&#34;YOUR_KEY&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="c1"># 上传视频，让AI总结</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 3. 本地部署Qwen-VL</span>
</span></span><span class="line"><span class="cl"><span class="c1"># git clone https://github.com/QwenLM/Qwen-VL</span>
</span></span><span class="line"><span class="cl"><span class="c1"># 完全免费，可商用</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p><strong>相关资源</strong>：</p>
<ul>
<li><a href="https://platform.openai.com/docs/guides/vision">OpenAI Vision Guide</a></li>
<li><a href="https://ai.google.dev/">Google Gemini</a></li>
<li><a href="https://github.com/QwenLM/Qwen-VL">Qwen-VL GitHub</a></li>
<li><a href="https://arxiv.org/abs/2103.00020">CLIP Paper</a></li>
</ul>
]]></content:encoded></item></channel></rss>