Qwen-Omni

概述

Qwen-Omni 模型当前通过 /v1/chat/completions 接入，典型模型包括 qwen3-omni-flash 与 qwen-omni-turbo。Ling.AI 不会额外改写这些 Omni 专属字段，而是按 OpenAI 兼容格式原样透传给上游。

流式限制

Qwen-Omni 请求需要使用流式输出。也就是说 stream=True / stream: true 必须设置，否则通常会由上游直接返回错误。

输入限制

在单条 User Message 中，只可以包含文本和一种模态的数据。也就是文本可以和图片、音频或视频中的一种组合，但不能在同一条用户消息里同时混入多种非文本模态。

基础调用

最基础的文本或文本加语音输出场景，可以直接通过 modalities 和 audio 控制返回模态。

Python

from openai import OpenAI

client = OpenAI(
    base_url="https://api.vip.lingapi.ai/v1",
    api_key="sk-xxxxxxxx"
)

completion = client.chat.completions.create(
    model="qwen3-omni-flash",
    messages=[
        {"role": "user", "content": "你是谁？"}
    ],
    modalities=["text", "audio"],
    audio={"voice": "Cherry", "format": "wav"},
    stream=True,
    stream_options={"include_usage": True}
)

for chunk in completion:
    print(chunk)

如果只需要文本输出，可以把 modalities 设为 ["text"]。当返回音频时，响应中的 delta.audio.data 通常为 Base64 编码音频片段，可在流结束后拼接解码。

思考模式

qwen3-omni-flash 支持通过 enable_thinking 控制是否开启思考模式；qwen-omni-turbo 不属于思考模型。需要注意的是，Qwen3 Omni 在思考模式下不支持输出音频，此时建议只请求 ["text"] 输出。

Python

completion = client.chat.completions.create(
    model="qwen3-omni-flash",
    messages=[{"role": "user", "content": "你是谁"}],
    extra_body={"enable_thinking": True},
    modalities=["text"],
    stream=True
)

多模态输入

Qwen-Omni 支持文本与图片、音频或视频中的一种组合输入。常见输入结构如下：

图片：image_url + text
音频：input_audio + text
视频文件：video_url + text
视频帧列表：video + text

Python

completion = client.chat.completions.create(
    model="qwen3-omni-flash",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "input_audio",
                    "input_audio": {
                        "data": "https://example.com/hello.wav",
                        "format": "wav"
                    }
                },
                {
                    "type": "text",
                    "text": "这段音频在说什么？"
                }
            ]
        }
    ],
    modalities=["text"],
    stream=True
)

Python

completion = client.chat.completions.create(
    model="qwen3-omni-flash",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "video_url",
                    "video_url": {
                        "url": "https://example.com/demo.mp4"
                    }
                },
                {
                    "type": "text",
                    "text": "视频的内容是什么？"
                }
            ]
        }
    ],
    modalities=["text"],
    stream=True
)

多轮对话

多轮对话时，Assistant Message 建议只保留文本内容；不同轮次的 User Message 可以分别携带不同模态，但单条用户消息仍然应遵守“文本 + 一种其他模态”的限制。

计费提示

Qwen-Omni 的文本、图片、音频、视频通常按各自对应的 Token 计费。
如果返回音频，音频输出 Token 会进入使用量统计，例如 completion_audio_tokens。
视频文件输入可能同时包含视觉与音频两部分成本，具体以上游模型计费规则为准。

概述

流式限制

输入限制

基础调用

思考模式

多模态输入

多轮对话

计费提示

相关阅读