Gemini Chat Completions 批量测试（BatchTest）¶

📝 简介¶

本文档说明 Gemini 系列模型 在 OpenAI Chat Completions 兼容接口（POST /v1/chat/completions）下可批量验证的能力与参数行为，包括：thinking / reasoning、流式 SSE、Function Calling、response_format、长上下文与常用生成参数等。默认网关：https://api-cs-al.naci-tech.com/v1。

一、项目目录结构¶

Gemini 批量测试仅需以下目录与文件（output/ 为运行后自动生成）：

ks_gemini/
├── requirements.txt
├── test_models.py    # 批量测试入口
└── output/           # 运行后生成，内含 test_results.json

二、安装依赖¶

按顺序执行以下步骤：

步骤 1：进入 ks_gemini 目录（与 requirements.txt、test_models.py 同目录）。

步骤 2：创建并激活虚拟环境（可选，推荐）：

python -m venv .venv
source .venv/bin/activate   # Windows: .venv\Scripts\activate

步骤 3：在当前目录安装依赖（requirements.txt 位于本目录）：

pip install -r requirements.txt

requirements.txt 内容如下：

requirements.txt

httpx>=0.27.0
python-dotenv>=1.0.0

若本目录无 requirements.txt，可直接安装：

pip install httpx>=0.27.0 python-dotenv>=1.0.0

三、配置环境变量¶

步骤 1：在 ks_gemini 目录 或项目根目录下创建 .env 文件（或导出到当前 shell）。

步骤 2：写入或设置 API Key：

API_DEMO_API_KEY=你的API密钥

脚本会通过 python-dotenv 自动加载同目录或上级目录的 .env。

四、启动与运行¶

步骤 1：进入 Gemini 测试目录：

cd ks_gemini

步骤 2：运行测试（二选一）：

跑全部场景（不传参）：

python test_models.py

只跑指定场景（传场景名或别名，可多个）：

python test_models.py thinking fc

步骤 3：查看结果：控制台会打印模型 × 场景的 PASS/FAIL 表格；完整结果会写入 ks_gemini/output/test_results.json。

测试覆盖的模型列表¶

gemini-2.5-flash-lite
gemini-2.5-flash
gemini-2.5-pro
gemini-3-flash-preview
gemini-3-pro-preview

可选：gemini-3-pro-image-preview（若需测试可加入模型列表）。

📦 输出结果¶

控制台：输出模型 × 场景的 PASS/FAIL 表格及摘要信息
结果文件：完整结果写入 ks_gemini/output/test_results.json

🧪 场景与别名¶

场景	说明	可用别名
Thinking	thinking / reasoning 输出	`thinking`, `think`
Function Calling	工具调用（流式）	`fc`, `function`
Tool Choice	`tool_choice` 行为对比	`tc`, `tool`
JSON Object	`response_format: json_object`	`so`, `json`
JSON Schema	`response_format: json_schema`	`js`, `schema`
200k Context	长上下文压力测试	`ctx`, `200k`
max_tokens	`max_tokens` 截断行为	`mt`
max_completion_tokens	`max_completion_tokens` 截断行为	`mct`
Gen Params	`stop` / 流式 `usage`	`gp`, `params`

🧾 请求 payload 示例¶

以下为 POST https://api-cs-al.naci-tech.com/v1/chat/completions 的请求体示例（需在请求中附带 model 字段）。流式响应需在客户端按 SSE 解析，并逐块拼接 content、thinking/reasoning_content 以及按 index 合并 tool_calls。

Thinking（思考内容）¶

{
  "messages": [{ "role": "user", "content": "你是谁" }],
  "enable_thinking": true,
  "reasoning_effort": "low",
  "stream": false
}

Function Calling（工具调用，流式）¶

{
  "thinking": { "type": "enabled", "budget_tokens": 4096 },
  "top_p": 0.95,
  "stream_options": { "include_thinking": true },
  "stream": true,
  "messages": [
    { "role": "user", "content": "北京天气怎么样，以及北京是几点钟？" }
  ],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_current_time",
        "description": "当你想知道现在的时间时非常有用。",
        "parameters": { "type": "object", "properties": {} }
      }
    },
    {
      "type": "function",
      "function": {
        "name": "get_current_weather",
        "description": "当你想查询指定城市的天气时非常有用。",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {
              "type": "string",
              "description": "城市或县区，比如北京市、杭州市、余杭区等。"
            }
          },
          "required": ["location"]
        }
      }
    }
  ],
  "tool_choice": "auto"
}

Tool Choice（required vs auto）¶

可分别发送两次请求对比：

tool_choice: "required"
tool_choice: "auto"

required 版本 payload（简化工具定义）：

{
  "temperature": 0.9,
  "stream": true,
  "messages": [{ "role": "user", "content": "杭州在哪个国家" }],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_current_weather",
        "parameters": {
          "type": "object",
          "properties": { "location": { "type": "string" } },
          "required": ["location"]
        }
      }
    }
  ],
  "tool_choice": "required"
}

auto 版本仅将 tool_choice 改为 "auto"。

JSON Object（结构化输出）¶

{
  "messages": [
    {
      "role": "user",
      "content": "generate a short json to describe the first digital computer, ENIAC, in the world."
    }
  ],
  "response_format": { "type": "json_object" }
}

JSON Schema（结构化输出）¶

{
  "max_tokens": 1024,
  "messages": [
    {
      "role": "user",
      "content": "Extract the key information from this email: John Smith (john@example.com) is interested in our Enterprise plan and wants to schedule a demo for next Tuesday at 2pm."
    }
  ],
  "response_format": {
    "type": "json_schema",
    "json_schema": {
      "name": "email_extraction",
      "schema": {
        "type": "object",
        "properties": {
          "name": { "type": "string" },
          "email": { "type": "string" },
          "plan_interest": { "type": "string" },
          "demo_requested": { "type": "boolean" }
        },
        "required": ["name", "email", "plan_interest", "demo_requested"],
        "additionalProperties": false
      }
    }
  }
}

200k Context（长上下文）¶

使用较长的 messages[0].content 进行长上下文能力验证：

{
  "messages": [{ "role": "user", "content": "<very large content...>" }],
  "stream": false
}

max_tokens / max_completion_tokens（截断）¶

{
  "messages": [{ "role": "user", "content": "讲一个100字的故事" }],
  "max_tokens": 10
}

{
  "messages": [{ "role": "user", "content": "讲一个100字的故事" }],
  "max_completion_tokens": 10
}

Gen Params（stop + 流式 usage）¶

{
  "messages": [{ "role": "user", "content": "北京天气怎么样" }],
  "top_p": 0.9,
  "stop": ["北京"],
  "stream": true,
  "stream_options": { "include_usage": true }
}

附录：test_models.py 源码¶

以下为 test_models.py 的完整代码，可直接复制到 ks_gemini/ 下使用。

ks_gemini/test_models.py

import os
import json
import sys
from typing import List, Dict, Any
import httpx
from dotenv import load_dotenv

load_dotenv()

API_BASE_URL = "https://api-cs-al.naci-tech.com/v1"
API_KEY = os.getenv("API_DEMO_API_KEY")

MODELS = [
    "gemini-2.5-flash-lite",
    "gemini-2.5-flash",
    "gemini-2.5-pro",
    "gemini-3-flash-preview",
    "gemini-3-pro-preview",
]

class GeminiTester:
    def __init__(self, base_url: str, api_key: str):
        self.base_url = base_url
        self.api_key = api_key
        self.client = httpx.Client(base_url=base_url, headers={
            "Authorization": f"Bearer {api_key}",
            "Content-Type": "application/json"
        }, timeout=120.0)

    def run_test(self, model: str, test_name: str, payload: Dict[str, Any]) -> Dict[str, Any]:
        payload["model"] = model
        try:
            if payload.get("stream"):
                full_content = ""
                full_thinking = ""
                tool_calls_chunks = {}
                finish_reason = None
                usage = None
                with self.client.stream("POST", "/chat/completions", json=payload) as response:
                    if response.status_code != 200:
                        return {"success": False, "error": f"Status {response.status_code}: {response.read().decode()}"}
                    for line in response.iter_lines():
                        if not line.startswith("data: "):
                            continue
                        data_str = line[6:]
                        if data_str == "[DONE]":
                            break
                        try:
                            chunk = json.loads(data_str)
                            if "usage" in chunk and chunk["usage"]:
                                usage = chunk["usage"]
                            if not chunk.get("choices"):
                                continue
                            delta = chunk["choices"][0].get("delta", {})
                            if "content" in delta and delta["content"]:
                                full_content += delta["content"]
                            if "thinking" in delta and delta["thinking"]:
                                full_thinking += delta["thinking"]
                            if "reasoning_content" in delta and delta["reasoning_content"]:
                                full_thinking += delta["reasoning_content"]
                            if "tool_calls" in delta:
                                for tc in delta["tool_calls"]:
                                    idx = tc["index"]
                                    if idx not in tool_calls_chunks:
                                        tool_calls_chunks[idx] = {"id": tc.get("id"), "type": "function", "function": {"name": "", "arguments": ""}}
                                    f = tc["function"]
                                    if f.get("name"):
                                        tool_calls_chunks[idx]["function"]["name"] += f["name"]
                                    if f.get("arguments"):
                                        tool_calls_chunks[idx]["function"]["arguments"] += f["arguments"]
                            if chunk["choices"][0].get("finish_reason"):
                                finish_reason = chunk["choices"][0]["finish_reason"]
                        except Exception as e:
                            print(f"解析流式数据出错: {e}")
                            continue
                final_tool_calls = [v for k, v in sorted(tool_calls_chunks.items())]
                simulated_response = {
                    "choices": [{"message": {"role": "assistant", "content": full_content, "thinking": full_thinking, "tool_calls": final_tool_calls if final_tool_calls else None}, "finish_reason": finish_reason}],
                    "usage": usage
                }
                return {"success": True, "response": simulated_response}
            else:
                response = self.client.post("/chat/completions", json=payload)
                if response.status_code == 200:
                    return {"success": True, "response": response.json()}
                return {"success": False, "error": f"Status {response.status_code}: {response.text}"}
        except Exception as e:
            return {"success": False, "error": str(e)}

    def test_thinking(self, model: str):
        payload = {"messages": [{"role": "user", "content": "你是谁"}], "enable_thinking": True, "reasoning_effort": "low", "stream": False}
        res = self.run_test(model, "Thinking", payload)
        if res["success"]:
            content = res["response"]["choices"][0]["message"]
            res["info"] = "Found thinking content" if ("thinking" in content or "reasoning_content" in content) else "No thinking content found"
        return res

    def test_function_calling(self, model: str):
        payload = {
            "thinking": {"type": "enabled", "budget_tokens": 4096}, "top_p": 0.95, "stream_options": {"include_thinking": True}, "stream": True,
            "messages": [{"role": "user", "content": "北京天气怎么样，以及北京是几点钟？"}],
            "tools": [
                {"type": "function", "function": {"name": "get_current_time", "description": "当你想知道现在的时间时非常有用。", "parameters": {"type": "object", "properties": {}}}},
                {"type": "function", "function": {"name": "get_current_weather", "description": "当你想查询指定城市的天气时非常有用。", "parameters": {"type": "object", "properties": {"location": {"type": "string", "description": "城市或县区，比如北京市、杭州市、余杭区等。"}}, "required": ["location"]}}}
            ],
            "tool_choice": "auto"
        }
        return self.run_test(model, "Function Calling", payload)

    def test_tool_choice(self, model: str):
        payload_base = {"temperature": 0.9, "stream": True, "messages": [{"role": "user", "content": "杭州在哪个国家"}], "tools": [{"type": "function", "function": {"name": "get_current_weather", "parameters": {"type": "object", "properties": {"location": {"type": "string"}}, "required": ["location"]}}}]}
        payload_req = payload_base.copy()
        payload_req["tool_choice"] = "required"
        res_req = self.run_test(model, "Tool Choice Required", payload_req)
        payload_auto = payload_base.copy()
        payload_auto["tool_choice"] = "auto"
        res_auto = self.run_test(model, "Tool Choice Auto", payload_auto)
        final_res = {"success": True, "info": ""}
        if not res_req["success"] or not res_auto["success"]:
            final_res["success"] = False
            final_res["error"] = f"Req Error: {res_req.get('error')} | Auto Error: {res_auto.get('error')}"
            return final_res
        tc_req = res_req["response"]["choices"][0]["message"].get("tool_calls")
        tc_auto = res_auto["response"]["choices"][0]["message"].get("tool_calls")
        if tc_req and not tc_auto:
            final_res["info"] = "PASS: Both 'required' and 'auto' respected"
        else:
            final_res["success"] = False
            details = []
            if not tc_req: details.append("'required' failed")
            if tc_auto: details.append("'auto' failed")
            final_res["info"] = "FAIL: " + " & ".join(details)
        final_res["response"] = {"required": res_req["response"], "auto": res_auto["response"]}
        return final_res

    def test_structured_output(self, model: str):
        payload = {"messages": [{"role": "user", "content": "generate a short json to describe the first digital computer, ENIAC, in the world."}], "response_format": {"type": "json_object"}}
        res = self.run_test(model, "JSON Object", payload)
        if res["success"]:
            try:
                json.loads(res["response"]["choices"][0]["message"].get("content", ""))
                res["info"] = "Valid JSON Object returned"
            except:
                res["success"] = False
                res["info"] = "Invalid JSON string"
        return res

    def test_json_schema(self, model: str):
        payload = {
            "max_tokens": 1024,
            "messages": [{"role": "user", "content": "Extract the key information from this email: John Smith (john@example.com) is interested in our Enterprise plan and wants to schedule a demo for next Tuesday at 2pm."}],
            "response_format": {"type": "json_schema", "json_schema": {"name": "email_extraction", "schema": {"type": "object", "properties": {"name": {"type": "string"}, "email": {"type": "string"}, "plan_interest": {"type": "string"}, "demo_requested": {"type": "boolean"}}, "required": ["name", "email", "plan_interest", "demo_requested"], "additionalProperties": False}}}
        }
        res = self.run_test(model, "JSON Schema", payload)
        if res["success"]:
            content = res["response"]["choices"][0]["message"].get("content", "")
            try:
                data = json.loads(content)
                res["info"] = "PASS: Schema respected" if all(k in data for k in ["name", "email", "plan_interest", "demo_requested"]) else "FAIL: Missing required fields"
                if "FAIL" in res["info"]:
                    res["success"] = False
            except:
                res["success"] = False
                res["info"] = "FAIL: Invalid JSON"
        return res

    def test_context_200k(self, model: str):
        base_text = "这是测试上下文的内容。我们正在测试 Gemini 的上下文能力。"
        large_content = (base_text + "\n") * 7140 + "\n请问上述文本主要在测试什么能力？"
        payload = {"messages": [{"role": "user", "content": large_content}], "stream": False}
        return self.run_test(model, "200k Context", payload)

    def test_max_tokens(self, model: str):
        payload = {"messages": [{"role": "user", "content": "讲一个100字的故事"}], "max_tokens": 10}
        res = self.run_test(model, "max_tokens", payload)
        if res["success"]:
            res["info"] = f"Finish reason: {res['response']['choices'][0].get('finish_reason')}"
        return res

    def test_max_completion_tokens(self, model: str):
        payload = {"messages": [{"role": "user", "content": "讲一个100字的故事"}], "max_completion_tokens": 10}
        res = self.run_test(model, "max_completion_tokens", payload)
        if res["success"]:
            res["info"] = f"Finish reason: {res['response']['choices'][0].get('finish_reason')}"
        return res

    def test_generation_params(self, model: str):
        payload = {"messages": [{"role": "user", "content": "北京天气怎么样"}], "top_p": 0.9, "stop": ["北京"], "stream": True, "stream_options": {"include_usage": True}}
        res = self.run_test(model, "Gen Params", payload)
        if res["success"]:
            usage = res["response"].get("usage")
            content = res["response"]["choices"][0]["message"].get("content", "")
            if usage and "total_tokens" in usage and "北京" not in content:
                res["info"] = f"PASS: Usage found ({usage.get('total_tokens')} tokens), Stop respected"
            elif not usage or "total_tokens" not in usage:
                res["success"] = False
                res["info"] = "FAIL: No usage info found"
            else:
                res["info"] = "Usage found, but Stop sequence failed"
        return res


def main():
    if not API_KEY:
        print("错误: 请设置环境变量 API_DEMO_API_KEY")
        return
    tester = GeminiTester(API_BASE_URL, API_KEY)
    scenarios = {
        "Thinking": {"func": tester.test_thinking, "aliases": ["thinking", "think"]},
        "Function Calling": {"func": tester.test_function_calling, "aliases": ["fc", "function"]},
        "Tool Choice": {"func": tester.test_tool_choice, "aliases": ["tc", "tool"]},
        "JSON Object": {"func": tester.test_structured_output, "aliases": ["so", "json"]},
        "JSON Schema": {"func": tester.test_json_schema, "aliases": ["js", "schema"]},
        "200k Context": {"func": tester.test_context_200k, "aliases": ["ctx", "200k"]},
        "max_tokens": {"func": tester.test_max_tokens, "aliases": ["mt"]},
        "max_completion_tokens": {"func": tester.test_max_completion_tokens, "aliases": ["mct"]},
        "Gen Params": {"func": tester.test_generation_params, "aliases": ["gp", "params"]},
    }
    args = sys.argv[1:]
    selected_scenarios = {}
    if not args:
        selected_scenarios = {k: v["func"] for k, v in scenarios.items()}
    else:
        for arg in args:
            arg_lower = arg.lower()
            for name, config in scenarios.items():
                if arg_lower == name.lower() or arg_lower in config["aliases"]:
                    selected_scenarios[name] = config["func"]
                    break
    if not selected_scenarios:
        return
    results = {}
    print(f"开始测试 Gemini 模型，API Base: {API_BASE_URL}")
    for model in MODELS:
        print(f"\n[{model}] 测试中...")
        model_results = {}
        for name, func in selected_scenarios.items():
            model_results[name] = func(model)
        results[model] = model_results
    print("\n" + "="*120)
    print(f"| {'模型':<30} | {'测试场景':<25} | {'状态':<10} | {'详细信息'} |")
    print("|" + "-"*32 + "|" + "-"*27 + "|" + "-"*12 + "|" + "-"*45 + "|")
    for model, m_results in results.items():
        for scenario, res in m_results.items():
            status = "✅ PASS" if res.get("success") else "❌ FAIL"
            display_info = res.get("info", "")
            if not res.get("success"):
                display_info = res.get("error") or res.get("info") or "Unknown error"
            if len(display_info) > 42:
                display_info = display_info[:42] + "..."
            print(f"| {model:<30} | {scenario:<25} | {status:<10} | {display_info:<45} |")
    print("="*120)
    output_dir = "output"
    os.makedirs(output_dir, exist_ok=True)
    output_path = os.path.join(output_dir, "test_results.json")
    with open(output_path, "w", encoding="utf-8") as f:
        json.dump(results, f, indent=2, ensure_ascii=False)
    print(f"\n测试结果已保存至 {output_path}")


if __name__ == "__main__":
    main()