Codex 这个重试退避设计也太离谱了：403 后越重试越像挂机

coco · 2026 年3 月 30 日 16:18

最近在用 Codex CLI 接一个不太稳定的上游时，踩到了一个很搞笑、但又很实际的问题。

Codex 现在把流式重连做成了硬编码的指数退避。前几次看起来还正常，但到了后面会膨胀得非常夸张：

第 1 次大约 0.2 秒
第 5 次大约 3.2 秒
第 10 次已经到 1 分多钟
再往后甚至会涨到十几分钟、几十分钟一次

问题在于，这种设计默认假设“失败越久，就应该越慢重试”。但现实里很多上游并不是这样：

网关偶发抽风
后端路由不稳定
某些 OpenAI 兼容中转会短暂返回 403 / 额度状态未刷新
实际上多试几次，很快就恢复了

也就是说，真正需要的是：

用户自己决定重试频率
至少允许固定间隔重试，比如每 500ms 一次
而不是被一个写死的指数退避绑架

更离谱的是，Codex 现在给了用户 stream_max_retries，却不给重试间隔和退避策略的配置权。这就导致：
你可以把次数改到 100 次，但第 10 次以后，每次等待都开始长得不合理，完全背离了“多试几次就能通”的场景。

我已经把这个问题提到上游了：

github.com/openai/codex

Make stream reconnect delay/backoff configurable in config.toml

已打开 09:12AM - 29 Mar 26 UTC

constansino

enhancement CLI custom-model

Hard-coded exponential backoff for retryable stream reconnects makes some provid…er setups unusable. Today Codex exposes `stream_max_retries`, but not the reconnect delay strategy. The outer reconnect loop uses a fixed exponential backoff starting around 200ms and doubling on each retry, so by the time a session reaches the low teens it is already waiting many minutes between attempts. That is a poor fit for providers / gateways that frequently fail with retryable transient errors but often recover after a few quick retries. One concrete case is OpenAI-compatible upstreams that may briefly return `403 Forbidden` with an "insufficient balance / quota" message and then succeed again shortly afterward once the upstream gateway refreshes state or routes to a healthy backend. In that setup, users can already raise `stream_max_retries`, but they cannot express the retry cadence they actually need. By retry 10+ the built-in exponential backoff dominates and the CLI can end up waiting far longer than the upstream outage itself. Proposed behavior: 1. Keep the current default behavior for existing users. 2. Add provider-scoped TOML settings so users can choose the reconnect delay behavior explicitly. 3. Support at least: - a configurable base delay in milliseconds - a configurable backoff mode (`exponential` or `fixed`) Example desired config: ```toml [model_providers.custom] stream_max_retries = 100 stream_retry_delay_ms = 500 stream_retry_backoff = "fixed" ``` That would allow the common "retry every 500ms" workflow without requiring a local patch. I have a PR ready that implements exactly this shape while preserving the current default behavior.

我感觉这个问题本质上不只是参数选得差，而是设计上太自作主张，没有给用户自由选择。

如果能在 TOML 里显式支持类似下面这种配置，至少才算把选择权交还给用户：

[model_providers.custom]
stream_max_retries = 100
stream_retry_delay_ms = 500
stream_retry_backoff = "fixed"

这种需求其实非常常见：
“这个上游经常抽风，但连续快速试几次通常就恢复，请别给我自动拉成几十分钟一次。”

话题		回复	浏览量
大家的工具建议版本用最新的! 同一句提示词，不同 Agent，智商差距现场对比，Codex 能把活干成“悬疑片”(贬义) :后续发现是版本问题通用 cli , 交互 , codex	1	12	2026 年2 月 3 日
2026年2月底的ai coding观点:你应该知道的一切长期追踪 cli , 交互 , 原理限制 , coding	1	19	2026 年2 月 27 日
为什么cli工具们不直接用webui? AIMB ui , cli , 交互 , 没想明白	2	19	2026 年3 月 6 日
CLIProxyAPI 这次 auth 自动刷新与 watcher 优化修复的完整复盘通用优化 , 编程 , 性能 , unhandled	2	8	2026 年4 月 27 日
Opus 4.6 与 Codex 5.3：深度技术对比与适用场景分析长期追踪对比	2	56	2026 年2 月 27 日

Codex 这个重试退避设计也太离谱了：403 后越重试越像挂机

相关话题