Two tests in ratelimit_service_401_test.go were encoding the bug behavior
itself:
- OAuth401InvalidatorError asserted updateCredentialsCalls == 1
- OAuth401UsesCredentialsUpdater asserted updateCredentialsCalls == 1
and lastCredentials["expires_at"] non-empty
Both assertions exercised the exact write-back this PR removes. Update
them to reflect the new contract and guard against regression:
- OAuth401InvalidatorError: assert updateCredentialsCalls == 0
- OAuth401UsesCredentialsUpdater is renamed to
OAuth401DoesNotOverwriteCredentials with reversed assertions, so it
now serves as a regression test ensuring the 401 handler never writes
credentials back from the request-start snapshot.
The 401 handler in RateLimitService.HandleUpstreamError set
account.Credentials["expires_at"] = time.Now() and then persisted the
full credentials map via persistAccountCredentials, which routes through
accountRepository.UpdateCredentials -> ent SetCredentials and replaces
the entire JSONB column.
The account passed to the handler is the request-start snapshot taken
by the gateway at SelectAccount time. When another worker has just
rotated refresh_token via oauth_refresh_api.RefreshIfNeeded, the
snapshot still holds the old refresh_token; writing the full snapshot
back rolls refresh_token in the DB back to the stale value.
The next refresh cycle then calls the upstream with the stale token,
receives invalid_grant, and tryRecoverFromRefreshRace re-reads the DB
only to find currentRT == usedRT (because the 401 handler just poisoned
the DB), returns false, and the account is incorrectly disabled.
Drop the credentials write. InvalidateToken + SetTempUnschedulable is
sufficient: the account is held out of scheduling during the cooldown,
and after the cooldown the next request goes through token_provider's
NeedsRefresh check, which routes through the locked, DB-re-reading
RefreshIfNeeded path.
The "force background refresh by setting expires_at = now" semantic is
intentionally dropped. token_refresh_service will naturally pick the
account up when the real expires_at enters the refresh window, and if
the real expires_at has already passed by the time the account becomes
schedulable again, token_provider's NeedsRefresh returns true and
RefreshIfNeeded fires synchronously on the next request.
Follow-up to #2816 (already merged): the same long-context pricing
exemption that affected cache_read also applies to all three
cache_creation price fields (standard, 5m ephemeral, 1h ephemeral).
computeCacheCreationCost reads these prices directly from pricing and
never sees the LongContextInputMultiplier that computeTokenBreakdown
applies to inputPrice / outputPrice / cacheReadPrice.
For GPT-5.4 / 5.5 above the 272k threshold, this causes the cache_write
portion of long sessions to be billed at roughly half what it should
be (default multiplier 2.0). Cache writes are conceptually input-side
operations and should share the same long-context treatment as input /
cache_read.
This patch threads an explicit multiplier into computeCacheCreationCost
so the function can be unit-tested in isolation and matches the existing
pattern used for cache_read. computeTokenBreakdown captures the long
context decision once and passes LongContextInputMultiplier when it
applies, 1.0 otherwise.
Adds three regression tests mirroring the #2816 cache_read tests:
- positive: long-context triggered -> cache_creation scaled by 2.0x
- negative: below threshold -> cache_creation stays at base price
- breakdown: 5m + 1h ephemeral prices both scaled when applicable
Refs #2816
Co-authored-by: Cursor <cursoragent@cursor.com>
The antigravity upstream-passthrough path (account.Type == AccountTypeUpstream
forwarding to a Claude-format upstream) drains the SSE stream via
streamUpstreamResponse + extractSSEUsage. The extractor only reads top-level
event["usage"], which matches Anthropic's message_delta but misses
message_start where usage is nested under event.message.usage.
As a result, every streaming /v1/messages request through this path drops
the input-side fields (input_tokens, cache_read_input_tokens, cache_creation_*)
and writes a usage_logs row with input_tokens=0 + output_tokens>0. The user
in #2332 observed 2,728 such rows attributed to claude-opus-4-6 / haiku-4-5
streaming requests; their billing on output is correct but the input-side
accounting is missing. (Their "duplicate write from message_delta" hypothesis
isn't borne out by the code — RecordUsage is invoked once per request and
writeUsageLogBestEffort dedupes by request_id; what they're seeing is
single records produced by this buggy extractor.)
Branch on event.type so message_start reads from event.message.usage and
other events keep using event.usage, matching how parseSSEUsagePassthrough
already handles both shapes for the Anthropic OAuth / API-key / Bedrock paths.
Adds two extractSSEUsage table cases plus a TestExtractSSEUsage_StreamingSequence
that drives the message_start → message_delta sequence end-to-end; both fail
on main and pass with this change.
Fixes#2332
Co-authored-by: Cursor <cursoragent@cursor.com>
When session long-context pricing is triggered in computeTokenBreakdown
(e.g. GPT-5.4 / GPT-5.5 above the 272k token threshold), the multiplier
was only being applied to InputPricePerToken and OutputPricePerToken.
The cache_read price was left at its base value, so CacheReadCost was
silently undercharged whenever a long-context session also had cache
hits — which is essentially every long Codex / Claude Code session.
Concretely for gpt-5.4 with 300k cache_read tokens, the bug
under-billed the request by exactly 1x the LongContextInputMultiplier
on the cache portion (e.g. 0.075 instead of 0.150 in the regression
test).
Cache reads are conceptually input-side replays, so they should scale
with LongContextInputMultiplier, matching the treatment of
InputPricePerToken.
Adds two regression tests:
- positive: long-context triggered -> cache_read scaled by 2.0x
- negative: below threshold -> cache_read stays at base price
Fixes#2293
Co-authored-by: Cursor <cursoragent@cursor.com>
Pool mode currently retries the same account for a fixed set of
upstream HTTP statuses: 401, 403, 429. Some upstream pool deployments
also need same-account retry for transient provider/proxy statuses
such as 502, 503, 520, 529, but hard-coding more statuses changes
behavior for everyone.
Add a per-account credentials option `pool_mode_retry_status_codes`
that lets admins choose which upstream HTTP status codes trigger
same-account retry in pool mode:
- Unset (default): preserve the current 401/403/429 default
- Explicit list: override the defaults with the configured codes
- Codes normalized to the 100-599 range, deduplicated, sorted
The standalone `isPoolModeRetryableStatus` helper is kept as the
default-only fallback. All 15 gateway call sites switch to the new
`Account.IsPoolModeRetryableStatus` method so behavior is preserved
for accounts that do not configure the new field.
Frontend admin UI gains a "Retry Status Codes" comma-separated input
under the pool-mode section in both Create/Edit account modals
(en + zh i18n).
Fixes#2731
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Errcheck flagged three unchecked strings.Builder.WriteString calls and
gofmt rejected over-aligned trailing comment in the route table.
Rewrite writeResponsesFailedSSE with json.Marshal on typed structs
instead of Builder+strconv.Quote. Same wire format, but:
- no unchecked Write returns to silence
- strict JSON escaping (strconv.Quote emits \a and \v which are not
valid JSON; Marshal handles all runes correctly)
- omitempty model field via struct tag instead of conditional Builder
- consistent with the json.Marshal style used elsewhere in handler/
Collapse trailing comment whitespace in stream_error_event_test.go to
satisfy gofmt.
All 30+ subtests in the package still pass.
Case B: when a slot wait flushes SSE ping comments first (Writer.Written
becomes true), the previous ensureForwardErrorResponse short-circuited
on `c.Writer.Written()` and returned false without notifying the client.
Subsequent upstream errors (http2 timeout, stream INTERNAL_ERROR, etc.)
produced silent EOF; Codex CLI reported "stream closed before
response.completed" just like the user-slot timeout case.
Remove the Written() early return; coerce streamStarted to true when
Writer has already been written to, and let handleStreamingAwareError
walk the existing logic — which now (thanks to the previous commits)
emits a protocol-compliant response.failed for /responses paths and the
legacy `event: error` for others.
Update tests that previously asserted "do not override written response":
the new contract is to *append* an SSE terminal frame so the client sees
a clean close instead of EOF. recoverResponsesPanic inherits this fix.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>