When the Gemini->Anthropic streaming bridge for the /v1/messages endpoint
receives a functionCall part followed by a text part, the text branch in
handleStreamingResponse opened a new text content block without closing the
already-open tool_use block. The tool block's content_block_stop was only
emitted at end-of-stream, after the text block's content_block_start, so the
Anthropic SSE stream contained overlapping/unterminated content blocks. Clients
that assemble messages by block index (e.g. Claude Code) can drop the tool
input or mis-parse the response.
The functionCall branch already closes an open text block before opening a tool
block, and the chat-completions sibling closes the tool block in its text branch
via closeOpenTool(). This applies the same symmetric handling to the messages
variant: close any open tool_use block (resetting openToolIndex/openToolName/
seenToolJSON) before starting text.
Adds a regression test that replays a tool->text Gemini stream and asserts the
Anthropic content-block lifecycle never overlaps.
isOpenAIWSTokenEvent classified response.completed / response.done as
token events. When upstream finishes a request without ever emitting
a recognizable delta (e.g. cached completions or models that skip
incremental output), firstTokenMs was then filled at the terminal
event's timestamp, so the first-token latency metric effectively
reported total request duration.
Terminal events are already handled separately by
isOpenAIWSTerminalEvent. Treating them as token events makes the two
classifiers overlap, which violates the implicit invariant that the
token-event and terminal-event sets are disjoint.
The metric only affects ForwardResult.FirstTokenMs (logging and
observability) — billing and routing are unchanged.
Add regression tests for both directions:
* TestIsOpenAIWSTokenEvent_TerminalEventsExcluded covers each
classification branch.
* TestIsOpenAIWSTokenEvent_DisjointWithTerminal asserts the
disjoint-set invariant for every known terminal event.
Both new tests fail when the old `return eventType == "response.completed"
|| eventType == "response.done"` is restored.
Fixes#2651
Co-authored-by: Cursor <cursoragent@cursor.com>
Two tests in ratelimit_service_401_test.go were encoding the bug behavior
itself:
- OAuth401InvalidatorError asserted updateCredentialsCalls == 1
- OAuth401UsesCredentialsUpdater asserted updateCredentialsCalls == 1
and lastCredentials["expires_at"] non-empty
Both assertions exercised the exact write-back this PR removes. Update
them to reflect the new contract and guard against regression:
- OAuth401InvalidatorError: assert updateCredentialsCalls == 0
- OAuth401UsesCredentialsUpdater is renamed to
OAuth401DoesNotOverwriteCredentials with reversed assertions, so it
now serves as a regression test ensuring the 401 handler never writes
credentials back from the request-start snapshot.
The 401 handler in RateLimitService.HandleUpstreamError set
account.Credentials["expires_at"] = time.Now() and then persisted the
full credentials map via persistAccountCredentials, which routes through
accountRepository.UpdateCredentials -> ent SetCredentials and replaces
the entire JSONB column.
The account passed to the handler is the request-start snapshot taken
by the gateway at SelectAccount time. When another worker has just
rotated refresh_token via oauth_refresh_api.RefreshIfNeeded, the
snapshot still holds the old refresh_token; writing the full snapshot
back rolls refresh_token in the DB back to the stale value.
The next refresh cycle then calls the upstream with the stale token,
receives invalid_grant, and tryRecoverFromRefreshRace re-reads the DB
only to find currentRT == usedRT (because the 401 handler just poisoned
the DB), returns false, and the account is incorrectly disabled.
Drop the credentials write. InvalidateToken + SetTempUnschedulable is
sufficient: the account is held out of scheduling during the cooldown,
and after the cooldown the next request goes through token_provider's
NeedsRefresh check, which routes through the locked, DB-re-reading
RefreshIfNeeded path.
The "force background refresh by setting expires_at = now" semantic is
intentionally dropped. token_refresh_service will naturally pick the
account up when the real expires_at enters the refresh window, and if
the real expires_at has already passed by the time the account becomes
schedulable again, token_provider's NeedsRefresh returns true and
RefreshIfNeeded fires synchronously on the next request.
Follow-up to #2816 (already merged): the same long-context pricing
exemption that affected cache_read also applies to all three
cache_creation price fields (standard, 5m ephemeral, 1h ephemeral).
computeCacheCreationCost reads these prices directly from pricing and
never sees the LongContextInputMultiplier that computeTokenBreakdown
applies to inputPrice / outputPrice / cacheReadPrice.
For GPT-5.4 / 5.5 above the 272k threshold, this causes the cache_write
portion of long sessions to be billed at roughly half what it should
be (default multiplier 2.0). Cache writes are conceptually input-side
operations and should share the same long-context treatment as input /
cache_read.
This patch threads an explicit multiplier into computeCacheCreationCost
so the function can be unit-tested in isolation and matches the existing
pattern used for cache_read. computeTokenBreakdown captures the long
context decision once and passes LongContextInputMultiplier when it
applies, 1.0 otherwise.
Adds three regression tests mirroring the #2816 cache_read tests:
- positive: long-context triggered -> cache_creation scaled by 2.0x
- negative: below threshold -> cache_creation stays at base price
- breakdown: 5m + 1h ephemeral prices both scaled when applicable
Refs #2816
Co-authored-by: Cursor <cursoragent@cursor.com>
The antigravity upstream-passthrough path (account.Type == AccountTypeUpstream
forwarding to a Claude-format upstream) drains the SSE stream via
streamUpstreamResponse + extractSSEUsage. The extractor only reads top-level
event["usage"], which matches Anthropic's message_delta but misses
message_start where usage is nested under event.message.usage.
As a result, every streaming /v1/messages request through this path drops
the input-side fields (input_tokens, cache_read_input_tokens, cache_creation_*)
and writes a usage_logs row with input_tokens=0 + output_tokens>0. The user
in #2332 observed 2,728 such rows attributed to claude-opus-4-6 / haiku-4-5
streaming requests; their billing on output is correct but the input-side
accounting is missing. (Their "duplicate write from message_delta" hypothesis
isn't borne out by the code — RecordUsage is invoked once per request and
writeUsageLogBestEffort dedupes by request_id; what they're seeing is
single records produced by this buggy extractor.)
Branch on event.type so message_start reads from event.message.usage and
other events keep using event.usage, matching how parseSSEUsagePassthrough
already handles both shapes for the Anthropic OAuth / API-key / Bedrock paths.
Adds two extractSSEUsage table cases plus a TestExtractSSEUsage_StreamingSequence
that drives the message_start → message_delta sequence end-to-end; both fail
on main and pass with this change.
Fixes#2332
Co-authored-by: Cursor <cursoragent@cursor.com>
When session long-context pricing is triggered in computeTokenBreakdown
(e.g. GPT-5.4 / GPT-5.5 above the 272k token threshold), the multiplier
was only being applied to InputPricePerToken and OutputPricePerToken.
The cache_read price was left at its base value, so CacheReadCost was
silently undercharged whenever a long-context session also had cache
hits — which is essentially every long Codex / Claude Code session.
Concretely for gpt-5.4 with 300k cache_read tokens, the bug
under-billed the request by exactly 1x the LongContextInputMultiplier
on the cache portion (e.g. 0.075 instead of 0.150 in the regression
test).
Cache reads are conceptually input-side replays, so they should scale
with LongContextInputMultiplier, matching the treatment of
InputPricePerToken.
Adds two regression tests:
- positive: long-context triggered -> cache_read scaled by 2.0x
- negative: below threshold -> cache_read stays at base price
Fixes#2293
Co-authored-by: Cursor <cursoragent@cursor.com>
Pool mode currently retries the same account for a fixed set of
upstream HTTP statuses: 401, 403, 429. Some upstream pool deployments
also need same-account retry for transient provider/proxy statuses
such as 502, 503, 520, 529, but hard-coding more statuses changes
behavior for everyone.
Add a per-account credentials option `pool_mode_retry_status_codes`
that lets admins choose which upstream HTTP status codes trigger
same-account retry in pool mode:
- Unset (default): preserve the current 401/403/429 default
- Explicit list: override the defaults with the configured codes
- Codes normalized to the 100-599 range, deduplicated, sorted
The standalone `isPoolModeRetryableStatus` helper is kept as the
default-only fallback. All 15 gateway call sites switch to the new
`Account.IsPoolModeRetryableStatus` method so behavior is preserved
for accounts that do not configure the new field.
Frontend admin UI gains a "Retry Status Codes" comma-separated input
under the pool-mode section in both Create/Edit account modals
(en + zh i18n).
Fixes#2731
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>