When the Gemini->Anthropic streaming bridge for the /v1/messages endpoint
receives a functionCall part followed by a text part, the text branch in
handleStreamingResponse opened a new text content block without closing the
already-open tool_use block. The tool block's content_block_stop was only
emitted at end-of-stream, after the text block's content_block_start, so the
Anthropic SSE stream contained overlapping/unterminated content blocks. Clients
that assemble messages by block index (e.g. Claude Code) can drop the tool
input or mis-parse the response.
The functionCall branch already closes an open text block before opening a tool
block, and the chat-completions sibling closes the tool block in its text branch
via closeOpenTool(). This applies the same symmetric handling to the messages
variant: close any open tool_use block (resetting openToolIndex/openToolName/
seenToolJSON) before starting text.
Adds a regression test that replays a tool->text Gemini stream and asserts the
Anthropic content-block lifecycle never overlaps.
- TopK initial filter now drops quota-paused accounts: fold the quota check
into isAccountRequestCompatible so session-hash, TopK pool, and per-candidate
rechecks all skip paused accounts. Previously the candidate pool was built
without the quota check, so paused accounts could fill TopK and leave the
scheduler returning "no available accounts" even with healthy ones available.
- Add per-account explicit disable flags auto_pause_5h_disabled /
auto_pause_7d_disabled with toggles in EditAccountModal. Without these,
leaving the account threshold blank silently falls back to the global default,
so admins could not exempt a single account once a global default existed.
Disable is per-window: an account can opt out of 5h auto-pause while still
honoring 7d. Schedule snapshot whitelist includes the new fields, i18n EN/ZH
updated, threshold-hint text revised to explain "blank = global default".
- Move quota auto-pause settings off the request hot path: replace the per-repo
TTL+singleflight sync DB read with a per-SettingService stale-while-revalidate
in-memory snapshot. Get is non-blocking (atomic.Pointer load + async refresh
on staleness); writes via UpdateOpsAdvancedSettings push directly into the
cache through an injected sink; wire warms the cache at startup. Adds Warm
(sync) for tests/init and SetOpenAIQuotaAutoPauseSettings (sink target).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- gate previous_response_id sticky path with quota auto-pause check at both
the snapshot and DB-recheck stages (previously bypassed, #1)
- skip pausing when the usage window already reset to avoid a stale stuck-pause;
carry codex_*_reset_at / reset_after_seconds / codex_usage_updated_at through
the scheduler snapshot whitelist (#2)
- remove the incomplete limit mode; percentage threshold only (#3)
- add global default 5h/7d threshold inputs to the Ops settings dialog with
validation and en/zh i18n (#4)
- downgrade account_auto_paused_by_quota log from Info to Debug; it fires
per-candidate on the scheduling hot path (#5)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
isOpenAIWSTokenEvent classified response.completed / response.done as
token events. When upstream finishes a request without ever emitting
a recognizable delta (e.g. cached completions or models that skip
incremental output), firstTokenMs was then filled at the terminal
event's timestamp, so the first-token latency metric effectively
reported total request duration.
Terminal events are already handled separately by
isOpenAIWSTerminalEvent. Treating them as token events makes the two
classifiers overlap, which violates the implicit invariant that the
token-event and terminal-event sets are disjoint.
The metric only affects ForwardResult.FirstTokenMs (logging and
observability) — billing and routing are unchanged.
Add regression tests for both directions:
* TestIsOpenAIWSTokenEvent_TerminalEventsExcluded covers each
classification branch.
* TestIsOpenAIWSTokenEvent_DisjointWithTerminal asserts the
disjoint-set invariant for every known terminal event.
Both new tests fail when the old `return eventType == "response.completed"
|| eventType == "response.done"` is restored.
Fixes#2651
Co-authored-by: Cursor <cursoragent@cursor.com>
Conflicts resolved (preserving fork customizations):
- config.go: keep NodeTLSProxy + add upstream OpenAIHTTP2
- gateway_service.go: NewGatewayService now takes both rpmTokenBucketSvc
(local) and userPlatformQuotaRepo (upstream)
- wire_gen.go: wire both new args into the call site
- http_upstream.go: drop redundant settings re-assignment; keep proxy
URL log redaction
- http_upstream_test.go: adopt upstream's explicit-0-disables semantics;
keep 600s default constant in nil-cfg fallback test
- user_handler_test.go / gateway_record_usage_test.go: pick up new
userPlatformQuotaRepo nil parameter
Also updated test stubs (windsurf_google_login_test.go,
windsurf_tier_access_service_test.go, gateway_models_test.go) for new
SetModelRateLimit variadic signature and the extra NewGatewayService arg.
Upstream highlights: OpenAI embeddings gateway, user x platform USD
quota, content-moderation risk thresholds, OAuth 401 credentials
no-overwrite fix, HTTP/2 OpenAI upstream config, pool retry status code
configurability, long-context cache pricing multipliers.
Two tests in ratelimit_service_401_test.go were encoding the bug behavior
itself:
- OAuth401InvalidatorError asserted updateCredentialsCalls == 1
- OAuth401UsesCredentialsUpdater asserted updateCredentialsCalls == 1
and lastCredentials["expires_at"] non-empty
Both assertions exercised the exact write-back this PR removes. Update
them to reflect the new contract and guard against regression:
- OAuth401InvalidatorError: assert updateCredentialsCalls == 0
- OAuth401UsesCredentialsUpdater is renamed to
OAuth401DoesNotOverwriteCredentials with reversed assertions, so it
now serves as a regression test ensuring the 401 handler never writes
credentials back from the request-start snapshot.
The 401 handler in RateLimitService.HandleUpstreamError set
account.Credentials["expires_at"] = time.Now() and then persisted the
full credentials map via persistAccountCredentials, which routes through
accountRepository.UpdateCredentials -> ent SetCredentials and replaces
the entire JSONB column.
The account passed to the handler is the request-start snapshot taken
by the gateway at SelectAccount time. When another worker has just
rotated refresh_token via oauth_refresh_api.RefreshIfNeeded, the
snapshot still holds the old refresh_token; writing the full snapshot
back rolls refresh_token in the DB back to the stale value.
The next refresh cycle then calls the upstream with the stale token,
receives invalid_grant, and tryRecoverFromRefreshRace re-reads the DB
only to find currentRT == usedRT (because the 401 handler just poisoned
the DB), returns false, and the account is incorrectly disabled.
Drop the credentials write. InvalidateToken + SetTempUnschedulable is
sufficient: the account is held out of scheduling during the cooldown,
and after the cooldown the next request goes through token_provider's
NeedsRefresh check, which routes through the locked, DB-re-reading
RefreshIfNeeded path.
The "force background refresh by setting expires_at = now" semantic is
intentionally dropped. token_refresh_service will naturally pick the
account up when the real expires_at enters the refresh window, and if
the real expires_at has already passed by the time the account becomes
schedulable again, token_provider's NeedsRefresh returns true and
RefreshIfNeeded fires synchronously on the next request.
Anthropic count_tokens rejects generation-only fields such as temperature, top_p, top_k, stream, and stop sequences. Passing the original messages payload through unchanged can turn otherwise valid requests into upstream 400 errors.
Sanitize only the count_tokens upstream payload after the gateway's existing request normalization, preserving fields that existing compatibility paths rely on while removing parameters the count_tokens endpoint does not accept.
Fixes#2764
Co-authored-by: Cursor <cursoragent@cursor.com>
Anthropic Messages reports input_tokens excluding cache_read/cache_creation, but
OpenAI Responses input_tokens is the total including cached tokens. The reverse
converter passed Anthropic's input_tokens straight through, so client-facing
prompt_tokens/input_tokens were short by the cached count and cache_creation
was dropped entirely.
Fix the non-stream path and the streaming state machine to add cache_read +
cache_creation back into input_tokens, and track CacheCreationInputTokens on
the streaming state. Six downstream paths benefit (Anthropic->Responses,
Anthropic->ChatCompletions, Gemini->ChatCompletions, each sync + stream).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Follow-up to #2816 (already merged): the same long-context pricing
exemption that affected cache_read also applies to all three
cache_creation price fields (standard, 5m ephemeral, 1h ephemeral).
computeCacheCreationCost reads these prices directly from pricing and
never sees the LongContextInputMultiplier that computeTokenBreakdown
applies to inputPrice / outputPrice / cacheReadPrice.
For GPT-5.4 / 5.5 above the 272k threshold, this causes the cache_write
portion of long sessions to be billed at roughly half what it should
be (default multiplier 2.0). Cache writes are conceptually input-side
operations and should share the same long-context treatment as input /
cache_read.
This patch threads an explicit multiplier into computeCacheCreationCost
so the function can be unit-tested in isolation and matches the existing
pattern used for cache_read. computeTokenBreakdown captures the long
context decision once and passes LongContextInputMultiplier when it
applies, 1.0 otherwise.
Adds three regression tests mirroring the #2816 cache_read tests:
- positive: long-context triggered -> cache_creation scaled by 2.0x
- negative: below threshold -> cache_creation stays at base price
- breakdown: 5m + 1h ephemeral prices both scaled when applicable
Refs #2816
Co-authored-by: Cursor <cursoragent@cursor.com>