Adds GET /api/v1/admin/ops/ws/requests — a fan-out WebSocket that pushes
per-request metadata (method, path, model, account_id, status, latency_ms)
to all connected admin clients the moment each gateway dispatch completes.
- service/request_event_bus.go: lock-free pub/sub with non-blocking drop
when per-subscriber buffer (64 slots) is full; nil-safe Publish
- service/request_event_bus_test.go: 6 tests (basic, fanout, drop, nil, close)
- GatewayHandler: records reqStartTime at entry; defer emits RequestEvent on
every return; sets status success/error/rate_limited in both Gemini and
Anthropic dispatch paths
- OpsHandler: accepts *RequestEventBus; wires it to RequestStreamWSHandler
- ops_ws_requests_handler.go: subscribes to bus, pushes JSON per event,
reuses existing upgrader/conn-limit/ping-pong infrastructure
- Route: ws.GET("/requests", ...) alongside existing /ws/qps
- wire_gen.go: requestEventBus shared between OpsHandler and GatewayHandler
Add HealthService with Liveness (no-op) and Readiness (DB+Redis ping
with per-component timeout) checks. Expose three endpoints:
- /healthz : new liveness endpoint, zero-dependency, always 200
- /ready : new readiness endpoint, returns 503 with details on dep
failure; suitable for K8s readinessProbe and load balancers
- /health : preserved for backward compatibility, equivalent to
/healthz
Switch primary docker-compose healthcheck to /ready so the container
is only marked healthy once DB+Redis are reachable. Standalone/dev/
local compose files keep /health to avoid disrupting existing setups.
Tests: unit tests cover liveness, readiness with both deps healthy,
each dep failing independently, and per-component timeout enforcement.
## Problem
When a proxy is unreachable, token refresh retries up to 4 times with
30s timeout each, causing requests to hang for ~2 minutes before
failing with a generic 502 error. The failed account is not marked,
so subsequent requests keep hitting it.
## Changes
### Proxy connection fast-fail
- Set TCP dial timeout to 5s and TLS handshake timeout to 5s on
antigravity client, so proxy connectivity issues fail within 5s
instead of 30s
- Reduce overall HTTP client timeout from 30s to 10s
- Export `IsConnectionError` for service-layer use
- Detect proxy connection errors in `RefreshToken` and return
immediately with "proxy unavailable" error (no retries)
### Token refresh temp-unschedulable
- Add 8s context timeout for token refresh on request path
- Mark account as temp-unschedulable for 10min when refresh fails
(both background `TokenRefreshService` and request-path
`GetAccessToken`)
- Sync temp-unschedulable state to Redis cache for immediate
scheduler effect
- Inject `TempUnschedCache` into `AntigravityTokenProvider`
### Account failover
- Return `UpstreamFailoverError` on `GetAccessToken` failure in
`Forward`/`ForwardGemini` to trigger handler-level account switch
instead of returning 502 directly
### Proxy probe alignment
- Apply same 5s dial/TLS timeout to shared `httpclient` pool
- Reduce proxy probe timeout from 30s to 10s
Introduce OAuthRefreshAPI as the single entry point for all OAuth token
refresh operations, eliminating the race condition where background
refresh and inline refresh could simultaneously use the same
refresh_token (fixes#1035).
Key changes:
- Add OAuthRefreshExecutor interface extending TokenRefresher with CacheKey
- Add OAuthRefreshAPI.RefreshIfNeeded with lock → DB re-read → double-check flow
- Add ProviderRefreshPolicy / BackgroundRefreshPolicy strategy types
- Simplify all 4 TokenProviders to delegate to OAuthRefreshAPI
- Rewrite TokenRefreshService.refreshWithRetry to use unified API path
- Add MergeCredentials and BuildClaudeAccountCredentials helpers
- Add 40 unit tests covering all new and modified code paths
When the service restarts, concurrency slots from the old process
remain in Redis, causing phantom occupancy. On startup, scan all
concurrency sorted sets and remove members with non-current process
prefix, then clear orphaned wait queue counters.
Uses Go-side SCAN to discover keys (compatible with Redis client
prefix hooks in tests), then passes them to a Lua script for
atomic member-level cleanup.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- add default subscriptions to admin settings
- auto-assign subscriptions on register and admin user creation
- add validation/tests and align settings UI with subscription selector patterns
- 新增并接通 Sora 专用 OAuth 接口与 ST/RT 换取能力
- 完成前端 Sora 授权、RT/ST 手动导入与账号创建流程
- 强化 Sora token 恢复、转发日志与网关路由隔离行为
- 补充后端服务层与路由层相关测试覆盖
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Implements announcements end-to-end (admin CRUD + read status, user list + mark read) with OR-of-AND targeting. Also breaks the ent<->service import cycle by moving schema-facing constants/targeting into a new domain package.