19 Commits

Author SHA1 Message Date
wucm667
c9caadb378 fix(account): address second-round review on quota auto-pause
- TopK initial filter now drops quota-paused accounts: fold the quota check
  into isAccountRequestCompatible so session-hash, TopK pool, and per-candidate
  rechecks all skip paused accounts. Previously the candidate pool was built
  without the quota check, so paused accounts could fill TopK and leave the
  scheduler returning "no available accounts" even with healthy ones available.
- Add per-account explicit disable flags auto_pause_5h_disabled /
  auto_pause_7d_disabled with toggles in EditAccountModal. Without these,
  leaving the account threshold blank silently falls back to the global default,
  so admins could not exempt a single account once a global default existed.
  Disable is per-window: an account can opt out of 5h auto-pause while still
  honoring 7d. Schedule snapshot whitelist includes the new fields, i18n EN/ZH
  updated, threshold-hint text revised to explain "blank = global default".
- Move quota auto-pause settings off the request hot path: replace the per-repo
  TTL+singleflight sync DB read with a per-SettingService stale-while-revalidate
  in-memory snapshot. Get is non-blocking (atomic.Pointer load + async refresh
  on staleness); writes via UpdateOpsAdvancedSettings push directly into the
  cache through an injected sink; wire warms the cache at startup. Adds Warm
  (sync) for tests/init and SetOpenAIQuotaAutoPauseSettings (sink target).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-29 14:32:45 +08:00
wucm667
ead471d64b feat(account): 支持按 5h/7d 用量阈值自动暂停账号调度 2026-05-29 10:47:47 +08:00
erio
d218b6c2aa refactor(ops-cleanup): 拆分 executor + table-driven + 提取常量 + 补测试
代码审查反馈:

1. 文件行数超标:ops_cleanup_service.go 594→413 行。
   拆 opsCleanupPlan / deleteOldRowsByID / truncateOpsTable / isMissingRelationError
   + counts struct 到 ops_cleanup_executor.go (164 行)。

2. runCleanupOnce 89 行→30 行(table-driven):
   用 []opsCleanupTarget 循环替代三组重复的 opsCleanupPlan → runOne → assign。

3. 魔法值提取常量:
   opsCleanupDefaultSchedule / opsCleanupBatchSize / opsCleanupCronStopTimeout /
   opsCleanupRunTimeout / opsCleanupHeartbeatTimeout。
   ops_settings.go 中 "0 2 * * *" 也统一引用 opsCleanupDefaultSchedule。

4. 补 5 个缺失测试:
   - Reload 未 Start 时 no-op
   - Reload 已 Stop 后 no-op
   - cleanupReloader==nil 时 Update 不 panic
   - Start 重复调用幂等
   - refreshEffectiveBeforeRun 正确更新 snapshot
2026-05-04 13:35:26 +08:00
erio
c4598aa9b6 fix(ops-cleanup): 让 UI 数据保留策略真正生效
UI 上 admin 改的数据保留策略(cron + retention 天数)此前只写入 settings 表的
ops_advanced_settings.data_retention,但 OpsCleanupService 启动时只读
cfg.Ops.Cleanup(config.yaml / 环境变量),从未读取 settings 表,导致 UI 配置
完全不生效——cron 实际仍按默认 0 2 * * * 每日跑、retention 30 天。

改动:
- OpsCleanupService 增加 settingRepo 依赖,新增 effective 配置 + Reload 方法。
  Start/Reload 时从 settings.ops_advanced_settings.data_retention 覆盖
  cfg.Ops.Cleanup(Enabled、Schedule、*RetentionDays),无 settings 时整体
  fallback 到 cfg。runScheduled 顶部刷新一次 effective,让 retention 改动当次
  即生效(schedule/enabled 改动需要 Reload 才换 cron)。
- 用 mu + started/stopped 替换 startOnce/stopOnce 以支持 Reload 重建 cron。
- OpsService 增加 CleanupReloader 接口与 SetCleanupReloader setter;
  UpdateOpsAdvancedSettings 写入后调用 Reload。
- wire 通过 setter 注入 cleanup hook,避免构造期循环依赖。
- 新增单测覆盖 overlay 五种情形 + Update 触发 Reload。
2026-05-04 12:43:15 +08:00
erio
4b6954f9f0 feat(ops): allow retention days = 0 to wipe table on each scheduled cleanup
Background / 背景

The ops cleanup task currently rejects retention days < 1 in both validate
and normalize, so operators who want minimal-history setups (e.g. high
churn deployments that prefer near-realtime cleanup) cannot express that
intent through the UI. The only options are 1+ days, which keeps at least
24h of history regardless of cron frequency.

ops 清理任务目前在 validate 和 normalize 两处都拒绝小于 1 的保留天数,
让希望尽量不留历史的运维场景(高吞吐部署 + 想用近实时清理)无法通过 UI
表达。最低只能配 1,等于不管 cron 多频繁,至少都会保留 24 小时的历史。

Purpose / 目的

Let admins set retention days to 0, meaning "every scheduled cleanup
run wipes the corresponding table(s) entirely". Combined with a more
frequent cron (e.g. `0 * * * *`) this yields effectively rolling cleanup.

允许管理员把保留天数设为 0,语义为"每次定时清理时把对应表全部清空"。
搭配更频繁的 cron(比如每小时整点)即可获得近似滚动清理的效果。

Changes / 改动内容

Backend

- service/ops_settings.go: validate accepts [0, 365]; normalize only
  refills default 30 when value is < 0 (negative is treated as legacy
  bad data, 0 is honoured)
- service/ops_cleanup_service.go: introduce `opsCleanupPlan(now, days)`
  returning `(cutoff, truncate, ok)`. days==0 returns truncate=true and
  short-circuits to a new `truncateOpsTable` helper that uses
  `TRUNCATE TABLE` (O(1), no WAL, no VACUUM pressure). days>0 keeps
  the existing batched DELETE path unchanged. Empty tables skip
  TRUNCATE to avoid the ACCESS EXCLUSIVE lock entirely
- Extract `isMissingRelationError` helper to dedupe the "table not
  yet created" tolerance shared by both delete and truncate paths
- Add unit tests for `opsCleanupPlan` (three branches) and
  `isMissingRelationError`

后端

- service/ops_settings.go: validate 接受 [0, 365];normalize 仅在 < 0
  时回填默认 30(负数视为脏数据,0 被尊重)
- service/ops_cleanup_service.go: 抽 `opsCleanupPlan(now, days)` 返回
  `(cutoff, truncate, ok)`。days==0 → truncate=true,走新增
  `truncateOpsTable`(TRUNCATE TABLE,O(1),无 WAL、无 VACUUM 压力);
  days>0 仍走原批量 DELETE 路径,行为完全不变。空表跳过 TRUNCATE,
  避免无意义的 ACCESS EXCLUSIVE 锁
- 抽 `isMissingRelationError` helper 复用 delete / truncate 两处的
  "表不存在"宽容判断
- 补 `opsCleanupPlan` 三分支 + `isMissingRelationError` 单元测试

Frontend

- OpsSettingsDialog.vue: validation accepts [0, 365]; input min=0
- i18n (zh/en): hint mentions "0 = wipe all on every cleanup",
  validation message updated to 0-365 range

前端

- OpsSettingsDialog.vue: 校验放宽到 [0, 365],input min 改 0
- i18n(zh/en):hint 补"0 = 每次清理时清空所有",错误提示改 0-365

Trade-offs / 取舍

- TRUNCATE requires ACCESS EXCLUSIVE lock briefly, but ops tables only
  have the cleanup task as a writer, so the lock is invisible to other
  workloads
- Empty-table guard avoids the lock when there is nothing to clean
- Negative values are still treated as legacy bad data and replaced
  with default 30 to preserve compatibility
2026-04-29 15:01:02 +08:00
erio
cfe72159d0 feat(ops): add ignore insufficient balance errors toggle and extract error constants
- Add 5th error filter switch IgnoreInsufficientBalanceErrors to suppress
  upstream insufficient balance / insufficient_quota errors from ops log
- Extract hardcoded error strings into package-level constants for
  shouldSkipOpsErrorLog, normalizeOpsErrorType, classifyOpsPhase, and
  classifyOpsIsBusinessLimited
- Define ErrNoAvailableAccounts sentinel error and replace all
  errors.New("no available accounts") call sites
- Update tests to use require.ErrorIs with the sentinel error
2026-03-15 17:26:18 +08:00
Peter
29b0e4a8a5 feat(ops): allow hiding alert events 2026-03-13 17:18:04 +08:00
Peter
af9c4a7dd0 feat(ops): make openai token stats optional 2026-03-13 04:11:58 +08:00
alfadb
832b0185c7 style: fix gofmt formatting in ops_settings.go
Remove extra space before inline comment to pass golangci-lint gofmt check.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-03 18:00:49 +08:00
alfadb
b1719b26d1 fix(ops): 默认忽略 count_tokens 404 错误
将 IgnoreCountTokensErrors 默认值从 false 改为 true。

count_tokens 返回 404 是预期业务行为(上游不支持 endpoint,
客户端应 fallback 到本地 tokenizer 估算),不应被视为错误。

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-03 16:50:13 +08:00
IanShaw027
967e25878f refactor(ops): 重构ops核心服务层代码 2026-01-14 12:40:12 +08:00
IanShaw027
182683814b refactor(ops): 移除duration相关告警指标,简化监控配置
主要改动:
- 移除 p95_latency_ms 和 p99_latency_ms 告警指标类型
- 移除配置中的 latency_p99_ms_max 阈值设置
- 简化健康分数计算(移除latency权重,重新归一化SLA和错误率)
- 移除duration相关的诊断规则和阈值检查
- 统一术语:延迟 → 请求时长
- 保留duration数据展示,但不再用于告警判断
- 聚焦TTFT作为主要的响应速度告警指标

影响范围:
- Backend: handler, service, models, tests
- Frontend: API types, i18n, components
2026-01-14 10:52:56 +08:00
IanShaw027
2d45e61a9b style(ops): 修复代码格式问题以通过 golangci-lint 2026-01-12 17:18:49 +08:00
IanShaw027
345a965fa3 feat(ops): 添加 count_tokens 错误过滤功能
功能特性:
- 自动识别并标记 count_tokens 请求的错误
- 支持配置是否在统计中忽略 count_tokens 错误
- 错误数据完整保留,仅在统计时动态过滤

技术实现:
- ops_error_logger.go: 自动标记 count_tokens 请求
- ops_repo.go: INSERT 语句添加 is_count_tokens 字段
- ops_repo_dashboard.go: buildErrorWhere 核心过滤函数
- ops_repo_preagg.go: 预聚合统计中添加过滤
- ops_repo_trends.go: 趋势统计查询添加过滤(2 处)
- ops_settings_models.go: 添加 ignore_count_tokens_errors 配置
- ops_settings.go: 配置验证和默认值设置
- ops_port.go: 错误日志模型添加 IsCountTokens 字段

业务价值:
- count_tokens 是探测性请求,其错误不影响真实业务 SLA
- 用户可根据需求灵活控制是否计入统计
- 提升错误率、告警等运维指标的准确性

影响范围:
- Dashboard 概览统计
- 错误趋势图表
- 告警规则评估
- 预聚合指标(hourly/daily)
- 健康分数计算
2026-01-12 17:06:12 +08:00
IanShaw027
e0cccf6ed2 fix(ops): 修复Go代码格式问题 2026-01-12 14:36:32 +08:00
IanShaw027
7536dbfee5 feat(ops): 后端添加指标阈值管理API
- 新增GetMetricThresholds和UpdateMetricThresholds接口
- 支持配置SLA、延迟P99、TTFT P99、请求错误率、上游错误率阈值
- 添加参数验证逻辑
- 提供默认阈值配置
2026-01-12 11:42:56 +08:00
IanShaw027
c48795a948 fix(ci): 修复最后一批CI错误
- 修复 ops_repo_trends.go 中剩余3处 Rows.Close 未检查错误
- 修复 ops_settings.go, ops_settings_models.go, ops_trends.go 的格式化问题
2026-01-12 00:02:19 +08:00
IanShaw027
988b4d0254 feat(ops): 添加高级设置API支持
- 新增OpsAdvancedSettings数据模型
- 支持数据保留策略配置(错误日志、分钟级指标、小时级指标)
- 支持数据聚合开关配置
- 添加GET/PUT /admin/ops/advanced-settings接口
- 添加配置校验和默认值处理

相关文件:
- backend/internal/service/ops_settings_models.go
- backend/internal/service/ops_settings.go
- backend/internal/handler/admin/ops_settings_handler.go
- backend/internal/server/routes/admin.go
- backend/internal/service/domain_constants.go
2026-01-11 19:51:18 +08:00
IanShaw027
5baa8b5673 feat(service): 实现运维监控业务逻辑层
- 新增 ops 主服务(ops_service.go)和端口定义(ops_port.go)
- 实现账号可用性检查服务(ops_account_availability.go)
- 实现数据聚合服务(ops_aggregation_service.go)
- 实现告警评估服务(ops_alert_evaluator_service.go)
- 实现告警管理服务(ops_alerts.go)
- 实现数据清理服务(ops_cleanup_service.go)
- 实现并发控制服务(ops_concurrency.go)
- 实现仪表板服务(ops_dashboard.go)
- 实现错误处理服务(ops_errors.go)
- 实现直方图服务(ops_histograms.go)
- 实现指标采集服务(ops_metrics_collector.go)
- 实现查询模式服务(ops_query_mode.go)
- 实现实时监控服务(ops_realtime.go)
- 实现请求详情服务(ops_request_details.go)
- 实现重试机制服务(ops_retry.go)
- 实现配置管理服务(ops_settings.go)
- 实现趋势分析服务(ops_trends.go)
- 实现窗口统计服务(ops_window_stats.go)
- 添加 ops 相关领域常量
- 注册 service 依赖注入
2026-01-09 20:53:44 +08:00