13 Commits

Author SHA1 Message Date
wucm667
8b7a822706 fix(account): address review on OpenAI quota auto-pause
- gate previous_response_id sticky path with quota auto-pause check at both
  the snapshot and DB-recheck stages (previously bypassed, #1)
- skip pausing when the usage window already reset to avoid a stale stuck-pause;
  carry codex_*_reset_at / reset_after_seconds / codex_usage_updated_at through
  the scheduler snapshot whitelist (#2)
- remove the incomplete limit mode; percentage threshold only (#3)
- add global default 5h/7d threshold inputs to the Ops settings dialog with
  validation and en/zh i18n (#4)
- downgrade account_auto_paused_by_quota log from Info to Debug; it fires
  per-candidate on the scheduling hot path (#5)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-29 12:20:30 +08:00
erio
4b6954f9f0 feat(ops): allow retention days = 0 to wipe table on each scheduled cleanup
Background / 背景

The ops cleanup task currently rejects retention days < 1 in both validate
and normalize, so operators who want minimal-history setups (e.g. high
churn deployments that prefer near-realtime cleanup) cannot express that
intent through the UI. The only options are 1+ days, which keeps at least
24h of history regardless of cron frequency.

ops 清理任务目前在 validate 和 normalize 两处都拒绝小于 1 的保留天数,
让希望尽量不留历史的运维场景(高吞吐部署 + 想用近实时清理)无法通过 UI
表达。最低只能配 1,等于不管 cron 多频繁,至少都会保留 24 小时的历史。

Purpose / 目的

Let admins set retention days to 0, meaning "every scheduled cleanup
run wipes the corresponding table(s) entirely". Combined with a more
frequent cron (e.g. `0 * * * *`) this yields effectively rolling cleanup.

允许管理员把保留天数设为 0,语义为"每次定时清理时把对应表全部清空"。
搭配更频繁的 cron(比如每小时整点)即可获得近似滚动清理的效果。

Changes / 改动内容

Backend

- service/ops_settings.go: validate accepts [0, 365]; normalize only
  refills default 30 when value is < 0 (negative is treated as legacy
  bad data, 0 is honoured)
- service/ops_cleanup_service.go: introduce `opsCleanupPlan(now, days)`
  returning `(cutoff, truncate, ok)`. days==0 returns truncate=true and
  short-circuits to a new `truncateOpsTable` helper that uses
  `TRUNCATE TABLE` (O(1), no WAL, no VACUUM pressure). days>0 keeps
  the existing batched DELETE path unchanged. Empty tables skip
  TRUNCATE to avoid the ACCESS EXCLUSIVE lock entirely
- Extract `isMissingRelationError` helper to dedupe the "table not
  yet created" tolerance shared by both delete and truncate paths
- Add unit tests for `opsCleanupPlan` (three branches) and
  `isMissingRelationError`

后端

- service/ops_settings.go: validate 接受 [0, 365];normalize 仅在 < 0
  时回填默认 30(负数视为脏数据,0 被尊重)
- service/ops_cleanup_service.go: 抽 `opsCleanupPlan(now, days)` 返回
  `(cutoff, truncate, ok)`。days==0 → truncate=true,走新增
  `truncateOpsTable`(TRUNCATE TABLE,O(1),无 WAL、无 VACUUM 压力);
  days>0 仍走原批量 DELETE 路径,行为完全不变。空表跳过 TRUNCATE,
  避免无意义的 ACCESS EXCLUSIVE 锁
- 抽 `isMissingRelationError` helper 复用 delete / truncate 两处的
  "表不存在"宽容判断
- 补 `opsCleanupPlan` 三分支 + `isMissingRelationError` 单元测试

Frontend

- OpsSettingsDialog.vue: validation accepts [0, 365]; input min=0
- i18n (zh/en): hint mentions "0 = wipe all on every cleanup",
  validation message updated to 0-365 range

前端

- OpsSettingsDialog.vue: 校验放宽到 [0, 365],input min 改 0
- i18n(zh/en):hint 补"0 = 每次清理时清空所有",错误提示改 0-365

Trade-offs / 取舍

- TRUNCATE requires ACCESS EXCLUSIVE lock briefly, but ops tables only
  have the cleanup task as a writer, so the lock is invisible to other
  workloads
- Empty-table guard avoids the lock when there is nothing to clean
- Negative values are still treated as legacy bad data and replaced
  with default 30 to preserve compatibility
2026-04-29 15:01:02 +08:00
erio
cfe72159d0 feat(ops): add ignore insufficient balance errors toggle and extract error constants
- Add 5th error filter switch IgnoreInsufficientBalanceErrors to suppress
  upstream insufficient balance / insufficient_quota errors from ops log
- Extract hardcoded error strings into package-level constants for
  shouldSkipOpsErrorLog, normalizeOpsErrorType, classifyOpsPhase, and
  classifyOpsIsBusinessLimited
- Define ErrNoAvailableAccounts sentinel error and replace all
  errors.New("no available accounts") call sites
- Update tests to use require.ErrorIs with the sentinel error
2026-03-15 17:26:18 +08:00
shaw
6da5fa01b9 fix(frontend): 修复运维设置对话框保存按钮始终禁用的问题
后端默认 alert.enabled=true 但 recipients 为空,前端验证将其视为
错误并阻断保存按钮。移除该阻断性验证,改为保存时自动禁用无收件人
的邮件通知配置。
2026-03-14 20:39:29 +08:00
Peter
29b0e4a8a5 feat(ops): allow hiding alert events 2026-03-13 17:18:04 +08:00
Peter
af9c4a7dd0 feat(ops): make openai token stats optional 2026-03-13 04:11:58 +08:00
Zero Clover
ad1cdba338
feat(ops): 支持过滤无效 API Key 错误,不写入错误日志
新增 IgnoreInvalidApiKeyErrors 开关,启用后 INVALID_API_KEY 和
API_KEY_REQUIRED 错误将被完全跳过,不写入 Ops 错误日志。
这些错误由用户错误配置导致,与服务质量无关。
2026-02-02 20:16:17 +08:00
IanShaw027
55e469c7fe fix(ops): 优化错误日志过滤和查询逻辑
后端改动:
- 添加 resolved 参数默认值处理(向后兼容,默认显示未解决错误)
- 新增 status_codes_other 查询参数支持
- 移除 service 层的高级设置过滤逻辑,简化错误日志查询流程

前端改动:
- 完善错误日志相关组件的国际化支持
- 优化 Ops 监控面板和设置对话框的用户体验
2026-01-14 16:26:33 +08:00
IanShaw027
5013290486 feat(frontend): 优化ops监控UI组件 2026-01-14 12:41:24 +08:00
IanShaw027
182683814b refactor(ops): 移除duration相关告警指标,简化监控配置
主要改动:
- 移除 p95_latency_ms 和 p99_latency_ms 告警指标类型
- 移除配置中的 latency_p99_ms_max 阈值设置
- 简化健康分数计算(移除latency权重,重新归一化SLA和错误率)
- 移除duration相关的诊断规则和阈值检查
- 统一术语:延迟 → 请求时长
- 保留duration数据展示,但不再用于告警判断
- 聚焦TTFT作为主要的响应速度告警指标

影响范围:
- Backend: handler, service, models, tests
- Frontend: API types, i18n, components
2026-01-14 10:52:56 +08:00
IanShaw027
b98fb013ae feat(ops): 添加自动刷新配置功能
功能特性:
- 支持配置启用/禁用自动刷新
- 可配置刷新间隔(15秒/30秒/60秒)
- 实时倒计时显示,用户可见下次刷新时间
- 手动刷新自动重置倒计时
- 页面卸载时自动清理定时器

用户体验:
- 默认禁用,用户可根据需求开启
- 与现有 OpsConcurrencyCard 5秒刷新保持一致
- 倒计时带旋转动画,视觉反馈清晰
- 配置修改后立即生效,无需刷新页面

技术实现:
- ops.ts: 添加 auto_refresh_enabled 和 auto_refresh_interval_seconds 配置
- OpsSettingsDialog.vue: 添加自动刷新配置界面
- OpsDashboard.vue: 实现主刷新逻辑和双定时器设计
- OpsDashboardHeader.vue: 倒计时显示组件

配置说明:
- auto_refresh_enabled: 是否启用(默认 false)
- auto_refresh_interval_seconds: 刷新间隔(默认 30 秒,范围 15-300 秒)
2026-01-12 17:07:07 +08:00
IanShaw027
d0b91a40d4 feat(ops): 添加指标阈值配置UI
- 在OpsSettingsDialog中添加指标阈值配置表单
- 在OpsRuntimeSettingsCard中添加阈值配置区域
- 添加阈值验证逻辑
- 更新国际化文本
2026-01-12 11:43:54 +08:00
IanShaw027
f541636840 feat(ops): 优化警报规则和设置的成功提示信息
- 添加警报规则保存成功提示:"警报规则保存成功"
- 添加警报规则删除成功提示:"警报规则删除成功"
- 添加运维监控设置保存成功提示:"运维监控设置保存成功"
- 替换通用的"操作成功"提示为具体的业务提示
- 失败时显示后端返回的详细错误信息

相关文件:
- frontend/src/i18n/locales/zh.ts
- frontend/src/views/admin/ops/components/OpsAlertRulesCard.vue
- frontend/src/views/admin/ops/components/OpsSettingsDialog.vue
2026-01-11 19:50:43 +08:00