docs: create roadmap (6 phases)

2026-03-21 17:00:12 +08:00 · 2026-03-21 17:00:12 +08:00 · 44b5f390aa
commit 44b5f390aa
parent 5e9102148a
3 changed files with 729 additions and 0 deletions
--- a/.planning/ROADMAP.md
+++ b/.planning/ROADMAP.md
@ -0,0 +1,134 @@
+# Roadmap: JobData 爬虫交互重构
+
+**Project:** JobData 爬虫交互重构
+**Core Value:** 基于关键词驱动爬虫抓取职位数据，可靠入库 ClickHouse，定时完成公司信息采集同步
+**Created:** 2026-03-21
+**Granularity:** Standard (5-8 phases)
+
+---
+
+## Phases
+
+- [ ] **Phase 1: 共享核心包** - 提取 crawler_core/ 可安装共享包，统一基类和基础设施
+- [ ] **Phase 2: Boss 直聘重写** - 基于 crawler_core 重写 Boss 直聘爬虫客户端
+- [ ] **Phase 3: 前程无忧 & 智联重写** - 基于 crawler_core 重写前程无忧和智联招聘爬虫客户端
+- [ ] **Phase 4: 后端 & 外部脚本接入** - 后端 facade 桥接 + 外部脚本迁移 + 废弃老框架
+- [ ] **Phase 5: 数据管道优化** - 入库去重、公司清洗流程优化、公司招聘信息写入
+- [ ] **Phase 6: 质量 & 前端** - 数据解析测试覆盖、前端监控和数据清洗页面优化
+
+---
+
+## Phase Details
+
+### Phase 1: 共享核心包
+**Goal:** crawler_core/ 可安装共享包就位，三平台开发可以统一基于它展开
+**Depends on:** Nothing (first phase)
+**Requirements:** ARCH-01, ARCH-02, QUAL-01, QUAL-04, QUAL-05
+**Success Criteria** (what must be TRUE):
+  1. `pip install -e ./crawler_core` 可成功安装，`from crawler_core import BaseFetcher` 可正常导入
+  2. 三大平台的签名算法已迁入 crawler_core/ 并通过单元测试（pytest 绿色）
+  3. BaseFetcher/BaseSearcher 基类定义完整，提供模板方法供子类实现
+  4. loguru 日志格式统一，tenacity 重试装饰器可开箱即用
+  5. 旧爬虫（jobs_spider/）仍正常运行，未被改动（feature flag 隔离）
+**Plans:** TBD
+
+### Phase 2: Boss 直聘重写
+**Goal:** Boss 直聘爬虫完全基于 crawler_core 运行，旧实现可安全停用
+**Depends on:** Phase 1
+**Requirements:** ARCH-03, QUAL-03
+**Success Criteria** (what must be TRUE):
+  1. Boss 直聘爬虫继承 BaseFetcher/BaseSearcher，不含内联签名或 HTTP 样板代码
+  2. 针对 Boss HTTP 层的 mock/respx 测试通过，覆盖正常响应和错误响应
+  3. 使用 Boss 新客户端运行一次真实关键词抓取，职位数据成功返回（手动验证）
+  4. 反爬机制保留：随机延迟 10-20s、代理轮换、TLS 指纹伪装均正常工作
+**Plans:** TBD
+
+### Phase 3: 前程无忧 & 智联重写
+**Goal:** 前程无忧和智联招聘爬虫完全基于 crawler_core 运行，三平台统一使用新基类
+**Depends on:** Phase 1
+**Requirements:** ARCH-04, ARCH-05
+**Success Criteria** (what must be TRUE):
+  1. 前程无忧爬虫继承 BaseFetcher/BaseSearcher，不含内联签名或 HTTP 样板代码
+  2. 智联招聘爬虫继承 BaseFetcher/BaseSearcher，不含内联签名或 HTTP 样板代码
+  3. 两平台各自运行一次真实关键词抓取，职位数据成功返回（手动验证）
+  4. 三平台新客户端的代码结构一致，新平台可参照模板快速实现
+**Plans:** TBD
+
+### Phase 4: 后端 & 外部脚本接入
+**Goal:** 后端 facade 和外部脚本均使用 crawler_core，老框架 jobs_spider/ 完成历史使命并下线
+**Depends on:** Phase 2, Phase 3
+**Requirements:** ARCH-06, ARCH-07, ARCH-08
+**Success Criteria** (what must be TRUE):
+  1. 后端 app/services/crawler/ facade 通过 asyncio.to_thread() 调用三平台新客户端，无同步阻塞
+  2. spiderJobs/ 外部脚本导入 crawler_core 替代内联代码，功能等价
+  3. jobs_spider/ 目录已标记废弃或删除，无生产流量指向旧代码
+  4. 定时任务和手动触发的爬虫均通过新路径正常运行（后端日志可验证）
+**Plans:** TBD
+
+### Phase 5: 数据管道优化
+**Goal:** 入库去重可靠，公司清洗流程顺畅，公司招聘信息完整写入 ClickHouse
+**Depends on:** Phase 4
+**Requirements:** DATA-01, DATA-02, DATA-03, DATA-04
+**Success Criteria** (what must be TRUE):
+  1. 同一职位 ID 在 30 天窗口内重复推送时，ClickHouse 中只保留一条记录（可通过重复入库验证）
+  2. 外部脚本和后端均通过同一个 IngestService 入库，入库日志显示来源字段一致
+  3. 公司清洗定时任务：ClickHouse 抽取 → 爬公司详情 → 写 MySQL，全程日志可追踪
+  4. 公司招聘信息（company_jobs）成功写入 ClickHouse，前端或查询可验证数据存在
+**Plans:** TBD
+
+### Phase 6: 质量 & 前端
+**Goal:** 数据解析测试覆盖到位，前端监控和清洗页面对实际运维有帮助
+**Depends on:** Phase 5
+**Requirements:** QUAL-02, QUAL-06, QUAL-07
+**Success Criteria** (what must be TRUE):
+  1. 三平台数据解析函数有单元测试，覆盖正常字段和缺字段场景（pytest 绿色）
+  2. 去重逻辑有单元测试，覆盖重复 ID 和时间窗口边界场景
+  3. 前端爬虫监控页面展示各平台最近抓取时间、数量趋势和错误状态
+  4. 前端数据清洗页面支持查看待清洗公司列表、触发清洗、查看结果
+**Plans:** TBD
+
+---
+
+## Progress Table
+
+| Phase | Plans Complete | Status | Completed |
+|-------|----------------|--------|-----------|
+| 1. 共享核心包 | 0/? | Not started | - |
+| 2. Boss 直聘重写 | 0/? | Not started | - |
+| 3. 前程无忧 & 智联重写 | 0/? | Not started | - |
+| 4. 后端 & 外部脚本接入 | 0/? | Not started | - |
+| 5. 数据管道优化 | 0/? | Not started | - |
+| 6. 质量 & 前端 | 0/? | Not started | - |
+
+---
+
+## Coverage Map
+
+| Requirement | Phase | Description |
+|-------------|-------|-------------|
+| ARCH-01 | Phase 1 | 提取 crawler_core/ 为可安装包 |
+| ARCH-02 | Phase 1 | 统一 BaseFetcher/BaseSearcher 基类 |
+| ARCH-03 | Phase 2 | Boss 直聘爬虫基于 crawler_core 重写 |
+| ARCH-04 | Phase 3 | 前程无忧爬虫基于 crawler_core 重写 |
+| ARCH-05 | Phase 3 | 智联招聘爬虫基于 crawler_core 重写 |
+| ARCH-06 | Phase 4 | 后端 facade 使用 asyncio.to_thread() 桥接 |
+| ARCH-07 | Phase 4 | 外部脚本 spiderJobs/ 使用 crawler_core |
+| ARCH-08 | Phase 4 | 废弃 jobs_spider/ 老框架 |
+| DATA-01 | Phase 5 | 入库按唯一 ID + 30 天时间窗口去重 |
+| DATA-02 | Phase 5 | 外部脚本和后端共用 IngestService |
+| DATA-03 | Phase 5 | 公司信息定时清洗流程优化 |
+| DATA-04 | Phase 5 | 公司招聘信息写入 ClickHouse |
+| QUAL-01 | Phase 1 | 核心签名算法单元测试 |
+| QUAL-02 | Phase 6 | 数据解析和去重逻辑单元测试 |
+| QUAL-03 | Phase 2 | HTTP 请求层 mock/respx 测试 |
+| QUAL-04 | Phase 1 | 结构化日志统一格式 |
+| QUAL-05 | Phase 1 | 错误重试机制（tenacity） |
+| QUAL-06 | Phase 6 | 前端爬虫监控页面优化 |
+| QUAL-07 | Phase 6 | 前端数据清洗管理页面优化 |
+
+**Coverage: 19/19 requirements mapped ✓**
+
+---
+
+*Roadmap created: 2026-03-21*
+*Last updated: 2026-03-21 after initial creation*
--- a/.planning/STATE.md
+++ b/.planning/STATE.md
@ -0,0 +1,110 @@
+# STATE: JobData 爬虫交互重构
+
+**Last updated:** 2026-03-21
+**Milestone:** 爬虫交互重构 v1
+
+---
+
+## Project Reference
+
+**Core value:** 基于关键词驱动爬虫抓取职位数据，可靠入库 ClickHouse，定时完成公司信息采集同步
+
+**Current focus:** Phase 1 — 提取 crawler_core/ 共享核心包
+
+---
+
+## Current Position
+
+| Field | Value |
+|-------|-------|
+| Phase | 1 |
+| Phase name | 共享核心包 |
+| Plan | None (not started) |
+| Status | Not started |
+
+**Progress:**
+
+```
+Phase 1 ░░░░░░░░░░░░░░░░░░░░  0%
+Phase 2 ░░░░░░░░░░░░░░░░░░░░  0%
+Phase 3 ░░░░░░░░░░░░░░░░░░░░  0%
+Phase 4 ░░░░░░░░░░░░░░░░░░░░  0%
+Phase 5 ░░░░░░░░░░░░░░░░░░░░  0%
+Phase 6 ░░░░░░░░░░░░░░░░░░░░  0%
+─────────────────────────────
+Overall  0/6 phases complete
+```
+
+---
+
+## Performance Metrics
+
+| Metric | Value |
+|--------|-------|
+| Plans completed | 0 |
+| Plans total (estimated) | TBD |
+| Phases completed | 0/6 |
+| Requirements mapped | 19/19 |
+
+---
+
+## Accumulated Context
+
+### Key Decisions
+
+| Decision | Rationale | Status |
+|----------|-----------|--------|
+| 先提取 crawler_core/ 再改平台代码 | 避免签名代码再次被复制粘贴 | Active |
+| feature flag 并行新旧代码 | 重构期间不中断在线爬虫 | Active |
+| jobs_spider/ 最终废弃 | 迁移到 spiderJobs/ + crawler_core | Active |
+| 同步核心 + asyncio.to_thread() 桥接 | requests_go 不支持 async，保持 TLS 指纹 | Active |
+| 30 天时间窗口去重 | 避免全表扫描 ClickHouse | Active |
+
+### Architecture Notes
+
+- `crawler_core/` 放项目根目录，`pip install -e ./crawler_core` 本地可安装
+- 核心保持同步（requests_go），后端通过 asyncio.to_thread() 调用
+- SmartIPManager 在 FastAPI 中需做成单例，避免会话状态丢失
+- ClickHouse 表结构不变，MySQL 模型变更走 Aerich
+
+### Known Risks
+
+| Risk | Mitigation |
+|------|-----------|
+| 重构期间打断在线爬虫 | feature flag 并行新旧代码，旧路径保留到 Phase 4 完成 |
+| 反爬会话状态丢失 | SmartIPManager 做单例，Phase 4 重点验证 |
+| 无测试基线 | Phase 1 先写签名纯函数测试，Phase 2 同步补 HTTP mock 测试 |
+
+### Open TODOs
+
+- [ ] 确认 crawler_core/ 包的 Python 包名（建议 `crawler_core`，避免与系统包冲突）
+- [ ] 确认 feature flag 实现方式（环境变量 vs 配置文件）
+- [ ] Phase 3 可与 Phase 2 并行（两平台互相独立），视进度决定
+
+### Blockers
+
+None currently.
+
+---
+
+## Session Continuity
+
+**To resume this project:**
+
+1. Check current phase in this file
+2. Read `.planning/ROADMAP.md` for phase goals and success criteria
+3. Read `.planning/REQUIREMENTS.md` for requirement coverage
+4. Run `git status` to see uncommitted work
+5. Start with `/gsd:plan-phase {current_phase}` if no active plan
+
+**Key files:**
+- `.planning/ROADMAP.md` — Phase structure and success criteria
+- `.planning/REQUIREMENTS.md` — All v1 requirements with traceability
+- `.planning/PROJECT.md` — Project context and constraints
+- `crawler_core/` — (to be created) shared package
+- `spiderJobs/` — external scripts using crawler_core
+- `app/services/crawler/` — backend facade layer
+
+---
+
+*State initialized: 2026-03-21*
--- a/CLAUDE.md
+++ b/CLAUDE.md
@ -0,0 +1,485 @@
+# JobData - 招聘数据采集与统计分析平台
+
+## 变更记录 (Changelog)
+
+| 版本 | 日期 | 说明 |
+|------|------|------|
+| 初始化 | 2026-03-20 | 首次生成架构文档，覆盖全部四个核心模块 |
+
+---
+
+## 项目愿景
+
+JobData 是一个面向招聘市场的全栈数据采集与分析平台。系统从三大主流招聘平台（Boss 直聘、前程无忧、智联招聘）自动抓取职位与公司数据，统一存储到 ClickHouse 列式数据库，并通过 FastAPI 后端 + Vue3 前端提供数据查看、定向清洗、统计分析等能力。ECS 弹性实例管理模块支持在阿里云上按需批量启停爬虫节点。
+
+---
+
+## 架构总览
+
+```
+[招聘平台]          [爬虫层]                  [后端 API]            [数据库]
+Boss直聘  ──►  jobs_spider/boss/     ──►                      ┌── MySQL
+前程无忧  ──►  jobs_spider/qcwy/     ──►  app (FastAPI)  ──►  └── ClickHouse
+智联招聘  ──►  jobs_spider/zhilian/  ──►
+                                                              ▲
+                                                              │
+                                          web (Vue3)  ────────┘
+                                          (前端页面)
+
+ecs_full_pipeline.py  ──►  阿里云 ECS  ──►  批量启动爬虫节点
+```
+
+**核心技术栈**
+
+| 层次 | 技术 |
+|------|------|
+| 后端 | Python 3.13, FastAPI 0.111, Tortoise-ORM 0.23, APScheduler |
+| 数据库（业务） | MySQL（用户/权限/审计/关键词/Token） |
+| 数据库（采集） | ClickHouse（职位/公司 JSON 原始数据 + 分析视图） |
+| 爬虫 | requests / httpx / Playwright（Python 脚本） |
+| 前端 | Vue 3.3, Vite 4, Naive UI, Pinia, ECharts |
+| 基础设施 | 阿里云 ECS（按量抢占实例），APScheduler 定时任务 |
+
+---
+
+## 模块结构图
+
+```mermaid
+graph TD
+    ROOT["(根) JobData"] --> APP["app - FastAPI 后端"]
+    ROOT --> WEB["web - Vue3 前端"]
+    ROOT --> SPIDER["jobs_spider - 平台爬虫"]
+    ROOT --> ECS["ecs_full_pipeline.py - ECS 批量部署"]
+
+    APP --> APP_API["app/api - 路由层"]
+    APP --> APP_SVC["app/services - 业务逻辑"]
+    APP --> APP_CORE["app/core - 框架核心"]
+    APP --> APP_MODELS["app/models - ORM 模型"]
+    APP --> APP_REPO["app/repositories - 数据仓库"]
+
+    SPIDER --> SP_BOSS["jobs_spider/boss"]
+    SPIDER --> SP_QCWY["jobs_spider/qcwy"]
+    SPIDER --> SP_ZL["jobs_spider/zhilian"]
+
+    click APP "./app/CLAUDE.md" "查看 app 模块文档"
+    click WEB "./web/CLAUDE.md" "查看 web 模块文档"
+    click SPIDER "./jobs_spider/CLAUDE.md" "查看 jobs_spider 模块文档"
+```
+
+---
+
+## 模块索引
+
+| 模块路径 | 语言 | 职责简述 |
+|----------|------|----------|
+| `app/` | Python | FastAPI 后端，提供 REST API、权限管理、定时任务、数据入库与分析 |
+| `web/` | Vue3/JS | 前端管理界面，数据展示、关键词管理、代理管理、数据清洗操作 |
+| `jobs_spider/` | Python | 三大平台的爬虫脚本，独立运行，结果通过 HTTP 推送到后端 |
+| `ecs_full_pipeline.py` | Python | 阿里云 ECS 实例批量创建/销毁/命令下发全流程脚本 |
+| `reclean_qcwy_jobs.py` | Python | 前程无忧数据重清洗独立脚本 |
+
+---
+
+## 运行与开发
+
+### 后端启动
+
+```bash
+# 安装依赖（pipenv）
+pipenv install
+
+# 开发模式（默认端口 9999，20 个 worker）
+python run.py
+
+# 环境变量覆盖
+APP_HOST=0.0.0.0 APP_PORT=9999 UVICORN_WORKERS=4 python run.py
+```
+
+**关键环境变量**
+
+| 变量 | 默认值 | 说明 |
+|------|--------|------|
+| `APP_HOST` | `0.0.0.0` | 监听地址 |
+| `APP_PORT` | `9999` | 监听端口 |
+| `UVICORN_WORKERS` | `20` | Worker 数量 |
+| `CLICKHOUSE_HOST` | `121.4.126.241` | ClickHouse 地址（需修改为实际地址） |
+| `CLICKHOUSE_USER` / `CLICKHOUSE_PASS` | 见 config.py | ClickHouse 认证 |
+| `SMTP_HOST` / `SMTP_USER` / `SMTP_PASS` | 见 config.py | 邮件告警配置 |
+| `REPORT_ENDPOINT` | 空 | 统计结果 Webhook 上报地址 |
+| `RUN_MIGRATIONS_ON_STARTUP` | `True` | 是否启动时自动迁移 |
+| `INITIALIZE_SEED_DATA_ON_STARTUP` | `True` | 是否启动时初始化种子数据 |
+
+> 安全警告：`config.py` 中 `SECRET_KEY`、数据库连接串、SMTP 密码均为硬编码默认值，生产环境必须通过环境变量覆盖。
+
+### 前端启动
+
+```bash
+cd web
+pnpm install
+pnpm dev        # 开发模式，默认 http://localhost:5173
+pnpm build      # 构建产物到 web/dist
+```
+
+### ECS 批量爬虫部署
+
+```bash
+# 需配置阿里云凭据（环境变量或 ~/.alibabacloud/credentials）
+python ecs_full_pipeline.py
+```
+
+---
+
+## 定时任务
+
+APScheduler 在应用启动时注册以下任务（`app/core/scheduler.py`）：
+
+| 任务 ID | 频率 | 职责 |
+|---------|------|------|
+| `stats_job` | 每 6 小时 | 统计 ClickHouse 各表总量并通过邮件/Webhook 上报 |
+| `ecs_full_pipeline` | 每 6 小时 | 调用 `ecs_full_pipeline.py` 批量刷新爬虫节点 |
+| `ip_alert_job` | 每 10 分钟 | 检查 IP 上报异常并告警 |
+| `company_cleaning_job` | 每 5 分钟 | 自动清洗待处理公司数据（collect 50 + process 30） |
+| `daily_cleanup_job` | 每天 00:05 | 清理历史任务运行记录 |
+
+所有任务通过分布式文件锁（或可选 Redis 锁）保证多 Worker 下只执行一次。
+
+---
+
+## 测试策略
+
+- 当前代码库**无自动化测试文件**（缺口：单元测试、集成测试均缺失）。
+- 推荐补充：
+  1. `app/services/` 的 service 层单元测试（使用 `pytest` + `anyio`）
+  2. `app/api/v1/` 的 API 集成测试（使用 `httpx.AsyncClient`）
+  3. `jobs_spider/` 的数据解析函数单元测试
+
+---
+
+## 编码规范
+
+- Python：使用 `ruff`（已在 Pipfile 中），格式化用 `black`，排序用 `isort`。
+- 前端：ESLint（`@zclzone` + `@unocss` 规则集），`prettier` 格式化。
+- 类型：后端强制 `pydantic` Schema 做入参校验；前端以 JS 为主（未启用严格 TS）。
+- 日志：后端统一使用 `loguru`，结构化字段 `logger.info(...)` 方式输出。
+
+---
+
+## AI 使用指引
+
+- 修改爬虫逻辑时，重点关注反爬机制：`SmartIPManager`、`IPAnomalyDetector` 在 `jobs_spider/boss/boos_api.py` 中实现，随机延迟至少 10 秒。
+- 新增 API 路由后需同步在 `app/api/v1/__init__.py` 注册，并执行 `api_controller.refresh_api()` 更新权限表。
+- ClickHouse 表结构变更在 `app/core/clickhouse_init.py` 中维护，**不走 Aerich 迁移**。
+- MySQL 模型变更走 Aerich（`aerich migrate && aerich upgrade`）。
+- 前端新增页面需要在 `web/src/views/{模块}/route.js` 和后端 `init_menus()` 中同步注册菜单。
+- `config.py` 中已硬编码真实 MySQL/ClickHouse 连接串和 SMTP 凭据，**提交代码前务必确认不泄露敏感信息**。
+
+<!-- GSD:project-start source:PROJECT.md -->
+## Project
+
+**JobData 爬虫交互重构**
+
+招聘数据采集平台的爬虫交互层重构项目。重写三个平台（Boss直聘、前程无忧、智联招聘）的爬虫代码，统一基类和共享核心逻辑，优化数据入库去重、公司信息清洗流程和前端监控页面。
+
+**Core Value:** 基于关键词驱动爬虫抓取职位数据，可靠地入库到 ClickHouse，并通过定时任务完成公司信息的采集和同步。
+
+### Constraints
+
+- **兼容性**: 重构期间不能中断现有爬虫数据采集
+- **共享核心**: 外部脚本和后端必须用同一套签名/请求/解析代码，避免代码重复
+- **数据库**: ClickHouse 表结构保持不变，MySQL 模型变更走 Aerich 迁移
+- **反爬**: 必须保留现有反爬机制（IP 轮换、随机延迟、代理管理）
+<!-- GSD:project-end -->
+
+<!-- GSD:stack-start source:codebase/STACK.md -->
+## Technology Stack
+
+## Languages
+- Python 3.13 - Backend API, crawlers, ECS pipeline scripts
+- JavaScript (ES Modules) - Vue3 frontend (no strict TypeScript; TS installed but JS used in `.vue` and `.js` files)
+- TypeScript 5.1.6 - Type definitions referenced in frontend toolchain only
+## Runtime
+- Python 3.13 (required: `>=3.13` per `Pipfile`; Dockerfile uses `python:3.11-slim-bullseye` for container build)
+- Node.js 18 (Dockerfile `node:18-alpine` for frontend build stage)
+- Python: `pipenv` (development), `pip` / `requirements.txt` (Docker production)
+- Frontend: `pnpm` (lockfile: `web/pnpm-lock.yaml` present)
+- Lockfiles: `Pipfile.lock` (Python), `web/pnpm-lock.yaml` (frontend), `uv.lock` (uv-compatible)
+## Frameworks
+- FastAPI 0.111.0 - REST API framework (`app/__init__.py` factory via `create_app()`)
+- Starlette 0.37.2 - ASGI underpinning (middleware, static files)
+- Uvicorn 0.34.0 - ASGI server (20 workers default, configurable via `UVICORN_WORKERS`)
+- Uvloop 0.21.0 - High-performance event loop (non-Windows only)
+- Tortoise-ORM 0.23.0 - Async ORM for MySQL (`app/models/`)
+- Aerich 0.8.1 - Database migrations for Tortoise-ORM (`migrations/` directory)
+- aiomysql - Async MySQL driver (Tortoise backend)
+- clickhouse-connect 0.7.19 - ClickHouse async client (`app/core/clickhouse.py`)
+- asyncpg - Async PostgreSQL driver (listed in Pipfile but not actively used)
+- Pydantic 2.10.5 - Request/response schemas (`app/schemas/`)
+- pydantic-settings 2.7.1 - Settings management via env vars (`app/settings/config.py`)
+- orjson 3.10.14 - Fast JSON serialization
+- ujson 5.10.0 - Alternative JSON library
+- PyJWT 2.10.1 - JWT token generation/verification (HS256, 7-day expiry)
+- passlib 1.7.4 - Password hashing
+- argon2-cffi 23.1.0 - Argon2 password hasher
+- APScheduler - Async job scheduler (`app/core/scheduler.py`, 6 registered cron tasks)
+- httpx 0.28.1 - Async HTTP client (backend service-to-service calls)
+- requests - Sync HTTP client (spider scripts in `jobs_spider/`)
+- loguru 0.7.3 - Structured logging (unified throughout backend and spiders)
+- Vue 3.3.4 - SPA framework (`web/src/main.js`)
+- Vue Router 4.2.4 - Client-side routing (`web/src/router/index.js`)
+- Pinia 2.1.6 - State management (`web/src/store/`)
+- Naive UI 2.34.4 - Component library (admin UI)
+- ECharts 6.0.0 - Data visualization charts (`web/src/views/analytics/`)
+- axios 1.4.0 - HTTP client (`web/src/utils/http/`)
+- vue-i18n 9 - Internationalization
+- Vite 4.4.6 - Frontend bundler (`web/vite.config.js`)
+- UnoCSS 66.5.10 - Atomic CSS engine
+- unplugin-auto-import 20.3.0 - Auto-imports for Vue APIs
+- unplugin-vue-components 30.0.0 - Auto-imports for components
+- @iconify/vue + @iconify/json - Icon library
+- playwright 1.57.0 - Browser automation (anti-detection in crawlers)
+- PyExecJS 1.5.1 - Execute JavaScript signing algorithms from Python (`jobs_spider/boss/`)
+- PySocks - SOCKS proxy support for spider requests
+- tenacity - Retry logic in crawlers
+- pandas + openpyxl - Data processing and Excel export
+- alibabacloud_ecs20140526 - Alibaba Cloud ECS SDK (`ecs_full_pipeline.py`)
+- alibabacloud_credentials - AliCloud credential management
+- ruff 0.9.1 - Python linter (`pyproject.toml` config: line-length 120, ignores F403/F405)
+- black 24.10.0 - Python formatter (line-length 120, target py310/py311)
+- isort 5.13.2 - Import sorter
+- ESLint 8.46.0 - Frontend linter (`@zclzone` + `@unocss` rule sets)
+- prettier - Frontend formatter
+## Key Dependencies
+- `clickhouse-connect==0.7.19` - All analytics and job data storage; loss means no data read/write
+- `tortoise-orm==0.23.0` - All business data (users, roles, keywords, tokens); paired with `aerich` for migrations
+- `fastapi==0.111.0` - API layer; version-pinned for stability
+- `APScheduler` - 6 scheduled tasks including ECS pipeline and IP alerting
+- `alibabacloud_ecs20140526` - ECS node management; required for crawler scaling
+- `uvicorn==0.34.0` + `uvloop==0.21.0` - Production ASGI server stack
+- `pydantic==2.10.5` - All input validation; v2 API used throughout
+- `loguru==0.7.3` - Unified logging across all modules
+- `redis` - Optional distributed lock backend (`app/core/locks.py`; falls back to file locks if Redis unavailable)
+## Configuration
+- All settings in `app/settings/config.py` via `pydantic-settings.BaseSettings`
+- Environment variables override defaults at startup
+- No `.env` file detected (not committed); variables set at OS/container level
+- Key variables: `APP_HOST`, `APP_PORT`, `UVICORN_WORKERS`, `CLICKHOUSE_HOST`, `CLICKHOUSE_USER`, `CLICKHOUSE_PASS`, `SECRET_KEY`, `SMTP_HOST`, `SMTP_USER`, `SMTP_PASS`, `REDIS_HOST`, `REPORT_ENDPOINT`
+- **Security warning**: `config.py` contains hardcoded default values for MySQL, ClickHouse, and SMTP credentials; must be overridden in production
+- `pyproject.toml` - Python project metadata, black/ruff tool config
+- `Pipfile` / `Pipfile.lock` - Development dependency management
+- `requirements.txt` - Production pip install (used in Dockerfile)
+- `web/vite.config.js` - Vite build config with proxy support via `VITE_USE_PROXY` env var
+- `Dockerfile` - Multi-stage build: Node 18 (frontend) + Python 3.11 (backend) + nginx
+## Platform Requirements
+- Python 3.13+ (Pipfile requirement)
+- Node.js 18+ with pnpm
+- MySQL server (Tortoise-ORM connection)
+- ClickHouse server (analytics data)
+- Optional: Redis (distributed locking upgrade)
+- Docker-based deployment (see `Dockerfile`, `deploy/entrypoint.sh`, `deploy/web.conf`)
+- Nginx serves frontend static files and proxies API requests
+- Alibaba Cloud ECS (cn-shanghai-b zone) for crawler nodes
+- Ports: 80 (nginx), 9999 (FastAPI backend direct)
+<!-- GSD:stack-end -->
+
+<!-- GSD:conventions-start source:CONVENTIONS.md -->
+## Conventions
+
+## Naming Patterns
+- Snake_case for all Python files: `company_storage.py`, `company_cleaner.py`, `clickhouse_repo.py`
+- Private/internal modules prefixed with underscore: `_base.py`, `_boss_api.py`, `_boss_client.py`, `_boss_sign.py`, `_http_client.py`
+- Platform-named service files: `boss.py`, `qcwy.py`, `zhilian.py` under `app/services/crawler/`
+- Router files named after domain: `keyword.py`, `analytics.py`, `cleaning.py`
+- PascalCase throughout: `CleaningService`, `KeywordController`, `ClickHouseBaseRepo`, `JobAnalyticsRepo`
+- Services: `{Domain}Service` — `BossService`, `QcwyService`, `ZhilianService`, `IngestService`, `AnalyticsService`
+- Controllers: `{Domain}Controller` — `KeywordController`
+- Repos: `{Domain}Repo` or `{Domain}BaseRepo` — `ClickHouseBaseRepo`, `JobAnalyticsRepo`
+- Models (Tortoise ORM): `{Platform}{Entity}` — `BossKeyword`, `QcwyCompany`, `ZhilianCompany`
+- Schemas (Pydantic): `{Entity}Base`, `{Entity}Create`, `{Entity}Update`, `{Entity}Out` — see `app/schemas/keyword.py`
+- Snake_case for all functions and methods: `get_available`, `report_page_progress`, `store_batch`, `build_insert_row`
+- Private helpers prefixed with underscore: `_apply_proxy`, `_ensure_boss_token_loaded`, `_pick_first`, `_nested_get`, `_clean_text`, `_model_for_source`
+- Async dependency factories follow pattern `get_{service/controller}()`: `get_ingest_service`, `get_analytics_service`, `get_keyword_controller`
+- Snake_case: `data_list`, `platform_type`, `check_duplicate`, `page_size`
+- Module-level constants: UPPER_SNAKE_CASE — `COMPANY_SOURCES`, `QUEUE_TERMINAL_STATUSES`
+- Class-level constants: UPPER_SNAKE_CASE prefixed `_` — `_TOKEN_REFRESH_INTERVAL = 3600`
+- Enums use PascalCase class name, UPPER_SNAKE_CASE values: `PlatformType.BOSS`, `ChannelType.MINI`, `DataType.JOB`
+- Enum values are lowercase strings matching URL slugs: `"boss"`, `"mini"`, `"job"` — see `app/schemas/ingest.py`
+- Enums inherit from `(str, Enum)` enabling direct string comparison
+## Code Style
+- Tool: `black` v24.10.0
+- Line length: 120 characters (set in `pyproject.toml` `[tool.black]` and `[tool.ruff]`)
+- Target Python versions: 3.10, 3.11 (black), 3.13 (Pipfile)
+- Tool: `ruff` v0.9.1 (configured in `pyproject.toml`)
+- Ignored rules: `F403` (star imports), `F405` (may be undefined from star import)
+- Star imports from internal modules are allowed (used in `app/models/__init__.py`, `app/services/ingest/__init__.py`)
+- Tool: `isort` v5.13.2
+- No explicit isort config found; follows default ordering
+## Import Organization
+- None; all imports use full `app.` prefix paths
+- `from app.log import logger` is the canonical loguru import path
+- Used only in `__init__.py` re-export files: `from .admin import *` in `app/models/__init__.py`
+- `# noqa: F401, F403` comments suppress lint warnings for intentional star imports
+## Error Handling
+- Services return `Dict[str, Any]` result objects with `"success"`, `"code"`, `"message"` fields instead of raising exceptions to callers
+- Controllers return dict with `"code": 200/400/404` and `"message"` for all outcomes
+- API route handlers do NOT use try/except — they rely on services returning structured results
+- Service methods wrap low-level calls in `try/except Exception as e` and log then return `False` or error dict
+## Logging
+- `logger.info(f"...")` for normal operation events
+- `logger.warning(f"...")` for non-fatal recoverable issues (e.g., token not found, API soft failures)
+- `logger.error(f"...")` for caught exceptions and operation failures
+- F-string interpolation used consistently for message formatting
+- No structured fields (no `logger.bind()` usage observed)
+## API Response Format
+## Comments
+- Docstrings on public methods describing purpose, not implementation: `"""获取可用关键词，优先返回断点续爬和失败重试的关键词"""`
+- Inline comments for priority logic and algorithm steps: `# 优先级 1: 断点续爬 (partial)`
+- Module-level docstrings for context: `"""Boss直聘 Service — 基于新算法文件的封装"""`
+- `# noqa` comments for intentional lint suppressions
+- Not applicable (Python backend)
+- Docstrings are brief single-line or short multi-line Chinese descriptions
+## Function Design
+- Keyword arguments with defaults preferred for optional params
+- Pydantic schemas used for HTTP request bodies (never raw dicts from router params)
+- `Optional[str]` with `= None` default for optional parameters
+- Services return `Dict[str, Any]` with consistent keys (`code`, `message`, `data`)
+- Private helpers return `Optional[T]` or primitive types
+- Async functions return awaitable results (no mixing of sync/async)
+## Module Design
+- `app/models/__init__.py` uses `from .{module} import *` to flatten model imports
+- Router modules export a single named router variable: `router = APIRouter(...)` or `{domain}_router = APIRouter(...)`
+- Service classes are imported directly by name
+- `app/models/__init__.py` — re-exports all model classes
+- `app/services/ingest/__init__.py` — re-exports `IngestService` and config registrations
+- `app/api/v1/__init__.py` — aggregates all routers into `v1_router`
+- FastAPI `Depends()` used for service/controller instantiation in route handlers
+- Dependency factory functions named `get_{service}()` and defined in the same file as the router
+- Shared auth dependencies: `DependAuth`, `DependPermission` in `app/core/dependency.py`
+## Tortoise ORM Model Conventions
+## Pydantic Schema Conventions
+- All schemas inherit from `pydantic.BaseModel`
+- All fields use `Field(...)` with `description=` for documentation
+- Enums inherit from `(str, Enum)` for JSON serialization compatibility
+- Output schemas include `class Config: from_attributes = True` to support ORM mode
+- Validation patterns use `Field(..., pattern="^(boss|qcwy|zhilian)$")` for enum-like string fields
+<!-- GSD:conventions-end -->
+
+<!-- GSD:architecture-start source:ARCHITECTURE.md -->
+## Architecture
+
+## Pattern Overview
+- FastAPI async backend with dual-database persistence (MySQL for RBAC/metadata, ClickHouse for job/company analytics data)
+- External crawlers (`jobs_spider/`, `spiderJobs/`) run as independent processes and push data to the backend API via HTTP POST — no shared code path at runtime
+- RBAC permission model enforced at router-level via FastAPI `Depends`
+- APScheduler drives all recurring automation (cleaning, ECS orchestration, alerting) within the main process
+- Registry pattern (`app/services/ingest/registry.py`) maps (platform, channel, data_type) tuples to ClickHouse table configs + dedup rules
+## Layers
+- Purpose: Parse HTTP request, validate via Pydantic schema, delegate to service or controller, return JSON envelope
+- Location: `app/api/v1/`
+- Contains: One sub-package per domain (e.g., `job/`, `cleaning/`, `analytics/`, `keyword/`)
+- Depends on: Services, Controllers, `app/core/dependency.py` for auth
+- Used by: FastAPI application via `app/api/v1/__init__.py` → `app/__init__.py`
+- Purpose: Business logic, orchestration across repos/models, side-effect coordination
+- Location: `app/services/`
+- Contains: `IngestService`, `AnalyticsService`, `CleaningService`, `CompanyCleaner`, `CompanyJobsSyncService`, platform crawler service wrappers (`crawler/boss.py`, `crawler/qcwy.py`, `crawler/zhilian.py`)
+- Depends on: Repositories, ORM models, `clickhouse_manager`, external HTTP clients
+- Used by: Routers, APScheduler jobs
+- Purpose: Thin CRUD orchestration for MySQL-backed entities (User, Role, API, Menu, Dept, etc.)
+- Location: `app/controllers/`
+- Contains: `user_controller`, `api_controller`, `role_controller`, `menu_controller`, `dept_controller`, `keyword_controller`, `proxy_controller`, `token_controller`, `cleaning_controller`, `company_controller`
+- Depends on: Tortoise-ORM models
+- Used by: Routers
+- Purpose: Encapsulate raw ClickHouse query execution behind typed methods
+- Location: `app/repositories/`
+- Contains: `ClickHouseBaseRepo` (generic query/insert), `JobAnalyticsRepo` (trend, distribution, count queries against `job_analytics` view)
+- Depends on: `clickhouse_connect.driver.AsyncClient`
+- Used by: `AnalyticsService`
+- Purpose: ORM schema definitions for MySQL, migrated via Aerich
+- Location: `app/models/`
+- Contains: `admin.py` (User, Role, Api, Menu, Dept, AuditLog), `token.py` (BossToken), `cleaning.py` (CleaningTask), `metrics.py` (ScheduledTaskRun, StatsTotal, IpUploadStats), `company.py` (CompanyCleaningQueue), `keyword.py` (BossKeyword, QcwyKeyword, ZhilianKeyword)
+- Depends on: `tortoise-orm`
+- Used by: Controllers, Services, Scheduler jobs
+- Purpose: App wiring, connection management, middleware, cross-cutting utilities
+- Location: `app/core/`
+- Contains: `init_app.py` (lifespan bootstrap), `scheduler.py` (APScheduler jobs), `clickhouse.py` (ClickHouseManager singleton), `clickhouse_init.py` (ClickHouse DDL), `dependency.py` (AuthControl, PermissionControl), `locks.py` (DistributedLock), `middlewares.py` (BackGroundTaskMiddleware, HttpAuditLogMiddleware, IpTrackingMiddleware), `exceptions.py`, `crud.py`, `ip_tracking.py`
+- Depends on: Everything below
+- Used by: `app/__init__.py`, all layers above
+- Purpose: Platform-agnostic data ingestion pipeline with registry-driven routing, dedup, and remote push
+- Location: `app/services/ingest/`
+- Contains: `service.py` (IngestService), `registry.py` (PlatformConfig registry), `dedup.py` (batch dedup filter), `remote_push.py` (async HTTP push to secondary target), `configs/` (per-platform PlatformConfig registrations)
+- Depends on: `clickhouse_connect`, registry configs
+- Used by: `app/api/v1/job/job.py` router
+- Purpose: Wrap platform HTTP clients so backend scheduler can call crawlers directly (company cleaning, company search)
+- Location: `app/services/crawler/`
+- Contains: `_base.py` (BaseFetcher, BaseSearcher, ApiResult), `_http_client.py`, `_boss_client.py`, `_boss_api.py`, `_boss_sign.py`, `_zhilian_client.py`, `_zhilian_api.py`, `_zhilian_sign.py`, `_job51_client.py`, `_job51_api.py`, `_job51_sign.py`, `boss.py`, `qcwy.py`, `zhilian.py`
+- Depends on: `requests` (sync), platform sign/auth modules
+- Used by: `CompanyCleaner`, `CompanyController`
+- Purpose: Standalone crawlers running on ECS nodes; push data to backend via `POST /api/v1/universal/data/batch-store-async`
+- Location: `jobs_spider/boss/boos_api.py`, `jobs_spider/qcwy/qcwy.py`, `jobs_spider/zhilian/zhilian_single.py`, `spiderJobs/`
+- Depends on: `requests`, `loguru`, `PyExecJS`; no shared Python imports with `app/`
+- Used by: Scheduled via `ecs_full_pipeline.py` or manual launch
+## Data Flow
+- MySQL (via Tortoise-ORM): User sessions, RBAC entities, keyword crawl state, tokens, task run records
+- ClickHouse: All raw job/company JSON data + analytics views; no ORM — raw SQL via `clickhouse_connect`
+- Context variable `CTX_USER_ID` (Python `contextvars.ContextVar`) carries authenticated user ID within request scope (`app/core/ctx.py`)
+- File-based startup lock (`.startup_lock/` dir) prevents multi-worker seed data collisions
+## Key Abstractions
+- Purpose: Maps `(platform, channel, data_type)` to target ClickHouse table and dedup field extractors
+- Examples: `app/services/ingest/registry.py`, `app/services/ingest/configs/`
+- Pattern: Immutable `@dataclass(frozen=True)`, registered at module import time via `register()`
+- Purpose: Prevents concurrent execution of scheduled jobs across multiple Uvicorn workers
+- Examples: `app/core/locks.py`
+- Pattern: Redis SET NX with TTL; falls back to filesystem directory creation with TTL metadata
+- Purpose: Singleton lazy-init connection wrapper around `clickhouse_connect.AsyncClient`
+- Examples: `app/core/clickhouse.py`
+- Pattern: Global module-level instance `clickhouse_manager`, injected into services at call time via `await clickhouse_manager.get_client()`
+- Purpose: Abstract base for platform-specific API clients within the in-process crawler wrappers
+- Examples: `app/services/crawler/_base.py`
+- Pattern: Template method — subclasses implement `_build_params()`, base handles HTTP invocation and `ApiResult` parsing
+- Purpose: Common ORM base with `BigIntField pk`, `to_dict()` async serializer with M2M support
+- Examples: `app/models/base.py`
+- Pattern: Mixed-in with `TimestampMixin` (created_at, updated_at auto fields)
+## Entry Points
+- Location: `run.py`
+- Triggers: `python run.py` — reads `APP_HOST`, `APP_PORT`, `UVICORN_WORKERS` env vars; calls `uvicorn.run("app:app")`
+- Responsibilities: Configure uvicorn logging format, start ASGI server
+- Location: `app/__init__.py` → `create_app()`
+- Triggers: Uvicorn import of `app:app`
+- Responsibilities: Register middlewares, exception handlers, routers, static files; define lifespan
+- Location: `app/__init__.py` (lifespan context manager) + `app/core/init_app.py` (`init_data()`)
+- Triggers: FastAPI startup event
+- Responsibilities: Tortoise-ORM init + schema generation → Aerich migrations → seed data (superuser, menus, APIs, roles) → ClickHouse table init → APScheduler start
+- Location: `app/api/v1/__init__.py`
+- Triggers: Imported by `app/api/__init__.py` → `app/__init__.py`
+- Responsibilities: Mount all domain routers under `/api/v1/` with appropriate `DependPermission` guards
+- Location: `ecs_full_pipeline.py`
+- Triggers: Manual execution or scheduled via `scheduler.py` (`ecs_full_pipeline_job`) every 6 hours
+- Responsibilities: Create Alibaba Cloud ECS spot instances, push spider scripts, execute crawlers, terminate instances
+## Error Handling
+- Tortoise `DoesNotExist` → 404 via `DoesNotExistHandle` (registered in `app/core/init_app.py`)
+- Pydantic `RequestValidationError` → 422 with field-level detail via `RequestValidationHandle`
+- `IntegrityError` (DB constraint) → 400 via `IntegrityHandle`
+- Custom `HTTPException` (from `app/core/exceptions.py`) → structured JSON error via `HttpExcHandle`
+- Service layer exceptions logged with `loguru` then re-raised or returned as `{"success": False, "error": "..."}`
+- Scheduled jobs catch all exceptions, record failure via `_record_task_run()` to `ScheduledTaskRun` table, then continue
+## Cross-Cutting Concerns
+<!-- GSD:architecture-end -->
+
+<!-- GSD:workflow-start source:GSD defaults -->
+## GSD Workflow Enforcement
+
+Before using Edit, Write, or other file-changing tools, start work through a GSD command so planning artifacts and execution context stay in sync.
+
+Use these entry points:
+- `/gsd:quick` for small fixes, doc updates, and ad-hoc tasks
+- `/gsd:debug` for investigation and bug fixing
+- `/gsd:execute-phase` for planned phase work
+
+Do not make direct repo edits outside a GSD workflow unless the user explicitly asks to bypass it.
+<!-- GSD:workflow-end -->
+
+<!-- GSD:profile-start -->
+## Developer Profile
+
+> Profile not yet configured. Run `/gsd:profile-user` to generate your developer profile.
+> This section is managed by `generate-claude-profile` -- do not edit manually.
+<!-- GSD:profile-end -->