docs(phase-3): complete execution — 2/2 plans, 98 tests passing

- ARCH-04: job51 migrated to crawler_core (no old deps) - ARCH-05: zhilian migrated to crawler_core (no old deps) - 34 new mock tests (17 job51 + 17 zhilian) - Added _parse_zhilian_response custom parser for zhilian API format - Fixed POST Searcher _request() overrides for job51/zhilian - Full regression: 98 passed in 0.12s
2026-03-21 19:19:17 +08:00 · 2026-03-21 19:19:17 +08:00 · 00a727519f
commit 00a727519f
parent 8c2c2d29d7
13 changed files with 1637 additions and 14 deletions
--- a/.planning/REQUIREMENTS.md
+++ b/.planning/REQUIREMENTS.md
@ -10,8 +10,8 @@
 - [x] **ARCH-01**: 提取 `crawler_core/` 为独立可安装包，包含签名、HTTP 客户端、响应解析核心逻辑
 - [x] **ARCH-02**: 统一 BaseFetcher/BaseSearcher 基类，三平台实现模板方法模式
 - [x] **ARCH-03**: Boss 直聘爬虫客户端基于 crawler_core 重写
- [ ] **ARCH-04**: 前程无忧爬虫客户端基于 crawler_core 重写
- [ ] **ARCH-05**: 智联招聘爬虫客户端基于 crawler_core 重写
+- [x] **ARCH-04**: 前程无忧爬虫客户端基于 crawler_core 重写
+- [x] **ARCH-05**: 智联招聘爬虫客户端基于 crawler_core 重写
 - [ ] **ARCH-06**: 后端 app/services/crawler/ facade 层使用 asyncio.to_thread() 桥接同步核心
 - [ ] **ARCH-07**: 外部脚本 spiderJobs/ 使用 crawler_core 替代内联代码
 - [ ] **ARCH-08**: 废弃 jobs_spider/ 老框架，迁移到 spiderJobs/ + crawler_core
@ -58,8 +58,8 @@
 | ARCH-01 | TBD | Complete |
 | ARCH-02 | TBD | Complete |
 | ARCH-03 | TBD | Complete |
-| ARCH-04 | TBD | Pending |
-| ARCH-05 | TBD | Pending |
+| ARCH-04 | TBD | Complete |
+| ARCH-05 | TBD | Complete |
 | ARCH-06 | TBD | Pending |
 | ARCH-07 | TBD | Pending |
 | ARCH-08 | TBD | Pending |
--- a/.planning/ROADMAP.md
+++ b/.planning/ROADMAP.md
@ -11,7 +11,7 @@

 - [ ] **Phase 1: 共享核心包** - 提取 crawler_core/ 可安装共享包，统一基类和基础设施
 - [x] **Phase 2: Boss 直聘重写** - 基于 crawler_core 重写 Boss 直聘爬虫客户端 (completed 2026-03-21)
- [ ] **Phase 3: 前程无忧 & 智联重写** - 基于 crawler_core 重写前程无忧和智联招聘爬虫客户端
+- [x] **Phase 3: 前程无忧 & 智联重写** - 基于 crawler_core 重写前程无忧和智联招聘爬虫客户端 (completed 2026-03-21)
 - [ ] **Phase 4: 后端 & 外部脚本接入** - 后端 facade 桥接 + 外部脚本迁移 + 废弃老框架
 - [ ] **Phase 5: 数据管道优化** - 入库去重、公司清洗流程优化、公司招聘信息写入
 - [ ] **Phase 6: 质量 & 前端** - 数据解析测试覆盖、前端监控和数据清洗页面优化
@ -55,7 +55,7 @@ Plans:
  2. 智联招聘爬虫继承 BaseFetcher/BaseSearcher，不含内联签名或 HTTP 样板代码
  3. 两平台各自运行一次真实关键词抓取，职位数据成功返回（手动验证）
  4. 三平台新客户端的代码结构一致，新平台可参照模板快速实现
-**Plans:** TBD
+**Plans:** 2/2 plans complete

 ### Phase 4: 后端 & 外部脚本接入
 **Goal:** 后端 facade 和外部脚本均使用 crawler_core，老框架 jobs_spider/ 完成历史使命并下线
@ -98,7 +98,7 @@ Plans:
 |-------|----------------|--------|-----------|
 | 1. 共享核心包 | 2/2 | Complete | 2026-03-21 |
 | 2. Boss 直聘重写 | 2/2 | Complete    | 2026-03-21 |
-| 3. 前程无忧 & 智联重写 | 0/? | Not started | - |
+| 3. 前程无忧 & 智联重写 | 2/2 | Complete    | 2026-03-21 |
 | 4. 后端 & 外部脚本接入 | 0/? | Not started | - |
 | 5. 数据管道优化 | 0/? | Not started | - |
 | 6. 质量 & 前端 | 0/? | Not started | - |
--- a/.planning/STATE.md
+++ b/.planning/STATE.md
@ -4,12 +4,12 @@ milestone: v1.0
 milestone_name: milestone
 status: unknown
 stopped_at: Completed 01-shared-core Plan 02 (sign algorithms + unit tests)
-last_updated: "2026-03-21T11:04:42.115Z"
+last_updated: "2026-03-21T11:19:04.395Z"
 progress:
  total_phases: 6
-  completed_phases: 2
-  total_plans: 4
-  completed_plans: 4
+  completed_phases: 3
+  total_plans: 6
+  completed_plans: 6
 ---

 # STATE: JobData 爬虫交互重构
@ -24,13 +24,13 @@ progress:

 **Core value:** 基于关键词驱动爬虫抓取职位数据，可靠入库 ClickHouse，定时完成公司信息采集同步

-**Current focus:** Phase 2 — Boss 直聘重写
+**Current focus:** Phase 3 — 前程无忧 & 智联重写

 ---

 ## Current Position

-Phase: 3
+Phase: 4
 Plan: Not started

 ## Performance Metrics
--- a/.planning/codebase/ARCHITECTURE.md
+++ b/.planning/codebase/ARCHITECTURE.md
@ -0,0 +1,201 @@
+# Architecture
+
+**Analysis Date:** 2026-03-21
+
+## Pattern Overview
+
+**Overall:** Layered monolith with embedded pipeline architecture
+
+**Key Characteristics:**
+- FastAPI async backend with dual-database persistence (MySQL for RBAC/metadata, ClickHouse for job/company analytics data)
+- External crawlers (`jobs_spider/`, `spiderJobs/`) run as independent processes and push data to the backend API via HTTP POST — no shared code path at runtime
+- RBAC permission model enforced at router-level via FastAPI `Depends`
+- APScheduler drives all recurring automation (cleaning, ECS orchestration, alerting) within the main process
+- Registry pattern (`app/services/ingest/registry.py`) maps (platform, channel, data_type) tuples to ClickHouse table configs + dedup rules
+
+## Layers
+
+**Router Layer:**
+- Purpose: Parse HTTP request, validate via Pydantic schema, delegate to service or controller, return JSON envelope
+- Location: `app/api/v1/`
+- Contains: One sub-package per domain (e.g., `job/`, `cleaning/`, `analytics/`, `keyword/`)
+- Depends on: Services, Controllers, `app/core/dependency.py` for auth
+- Used by: FastAPI application via `app/api/v1/__init__.py` → `app/__init__.py`
+
+**Service Layer:**
+- Purpose: Business logic, orchestration across repos/models, side-effect coordination
+- Location: `app/services/`
+- Contains: `IngestService`, `AnalyticsService`, `CleaningService`, `CompanyCleaner`, `CompanyJobsSyncService`, platform crawler service wrappers (`crawler/boss.py`, `crawler/qcwy.py`, `crawler/zhilian.py`)
+- Depends on: Repositories, ORM models, `clickhouse_manager`, external HTTP clients
+- Used by: Routers, APScheduler jobs
+
+**Controller Layer (RBAC domain operations):**
+- Purpose: Thin CRUD orchestration for MySQL-backed entities (User, Role, API, Menu, Dept, etc.)
+- Location: `app/controllers/`
+- Contains: `user_controller`, `api_controller`, `role_controller`, `menu_controller`, `dept_controller`, `keyword_controller`, `proxy_controller`, `token_controller`, `cleaning_controller`, `company_controller`
+- Depends on: Tortoise-ORM models
+- Used by: Routers
+
+**Repository Layer:**
+- Purpose: Encapsulate raw ClickHouse query execution behind typed methods
+- Location: `app/repositories/`
+- Contains: `ClickHouseBaseRepo` (generic query/insert), `JobAnalyticsRepo` (trend, distribution, count queries against `job_analytics` view)
+- Depends on: `clickhouse_connect.driver.AsyncClient`
+- Used by: `AnalyticsService`
+
+**Model Layer:**
+- Purpose: ORM schema definitions for MySQL, migrated via Aerich
+- Location: `app/models/`
+- Contains: `admin.py` (User, Role, Api, Menu, Dept, AuditLog), `token.py` (BossToken), `cleaning.py` (CleaningTask), `metrics.py` (ScheduledTaskRun, StatsTotal, IpUploadStats), `company.py` (CompanyCleaningQueue), `keyword.py` (BossKeyword, QcwyKeyword, ZhilianKeyword)
+- Depends on: `tortoise-orm`
+- Used by: Controllers, Services, Scheduler jobs
+
+**Core Infrastructure Layer:**
+- Purpose: App wiring, connection management, middleware, cross-cutting utilities
+- Location: `app/core/`
+- Contains: `init_app.py` (lifespan bootstrap), `scheduler.py` (APScheduler jobs), `clickhouse.py` (ClickHouseManager singleton), `clickhouse_init.py` (ClickHouse DDL), `dependency.py` (AuthControl, PermissionControl), `locks.py` (DistributedLock), `middlewares.py` (BackGroundTaskMiddleware, HttpAuditLogMiddleware, IpTrackingMiddleware), `exceptions.py`, `crud.py`, `ip_tracking.py`
+- Depends on: Everything below
+- Used by: `app/__init__.py`, all layers above
+
+**Ingest Sub-system:**
+- Purpose: Platform-agnostic data ingestion pipeline with registry-driven routing, dedup, and remote push
+- Location: `app/services/ingest/`
+- Contains: `service.py` (IngestService), `registry.py` (PlatformConfig registry), `dedup.py` (batch dedup filter), `remote_push.py` (async HTTP push to secondary target), `configs/` (per-platform PlatformConfig registrations)
+- Depends on: `clickhouse_connect`, registry configs
+- Used by: `app/api/v1/job/job.py` router
+
+**Spider Crawler Service Wrappers (in-process):**
+- Purpose: Wrap platform HTTP clients so backend scheduler can call crawlers directly (company cleaning, company search)
+- Location: `app/services/crawler/`
+- Contains: `_base.py` (BaseFetcher, BaseSearcher, ApiResult), `_http_client.py`, `_boss_client.py`, `_boss_api.py`, `_boss_sign.py`, `_zhilian_client.py`, `_zhilian_api.py`, `_zhilian_sign.py`, `_job51_client.py`, `_job51_api.py`, `_job51_sign.py`, `boss.py`, `qcwy.py`, `zhilian.py`
+- Depends on: `requests` (sync), platform sign/auth modules
+- Used by: `CompanyCleaner`, `CompanyController`
+
+**External Spider Processes:**
+- Purpose: Standalone crawlers running on ECS nodes; push data to backend via `POST /api/v1/universal/data/batch-store-async`
+- Location: `jobs_spider/boss/boos_api.py`, `jobs_spider/qcwy/qcwy.py`, `jobs_spider/zhilian/zhilian_single.py`, `spiderJobs/`
+- Depends on: `requests`, `loguru`, `PyExecJS`; no shared Python imports with `app/`
+- Used by: Scheduled via `ecs_full_pipeline.py` or manual launch
+
+## Data Flow
+
+**Crawler → Storage Flow:**
+
+1. External spider (`jobs_spider/boss/boos_api.py`) fetches keyword from `GET /api/v1/keyword/available`
+2. Spider calls Boss/QCWY/Zhilian platform API with rotating proxy/token
+3. Spider POSTs raw JSON batch to `POST /api/v1/universal/data/batch-store-async`
+4. Router (`app/api/v1/job/job.py`) delegates to `IngestService.store_batch()`
+5. `IngestService` looks up `PlatformConfig` from registry (`app/services/ingest/registry.py`)
+6. Dedup filter queries ClickHouse for existing records (`app/services/ingest/dedup.py`)
+7. New rows inserted to ClickHouse table (e.g., `job_data.boss_job`) via `clickhouse_connect`
+8. Optional async push to secondary remote endpoint via `remote_push.py`
+
+**Analytics Query Flow:**
+
+1. Frontend calls `GET /api/v1/analytics/volume-trend?interval=day&from_dt=...`
+2. Router (`app/api/v1/analytics.py`) creates `AnalyticsService` with `clickhouse_manager` client
+3. `AnalyticsService` delegates to `JobAnalyticsRepo.get_volume_trend()`
+4. `JobAnalyticsRepo` queries the `job_data.job_analytics` VIEW (UNION ALL across three platform job tables)
+5. Returns time-bucketed counts grouped by source
+
+**RBAC Authentication Flow:**
+
+1. Client sends `POST /api/v1/base/access_token` with credentials
+2. `AuthControl.is_authed()` verifies JWT (HS256, 7-day expiry) from `token` header
+3. `PermissionControl.has_permission()` loads user's roles → role's API list
+4. Compares `(method, path)` tuple against allowed API set
+5. Superusers bypass permission check
+
+**Company Cleaning Flow (Scheduled):**
+
+1. `company_cleaning_job()` in `app/core/scheduler.py` fires every 5 minutes
+2. `CompanyCleaner.collect_pending_companies(limit=50)` queries ClickHouse job tables for company IDs not yet in `pending_company` table
+3. `CompanyCleaner.process_pending_companies(limit=30)` fetches company details from platform crawlers (BossService, QcwyService, ZhilianService)
+4. Cleaned company data stored via `company_storage` to ClickHouse `boss_company` / `qcwy_company` / `zhilian_company`
+5. Run protected by `DistributedLock` (file-based fallback, optional Redis)
+
+**State Management:**
+- MySQL (via Tortoise-ORM): User sessions, RBAC entities, keyword crawl state, tokens, task run records
+- ClickHouse: All raw job/company JSON data + analytics views; no ORM — raw SQL via `clickhouse_connect`
+- Context variable `CTX_USER_ID` (Python `contextvars.ContextVar`) carries authenticated user ID within request scope (`app/core/ctx.py`)
+- File-based startup lock (`.startup_lock/` dir) prevents multi-worker seed data collisions
+
+## Key Abstractions
+
+**PlatformConfig (Ingest Registry):**
+- Purpose: Maps `(platform, channel, data_type)` to target ClickHouse table and dedup field extractors
+- Examples: `app/services/ingest/registry.py`, `app/services/ingest/configs/`
+- Pattern: Immutable `@dataclass(frozen=True)`, registered at module import time via `register()`
+
+**DistributedLock:**
+- Purpose: Prevents concurrent execution of scheduled jobs across multiple Uvicorn workers
+- Examples: `app/core/locks.py`
+- Pattern: Redis SET NX with TTL; falls back to filesystem directory creation with TTL metadata
+
+**ClickHouseManager:**
+- Purpose: Singleton lazy-init connection wrapper around `clickhouse_connect.AsyncClient`
+- Examples: `app/core/clickhouse.py`
+- Pattern: Global module-level instance `clickhouse_manager`, injected into services at call time via `await clickhouse_manager.get_client()`
+
+**BaseFetcher / BaseSearcher:**
+- Purpose: Abstract base for platform-specific API clients within the in-process crawler wrappers
+- Examples: `app/services/crawler/_base.py`
+- Pattern: Template method — subclasses implement `_build_params()`, base handles HTTP invocation and `ApiResult` parsing
+
+**BaseModel (Tortoise):**
+- Purpose: Common ORM base with `BigIntField pk`, `to_dict()` async serializer with M2M support
+- Examples: `app/models/base.py`
+- Pattern: Mixed-in with `TimestampMixin` (created_at, updated_at auto fields)
+
+## Entry Points
+
+**Application Entry:**
+- Location: `run.py`
+- Triggers: `python run.py` — reads `APP_HOST`, `APP_PORT`, `UVICORN_WORKERS` env vars; calls `uvicorn.run("app:app")`
+- Responsibilities: Configure uvicorn logging format, start ASGI server
+
+**FastAPI App Factory:**
+- Location: `app/__init__.py` → `create_app()`
+- Triggers: Uvicorn import of `app:app`
+- Responsibilities: Register middlewares, exception handlers, routers, static files; define lifespan
+
+**Lifespan Bootstrap:**
+- Location: `app/__init__.py` (lifespan context manager) + `app/core/init_app.py` (`init_data()`)
+- Triggers: FastAPI startup event
+- Responsibilities: Tortoise-ORM init + schema generation → Aerich migrations → seed data (superuser, menus, APIs, roles) → ClickHouse table init → APScheduler start
+
+**Router Aggregation:**
+- Location: `app/api/v1/__init__.py`
+- Triggers: Imported by `app/api/__init__.py` → `app/__init__.py`
+- Responsibilities: Mount all domain routers under `/api/v1/` with appropriate `DependPermission` guards
+
+**ECS Pipeline:**
+- Location: `ecs_full_pipeline.py`
+- Triggers: Manual execution or scheduled via `scheduler.py` (`ecs_full_pipeline_job`) every 6 hours
+- Responsibilities: Create Alibaba Cloud ECS spot instances, push spider scripts, execute crawlers, terminate instances
+
+## Error Handling
+
+**Strategy:** Exception handler registration at FastAPI app level; structured error propagation up through layers
+
+**Patterns:**
+- Tortoise `DoesNotExist` → 404 via `DoesNotExistHandle` (registered in `app/core/init_app.py`)
+- Pydantic `RequestValidationError` → 422 with field-level detail via `RequestValidationHandle`
+- `IntegrityError` (DB constraint) → 400 via `IntegrityHandle`
+- Custom `HTTPException` (from `app/core/exceptions.py`) → structured JSON error via `HttpExcHandle`
+- Service layer exceptions logged with `loguru` then re-raised or returned as `{"success": False, "error": "..."}`
+- Scheduled jobs catch all exceptions, record failure via `_record_task_run()` to `ScheduledTaskRun` table, then continue
+
+## Cross-Cutting Concerns
+
+**Logging:** `loguru` — imported from `app/log/__init__.py`; structured `logger.info/error/warning` calls throughout; external spider logs to rotating daily files in `jobs_spider/*/logs/`
+
+**Validation:** Pydantic v2 schemas in `app/schemas/` validate all inbound HTTP request bodies; ClickHouse inserts use registry-defined column extractors for type safety
+
+**Authentication:** JWT HS256 tokens issued by `/api/v1/base/access_token`; verified per-request by `AuthControl.is_authed()` injected via `DependAuth` / `DependPermission`; dev bypass (`token: dev`) available when `APP_ENV=development`
+
+**Audit Logging:** `HttpAuditLogMiddleware` records all non-excluded mutating requests (method, path, user, request args, response body, timing) to MySQL `auditlog` table
+
+---
+
+*Architecture analysis: 2026-03-21*
--- a/.planning/codebase/CONCERNS.md
+++ b/.planning/codebase/CONCERNS.md
@ -0,0 +1,255 @@
+# Codebase Concerns
+
+**Analysis Date:** 2026-03-21
+
+---
+
+## Security Considerations
+
+### CRITICAL: Hardcoded Credentials in Source Code
+
+**Area:** Configuration
+- Risk: Real database passwords, SMTP credentials, and IP addresses committed to version control.
+- Files: `app/settings/config.py`
+- Exposed values (lines 27–52):
+  - MySQL root password (`jobdata123`) with public IP `121.4.126.241:3306`
+  - ClickHouse user/pass defaults (`data_user` / `data_pass`) with same public IP
+  - QQ SMTP app password (`jkncedjzkummbbdi`) and recipient email addresses
+- Current mitigation: `pydantic-settings` allows environment variable override, but defaults are live credentials that will be exposed if the repo is shared or CI logs are leaked.
+- Recommendations: Remove all real default values; replace with `None` or clearly fake placeholders; enforce required-field validation at startup; rotate exposed credentials immediately.
+
+### CRITICAL: Unauthenticated Data Ingest Endpoints
+
+**Area:** API Authentication
+- Risk: Any actor with network access can push arbitrary data directly into ClickHouse.
+- Files: `app/api/v1/__init__.py` (lines 32–34), `app/api/v1/job/job.py`
+- The `/ingest`, `/job`, and `/universal` router groups are mounted **without** any `DependPermission` or `DependAuth` dependency. The data store, batch-store, and batch-store-async endpoints accept unauthenticated POST requests.
+- Trigger: Send a POST to `/api/v1/universal/data/batch-store-async` with any payload. No token required.
+- Workaround: The CLAUDE.md acknowledges these as "internal call" endpoints, but they are publicly reachable with no network-level isolation noted.
+- Recommendations: Add at minimum IP allowlist middleware or a static shared secret header check for ingest routes; or mount them behind `DependAuth`.
+
+### HIGH: Dev Token Bypass in Production Auth
+
+**Area:** Authentication
+- Risk: If `APP_ENV=development`, any request with `token: dev` header is granted the first user's full privileges.
+- Files: `app/core/dependency.py` (lines 28–30)
+- The check `os.getenv("APP_ENV", "production") == "development"` defaults to `"production"`, but there is no enforcement that this variable is set correctly in production deployments. A misconfigured environment would expose the bypass.
+- Recommendations: Remove the dev bypass entirely; use test fixtures instead; add a startup assertion that the bypass is disabled in non-dev environments.
+
+### HIGH: SQL Injection via f-string Queries
+
+**Area:** ClickHouse Queries
+- Risk: User-controlled values inserted directly into query strings allow injection.
+- Files:
+  - `reclean_qcwy_jobs.py` (line 34): `f"SELECT json_data FROM job_data.qcwy_job WHERE job_id = '{job_id}' LIMIT 1"` — `job_id` comes from URL parsing of user-provided input.
+  - `app/repositories/clickhouse_repo.py` (lines 101–108): `group_by_column` parameter is interpolated directly into the SELECT and GROUP BY clauses without validation.
+  - `app/services/ingest/service.py` (lines 109, 112): `table` value is derived from config (lower risk) but still concatenated into f-strings.
+  - `app/core/scheduler.py` (lines 57, 63): `table` names from a fixed list, but pattern is risky.
+- Current mitigation: ClickHouse parameterized queries are used in `dedup.py`, but inconsistently applied elsewhere.
+- Recommendations: Use parameterized queries (`parameters={}`) for all user-influenced values; validate `group_by_column` against an allowlist; never interpolate user input into SQL strings.
+
+### MEDIUM: CORS Misconfiguration
+
+**Area:** CORS Policy
+- Risk: `CORS_ALLOW_CREDENTIALS = False` while `CORS_ALLOW_HEADERS = ["*"]` is set. If credentials are later enabled, wildcard headers become a security liability.
+- Files: `app/settings/config.py` (lines 13–16)
+- Recommendations: Restrict `CORS_ORIGINS` to known production domains; avoid wildcard headers.
+
+### MEDIUM: Hardcoded Public IP in Spider Default
+
+**Area:** Spider Configuration
+- Risk: Hardcoded IP `124.222.106.226:9999` as the default `API_BASE_URL` in all spider scripts and `spiderJobs/` runner. If that IP changes hands or is exposed, all crawled data goes to an unintended recipient.
+- Files:
+  - `jobs_spider/boss/boos_api.py` (line 17)
+  - `jobs_spider/qcwy/qcwy.py` (line 28)
+  - `jobs_spider/zhilian/zhilian_single.py` (line 41)
+  - `spiderJobs/runner/api_client.py` (line 31)
+- Recommendations: Remove hardcoded IP defaults; require `API_BASE_URL` to be explicitly set via environment variable with no default.
+
+---
+
+## Tech Debt
+
+### Duplicate Spider Implementations
+
+**Area:** Code Organization
+- Issue: Two parallel spider hierarchies exist for the same three platforms.
+  - `jobs_spider/boss/`, `jobs_spider/qcwy/`, `jobs_spider/zhilian/` — original scripts
+  - `spiderJobs/platforms/boss/`, `spiderJobs/platforms/job51/`, `spiderJobs/platforms/zhilian/` — refactored version
+  - `app/services/crawler/` — yet another crawler abstraction inside the app
+- Files: `jobs_spider/`, `spiderJobs/`, `app/services/crawler/`
+- Impact: Bug fixes and anti-detection improvements must be applied in multiple places; the relationship between these directories is not documented.
+- Fix approach: Decide on one canonical implementation; deprecate and delete the others; update CLAUDE.md to clarify which is the active spider.
+
+### God File: `jobs_spider/boss/boos_api.py`
+
+**Area:** File Size / Complexity
+- Issue: Single file containing 2281 lines covering API client, IP strategy, anomaly detection, session management, main loop, and push logic.
+- Files: `jobs_spider/boss/boos_api.py`
+- Impact: Extremely difficult to test or modify any single responsibility; cognitive load is very high.
+- Fix approach: Extract `SmartIPManager`, `IPAnomalyDetector`, `IPStrategyConfig` into separate modules; separate the main crawl loop from the API client class.
+
+### `CleaningService` and `CompanyCleaner` Code Duplication
+
+**Area:** Service Layer
+- Issue: Both classes replicate identical token-loading logic (`_ensure_boss_token_loaded`), proxy-setting logic (`_apply_proxy`), and service instantiation (BossService, QcwyService, ZhilianService).
+- Files:
+  - `app/services/cleaning.py` (lines 37–48, 31–34)
+  - `app/services/company_cleaner.py` (lines 60–74, 54–58)
+- Impact: Token refresh logic is maintained in two places; a bug fix in one will not propagate to the other.
+- Fix approach: Extract a shared `CrawlerContext` or `CrawlerMixin` class providing token management and proxy configuration.
+
+### ClickHouse Schema Changes Require Manual DDL
+
+**Area:** Database Migrations
+- Issue: ClickHouse table definitions in `app/core/clickhouse_init.py` use `CREATE TABLE IF NOT EXISTS`, meaning schema changes (new columns, index changes) are never automatically applied to existing tables. There is no migration tooling equivalent to Aerich for ClickHouse.
+- Files: `app/core/clickhouse_init.py`
+- Impact: If a column is added to the DDL, existing production tables silently retain the old schema; queries referencing new columns will fail at runtime.
+- Fix approach: Implement `ALTER TABLE ... ADD COLUMN IF NOT EXISTS` statements for schema evolution; or adopt a ClickHouse migration tool (e.g., `ch-migrations`).
+
+### Startup File Lock Not Robust to Crashes
+
+**Area:** Multi-Worker Initialization
+- Issue: `init_data()` uses `os.mkdir(".startup_lock")` as a bare file lock with no TTL. If a worker crashes mid-initialization, the lock directory is never cleaned up, and subsequent restarts will skip all migrations and seed data.
+- Files: `app/core/init_app.py` (lines 322–349)
+- Impact: Silent data-initialization failure on restart after crash; can result in missing admin user or broken menu state.
+- Fix approach: Replace with the existing `DistributedLock` class from `app/core/locks.py` which has TTL support; or at minimum add a time-based stale-lock cleanup.
+
+---
+
+## Performance Bottlenecks
+
+### ClickHouse Singleton Connection Per-Worker
+
+**Area:** Database Connections
+- Problem: `clickhouse_manager` in `app/core/clickhouse.py` is a module-level singleton that lazily creates one connection. With 20 uvicorn workers (`UVICORN_WORKERS=20`), each worker maintains its own connection, and the connection pool inside each is limited to `maxsize=5`. Under concurrent load, requests block waiting for the pool.
+- Files: `app/core/clickhouse.py` (lines 31–58)
+- Cause: A new `PoolManager` is created on every `get_clickhouse_client()` call but the singleton pattern means `get_client()` only calls this once per worker — pool configuration cannot be tuned per endpoint.
+- Improvement path: Use a connection pool library or connection factory pattern; expose pool size as a configurable setting; consider `clickhouse-pool` for proper async pooling.
+
+### Dedup Query Scans Entire Table
+
+**Area:** ClickHouse Deduplication
+- Problem: `batch_dedup_filter` in `app/services/ingest/dedup.py` issues `SELECT key_col FROM table WHERE key_col IN (...)` with no time-bound filter. For large tables this is a full-table scan on each ingest batch.
+- Files: `app/services/ingest/dedup.py` (lines 51, 65)
+- Cause: No `PREWHERE created_at > now() - INTERVAL N DAY` constraint added.
+- Improvement path: Add a configurable recency window to dedup queries (e.g., 30-day lookback matching the existing `days_back` logic in `company_cleaner.py`).
+
+### Synchronous Crawler Calls Blocking Async Workers
+
+**Area:** Concurrency
+- Problem: All crawler service methods (BossService, QcwyService, ZhilianService) use synchronous `requests` internally, wrapped with `asyncio.to_thread()`. Under high concurrency, this can exhaust the thread pool.
+- Files: `app/services/cleaning.py` (lines 186, 190, 194, etc.), `app/services/company_cleaner.py` (lines 318, 321, 325)
+- Cause: The underlying crawler libraries use `requests` (sync), not `httpx` (async).
+- Improvement path: Migrate crawlers to `httpx.AsyncClient` for true async I/O, or cap thread pool size via `asyncio.to_thread` semaphore.
+
+---
+
+## Fragile Areas
+
+### `group_by_column` API Parameter Lacks Allowlist Validation
+
+**Area:** Analytics API
+- Files: `app/repositories/clickhouse_repo.py` (lines 86–115), `app/api/v1/analytics.py`
+- Why fragile: `group_by_column` is passed directly into the ClickHouse query without validation against allowed column names. An invalid column name causes a runtime ClickHouse error that propagates as a 500 to the client; a malicious value enables injection.
+- Safe modification: Add a `ALLOWED_GROUP_COLUMNS = frozenset({"source", "city", "channel", ...})` check before query construction.
+- Test coverage: No tests for analytics query building.
+
+### `DistributedLock` File Lock Has Race Window
+
+**Area:** Distributed Locking
+- Files: `app/core/locks.py` (lines 64–92)
+- Why fragile: The stale-lock cleanup path (lines 81–87) involves `shutil.rmtree` followed by `mkdir` — a TOCTOU race where two workers can both detect staleness, both delete the directory, and both attempt to re-create it. One wins, one raises `FileExistsError` and falls through to return `False`, but the winner's lock is not guaranteed to be the intended holder.
+- Safe modification: Use `os.rename` from a temp directory as an atomic lock-acquire primitive.
+- Test coverage: None.
+
+### `CleaningService` Returns Mixed Types (`bool | Dict`)
+
+**Area:** Type Consistency
+- Files: `app/services/cleaning.py` (lines 158–165, 167–202, 204–239, 241–268)
+- Why fragile: Multiple `clean_*` methods return `Union[bool, Dict[str, Any]]`. Callers must defensively check `isinstance(result, bool)` before accessing dict keys (see `process_single_item` at lines 128–135). This makes the API fragile — any new call site that forgets the `bool` check will raise `AttributeError`.
+- Safe modification: Define a `CleanResult` dataclass or TypedDict; always return it.
+- Test coverage: None.
+
+### `company_cleaning_job` Assumes 5-Minute Completion
+
+**Area:** Scheduler
+- Files: `app/core/scheduler.py` (lines 171–202)
+- Why fragile: The job comment explicitly notes it processes 50 collect + 30 process operations and "should complete in 2–3 minutes" to fit within the 5-minute schedule window. If the external crawlers slow down (rate-limited, network issues), the job overruns its window, causing the next scheduled run to be blocked by the distributed lock, leading to queue backlog.
+- Safe modification: Add timeout guards around `collect_pending_companies` and `process_pending_companies`; or dynamically reduce batch sizes based on elapsed time.
+
+---
+
+## Missing Critical Features
+
+### No Rate Limiting on Any API Endpoint
+
+**Problem:** FastAPI has no rate-limiting middleware configured. The data ingest endpoints (which have no auth) and the cleaning trigger endpoints can be called at arbitrary rates.
+- Blocks: Protection against accidental or malicious flooding of ClickHouse with writes.
+- Recommendation: Add `slowapi` or custom middleware with per-IP rate limiting on ingest and crawl-triggering endpoints.
+
+### No Input Size Limit on Batch Ingest
+
+**Problem:** `IngestBatchRequest.data_list` has no `max_items` constraint. A single request with tens of thousands of job records will be processed synchronously (or queued as a background task) with no guard on memory consumption.
+- Files: `app/schemas/ingest.py`, `app/api/v1/job/job.py`
+- Recommendation: Add `Field(max_length=1000)` or equivalent to `data_list`; add body size limit in uvicorn/nginx.
+
+---
+
+## Test Coverage Gaps
+
+### Service Layer: Virtually Untested
+
+- What's not tested: `CleaningService`, `CompanyCleaner`, `IngestService`, `AnalyticsService`, `DistributedLock`, `ClickHouseManager`
+- Files: `app/services/cleaning.py`, `app/services/company_cleaner.py`, `app/services/ingest/service.py`, `app/services/analytics_service.py`, `app/core/locks.py`, `app/core/clickhouse.py`
+- Risk: Core data routing, deduplication, and cleaning logic can regress silently. The existing tests only cover two pure-function helpers (`extract_company_fields`, `normalize_company_id`, `_extract_*_jobs`).
+- Priority: **High**
+
+### API Layer: No Integration Tests
+
+- What's not tested: All FastAPI route handlers, auth flow, permission enforcement, error responses.
+- Files: `app/api/v1/` (all modules)
+- Risk: Auth bypass regressions, schema changes breaking clients, and incorrect HTTP status codes go undetected.
+- Priority: **High**
+
+### Spider Logic: No Unit Tests
+
+- What's not tested: `IPAnomalyDetector.detect()`, `SmartIPManager` state transitions, response parsing in `BossZhipinAPI`, `QcwyApi`, `ZhilianApi`.
+- Files: `jobs_spider/boss/boos_api.py`, `jobs_spider/qcwy/qcwy.py`, `jobs_spider/zhilian/zhilian_single.py`
+- Risk: Anti-detection logic breaks silently when platforms change their response format.
+- Priority: **Medium**
+
+### ClickHouse Schema: No Schema Evolution Tests
+
+- What's not tested: Behavior of `clickhouse_init.py` when tables already exist with old schemas; column migration paths.
+- Files: `app/core/clickhouse_init.py`
+- Risk: Production schema drift goes unnoticed until a query fails at runtime.
+- Priority: **Medium**
+
+---
+
+## Known Issues / Observed Bugs
+
+### Silent Error Swallowing in Spider Code
+
+- Symptoms: Data silently not collected; no error surfaced to caller.
+- Files: `jobs_spider/qcwy/qcwy.py` (lines 178–179, 191–192, 1047–1048, 1146–1147), `jobs_spider/qcwy/search_company_jobs.py` (lines 240–241, 512–513, 538–539, 601–602), `jobs_spider/boss/boos_api.py` (lines 400, 407–408, 508–509, 518–519, 1779–1780), `jobs_spider/zhilian/company_spider.py` (many `except: pass` blocks)
+- Trigger: Network errors, parsing errors, or unexpected response shapes in spider code are caught by bare `except Exception: pass` blocks.
+- Workaround: Check logs for INFO-level skips; monitor push success counts.
+
+### `_send_email` Uses Blocking SMTP in Async Context
+
+- Symptoms: Scheduler event loop blocked for up to several seconds during email send.
+- Files: `app/core/scheduler.py` (lines 308–336)
+- Trigger: Any `stats_job` or `ip_alert_job` execution that sends email calls synchronous `smtplib.SMTP` from the async event loop without wrapping in `asyncio.to_thread`.
+- Workaround: Email delivery may be delayed; event loop latency spikes during scheduled runs.
+
+### `cleanup_old_records` Deletes All Done/Failed Records Without Archive
+
+- Symptoms: Historical processing data is permanently deleted daily; no audit trail of what was cleaned.
+- Files: `app/services/company_cleaner.py` (lines 329–331), `app/core/scheduler.py` (line 220)
+- Trigger: `daily_cleanup_job` calls `CompanyCleaningQueue.filter(status__in=["done", "failed"]).delete()` — a hard delete with no archiving or retention window.
+
+---
+
+*Concerns audit: 2026-03-21*
--- a/.planning/codebase/CONVENTIONS.md
+++ b/.planning/codebase/CONVENTIONS.md
@ -0,0 +1,212 @@
+# Coding Conventions
+
+**Analysis Date:** 2026-03-21
+
+## Naming Patterns
+
+**Files:**
+- Snake_case for all Python files: `company_storage.py`, `company_cleaner.py`, `clickhouse_repo.py`
+- Private/internal modules prefixed with underscore: `_base.py`, `_boss_api.py`, `_boss_client.py`, `_boss_sign.py`, `_http_client.py`
+- Platform-named service files: `boss.py`, `qcwy.py`, `zhilian.py` under `app/services/crawler/`
+- Router files named after domain: `keyword.py`, `analytics.py`, `cleaning.py`
+
+**Classes:**
+- PascalCase throughout: `CleaningService`, `KeywordController`, `ClickHouseBaseRepo`, `JobAnalyticsRepo`
+- Services: `{Domain}Service` — `BossService`, `QcwyService`, `ZhilianService`, `IngestService`, `AnalyticsService`
+- Controllers: `{Domain}Controller` — `KeywordController`
+- Repos: `{Domain}Repo` or `{Domain}BaseRepo` — `ClickHouseBaseRepo`, `JobAnalyticsRepo`
+- Models (Tortoise ORM): `{Platform}{Entity}` — `BossKeyword`, `QcwyCompany`, `ZhilianCompany`
+- Schemas (Pydantic): `{Entity}Base`, `{Entity}Create`, `{Entity}Update`, `{Entity}Out` — see `app/schemas/keyword.py`
+
+**Functions and Methods:**
+- Snake_case for all functions and methods: `get_available`, `report_page_progress`, `store_batch`, `build_insert_row`
+- Private helpers prefixed with underscore: `_apply_proxy`, `_ensure_boss_token_loaded`, `_pick_first`, `_nested_get`, `_clean_text`, `_model_for_source`
+- Async dependency factories follow pattern `get_{service/controller}()`: `get_ingest_service`, `get_analytics_service`, `get_keyword_controller`
+
+**Variables:**
+- Snake_case: `data_list`, `platform_type`, `check_duplicate`, `page_size`
+- Module-level constants: UPPER_SNAKE_CASE — `COMPANY_SOURCES`, `QUEUE_TERMINAL_STATUSES`
+- Class-level constants: UPPER_SNAKE_CASE prefixed `_` — `_TOKEN_REFRESH_INTERVAL = 3600`
+
+**Types and Enums:**
+- Enums use PascalCase class name, UPPER_SNAKE_CASE values: `PlatformType.BOSS`, `ChannelType.MINI`, `DataType.JOB`
+- Enum values are lowercase strings matching URL slugs: `"boss"`, `"mini"`, `"job"` — see `app/schemas/ingest.py`
+- Enums inherit from `(str, Enum)` enabling direct string comparison
+
+## Code Style
+
+**Formatting:**
+- Tool: `black` v24.10.0
+- Line length: 120 characters (set in `pyproject.toml` `[tool.black]` and `[tool.ruff]`)
+- Target Python versions: 3.10, 3.11 (black), 3.13 (Pipfile)
+
+**Linting:**
+- Tool: `ruff` v0.9.1 (configured in `pyproject.toml`)
+- Ignored rules: `F403` (star imports), `F405` (may be undefined from star import)
+- Star imports from internal modules are allowed (used in `app/models/__init__.py`, `app/services/ingest/__init__.py`)
+
+**Import Sorting:**
+- Tool: `isort` v5.13.2
+- No explicit isort config found; follows default ordering
+
+## Import Organization
+
+**Order:**
+1. Standard library (`from __future__`, `os`, `re`, `typing`, `datetime`, `json`)
+2. Third-party (`fastapi`, `pydantic`, `tortoise`, `loguru`, `clickhouse_connect`)
+3. Internal app imports (`from app.core.`, `from app.models.`, `from app.services.`, `from app.schemas.`)
+
+**Example from `app/api/v1/analytics.py`:**
+```python
+from typing import Optional
+from datetime import datetime, date, timezone
+from zoneinfo import ZoneInfo
+
+from fastapi import APIRouter, Depends, Query
+from app.core.clickhouse import clickhouse_manager
+from app.services.analytics_service import AnalyticsService
+from app.schemas.analytics import JobStatisticsResponse
+```
+
+**Path Aliases:**
+- None; all imports use full `app.` prefix paths
+- `from app.log import logger` is the canonical loguru import path
+
+**Star Imports:**
+- Used only in `__init__.py` re-export files: `from .admin import *` in `app/models/__init__.py`
+- `# noqa: F401, F403` comments suppress lint warnings for intentional star imports
+
+## Error Handling
+
+**Patterns:**
+- Services return `Dict[str, Any]` result objects with `"success"`, `"code"`, `"message"` fields instead of raising exceptions to callers
+- Controllers return dict with `"code": 200/400/404` and `"message"` for all outcomes
+- API route handlers do NOT use try/except — they rely on services returning structured results
+- Service methods wrap low-level calls in `try/except Exception as e` and log then return `False` or error dict
+
+**Service-level error handling example** (`app/services/cleaning.py`):
+```python
+except Exception as e:
+    logger.error(f"Error processing item {target}: {e}")
+    return {
+        "success": False,
+        "target": target,
+        "error": str(e),
+        "storage_status": "error",
+        "remote_sent": False
+    }
+```
+
+**Repository-level:** `ClickHouseBaseRepo` does not swallow exceptions; they propagate to the service layer.
+
+**Auth exceptions:** `app/core/dependency.py` raises `HTTPException(status_code=401/403)` directly — the standard FastAPI pattern for auth failures.
+
+## Logging
+
+**Framework:** `loguru` v0.7.3
+
+**Import:** `from app.log import logger` (centralized re-export) or `from loguru import logger` (direct)
+
+**Patterns:**
+- `logger.info(f"...")` for normal operation events
+- `logger.warning(f"...")` for non-fatal recoverable issues (e.g., token not found, API soft failures)
+- `logger.error(f"...")` for caught exceptions and operation failures
+- F-string interpolation used consistently for message formatting
+- No structured fields (no `logger.bind()` usage observed)
+
+**Example:**
+```python
+logger.info(f"获取招聘详情: {job_id}")
+logger.warning(f"Boss get_job_detail failed: {result.error}")
+logger.error(f"批量插入失败: {e}")
+```
+
+## API Response Format
+
+**Two response styles coexist:**
+
+**Style 1 — Direct dict return** (most routes in new modules like `app/api/v1/job/job.py`, `app/api/v1/analytics.py`):
+```python
+return {"code": 200, "data": result, "message": "ok"}
+```
+
+**Style 2 — JSONResponse subclasses** (older RBAC routes, defined in `app/schemas/base.py`):
+```python
+Success(code=200, msg="OK", data=data)
+Fail(code=400, msg="error message")
+SuccessExtra(code=200, data=data, total=100, page=1, page_size=20)
+```
+
+**Paginated responses** include: `code`, `data` (list), `total`, `page`, `page_size`
+
+## Comments
+
+**When to Comment:**
+- Docstrings on public methods describing purpose, not implementation: `"""获取可用关键词，优先返回断点续爬和失败重试的关键词"""`
+- Inline comments for priority logic and algorithm steps: `# 优先级 1: 断点续爬 (partial)`
+- Module-level docstrings for context: `"""Boss直聘 Service — 基于新算法文件的封装"""`
+- `# noqa` comments for intentional lint suppressions
+
+**JSDoc/TSDoc:**
+- Not applicable (Python backend)
+- Docstrings are brief single-line or short multi-line Chinese descriptions
+
+## Function Design
+
+**Size:** Functions tend to be 10-50 lines; service methods like `process_single_item` in `app/services/cleaning.py` grow to ~70 lines due to multi-platform dispatch
+
+**Parameters:**
+- Keyword arguments with defaults preferred for optional params
+- Pydantic schemas used for HTTP request bodies (never raw dicts from router params)
+- `Optional[str]` with `= None` default for optional parameters
+
+**Return Values:**
+- Services return `Dict[str, Any]` with consistent keys (`code`, `message`, `data`)
+- Private helpers return `Optional[T]` or primitive types
+- Async functions return awaitable results (no mixing of sync/async)
+
+## Module Design
+
+**Exports:**
+- `app/models/__init__.py` uses `from .{module} import *` to flatten model imports
+- Router modules export a single named router variable: `router = APIRouter(...)` or `{domain}_router = APIRouter(...)`
+- Service classes are imported directly by name
+
+**Barrel Files (`__init__.py`):**
+- `app/models/__init__.py` — re-exports all model classes
+- `app/services/ingest/__init__.py` — re-exports `IngestService` and config registrations
+- `app/api/v1/__init__.py` — aggregates all routers into `v1_router`
+
+**Dependency Injection:**
+- FastAPI `Depends()` used for service/controller instantiation in route handlers
+- Dependency factory functions named `get_{service}()` and defined in the same file as the router
+- Shared auth dependencies: `DependAuth`, `DependPermission` in `app/core/dependency.py`
+
+## Tortoise ORM Model Conventions
+
+**Base class:** All models inherit from `app/models/base.py:BaseModel` (which extends `tortoise.models.Model` with `id = BigIntField(pk=True)`)
+
+**Timestamp mixin:** `TimestampMixin` adds `created_at` (auto_now_add) and `updated_at` (auto_now) — applied via multiple inheritance
+
+**Abstract base models:** Platform variants use abstract base + concrete subclasses:
+```python
+class BaseKeyword(Model):      # abstract = True in Meta
+    ...
+class BossKeyword(BaseKeyword):
+    class Meta:
+        table = "boss_keyword"
+```
+
+**Field descriptions:** All fields include `description=` parameter for documentation
+
+## Pydantic Schema Conventions
+
+- All schemas inherit from `pydantic.BaseModel`
+- All fields use `Field(...)` with `description=` for documentation
+- Enums inherit from `(str, Enum)` for JSON serialization compatibility
+- Output schemas include `class Config: from_attributes = True` to support ORM mode
+- Validation patterns use `Field(..., pattern="^(boss|qcwy|zhilian)$")` for enum-like string fields
+
+---
+
+*Convention analysis: 2026-03-21*
--- a/.planning/codebase/INTEGRATIONS.md
+++ b/.planning/codebase/INTEGRATIONS.md
@ -0,0 +1,165 @@
+# External Integrations
+
+**Analysis Date:** 2026-03-21
+
+## APIs & External Services
+
+**Recruitment Platforms (scraped, not official APIs):**
+- Boss Zhipin (Boss直聘) - Job and company data via WeChat mini-program endpoints
+  - Client: `requests` (sync) in `jobs_spider/boss/boos_api.py`
+  - Endpoints scraped: `/wapi/zpgeek/miniapp/search/joblist.json`, `/wapi/batch/batchRunV3`, `/wapi/zpgeek/miniapp/brand/detail.json`
+  - Auth: MPT Token (stored in MySQL `boss_token` table, managed via `app/api/v1/token/`)
+  - Anti-detection: `SmartIPManager`, `IPAnomalyDetector`, random 10-20s delays, session rebuilding on ban
+  - JS signing: `PyExecJS` executes JS sign algorithms (`jobs_spider/boss/boos_api.py`)
+
+- 51job / Qiancheng Wuyou (前程无忧) - Job data via platform API
+  - Client: `httpx` + `requests` in `jobs_spider/qcwy/qcwy.py`
+  - Auth: HMAC signature computed per-request
+  - Proxy: Hardcoded proxy URL present in `jobs_spider/qcwy/qcwy.py` (line 31) - security risk
+
+- Zhilian Zhaopin (智联招聘) - Job and company data
+  - Client: `requests` in `jobs_spider/zhilian/zhilian_single.py`
+  - Sign client: `app/services/crawler/_zhilian_sign.py`
+
+**Qixin.com (企信) - Company data enrichment (disabled):**
+- Endpoint: `http://external-data.qixin.com/extend/extend_data_push`
+- Auth: MD5 HMAC with timestamp + salt (salt hardcoded in `app/services/ingest/remote_push.py` line 48)
+- Status: Push code commented out; function `push_to_remote()` currently only logs
+
+**Webhook / Report Endpoint:**
+- Outgoing: Configurable via `REPORT_ENDPOINT` env var (default: none)
+- Used by: `app/core/scheduler.py` `_post_with_retry()` to POST stats JSON every 6 hours
+- Timeout: `REPORT_TIMEOUT` (default 10s), retries: `REPORT_MAX_RETRIES` (default 3)
+
+## Data Storage
+
+**Databases:**
+
+MySQL (Primary business database):
+- Connection: `mysql://root:jobdata123@121.4.126.241:3306/job_data` (default hardcoded in `app/settings/config.py`; override via `TORTOISE_ORM` env or individual components)
+- Client: Tortoise-ORM 0.23.0 with aiomysql driver
+- Migrations: Aerich (`migrations/` directory, runs on startup if `RUN_MIGRATIONS_ON_STARTUP=True`)
+- Tables: `user`, `role`, `api`, `menu`, `dept`, `auditlog`, `boss_token`, `cleaning_*`, `scheduled_task_run`, `stats_total`, `ip_upload_stats`
+- Models: `app/models/admin.py`, `app/models/token.py`, `app/models/cleaning.py`, `app/models/metrics.py`, `app/models/keyword.py`, `app/models/company.py`
+
+ClickHouse (Analytics / raw job data):
+- Connection: host `121.4.126.241`, port `8123`, DB `job_data` (configured in `app/settings/config.py`)
+- Env vars: `CLICKHOUSE_HOST`, `CLICKHOUSE_PORT`, `CLICKHOUSE_USER`, `CLICKHOUSE_PASS`, `CLICKHOUSE_DB`
+- Client: `clickhouse-connect==0.7.19` async client (`app/core/clickhouse.py`)
+- Connection pool: `urllib3.PoolManager` with `num_pools=2`, `maxsize=5` per worker
+- Tables managed via DDL in `app/core/clickhouse_init.py` (NOT Aerich migrations)
+- Tables: `boss_job`, `boss_company`, `qcwy_job`, `qcwy_company`, `zhilian_job`, `zhilian_company` (MergeTree engine), `pending_company` (ReplacingMergeTree), `job_analytics` (VIEW)
+- Queries: Repository pattern in `app/repositories/clickhouse_repo.py`
+
+**File Storage:**
+- Local filesystem only (no S3 / object storage detected)
+- Static files served from `static/` directory mounted via FastAPI (`app/__init__.py`)
+- Spider logs written to `logs/` directory relative to spider working dir
+- ECS pipeline log: `ecs_full_pipeline.log` (root project directory)
+
+**Caching:**
+- Redis (optional): Used only for distributed locking in `app/core/locks.py`
+  - Env vars: `REDIS_HOST`, `REDIS_PORT` (default 6379), `REDIS_DB` (default 0), `REDIS_PASS`
+  - Falls back to filesystem-based TTL locks when Redis unavailable
+  - Lock files stored in system temp directory: `{tempdir}/jobdata_lock_{name}/`
+
+## Authentication & Identity
+
+**Auth Provider:** Custom JWT implementation (no third-party auth provider)
+- Implementation: `app/core/dependency.py` (`DependPermission` dependency)
+- Algorithm: HS256
+- Expiry: 7 days (60 * 24 * 7 minutes, `app/settings/config.py`)
+- Secret: `SECRET_KEY` (default `"CHANGE_ME_DEV_ONLY"` - must override in production)
+- Storage (frontend): JWT stored in `localStorage`, injected via axios request interceptor (`web/src/utils/http/interceptors.js`)
+- RBAC: Route-level permission check via `DependPermission`; roles link to APIs (many-to-many in MySQL)
+
+**Boss Token Management:**
+- Boss MPT tokens stored in `boss_token` MySQL table (`app/models/token.py`)
+- Managed via `app/api/v1/token/` CRUD endpoints
+- Spiders fetch active tokens from `/api/v1/token/tokens` and mark invalid ones
+
+## Monitoring & Observability
+
+**Error Tracking:** None (no Sentry or equivalent detected)
+
+**Logs:**
+- Backend: `loguru` structured logging throughout all modules
+- Spider: `loguru` with daily rotation (`logs/log_{YYYY-MM-DD}.log`, 30-day retention)
+- ECS pipeline: File log appended to `ecs_full_pipeline.log`
+- No centralized log aggregation (no ELK, CloudWatch, etc.)
+
+**Alerting:**
+- Email alerts via SMTP on: stats job completion, IP anomaly detection
+- SMTP provider: QQ Mail (`smtp.qq.com:587`) with STARTTLS
+- Env vars: `SMTP_HOST`, `SMTP_PORT`, `SMTP_USER`, `SMTP_PASS`, `SMTP_FROM`, `SMTP_TO`
+- Default recipient hardcoded in `app/settings/config.py` line 47
+- IP anomaly alerts triggered every 10 minutes by `ip_alert_job` in `app/core/scheduler.py`
+- Stats reports sent every 6 hours by `stats_job` in `app/core/scheduler.py`
+
+**Metrics:**
+- IP upload tracking: `IpTrackingMiddleware` (`app/core/ip_tracking.py`) records per-source, per-IP upload counts to `ip_upload_stats` MySQL table
+- Task run history: `ScheduledTaskRun` model (`app/models/metrics.py`) records all scheduler task outcomes
+- Stats totals: `StatsTotal` model records ClickHouse row counts per table per run
+
+## CI/CD & Deployment
+
+**Hosting:**
+- Production: Alibaba Cloud ECS (manually provisioned + Docker container)
+- Crawler nodes: Alibaba Cloud ECS spot instances (cn-shanghai-b, `ecs.t5-lc1m1.small`, Ubuntu 24.04)
+- ECS node config: `ecs_full_pipeline.py` (20 spot instances, `SpotAsPriceGo` strategy, auto-release)
+- Vswitch: `vsw-uf6twb2lrd02fi7q2i605`, Security Group: `sg-uf60380ctrux2uusucad`
+
+**CI Pipeline:** None detected (no GitHub Actions, Jenkins, or equivalent)
+
+**Container:**
+- Multi-stage `Dockerfile` in project root
+- Stage 1: `node:18-alpine` builds Vue3 frontend
+- Stage 2: `python:3.11-slim-bullseye` installs Python deps + nginx
+- Entrypoint: `deploy/entrypoint.sh`
+- Nginx config: `deploy/web.conf`
+- Exposed port: 80
+
+## Environment Configuration
+
+**Required env vars (backend):**
+- `SECRET_KEY` - JWT signing secret (must override `"CHANGE_ME_DEV_ONLY"`)
+- `CLICKHOUSE_HOST` - ClickHouse server address
+- `CLICKHOUSE_USER` / `CLICKHOUSE_PASS` - ClickHouse credentials
+- MySQL connection (embedded in `TORTOISE_ORM` dict or override connection string)
+
+**Optional env vars:**
+- `APP_HOST` / `APP_PORT` / `UVICORN_WORKERS` - Server binding
+- `REDIS_HOST` / `REDIS_PORT` / `REDIS_PASS` - Distributed lock backend
+- `SMTP_HOST` / `SMTP_USER` / `SMTP_PASS` / `SMTP_FROM` / `SMTP_TO` - Email alerting
+- `REPORT_ENDPOINT` - Webhook for stats reporting
+- `RUN_MIGRATIONS_ON_STARTUP` / `INITIALIZE_SEED_DATA_ON_STARTUP` - Startup behavior flags
+
+**Spider env vars:**
+- `API_BASE_URL` - Backend URL for data push (default: `http://124.222.106.226:9999`)
+- `SLEEP_MIN_SECONDS` / `SLEEP_MAX_SECONDS` - Request rate limiting
+- `PROXY_USERNAME` / `PROXY_PASSWORD` / `PROXY_TUNNEL` / `PROXY_SCHEME` - Proxy pool
+- `MAX_PAGES` / `PAGE_SIZE` - Crawl depth control
+
+**Alibaba Cloud (ECS pipeline):**
+- Uses `alibabacloud_credentials` default credential chain (env vars `ALIBABA_CLOUD_ACCESS_KEY_ID` / `ALIBABA_CLOUD_ACCESS_KEY_SECRET` or `~/.alibabacloud/credentials`)
+
+**Secrets location:** All secrets are environment variables; no secrets manager. `config.py` contains hardcoded defaults that are insecure for production use.
+
+## Webhooks & Callbacks
+
+**Incoming:**
+- `/api/v1/universal/data/batch-store-async` - Spider data ingestion endpoint (called by all three crawlers)
+  - Header auth: `token: dev` (or `API_TOKEN` env var)
+  - Body: `{"data_list": [...], "data_type": "job"|"company", "platform": "boss"|"qcwy"|"zhilian"}`
+- `/api/v1/keyword/available` - Spiders poll this to get next keyword to crawl
+- `/api/v1/token/tokens` - Spiders fetch active Boss MPT tokens
+- `/api/v1/pipeline` - Manual trigger for ECS full pipeline
+
+**Outgoing:**
+- `REPORT_ENDPOINT` (configurable) - Stats payload POST every 6 hours (`app/core/scheduler.py`)
+- `http://external-data.qixin.com/...` - Company data push to Qixin (currently disabled/commented out in `app/services/ingest/remote_push.py`)
+- SMTP email notifications on scheduler events
+
+---
+
+*Integration audit: 2026-03-21*
--- a/.planning/codebase/STACK.md
+++ b/.planning/codebase/STACK.md
@ -0,0 +1,143 @@
+# Technology Stack
+
+**Analysis Date:** 2026-03-21
+
+## Languages
+
+**Primary:**
+- Python 3.13 - Backend API, crawlers, ECS pipeline scripts
+- JavaScript (ES Modules) - Vue3 frontend (no strict TypeScript; TS installed but JS used in `.vue` and `.js` files)
+
+**Secondary:**
+- TypeScript 5.1.6 - Type definitions referenced in frontend toolchain only
+
+## Runtime
+
+**Environment:**
+- Python 3.13 (required: `>=3.13` per `Pipfile`; Dockerfile uses `python:3.11-slim-bullseye` for container build)
+- Node.js 18 (Dockerfile `node:18-alpine` for frontend build stage)
+
+**Package Manager:**
+- Python: `pipenv` (development), `pip` / `requirements.txt` (Docker production)
+- Frontend: `pnpm` (lockfile: `web/pnpm-lock.yaml` present)
+- Lockfiles: `Pipfile.lock` (Python), `web/pnpm-lock.yaml` (frontend), `uv.lock` (uv-compatible)
+
+## Frameworks
+
+**Core Backend:**
+- FastAPI 0.111.0 - REST API framework (`app/__init__.py` factory via `create_app()`)
+- Starlette 0.37.2 - ASGI underpinning (middleware, static files)
+- Uvicorn 0.34.0 - ASGI server (20 workers default, configurable via `UVICORN_WORKERS`)
+- Uvloop 0.21.0 - High-performance event loop (non-Windows only)
+
+**ORM / Database:**
+- Tortoise-ORM 0.23.0 - Async ORM for MySQL (`app/models/`)
+- Aerich 0.8.1 - Database migrations for Tortoise-ORM (`migrations/` directory)
+- aiomysql - Async MySQL driver (Tortoise backend)
+- clickhouse-connect 0.7.19 - ClickHouse async client (`app/core/clickhouse.py`)
+- asyncpg - Async PostgreSQL driver (listed in Pipfile but not actively used)
+
+**Validation / Serialization:**
+- Pydantic 2.10.5 - Request/response schemas (`app/schemas/`)
+- pydantic-settings 2.7.1 - Settings management via env vars (`app/settings/config.py`)
+- orjson 3.10.14 - Fast JSON serialization
+- ujson 5.10.0 - Alternative JSON library
+
+**Authentication / Security:**
+- PyJWT 2.10.1 - JWT token generation/verification (HS256, 7-day expiry)
+- passlib 1.7.4 - Password hashing
+- argon2-cffi 23.1.0 - Argon2 password hasher
+
+**Scheduling:**
+- APScheduler - Async job scheduler (`app/core/scheduler.py`, 6 registered cron tasks)
+
+**HTTP Clients:**
+- httpx 0.28.1 - Async HTTP client (backend service-to-service calls)
+- requests - Sync HTTP client (spider scripts in `jobs_spider/`)
+
+**Logging:**
+- loguru 0.7.3 - Structured logging (unified throughout backend and spiders)
+
+**Core Frontend:**
+- Vue 3.3.4 - SPA framework (`web/src/main.js`)
+- Vue Router 4.2.4 - Client-side routing (`web/src/router/index.js`)
+- Pinia 2.1.6 - State management (`web/src/store/`)
+- Naive UI 2.34.4 - Component library (admin UI)
+- ECharts 6.0.0 - Data visualization charts (`web/src/views/analytics/`)
+- axios 1.4.0 - HTTP client (`web/src/utils/http/`)
+- vue-i18n 9 - Internationalization
+
+**Build / Dev Tools:**
+- Vite 4.4.6 - Frontend bundler (`web/vite.config.js`)
+- UnoCSS 66.5.10 - Atomic CSS engine
+- unplugin-auto-import 20.3.0 - Auto-imports for Vue APIs
+- unplugin-vue-components 30.0.0 - Auto-imports for components
+- @iconify/vue + @iconify/json - Icon library
+
+**Spider-specific:**
+- playwright 1.57.0 - Browser automation (anti-detection in crawlers)
+- PyExecJS 1.5.1 - Execute JavaScript signing algorithms from Python (`jobs_spider/boss/`)
+- PySocks - SOCKS proxy support for spider requests
+- tenacity - Retry logic in crawlers
+- pandas + openpyxl - Data processing and Excel export
+
+**Cloud:**
+- alibabacloud_ecs20140526 - Alibaba Cloud ECS SDK (`ecs_full_pipeline.py`)
+- alibabacloud_credentials - AliCloud credential management
+
+**Code Quality:**
+- ruff 0.9.1 - Python linter (`pyproject.toml` config: line-length 120, ignores F403/F405)
+- black 24.10.0 - Python formatter (line-length 120, target py310/py311)
+- isort 5.13.2 - Import sorter
+- ESLint 8.46.0 - Frontend linter (`@zclzone` + `@unocss` rule sets)
+- prettier - Frontend formatter
+
+## Key Dependencies
+
+**Critical:**
+- `clickhouse-connect==0.7.19` - All analytics and job data storage; loss means no data read/write
+- `tortoise-orm==0.23.0` - All business data (users, roles, keywords, tokens); paired with `aerich` for migrations
+- `fastapi==0.111.0` - API layer; version-pinned for stability
+- `APScheduler` - 6 scheduled tasks including ECS pipeline and IP alerting
+- `alibabacloud_ecs20140526` - ECS node management; required for crawler scaling
+
+**Infrastructure:**
+- `uvicorn==0.34.0` + `uvloop==0.21.0` - Production ASGI server stack
+- `pydantic==2.10.5` - All input validation; v2 API used throughout
+- `loguru==0.7.3` - Unified logging across all modules
+- `redis` - Optional distributed lock backend (`app/core/locks.py`; falls back to file locks if Redis unavailable)
+
+## Configuration
+
+**Environment:**
+- All settings in `app/settings/config.py` via `pydantic-settings.BaseSettings`
+- Environment variables override defaults at startup
+- No `.env` file detected (not committed); variables set at OS/container level
+- Key variables: `APP_HOST`, `APP_PORT`, `UVICORN_WORKERS`, `CLICKHOUSE_HOST`, `CLICKHOUSE_USER`, `CLICKHOUSE_PASS`, `SECRET_KEY`, `SMTP_HOST`, `SMTP_USER`, `SMTP_PASS`, `REDIS_HOST`, `REPORT_ENDPOINT`
+- **Security warning**: `config.py` contains hardcoded default values for MySQL, ClickHouse, and SMTP credentials; must be overridden in production
+
+**Build:**
+- `pyproject.toml` - Python project metadata, black/ruff tool config
+- `Pipfile` / `Pipfile.lock` - Development dependency management
+- `requirements.txt` - Production pip install (used in Dockerfile)
+- `web/vite.config.js` - Vite build config with proxy support via `VITE_USE_PROXY` env var
+- `Dockerfile` - Multi-stage build: Node 18 (frontend) + Python 3.11 (backend) + nginx
+
+## Platform Requirements
+
+**Development:**
+- Python 3.13+ (Pipfile requirement)
+- Node.js 18+ with pnpm
+- MySQL server (Tortoise-ORM connection)
+- ClickHouse server (analytics data)
+- Optional: Redis (distributed locking upgrade)
+
+**Production:**
+- Docker-based deployment (see `Dockerfile`, `deploy/entrypoint.sh`, `deploy/web.conf`)
+- Nginx serves frontend static files and proxies API requests
+- Alibaba Cloud ECS (cn-shanghai-b zone) for crawler nodes
+- Ports: 80 (nginx), 9999 (FastAPI backend direct)
+
+---
+
+*Stack analysis: 2026-03-21*
--- a/.planning/codebase/STRUCTURE.md
+++ b/.planning/codebase/STRUCTURE.md
@ -0,0 +1,350 @@
+# Codebase Structure
+
+**Analysis Date:** 2026-03-21
+
+## Directory Layout
+
+```
+JobData/                          # Project root
+├── app/                          # FastAPI backend (Python)
+│   ├── __init__.py               # App factory create_app()
+│   ├── api/
+│   │   └── v1/                   # All REST API routes (/api/v1/...)
+│   │       ├── __init__.py       # Router aggregation + permission guards
+│   │       ├── analytics.py      # Data analytics endpoints
+│   │       ├── stats.py          # Table count stats endpoints
+│   │       ├── pipeline.py       # ECS pipeline trigger endpoint
+│   │       ├── base/             # Login, user-info, menu-tree (no auth)
+│   │       ├── job/              # Data ingest endpoints (batch/single/async)
+│   │       ├── cleaning/         # Targeted cleaning endpoints
+│   │       ├── company/          # Company search endpoints
+│   │       ├── keyword/          # Keyword CRUD endpoints
+│   │       ├── token/            # Boss token management endpoints
+│   │       ├── proxy/            # Proxy IP management endpoints
+│   │       ├── users/            # User CRUD
+│   │       ├── roles/            # Role + permission assignment
+│   │       ├── menus/            # Menu tree CRUD
+│   │       ├── apis/             # API registration management
+│   │       ├── depts/            # Department tree CRUD
+│   │       └── auditlog/         # Audit log query
+│   ├── controllers/              # CRUD orchestration for MySQL entities
+│   │   ├── api.py                # ApiController (refresh_api(), scan routes)
+│   │   ├── user.py               # UserController
+│   │   ├── role.py               # RoleController
+│   │   ├── menu.py               # MenuController
+│   │   ├── dept.py               # DeptController
+│   │   ├── keyword.py            # KeywordController
+│   │   ├── proxy.py              # ProxyController
+│   │   ├── token.py              # TokenController
+│   │   ├── cleaning.py           # CleaningController
+│   │   └── company.py            # CompanyController (search via crawler services)
+│   ├── core/                     # Framework infrastructure
+│   │   ├── init_app.py           # lifespan: migrations, seed, ClickHouse init
+│   │   ├── scheduler.py          # APScheduler job definitions + registration
+│   │   ├── clickhouse.py         # ClickHouseManager singleton
+│   │   ├── clickhouse_init.py    # ClickHouse DDL (tables + analytics view)
+│   │   ├── dependency.py         # AuthControl, PermissionControl, DependPermission
+│   │   ├── locks.py              # DistributedLock (Redis + file fallback)
+│   │   ├── middlewares.py        # BackGroundTaskMiddleware, HttpAuditLogMiddleware
+│   │   ├── ip_tracking.py        # IpTrackingMiddleware
+│   │   ├── exceptions.py         # Custom exception classes + handlers
+│   │   ├── crud.py               # Generic ListParams, pagination helpers
+│   │   ├── ctx.py                # CTX_USER_ID ContextVar
+│   │   ├── bgtask.py             # BgTasks helper for BackGroundTaskMiddleware
+│   │   ├── proxy_rule.py         # Proxy selection rules
+│   │   └── algorithms/
+│   │       └── antispider.py     # Anti-spider signature helpers
+│   ├── models/                   # Tortoise-ORM models (MySQL)
+│   │   ├── base.py               # BaseModel, TimestampMixin, UUIDModel
+│   │   ├── admin.py              # User, Role, Api, Menu, Dept, AuditLog
+│   │   ├── token.py              # BossToken
+│   │   ├── cleaning.py           # CleaningTask / CleaningRecord
+│   │   ├── metrics.py            # ScheduledTaskRun, StatsTotal, IpUploadStats
+│   │   ├── company.py            # CompanyCleaningQueue
+│   │   ├── keyword.py            # BossKeyword, QcwyKeyword, ZhilianKeyword
+│   │   └── enums.py              # MethodType enum
+│   ├── repositories/             # ClickHouse query layer
+│   │   └── clickhouse_repo.py    # ClickHouseBaseRepo, JobAnalyticsRepo
+│   ├── schemas/                  # Pydantic request/response schemas
+│   │   ├── ingest.py             # IngestBatchRequest, PlatformType, DataType enums
+│   │   ├── analytics.py          # Analytics query params + response schemas
+│   │   ├── keyword.py            # Keyword schemas
+│   │   ├── token.py              # Token schemas
+│   │   ├── base.py               # Common response envelopes
+│   │   ├── login.py              # Login request/response
+│   │   ├── users.py              # User schemas
+│   │   ├── roles.py              # Role schemas
+│   │   ├── menus.py              # Menu schemas (MenuType enum)
+│   │   ├── apis.py               # API schemas
+│   │   ├── depts.py              # Dept schemas
+│   │   ├── cleaning.py           # Cleaning schemas
+│   │   └── proxy.py              # Proxy schemas
+│   ├── services/                 # Business logic
+│   │   ├── analytics_service.py  # AnalyticsService → JobAnalyticsRepo
+│   │   ├── cleaning.py           # CleaningService (targeted cleaning)
+│   │   ├── company_cleaner.py    # CompanyCleaner (automated company pipeline)
+│   │   ├── company_jobs_sync.py  # CompanyJobsSyncService
+│   │   ├── company_storage.py    # company_storage (ClickHouse company insert)
+│   │   ├── ingest/               # Ingest pipeline subsystem
+│   │   │   ├── service.py        # IngestService.store_batch/store_single/query_data
+│   │   │   ├── registry.py       # PlatformConfig registry + get_config()
+│   │   │   ├── dedup.py          # batch_dedup_filter() against ClickHouse
+│   │   │   ├── remote_push.py    # Async HTTP push to secondary target
+│   │   │   └── configs/          # Per-platform PlatformConfig registrations
+│   │   └── crawler/              # In-process platform HTTP clients
+│   │       ├── _base.py          # BaseFetcher, BaseSearcher, ApiResult
+│   │       ├── _http_client.py   # HTTPClient wrapper (sync requests)
+│   │       ├── _boss_client.py   # Boss HTTP client
+│   │       ├── _boss_api.py      # Boss API methods
+│   │       ├── _boss_sign.py     # Boss request signing
+│   │       ├── _zhilian_client.py
+│   │       ├── _zhilian_api.py
+│   │       ├── _zhilian_sign.py
+│   │       ├── _job51_client.py
+│   │       ├── _job51_api.py
+│   │       ├── _job51_sign.py
+│   │       ├── boss.py           # BossService (high-level facade)
+│   │       ├── qcwy.py           # QcwyService (high-level facade)
+│   │       └── zhilian.py        # ZhilianService (high-level facade)
+│   ├── settings/
+│   │   └── config.py             # Settings (pydantic-settings BaseSettings)
+│   ├── log/
+│   │   └── __init__.py           # loguru logger export
+│   └── utils/                    # Shared utilities
+├── web/                          # Vue 3 frontend
+│   ├── src/
+│   │   ├── main.js               # App entry, Pinia + Router setup
+│   │   ├── App.vue               # Root component
+│   │   ├── api/                  # Axios API call modules
+│   │   │   ├── index.js          # System management APIs
+│   │   │   ├── analytics.js      # Analytics APIs
+│   │   │   ├── keyword.js        # Keyword APIs
+│   │   │   ├── proxy.js          # Proxy APIs
+│   │   │   └── token.js          # Token APIs
+│   │   ├── components/           # Shared components (CrudTable, CrudModal, QueryBar)
+│   │   ├── layout/               # App shell (sidebar, topbar, tabs)
+│   │   ├── router/               # Vue Router config + dynamic route loading
+│   │   │   ├── index.js          # createRouter(), addDynamicRoutes()
+│   │   │   ├── guard/            # Navigation guards (auth, loading, title)
+│   │   │   └── routes/           # Static route definitions
+│   │   ├── store/                # Pinia stores
+│   │   │   └── modules/          # user, permission, app, tags
+│   │   ├── utils/
+│   │   │   ├── auth/             # JWT token read/write
+│   │   │   ├── http/             # Axios instance + interceptors
+│   │   │   └── storage/          # localStorage wrapper
+│   │   └── views/                # Page-level Vue components
+│   │       ├── login/            # Login page
+│   │       ├── analytics/        # ECharts dashboard (trend + distribution)
+│   │       ├── recruitment/      # Job data browser (boss/, qcwy/, zhilian/)
+│   │       ├── cleaning/         # Cleaning operation + monitor pages
+│   │       ├── keyword/          # Keyword management
+│   │       ├── profile/          # User profile
+│   │       └── system/           # RBAC admin (user, role, menu, api, dept, auditlog, proxy, token)
+│   ├── vite.config.js            # Vite build config (plugins, proxy)
+│   └── package.json              # Frontend dependencies
+├── jobs_spider/                  # Standalone crawlers (external process)
+│   ├── boss/
+│   │   ├── boos_api.py           # BossZhipinAPI core + SmartIPManager + main (~2246 lines)
+│   │   ├── company_spider.py     # Boss company batch spider
+│   │   └── enumerate_combos.py   # City × keyword combo enumeration
+│   ├── qcwy/
+│   │   ├── qcwy.py               # QCWY API wrapper
+│   │   ├── search_company_jobs.py
+│   │   └── run_company_search.py
+│   └── zhilian/
+│       ├── zhilian_single.py     # Zhilian single-job spider
+│       └── company_spider.py     # Zhilian company spider
+├── spiderJobs/                   # Second-generation spider framework (structured)
+│   ├── core/
+│   │   ├── base.py               # Shared base classes
+│   │   └── http_client.py        # HTTP client abstraction
+│   ├── platforms/                # Platform-specific implementations
+│   │   ├── boss/
+│   │   ├── job51/
+│   │   └── zhilian/
+│   └── runner/                   # Execution runners
+│       ├── api_client.py         # Backend API push client
+│       ├── loop.py               # Job crawl loop runner
+│       └── company_loop.py       # Company crawl loop runner
+├── migrations/
+│   └── models/                   # Aerich migration files (auto-generated)
+├── static/                       # Served at /static/ (built frontend or assets)
+├── deploy/                       # Deployment configs and sample images
+├── docs/                         # Project documentation
+├── ecs_full_pipeline.py          # Alibaba Cloud ECS batch provisioning script
+├── create_ecs_instances.py       # ECS instance creation utilities
+├── reclean_qcwy_jobs.py          # Standalone QCWY re-cleaning script
+├── run.py                        # Uvicorn launch entry point
+├── Pipfile / Pipfile.lock        # Python dependency management (pipenv)
+├── pyproject.toml                # ruff + black + isort tool config
+└── Dockerfile                    # Container build spec
+```
+
+## Directory Purposes
+
+**`app/api/v1/`:**
+- Purpose: One sub-package per API domain; each contains `__init__.py` exporting its `APIRouter`
+- Contains: Route handler functions, dependency injection wiring, Pydantic schema imports
+- Key files: `app/api/v1/__init__.py` (aggregation), `app/api/v1/job/job.py` (ingest routes), `app/api/v1/analytics.py`
+
+**`app/controllers/`:**
+- Purpose: Business-operation classes for MySQL-backed entities; encapsulate Tortoise-ORM queries
+- Contains: One controller per domain entity; typically expose `create`, `update`, `delete`, `list`, `get` methods
+- Key files: `app/controllers/api.py` (scan and register FastAPI routes → DB)
+
+**`app/services/ingest/`:**
+- Purpose: Self-contained data pipeline: registry lookup → dedup check → ClickHouse insert → remote push
+- Contains: `service.py` (main orchestrator), `registry.py` (config store), `dedup.py`, `remote_push.py`, `configs/` (registrations per platform)
+- Key files: `app/services/ingest/service.py`, `app/services/ingest/registry.py`
+
+**`app/services/crawler/`:**
+- Purpose: In-process HTTP clients for platform APIs, used by backend scheduler (company cleaning) and company search controller
+- Contains: Underscore-prefixed private modules (`_boss_client.py`, `_boss_sign.py`, etc.) and public service facades (`boss.py`, `qcwy.py`, `zhilian.py`)
+- Key files: `app/services/crawler/_base.py` (shared base classes)
+
+**`app/core/`:**
+- Purpose: Framework plumbing — startup, middleware chain, auth, distributed locking, ClickHouse connection
+- Key files: `app/core/dependency.py` (auth guards), `app/core/scheduler.py` (all cron jobs), `app/core/locks.py`
+
+**`app/models/`:**
+- Purpose: All Tortoise-ORM model definitions; migrated via Aerich (not ClickHouse)
+- Key files: `app/models/admin.py` (RBAC), `app/models/keyword.py` (crawl state tracking)
+
+**`app/repositories/`:**
+- Purpose: Typed ClickHouse query encapsulation following the Repository pattern
+- Key files: `app/repositories/clickhouse_repo.py`
+
+**`jobs_spider/`:**
+- Purpose: Standalone scripts that run on ECS nodes; communicate with backend only via HTTP
+- Key files: `jobs_spider/boss/boos_api.py` (2246-line core with SmartIPManager, BossZhipinAPI)
+
+**`spiderJobs/`:**
+- Purpose: Newer modular spider framework; parallel architecture to `jobs_spider/` with cleaner separation
+- Key files: `spiderJobs/runner/api_client.py`, `spiderJobs/runner/loop.py`
+
+**`web/src/views/`:**
+- Purpose: One directory per route group; each view owns its local route definition in `route.js`
+- Key files: `web/src/views/analytics/index.vue` (ECharts dashboard), `web/src/views/cleaning/monitor.vue`
+
+## Key File Locations
+
+**Entry Points:**
+- `run.py`: Uvicorn startup; reads env vars `APP_HOST`, `APP_PORT`, `UVICORN_WORKERS`
+- `app/__init__.py`: FastAPI app factory `create_app()` and lifespan context manager
+- `web/src/main.js`: Vue 3 app entry; mounts Pinia, Router, directives
+
+**Configuration:**
+- `app/settings/config.py`: All backend config (`Settings` extends `pydantic_settings.BaseSettings`); override via env vars
+- `web/vite.config.js`: Vite build and dev server config
+- `pyproject.toml`: ruff, black, isort tool configuration
+
+**Core Logic:**
+- `app/core/init_app.py`: Startup sequence (migrations, seed data, ClickHouse init)
+- `app/core/scheduler.py`: All APScheduler job functions and `start_scheduler()`
+- `app/core/clickhouse_init.py`: All ClickHouse `CREATE TABLE IF NOT EXISTS` DDL statements
+- `app/services/ingest/registry.py`: `PlatformConfig` dataclass + global registry dict
+- `app/services/ingest/service.py`: `IngestService.store_batch()` — main data ingestion path
+- `app/services/company_cleaner.py`: `CompanyCleaner` — automated company data pipeline
+
+**ClickHouse Schema:**
+- `app/core/clickhouse_init.py`: Defines all 6 MergeTree tables and 1 analytics VIEW DDL; apply via `ClickHouseInitializer.initialize_all_tables()`
+
+**MySQL Migrations:**
+- `migrations/models/`: Auto-generated by `aerich migrate`; applied via `aerich upgrade` or automatically on startup
+
+**Testing:**
+- `tests/test_company_jobs_sync.py`
+- `tests/test_company_storage.py`
+
+## Naming Conventions
+
+**Files (Python):**
+- Modules: `snake_case.py` (e.g., `company_cleaner.py`, `clickhouse_init.py`)
+- Private crawler modules: leading underscore prefix (e.g., `_boss_client.py`, `_zhilian_sign.py`)
+- Spider scripts in `jobs_spider/`: descriptive names without prefix (e.g., `boos_api.py`, `zhilian_single.py`)
+
+**Files (Vue/JS):**
+- Pages: `index.vue` for primary view, `monitor.vue` / descriptive name for secondary views
+- API modules: `snake_case.js` matching domain (e.g., `analytics.js`, `keyword.js`)
+- Router guard files: Grouped in `router/guard/` directory
+
+**Classes (Python):**
+- Services: `PascalCaseService` (e.g., `IngestService`, `AnalyticsService`, `CompanyCleaner`)
+- Controllers: `PascalCaseController` with module-level singleton instance in snake_case (e.g., `user_controller = UserController()`)
+- Repos: `PascalCaseRepo` (e.g., `ClickHouseBaseRepo`, `JobAnalyticsRepo`)
+- Models: `PascalCase` matching table concept (e.g., `BossToken`, `CompanyCleaningQueue`)
+
+**Directories:**
+- Domain groupings: lowercase (e.g., `cleaning/`, `recruitment/`, `keyword/`)
+- Platform names used consistently: `boss`, `qcwy`, `zhilian` (lowercase) across all layers
+
+## Where to Add New Code
+
+**New API Endpoint:**
+1. Create or update router file in `app/api/v1/{domain}/`
+2. Register the router in `app/api/v1/__init__.py` with `v1_router.include_router(...)`
+3. Run `api_controller.refresh_api()` on next app startup to sync new routes to the `api` table
+4. Add corresponding Pydantic schemas in `app/schemas/{domain}.py`
+
+**New Platform Data Type (Ingest):**
+1. Define `PlatformConfig` with dedup fields in `app/services/ingest/configs/`
+2. Call `register(config)` at module import in `app/services/ingest/configs/__init__.py`
+3. ClickHouse table DDL goes in `app/core/clickhouse_init.py` under `_TABLE_DDLS`
+
+**New MySQL Model:**
+1. Add model class to appropriate file in `app/models/` (or create new file)
+2. Register the model module in `app/settings/config.py` under `TORTOISE_ORM.apps.models.models`
+3. Run `aerich migrate && aerich upgrade` (or rely on `RUN_MIGRATIONS_ON_STARTUP=True`)
+
+**New Scheduled Job:**
+1. Define async job function in `app/core/scheduler.py`
+2. Wrap execution body with `DistributedLock` context from `app/core/locks.py`
+3. Add `scheduler.add_job(...)` call inside `start_scheduler()`
+4. Record run result via `_record_task_run()` to `ScheduledTaskRun` table
+
+**New Frontend Page:**
+1. Create view component at `web/src/views/{module}/index.vue`
+2. Add `route.js` in the same directory with route meta
+3. Register the route in `web/src/router/routes/` static routes or rely on backend menu system
+4. Add corresponding backend menu entry in `app/core/init_app.py` → `init_menus()` (for seed data)
+
+**New In-Process Platform Crawler:**
+1. Add private client modules `_platform_client.py`, `_platform_api.py`, `_platform_sign.py` under `app/services/crawler/`
+2. Create public facade `platform.py` extending or wrapping `BaseFetcher` / `BaseSearcher` from `app/services/crawler/_base.py`
+3. Instantiate in `CompanyCleaner.__init__()` if needed for company pipeline
+
+**Shared Utilities:**
+- Backend Python utilities: `app/utils/`
+- Frontend JS utilities: `web/src/utils/`
+
+## Special Directories
+
+**`.startup_lock/`:**
+- Purpose: File-based mutex directory used to prevent multi-worker race on startup seed data
+- Generated: Yes (created and deleted at startup)
+- Committed: No (in `.gitignore`)
+
+**`migrations/`:**
+- Purpose: Aerich migration history for MySQL schema
+- Generated: Yes (by `aerich migrate`)
+- Committed: Yes
+
+**`static/`:**
+- Purpose: Served at `/static/` path; can hold built frontend assets or other static files
+- Generated: Partially (built frontend output can be placed here)
+- Committed: Yes (currently contains static assets)
+
+**`.planning/`:**
+- Purpose: GSD planning documents (codebase analysis, phase plans)
+- Generated: Yes (by GSD tooling)
+- Committed: Yes
+
+**`jobs_spider/*/logs/`:**
+- Purpose: Rotating daily log files for standalone spider processes
+- Generated: Yes (by loguru in spider scripts)
+- Committed: No
+
+---
+
+*Structure analysis: 2026-03-21*
--- a/.planning/codebase/TESTING.md
+++ b/.planning/codebase/TESTING.md
@ -0,0 +1,236 @@
+# Testing Patterns
+
+**Analysis Date:** 2026-03-21
+
+## Test Framework
+
+**Runner:**
+- `unittest` (Python standard library) — the only test framework currently in use
+- No pytest, no pytest-anyio, no httpx.AsyncClient test harness configured
+- No test-specific dependencies in `Pipfile` `[dev-packages]` (section is empty)
+
+**Assertion Library:**
+- `unittest.TestCase` assertion methods: `assertEqual`, `assertIsNone`, etc.
+
+**Run Commands:**
+```bash
+python -m unittest tests/test_company_storage.py     # Run a specific test file
+python -m unittest tests/test_company_jobs_sync.py   # Run a specific test file
+python -m unittest discover tests/                   # Discover and run all tests
+```
+
+**Coverage:**
+- No coverage tool configured (no `.coveragerc`, no `pytest-cov`, no coverage requirement enforced)
+
+## Test File Organization
+
+**Location:**
+- Separate `tests/` directory at project root: `/Users/win/2025/AICoding/JobData/tests/`
+- NOT co-located with source modules
+
+**Naming:**
+- Pattern: `test_{module_name}.py`
+- Examples: `test_company_storage.py`, `test_company_jobs_sync.py`
+
+**Structure:**
+```
+tests/
+├── test_company_jobs_sync.py   # Tests for app/services/company_jobs_sync.py
+└── test_company_storage.py     # Tests for app/services/company_storage.py
+```
+
+## Test Structure
+
+**Suite Organization:**
+```python
+import unittest
+
+from app.services.company_storage import extract_company_fields, normalize_company_id
+
+
+class CompanyStorageTests(unittest.TestCase):
+    def test_normalize_qcwy_company_id(self):
+        self.assertEqual(normalize_company_id("qcwy", "co123"), "123")
+        self.assertEqual(normalize_company_id("qcwy", "123"), "123")
+        self.assertEqual(normalize_company_id("boss", "co123"), "co123")
+
+    def test_extract_boss_fields(self):
+        payload = { ... }
+        result = extract_company_fields("boss", payload, "boss-1")
+        self.assertEqual(result["source_company_id"], "boss-1")
+
+
+if __name__ == "__main__":
+    unittest.main()
+```
+
+**Patterns:**
+- One `TestCase` class per test file
+- Test methods named `test_{what_is_being_tested}_{condition}` (e.g., `test_extract_boss_fields`, `test_normalize_qcwy_company_id`)
+- No setUp/tearDown methods — tests are stateless and self-contained
+- No fixtures or factories; inline dicts serve as test data
+- Tests verify pure functions only (no database, no HTTP, no async)
+
+## Mocking
+
+**Framework:** None currently used (no `unittest.mock`, no `pytest-mock`)
+
+**Current approach:** Tests only cover pure, synchronous functions that have no external dependencies. All existing tests call functions that:
+- Accept plain dicts as input
+- Return plain dicts or strings as output
+- Have no database, network, or filesystem side effects
+
+**What to Mock (when tests are added):**
+- `clickhouse_connect.driver.AsyncClient` for repository tests
+- `BossToken.filter(...)` ORM queries for service tests that check token loading
+- `httpx` / `requests` for crawler service tests
+- `asyncio.to_thread(...)` calls wrapping synchronous crawlers
+
+**What NOT to Mock:**
+- Pure data transformation functions (`extract_company_fields`, `normalize_company_id`, `_pick_first`, `_clean_text`)
+- Enum definitions and schema validation logic
+- URL pattern matching logic in `CleaningService`
+
+## Fixtures and Factories
+
+**Test Data:**
+- Inline dicts constructed within each test method — no shared fixtures
+- Platform-specific payloads mirror real API response shapes:
+
+```python
+# Boss payload structure (from test_company_storage.py)
+payload = {
+    "zpData": {
+        "brandComInfoVO": {
+            "encryptBrandId": "boss-1",
+            "brandName": "Boss公司",
+            "industryName": "互联网",
+            "stageName": "B轮",
+        },
+        "companyFullInfoVO": {
+            "name": "Boss公司",
+            "typeName": "民营",
+            "cityName": "上海",
+        },
+    }
+}
+
+# Qcwy payload structure
+payload = {
+    "coinfo": {
+        "coid": "123",
+        "coname": "前程公司",
+        "cotype": "民营",
+        "cosize": "500-999人",
+    },
+    "financingStage": {"name": "未融资"},
+}
+
+# Zhilian payload structure
+payload = {
+    "data": {
+        "companyBase": {
+            "companyNumber": "zl-1",
+            "companyName": "智联公司",
+            "companyTypeName": "上市公司",
+        }
+    }
+}
+```
+
+**Location:** Inline within test methods; no shared fixture file or factory module.
+
+## Coverage
+
+**Requirements:** None enforced — no coverage target, no CI gates
+
+**Current coverage estimation:**
+- `app/services/company_storage.py` — partially covered (extract functions and normalize_company_id)
+- `app/services/company_jobs_sync.py` — partially covered (_extract_boss_jobs, _extract_qcwy_jobs, _extract_zhilian_jobs)
+- All other modules — ZERO test coverage
+
+**View Coverage (manual):**
+```bash
+pip install coverage
+coverage run -m unittest discover tests/
+coverage report
+coverage html  # generates htmlcov/index.html
+```
+
+## Test Types
+
+**Unit Tests:**
+- The only type present
+- Scope: pure synchronous functions with no external dependencies
+- Files: `tests/test_company_storage.py`, `tests/test_company_jobs_sync.py`
+
+**Integration Tests:**
+- Not present
+- Needed for: API endpoint testing using `httpx.AsyncClient` with `TestClient` or `anyio`
+- Critical gaps: `app/api/v1/job/job.py`, `app/api/v1/analytics.py`, `app/api/v1/keyword/keyword.py`
+
+**E2E Tests:**
+- Not present
+- No E2E framework configured
+
+## Critical Gaps
+
+**Untested service logic (highest priority):**
+
+1. `app/services/cleaning.py:CleaningService.clean_target_auto()` — URL platform detection logic (zhipin.com vs 51job.com vs zhaopin.com)
+2. `app/services/cleaning.py:CleaningService.process_single_item()` — multi-branch dispatch based on `clean_type` and `platform`
+3. `app/services/ingest/service.py:IngestService.store_batch()` — deduplication and batch insert flow
+4. `app/services/ingest/dedup.py` — deduplication filter logic
+5. `app/controllers/keyword.py:KeywordController.get_available()` — priority scheduling logic (partial > failed > fresh)
+6. `app/repositories/clickhouse_repo.py:JobAnalyticsRepo` — query building and result mapping
+
+**Untested URL parsing (medium priority):**
+
+7. `app/services/cleaning.py:_process_boss_url()` — regex extraction of job/company ID from Boss URLs
+8. `app/services/cleaning.py:_process_qcwy_url()` — regex extraction for qcwy URLs
+9. `app/services/cleaning.py:_process_zhilian_url()` — regex extraction for zhilian URLs
+
+**Untested API endpoints:**
+
+10. All routes in `app/api/v1/` — no integration tests exist for any HTTP endpoints
+
+## Recommended Test Setup (when adding tests)
+
+**For async tests, add to `Pipfile [dev-packages]`:**
+```toml
+[dev-packages]
+pytest = "*"
+pytest-asyncio = "*"
+anyio = {extras = ["trio"]}
+httpx = "*"  # already in packages, for AsyncClient
+```
+
+**Async test pattern (recommended):**
+```python
+import pytest
+from httpx import AsyncClient
+from app import app  # FastAPI app instance
+
+@pytest.mark.asyncio
+async def test_ingest_endpoint():
+    async with AsyncClient(app=app, base_url="http://test") as client:
+        response = await client.post("/api/v1/ingest/data/store", json={...})
+    assert response.status_code == 200
+```
+
+**Synchronous unit test pattern (existing):**
+```python
+import unittest
+from app.services.company_storage import normalize_company_id
+
+class MyTests(unittest.TestCase):
+    def test_normalize(self):
+        self.assertEqual(normalize_company_id("qcwy", "co123"), "123")
+
+if __name__ == "__main__":
+    unittest.main()
+```
+
+---
+
+*Testing analysis: 2026-03-21*
--- a/.planning/config.json
+++ b/.planning/config.json
@ -22,7 +22,8 @@
    "node_repair_budget": 2,
    "ui_phase": true,
    "ui_safety_gate": true,
-    "text_mode": false
+    "text_mode": false,
+    "_auto_chain_active": false
  },
  "hooks": {
    "context_warnings": true
--- a/.planning/phases/03-qcwy-zhilian/03-01-SUMMARY.md
+++ b/.planning/phases/03-qcwy-zhilian/03-01-SUMMARY.md
@ -0,0 +1,27 @@
+# Plan 03-01 Summary: 迁移前程无忧（job51）层至 crawler_core
+
+**Status:** Complete
+**Tasks:** 6/6
+**Commit:** 8c2c2d2
+
+## What was built
+
+将 `spiderJobs/platforms/job51/` 4 个文件迁移至 crawler_core：
+
+- **client.py**：HTTPClient 改用 `crawler_core.http_client`，Job51Sign 改用 `crawler_core.qcwy.sign`
+- **api.py**：
+  - import 改为 `crawler_core.base.Result/BaseFetcher/BaseSearcher`
+  - `ApiResult` 全量替换为 `Result`（共 11 处）
+  - 2 处 `self._http.get(` → `self.http_client.get(`（GetJobDetail/GetCompanyInfo 的覆写 fetch）
+  - 为 POST Searcher（SearchRecommendJobs/SearchCompanyJobs）添加 `_request()` POST 覆写
+- **main.py**：BaseFetcher/BaseSearcher 改用 `crawler_core.base`
+- **sign.py**：向后兼容桩，重新导出 `crawler_core.qcwy.sign.Job51Sign`
+- **tests/job51/test_job51_client.py**：17 个 mock 测试，覆盖所有 API 类
+
+## Verification
+
+```
+pytest tests/job51/ -v → 17 passed ✅
+```
+
+## Self-Check: PASSED
--- a/.planning/phases/03-qcwy-zhilian/03-02-SUMMARY.md
+++ b/.planning/phases/03-qcwy-zhilian/03-02-SUMMARY.md
@ -0,0 +1,33 @@
+# Plan 03-02 Summary: 迁移智联招聘（zhilian）层至 crawler_core
+
+**Status:** Complete
+**Tasks:** 6/6
+**Commit:** 8c2c2d2
+
+## What was built
+
+将 `spiderJobs/platforms/zhilian/` 4 个文件迁移至 crawler_core：
+
+- **client.py**：HTTPClient 改用 `crawler_core.http_client`，ZhilianSign 改用 `crawler_core.zhilian.sign`
+- **api.py**：
+  - import 改为 `crawler_core.base.BaseFetcher/BaseSearcher/Result/parse_response`
+  - 添加 `_parse_zhilian_response`（HTTP 200 即成功，适配智联无 statusCode 的响应格式）
+  - 1 处 `self._http.get(` → `self.http_client.get(`（SearchCompanyPositions）
+  - 为所有 Fetcher/Searcher 类补充 `_parse()` 方法（crawler_core 要求的抽象方法）
+  - 为 POST Searcher（SearchPositions）添加 `_request()` POST 覆写
+- **main.py**：BaseFetcher/BaseSearcher 改用 `crawler_core.base`
+- **sign.py**：向后兼容桩，重新导出 `crawler_core.zhilian.sign.ZhilianSign`
+- **tests/zhilian/test_zhilian_client.py**：17 个 mock 测试
+
+## Key Decisions
+
+- 添加了 `_parse_zhilian_response` 而非使用 `parse_response`（后者检查 `statusCode==200`，但智联接口不返回此字段）
+- `SearchCompanyPositions._build_params()` 通过 `self._client.signer.sign_params()` 调用 signer，迁移后不受影响
+
+## Verification
+
+```
+pytest tests/zhilian/ -v → 17 passed ✅
+```
+
+## Self-Check: PASSED