diff --git a/.planning/REQUIREMENTS.md b/.planning/REQUIREMENTS.md index cece65d..cf598a9 100644 --- a/.planning/REQUIREMENTS.md +++ b/.planning/REQUIREMENTS.md @@ -10,8 +10,8 @@ - [x] **ARCH-01**: 提取 `crawler_core/` 为独立可安装包,包含签名、HTTP 客户端、响应解析核心逻辑 - [x] **ARCH-02**: 统一 BaseFetcher/BaseSearcher 基类,三平台实现模板方法模式 - [x] **ARCH-03**: Boss 直聘爬虫客户端基于 crawler_core 重写 -- [ ] **ARCH-04**: 前程无忧爬虫客户端基于 crawler_core 重写 -- [ ] **ARCH-05**: 智联招聘爬虫客户端基于 crawler_core 重写 +- [x] **ARCH-04**: 前程无忧爬虫客户端基于 crawler_core 重写 +- [x] **ARCH-05**: 智联招聘爬虫客户端基于 crawler_core 重写 - [ ] **ARCH-06**: 后端 app/services/crawler/ facade 层使用 asyncio.to_thread() 桥接同步核心 - [ ] **ARCH-07**: 外部脚本 spiderJobs/ 使用 crawler_core 替代内联代码 - [ ] **ARCH-08**: 废弃 jobs_spider/ 老框架,迁移到 spiderJobs/ + crawler_core @@ -58,8 +58,8 @@ | ARCH-01 | TBD | Complete | | ARCH-02 | TBD | Complete | | ARCH-03 | TBD | Complete | -| ARCH-04 | TBD | Pending | -| ARCH-05 | TBD | Pending | +| ARCH-04 | TBD | Complete | +| ARCH-05 | TBD | Complete | | ARCH-06 | TBD | Pending | | ARCH-07 | TBD | Pending | | ARCH-08 | TBD | Pending | diff --git a/.planning/ROADMAP.md b/.planning/ROADMAP.md index 20c6410..b6005d3 100644 --- a/.planning/ROADMAP.md +++ b/.planning/ROADMAP.md @@ -11,7 +11,7 @@ - [ ] **Phase 1: 共享核心包** - 提取 crawler_core/ 可安装共享包,统一基类和基础设施 - [x] **Phase 2: Boss 直聘重写** - 基于 crawler_core 重写 Boss 直聘爬虫客户端 (completed 2026-03-21) -- [ ] **Phase 3: 前程无忧 & 智联重写** - 基于 crawler_core 重写前程无忧和智联招聘爬虫客户端 +- [x] **Phase 3: 前程无忧 & 智联重写** - 基于 crawler_core 重写前程无忧和智联招聘爬虫客户端 (completed 2026-03-21) - [ ] **Phase 4: 后端 & 外部脚本接入** - 后端 facade 桥接 + 外部脚本迁移 + 废弃老框架 - [ ] **Phase 5: 数据管道优化** - 入库去重、公司清洗流程优化、公司招聘信息写入 - [ ] **Phase 6: 质量 & 前端** - 数据解析测试覆盖、前端监控和数据清洗页面优化 @@ -55,7 +55,7 @@ Plans: 2. 智联招聘爬虫继承 BaseFetcher/BaseSearcher,不含内联签名或 HTTP 样板代码 3. 两平台各自运行一次真实关键词抓取,职位数据成功返回(手动验证) 4. 三平台新客户端的代码结构一致,新平台可参照模板快速实现 -**Plans:** TBD +**Plans:** 2/2 plans complete ### Phase 4: 后端 & 外部脚本接入 **Goal:** 后端 facade 和外部脚本均使用 crawler_core,老框架 jobs_spider/ 完成历史使命并下线 @@ -98,7 +98,7 @@ Plans: |-------|----------------|--------|-----------| | 1. 共享核心包 | 2/2 | Complete | 2026-03-21 | | 2. Boss 直聘重写 | 2/2 | Complete | 2026-03-21 | -| 3. 前程无忧 & 智联重写 | 0/? | Not started | - | +| 3. 前程无忧 & 智联重写 | 2/2 | Complete | 2026-03-21 | | 4. 后端 & 外部脚本接入 | 0/? | Not started | - | | 5. 数据管道优化 | 0/? | Not started | - | | 6. 质量 & 前端 | 0/? | Not started | - | diff --git a/.planning/STATE.md b/.planning/STATE.md index 764ad4f..6726c8c 100644 --- a/.planning/STATE.md +++ b/.planning/STATE.md @@ -4,12 +4,12 @@ milestone: v1.0 milestone_name: milestone status: unknown stopped_at: Completed 01-shared-core Plan 02 (sign algorithms + unit tests) -last_updated: "2026-03-21T11:04:42.115Z" +last_updated: "2026-03-21T11:19:04.395Z" progress: total_phases: 6 - completed_phases: 2 - total_plans: 4 - completed_plans: 4 + completed_phases: 3 + total_plans: 6 + completed_plans: 6 --- # STATE: JobData 爬虫交互重构 @@ -24,13 +24,13 @@ progress: **Core value:** 基于关键词驱动爬虫抓取职位数据,可靠入库 ClickHouse,定时完成公司信息采集同步 -**Current focus:** Phase 2 — Boss 直聘重写 +**Current focus:** Phase 3 — 前程无忧 & 智联重写 --- ## Current Position -Phase: 3 +Phase: 4 Plan: Not started ## Performance Metrics diff --git a/.planning/codebase/ARCHITECTURE.md b/.planning/codebase/ARCHITECTURE.md new file mode 100644 index 0000000..51cc33d --- /dev/null +++ b/.planning/codebase/ARCHITECTURE.md @@ -0,0 +1,201 @@ +# Architecture + +**Analysis Date:** 2026-03-21 + +## Pattern Overview + +**Overall:** Layered monolith with embedded pipeline architecture + +**Key Characteristics:** +- FastAPI async backend with dual-database persistence (MySQL for RBAC/metadata, ClickHouse for job/company analytics data) +- External crawlers (`jobs_spider/`, `spiderJobs/`) run as independent processes and push data to the backend API via HTTP POST — no shared code path at runtime +- RBAC permission model enforced at router-level via FastAPI `Depends` +- APScheduler drives all recurring automation (cleaning, ECS orchestration, alerting) within the main process +- Registry pattern (`app/services/ingest/registry.py`) maps (platform, channel, data_type) tuples to ClickHouse table configs + dedup rules + +## Layers + +**Router Layer:** +- Purpose: Parse HTTP request, validate via Pydantic schema, delegate to service or controller, return JSON envelope +- Location: `app/api/v1/` +- Contains: One sub-package per domain (e.g., `job/`, `cleaning/`, `analytics/`, `keyword/`) +- Depends on: Services, Controllers, `app/core/dependency.py` for auth +- Used by: FastAPI application via `app/api/v1/__init__.py` → `app/__init__.py` + +**Service Layer:** +- Purpose: Business logic, orchestration across repos/models, side-effect coordination +- Location: `app/services/` +- Contains: `IngestService`, `AnalyticsService`, `CleaningService`, `CompanyCleaner`, `CompanyJobsSyncService`, platform crawler service wrappers (`crawler/boss.py`, `crawler/qcwy.py`, `crawler/zhilian.py`) +- Depends on: Repositories, ORM models, `clickhouse_manager`, external HTTP clients +- Used by: Routers, APScheduler jobs + +**Controller Layer (RBAC domain operations):** +- Purpose: Thin CRUD orchestration for MySQL-backed entities (User, Role, API, Menu, Dept, etc.) +- Location: `app/controllers/` +- Contains: `user_controller`, `api_controller`, `role_controller`, `menu_controller`, `dept_controller`, `keyword_controller`, `proxy_controller`, `token_controller`, `cleaning_controller`, `company_controller` +- Depends on: Tortoise-ORM models +- Used by: Routers + +**Repository Layer:** +- Purpose: Encapsulate raw ClickHouse query execution behind typed methods +- Location: `app/repositories/` +- Contains: `ClickHouseBaseRepo` (generic query/insert), `JobAnalyticsRepo` (trend, distribution, count queries against `job_analytics` view) +- Depends on: `clickhouse_connect.driver.AsyncClient` +- Used by: `AnalyticsService` + +**Model Layer:** +- Purpose: ORM schema definitions for MySQL, migrated via Aerich +- Location: `app/models/` +- Contains: `admin.py` (User, Role, Api, Menu, Dept, AuditLog), `token.py` (BossToken), `cleaning.py` (CleaningTask), `metrics.py` (ScheduledTaskRun, StatsTotal, IpUploadStats), `company.py` (CompanyCleaningQueue), `keyword.py` (BossKeyword, QcwyKeyword, ZhilianKeyword) +- Depends on: `tortoise-orm` +- Used by: Controllers, Services, Scheduler jobs + +**Core Infrastructure Layer:** +- Purpose: App wiring, connection management, middleware, cross-cutting utilities +- Location: `app/core/` +- Contains: `init_app.py` (lifespan bootstrap), `scheduler.py` (APScheduler jobs), `clickhouse.py` (ClickHouseManager singleton), `clickhouse_init.py` (ClickHouse DDL), `dependency.py` (AuthControl, PermissionControl), `locks.py` (DistributedLock), `middlewares.py` (BackGroundTaskMiddleware, HttpAuditLogMiddleware, IpTrackingMiddleware), `exceptions.py`, `crud.py`, `ip_tracking.py` +- Depends on: Everything below +- Used by: `app/__init__.py`, all layers above + +**Ingest Sub-system:** +- Purpose: Platform-agnostic data ingestion pipeline with registry-driven routing, dedup, and remote push +- Location: `app/services/ingest/` +- Contains: `service.py` (IngestService), `registry.py` (PlatformConfig registry), `dedup.py` (batch dedup filter), `remote_push.py` (async HTTP push to secondary target), `configs/` (per-platform PlatformConfig registrations) +- Depends on: `clickhouse_connect`, registry configs +- Used by: `app/api/v1/job/job.py` router + +**Spider Crawler Service Wrappers (in-process):** +- Purpose: Wrap platform HTTP clients so backend scheduler can call crawlers directly (company cleaning, company search) +- Location: `app/services/crawler/` +- Contains: `_base.py` (BaseFetcher, BaseSearcher, ApiResult), `_http_client.py`, `_boss_client.py`, `_boss_api.py`, `_boss_sign.py`, `_zhilian_client.py`, `_zhilian_api.py`, `_zhilian_sign.py`, `_job51_client.py`, `_job51_api.py`, `_job51_sign.py`, `boss.py`, `qcwy.py`, `zhilian.py` +- Depends on: `requests` (sync), platform sign/auth modules +- Used by: `CompanyCleaner`, `CompanyController` + +**External Spider Processes:** +- Purpose: Standalone crawlers running on ECS nodes; push data to backend via `POST /api/v1/universal/data/batch-store-async` +- Location: `jobs_spider/boss/boos_api.py`, `jobs_spider/qcwy/qcwy.py`, `jobs_spider/zhilian/zhilian_single.py`, `spiderJobs/` +- Depends on: `requests`, `loguru`, `PyExecJS`; no shared Python imports with `app/` +- Used by: Scheduled via `ecs_full_pipeline.py` or manual launch + +## Data Flow + +**Crawler → Storage Flow:** + +1. External spider (`jobs_spider/boss/boos_api.py`) fetches keyword from `GET /api/v1/keyword/available` +2. Spider calls Boss/QCWY/Zhilian platform API with rotating proxy/token +3. Spider POSTs raw JSON batch to `POST /api/v1/universal/data/batch-store-async` +4. Router (`app/api/v1/job/job.py`) delegates to `IngestService.store_batch()` +5. `IngestService` looks up `PlatformConfig` from registry (`app/services/ingest/registry.py`) +6. Dedup filter queries ClickHouse for existing records (`app/services/ingest/dedup.py`) +7. New rows inserted to ClickHouse table (e.g., `job_data.boss_job`) via `clickhouse_connect` +8. Optional async push to secondary remote endpoint via `remote_push.py` + +**Analytics Query Flow:** + +1. Frontend calls `GET /api/v1/analytics/volume-trend?interval=day&from_dt=...` +2. Router (`app/api/v1/analytics.py`) creates `AnalyticsService` with `clickhouse_manager` client +3. `AnalyticsService` delegates to `JobAnalyticsRepo.get_volume_trend()` +4. `JobAnalyticsRepo` queries the `job_data.job_analytics` VIEW (UNION ALL across three platform job tables) +5. Returns time-bucketed counts grouped by source + +**RBAC Authentication Flow:** + +1. Client sends `POST /api/v1/base/access_token` with credentials +2. `AuthControl.is_authed()` verifies JWT (HS256, 7-day expiry) from `token` header +3. `PermissionControl.has_permission()` loads user's roles → role's API list +4. Compares `(method, path)` tuple against allowed API set +5. Superusers bypass permission check + +**Company Cleaning Flow (Scheduled):** + +1. `company_cleaning_job()` in `app/core/scheduler.py` fires every 5 minutes +2. `CompanyCleaner.collect_pending_companies(limit=50)` queries ClickHouse job tables for company IDs not yet in `pending_company` table +3. `CompanyCleaner.process_pending_companies(limit=30)` fetches company details from platform crawlers (BossService, QcwyService, ZhilianService) +4. Cleaned company data stored via `company_storage` to ClickHouse `boss_company` / `qcwy_company` / `zhilian_company` +5. Run protected by `DistributedLock` (file-based fallback, optional Redis) + +**State Management:** +- MySQL (via Tortoise-ORM): User sessions, RBAC entities, keyword crawl state, tokens, task run records +- ClickHouse: All raw job/company JSON data + analytics views; no ORM — raw SQL via `clickhouse_connect` +- Context variable `CTX_USER_ID` (Python `contextvars.ContextVar`) carries authenticated user ID within request scope (`app/core/ctx.py`) +- File-based startup lock (`.startup_lock/` dir) prevents multi-worker seed data collisions + +## Key Abstractions + +**PlatformConfig (Ingest Registry):** +- Purpose: Maps `(platform, channel, data_type)` to target ClickHouse table and dedup field extractors +- Examples: `app/services/ingest/registry.py`, `app/services/ingest/configs/` +- Pattern: Immutable `@dataclass(frozen=True)`, registered at module import time via `register()` + +**DistributedLock:** +- Purpose: Prevents concurrent execution of scheduled jobs across multiple Uvicorn workers +- Examples: `app/core/locks.py` +- Pattern: Redis SET NX with TTL; falls back to filesystem directory creation with TTL metadata + +**ClickHouseManager:** +- Purpose: Singleton lazy-init connection wrapper around `clickhouse_connect.AsyncClient` +- Examples: `app/core/clickhouse.py` +- Pattern: Global module-level instance `clickhouse_manager`, injected into services at call time via `await clickhouse_manager.get_client()` + +**BaseFetcher / BaseSearcher:** +- Purpose: Abstract base for platform-specific API clients within the in-process crawler wrappers +- Examples: `app/services/crawler/_base.py` +- Pattern: Template method — subclasses implement `_build_params()`, base handles HTTP invocation and `ApiResult` parsing + +**BaseModel (Tortoise):** +- Purpose: Common ORM base with `BigIntField pk`, `to_dict()` async serializer with M2M support +- Examples: `app/models/base.py` +- Pattern: Mixed-in with `TimestampMixin` (created_at, updated_at auto fields) + +## Entry Points + +**Application Entry:** +- Location: `run.py` +- Triggers: `python run.py` — reads `APP_HOST`, `APP_PORT`, `UVICORN_WORKERS` env vars; calls `uvicorn.run("app:app")` +- Responsibilities: Configure uvicorn logging format, start ASGI server + +**FastAPI App Factory:** +- Location: `app/__init__.py` → `create_app()` +- Triggers: Uvicorn import of `app:app` +- Responsibilities: Register middlewares, exception handlers, routers, static files; define lifespan + +**Lifespan Bootstrap:** +- Location: `app/__init__.py` (lifespan context manager) + `app/core/init_app.py` (`init_data()`) +- Triggers: FastAPI startup event +- Responsibilities: Tortoise-ORM init + schema generation → Aerich migrations → seed data (superuser, menus, APIs, roles) → ClickHouse table init → APScheduler start + +**Router Aggregation:** +- Location: `app/api/v1/__init__.py` +- Triggers: Imported by `app/api/__init__.py` → `app/__init__.py` +- Responsibilities: Mount all domain routers under `/api/v1/` with appropriate `DependPermission` guards + +**ECS Pipeline:** +- Location: `ecs_full_pipeline.py` +- Triggers: Manual execution or scheduled via `scheduler.py` (`ecs_full_pipeline_job`) every 6 hours +- Responsibilities: Create Alibaba Cloud ECS spot instances, push spider scripts, execute crawlers, terminate instances + +## Error Handling + +**Strategy:** Exception handler registration at FastAPI app level; structured error propagation up through layers + +**Patterns:** +- Tortoise `DoesNotExist` → 404 via `DoesNotExistHandle` (registered in `app/core/init_app.py`) +- Pydantic `RequestValidationError` → 422 with field-level detail via `RequestValidationHandle` +- `IntegrityError` (DB constraint) → 400 via `IntegrityHandle` +- Custom `HTTPException` (from `app/core/exceptions.py`) → structured JSON error via `HttpExcHandle` +- Service layer exceptions logged with `loguru` then re-raised or returned as `{"success": False, "error": "..."}` +- Scheduled jobs catch all exceptions, record failure via `_record_task_run()` to `ScheduledTaskRun` table, then continue + +## Cross-Cutting Concerns + +**Logging:** `loguru` — imported from `app/log/__init__.py`; structured `logger.info/error/warning` calls throughout; external spider logs to rotating daily files in `jobs_spider/*/logs/` + +**Validation:** Pydantic v2 schemas in `app/schemas/` validate all inbound HTTP request bodies; ClickHouse inserts use registry-defined column extractors for type safety + +**Authentication:** JWT HS256 tokens issued by `/api/v1/base/access_token`; verified per-request by `AuthControl.is_authed()` injected via `DependAuth` / `DependPermission`; dev bypass (`token: dev`) available when `APP_ENV=development` + +**Audit Logging:** `HttpAuditLogMiddleware` records all non-excluded mutating requests (method, path, user, request args, response body, timing) to MySQL `auditlog` table + +--- + +*Architecture analysis: 2026-03-21* diff --git a/.planning/codebase/CONCERNS.md b/.planning/codebase/CONCERNS.md new file mode 100644 index 0000000..e7ab745 --- /dev/null +++ b/.planning/codebase/CONCERNS.md @@ -0,0 +1,255 @@ +# Codebase Concerns + +**Analysis Date:** 2026-03-21 + +--- + +## Security Considerations + +### CRITICAL: Hardcoded Credentials in Source Code + +**Area:** Configuration +- Risk: Real database passwords, SMTP credentials, and IP addresses committed to version control. +- Files: `app/settings/config.py` +- Exposed values (lines 27–52): + - MySQL root password (`jobdata123`) with public IP `121.4.126.241:3306` + - ClickHouse user/pass defaults (`data_user` / `data_pass`) with same public IP + - QQ SMTP app password (`jkncedjzkummbbdi`) and recipient email addresses +- Current mitigation: `pydantic-settings` allows environment variable override, but defaults are live credentials that will be exposed if the repo is shared or CI logs are leaked. +- Recommendations: Remove all real default values; replace with `None` or clearly fake placeholders; enforce required-field validation at startup; rotate exposed credentials immediately. + +### CRITICAL: Unauthenticated Data Ingest Endpoints + +**Area:** API Authentication +- Risk: Any actor with network access can push arbitrary data directly into ClickHouse. +- Files: `app/api/v1/__init__.py` (lines 32–34), `app/api/v1/job/job.py` +- The `/ingest`, `/job`, and `/universal` router groups are mounted **without** any `DependPermission` or `DependAuth` dependency. The data store, batch-store, and batch-store-async endpoints accept unauthenticated POST requests. +- Trigger: Send a POST to `/api/v1/universal/data/batch-store-async` with any payload. No token required. +- Workaround: The CLAUDE.md acknowledges these as "internal call" endpoints, but they are publicly reachable with no network-level isolation noted. +- Recommendations: Add at minimum IP allowlist middleware or a static shared secret header check for ingest routes; or mount them behind `DependAuth`. + +### HIGH: Dev Token Bypass in Production Auth + +**Area:** Authentication +- Risk: If `APP_ENV=development`, any request with `token: dev` header is granted the first user's full privileges. +- Files: `app/core/dependency.py` (lines 28–30) +- The check `os.getenv("APP_ENV", "production") == "development"` defaults to `"production"`, but there is no enforcement that this variable is set correctly in production deployments. A misconfigured environment would expose the bypass. +- Recommendations: Remove the dev bypass entirely; use test fixtures instead; add a startup assertion that the bypass is disabled in non-dev environments. + +### HIGH: SQL Injection via f-string Queries + +**Area:** ClickHouse Queries +- Risk: User-controlled values inserted directly into query strings allow injection. +- Files: + - `reclean_qcwy_jobs.py` (line 34): `f"SELECT json_data FROM job_data.qcwy_job WHERE job_id = '{job_id}' LIMIT 1"` — `job_id` comes from URL parsing of user-provided input. + - `app/repositories/clickhouse_repo.py` (lines 101–108): `group_by_column` parameter is interpolated directly into the SELECT and GROUP BY clauses without validation. + - `app/services/ingest/service.py` (lines 109, 112): `table` value is derived from config (lower risk) but still concatenated into f-strings. + - `app/core/scheduler.py` (lines 57, 63): `table` names from a fixed list, but pattern is risky. +- Current mitigation: ClickHouse parameterized queries are used in `dedup.py`, but inconsistently applied elsewhere. +- Recommendations: Use parameterized queries (`parameters={}`) for all user-influenced values; validate `group_by_column` against an allowlist; never interpolate user input into SQL strings. + +### MEDIUM: CORS Misconfiguration + +**Area:** CORS Policy +- Risk: `CORS_ALLOW_CREDENTIALS = False` while `CORS_ALLOW_HEADERS = ["*"]` is set. If credentials are later enabled, wildcard headers become a security liability. +- Files: `app/settings/config.py` (lines 13–16) +- Recommendations: Restrict `CORS_ORIGINS` to known production domains; avoid wildcard headers. + +### MEDIUM: Hardcoded Public IP in Spider Default + +**Area:** Spider Configuration +- Risk: Hardcoded IP `124.222.106.226:9999` as the default `API_BASE_URL` in all spider scripts and `spiderJobs/` runner. If that IP changes hands or is exposed, all crawled data goes to an unintended recipient. +- Files: + - `jobs_spider/boss/boos_api.py` (line 17) + - `jobs_spider/qcwy/qcwy.py` (line 28) + - `jobs_spider/zhilian/zhilian_single.py` (line 41) + - `spiderJobs/runner/api_client.py` (line 31) +- Recommendations: Remove hardcoded IP defaults; require `API_BASE_URL` to be explicitly set via environment variable with no default. + +--- + +## Tech Debt + +### Duplicate Spider Implementations + +**Area:** Code Organization +- Issue: Two parallel spider hierarchies exist for the same three platforms. + - `jobs_spider/boss/`, `jobs_spider/qcwy/`, `jobs_spider/zhilian/` — original scripts + - `spiderJobs/platforms/boss/`, `spiderJobs/platforms/job51/`, `spiderJobs/platforms/zhilian/` — refactored version + - `app/services/crawler/` — yet another crawler abstraction inside the app +- Files: `jobs_spider/`, `spiderJobs/`, `app/services/crawler/` +- Impact: Bug fixes and anti-detection improvements must be applied in multiple places; the relationship between these directories is not documented. +- Fix approach: Decide on one canonical implementation; deprecate and delete the others; update CLAUDE.md to clarify which is the active spider. + +### God File: `jobs_spider/boss/boos_api.py` + +**Area:** File Size / Complexity +- Issue: Single file containing 2281 lines covering API client, IP strategy, anomaly detection, session management, main loop, and push logic. +- Files: `jobs_spider/boss/boos_api.py` +- Impact: Extremely difficult to test or modify any single responsibility; cognitive load is very high. +- Fix approach: Extract `SmartIPManager`, `IPAnomalyDetector`, `IPStrategyConfig` into separate modules; separate the main crawl loop from the API client class. + +### `CleaningService` and `CompanyCleaner` Code Duplication + +**Area:** Service Layer +- Issue: Both classes replicate identical token-loading logic (`_ensure_boss_token_loaded`), proxy-setting logic (`_apply_proxy`), and service instantiation (BossService, QcwyService, ZhilianService). +- Files: + - `app/services/cleaning.py` (lines 37–48, 31–34) + - `app/services/company_cleaner.py` (lines 60–74, 54–58) +- Impact: Token refresh logic is maintained in two places; a bug fix in one will not propagate to the other. +- Fix approach: Extract a shared `CrawlerContext` or `CrawlerMixin` class providing token management and proxy configuration. + +### ClickHouse Schema Changes Require Manual DDL + +**Area:** Database Migrations +- Issue: ClickHouse table definitions in `app/core/clickhouse_init.py` use `CREATE TABLE IF NOT EXISTS`, meaning schema changes (new columns, index changes) are never automatically applied to existing tables. There is no migration tooling equivalent to Aerich for ClickHouse. +- Files: `app/core/clickhouse_init.py` +- Impact: If a column is added to the DDL, existing production tables silently retain the old schema; queries referencing new columns will fail at runtime. +- Fix approach: Implement `ALTER TABLE ... ADD COLUMN IF NOT EXISTS` statements for schema evolution; or adopt a ClickHouse migration tool (e.g., `ch-migrations`). + +### Startup File Lock Not Robust to Crashes + +**Area:** Multi-Worker Initialization +- Issue: `init_data()` uses `os.mkdir(".startup_lock")` as a bare file lock with no TTL. If a worker crashes mid-initialization, the lock directory is never cleaned up, and subsequent restarts will skip all migrations and seed data. +- Files: `app/core/init_app.py` (lines 322–349) +- Impact: Silent data-initialization failure on restart after crash; can result in missing admin user or broken menu state. +- Fix approach: Replace with the existing `DistributedLock` class from `app/core/locks.py` which has TTL support; or at minimum add a time-based stale-lock cleanup. + +--- + +## Performance Bottlenecks + +### ClickHouse Singleton Connection Per-Worker + +**Area:** Database Connections +- Problem: `clickhouse_manager` in `app/core/clickhouse.py` is a module-level singleton that lazily creates one connection. With 20 uvicorn workers (`UVICORN_WORKERS=20`), each worker maintains its own connection, and the connection pool inside each is limited to `maxsize=5`. Under concurrent load, requests block waiting for the pool. +- Files: `app/core/clickhouse.py` (lines 31–58) +- Cause: A new `PoolManager` is created on every `get_clickhouse_client()` call but the singleton pattern means `get_client()` only calls this once per worker — pool configuration cannot be tuned per endpoint. +- Improvement path: Use a connection pool library or connection factory pattern; expose pool size as a configurable setting; consider `clickhouse-pool` for proper async pooling. + +### Dedup Query Scans Entire Table + +**Area:** ClickHouse Deduplication +- Problem: `batch_dedup_filter` in `app/services/ingest/dedup.py` issues `SELECT key_col FROM table WHERE key_col IN (...)` with no time-bound filter. For large tables this is a full-table scan on each ingest batch. +- Files: `app/services/ingest/dedup.py` (lines 51, 65) +- Cause: No `PREWHERE created_at > now() - INTERVAL N DAY` constraint added. +- Improvement path: Add a configurable recency window to dedup queries (e.g., 30-day lookback matching the existing `days_back` logic in `company_cleaner.py`). + +### Synchronous Crawler Calls Blocking Async Workers + +**Area:** Concurrency +- Problem: All crawler service methods (BossService, QcwyService, ZhilianService) use synchronous `requests` internally, wrapped with `asyncio.to_thread()`. Under high concurrency, this can exhaust the thread pool. +- Files: `app/services/cleaning.py` (lines 186, 190, 194, etc.), `app/services/company_cleaner.py` (lines 318, 321, 325) +- Cause: The underlying crawler libraries use `requests` (sync), not `httpx` (async). +- Improvement path: Migrate crawlers to `httpx.AsyncClient` for true async I/O, or cap thread pool size via `asyncio.to_thread` semaphore. + +--- + +## Fragile Areas + +### `group_by_column` API Parameter Lacks Allowlist Validation + +**Area:** Analytics API +- Files: `app/repositories/clickhouse_repo.py` (lines 86–115), `app/api/v1/analytics.py` +- Why fragile: `group_by_column` is passed directly into the ClickHouse query without validation against allowed column names. An invalid column name causes a runtime ClickHouse error that propagates as a 500 to the client; a malicious value enables injection. +- Safe modification: Add a `ALLOWED_GROUP_COLUMNS = frozenset({"source", "city", "channel", ...})` check before query construction. +- Test coverage: No tests for analytics query building. + +### `DistributedLock` File Lock Has Race Window + +**Area:** Distributed Locking +- Files: `app/core/locks.py` (lines 64–92) +- Why fragile: The stale-lock cleanup path (lines 81–87) involves `shutil.rmtree` followed by `mkdir` — a TOCTOU race where two workers can both detect staleness, both delete the directory, and both attempt to re-create it. One wins, one raises `FileExistsError` and falls through to return `False`, but the winner's lock is not guaranteed to be the intended holder. +- Safe modification: Use `os.rename` from a temp directory as an atomic lock-acquire primitive. +- Test coverage: None. + +### `CleaningService` Returns Mixed Types (`bool | Dict`) + +**Area:** Type Consistency +- Files: `app/services/cleaning.py` (lines 158–165, 167–202, 204–239, 241–268) +- Why fragile: Multiple `clean_*` methods return `Union[bool, Dict[str, Any]]`. Callers must defensively check `isinstance(result, bool)` before accessing dict keys (see `process_single_item` at lines 128–135). This makes the API fragile — any new call site that forgets the `bool` check will raise `AttributeError`. +- Safe modification: Define a `CleanResult` dataclass or TypedDict; always return it. +- Test coverage: None. + +### `company_cleaning_job` Assumes 5-Minute Completion + +**Area:** Scheduler +- Files: `app/core/scheduler.py` (lines 171–202) +- Why fragile: The job comment explicitly notes it processes 50 collect + 30 process operations and "should complete in 2–3 minutes" to fit within the 5-minute schedule window. If the external crawlers slow down (rate-limited, network issues), the job overruns its window, causing the next scheduled run to be blocked by the distributed lock, leading to queue backlog. +- Safe modification: Add timeout guards around `collect_pending_companies` and `process_pending_companies`; or dynamically reduce batch sizes based on elapsed time. + +--- + +## Missing Critical Features + +### No Rate Limiting on Any API Endpoint + +**Problem:** FastAPI has no rate-limiting middleware configured. The data ingest endpoints (which have no auth) and the cleaning trigger endpoints can be called at arbitrary rates. +- Blocks: Protection against accidental or malicious flooding of ClickHouse with writes. +- Recommendation: Add `slowapi` or custom middleware with per-IP rate limiting on ingest and crawl-triggering endpoints. + +### No Input Size Limit on Batch Ingest + +**Problem:** `IngestBatchRequest.data_list` has no `max_items` constraint. A single request with tens of thousands of job records will be processed synchronously (or queued as a background task) with no guard on memory consumption. +- Files: `app/schemas/ingest.py`, `app/api/v1/job/job.py` +- Recommendation: Add `Field(max_length=1000)` or equivalent to `data_list`; add body size limit in uvicorn/nginx. + +--- + +## Test Coverage Gaps + +### Service Layer: Virtually Untested + +- What's not tested: `CleaningService`, `CompanyCleaner`, `IngestService`, `AnalyticsService`, `DistributedLock`, `ClickHouseManager` +- Files: `app/services/cleaning.py`, `app/services/company_cleaner.py`, `app/services/ingest/service.py`, `app/services/analytics_service.py`, `app/core/locks.py`, `app/core/clickhouse.py` +- Risk: Core data routing, deduplication, and cleaning logic can regress silently. The existing tests only cover two pure-function helpers (`extract_company_fields`, `normalize_company_id`, `_extract_*_jobs`). +- Priority: **High** + +### API Layer: No Integration Tests + +- What's not tested: All FastAPI route handlers, auth flow, permission enforcement, error responses. +- Files: `app/api/v1/` (all modules) +- Risk: Auth bypass regressions, schema changes breaking clients, and incorrect HTTP status codes go undetected. +- Priority: **High** + +### Spider Logic: No Unit Tests + +- What's not tested: `IPAnomalyDetector.detect()`, `SmartIPManager` state transitions, response parsing in `BossZhipinAPI`, `QcwyApi`, `ZhilianApi`. +- Files: `jobs_spider/boss/boos_api.py`, `jobs_spider/qcwy/qcwy.py`, `jobs_spider/zhilian/zhilian_single.py` +- Risk: Anti-detection logic breaks silently when platforms change their response format. +- Priority: **Medium** + +### ClickHouse Schema: No Schema Evolution Tests + +- What's not tested: Behavior of `clickhouse_init.py` when tables already exist with old schemas; column migration paths. +- Files: `app/core/clickhouse_init.py` +- Risk: Production schema drift goes unnoticed until a query fails at runtime. +- Priority: **Medium** + +--- + +## Known Issues / Observed Bugs + +### Silent Error Swallowing in Spider Code + +- Symptoms: Data silently not collected; no error surfaced to caller. +- Files: `jobs_spider/qcwy/qcwy.py` (lines 178–179, 191–192, 1047–1048, 1146–1147), `jobs_spider/qcwy/search_company_jobs.py` (lines 240–241, 512–513, 538–539, 601–602), `jobs_spider/boss/boos_api.py` (lines 400, 407–408, 508–509, 518–519, 1779–1780), `jobs_spider/zhilian/company_spider.py` (many `except: pass` blocks) +- Trigger: Network errors, parsing errors, or unexpected response shapes in spider code are caught by bare `except Exception: pass` blocks. +- Workaround: Check logs for INFO-level skips; monitor push success counts. + +### `_send_email` Uses Blocking SMTP in Async Context + +- Symptoms: Scheduler event loop blocked for up to several seconds during email send. +- Files: `app/core/scheduler.py` (lines 308–336) +- Trigger: Any `stats_job` or `ip_alert_job` execution that sends email calls synchronous `smtplib.SMTP` from the async event loop without wrapping in `asyncio.to_thread`. +- Workaround: Email delivery may be delayed; event loop latency spikes during scheduled runs. + +### `cleanup_old_records` Deletes All Done/Failed Records Without Archive + +- Symptoms: Historical processing data is permanently deleted daily; no audit trail of what was cleaned. +- Files: `app/services/company_cleaner.py` (lines 329–331), `app/core/scheduler.py` (line 220) +- Trigger: `daily_cleanup_job` calls `CompanyCleaningQueue.filter(status__in=["done", "failed"]).delete()` — a hard delete with no archiving or retention window. + +--- + +*Concerns audit: 2026-03-21* diff --git a/.planning/codebase/CONVENTIONS.md b/.planning/codebase/CONVENTIONS.md new file mode 100644 index 0000000..7f4250f --- /dev/null +++ b/.planning/codebase/CONVENTIONS.md @@ -0,0 +1,212 @@ +# Coding Conventions + +**Analysis Date:** 2026-03-21 + +## Naming Patterns + +**Files:** +- Snake_case for all Python files: `company_storage.py`, `company_cleaner.py`, `clickhouse_repo.py` +- Private/internal modules prefixed with underscore: `_base.py`, `_boss_api.py`, `_boss_client.py`, `_boss_sign.py`, `_http_client.py` +- Platform-named service files: `boss.py`, `qcwy.py`, `zhilian.py` under `app/services/crawler/` +- Router files named after domain: `keyword.py`, `analytics.py`, `cleaning.py` + +**Classes:** +- PascalCase throughout: `CleaningService`, `KeywordController`, `ClickHouseBaseRepo`, `JobAnalyticsRepo` +- Services: `{Domain}Service` — `BossService`, `QcwyService`, `ZhilianService`, `IngestService`, `AnalyticsService` +- Controllers: `{Domain}Controller` — `KeywordController` +- Repos: `{Domain}Repo` or `{Domain}BaseRepo` — `ClickHouseBaseRepo`, `JobAnalyticsRepo` +- Models (Tortoise ORM): `{Platform}{Entity}` — `BossKeyword`, `QcwyCompany`, `ZhilianCompany` +- Schemas (Pydantic): `{Entity}Base`, `{Entity}Create`, `{Entity}Update`, `{Entity}Out` — see `app/schemas/keyword.py` + +**Functions and Methods:** +- Snake_case for all functions and methods: `get_available`, `report_page_progress`, `store_batch`, `build_insert_row` +- Private helpers prefixed with underscore: `_apply_proxy`, `_ensure_boss_token_loaded`, `_pick_first`, `_nested_get`, `_clean_text`, `_model_for_source` +- Async dependency factories follow pattern `get_{service/controller}()`: `get_ingest_service`, `get_analytics_service`, `get_keyword_controller` + +**Variables:** +- Snake_case: `data_list`, `platform_type`, `check_duplicate`, `page_size` +- Module-level constants: UPPER_SNAKE_CASE — `COMPANY_SOURCES`, `QUEUE_TERMINAL_STATUSES` +- Class-level constants: UPPER_SNAKE_CASE prefixed `_` — `_TOKEN_REFRESH_INTERVAL = 3600` + +**Types and Enums:** +- Enums use PascalCase class name, UPPER_SNAKE_CASE values: `PlatformType.BOSS`, `ChannelType.MINI`, `DataType.JOB` +- Enum values are lowercase strings matching URL slugs: `"boss"`, `"mini"`, `"job"` — see `app/schemas/ingest.py` +- Enums inherit from `(str, Enum)` enabling direct string comparison + +## Code Style + +**Formatting:** +- Tool: `black` v24.10.0 +- Line length: 120 characters (set in `pyproject.toml` `[tool.black]` and `[tool.ruff]`) +- Target Python versions: 3.10, 3.11 (black), 3.13 (Pipfile) + +**Linting:** +- Tool: `ruff` v0.9.1 (configured in `pyproject.toml`) +- Ignored rules: `F403` (star imports), `F405` (may be undefined from star import) +- Star imports from internal modules are allowed (used in `app/models/__init__.py`, `app/services/ingest/__init__.py`) + +**Import Sorting:** +- Tool: `isort` v5.13.2 +- No explicit isort config found; follows default ordering + +## Import Organization + +**Order:** +1. Standard library (`from __future__`, `os`, `re`, `typing`, `datetime`, `json`) +2. Third-party (`fastapi`, `pydantic`, `tortoise`, `loguru`, `clickhouse_connect`) +3. Internal app imports (`from app.core.`, `from app.models.`, `from app.services.`, `from app.schemas.`) + +**Example from `app/api/v1/analytics.py`:** +```python +from typing import Optional +from datetime import datetime, date, timezone +from zoneinfo import ZoneInfo + +from fastapi import APIRouter, Depends, Query +from app.core.clickhouse import clickhouse_manager +from app.services.analytics_service import AnalyticsService +from app.schemas.analytics import JobStatisticsResponse +``` + +**Path Aliases:** +- None; all imports use full `app.` prefix paths +- `from app.log import logger` is the canonical loguru import path + +**Star Imports:** +- Used only in `__init__.py` re-export files: `from .admin import *` in `app/models/__init__.py` +- `# noqa: F401, F403` comments suppress lint warnings for intentional star imports + +## Error Handling + +**Patterns:** +- Services return `Dict[str, Any]` result objects with `"success"`, `"code"`, `"message"` fields instead of raising exceptions to callers +- Controllers return dict with `"code": 200/400/404` and `"message"` for all outcomes +- API route handlers do NOT use try/except — they rely on services returning structured results +- Service methods wrap low-level calls in `try/except Exception as e` and log then return `False` or error dict + +**Service-level error handling example** (`app/services/cleaning.py`): +```python +except Exception as e: + logger.error(f"Error processing item {target}: {e}") + return { + "success": False, + "target": target, + "error": str(e), + "storage_status": "error", + "remote_sent": False + } +``` + +**Repository-level:** `ClickHouseBaseRepo` does not swallow exceptions; they propagate to the service layer. + +**Auth exceptions:** `app/core/dependency.py` raises `HTTPException(status_code=401/403)` directly — the standard FastAPI pattern for auth failures. + +## Logging + +**Framework:** `loguru` v0.7.3 + +**Import:** `from app.log import logger` (centralized re-export) or `from loguru import logger` (direct) + +**Patterns:** +- `logger.info(f"...")` for normal operation events +- `logger.warning(f"...")` for non-fatal recoverable issues (e.g., token not found, API soft failures) +- `logger.error(f"...")` for caught exceptions and operation failures +- F-string interpolation used consistently for message formatting +- No structured fields (no `logger.bind()` usage observed) + +**Example:** +```python +logger.info(f"获取招聘详情: {job_id}") +logger.warning(f"Boss get_job_detail failed: {result.error}") +logger.error(f"批量插入失败: {e}") +``` + +## API Response Format + +**Two response styles coexist:** + +**Style 1 — Direct dict return** (most routes in new modules like `app/api/v1/job/job.py`, `app/api/v1/analytics.py`): +```python +return {"code": 200, "data": result, "message": "ok"} +``` + +**Style 2 — JSONResponse subclasses** (older RBAC routes, defined in `app/schemas/base.py`): +```python +Success(code=200, msg="OK", data=data) +Fail(code=400, msg="error message") +SuccessExtra(code=200, data=data, total=100, page=1, page_size=20) +``` + +**Paginated responses** include: `code`, `data` (list), `total`, `page`, `page_size` + +## Comments + +**When to Comment:** +- Docstrings on public methods describing purpose, not implementation: `"""获取可用关键词,优先返回断点续爬和失败重试的关键词"""` +- Inline comments for priority logic and algorithm steps: `# 优先级 1: 断点续爬 (partial)` +- Module-level docstrings for context: `"""Boss直聘 Service — 基于新算法文件的封装"""` +- `# noqa` comments for intentional lint suppressions + +**JSDoc/TSDoc:** +- Not applicable (Python backend) +- Docstrings are brief single-line or short multi-line Chinese descriptions + +## Function Design + +**Size:** Functions tend to be 10-50 lines; service methods like `process_single_item` in `app/services/cleaning.py` grow to ~70 lines due to multi-platform dispatch + +**Parameters:** +- Keyword arguments with defaults preferred for optional params +- Pydantic schemas used for HTTP request bodies (never raw dicts from router params) +- `Optional[str]` with `= None` default for optional parameters + +**Return Values:** +- Services return `Dict[str, Any]` with consistent keys (`code`, `message`, `data`) +- Private helpers return `Optional[T]` or primitive types +- Async functions return awaitable results (no mixing of sync/async) + +## Module Design + +**Exports:** +- `app/models/__init__.py` uses `from .{module} import *` to flatten model imports +- Router modules export a single named router variable: `router = APIRouter(...)` or `{domain}_router = APIRouter(...)` +- Service classes are imported directly by name + +**Barrel Files (`__init__.py`):** +- `app/models/__init__.py` — re-exports all model classes +- `app/services/ingest/__init__.py` — re-exports `IngestService` and config registrations +- `app/api/v1/__init__.py` — aggregates all routers into `v1_router` + +**Dependency Injection:** +- FastAPI `Depends()` used for service/controller instantiation in route handlers +- Dependency factory functions named `get_{service}()` and defined in the same file as the router +- Shared auth dependencies: `DependAuth`, `DependPermission` in `app/core/dependency.py` + +## Tortoise ORM Model Conventions + +**Base class:** All models inherit from `app/models/base.py:BaseModel` (which extends `tortoise.models.Model` with `id = BigIntField(pk=True)`) + +**Timestamp mixin:** `TimestampMixin` adds `created_at` (auto_now_add) and `updated_at` (auto_now) — applied via multiple inheritance + +**Abstract base models:** Platform variants use abstract base + concrete subclasses: +```python +class BaseKeyword(Model): # abstract = True in Meta + ... +class BossKeyword(BaseKeyword): + class Meta: + table = "boss_keyword" +``` + +**Field descriptions:** All fields include `description=` parameter for documentation + +## Pydantic Schema Conventions + +- All schemas inherit from `pydantic.BaseModel` +- All fields use `Field(...)` with `description=` for documentation +- Enums inherit from `(str, Enum)` for JSON serialization compatibility +- Output schemas include `class Config: from_attributes = True` to support ORM mode +- Validation patterns use `Field(..., pattern="^(boss|qcwy|zhilian)$")` for enum-like string fields + +--- + +*Convention analysis: 2026-03-21* diff --git a/.planning/codebase/INTEGRATIONS.md b/.planning/codebase/INTEGRATIONS.md new file mode 100644 index 0000000..232a9b8 --- /dev/null +++ b/.planning/codebase/INTEGRATIONS.md @@ -0,0 +1,165 @@ +# External Integrations + +**Analysis Date:** 2026-03-21 + +## APIs & External Services + +**Recruitment Platforms (scraped, not official APIs):** +- Boss Zhipin (Boss直聘) - Job and company data via WeChat mini-program endpoints + - Client: `requests` (sync) in `jobs_spider/boss/boos_api.py` + - Endpoints scraped: `/wapi/zpgeek/miniapp/search/joblist.json`, `/wapi/batch/batchRunV3`, `/wapi/zpgeek/miniapp/brand/detail.json` + - Auth: MPT Token (stored in MySQL `boss_token` table, managed via `app/api/v1/token/`) + - Anti-detection: `SmartIPManager`, `IPAnomalyDetector`, random 10-20s delays, session rebuilding on ban + - JS signing: `PyExecJS` executes JS sign algorithms (`jobs_spider/boss/boos_api.py`) + +- 51job / Qiancheng Wuyou (前程无忧) - Job data via platform API + - Client: `httpx` + `requests` in `jobs_spider/qcwy/qcwy.py` + - Auth: HMAC signature computed per-request + - Proxy: Hardcoded proxy URL present in `jobs_spider/qcwy/qcwy.py` (line 31) - security risk + +- Zhilian Zhaopin (智联招聘) - Job and company data + - Client: `requests` in `jobs_spider/zhilian/zhilian_single.py` + - Sign client: `app/services/crawler/_zhilian_sign.py` + +**Qixin.com (企信) - Company data enrichment (disabled):** +- Endpoint: `http://external-data.qixin.com/extend/extend_data_push` +- Auth: MD5 HMAC with timestamp + salt (salt hardcoded in `app/services/ingest/remote_push.py` line 48) +- Status: Push code commented out; function `push_to_remote()` currently only logs + +**Webhook / Report Endpoint:** +- Outgoing: Configurable via `REPORT_ENDPOINT` env var (default: none) +- Used by: `app/core/scheduler.py` `_post_with_retry()` to POST stats JSON every 6 hours +- Timeout: `REPORT_TIMEOUT` (default 10s), retries: `REPORT_MAX_RETRIES` (default 3) + +## Data Storage + +**Databases:** + +MySQL (Primary business database): +- Connection: `mysql://root:jobdata123@121.4.126.241:3306/job_data` (default hardcoded in `app/settings/config.py`; override via `TORTOISE_ORM` env or individual components) +- Client: Tortoise-ORM 0.23.0 with aiomysql driver +- Migrations: Aerich (`migrations/` directory, runs on startup if `RUN_MIGRATIONS_ON_STARTUP=True`) +- Tables: `user`, `role`, `api`, `menu`, `dept`, `auditlog`, `boss_token`, `cleaning_*`, `scheduled_task_run`, `stats_total`, `ip_upload_stats` +- Models: `app/models/admin.py`, `app/models/token.py`, `app/models/cleaning.py`, `app/models/metrics.py`, `app/models/keyword.py`, `app/models/company.py` + +ClickHouse (Analytics / raw job data): +- Connection: host `121.4.126.241`, port `8123`, DB `job_data` (configured in `app/settings/config.py`) +- Env vars: `CLICKHOUSE_HOST`, `CLICKHOUSE_PORT`, `CLICKHOUSE_USER`, `CLICKHOUSE_PASS`, `CLICKHOUSE_DB` +- Client: `clickhouse-connect==0.7.19` async client (`app/core/clickhouse.py`) +- Connection pool: `urllib3.PoolManager` with `num_pools=2`, `maxsize=5` per worker +- Tables managed via DDL in `app/core/clickhouse_init.py` (NOT Aerich migrations) +- Tables: `boss_job`, `boss_company`, `qcwy_job`, `qcwy_company`, `zhilian_job`, `zhilian_company` (MergeTree engine), `pending_company` (ReplacingMergeTree), `job_analytics` (VIEW) +- Queries: Repository pattern in `app/repositories/clickhouse_repo.py` + +**File Storage:** +- Local filesystem only (no S3 / object storage detected) +- Static files served from `static/` directory mounted via FastAPI (`app/__init__.py`) +- Spider logs written to `logs/` directory relative to spider working dir +- ECS pipeline log: `ecs_full_pipeline.log` (root project directory) + +**Caching:** +- Redis (optional): Used only for distributed locking in `app/core/locks.py` + - Env vars: `REDIS_HOST`, `REDIS_PORT` (default 6379), `REDIS_DB` (default 0), `REDIS_PASS` + - Falls back to filesystem-based TTL locks when Redis unavailable + - Lock files stored in system temp directory: `{tempdir}/jobdata_lock_{name}/` + +## Authentication & Identity + +**Auth Provider:** Custom JWT implementation (no third-party auth provider) +- Implementation: `app/core/dependency.py` (`DependPermission` dependency) +- Algorithm: HS256 +- Expiry: 7 days (60 * 24 * 7 minutes, `app/settings/config.py`) +- Secret: `SECRET_KEY` (default `"CHANGE_ME_DEV_ONLY"` - must override in production) +- Storage (frontend): JWT stored in `localStorage`, injected via axios request interceptor (`web/src/utils/http/interceptors.js`) +- RBAC: Route-level permission check via `DependPermission`; roles link to APIs (many-to-many in MySQL) + +**Boss Token Management:** +- Boss MPT tokens stored in `boss_token` MySQL table (`app/models/token.py`) +- Managed via `app/api/v1/token/` CRUD endpoints +- Spiders fetch active tokens from `/api/v1/token/tokens` and mark invalid ones + +## Monitoring & Observability + +**Error Tracking:** None (no Sentry or equivalent detected) + +**Logs:** +- Backend: `loguru` structured logging throughout all modules +- Spider: `loguru` with daily rotation (`logs/log_{YYYY-MM-DD}.log`, 30-day retention) +- ECS pipeline: File log appended to `ecs_full_pipeline.log` +- No centralized log aggregation (no ELK, CloudWatch, etc.) + +**Alerting:** +- Email alerts via SMTP on: stats job completion, IP anomaly detection +- SMTP provider: QQ Mail (`smtp.qq.com:587`) with STARTTLS +- Env vars: `SMTP_HOST`, `SMTP_PORT`, `SMTP_USER`, `SMTP_PASS`, `SMTP_FROM`, `SMTP_TO` +- Default recipient hardcoded in `app/settings/config.py` line 47 +- IP anomaly alerts triggered every 10 minutes by `ip_alert_job` in `app/core/scheduler.py` +- Stats reports sent every 6 hours by `stats_job` in `app/core/scheduler.py` + +**Metrics:** +- IP upload tracking: `IpTrackingMiddleware` (`app/core/ip_tracking.py`) records per-source, per-IP upload counts to `ip_upload_stats` MySQL table +- Task run history: `ScheduledTaskRun` model (`app/models/metrics.py`) records all scheduler task outcomes +- Stats totals: `StatsTotal` model records ClickHouse row counts per table per run + +## CI/CD & Deployment + +**Hosting:** +- Production: Alibaba Cloud ECS (manually provisioned + Docker container) +- Crawler nodes: Alibaba Cloud ECS spot instances (cn-shanghai-b, `ecs.t5-lc1m1.small`, Ubuntu 24.04) +- ECS node config: `ecs_full_pipeline.py` (20 spot instances, `SpotAsPriceGo` strategy, auto-release) +- Vswitch: `vsw-uf6twb2lrd02fi7q2i605`, Security Group: `sg-uf60380ctrux2uusucad` + +**CI Pipeline:** None detected (no GitHub Actions, Jenkins, or equivalent) + +**Container:** +- Multi-stage `Dockerfile` in project root +- Stage 1: `node:18-alpine` builds Vue3 frontend +- Stage 2: `python:3.11-slim-bullseye` installs Python deps + nginx +- Entrypoint: `deploy/entrypoint.sh` +- Nginx config: `deploy/web.conf` +- Exposed port: 80 + +## Environment Configuration + +**Required env vars (backend):** +- `SECRET_KEY` - JWT signing secret (must override `"CHANGE_ME_DEV_ONLY"`) +- `CLICKHOUSE_HOST` - ClickHouse server address +- `CLICKHOUSE_USER` / `CLICKHOUSE_PASS` - ClickHouse credentials +- MySQL connection (embedded in `TORTOISE_ORM` dict or override connection string) + +**Optional env vars:** +- `APP_HOST` / `APP_PORT` / `UVICORN_WORKERS` - Server binding +- `REDIS_HOST` / `REDIS_PORT` / `REDIS_PASS` - Distributed lock backend +- `SMTP_HOST` / `SMTP_USER` / `SMTP_PASS` / `SMTP_FROM` / `SMTP_TO` - Email alerting +- `REPORT_ENDPOINT` - Webhook for stats reporting +- `RUN_MIGRATIONS_ON_STARTUP` / `INITIALIZE_SEED_DATA_ON_STARTUP` - Startup behavior flags + +**Spider env vars:** +- `API_BASE_URL` - Backend URL for data push (default: `http://124.222.106.226:9999`) +- `SLEEP_MIN_SECONDS` / `SLEEP_MAX_SECONDS` - Request rate limiting +- `PROXY_USERNAME` / `PROXY_PASSWORD` / `PROXY_TUNNEL` / `PROXY_SCHEME` - Proxy pool +- `MAX_PAGES` / `PAGE_SIZE` - Crawl depth control + +**Alibaba Cloud (ECS pipeline):** +- Uses `alibabacloud_credentials` default credential chain (env vars `ALIBABA_CLOUD_ACCESS_KEY_ID` / `ALIBABA_CLOUD_ACCESS_KEY_SECRET` or `~/.alibabacloud/credentials`) + +**Secrets location:** All secrets are environment variables; no secrets manager. `config.py` contains hardcoded defaults that are insecure for production use. + +## Webhooks & Callbacks + +**Incoming:** +- `/api/v1/universal/data/batch-store-async` - Spider data ingestion endpoint (called by all three crawlers) + - Header auth: `token: dev` (or `API_TOKEN` env var) + - Body: `{"data_list": [...], "data_type": "job"|"company", "platform": "boss"|"qcwy"|"zhilian"}` +- `/api/v1/keyword/available` - Spiders poll this to get next keyword to crawl +- `/api/v1/token/tokens` - Spiders fetch active Boss MPT tokens +- `/api/v1/pipeline` - Manual trigger for ECS full pipeline + +**Outgoing:** +- `REPORT_ENDPOINT` (configurable) - Stats payload POST every 6 hours (`app/core/scheduler.py`) +- `http://external-data.qixin.com/...` - Company data push to Qixin (currently disabled/commented out in `app/services/ingest/remote_push.py`) +- SMTP email notifications on scheduler events + +--- + +*Integration audit: 2026-03-21* diff --git a/.planning/codebase/STACK.md b/.planning/codebase/STACK.md new file mode 100644 index 0000000..3830a81 --- /dev/null +++ b/.planning/codebase/STACK.md @@ -0,0 +1,143 @@ +# Technology Stack + +**Analysis Date:** 2026-03-21 + +## Languages + +**Primary:** +- Python 3.13 - Backend API, crawlers, ECS pipeline scripts +- JavaScript (ES Modules) - Vue3 frontend (no strict TypeScript; TS installed but JS used in `.vue` and `.js` files) + +**Secondary:** +- TypeScript 5.1.6 - Type definitions referenced in frontend toolchain only + +## Runtime + +**Environment:** +- Python 3.13 (required: `>=3.13` per `Pipfile`; Dockerfile uses `python:3.11-slim-bullseye` for container build) +- Node.js 18 (Dockerfile `node:18-alpine` for frontend build stage) + +**Package Manager:** +- Python: `pipenv` (development), `pip` / `requirements.txt` (Docker production) +- Frontend: `pnpm` (lockfile: `web/pnpm-lock.yaml` present) +- Lockfiles: `Pipfile.lock` (Python), `web/pnpm-lock.yaml` (frontend), `uv.lock` (uv-compatible) + +## Frameworks + +**Core Backend:** +- FastAPI 0.111.0 - REST API framework (`app/__init__.py` factory via `create_app()`) +- Starlette 0.37.2 - ASGI underpinning (middleware, static files) +- Uvicorn 0.34.0 - ASGI server (20 workers default, configurable via `UVICORN_WORKERS`) +- Uvloop 0.21.0 - High-performance event loop (non-Windows only) + +**ORM / Database:** +- Tortoise-ORM 0.23.0 - Async ORM for MySQL (`app/models/`) +- Aerich 0.8.1 - Database migrations for Tortoise-ORM (`migrations/` directory) +- aiomysql - Async MySQL driver (Tortoise backend) +- clickhouse-connect 0.7.19 - ClickHouse async client (`app/core/clickhouse.py`) +- asyncpg - Async PostgreSQL driver (listed in Pipfile but not actively used) + +**Validation / Serialization:** +- Pydantic 2.10.5 - Request/response schemas (`app/schemas/`) +- pydantic-settings 2.7.1 - Settings management via env vars (`app/settings/config.py`) +- orjson 3.10.14 - Fast JSON serialization +- ujson 5.10.0 - Alternative JSON library + +**Authentication / Security:** +- PyJWT 2.10.1 - JWT token generation/verification (HS256, 7-day expiry) +- passlib 1.7.4 - Password hashing +- argon2-cffi 23.1.0 - Argon2 password hasher + +**Scheduling:** +- APScheduler - Async job scheduler (`app/core/scheduler.py`, 6 registered cron tasks) + +**HTTP Clients:** +- httpx 0.28.1 - Async HTTP client (backend service-to-service calls) +- requests - Sync HTTP client (spider scripts in `jobs_spider/`) + +**Logging:** +- loguru 0.7.3 - Structured logging (unified throughout backend and spiders) + +**Core Frontend:** +- Vue 3.3.4 - SPA framework (`web/src/main.js`) +- Vue Router 4.2.4 - Client-side routing (`web/src/router/index.js`) +- Pinia 2.1.6 - State management (`web/src/store/`) +- Naive UI 2.34.4 - Component library (admin UI) +- ECharts 6.0.0 - Data visualization charts (`web/src/views/analytics/`) +- axios 1.4.0 - HTTP client (`web/src/utils/http/`) +- vue-i18n 9 - Internationalization + +**Build / Dev Tools:** +- Vite 4.4.6 - Frontend bundler (`web/vite.config.js`) +- UnoCSS 66.5.10 - Atomic CSS engine +- unplugin-auto-import 20.3.0 - Auto-imports for Vue APIs +- unplugin-vue-components 30.0.0 - Auto-imports for components +- @iconify/vue + @iconify/json - Icon library + +**Spider-specific:** +- playwright 1.57.0 - Browser automation (anti-detection in crawlers) +- PyExecJS 1.5.1 - Execute JavaScript signing algorithms from Python (`jobs_spider/boss/`) +- PySocks - SOCKS proxy support for spider requests +- tenacity - Retry logic in crawlers +- pandas + openpyxl - Data processing and Excel export + +**Cloud:** +- alibabacloud_ecs20140526 - Alibaba Cloud ECS SDK (`ecs_full_pipeline.py`) +- alibabacloud_credentials - AliCloud credential management + +**Code Quality:** +- ruff 0.9.1 - Python linter (`pyproject.toml` config: line-length 120, ignores F403/F405) +- black 24.10.0 - Python formatter (line-length 120, target py310/py311) +- isort 5.13.2 - Import sorter +- ESLint 8.46.0 - Frontend linter (`@zclzone` + `@unocss` rule sets) +- prettier - Frontend formatter + +## Key Dependencies + +**Critical:** +- `clickhouse-connect==0.7.19` - All analytics and job data storage; loss means no data read/write +- `tortoise-orm==0.23.0` - All business data (users, roles, keywords, tokens); paired with `aerich` for migrations +- `fastapi==0.111.0` - API layer; version-pinned for stability +- `APScheduler` - 6 scheduled tasks including ECS pipeline and IP alerting +- `alibabacloud_ecs20140526` - ECS node management; required for crawler scaling + +**Infrastructure:** +- `uvicorn==0.34.0` + `uvloop==0.21.0` - Production ASGI server stack +- `pydantic==2.10.5` - All input validation; v2 API used throughout +- `loguru==0.7.3` - Unified logging across all modules +- `redis` - Optional distributed lock backend (`app/core/locks.py`; falls back to file locks if Redis unavailable) + +## Configuration + +**Environment:** +- All settings in `app/settings/config.py` via `pydantic-settings.BaseSettings` +- Environment variables override defaults at startup +- No `.env` file detected (not committed); variables set at OS/container level +- Key variables: `APP_HOST`, `APP_PORT`, `UVICORN_WORKERS`, `CLICKHOUSE_HOST`, `CLICKHOUSE_USER`, `CLICKHOUSE_PASS`, `SECRET_KEY`, `SMTP_HOST`, `SMTP_USER`, `SMTP_PASS`, `REDIS_HOST`, `REPORT_ENDPOINT` +- **Security warning**: `config.py` contains hardcoded default values for MySQL, ClickHouse, and SMTP credentials; must be overridden in production + +**Build:** +- `pyproject.toml` - Python project metadata, black/ruff tool config +- `Pipfile` / `Pipfile.lock` - Development dependency management +- `requirements.txt` - Production pip install (used in Dockerfile) +- `web/vite.config.js` - Vite build config with proxy support via `VITE_USE_PROXY` env var +- `Dockerfile` - Multi-stage build: Node 18 (frontend) + Python 3.11 (backend) + nginx + +## Platform Requirements + +**Development:** +- Python 3.13+ (Pipfile requirement) +- Node.js 18+ with pnpm +- MySQL server (Tortoise-ORM connection) +- ClickHouse server (analytics data) +- Optional: Redis (distributed locking upgrade) + +**Production:** +- Docker-based deployment (see `Dockerfile`, `deploy/entrypoint.sh`, `deploy/web.conf`) +- Nginx serves frontend static files and proxies API requests +- Alibaba Cloud ECS (cn-shanghai-b zone) for crawler nodes +- Ports: 80 (nginx), 9999 (FastAPI backend direct) + +--- + +*Stack analysis: 2026-03-21* diff --git a/.planning/codebase/STRUCTURE.md b/.planning/codebase/STRUCTURE.md new file mode 100644 index 0000000..f10578e --- /dev/null +++ b/.planning/codebase/STRUCTURE.md @@ -0,0 +1,350 @@ +# Codebase Structure + +**Analysis Date:** 2026-03-21 + +## Directory Layout + +``` +JobData/ # Project root +├── app/ # FastAPI backend (Python) +│ ├── __init__.py # App factory create_app() +│ ├── api/ +│ │ └── v1/ # All REST API routes (/api/v1/...) +│ │ ├── __init__.py # Router aggregation + permission guards +│ │ ├── analytics.py # Data analytics endpoints +│ │ ├── stats.py # Table count stats endpoints +│ │ ├── pipeline.py # ECS pipeline trigger endpoint +│ │ ├── base/ # Login, user-info, menu-tree (no auth) +│ │ ├── job/ # Data ingest endpoints (batch/single/async) +│ │ ├── cleaning/ # Targeted cleaning endpoints +│ │ ├── company/ # Company search endpoints +│ │ ├── keyword/ # Keyword CRUD endpoints +│ │ ├── token/ # Boss token management endpoints +│ │ ├── proxy/ # Proxy IP management endpoints +│ │ ├── users/ # User CRUD +│ │ ├── roles/ # Role + permission assignment +│ │ ├── menus/ # Menu tree CRUD +│ │ ├── apis/ # API registration management +│ │ ├── depts/ # Department tree CRUD +│ │ └── auditlog/ # Audit log query +│ ├── controllers/ # CRUD orchestration for MySQL entities +│ │ ├── api.py # ApiController (refresh_api(), scan routes) +│ │ ├── user.py # UserController +│ │ ├── role.py # RoleController +│ │ ├── menu.py # MenuController +│ │ ├── dept.py # DeptController +│ │ ├── keyword.py # KeywordController +│ │ ├── proxy.py # ProxyController +│ │ ├── token.py # TokenController +│ │ ├── cleaning.py # CleaningController +│ │ └── company.py # CompanyController (search via crawler services) +│ ├── core/ # Framework infrastructure +│ │ ├── init_app.py # lifespan: migrations, seed, ClickHouse init +│ │ ├── scheduler.py # APScheduler job definitions + registration +│ │ ├── clickhouse.py # ClickHouseManager singleton +│ │ ├── clickhouse_init.py # ClickHouse DDL (tables + analytics view) +│ │ ├── dependency.py # AuthControl, PermissionControl, DependPermission +│ │ ├── locks.py # DistributedLock (Redis + file fallback) +│ │ ├── middlewares.py # BackGroundTaskMiddleware, HttpAuditLogMiddleware +│ │ ├── ip_tracking.py # IpTrackingMiddleware +│ │ ├── exceptions.py # Custom exception classes + handlers +│ │ ├── crud.py # Generic ListParams, pagination helpers +│ │ ├── ctx.py # CTX_USER_ID ContextVar +│ │ ├── bgtask.py # BgTasks helper for BackGroundTaskMiddleware +│ │ ├── proxy_rule.py # Proxy selection rules +│ │ └── algorithms/ +│ │ └── antispider.py # Anti-spider signature helpers +│ ├── models/ # Tortoise-ORM models (MySQL) +│ │ ├── base.py # BaseModel, TimestampMixin, UUIDModel +│ │ ├── admin.py # User, Role, Api, Menu, Dept, AuditLog +│ │ ├── token.py # BossToken +│ │ ├── cleaning.py # CleaningTask / CleaningRecord +│ │ ├── metrics.py # ScheduledTaskRun, StatsTotal, IpUploadStats +│ │ ├── company.py # CompanyCleaningQueue +│ │ ├── keyword.py # BossKeyword, QcwyKeyword, ZhilianKeyword +│ │ └── enums.py # MethodType enum +│ ├── repositories/ # ClickHouse query layer +│ │ └── clickhouse_repo.py # ClickHouseBaseRepo, JobAnalyticsRepo +│ ├── schemas/ # Pydantic request/response schemas +│ │ ├── ingest.py # IngestBatchRequest, PlatformType, DataType enums +│ │ ├── analytics.py # Analytics query params + response schemas +│ │ ├── keyword.py # Keyword schemas +│ │ ├── token.py # Token schemas +│ │ ├── base.py # Common response envelopes +│ │ ├── login.py # Login request/response +│ │ ├── users.py # User schemas +│ │ ├── roles.py # Role schemas +│ │ ├── menus.py # Menu schemas (MenuType enum) +│ │ ├── apis.py # API schemas +│ │ ├── depts.py # Dept schemas +│ │ ├── cleaning.py # Cleaning schemas +│ │ └── proxy.py # Proxy schemas +│ ├── services/ # Business logic +│ │ ├── analytics_service.py # AnalyticsService → JobAnalyticsRepo +│ │ ├── cleaning.py # CleaningService (targeted cleaning) +│ │ ├── company_cleaner.py # CompanyCleaner (automated company pipeline) +│ │ ├── company_jobs_sync.py # CompanyJobsSyncService +│ │ ├── company_storage.py # company_storage (ClickHouse company insert) +│ │ ├── ingest/ # Ingest pipeline subsystem +│ │ │ ├── service.py # IngestService.store_batch/store_single/query_data +│ │ │ ├── registry.py # PlatformConfig registry + get_config() +│ │ │ ├── dedup.py # batch_dedup_filter() against ClickHouse +│ │ │ ├── remote_push.py # Async HTTP push to secondary target +│ │ │ └── configs/ # Per-platform PlatformConfig registrations +│ │ └── crawler/ # In-process platform HTTP clients +│ │ ├── _base.py # BaseFetcher, BaseSearcher, ApiResult +│ │ ├── _http_client.py # HTTPClient wrapper (sync requests) +│ │ ├── _boss_client.py # Boss HTTP client +│ │ ├── _boss_api.py # Boss API methods +│ │ ├── _boss_sign.py # Boss request signing +│ │ ├── _zhilian_client.py +│ │ ├── _zhilian_api.py +│ │ ├── _zhilian_sign.py +│ │ ├── _job51_client.py +│ │ ├── _job51_api.py +│ │ ├── _job51_sign.py +│ │ ├── boss.py # BossService (high-level facade) +│ │ ├── qcwy.py # QcwyService (high-level facade) +│ │ └── zhilian.py # ZhilianService (high-level facade) +│ ├── settings/ +│ │ └── config.py # Settings (pydantic-settings BaseSettings) +│ ├── log/ +│ │ └── __init__.py # loguru logger export +│ └── utils/ # Shared utilities +├── web/ # Vue 3 frontend +│ ├── src/ +│ │ ├── main.js # App entry, Pinia + Router setup +│ │ ├── App.vue # Root component +│ │ ├── api/ # Axios API call modules +│ │ │ ├── index.js # System management APIs +│ │ │ ├── analytics.js # Analytics APIs +│ │ │ ├── keyword.js # Keyword APIs +│ │ │ ├── proxy.js # Proxy APIs +│ │ │ └── token.js # Token APIs +│ │ ├── components/ # Shared components (CrudTable, CrudModal, QueryBar) +│ │ ├── layout/ # App shell (sidebar, topbar, tabs) +│ │ ├── router/ # Vue Router config + dynamic route loading +│ │ │ ├── index.js # createRouter(), addDynamicRoutes() +│ │ │ ├── guard/ # Navigation guards (auth, loading, title) +│ │ │ └── routes/ # Static route definitions +│ │ ├── store/ # Pinia stores +│ │ │ └── modules/ # user, permission, app, tags +│ │ ├── utils/ +│ │ │ ├── auth/ # JWT token read/write +│ │ │ ├── http/ # Axios instance + interceptors +│ │ │ └── storage/ # localStorage wrapper +│ │ └── views/ # Page-level Vue components +│ │ ├── login/ # Login page +│ │ ├── analytics/ # ECharts dashboard (trend + distribution) +│ │ ├── recruitment/ # Job data browser (boss/, qcwy/, zhilian/) +│ │ ├── cleaning/ # Cleaning operation + monitor pages +│ │ ├── keyword/ # Keyword management +│ │ ├── profile/ # User profile +│ │ └── system/ # RBAC admin (user, role, menu, api, dept, auditlog, proxy, token) +│ ├── vite.config.js # Vite build config (plugins, proxy) +│ └── package.json # Frontend dependencies +├── jobs_spider/ # Standalone crawlers (external process) +│ ├── boss/ +│ │ ├── boos_api.py # BossZhipinAPI core + SmartIPManager + main (~2246 lines) +│ │ ├── company_spider.py # Boss company batch spider +│ │ └── enumerate_combos.py # City × keyword combo enumeration +│ ├── qcwy/ +│ │ ├── qcwy.py # QCWY API wrapper +│ │ ├── search_company_jobs.py +│ │ └── run_company_search.py +│ └── zhilian/ +│ ├── zhilian_single.py # Zhilian single-job spider +│ └── company_spider.py # Zhilian company spider +├── spiderJobs/ # Second-generation spider framework (structured) +│ ├── core/ +│ │ ├── base.py # Shared base classes +│ │ └── http_client.py # HTTP client abstraction +│ ├── platforms/ # Platform-specific implementations +│ │ ├── boss/ +│ │ ├── job51/ +│ │ └── zhilian/ +│ └── runner/ # Execution runners +│ ├── api_client.py # Backend API push client +│ ├── loop.py # Job crawl loop runner +│ └── company_loop.py # Company crawl loop runner +├── migrations/ +│ └── models/ # Aerich migration files (auto-generated) +├── static/ # Served at /static/ (built frontend or assets) +├── deploy/ # Deployment configs and sample images +├── docs/ # Project documentation +├── ecs_full_pipeline.py # Alibaba Cloud ECS batch provisioning script +├── create_ecs_instances.py # ECS instance creation utilities +├── reclean_qcwy_jobs.py # Standalone QCWY re-cleaning script +├── run.py # Uvicorn launch entry point +├── Pipfile / Pipfile.lock # Python dependency management (pipenv) +├── pyproject.toml # ruff + black + isort tool config +└── Dockerfile # Container build spec +``` + +## Directory Purposes + +**`app/api/v1/`:** +- Purpose: One sub-package per API domain; each contains `__init__.py` exporting its `APIRouter` +- Contains: Route handler functions, dependency injection wiring, Pydantic schema imports +- Key files: `app/api/v1/__init__.py` (aggregation), `app/api/v1/job/job.py` (ingest routes), `app/api/v1/analytics.py` + +**`app/controllers/`:** +- Purpose: Business-operation classes for MySQL-backed entities; encapsulate Tortoise-ORM queries +- Contains: One controller per domain entity; typically expose `create`, `update`, `delete`, `list`, `get` methods +- Key files: `app/controllers/api.py` (scan and register FastAPI routes → DB) + +**`app/services/ingest/`:** +- Purpose: Self-contained data pipeline: registry lookup → dedup check → ClickHouse insert → remote push +- Contains: `service.py` (main orchestrator), `registry.py` (config store), `dedup.py`, `remote_push.py`, `configs/` (registrations per platform) +- Key files: `app/services/ingest/service.py`, `app/services/ingest/registry.py` + +**`app/services/crawler/`:** +- Purpose: In-process HTTP clients for platform APIs, used by backend scheduler (company cleaning) and company search controller +- Contains: Underscore-prefixed private modules (`_boss_client.py`, `_boss_sign.py`, etc.) and public service facades (`boss.py`, `qcwy.py`, `zhilian.py`) +- Key files: `app/services/crawler/_base.py` (shared base classes) + +**`app/core/`:** +- Purpose: Framework plumbing — startup, middleware chain, auth, distributed locking, ClickHouse connection +- Key files: `app/core/dependency.py` (auth guards), `app/core/scheduler.py` (all cron jobs), `app/core/locks.py` + +**`app/models/`:** +- Purpose: All Tortoise-ORM model definitions; migrated via Aerich (not ClickHouse) +- Key files: `app/models/admin.py` (RBAC), `app/models/keyword.py` (crawl state tracking) + +**`app/repositories/`:** +- Purpose: Typed ClickHouse query encapsulation following the Repository pattern +- Key files: `app/repositories/clickhouse_repo.py` + +**`jobs_spider/`:** +- Purpose: Standalone scripts that run on ECS nodes; communicate with backend only via HTTP +- Key files: `jobs_spider/boss/boos_api.py` (2246-line core with SmartIPManager, BossZhipinAPI) + +**`spiderJobs/`:** +- Purpose: Newer modular spider framework; parallel architecture to `jobs_spider/` with cleaner separation +- Key files: `spiderJobs/runner/api_client.py`, `spiderJobs/runner/loop.py` + +**`web/src/views/`:** +- Purpose: One directory per route group; each view owns its local route definition in `route.js` +- Key files: `web/src/views/analytics/index.vue` (ECharts dashboard), `web/src/views/cleaning/monitor.vue` + +## Key File Locations + +**Entry Points:** +- `run.py`: Uvicorn startup; reads env vars `APP_HOST`, `APP_PORT`, `UVICORN_WORKERS` +- `app/__init__.py`: FastAPI app factory `create_app()` and lifespan context manager +- `web/src/main.js`: Vue 3 app entry; mounts Pinia, Router, directives + +**Configuration:** +- `app/settings/config.py`: All backend config (`Settings` extends `pydantic_settings.BaseSettings`); override via env vars +- `web/vite.config.js`: Vite build and dev server config +- `pyproject.toml`: ruff, black, isort tool configuration + +**Core Logic:** +- `app/core/init_app.py`: Startup sequence (migrations, seed data, ClickHouse init) +- `app/core/scheduler.py`: All APScheduler job functions and `start_scheduler()` +- `app/core/clickhouse_init.py`: All ClickHouse `CREATE TABLE IF NOT EXISTS` DDL statements +- `app/services/ingest/registry.py`: `PlatformConfig` dataclass + global registry dict +- `app/services/ingest/service.py`: `IngestService.store_batch()` — main data ingestion path +- `app/services/company_cleaner.py`: `CompanyCleaner` — automated company data pipeline + +**ClickHouse Schema:** +- `app/core/clickhouse_init.py`: Defines all 6 MergeTree tables and 1 analytics VIEW DDL; apply via `ClickHouseInitializer.initialize_all_tables()` + +**MySQL Migrations:** +- `migrations/models/`: Auto-generated by `aerich migrate`; applied via `aerich upgrade` or automatically on startup + +**Testing:** +- `tests/test_company_jobs_sync.py` +- `tests/test_company_storage.py` + +## Naming Conventions + +**Files (Python):** +- Modules: `snake_case.py` (e.g., `company_cleaner.py`, `clickhouse_init.py`) +- Private crawler modules: leading underscore prefix (e.g., `_boss_client.py`, `_zhilian_sign.py`) +- Spider scripts in `jobs_spider/`: descriptive names without prefix (e.g., `boos_api.py`, `zhilian_single.py`) + +**Files (Vue/JS):** +- Pages: `index.vue` for primary view, `monitor.vue` / descriptive name for secondary views +- API modules: `snake_case.js` matching domain (e.g., `analytics.js`, `keyword.js`) +- Router guard files: Grouped in `router/guard/` directory + +**Classes (Python):** +- Services: `PascalCaseService` (e.g., `IngestService`, `AnalyticsService`, `CompanyCleaner`) +- Controllers: `PascalCaseController` with module-level singleton instance in snake_case (e.g., `user_controller = UserController()`) +- Repos: `PascalCaseRepo` (e.g., `ClickHouseBaseRepo`, `JobAnalyticsRepo`) +- Models: `PascalCase` matching table concept (e.g., `BossToken`, `CompanyCleaningQueue`) + +**Directories:** +- Domain groupings: lowercase (e.g., `cleaning/`, `recruitment/`, `keyword/`) +- Platform names used consistently: `boss`, `qcwy`, `zhilian` (lowercase) across all layers + +## Where to Add New Code + +**New API Endpoint:** +1. Create or update router file in `app/api/v1/{domain}/` +2. Register the router in `app/api/v1/__init__.py` with `v1_router.include_router(...)` +3. Run `api_controller.refresh_api()` on next app startup to sync new routes to the `api` table +4. Add corresponding Pydantic schemas in `app/schemas/{domain}.py` + +**New Platform Data Type (Ingest):** +1. Define `PlatformConfig` with dedup fields in `app/services/ingest/configs/` +2. Call `register(config)` at module import in `app/services/ingest/configs/__init__.py` +3. ClickHouse table DDL goes in `app/core/clickhouse_init.py` under `_TABLE_DDLS` + +**New MySQL Model:** +1. Add model class to appropriate file in `app/models/` (or create new file) +2. Register the model module in `app/settings/config.py` under `TORTOISE_ORM.apps.models.models` +3. Run `aerich migrate && aerich upgrade` (or rely on `RUN_MIGRATIONS_ON_STARTUP=True`) + +**New Scheduled Job:** +1. Define async job function in `app/core/scheduler.py` +2. Wrap execution body with `DistributedLock` context from `app/core/locks.py` +3. Add `scheduler.add_job(...)` call inside `start_scheduler()` +4. Record run result via `_record_task_run()` to `ScheduledTaskRun` table + +**New Frontend Page:** +1. Create view component at `web/src/views/{module}/index.vue` +2. Add `route.js` in the same directory with route meta +3. Register the route in `web/src/router/routes/` static routes or rely on backend menu system +4. Add corresponding backend menu entry in `app/core/init_app.py` → `init_menus()` (for seed data) + +**New In-Process Platform Crawler:** +1. Add private client modules `_platform_client.py`, `_platform_api.py`, `_platform_sign.py` under `app/services/crawler/` +2. Create public facade `platform.py` extending or wrapping `BaseFetcher` / `BaseSearcher` from `app/services/crawler/_base.py` +3. Instantiate in `CompanyCleaner.__init__()` if needed for company pipeline + +**Shared Utilities:** +- Backend Python utilities: `app/utils/` +- Frontend JS utilities: `web/src/utils/` + +## Special Directories + +**`.startup_lock/`:** +- Purpose: File-based mutex directory used to prevent multi-worker race on startup seed data +- Generated: Yes (created and deleted at startup) +- Committed: No (in `.gitignore`) + +**`migrations/`:** +- Purpose: Aerich migration history for MySQL schema +- Generated: Yes (by `aerich migrate`) +- Committed: Yes + +**`static/`:** +- Purpose: Served at `/static/` path; can hold built frontend assets or other static files +- Generated: Partially (built frontend output can be placed here) +- Committed: Yes (currently contains static assets) + +**`.planning/`:** +- Purpose: GSD planning documents (codebase analysis, phase plans) +- Generated: Yes (by GSD tooling) +- Committed: Yes + +**`jobs_spider/*/logs/`:** +- Purpose: Rotating daily log files for standalone spider processes +- Generated: Yes (by loguru in spider scripts) +- Committed: No + +--- + +*Structure analysis: 2026-03-21* diff --git a/.planning/codebase/TESTING.md b/.planning/codebase/TESTING.md new file mode 100644 index 0000000..90d4910 --- /dev/null +++ b/.planning/codebase/TESTING.md @@ -0,0 +1,236 @@ +# Testing Patterns + +**Analysis Date:** 2026-03-21 + +## Test Framework + +**Runner:** +- `unittest` (Python standard library) — the only test framework currently in use +- No pytest, no pytest-anyio, no httpx.AsyncClient test harness configured +- No test-specific dependencies in `Pipfile` `[dev-packages]` (section is empty) + +**Assertion Library:** +- `unittest.TestCase` assertion methods: `assertEqual`, `assertIsNone`, etc. + +**Run Commands:** +```bash +python -m unittest tests/test_company_storage.py # Run a specific test file +python -m unittest tests/test_company_jobs_sync.py # Run a specific test file +python -m unittest discover tests/ # Discover and run all tests +``` + +**Coverage:** +- No coverage tool configured (no `.coveragerc`, no `pytest-cov`, no coverage requirement enforced) + +## Test File Organization + +**Location:** +- Separate `tests/` directory at project root: `/Users/win/2025/AICoding/JobData/tests/` +- NOT co-located with source modules + +**Naming:** +- Pattern: `test_{module_name}.py` +- Examples: `test_company_storage.py`, `test_company_jobs_sync.py` + +**Structure:** +``` +tests/ +├── test_company_jobs_sync.py # Tests for app/services/company_jobs_sync.py +└── test_company_storage.py # Tests for app/services/company_storage.py +``` + +## Test Structure + +**Suite Organization:** +```python +import unittest + +from app.services.company_storage import extract_company_fields, normalize_company_id + + +class CompanyStorageTests(unittest.TestCase): + def test_normalize_qcwy_company_id(self): + self.assertEqual(normalize_company_id("qcwy", "co123"), "123") + self.assertEqual(normalize_company_id("qcwy", "123"), "123") + self.assertEqual(normalize_company_id("boss", "co123"), "co123") + + def test_extract_boss_fields(self): + payload = { ... } + result = extract_company_fields("boss", payload, "boss-1") + self.assertEqual(result["source_company_id"], "boss-1") + + +if __name__ == "__main__": + unittest.main() +``` + +**Patterns:** +- One `TestCase` class per test file +- Test methods named `test_{what_is_being_tested}_{condition}` (e.g., `test_extract_boss_fields`, `test_normalize_qcwy_company_id`) +- No setUp/tearDown methods — tests are stateless and self-contained +- No fixtures or factories; inline dicts serve as test data +- Tests verify pure functions only (no database, no HTTP, no async) + +## Mocking + +**Framework:** None currently used (no `unittest.mock`, no `pytest-mock`) + +**Current approach:** Tests only cover pure, synchronous functions that have no external dependencies. All existing tests call functions that: +- Accept plain dicts as input +- Return plain dicts or strings as output +- Have no database, network, or filesystem side effects + +**What to Mock (when tests are added):** +- `clickhouse_connect.driver.AsyncClient` for repository tests +- `BossToken.filter(...)` ORM queries for service tests that check token loading +- `httpx` / `requests` for crawler service tests +- `asyncio.to_thread(...)` calls wrapping synchronous crawlers + +**What NOT to Mock:** +- Pure data transformation functions (`extract_company_fields`, `normalize_company_id`, `_pick_first`, `_clean_text`) +- Enum definitions and schema validation logic +- URL pattern matching logic in `CleaningService` + +## Fixtures and Factories + +**Test Data:** +- Inline dicts constructed within each test method — no shared fixtures +- Platform-specific payloads mirror real API response shapes: + +```python +# Boss payload structure (from test_company_storage.py) +payload = { + "zpData": { + "brandComInfoVO": { + "encryptBrandId": "boss-1", + "brandName": "Boss公司", + "industryName": "互联网", + "stageName": "B轮", + }, + "companyFullInfoVO": { + "name": "Boss公司", + "typeName": "民营", + "cityName": "上海", + }, + } +} + +# Qcwy payload structure +payload = { + "coinfo": { + "coid": "123", + "coname": "前程公司", + "cotype": "民营", + "cosize": "500-999人", + }, + "financingStage": {"name": "未融资"}, +} + +# Zhilian payload structure +payload = { + "data": { + "companyBase": { + "companyNumber": "zl-1", + "companyName": "智联公司", + "companyTypeName": "上市公司", + } + } +} +``` + +**Location:** Inline within test methods; no shared fixture file or factory module. + +## Coverage + +**Requirements:** None enforced — no coverage target, no CI gates + +**Current coverage estimation:** +- `app/services/company_storage.py` — partially covered (extract functions and normalize_company_id) +- `app/services/company_jobs_sync.py` — partially covered (_extract_boss_jobs, _extract_qcwy_jobs, _extract_zhilian_jobs) +- All other modules — ZERO test coverage + +**View Coverage (manual):** +```bash +pip install coverage +coverage run -m unittest discover tests/ +coverage report +coverage html # generates htmlcov/index.html +``` + +## Test Types + +**Unit Tests:** +- The only type present +- Scope: pure synchronous functions with no external dependencies +- Files: `tests/test_company_storage.py`, `tests/test_company_jobs_sync.py` + +**Integration Tests:** +- Not present +- Needed for: API endpoint testing using `httpx.AsyncClient` with `TestClient` or `anyio` +- Critical gaps: `app/api/v1/job/job.py`, `app/api/v1/analytics.py`, `app/api/v1/keyword/keyword.py` + +**E2E Tests:** +- Not present +- No E2E framework configured + +## Critical Gaps + +**Untested service logic (highest priority):** + +1. `app/services/cleaning.py:CleaningService.clean_target_auto()` — URL platform detection logic (zhipin.com vs 51job.com vs zhaopin.com) +2. `app/services/cleaning.py:CleaningService.process_single_item()` — multi-branch dispatch based on `clean_type` and `platform` +3. `app/services/ingest/service.py:IngestService.store_batch()` — deduplication and batch insert flow +4. `app/services/ingest/dedup.py` — deduplication filter logic +5. `app/controllers/keyword.py:KeywordController.get_available()` — priority scheduling logic (partial > failed > fresh) +6. `app/repositories/clickhouse_repo.py:JobAnalyticsRepo` — query building and result mapping + +**Untested URL parsing (medium priority):** + +7. `app/services/cleaning.py:_process_boss_url()` — regex extraction of job/company ID from Boss URLs +8. `app/services/cleaning.py:_process_qcwy_url()` — regex extraction for qcwy URLs +9. `app/services/cleaning.py:_process_zhilian_url()` — regex extraction for zhilian URLs + +**Untested API endpoints:** + +10. All routes in `app/api/v1/` — no integration tests exist for any HTTP endpoints + +## Recommended Test Setup (when adding tests) + +**For async tests, add to `Pipfile [dev-packages]`:** +```toml +[dev-packages] +pytest = "*" +pytest-asyncio = "*" +anyio = {extras = ["trio"]} +httpx = "*" # already in packages, for AsyncClient +``` + +**Async test pattern (recommended):** +```python +import pytest +from httpx import AsyncClient +from app import app # FastAPI app instance + +@pytest.mark.asyncio +async def test_ingest_endpoint(): + async with AsyncClient(app=app, base_url="http://test") as client: + response = await client.post("/api/v1/ingest/data/store", json={...}) + assert response.status_code == 200 +``` + +**Synchronous unit test pattern (existing):** +```python +import unittest +from app.services.company_storage import normalize_company_id + +class MyTests(unittest.TestCase): + def test_normalize(self): + self.assertEqual(normalize_company_id("qcwy", "co123"), "123") + +if __name__ == "__main__": + unittest.main() +``` + +--- + +*Testing analysis: 2026-03-21* diff --git a/.planning/config.json b/.planning/config.json index 992b667..656e73c 100644 --- a/.planning/config.json +++ b/.planning/config.json @@ -22,7 +22,8 @@ "node_repair_budget": 2, "ui_phase": true, "ui_safety_gate": true, - "text_mode": false + "text_mode": false, + "_auto_chain_active": false }, "hooks": { "context_warnings": true diff --git a/.planning/phases/03-qcwy-zhilian/03-01-SUMMARY.md b/.planning/phases/03-qcwy-zhilian/03-01-SUMMARY.md new file mode 100644 index 0000000..b4586ae --- /dev/null +++ b/.planning/phases/03-qcwy-zhilian/03-01-SUMMARY.md @@ -0,0 +1,27 @@ +# Plan 03-01 Summary: 迁移前程无忧(job51)层至 crawler_core + +**Status:** Complete +**Tasks:** 6/6 +**Commit:** 8c2c2d2 + +## What was built + +将 `spiderJobs/platforms/job51/` 4 个文件迁移至 crawler_core: + +- **client.py**:HTTPClient 改用 `crawler_core.http_client`,Job51Sign 改用 `crawler_core.qcwy.sign` +- **api.py**: + - import 改为 `crawler_core.base.Result/BaseFetcher/BaseSearcher` + - `ApiResult` 全量替换为 `Result`(共 11 处) + - 2 处 `self._http.get(` → `self.http_client.get(`(GetJobDetail/GetCompanyInfo 的覆写 fetch) + - 为 POST Searcher(SearchRecommendJobs/SearchCompanyJobs)添加 `_request()` POST 覆写 +- **main.py**:BaseFetcher/BaseSearcher 改用 `crawler_core.base` +- **sign.py**:向后兼容桩,重新导出 `crawler_core.qcwy.sign.Job51Sign` +- **tests/job51/test_job51_client.py**:17 个 mock 测试,覆盖所有 API 类 + +## Verification + +``` +pytest tests/job51/ -v → 17 passed ✅ +``` + +## Self-Check: PASSED diff --git a/.planning/phases/03-qcwy-zhilian/03-02-SUMMARY.md b/.planning/phases/03-qcwy-zhilian/03-02-SUMMARY.md new file mode 100644 index 0000000..290ee9b --- /dev/null +++ b/.planning/phases/03-qcwy-zhilian/03-02-SUMMARY.md @@ -0,0 +1,33 @@ +# Plan 03-02 Summary: 迁移智联招聘(zhilian)层至 crawler_core + +**Status:** Complete +**Tasks:** 6/6 +**Commit:** 8c2c2d2 + +## What was built + +将 `spiderJobs/platforms/zhilian/` 4 个文件迁移至 crawler_core: + +- **client.py**:HTTPClient 改用 `crawler_core.http_client`,ZhilianSign 改用 `crawler_core.zhilian.sign` +- **api.py**: + - import 改为 `crawler_core.base.BaseFetcher/BaseSearcher/Result/parse_response` + - 添加 `_parse_zhilian_response`(HTTP 200 即成功,适配智联无 statusCode 的响应格式) + - 1 处 `self._http.get(` → `self.http_client.get(`(SearchCompanyPositions) + - 为所有 Fetcher/Searcher 类补充 `_parse()` 方法(crawler_core 要求的抽象方法) + - 为 POST Searcher(SearchPositions)添加 `_request()` POST 覆写 +- **main.py**:BaseFetcher/BaseSearcher 改用 `crawler_core.base` +- **sign.py**:向后兼容桩,重新导出 `crawler_core.zhilian.sign.ZhilianSign` +- **tests/zhilian/test_zhilian_client.py**:17 个 mock 测试 + +## Key Decisions + +- 添加了 `_parse_zhilian_response` 而非使用 `parse_response`(后者检查 `statusCode==200`,但智联接口不返回此字段) +- `SearchCompanyPositions._build_params()` 通过 `self._client.signer.sign_params()` 调用 signer,迁移后不受影响 + +## Verification + +``` +pytest tests/zhilian/ -v → 17 passed ✅ +``` + +## Self-Check: PASSED