docs(phase-3): complete execution — 2/2 plans, 98 tests passing

- ARCH-04: job51 migrated to crawler_core (no old deps)
- ARCH-05: zhilian migrated to crawler_core (no old deps)
- 34 new mock tests (17 job51 + 17 zhilian)
- Added _parse_zhilian_response custom parser for zhilian API format
- Fixed POST Searcher _request() overrides for job51/zhilian
- Full regression: 98 passed in 0.12s
This commit is contained in:
win 2026-03-21 19:19:17 +08:00
parent 8c2c2d29d7
commit 00a727519f
13 changed files with 1637 additions and 14 deletions

View File

@ -10,8 +10,8 @@
- [x] **ARCH-01**: 提取 `crawler_core/` 为独立可安装包包含签名、HTTP 客户端、响应解析核心逻辑
- [x] **ARCH-02**: 统一 BaseFetcher/BaseSearcher 基类,三平台实现模板方法模式
- [x] **ARCH-03**: Boss 直聘爬虫客户端基于 crawler_core 重写
- [ ] **ARCH-04**: 前程无忧爬虫客户端基于 crawler_core 重写
- [ ] **ARCH-05**: 智联招聘爬虫客户端基于 crawler_core 重写
- [x] **ARCH-04**: 前程无忧爬虫客户端基于 crawler_core 重写
- [x] **ARCH-05**: 智联招聘爬虫客户端基于 crawler_core 重写
- [ ] **ARCH-06**: 后端 app/services/crawler/ facade 层使用 asyncio.to_thread() 桥接同步核心
- [ ] **ARCH-07**: 外部脚本 spiderJobs/ 使用 crawler_core 替代内联代码
- [ ] **ARCH-08**: 废弃 jobs_spider/ 老框架,迁移到 spiderJobs/ + crawler_core
@ -58,8 +58,8 @@
| ARCH-01 | TBD | Complete |
| ARCH-02 | TBD | Complete |
| ARCH-03 | TBD | Complete |
| ARCH-04 | TBD | Pending |
| ARCH-05 | TBD | Pending |
| ARCH-04 | TBD | Complete |
| ARCH-05 | TBD | Complete |
| ARCH-06 | TBD | Pending |
| ARCH-07 | TBD | Pending |
| ARCH-08 | TBD | Pending |

View File

@ -11,7 +11,7 @@
- [ ] **Phase 1: 共享核心包** - 提取 crawler_core/ 可安装共享包,统一基类和基础设施
- [x] **Phase 2: Boss 直聘重写** - 基于 crawler_core 重写 Boss 直聘爬虫客户端 (completed 2026-03-21)
- [ ] **Phase 3: 前程无忧 & 智联重写** - 基于 crawler_core 重写前程无忧和智联招聘爬虫客户端
- [x] **Phase 3: 前程无忧 & 智联重写** - 基于 crawler_core 重写前程无忧和智联招聘爬虫客户端 (completed 2026-03-21)
- [ ] **Phase 4: 后端 & 外部脚本接入** - 后端 facade 桥接 + 外部脚本迁移 + 废弃老框架
- [ ] **Phase 5: 数据管道优化** - 入库去重、公司清洗流程优化、公司招聘信息写入
- [ ] **Phase 6: 质量 & 前端** - 数据解析测试覆盖、前端监控和数据清洗页面优化
@ -55,7 +55,7 @@ Plans:
2. 智联招聘爬虫继承 BaseFetcher/BaseSearcher不含内联签名或 HTTP 样板代码
3. 两平台各自运行一次真实关键词抓取,职位数据成功返回(手动验证)
4. 三平台新客户端的代码结构一致,新平台可参照模板快速实现
**Plans:** TBD
**Plans:** 2/2 plans complete
### Phase 4: 后端 & 外部脚本接入
**Goal:** 后端 facade 和外部脚本均使用 crawler_core老框架 jobs_spider/ 完成历史使命并下线
@ -98,7 +98,7 @@ Plans:
|-------|----------------|--------|-----------|
| 1. 共享核心包 | 2/2 | Complete | 2026-03-21 |
| 2. Boss 直聘重写 | 2/2 | Complete | 2026-03-21 |
| 3. 前程无忧 & 智联重写 | 0/? | Not started | - |
| 3. 前程无忧 & 智联重写 | 2/2 | Complete | 2026-03-21 |
| 4. 后端 & 外部脚本接入 | 0/? | Not started | - |
| 5. 数据管道优化 | 0/? | Not started | - |
| 6. 质量 & 前端 | 0/? | Not started | - |

View File

@ -4,12 +4,12 @@ milestone: v1.0
milestone_name: milestone
status: unknown
stopped_at: Completed 01-shared-core Plan 02 (sign algorithms + unit tests)
last_updated: "2026-03-21T11:04:42.115Z"
last_updated: "2026-03-21T11:19:04.395Z"
progress:
total_phases: 6
completed_phases: 2
total_plans: 4
completed_plans: 4
completed_phases: 3
total_plans: 6
completed_plans: 6
---
# STATE: JobData 爬虫交互重构
@ -24,13 +24,13 @@ progress:
**Core value:** 基于关键词驱动爬虫抓取职位数据,可靠入库 ClickHouse定时完成公司信息采集同步
**Current focus:** Phase 2 — Boss 直聘重写
**Current focus:** Phase 3 — 前程无忧 & 智联重写
---
## Current Position
Phase: 3
Phase: 4
Plan: Not started
## Performance Metrics

View File

@ -0,0 +1,201 @@
# Architecture
**Analysis Date:** 2026-03-21
## Pattern Overview
**Overall:** Layered monolith with embedded pipeline architecture
**Key Characteristics:**
- FastAPI async backend with dual-database persistence (MySQL for RBAC/metadata, ClickHouse for job/company analytics data)
- External crawlers (`jobs_spider/`, `spiderJobs/`) run as independent processes and push data to the backend API via HTTP POST — no shared code path at runtime
- RBAC permission model enforced at router-level via FastAPI `Depends`
- APScheduler drives all recurring automation (cleaning, ECS orchestration, alerting) within the main process
- Registry pattern (`app/services/ingest/registry.py`) maps (platform, channel, data_type) tuples to ClickHouse table configs + dedup rules
## Layers
**Router Layer:**
- Purpose: Parse HTTP request, validate via Pydantic schema, delegate to service or controller, return JSON envelope
- Location: `app/api/v1/`
- Contains: One sub-package per domain (e.g., `job/`, `cleaning/`, `analytics/`, `keyword/`)
- Depends on: Services, Controllers, `app/core/dependency.py` for auth
- Used by: FastAPI application via `app/api/v1/__init__.py``app/__init__.py`
**Service Layer:**
- Purpose: Business logic, orchestration across repos/models, side-effect coordination
- Location: `app/services/`
- Contains: `IngestService`, `AnalyticsService`, `CleaningService`, `CompanyCleaner`, `CompanyJobsSyncService`, platform crawler service wrappers (`crawler/boss.py`, `crawler/qcwy.py`, `crawler/zhilian.py`)
- Depends on: Repositories, ORM models, `clickhouse_manager`, external HTTP clients
- Used by: Routers, APScheduler jobs
**Controller Layer (RBAC domain operations):**
- Purpose: Thin CRUD orchestration for MySQL-backed entities (User, Role, API, Menu, Dept, etc.)
- Location: `app/controllers/`
- Contains: `user_controller`, `api_controller`, `role_controller`, `menu_controller`, `dept_controller`, `keyword_controller`, `proxy_controller`, `token_controller`, `cleaning_controller`, `company_controller`
- Depends on: Tortoise-ORM models
- Used by: Routers
**Repository Layer:**
- Purpose: Encapsulate raw ClickHouse query execution behind typed methods
- Location: `app/repositories/`
- Contains: `ClickHouseBaseRepo` (generic query/insert), `JobAnalyticsRepo` (trend, distribution, count queries against `job_analytics` view)
- Depends on: `clickhouse_connect.driver.AsyncClient`
- Used by: `AnalyticsService`
**Model Layer:**
- Purpose: ORM schema definitions for MySQL, migrated via Aerich
- Location: `app/models/`
- Contains: `admin.py` (User, Role, Api, Menu, Dept, AuditLog), `token.py` (BossToken), `cleaning.py` (CleaningTask), `metrics.py` (ScheduledTaskRun, StatsTotal, IpUploadStats), `company.py` (CompanyCleaningQueue), `keyword.py` (BossKeyword, QcwyKeyword, ZhilianKeyword)
- Depends on: `tortoise-orm`
- Used by: Controllers, Services, Scheduler jobs
**Core Infrastructure Layer:**
- Purpose: App wiring, connection management, middleware, cross-cutting utilities
- Location: `app/core/`
- Contains: `init_app.py` (lifespan bootstrap), `scheduler.py` (APScheduler jobs), `clickhouse.py` (ClickHouseManager singleton), `clickhouse_init.py` (ClickHouse DDL), `dependency.py` (AuthControl, PermissionControl), `locks.py` (DistributedLock), `middlewares.py` (BackGroundTaskMiddleware, HttpAuditLogMiddleware, IpTrackingMiddleware), `exceptions.py`, `crud.py`, `ip_tracking.py`
- Depends on: Everything below
- Used by: `app/__init__.py`, all layers above
**Ingest Sub-system:**
- Purpose: Platform-agnostic data ingestion pipeline with registry-driven routing, dedup, and remote push
- Location: `app/services/ingest/`
- Contains: `service.py` (IngestService), `registry.py` (PlatformConfig registry), `dedup.py` (batch dedup filter), `remote_push.py` (async HTTP push to secondary target), `configs/` (per-platform PlatformConfig registrations)
- Depends on: `clickhouse_connect`, registry configs
- Used by: `app/api/v1/job/job.py` router
**Spider Crawler Service Wrappers (in-process):**
- Purpose: Wrap platform HTTP clients so backend scheduler can call crawlers directly (company cleaning, company search)
- Location: `app/services/crawler/`
- Contains: `_base.py` (BaseFetcher, BaseSearcher, ApiResult), `_http_client.py`, `_boss_client.py`, `_boss_api.py`, `_boss_sign.py`, `_zhilian_client.py`, `_zhilian_api.py`, `_zhilian_sign.py`, `_job51_client.py`, `_job51_api.py`, `_job51_sign.py`, `boss.py`, `qcwy.py`, `zhilian.py`
- Depends on: `requests` (sync), platform sign/auth modules
- Used by: `CompanyCleaner`, `CompanyController`
**External Spider Processes:**
- Purpose: Standalone crawlers running on ECS nodes; push data to backend via `POST /api/v1/universal/data/batch-store-async`
- Location: `jobs_spider/boss/boos_api.py`, `jobs_spider/qcwy/qcwy.py`, `jobs_spider/zhilian/zhilian_single.py`, `spiderJobs/`
- Depends on: `requests`, `loguru`, `PyExecJS`; no shared Python imports with `app/`
- Used by: Scheduled via `ecs_full_pipeline.py` or manual launch
## Data Flow
**Crawler → Storage Flow:**
1. External spider (`jobs_spider/boss/boos_api.py`) fetches keyword from `GET /api/v1/keyword/available`
2. Spider calls Boss/QCWY/Zhilian platform API with rotating proxy/token
3. Spider POSTs raw JSON batch to `POST /api/v1/universal/data/batch-store-async`
4. Router (`app/api/v1/job/job.py`) delegates to `IngestService.store_batch()`
5. `IngestService` looks up `PlatformConfig` from registry (`app/services/ingest/registry.py`)
6. Dedup filter queries ClickHouse for existing records (`app/services/ingest/dedup.py`)
7. New rows inserted to ClickHouse table (e.g., `job_data.boss_job`) via `clickhouse_connect`
8. Optional async push to secondary remote endpoint via `remote_push.py`
**Analytics Query Flow:**
1. Frontend calls `GET /api/v1/analytics/volume-trend?interval=day&from_dt=...`
2. Router (`app/api/v1/analytics.py`) creates `AnalyticsService` with `clickhouse_manager` client
3. `AnalyticsService` delegates to `JobAnalyticsRepo.get_volume_trend()`
4. `JobAnalyticsRepo` queries the `job_data.job_analytics` VIEW (UNION ALL across three platform job tables)
5. Returns time-bucketed counts grouped by source
**RBAC Authentication Flow:**
1. Client sends `POST /api/v1/base/access_token` with credentials
2. `AuthControl.is_authed()` verifies JWT (HS256, 7-day expiry) from `token` header
3. `PermissionControl.has_permission()` loads user's roles → role's API list
4. Compares `(method, path)` tuple against allowed API set
5. Superusers bypass permission check
**Company Cleaning Flow (Scheduled):**
1. `company_cleaning_job()` in `app/core/scheduler.py` fires every 5 minutes
2. `CompanyCleaner.collect_pending_companies(limit=50)` queries ClickHouse job tables for company IDs not yet in `pending_company` table
3. `CompanyCleaner.process_pending_companies(limit=30)` fetches company details from platform crawlers (BossService, QcwyService, ZhilianService)
4. Cleaned company data stored via `company_storage` to ClickHouse `boss_company` / `qcwy_company` / `zhilian_company`
5. Run protected by `DistributedLock` (file-based fallback, optional Redis)
**State Management:**
- MySQL (via Tortoise-ORM): User sessions, RBAC entities, keyword crawl state, tokens, task run records
- ClickHouse: All raw job/company JSON data + analytics views; no ORM — raw SQL via `clickhouse_connect`
- Context variable `CTX_USER_ID` (Python `contextvars.ContextVar`) carries authenticated user ID within request scope (`app/core/ctx.py`)
- File-based startup lock (`.startup_lock/` dir) prevents multi-worker seed data collisions
## Key Abstractions
**PlatformConfig (Ingest Registry):**
- Purpose: Maps `(platform, channel, data_type)` to target ClickHouse table and dedup field extractors
- Examples: `app/services/ingest/registry.py`, `app/services/ingest/configs/`
- Pattern: Immutable `@dataclass(frozen=True)`, registered at module import time via `register()`
**DistributedLock:**
- Purpose: Prevents concurrent execution of scheduled jobs across multiple Uvicorn workers
- Examples: `app/core/locks.py`
- Pattern: Redis SET NX with TTL; falls back to filesystem directory creation with TTL metadata
**ClickHouseManager:**
- Purpose: Singleton lazy-init connection wrapper around `clickhouse_connect.AsyncClient`
- Examples: `app/core/clickhouse.py`
- Pattern: Global module-level instance `clickhouse_manager`, injected into services at call time via `await clickhouse_manager.get_client()`
**BaseFetcher / BaseSearcher:**
- Purpose: Abstract base for platform-specific API clients within the in-process crawler wrappers
- Examples: `app/services/crawler/_base.py`
- Pattern: Template method — subclasses implement `_build_params()`, base handles HTTP invocation and `ApiResult` parsing
**BaseModel (Tortoise):**
- Purpose: Common ORM base with `BigIntField pk`, `to_dict()` async serializer with M2M support
- Examples: `app/models/base.py`
- Pattern: Mixed-in with `TimestampMixin` (created_at, updated_at auto fields)
## Entry Points
**Application Entry:**
- Location: `run.py`
- Triggers: `python run.py` — reads `APP_HOST`, `APP_PORT`, `UVICORN_WORKERS` env vars; calls `uvicorn.run("app:app")`
- Responsibilities: Configure uvicorn logging format, start ASGI server
**FastAPI App Factory:**
- Location: `app/__init__.py``create_app()`
- Triggers: Uvicorn import of `app:app`
- Responsibilities: Register middlewares, exception handlers, routers, static files; define lifespan
**Lifespan Bootstrap:**
- Location: `app/__init__.py` (lifespan context manager) + `app/core/init_app.py` (`init_data()`)
- Triggers: FastAPI startup event
- Responsibilities: Tortoise-ORM init + schema generation → Aerich migrations → seed data (superuser, menus, APIs, roles) → ClickHouse table init → APScheduler start
**Router Aggregation:**
- Location: `app/api/v1/__init__.py`
- Triggers: Imported by `app/api/__init__.py``app/__init__.py`
- Responsibilities: Mount all domain routers under `/api/v1/` with appropriate `DependPermission` guards
**ECS Pipeline:**
- Location: `ecs_full_pipeline.py`
- Triggers: Manual execution or scheduled via `scheduler.py` (`ecs_full_pipeline_job`) every 6 hours
- Responsibilities: Create Alibaba Cloud ECS spot instances, push spider scripts, execute crawlers, terminate instances
## Error Handling
**Strategy:** Exception handler registration at FastAPI app level; structured error propagation up through layers
**Patterns:**
- Tortoise `DoesNotExist` → 404 via `DoesNotExistHandle` (registered in `app/core/init_app.py`)
- Pydantic `RequestValidationError` → 422 with field-level detail via `RequestValidationHandle`
- `IntegrityError` (DB constraint) → 400 via `IntegrityHandle`
- Custom `HTTPException` (from `app/core/exceptions.py`) → structured JSON error via `HttpExcHandle`
- Service layer exceptions logged with `loguru` then re-raised or returned as `{"success": False, "error": "..."}`
- Scheduled jobs catch all exceptions, record failure via `_record_task_run()` to `ScheduledTaskRun` table, then continue
## Cross-Cutting Concerns
**Logging:** `loguru` — imported from `app/log/__init__.py`; structured `logger.info/error/warning` calls throughout; external spider logs to rotating daily files in `jobs_spider/*/logs/`
**Validation:** Pydantic v2 schemas in `app/schemas/` validate all inbound HTTP request bodies; ClickHouse inserts use registry-defined column extractors for type safety
**Authentication:** JWT HS256 tokens issued by `/api/v1/base/access_token`; verified per-request by `AuthControl.is_authed()` injected via `DependAuth` / `DependPermission`; dev bypass (`token: dev`) available when `APP_ENV=development`
**Audit Logging:** `HttpAuditLogMiddleware` records all non-excluded mutating requests (method, path, user, request args, response body, timing) to MySQL `auditlog` table
---
*Architecture analysis: 2026-03-21*

View File

@ -0,0 +1,255 @@
# Codebase Concerns
**Analysis Date:** 2026-03-21
---
## Security Considerations
### CRITICAL: Hardcoded Credentials in Source Code
**Area:** Configuration
- Risk: Real database passwords, SMTP credentials, and IP addresses committed to version control.
- Files: `app/settings/config.py`
- Exposed values (lines 2752):
- MySQL root password (`jobdata123`) with public IP `121.4.126.241:3306`
- ClickHouse user/pass defaults (`data_user` / `data_pass`) with same public IP
- QQ SMTP app password (`jkncedjzkummbbdi`) and recipient email addresses
- Current mitigation: `pydantic-settings` allows environment variable override, but defaults are live credentials that will be exposed if the repo is shared or CI logs are leaked.
- Recommendations: Remove all real default values; replace with `None` or clearly fake placeholders; enforce required-field validation at startup; rotate exposed credentials immediately.
### CRITICAL: Unauthenticated Data Ingest Endpoints
**Area:** API Authentication
- Risk: Any actor with network access can push arbitrary data directly into ClickHouse.
- Files: `app/api/v1/__init__.py` (lines 3234), `app/api/v1/job/job.py`
- The `/ingest`, `/job`, and `/universal` router groups are mounted **without** any `DependPermission` or `DependAuth` dependency. The data store, batch-store, and batch-store-async endpoints accept unauthenticated POST requests.
- Trigger: Send a POST to `/api/v1/universal/data/batch-store-async` with any payload. No token required.
- Workaround: The CLAUDE.md acknowledges these as "internal call" endpoints, but they are publicly reachable with no network-level isolation noted.
- Recommendations: Add at minimum IP allowlist middleware or a static shared secret header check for ingest routes; or mount them behind `DependAuth`.
### HIGH: Dev Token Bypass in Production Auth
**Area:** Authentication
- Risk: If `APP_ENV=development`, any request with `token: dev` header is granted the first user's full privileges.
- Files: `app/core/dependency.py` (lines 2830)
- The check `os.getenv("APP_ENV", "production") == "development"` defaults to `"production"`, but there is no enforcement that this variable is set correctly in production deployments. A misconfigured environment would expose the bypass.
- Recommendations: Remove the dev bypass entirely; use test fixtures instead; add a startup assertion that the bypass is disabled in non-dev environments.
### HIGH: SQL Injection via f-string Queries
**Area:** ClickHouse Queries
- Risk: User-controlled values inserted directly into query strings allow injection.
- Files:
- `reclean_qcwy_jobs.py` (line 34): `f"SELECT json_data FROM job_data.qcwy_job WHERE job_id = '{job_id}' LIMIT 1"``job_id` comes from URL parsing of user-provided input.
- `app/repositories/clickhouse_repo.py` (lines 101108): `group_by_column` parameter is interpolated directly into the SELECT and GROUP BY clauses without validation.
- `app/services/ingest/service.py` (lines 109, 112): `table` value is derived from config (lower risk) but still concatenated into f-strings.
- `app/core/scheduler.py` (lines 57, 63): `table` names from a fixed list, but pattern is risky.
- Current mitigation: ClickHouse parameterized queries are used in `dedup.py`, but inconsistently applied elsewhere.
- Recommendations: Use parameterized queries (`parameters={}`) for all user-influenced values; validate `group_by_column` against an allowlist; never interpolate user input into SQL strings.
### MEDIUM: CORS Misconfiguration
**Area:** CORS Policy
- Risk: `CORS_ALLOW_CREDENTIALS = False` while `CORS_ALLOW_HEADERS = ["*"]` is set. If credentials are later enabled, wildcard headers become a security liability.
- Files: `app/settings/config.py` (lines 1316)
- Recommendations: Restrict `CORS_ORIGINS` to known production domains; avoid wildcard headers.
### MEDIUM: Hardcoded Public IP in Spider Default
**Area:** Spider Configuration
- Risk: Hardcoded IP `124.222.106.226:9999` as the default `API_BASE_URL` in all spider scripts and `spiderJobs/` runner. If that IP changes hands or is exposed, all crawled data goes to an unintended recipient.
- Files:
- `jobs_spider/boss/boos_api.py` (line 17)
- `jobs_spider/qcwy/qcwy.py` (line 28)
- `jobs_spider/zhilian/zhilian_single.py` (line 41)
- `spiderJobs/runner/api_client.py` (line 31)
- Recommendations: Remove hardcoded IP defaults; require `API_BASE_URL` to be explicitly set via environment variable with no default.
---
## Tech Debt
### Duplicate Spider Implementations
**Area:** Code Organization
- Issue: Two parallel spider hierarchies exist for the same three platforms.
- `jobs_spider/boss/`, `jobs_spider/qcwy/`, `jobs_spider/zhilian/` — original scripts
- `spiderJobs/platforms/boss/`, `spiderJobs/platforms/job51/`, `spiderJobs/platforms/zhilian/` — refactored version
- `app/services/crawler/` — yet another crawler abstraction inside the app
- Files: `jobs_spider/`, `spiderJobs/`, `app/services/crawler/`
- Impact: Bug fixes and anti-detection improvements must be applied in multiple places; the relationship between these directories is not documented.
- Fix approach: Decide on one canonical implementation; deprecate and delete the others; update CLAUDE.md to clarify which is the active spider.
### God File: `jobs_spider/boss/boos_api.py`
**Area:** File Size / Complexity
- Issue: Single file containing 2281 lines covering API client, IP strategy, anomaly detection, session management, main loop, and push logic.
- Files: `jobs_spider/boss/boos_api.py`
- Impact: Extremely difficult to test or modify any single responsibility; cognitive load is very high.
- Fix approach: Extract `SmartIPManager`, `IPAnomalyDetector`, `IPStrategyConfig` into separate modules; separate the main crawl loop from the API client class.
### `CleaningService` and `CompanyCleaner` Code Duplication
**Area:** Service Layer
- Issue: Both classes replicate identical token-loading logic (`_ensure_boss_token_loaded`), proxy-setting logic (`_apply_proxy`), and service instantiation (BossService, QcwyService, ZhilianService).
- Files:
- `app/services/cleaning.py` (lines 3748, 3134)
- `app/services/company_cleaner.py` (lines 6074, 5458)
- Impact: Token refresh logic is maintained in two places; a bug fix in one will not propagate to the other.
- Fix approach: Extract a shared `CrawlerContext` or `CrawlerMixin` class providing token management and proxy configuration.
### ClickHouse Schema Changes Require Manual DDL
**Area:** Database Migrations
- Issue: ClickHouse table definitions in `app/core/clickhouse_init.py` use `CREATE TABLE IF NOT EXISTS`, meaning schema changes (new columns, index changes) are never automatically applied to existing tables. There is no migration tooling equivalent to Aerich for ClickHouse.
- Files: `app/core/clickhouse_init.py`
- Impact: If a column is added to the DDL, existing production tables silently retain the old schema; queries referencing new columns will fail at runtime.
- Fix approach: Implement `ALTER TABLE ... ADD COLUMN IF NOT EXISTS` statements for schema evolution; or adopt a ClickHouse migration tool (e.g., `ch-migrations`).
### Startup File Lock Not Robust to Crashes
**Area:** Multi-Worker Initialization
- Issue: `init_data()` uses `os.mkdir(".startup_lock")` as a bare file lock with no TTL. If a worker crashes mid-initialization, the lock directory is never cleaned up, and subsequent restarts will skip all migrations and seed data.
- Files: `app/core/init_app.py` (lines 322349)
- Impact: Silent data-initialization failure on restart after crash; can result in missing admin user or broken menu state.
- Fix approach: Replace with the existing `DistributedLock` class from `app/core/locks.py` which has TTL support; or at minimum add a time-based stale-lock cleanup.
---
## Performance Bottlenecks
### ClickHouse Singleton Connection Per-Worker
**Area:** Database Connections
- Problem: `clickhouse_manager` in `app/core/clickhouse.py` is a module-level singleton that lazily creates one connection. With 20 uvicorn workers (`UVICORN_WORKERS=20`), each worker maintains its own connection, and the connection pool inside each is limited to `maxsize=5`. Under concurrent load, requests block waiting for the pool.
- Files: `app/core/clickhouse.py` (lines 3158)
- Cause: A new `PoolManager` is created on every `get_clickhouse_client()` call but the singleton pattern means `get_client()` only calls this once per worker — pool configuration cannot be tuned per endpoint.
- Improvement path: Use a connection pool library or connection factory pattern; expose pool size as a configurable setting; consider `clickhouse-pool` for proper async pooling.
### Dedup Query Scans Entire Table
**Area:** ClickHouse Deduplication
- Problem: `batch_dedup_filter` in `app/services/ingest/dedup.py` issues `SELECT key_col FROM table WHERE key_col IN (...)` with no time-bound filter. For large tables this is a full-table scan on each ingest batch.
- Files: `app/services/ingest/dedup.py` (lines 51, 65)
- Cause: No `PREWHERE created_at > now() - INTERVAL N DAY` constraint added.
- Improvement path: Add a configurable recency window to dedup queries (e.g., 30-day lookback matching the existing `days_back` logic in `company_cleaner.py`).
### Synchronous Crawler Calls Blocking Async Workers
**Area:** Concurrency
- Problem: All crawler service methods (BossService, QcwyService, ZhilianService) use synchronous `requests` internally, wrapped with `asyncio.to_thread()`. Under high concurrency, this can exhaust the thread pool.
- Files: `app/services/cleaning.py` (lines 186, 190, 194, etc.), `app/services/company_cleaner.py` (lines 318, 321, 325)
- Cause: The underlying crawler libraries use `requests` (sync), not `httpx` (async).
- Improvement path: Migrate crawlers to `httpx.AsyncClient` for true async I/O, or cap thread pool size via `asyncio.to_thread` semaphore.
---
## Fragile Areas
### `group_by_column` API Parameter Lacks Allowlist Validation
**Area:** Analytics API
- Files: `app/repositories/clickhouse_repo.py` (lines 86115), `app/api/v1/analytics.py`
- Why fragile: `group_by_column` is passed directly into the ClickHouse query without validation against allowed column names. An invalid column name causes a runtime ClickHouse error that propagates as a 500 to the client; a malicious value enables injection.
- Safe modification: Add a `ALLOWED_GROUP_COLUMNS = frozenset({"source", "city", "channel", ...})` check before query construction.
- Test coverage: No tests for analytics query building.
### `DistributedLock` File Lock Has Race Window
**Area:** Distributed Locking
- Files: `app/core/locks.py` (lines 6492)
- Why fragile: The stale-lock cleanup path (lines 8187) involves `shutil.rmtree` followed by `mkdir` — a TOCTOU race where two workers can both detect staleness, both delete the directory, and both attempt to re-create it. One wins, one raises `FileExistsError` and falls through to return `False`, but the winner's lock is not guaranteed to be the intended holder.
- Safe modification: Use `os.rename` from a temp directory as an atomic lock-acquire primitive.
- Test coverage: None.
### `CleaningService` Returns Mixed Types (`bool | Dict`)
**Area:** Type Consistency
- Files: `app/services/cleaning.py` (lines 158165, 167202, 204239, 241268)
- Why fragile: Multiple `clean_*` methods return `Union[bool, Dict[str, Any]]`. Callers must defensively check `isinstance(result, bool)` before accessing dict keys (see `process_single_item` at lines 128135). This makes the API fragile — any new call site that forgets the `bool` check will raise `AttributeError`.
- Safe modification: Define a `CleanResult` dataclass or TypedDict; always return it.
- Test coverage: None.
### `company_cleaning_job` Assumes 5-Minute Completion
**Area:** Scheduler
- Files: `app/core/scheduler.py` (lines 171202)
- Why fragile: The job comment explicitly notes it processes 50 collect + 30 process operations and "should complete in 23 minutes" to fit within the 5-minute schedule window. If the external crawlers slow down (rate-limited, network issues), the job overruns its window, causing the next scheduled run to be blocked by the distributed lock, leading to queue backlog.
- Safe modification: Add timeout guards around `collect_pending_companies` and `process_pending_companies`; or dynamically reduce batch sizes based on elapsed time.
---
## Missing Critical Features
### No Rate Limiting on Any API Endpoint
**Problem:** FastAPI has no rate-limiting middleware configured. The data ingest endpoints (which have no auth) and the cleaning trigger endpoints can be called at arbitrary rates.
- Blocks: Protection against accidental or malicious flooding of ClickHouse with writes.
- Recommendation: Add `slowapi` or custom middleware with per-IP rate limiting on ingest and crawl-triggering endpoints.
### No Input Size Limit on Batch Ingest
**Problem:** `IngestBatchRequest.data_list` has no `max_items` constraint. A single request with tens of thousands of job records will be processed synchronously (or queued as a background task) with no guard on memory consumption.
- Files: `app/schemas/ingest.py`, `app/api/v1/job/job.py`
- Recommendation: Add `Field(max_length=1000)` or equivalent to `data_list`; add body size limit in uvicorn/nginx.
---
## Test Coverage Gaps
### Service Layer: Virtually Untested
- What's not tested: `CleaningService`, `CompanyCleaner`, `IngestService`, `AnalyticsService`, `DistributedLock`, `ClickHouseManager`
- Files: `app/services/cleaning.py`, `app/services/company_cleaner.py`, `app/services/ingest/service.py`, `app/services/analytics_service.py`, `app/core/locks.py`, `app/core/clickhouse.py`
- Risk: Core data routing, deduplication, and cleaning logic can regress silently. The existing tests only cover two pure-function helpers (`extract_company_fields`, `normalize_company_id`, `_extract_*_jobs`).
- Priority: **High**
### API Layer: No Integration Tests
- What's not tested: All FastAPI route handlers, auth flow, permission enforcement, error responses.
- Files: `app/api/v1/` (all modules)
- Risk: Auth bypass regressions, schema changes breaking clients, and incorrect HTTP status codes go undetected.
- Priority: **High**
### Spider Logic: No Unit Tests
- What's not tested: `IPAnomalyDetector.detect()`, `SmartIPManager` state transitions, response parsing in `BossZhipinAPI`, `QcwyApi`, `ZhilianApi`.
- Files: `jobs_spider/boss/boos_api.py`, `jobs_spider/qcwy/qcwy.py`, `jobs_spider/zhilian/zhilian_single.py`
- Risk: Anti-detection logic breaks silently when platforms change their response format.
- Priority: **Medium**
### ClickHouse Schema: No Schema Evolution Tests
- What's not tested: Behavior of `clickhouse_init.py` when tables already exist with old schemas; column migration paths.
- Files: `app/core/clickhouse_init.py`
- Risk: Production schema drift goes unnoticed until a query fails at runtime.
- Priority: **Medium**
---
## Known Issues / Observed Bugs
### Silent Error Swallowing in Spider Code
- Symptoms: Data silently not collected; no error surfaced to caller.
- Files: `jobs_spider/qcwy/qcwy.py` (lines 178179, 191192, 10471048, 11461147), `jobs_spider/qcwy/search_company_jobs.py` (lines 240241, 512513, 538539, 601602), `jobs_spider/boss/boos_api.py` (lines 400, 407408, 508509, 518519, 17791780), `jobs_spider/zhilian/company_spider.py` (many `except: pass` blocks)
- Trigger: Network errors, parsing errors, or unexpected response shapes in spider code are caught by bare `except Exception: pass` blocks.
- Workaround: Check logs for INFO-level skips; monitor push success counts.
### `_send_email` Uses Blocking SMTP in Async Context
- Symptoms: Scheduler event loop blocked for up to several seconds during email send.
- Files: `app/core/scheduler.py` (lines 308336)
- Trigger: Any `stats_job` or `ip_alert_job` execution that sends email calls synchronous `smtplib.SMTP` from the async event loop without wrapping in `asyncio.to_thread`.
- Workaround: Email delivery may be delayed; event loop latency spikes during scheduled runs.
### `cleanup_old_records` Deletes All Done/Failed Records Without Archive
- Symptoms: Historical processing data is permanently deleted daily; no audit trail of what was cleaned.
- Files: `app/services/company_cleaner.py` (lines 329331), `app/core/scheduler.py` (line 220)
- Trigger: `daily_cleanup_job` calls `CompanyCleaningQueue.filter(status__in=["done", "failed"]).delete()` — a hard delete with no archiving or retention window.
---
*Concerns audit: 2026-03-21*

View File

@ -0,0 +1,212 @@
# Coding Conventions
**Analysis Date:** 2026-03-21
## Naming Patterns
**Files:**
- Snake_case for all Python files: `company_storage.py`, `company_cleaner.py`, `clickhouse_repo.py`
- Private/internal modules prefixed with underscore: `_base.py`, `_boss_api.py`, `_boss_client.py`, `_boss_sign.py`, `_http_client.py`
- Platform-named service files: `boss.py`, `qcwy.py`, `zhilian.py` under `app/services/crawler/`
- Router files named after domain: `keyword.py`, `analytics.py`, `cleaning.py`
**Classes:**
- PascalCase throughout: `CleaningService`, `KeywordController`, `ClickHouseBaseRepo`, `JobAnalyticsRepo`
- Services: `{Domain}Service``BossService`, `QcwyService`, `ZhilianService`, `IngestService`, `AnalyticsService`
- Controllers: `{Domain}Controller``KeywordController`
- Repos: `{Domain}Repo` or `{Domain}BaseRepo``ClickHouseBaseRepo`, `JobAnalyticsRepo`
- Models (Tortoise ORM): `{Platform}{Entity}``BossKeyword`, `QcwyCompany`, `ZhilianCompany`
- Schemas (Pydantic): `{Entity}Base`, `{Entity}Create`, `{Entity}Update`, `{Entity}Out` — see `app/schemas/keyword.py`
**Functions and Methods:**
- Snake_case for all functions and methods: `get_available`, `report_page_progress`, `store_batch`, `build_insert_row`
- Private helpers prefixed with underscore: `_apply_proxy`, `_ensure_boss_token_loaded`, `_pick_first`, `_nested_get`, `_clean_text`, `_model_for_source`
- Async dependency factories follow pattern `get_{service/controller}()`: `get_ingest_service`, `get_analytics_service`, `get_keyword_controller`
**Variables:**
- Snake_case: `data_list`, `platform_type`, `check_duplicate`, `page_size`
- Module-level constants: UPPER_SNAKE_CASE — `COMPANY_SOURCES`, `QUEUE_TERMINAL_STATUSES`
- Class-level constants: UPPER_SNAKE_CASE prefixed `_``_TOKEN_REFRESH_INTERVAL = 3600`
**Types and Enums:**
- Enums use PascalCase class name, UPPER_SNAKE_CASE values: `PlatformType.BOSS`, `ChannelType.MINI`, `DataType.JOB`
- Enum values are lowercase strings matching URL slugs: `"boss"`, `"mini"`, `"job"` — see `app/schemas/ingest.py`
- Enums inherit from `(str, Enum)` enabling direct string comparison
## Code Style
**Formatting:**
- Tool: `black` v24.10.0
- Line length: 120 characters (set in `pyproject.toml` `[tool.black]` and `[tool.ruff]`)
- Target Python versions: 3.10, 3.11 (black), 3.13 (Pipfile)
**Linting:**
- Tool: `ruff` v0.9.1 (configured in `pyproject.toml`)
- Ignored rules: `F403` (star imports), `F405` (may be undefined from star import)
- Star imports from internal modules are allowed (used in `app/models/__init__.py`, `app/services/ingest/__init__.py`)
**Import Sorting:**
- Tool: `isort` v5.13.2
- No explicit isort config found; follows default ordering
## Import Organization
**Order:**
1. Standard library (`from __future__`, `os`, `re`, `typing`, `datetime`, `json`)
2. Third-party (`fastapi`, `pydantic`, `tortoise`, `loguru`, `clickhouse_connect`)
3. Internal app imports (`from app.core.`, `from app.models.`, `from app.services.`, `from app.schemas.`)
**Example from `app/api/v1/analytics.py`:**
```python
from typing import Optional
from datetime import datetime, date, timezone
from zoneinfo import ZoneInfo
from fastapi import APIRouter, Depends, Query
from app.core.clickhouse import clickhouse_manager
from app.services.analytics_service import AnalyticsService
from app.schemas.analytics import JobStatisticsResponse
```
**Path Aliases:**
- None; all imports use full `app.` prefix paths
- `from app.log import logger` is the canonical loguru import path
**Star Imports:**
- Used only in `__init__.py` re-export files: `from .admin import *` in `app/models/__init__.py`
- `# noqa: F401, F403` comments suppress lint warnings for intentional star imports
## Error Handling
**Patterns:**
- Services return `Dict[str, Any]` result objects with `"success"`, `"code"`, `"message"` fields instead of raising exceptions to callers
- Controllers return dict with `"code": 200/400/404` and `"message"` for all outcomes
- API route handlers do NOT use try/except — they rely on services returning structured results
- Service methods wrap low-level calls in `try/except Exception as e` and log then return `False` or error dict
**Service-level error handling example** (`app/services/cleaning.py`):
```python
except Exception as e:
logger.error(f"Error processing item {target}: {e}")
return {
"success": False,
"target": target,
"error": str(e),
"storage_status": "error",
"remote_sent": False
}
```
**Repository-level:** `ClickHouseBaseRepo` does not swallow exceptions; they propagate to the service layer.
**Auth exceptions:** `app/core/dependency.py` raises `HTTPException(status_code=401/403)` directly — the standard FastAPI pattern for auth failures.
## Logging
**Framework:** `loguru` v0.7.3
**Import:** `from app.log import logger` (centralized re-export) or `from loguru import logger` (direct)
**Patterns:**
- `logger.info(f"...")` for normal operation events
- `logger.warning(f"...")` for non-fatal recoverable issues (e.g., token not found, API soft failures)
- `logger.error(f"...")` for caught exceptions and operation failures
- F-string interpolation used consistently for message formatting
- No structured fields (no `logger.bind()` usage observed)
**Example:**
```python
logger.info(f"获取招聘详情: {job_id}")
logger.warning(f"Boss get_job_detail failed: {result.error}")
logger.error(f"批量插入失败: {e}")
```
## API Response Format
**Two response styles coexist:**
**Style 1 — Direct dict return** (most routes in new modules like `app/api/v1/job/job.py`, `app/api/v1/analytics.py`):
```python
return {"code": 200, "data": result, "message": "ok"}
```
**Style 2 — JSONResponse subclasses** (older RBAC routes, defined in `app/schemas/base.py`):
```python
Success(code=200, msg="OK", data=data)
Fail(code=400, msg="error message")
SuccessExtra(code=200, data=data, total=100, page=1, page_size=20)
```
**Paginated responses** include: `code`, `data` (list), `total`, `page`, `page_size`
## Comments
**When to Comment:**
- Docstrings on public methods describing purpose, not implementation: `"""获取可用关键词,优先返回断点续爬和失败重试的关键词"""`
- Inline comments for priority logic and algorithm steps: `# 优先级 1: 断点续爬 (partial)`
- Module-level docstrings for context: `"""Boss直聘 Service — 基于新算法文件的封装"""`
- `# noqa` comments for intentional lint suppressions
**JSDoc/TSDoc:**
- Not applicable (Python backend)
- Docstrings are brief single-line or short multi-line Chinese descriptions
## Function Design
**Size:** Functions tend to be 10-50 lines; service methods like `process_single_item` in `app/services/cleaning.py` grow to ~70 lines due to multi-platform dispatch
**Parameters:**
- Keyword arguments with defaults preferred for optional params
- Pydantic schemas used for HTTP request bodies (never raw dicts from router params)
- `Optional[str]` with `= None` default for optional parameters
**Return Values:**
- Services return `Dict[str, Any]` with consistent keys (`code`, `message`, `data`)
- Private helpers return `Optional[T]` or primitive types
- Async functions return awaitable results (no mixing of sync/async)
## Module Design
**Exports:**
- `app/models/__init__.py` uses `from .{module} import *` to flatten model imports
- Router modules export a single named router variable: `router = APIRouter(...)` or `{domain}_router = APIRouter(...)`
- Service classes are imported directly by name
**Barrel Files (`__init__.py`):**
- `app/models/__init__.py` — re-exports all model classes
- `app/services/ingest/__init__.py` — re-exports `IngestService` and config registrations
- `app/api/v1/__init__.py` — aggregates all routers into `v1_router`
**Dependency Injection:**
- FastAPI `Depends()` used for service/controller instantiation in route handlers
- Dependency factory functions named `get_{service}()` and defined in the same file as the router
- Shared auth dependencies: `DependAuth`, `DependPermission` in `app/core/dependency.py`
## Tortoise ORM Model Conventions
**Base class:** All models inherit from `app/models/base.py:BaseModel` (which extends `tortoise.models.Model` with `id = BigIntField(pk=True)`)
**Timestamp mixin:** `TimestampMixin` adds `created_at` (auto_now_add) and `updated_at` (auto_now) — applied via multiple inheritance
**Abstract base models:** Platform variants use abstract base + concrete subclasses:
```python
class BaseKeyword(Model): # abstract = True in Meta
...
class BossKeyword(BaseKeyword):
class Meta:
table = "boss_keyword"
```
**Field descriptions:** All fields include `description=` parameter for documentation
## Pydantic Schema Conventions
- All schemas inherit from `pydantic.BaseModel`
- All fields use `Field(...)` with `description=` for documentation
- Enums inherit from `(str, Enum)` for JSON serialization compatibility
- Output schemas include `class Config: from_attributes = True` to support ORM mode
- Validation patterns use `Field(..., pattern="^(boss|qcwy|zhilian)$")` for enum-like string fields
---
*Convention analysis: 2026-03-21*

View File

@ -0,0 +1,165 @@
# External Integrations
**Analysis Date:** 2026-03-21
## APIs & External Services
**Recruitment Platforms (scraped, not official APIs):**
- Boss Zhipin (Boss直聘) - Job and company data via WeChat mini-program endpoints
- Client: `requests` (sync) in `jobs_spider/boss/boos_api.py`
- Endpoints scraped: `/wapi/zpgeek/miniapp/search/joblist.json`, `/wapi/batch/batchRunV3`, `/wapi/zpgeek/miniapp/brand/detail.json`
- Auth: MPT Token (stored in MySQL `boss_token` table, managed via `app/api/v1/token/`)
- Anti-detection: `SmartIPManager`, `IPAnomalyDetector`, random 10-20s delays, session rebuilding on ban
- JS signing: `PyExecJS` executes JS sign algorithms (`jobs_spider/boss/boos_api.py`)
- 51job / Qiancheng Wuyou (前程无忧) - Job data via platform API
- Client: `httpx` + `requests` in `jobs_spider/qcwy/qcwy.py`
- Auth: HMAC signature computed per-request
- Proxy: Hardcoded proxy URL present in `jobs_spider/qcwy/qcwy.py` (line 31) - security risk
- Zhilian Zhaopin (智联招聘) - Job and company data
- Client: `requests` in `jobs_spider/zhilian/zhilian_single.py`
- Sign client: `app/services/crawler/_zhilian_sign.py`
**Qixin.com (企信) - Company data enrichment (disabled):**
- Endpoint: `http://external-data.qixin.com/extend/extend_data_push`
- Auth: MD5 HMAC with timestamp + salt (salt hardcoded in `app/services/ingest/remote_push.py` line 48)
- Status: Push code commented out; function `push_to_remote()` currently only logs
**Webhook / Report Endpoint:**
- Outgoing: Configurable via `REPORT_ENDPOINT` env var (default: none)
- Used by: `app/core/scheduler.py` `_post_with_retry()` to POST stats JSON every 6 hours
- Timeout: `REPORT_TIMEOUT` (default 10s), retries: `REPORT_MAX_RETRIES` (default 3)
## Data Storage
**Databases:**
MySQL (Primary business database):
- Connection: `mysql://root:jobdata123@121.4.126.241:3306/job_data` (default hardcoded in `app/settings/config.py`; override via `TORTOISE_ORM` env or individual components)
- Client: Tortoise-ORM 0.23.0 with aiomysql driver
- Migrations: Aerich (`migrations/` directory, runs on startup if `RUN_MIGRATIONS_ON_STARTUP=True`)
- Tables: `user`, `role`, `api`, `menu`, `dept`, `auditlog`, `boss_token`, `cleaning_*`, `scheduled_task_run`, `stats_total`, `ip_upload_stats`
- Models: `app/models/admin.py`, `app/models/token.py`, `app/models/cleaning.py`, `app/models/metrics.py`, `app/models/keyword.py`, `app/models/company.py`
ClickHouse (Analytics / raw job data):
- Connection: host `121.4.126.241`, port `8123`, DB `job_data` (configured in `app/settings/config.py`)
- Env vars: `CLICKHOUSE_HOST`, `CLICKHOUSE_PORT`, `CLICKHOUSE_USER`, `CLICKHOUSE_PASS`, `CLICKHOUSE_DB`
- Client: `clickhouse-connect==0.7.19` async client (`app/core/clickhouse.py`)
- Connection pool: `urllib3.PoolManager` with `num_pools=2`, `maxsize=5` per worker
- Tables managed via DDL in `app/core/clickhouse_init.py` (NOT Aerich migrations)
- Tables: `boss_job`, `boss_company`, `qcwy_job`, `qcwy_company`, `zhilian_job`, `zhilian_company` (MergeTree engine), `pending_company` (ReplacingMergeTree), `job_analytics` (VIEW)
- Queries: Repository pattern in `app/repositories/clickhouse_repo.py`
**File Storage:**
- Local filesystem only (no S3 / object storage detected)
- Static files served from `static/` directory mounted via FastAPI (`app/__init__.py`)
- Spider logs written to `logs/` directory relative to spider working dir
- ECS pipeline log: `ecs_full_pipeline.log` (root project directory)
**Caching:**
- Redis (optional): Used only for distributed locking in `app/core/locks.py`
- Env vars: `REDIS_HOST`, `REDIS_PORT` (default 6379), `REDIS_DB` (default 0), `REDIS_PASS`
- Falls back to filesystem-based TTL locks when Redis unavailable
- Lock files stored in system temp directory: `{tempdir}/jobdata_lock_{name}/`
## Authentication & Identity
**Auth Provider:** Custom JWT implementation (no third-party auth provider)
- Implementation: `app/core/dependency.py` (`DependPermission` dependency)
- Algorithm: HS256
- Expiry: 7 days (60 * 24 * 7 minutes, `app/settings/config.py`)
- Secret: `SECRET_KEY` (default `"CHANGE_ME_DEV_ONLY"` - must override in production)
- Storage (frontend): JWT stored in `localStorage`, injected via axios request interceptor (`web/src/utils/http/interceptors.js`)
- RBAC: Route-level permission check via `DependPermission`; roles link to APIs (many-to-many in MySQL)
**Boss Token Management:**
- Boss MPT tokens stored in `boss_token` MySQL table (`app/models/token.py`)
- Managed via `app/api/v1/token/` CRUD endpoints
- Spiders fetch active tokens from `/api/v1/token/tokens` and mark invalid ones
## Monitoring & Observability
**Error Tracking:** None (no Sentry or equivalent detected)
**Logs:**
- Backend: `loguru` structured logging throughout all modules
- Spider: `loguru` with daily rotation (`logs/log_{YYYY-MM-DD}.log`, 30-day retention)
- ECS pipeline: File log appended to `ecs_full_pipeline.log`
- No centralized log aggregation (no ELK, CloudWatch, etc.)
**Alerting:**
- Email alerts via SMTP on: stats job completion, IP anomaly detection
- SMTP provider: QQ Mail (`smtp.qq.com:587`) with STARTTLS
- Env vars: `SMTP_HOST`, `SMTP_PORT`, `SMTP_USER`, `SMTP_PASS`, `SMTP_FROM`, `SMTP_TO`
- Default recipient hardcoded in `app/settings/config.py` line 47
- IP anomaly alerts triggered every 10 minutes by `ip_alert_job` in `app/core/scheduler.py`
- Stats reports sent every 6 hours by `stats_job` in `app/core/scheduler.py`
**Metrics:**
- IP upload tracking: `IpTrackingMiddleware` (`app/core/ip_tracking.py`) records per-source, per-IP upload counts to `ip_upload_stats` MySQL table
- Task run history: `ScheduledTaskRun` model (`app/models/metrics.py`) records all scheduler task outcomes
- Stats totals: `StatsTotal` model records ClickHouse row counts per table per run
## CI/CD & Deployment
**Hosting:**
- Production: Alibaba Cloud ECS (manually provisioned + Docker container)
- Crawler nodes: Alibaba Cloud ECS spot instances (cn-shanghai-b, `ecs.t5-lc1m1.small`, Ubuntu 24.04)
- ECS node config: `ecs_full_pipeline.py` (20 spot instances, `SpotAsPriceGo` strategy, auto-release)
- Vswitch: `vsw-uf6twb2lrd02fi7q2i605`, Security Group: `sg-uf60380ctrux2uusucad`
**CI Pipeline:** None detected (no GitHub Actions, Jenkins, or equivalent)
**Container:**
- Multi-stage `Dockerfile` in project root
- Stage 1: `node:18-alpine` builds Vue3 frontend
- Stage 2: `python:3.11-slim-bullseye` installs Python deps + nginx
- Entrypoint: `deploy/entrypoint.sh`
- Nginx config: `deploy/web.conf`
- Exposed port: 80
## Environment Configuration
**Required env vars (backend):**
- `SECRET_KEY` - JWT signing secret (must override `"CHANGE_ME_DEV_ONLY"`)
- `CLICKHOUSE_HOST` - ClickHouse server address
- `CLICKHOUSE_USER` / `CLICKHOUSE_PASS` - ClickHouse credentials
- MySQL connection (embedded in `TORTOISE_ORM` dict or override connection string)
**Optional env vars:**
- `APP_HOST` / `APP_PORT` / `UVICORN_WORKERS` - Server binding
- `REDIS_HOST` / `REDIS_PORT` / `REDIS_PASS` - Distributed lock backend
- `SMTP_HOST` / `SMTP_USER` / `SMTP_PASS` / `SMTP_FROM` / `SMTP_TO` - Email alerting
- `REPORT_ENDPOINT` - Webhook for stats reporting
- `RUN_MIGRATIONS_ON_STARTUP` / `INITIALIZE_SEED_DATA_ON_STARTUP` - Startup behavior flags
**Spider env vars:**
- `API_BASE_URL` - Backend URL for data push (default: `http://124.222.106.226:9999`)
- `SLEEP_MIN_SECONDS` / `SLEEP_MAX_SECONDS` - Request rate limiting
- `PROXY_USERNAME` / `PROXY_PASSWORD` / `PROXY_TUNNEL` / `PROXY_SCHEME` - Proxy pool
- `MAX_PAGES` / `PAGE_SIZE` - Crawl depth control
**Alibaba Cloud (ECS pipeline):**
- Uses `alibabacloud_credentials` default credential chain (env vars `ALIBABA_CLOUD_ACCESS_KEY_ID` / `ALIBABA_CLOUD_ACCESS_KEY_SECRET` or `~/.alibabacloud/credentials`)
**Secrets location:** All secrets are environment variables; no secrets manager. `config.py` contains hardcoded defaults that are insecure for production use.
## Webhooks & Callbacks
**Incoming:**
- `/api/v1/universal/data/batch-store-async` - Spider data ingestion endpoint (called by all three crawlers)
- Header auth: `token: dev` (or `API_TOKEN` env var)
- Body: `{"data_list": [...], "data_type": "job"|"company", "platform": "boss"|"qcwy"|"zhilian"}`
- `/api/v1/keyword/available` - Spiders poll this to get next keyword to crawl
- `/api/v1/token/tokens` - Spiders fetch active Boss MPT tokens
- `/api/v1/pipeline` - Manual trigger for ECS full pipeline
**Outgoing:**
- `REPORT_ENDPOINT` (configurable) - Stats payload POST every 6 hours (`app/core/scheduler.py`)
- `http://external-data.qixin.com/...` - Company data push to Qixin (currently disabled/commented out in `app/services/ingest/remote_push.py`)
- SMTP email notifications on scheduler events
---
*Integration audit: 2026-03-21*

143
.planning/codebase/STACK.md Normal file
View File

@ -0,0 +1,143 @@
# Technology Stack
**Analysis Date:** 2026-03-21
## Languages
**Primary:**
- Python 3.13 - Backend API, crawlers, ECS pipeline scripts
- JavaScript (ES Modules) - Vue3 frontend (no strict TypeScript; TS installed but JS used in `.vue` and `.js` files)
**Secondary:**
- TypeScript 5.1.6 - Type definitions referenced in frontend toolchain only
## Runtime
**Environment:**
- Python 3.13 (required: `>=3.13` per `Pipfile`; Dockerfile uses `python:3.11-slim-bullseye` for container build)
- Node.js 18 (Dockerfile `node:18-alpine` for frontend build stage)
**Package Manager:**
- Python: `pipenv` (development), `pip` / `requirements.txt` (Docker production)
- Frontend: `pnpm` (lockfile: `web/pnpm-lock.yaml` present)
- Lockfiles: `Pipfile.lock` (Python), `web/pnpm-lock.yaml` (frontend), `uv.lock` (uv-compatible)
## Frameworks
**Core Backend:**
- FastAPI 0.111.0 - REST API framework (`app/__init__.py` factory via `create_app()`)
- Starlette 0.37.2 - ASGI underpinning (middleware, static files)
- Uvicorn 0.34.0 - ASGI server (20 workers default, configurable via `UVICORN_WORKERS`)
- Uvloop 0.21.0 - High-performance event loop (non-Windows only)
**ORM / Database:**
- Tortoise-ORM 0.23.0 - Async ORM for MySQL (`app/models/`)
- Aerich 0.8.1 - Database migrations for Tortoise-ORM (`migrations/` directory)
- aiomysql - Async MySQL driver (Tortoise backend)
- clickhouse-connect 0.7.19 - ClickHouse async client (`app/core/clickhouse.py`)
- asyncpg - Async PostgreSQL driver (listed in Pipfile but not actively used)
**Validation / Serialization:**
- Pydantic 2.10.5 - Request/response schemas (`app/schemas/`)
- pydantic-settings 2.7.1 - Settings management via env vars (`app/settings/config.py`)
- orjson 3.10.14 - Fast JSON serialization
- ujson 5.10.0 - Alternative JSON library
**Authentication / Security:**
- PyJWT 2.10.1 - JWT token generation/verification (HS256, 7-day expiry)
- passlib 1.7.4 - Password hashing
- argon2-cffi 23.1.0 - Argon2 password hasher
**Scheduling:**
- APScheduler - Async job scheduler (`app/core/scheduler.py`, 6 registered cron tasks)
**HTTP Clients:**
- httpx 0.28.1 - Async HTTP client (backend service-to-service calls)
- requests - Sync HTTP client (spider scripts in `jobs_spider/`)
**Logging:**
- loguru 0.7.3 - Structured logging (unified throughout backend and spiders)
**Core Frontend:**
- Vue 3.3.4 - SPA framework (`web/src/main.js`)
- Vue Router 4.2.4 - Client-side routing (`web/src/router/index.js`)
- Pinia 2.1.6 - State management (`web/src/store/`)
- Naive UI 2.34.4 - Component library (admin UI)
- ECharts 6.0.0 - Data visualization charts (`web/src/views/analytics/`)
- axios 1.4.0 - HTTP client (`web/src/utils/http/`)
- vue-i18n 9 - Internationalization
**Build / Dev Tools:**
- Vite 4.4.6 - Frontend bundler (`web/vite.config.js`)
- UnoCSS 66.5.10 - Atomic CSS engine
- unplugin-auto-import 20.3.0 - Auto-imports for Vue APIs
- unplugin-vue-components 30.0.0 - Auto-imports for components
- @iconify/vue + @iconify/json - Icon library
**Spider-specific:**
- playwright 1.57.0 - Browser automation (anti-detection in crawlers)
- PyExecJS 1.5.1 - Execute JavaScript signing algorithms from Python (`jobs_spider/boss/`)
- PySocks - SOCKS proxy support for spider requests
- tenacity - Retry logic in crawlers
- pandas + openpyxl - Data processing and Excel export
**Cloud:**
- alibabacloud_ecs20140526 - Alibaba Cloud ECS SDK (`ecs_full_pipeline.py`)
- alibabacloud_credentials - AliCloud credential management
**Code Quality:**
- ruff 0.9.1 - Python linter (`pyproject.toml` config: line-length 120, ignores F403/F405)
- black 24.10.0 - Python formatter (line-length 120, target py310/py311)
- isort 5.13.2 - Import sorter
- ESLint 8.46.0 - Frontend linter (`@zclzone` + `@unocss` rule sets)
- prettier - Frontend formatter
## Key Dependencies
**Critical:**
- `clickhouse-connect==0.7.19` - All analytics and job data storage; loss means no data read/write
- `tortoise-orm==0.23.0` - All business data (users, roles, keywords, tokens); paired with `aerich` for migrations
- `fastapi==0.111.0` - API layer; version-pinned for stability
- `APScheduler` - 6 scheduled tasks including ECS pipeline and IP alerting
- `alibabacloud_ecs20140526` - ECS node management; required for crawler scaling
**Infrastructure:**
- `uvicorn==0.34.0` + `uvloop==0.21.0` - Production ASGI server stack
- `pydantic==2.10.5` - All input validation; v2 API used throughout
- `loguru==0.7.3` - Unified logging across all modules
- `redis` - Optional distributed lock backend (`app/core/locks.py`; falls back to file locks if Redis unavailable)
## Configuration
**Environment:**
- All settings in `app/settings/config.py` via `pydantic-settings.BaseSettings`
- Environment variables override defaults at startup
- No `.env` file detected (not committed); variables set at OS/container level
- Key variables: `APP_HOST`, `APP_PORT`, `UVICORN_WORKERS`, `CLICKHOUSE_HOST`, `CLICKHOUSE_USER`, `CLICKHOUSE_PASS`, `SECRET_KEY`, `SMTP_HOST`, `SMTP_USER`, `SMTP_PASS`, `REDIS_HOST`, `REPORT_ENDPOINT`
- **Security warning**: `config.py` contains hardcoded default values for MySQL, ClickHouse, and SMTP credentials; must be overridden in production
**Build:**
- `pyproject.toml` - Python project metadata, black/ruff tool config
- `Pipfile` / `Pipfile.lock` - Development dependency management
- `requirements.txt` - Production pip install (used in Dockerfile)
- `web/vite.config.js` - Vite build config with proxy support via `VITE_USE_PROXY` env var
- `Dockerfile` - Multi-stage build: Node 18 (frontend) + Python 3.11 (backend) + nginx
## Platform Requirements
**Development:**
- Python 3.13+ (Pipfile requirement)
- Node.js 18+ with pnpm
- MySQL server (Tortoise-ORM connection)
- ClickHouse server (analytics data)
- Optional: Redis (distributed locking upgrade)
**Production:**
- Docker-based deployment (see `Dockerfile`, `deploy/entrypoint.sh`, `deploy/web.conf`)
- Nginx serves frontend static files and proxies API requests
- Alibaba Cloud ECS (cn-shanghai-b zone) for crawler nodes
- Ports: 80 (nginx), 9999 (FastAPI backend direct)
---
*Stack analysis: 2026-03-21*

View File

@ -0,0 +1,350 @@
# Codebase Structure
**Analysis Date:** 2026-03-21
## Directory Layout
```
JobData/ # Project root
├── app/ # FastAPI backend (Python)
│ ├── __init__.py # App factory create_app()
│ ├── api/
│ │ └── v1/ # All REST API routes (/api/v1/...)
│ │ ├── __init__.py # Router aggregation + permission guards
│ │ ├── analytics.py # Data analytics endpoints
│ │ ├── stats.py # Table count stats endpoints
│ │ ├── pipeline.py # ECS pipeline trigger endpoint
│ │ ├── base/ # Login, user-info, menu-tree (no auth)
│ │ ├── job/ # Data ingest endpoints (batch/single/async)
│ │ ├── cleaning/ # Targeted cleaning endpoints
│ │ ├── company/ # Company search endpoints
│ │ ├── keyword/ # Keyword CRUD endpoints
│ │ ├── token/ # Boss token management endpoints
│ │ ├── proxy/ # Proxy IP management endpoints
│ │ ├── users/ # User CRUD
│ │ ├── roles/ # Role + permission assignment
│ │ ├── menus/ # Menu tree CRUD
│ │ ├── apis/ # API registration management
│ │ ├── depts/ # Department tree CRUD
│ │ └── auditlog/ # Audit log query
│ ├── controllers/ # CRUD orchestration for MySQL entities
│ │ ├── api.py # ApiController (refresh_api(), scan routes)
│ │ ├── user.py # UserController
│ │ ├── role.py # RoleController
│ │ ├── menu.py # MenuController
│ │ ├── dept.py # DeptController
│ │ ├── keyword.py # KeywordController
│ │ ├── proxy.py # ProxyController
│ │ ├── token.py # TokenController
│ │ ├── cleaning.py # CleaningController
│ │ └── company.py # CompanyController (search via crawler services)
│ ├── core/ # Framework infrastructure
│ │ ├── init_app.py # lifespan: migrations, seed, ClickHouse init
│ │ ├── scheduler.py # APScheduler job definitions + registration
│ │ ├── clickhouse.py # ClickHouseManager singleton
│ │ ├── clickhouse_init.py # ClickHouse DDL (tables + analytics view)
│ │ ├── dependency.py # AuthControl, PermissionControl, DependPermission
│ │ ├── locks.py # DistributedLock (Redis + file fallback)
│ │ ├── middlewares.py # BackGroundTaskMiddleware, HttpAuditLogMiddleware
│ │ ├── ip_tracking.py # IpTrackingMiddleware
│ │ ├── exceptions.py # Custom exception classes + handlers
│ │ ├── crud.py # Generic ListParams, pagination helpers
│ │ ├── ctx.py # CTX_USER_ID ContextVar
│ │ ├── bgtask.py # BgTasks helper for BackGroundTaskMiddleware
│ │ ├── proxy_rule.py # Proxy selection rules
│ │ └── algorithms/
│ │ └── antispider.py # Anti-spider signature helpers
│ ├── models/ # Tortoise-ORM models (MySQL)
│ │ ├── base.py # BaseModel, TimestampMixin, UUIDModel
│ │ ├── admin.py # User, Role, Api, Menu, Dept, AuditLog
│ │ ├── token.py # BossToken
│ │ ├── cleaning.py # CleaningTask / CleaningRecord
│ │ ├── metrics.py # ScheduledTaskRun, StatsTotal, IpUploadStats
│ │ ├── company.py # CompanyCleaningQueue
│ │ ├── keyword.py # BossKeyword, QcwyKeyword, ZhilianKeyword
│ │ └── enums.py # MethodType enum
│ ├── repositories/ # ClickHouse query layer
│ │ └── clickhouse_repo.py # ClickHouseBaseRepo, JobAnalyticsRepo
│ ├── schemas/ # Pydantic request/response schemas
│ │ ├── ingest.py # IngestBatchRequest, PlatformType, DataType enums
│ │ ├── analytics.py # Analytics query params + response schemas
│ │ ├── keyword.py # Keyword schemas
│ │ ├── token.py # Token schemas
│ │ ├── base.py # Common response envelopes
│ │ ├── login.py # Login request/response
│ │ ├── users.py # User schemas
│ │ ├── roles.py # Role schemas
│ │ ├── menus.py # Menu schemas (MenuType enum)
│ │ ├── apis.py # API schemas
│ │ ├── depts.py # Dept schemas
│ │ ├── cleaning.py # Cleaning schemas
│ │ └── proxy.py # Proxy schemas
│ ├── services/ # Business logic
│ │ ├── analytics_service.py # AnalyticsService → JobAnalyticsRepo
│ │ ├── cleaning.py # CleaningService (targeted cleaning)
│ │ ├── company_cleaner.py # CompanyCleaner (automated company pipeline)
│ │ ├── company_jobs_sync.py # CompanyJobsSyncService
│ │ ├── company_storage.py # company_storage (ClickHouse company insert)
│ │ ├── ingest/ # Ingest pipeline subsystem
│ │ │ ├── service.py # IngestService.store_batch/store_single/query_data
│ │ │ ├── registry.py # PlatformConfig registry + get_config()
│ │ │ ├── dedup.py # batch_dedup_filter() against ClickHouse
│ │ │ ├── remote_push.py # Async HTTP push to secondary target
│ │ │ └── configs/ # Per-platform PlatformConfig registrations
│ │ └── crawler/ # In-process platform HTTP clients
│ │ ├── _base.py # BaseFetcher, BaseSearcher, ApiResult
│ │ ├── _http_client.py # HTTPClient wrapper (sync requests)
│ │ ├── _boss_client.py # Boss HTTP client
│ │ ├── _boss_api.py # Boss API methods
│ │ ├── _boss_sign.py # Boss request signing
│ │ ├── _zhilian_client.py
│ │ ├── _zhilian_api.py
│ │ ├── _zhilian_sign.py
│ │ ├── _job51_client.py
│ │ ├── _job51_api.py
│ │ ├── _job51_sign.py
│ │ ├── boss.py # BossService (high-level facade)
│ │ ├── qcwy.py # QcwyService (high-level facade)
│ │ └── zhilian.py # ZhilianService (high-level facade)
│ ├── settings/
│ │ └── config.py # Settings (pydantic-settings BaseSettings)
│ ├── log/
│ │ └── __init__.py # loguru logger export
│ └── utils/ # Shared utilities
├── web/ # Vue 3 frontend
│ ├── src/
│ │ ├── main.js # App entry, Pinia + Router setup
│ │ ├── App.vue # Root component
│ │ ├── api/ # Axios API call modules
│ │ │ ├── index.js # System management APIs
│ │ │ ├── analytics.js # Analytics APIs
│ │ │ ├── keyword.js # Keyword APIs
│ │ │ ├── proxy.js # Proxy APIs
│ │ │ └── token.js # Token APIs
│ │ ├── components/ # Shared components (CrudTable, CrudModal, QueryBar)
│ │ ├── layout/ # App shell (sidebar, topbar, tabs)
│ │ ├── router/ # Vue Router config + dynamic route loading
│ │ │ ├── index.js # createRouter(), addDynamicRoutes()
│ │ │ ├── guard/ # Navigation guards (auth, loading, title)
│ │ │ └── routes/ # Static route definitions
│ │ ├── store/ # Pinia stores
│ │ │ └── modules/ # user, permission, app, tags
│ │ ├── utils/
│ │ │ ├── auth/ # JWT token read/write
│ │ │ ├── http/ # Axios instance + interceptors
│ │ │ └── storage/ # localStorage wrapper
│ │ └── views/ # Page-level Vue components
│ │ ├── login/ # Login page
│ │ ├── analytics/ # ECharts dashboard (trend + distribution)
│ │ ├── recruitment/ # Job data browser (boss/, qcwy/, zhilian/)
│ │ ├── cleaning/ # Cleaning operation + monitor pages
│ │ ├── keyword/ # Keyword management
│ │ ├── profile/ # User profile
│ │ └── system/ # RBAC admin (user, role, menu, api, dept, auditlog, proxy, token)
│ ├── vite.config.js # Vite build config (plugins, proxy)
│ └── package.json # Frontend dependencies
├── jobs_spider/ # Standalone crawlers (external process)
│ ├── boss/
│ │ ├── boos_api.py # BossZhipinAPI core + SmartIPManager + main (~2246 lines)
│ │ ├── company_spider.py # Boss company batch spider
│ │ └── enumerate_combos.py # City × keyword combo enumeration
│ ├── qcwy/
│ │ ├── qcwy.py # QCWY API wrapper
│ │ ├── search_company_jobs.py
│ │ └── run_company_search.py
│ └── zhilian/
│ ├── zhilian_single.py # Zhilian single-job spider
│ └── company_spider.py # Zhilian company spider
├── spiderJobs/ # Second-generation spider framework (structured)
│ ├── core/
│ │ ├── base.py # Shared base classes
│ │ └── http_client.py # HTTP client abstraction
│ ├── platforms/ # Platform-specific implementations
│ │ ├── boss/
│ │ ├── job51/
│ │ └── zhilian/
│ └── runner/ # Execution runners
│ ├── api_client.py # Backend API push client
│ ├── loop.py # Job crawl loop runner
│ └── company_loop.py # Company crawl loop runner
├── migrations/
│ └── models/ # Aerich migration files (auto-generated)
├── static/ # Served at /static/ (built frontend or assets)
├── deploy/ # Deployment configs and sample images
├── docs/ # Project documentation
├── ecs_full_pipeline.py # Alibaba Cloud ECS batch provisioning script
├── create_ecs_instances.py # ECS instance creation utilities
├── reclean_qcwy_jobs.py # Standalone QCWY re-cleaning script
├── run.py # Uvicorn launch entry point
├── Pipfile / Pipfile.lock # Python dependency management (pipenv)
├── pyproject.toml # ruff + black + isort tool config
└── Dockerfile # Container build spec
```
## Directory Purposes
**`app/api/v1/`:**
- Purpose: One sub-package per API domain; each contains `__init__.py` exporting its `APIRouter`
- Contains: Route handler functions, dependency injection wiring, Pydantic schema imports
- Key files: `app/api/v1/__init__.py` (aggregation), `app/api/v1/job/job.py` (ingest routes), `app/api/v1/analytics.py`
**`app/controllers/`:**
- Purpose: Business-operation classes for MySQL-backed entities; encapsulate Tortoise-ORM queries
- Contains: One controller per domain entity; typically expose `create`, `update`, `delete`, `list`, `get` methods
- Key files: `app/controllers/api.py` (scan and register FastAPI routes → DB)
**`app/services/ingest/`:**
- Purpose: Self-contained data pipeline: registry lookup → dedup check → ClickHouse insert → remote push
- Contains: `service.py` (main orchestrator), `registry.py` (config store), `dedup.py`, `remote_push.py`, `configs/` (registrations per platform)
- Key files: `app/services/ingest/service.py`, `app/services/ingest/registry.py`
**`app/services/crawler/`:**
- Purpose: In-process HTTP clients for platform APIs, used by backend scheduler (company cleaning) and company search controller
- Contains: Underscore-prefixed private modules (`_boss_client.py`, `_boss_sign.py`, etc.) and public service facades (`boss.py`, `qcwy.py`, `zhilian.py`)
- Key files: `app/services/crawler/_base.py` (shared base classes)
**`app/core/`:**
- Purpose: Framework plumbing — startup, middleware chain, auth, distributed locking, ClickHouse connection
- Key files: `app/core/dependency.py` (auth guards), `app/core/scheduler.py` (all cron jobs), `app/core/locks.py`
**`app/models/`:**
- Purpose: All Tortoise-ORM model definitions; migrated via Aerich (not ClickHouse)
- Key files: `app/models/admin.py` (RBAC), `app/models/keyword.py` (crawl state tracking)
**`app/repositories/`:**
- Purpose: Typed ClickHouse query encapsulation following the Repository pattern
- Key files: `app/repositories/clickhouse_repo.py`
**`jobs_spider/`:**
- Purpose: Standalone scripts that run on ECS nodes; communicate with backend only via HTTP
- Key files: `jobs_spider/boss/boos_api.py` (2246-line core with SmartIPManager, BossZhipinAPI)
**`spiderJobs/`:**
- Purpose: Newer modular spider framework; parallel architecture to `jobs_spider/` with cleaner separation
- Key files: `spiderJobs/runner/api_client.py`, `spiderJobs/runner/loop.py`
**`web/src/views/`:**
- Purpose: One directory per route group; each view owns its local route definition in `route.js`
- Key files: `web/src/views/analytics/index.vue` (ECharts dashboard), `web/src/views/cleaning/monitor.vue`
## Key File Locations
**Entry Points:**
- `run.py`: Uvicorn startup; reads env vars `APP_HOST`, `APP_PORT`, `UVICORN_WORKERS`
- `app/__init__.py`: FastAPI app factory `create_app()` and lifespan context manager
- `web/src/main.js`: Vue 3 app entry; mounts Pinia, Router, directives
**Configuration:**
- `app/settings/config.py`: All backend config (`Settings` extends `pydantic_settings.BaseSettings`); override via env vars
- `web/vite.config.js`: Vite build and dev server config
- `pyproject.toml`: ruff, black, isort tool configuration
**Core Logic:**
- `app/core/init_app.py`: Startup sequence (migrations, seed data, ClickHouse init)
- `app/core/scheduler.py`: All APScheduler job functions and `start_scheduler()`
- `app/core/clickhouse_init.py`: All ClickHouse `CREATE TABLE IF NOT EXISTS` DDL statements
- `app/services/ingest/registry.py`: `PlatformConfig` dataclass + global registry dict
- `app/services/ingest/service.py`: `IngestService.store_batch()` — main data ingestion path
- `app/services/company_cleaner.py`: `CompanyCleaner` — automated company data pipeline
**ClickHouse Schema:**
- `app/core/clickhouse_init.py`: Defines all 6 MergeTree tables and 1 analytics VIEW DDL; apply via `ClickHouseInitializer.initialize_all_tables()`
**MySQL Migrations:**
- `migrations/models/`: Auto-generated by `aerich migrate`; applied via `aerich upgrade` or automatically on startup
**Testing:**
- `tests/test_company_jobs_sync.py`
- `tests/test_company_storage.py`
## Naming Conventions
**Files (Python):**
- Modules: `snake_case.py` (e.g., `company_cleaner.py`, `clickhouse_init.py`)
- Private crawler modules: leading underscore prefix (e.g., `_boss_client.py`, `_zhilian_sign.py`)
- Spider scripts in `jobs_spider/`: descriptive names without prefix (e.g., `boos_api.py`, `zhilian_single.py`)
**Files (Vue/JS):**
- Pages: `index.vue` for primary view, `monitor.vue` / descriptive name for secondary views
- API modules: `snake_case.js` matching domain (e.g., `analytics.js`, `keyword.js`)
- Router guard files: Grouped in `router/guard/` directory
**Classes (Python):**
- Services: `PascalCaseService` (e.g., `IngestService`, `AnalyticsService`, `CompanyCleaner`)
- Controllers: `PascalCaseController` with module-level singleton instance in snake_case (e.g., `user_controller = UserController()`)
- Repos: `PascalCaseRepo` (e.g., `ClickHouseBaseRepo`, `JobAnalyticsRepo`)
- Models: `PascalCase` matching table concept (e.g., `BossToken`, `CompanyCleaningQueue`)
**Directories:**
- Domain groupings: lowercase (e.g., `cleaning/`, `recruitment/`, `keyword/`)
- Platform names used consistently: `boss`, `qcwy`, `zhilian` (lowercase) across all layers
## Where to Add New Code
**New API Endpoint:**
1. Create or update router file in `app/api/v1/{domain}/`
2. Register the router in `app/api/v1/__init__.py` with `v1_router.include_router(...)`
3. Run `api_controller.refresh_api()` on next app startup to sync new routes to the `api` table
4. Add corresponding Pydantic schemas in `app/schemas/{domain}.py`
**New Platform Data Type (Ingest):**
1. Define `PlatformConfig` with dedup fields in `app/services/ingest/configs/`
2. Call `register(config)` at module import in `app/services/ingest/configs/__init__.py`
3. ClickHouse table DDL goes in `app/core/clickhouse_init.py` under `_TABLE_DDLS`
**New MySQL Model:**
1. Add model class to appropriate file in `app/models/` (or create new file)
2. Register the model module in `app/settings/config.py` under `TORTOISE_ORM.apps.models.models`
3. Run `aerich migrate && aerich upgrade` (or rely on `RUN_MIGRATIONS_ON_STARTUP=True`)
**New Scheduled Job:**
1. Define async job function in `app/core/scheduler.py`
2. Wrap execution body with `DistributedLock` context from `app/core/locks.py`
3. Add `scheduler.add_job(...)` call inside `start_scheduler()`
4. Record run result via `_record_task_run()` to `ScheduledTaskRun` table
**New Frontend Page:**
1. Create view component at `web/src/views/{module}/index.vue`
2. Add `route.js` in the same directory with route meta
3. Register the route in `web/src/router/routes/` static routes or rely on backend menu system
4. Add corresponding backend menu entry in `app/core/init_app.py``init_menus()` (for seed data)
**New In-Process Platform Crawler:**
1. Add private client modules `_platform_client.py`, `_platform_api.py`, `_platform_sign.py` under `app/services/crawler/`
2. Create public facade `platform.py` extending or wrapping `BaseFetcher` / `BaseSearcher` from `app/services/crawler/_base.py`
3. Instantiate in `CompanyCleaner.__init__()` if needed for company pipeline
**Shared Utilities:**
- Backend Python utilities: `app/utils/`
- Frontend JS utilities: `web/src/utils/`
## Special Directories
**`.startup_lock/`:**
- Purpose: File-based mutex directory used to prevent multi-worker race on startup seed data
- Generated: Yes (created and deleted at startup)
- Committed: No (in `.gitignore`)
**`migrations/`:**
- Purpose: Aerich migration history for MySQL schema
- Generated: Yes (by `aerich migrate`)
- Committed: Yes
**`static/`:**
- Purpose: Served at `/static/` path; can hold built frontend assets or other static files
- Generated: Partially (built frontend output can be placed here)
- Committed: Yes (currently contains static assets)
**`.planning/`:**
- Purpose: GSD planning documents (codebase analysis, phase plans)
- Generated: Yes (by GSD tooling)
- Committed: Yes
**`jobs_spider/*/logs/`:**
- Purpose: Rotating daily log files for standalone spider processes
- Generated: Yes (by loguru in spider scripts)
- Committed: No
---
*Structure analysis: 2026-03-21*

View File

@ -0,0 +1,236 @@
# Testing Patterns
**Analysis Date:** 2026-03-21
## Test Framework
**Runner:**
- `unittest` (Python standard library) — the only test framework currently in use
- No pytest, no pytest-anyio, no httpx.AsyncClient test harness configured
- No test-specific dependencies in `Pipfile` `[dev-packages]` (section is empty)
**Assertion Library:**
- `unittest.TestCase` assertion methods: `assertEqual`, `assertIsNone`, etc.
**Run Commands:**
```bash
python -m unittest tests/test_company_storage.py # Run a specific test file
python -m unittest tests/test_company_jobs_sync.py # Run a specific test file
python -m unittest discover tests/ # Discover and run all tests
```
**Coverage:**
- No coverage tool configured (no `.coveragerc`, no `pytest-cov`, no coverage requirement enforced)
## Test File Organization
**Location:**
- Separate `tests/` directory at project root: `/Users/win/2025/AICoding/JobData/tests/`
- NOT co-located with source modules
**Naming:**
- Pattern: `test_{module_name}.py`
- Examples: `test_company_storage.py`, `test_company_jobs_sync.py`
**Structure:**
```
tests/
├── test_company_jobs_sync.py # Tests for app/services/company_jobs_sync.py
└── test_company_storage.py # Tests for app/services/company_storage.py
```
## Test Structure
**Suite Organization:**
```python
import unittest
from app.services.company_storage import extract_company_fields, normalize_company_id
class CompanyStorageTests(unittest.TestCase):
def test_normalize_qcwy_company_id(self):
self.assertEqual(normalize_company_id("qcwy", "co123"), "123")
self.assertEqual(normalize_company_id("qcwy", "123"), "123")
self.assertEqual(normalize_company_id("boss", "co123"), "co123")
def test_extract_boss_fields(self):
payload = { ... }
result = extract_company_fields("boss", payload, "boss-1")
self.assertEqual(result["source_company_id"], "boss-1")
if __name__ == "__main__":
unittest.main()
```
**Patterns:**
- One `TestCase` class per test file
- Test methods named `test_{what_is_being_tested}_{condition}` (e.g., `test_extract_boss_fields`, `test_normalize_qcwy_company_id`)
- No setUp/tearDown methods — tests are stateless and self-contained
- No fixtures or factories; inline dicts serve as test data
- Tests verify pure functions only (no database, no HTTP, no async)
## Mocking
**Framework:** None currently used (no `unittest.mock`, no `pytest-mock`)
**Current approach:** Tests only cover pure, synchronous functions that have no external dependencies. All existing tests call functions that:
- Accept plain dicts as input
- Return plain dicts or strings as output
- Have no database, network, or filesystem side effects
**What to Mock (when tests are added):**
- `clickhouse_connect.driver.AsyncClient` for repository tests
- `BossToken.filter(...)` ORM queries for service tests that check token loading
- `httpx` / `requests` for crawler service tests
- `asyncio.to_thread(...)` calls wrapping synchronous crawlers
**What NOT to Mock:**
- Pure data transformation functions (`extract_company_fields`, `normalize_company_id`, `_pick_first`, `_clean_text`)
- Enum definitions and schema validation logic
- URL pattern matching logic in `CleaningService`
## Fixtures and Factories
**Test Data:**
- Inline dicts constructed within each test method — no shared fixtures
- Platform-specific payloads mirror real API response shapes:
```python
# Boss payload structure (from test_company_storage.py)
payload = {
"zpData": {
"brandComInfoVO": {
"encryptBrandId": "boss-1",
"brandName": "Boss公司",
"industryName": "互联网",
"stageName": "B轮",
},
"companyFullInfoVO": {
"name": "Boss公司",
"typeName": "民营",
"cityName": "上海",
},
}
}
# Qcwy payload structure
payload = {
"coinfo": {
"coid": "123",
"coname": "前程公司",
"cotype": "民营",
"cosize": "500-999人",
},
"financingStage": {"name": "未融资"},
}
# Zhilian payload structure
payload = {
"data": {
"companyBase": {
"companyNumber": "zl-1",
"companyName": "智联公司",
"companyTypeName": "上市公司",
}
}
}
```
**Location:** Inline within test methods; no shared fixture file or factory module.
## Coverage
**Requirements:** None enforced — no coverage target, no CI gates
**Current coverage estimation:**
- `app/services/company_storage.py` — partially covered (extract functions and normalize_company_id)
- `app/services/company_jobs_sync.py` — partially covered (_extract_boss_jobs, _extract_qcwy_jobs, _extract_zhilian_jobs)
- All other modules — ZERO test coverage
**View Coverage (manual):**
```bash
pip install coverage
coverage run -m unittest discover tests/
coverage report
coverage html # generates htmlcov/index.html
```
## Test Types
**Unit Tests:**
- The only type present
- Scope: pure synchronous functions with no external dependencies
- Files: `tests/test_company_storage.py`, `tests/test_company_jobs_sync.py`
**Integration Tests:**
- Not present
- Needed for: API endpoint testing using `httpx.AsyncClient` with `TestClient` or `anyio`
- Critical gaps: `app/api/v1/job/job.py`, `app/api/v1/analytics.py`, `app/api/v1/keyword/keyword.py`
**E2E Tests:**
- Not present
- No E2E framework configured
## Critical Gaps
**Untested service logic (highest priority):**
1. `app/services/cleaning.py:CleaningService.clean_target_auto()` — URL platform detection logic (zhipin.com vs 51job.com vs zhaopin.com)
2. `app/services/cleaning.py:CleaningService.process_single_item()` — multi-branch dispatch based on `clean_type` and `platform`
3. `app/services/ingest/service.py:IngestService.store_batch()` — deduplication and batch insert flow
4. `app/services/ingest/dedup.py` — deduplication filter logic
5. `app/controllers/keyword.py:KeywordController.get_available()` — priority scheduling logic (partial > failed > fresh)
6. `app/repositories/clickhouse_repo.py:JobAnalyticsRepo` — query building and result mapping
**Untested URL parsing (medium priority):**
7. `app/services/cleaning.py:_process_boss_url()` — regex extraction of job/company ID from Boss URLs
8. `app/services/cleaning.py:_process_qcwy_url()` — regex extraction for qcwy URLs
9. `app/services/cleaning.py:_process_zhilian_url()` — regex extraction for zhilian URLs
**Untested API endpoints:**
10. All routes in `app/api/v1/` — no integration tests exist for any HTTP endpoints
## Recommended Test Setup (when adding tests)
**For async tests, add to `Pipfile [dev-packages]`:**
```toml
[dev-packages]
pytest = "*"
pytest-asyncio = "*"
anyio = {extras = ["trio"]}
httpx = "*" # already in packages, for AsyncClient
```
**Async test pattern (recommended):**
```python
import pytest
from httpx import AsyncClient
from app import app # FastAPI app instance
@pytest.mark.asyncio
async def test_ingest_endpoint():
async with AsyncClient(app=app, base_url="http://test") as client:
response = await client.post("/api/v1/ingest/data/store", json={...})
assert response.status_code == 200
```
**Synchronous unit test pattern (existing):**
```python
import unittest
from app.services.company_storage import normalize_company_id
class MyTests(unittest.TestCase):
def test_normalize(self):
self.assertEqual(normalize_company_id("qcwy", "co123"), "123")
if __name__ == "__main__":
unittest.main()
```
---
*Testing analysis: 2026-03-21*

View File

@ -22,7 +22,8 @@
"node_repair_budget": 2,
"ui_phase": true,
"ui_safety_gate": true,
"text_mode": false
"text_mode": false,
"_auto_chain_active": false
},
"hooks": {
"context_warnings": true

View File

@ -0,0 +1,27 @@
# Plan 03-01 Summary: 迁移前程无忧job51层至 crawler_core
**Status:** Complete
**Tasks:** 6/6
**Commit:** 8c2c2d2
## What was built
`spiderJobs/platforms/job51/` 4 个文件迁移至 crawler_core
- **client.py**HTTPClient 改用 `crawler_core.http_client`Job51Sign 改用 `crawler_core.qcwy.sign`
- **api.py**
- import 改为 `crawler_core.base.Result/BaseFetcher/BaseSearcher`
- `ApiResult` 全量替换为 `Result`(共 11 处)
- 2 处 `self._http.get(``self.http_client.get(`GetJobDetail/GetCompanyInfo 的覆写 fetch
- 为 POST SearcherSearchRecommendJobs/SearchCompanyJobs添加 `_request()` POST 覆写
- **main.py**BaseFetcher/BaseSearcher 改用 `crawler_core.base`
- **sign.py**:向后兼容桩,重新导出 `crawler_core.qcwy.sign.Job51Sign`
- **tests/job51/test_job51_client.py**17 个 mock 测试,覆盖所有 API 类
## Verification
```
pytest tests/job51/ -v → 17 passed ✅
```
## Self-Check: PASSED

View File

@ -0,0 +1,33 @@
# Plan 03-02 Summary: 迁移智联招聘zhilian层至 crawler_core
**Status:** Complete
**Tasks:** 6/6
**Commit:** 8c2c2d2
## What was built
`spiderJobs/platforms/zhilian/` 4 个文件迁移至 crawler_core
- **client.py**HTTPClient 改用 `crawler_core.http_client`ZhilianSign 改用 `crawler_core.zhilian.sign`
- **api.py**
- import 改为 `crawler_core.base.BaseFetcher/BaseSearcher/Result/parse_response`
- 添加 `_parse_zhilian_response`HTTP 200 即成功,适配智联无 statusCode 的响应格式)
- 1 处 `self._http.get(``self.http_client.get(`SearchCompanyPositions
- 为所有 Fetcher/Searcher 类补充 `_parse()` 方法crawler_core 要求的抽象方法)
- 为 POST SearcherSearchPositions添加 `_request()` POST 覆写
- **main.py**BaseFetcher/BaseSearcher 改用 `crawler_core.base`
- **sign.py**:向后兼容桩,重新导出 `crawler_core.zhilian.sign.ZhilianSign`
- **tests/zhilian/test_zhilian_client.py**17 个 mock 测试
## Key Decisions
- 添加了 `_parse_zhilian_response` 而非使用 `parse_response`(后者检查 `statusCode==200`,但智联接口不返回此字段)
- `SearchCompanyPositions._build_params()` 通过 `self._client.signer.sign_params()` 调用 signer迁移后不受影响
## Verification
```
pytest tests/zhilian/ -v → 17 passed ✅
```
## Self-Check: PASSED