JobData/.planning/codebase/STRUCTURE.md
win 00a727519f docs(phase-3): complete execution — 2/2 plans, 98 tests passing
- ARCH-04: job51 migrated to crawler_core (no old deps)
- ARCH-05: zhilian migrated to crawler_core (no old deps)
- 34 new mock tests (17 job51 + 17 zhilian)
- Added _parse_zhilian_response custom parser for zhilian API format
- Fixed POST Searcher _request() overrides for job51/zhilian
- Full regression: 98 passed in 0.12s
2026-03-21 19:19:17 +08:00

351 lines
20 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Codebase Structure
**Analysis Date:** 2026-03-21
## Directory Layout
```
JobData/ # Project root
├── app/ # FastAPI backend (Python)
│ ├── __init__.py # App factory create_app()
│ ├── api/
│ │ └── v1/ # All REST API routes (/api/v1/...)
│ │ ├── __init__.py # Router aggregation + permission guards
│ │ ├── analytics.py # Data analytics endpoints
│ │ ├── stats.py # Table count stats endpoints
│ │ ├── pipeline.py # ECS pipeline trigger endpoint
│ │ ├── base/ # Login, user-info, menu-tree (no auth)
│ │ ├── job/ # Data ingest endpoints (batch/single/async)
│ │ ├── cleaning/ # Targeted cleaning endpoints
│ │ ├── company/ # Company search endpoints
│ │ ├── keyword/ # Keyword CRUD endpoints
│ │ ├── token/ # Boss token management endpoints
│ │ ├── proxy/ # Proxy IP management endpoints
│ │ ├── users/ # User CRUD
│ │ ├── roles/ # Role + permission assignment
│ │ ├── menus/ # Menu tree CRUD
│ │ ├── apis/ # API registration management
│ │ ├── depts/ # Department tree CRUD
│ │ └── auditlog/ # Audit log query
│ ├── controllers/ # CRUD orchestration for MySQL entities
│ │ ├── api.py # ApiController (refresh_api(), scan routes)
│ │ ├── user.py # UserController
│ │ ├── role.py # RoleController
│ │ ├── menu.py # MenuController
│ │ ├── dept.py # DeptController
│ │ ├── keyword.py # KeywordController
│ │ ├── proxy.py # ProxyController
│ │ ├── token.py # TokenController
│ │ ├── cleaning.py # CleaningController
│ │ └── company.py # CompanyController (search via crawler services)
│ ├── core/ # Framework infrastructure
│ │ ├── init_app.py # lifespan: migrations, seed, ClickHouse init
│ │ ├── scheduler.py # APScheduler job definitions + registration
│ │ ├── clickhouse.py # ClickHouseManager singleton
│ │ ├── clickhouse_init.py # ClickHouse DDL (tables + analytics view)
│ │ ├── dependency.py # AuthControl, PermissionControl, DependPermission
│ │ ├── locks.py # DistributedLock (Redis + file fallback)
│ │ ├── middlewares.py # BackGroundTaskMiddleware, HttpAuditLogMiddleware
│ │ ├── ip_tracking.py # IpTrackingMiddleware
│ │ ├── exceptions.py # Custom exception classes + handlers
│ │ ├── crud.py # Generic ListParams, pagination helpers
│ │ ├── ctx.py # CTX_USER_ID ContextVar
│ │ ├── bgtask.py # BgTasks helper for BackGroundTaskMiddleware
│ │ ├── proxy_rule.py # Proxy selection rules
│ │ └── algorithms/
│ │ └── antispider.py # Anti-spider signature helpers
│ ├── models/ # Tortoise-ORM models (MySQL)
│ │ ├── base.py # BaseModel, TimestampMixin, UUIDModel
│ │ ├── admin.py # User, Role, Api, Menu, Dept, AuditLog
│ │ ├── token.py # BossToken
│ │ ├── cleaning.py # CleaningTask / CleaningRecord
│ │ ├── metrics.py # ScheduledTaskRun, StatsTotal, IpUploadStats
│ │ ├── company.py # CompanyCleaningQueue
│ │ ├── keyword.py # BossKeyword, QcwyKeyword, ZhilianKeyword
│ │ └── enums.py # MethodType enum
│ ├── repositories/ # ClickHouse query layer
│ │ └── clickhouse_repo.py # ClickHouseBaseRepo, JobAnalyticsRepo
│ ├── schemas/ # Pydantic request/response schemas
│ │ ├── ingest.py # IngestBatchRequest, PlatformType, DataType enums
│ │ ├── analytics.py # Analytics query params + response schemas
│ │ ├── keyword.py # Keyword schemas
│ │ ├── token.py # Token schemas
│ │ ├── base.py # Common response envelopes
│ │ ├── login.py # Login request/response
│ │ ├── users.py # User schemas
│ │ ├── roles.py # Role schemas
│ │ ├── menus.py # Menu schemas (MenuType enum)
│ │ ├── apis.py # API schemas
│ │ ├── depts.py # Dept schemas
│ │ ├── cleaning.py # Cleaning schemas
│ │ └── proxy.py # Proxy schemas
│ ├── services/ # Business logic
│ │ ├── analytics_service.py # AnalyticsService → JobAnalyticsRepo
│ │ ├── cleaning.py # CleaningService (targeted cleaning)
│ │ ├── company_cleaner.py # CompanyCleaner (automated company pipeline)
│ │ ├── company_jobs_sync.py # CompanyJobsSyncService
│ │ ├── company_storage.py # company_storage (ClickHouse company insert)
│ │ ├── ingest/ # Ingest pipeline subsystem
│ │ │ ├── service.py # IngestService.store_batch/store_single/query_data
│ │ │ ├── registry.py # PlatformConfig registry + get_config()
│ │ │ ├── dedup.py # batch_dedup_filter() against ClickHouse
│ │ │ ├── remote_push.py # Async HTTP push to secondary target
│ │ │ └── configs/ # Per-platform PlatformConfig registrations
│ │ └── crawler/ # In-process platform HTTP clients
│ │ ├── _base.py # BaseFetcher, BaseSearcher, ApiResult
│ │ ├── _http_client.py # HTTPClient wrapper (sync requests)
│ │ ├── _boss_client.py # Boss HTTP client
│ │ ├── _boss_api.py # Boss API methods
│ │ ├── _boss_sign.py # Boss request signing
│ │ ├── _zhilian_client.py
│ │ ├── _zhilian_api.py
│ │ ├── _zhilian_sign.py
│ │ ├── _job51_client.py
│ │ ├── _job51_api.py
│ │ ├── _job51_sign.py
│ │ ├── boss.py # BossService (high-level facade)
│ │ ├── qcwy.py # QcwyService (high-level facade)
│ │ └── zhilian.py # ZhilianService (high-level facade)
│ ├── settings/
│ │ └── config.py # Settings (pydantic-settings BaseSettings)
│ ├── log/
│ │ └── __init__.py # loguru logger export
│ └── utils/ # Shared utilities
├── web/ # Vue 3 frontend
│ ├── src/
│ │ ├── main.js # App entry, Pinia + Router setup
│ │ ├── App.vue # Root component
│ │ ├── api/ # Axios API call modules
│ │ │ ├── index.js # System management APIs
│ │ │ ├── analytics.js # Analytics APIs
│ │ │ ├── keyword.js # Keyword APIs
│ │ │ ├── proxy.js # Proxy APIs
│ │ │ └── token.js # Token APIs
│ │ ├── components/ # Shared components (CrudTable, CrudModal, QueryBar)
│ │ ├── layout/ # App shell (sidebar, topbar, tabs)
│ │ ├── router/ # Vue Router config + dynamic route loading
│ │ │ ├── index.js # createRouter(), addDynamicRoutes()
│ │ │ ├── guard/ # Navigation guards (auth, loading, title)
│ │ │ └── routes/ # Static route definitions
│ │ ├── store/ # Pinia stores
│ │ │ └── modules/ # user, permission, app, tags
│ │ ├── utils/
│ │ │ ├── auth/ # JWT token read/write
│ │ │ ├── http/ # Axios instance + interceptors
│ │ │ └── storage/ # localStorage wrapper
│ │ └── views/ # Page-level Vue components
│ │ ├── login/ # Login page
│ │ ├── analytics/ # ECharts dashboard (trend + distribution)
│ │ ├── recruitment/ # Job data browser (boss/, qcwy/, zhilian/)
│ │ ├── cleaning/ # Cleaning operation + monitor pages
│ │ ├── keyword/ # Keyword management
│ │ ├── profile/ # User profile
│ │ └── system/ # RBAC admin (user, role, menu, api, dept, auditlog, proxy, token)
│ ├── vite.config.js # Vite build config (plugins, proxy)
│ └── package.json # Frontend dependencies
├── jobs_spider/ # Standalone crawlers (external process)
│ ├── boss/
│ │ ├── boos_api.py # BossZhipinAPI core + SmartIPManager + main (~2246 lines)
│ │ ├── company_spider.py # Boss company batch spider
│ │ └── enumerate_combos.py # City × keyword combo enumeration
│ ├── qcwy/
│ │ ├── qcwy.py # QCWY API wrapper
│ │ ├── search_company_jobs.py
│ │ └── run_company_search.py
│ └── zhilian/
│ ├── zhilian_single.py # Zhilian single-job spider
│ └── company_spider.py # Zhilian company spider
├── spiderJobs/ # Second-generation spider framework (structured)
│ ├── core/
│ │ ├── base.py # Shared base classes
│ │ └── http_client.py # HTTP client abstraction
│ ├── platforms/ # Platform-specific implementations
│ │ ├── boss/
│ │ ├── job51/
│ │ └── zhilian/
│ └── runner/ # Execution runners
│ ├── api_client.py # Backend API push client
│ ├── loop.py # Job crawl loop runner
│ └── company_loop.py # Company crawl loop runner
├── migrations/
│ └── models/ # Aerich migration files (auto-generated)
├── static/ # Served at /static/ (built frontend or assets)
├── deploy/ # Deployment configs and sample images
├── docs/ # Project documentation
├── ecs_full_pipeline.py # Alibaba Cloud ECS batch provisioning script
├── create_ecs_instances.py # ECS instance creation utilities
├── reclean_qcwy_jobs.py # Standalone QCWY re-cleaning script
├── run.py # Uvicorn launch entry point
├── Pipfile / Pipfile.lock # Python dependency management (pipenv)
├── pyproject.toml # ruff + black + isort tool config
└── Dockerfile # Container build spec
```
## Directory Purposes
**`app/api/v1/`:**
- Purpose: One sub-package per API domain; each contains `__init__.py` exporting its `APIRouter`
- Contains: Route handler functions, dependency injection wiring, Pydantic schema imports
- Key files: `app/api/v1/__init__.py` (aggregation), `app/api/v1/job/job.py` (ingest routes), `app/api/v1/analytics.py`
**`app/controllers/`:**
- Purpose: Business-operation classes for MySQL-backed entities; encapsulate Tortoise-ORM queries
- Contains: One controller per domain entity; typically expose `create`, `update`, `delete`, `list`, `get` methods
- Key files: `app/controllers/api.py` (scan and register FastAPI routes → DB)
**`app/services/ingest/`:**
- Purpose: Self-contained data pipeline: registry lookup → dedup check → ClickHouse insert → remote push
- Contains: `service.py` (main orchestrator), `registry.py` (config store), `dedup.py`, `remote_push.py`, `configs/` (registrations per platform)
- Key files: `app/services/ingest/service.py`, `app/services/ingest/registry.py`
**`app/services/crawler/`:**
- Purpose: In-process HTTP clients for platform APIs, used by backend scheduler (company cleaning) and company search controller
- Contains: Underscore-prefixed private modules (`_boss_client.py`, `_boss_sign.py`, etc.) and public service facades (`boss.py`, `qcwy.py`, `zhilian.py`)
- Key files: `app/services/crawler/_base.py` (shared base classes)
**`app/core/`:**
- Purpose: Framework plumbing — startup, middleware chain, auth, distributed locking, ClickHouse connection
- Key files: `app/core/dependency.py` (auth guards), `app/core/scheduler.py` (all cron jobs), `app/core/locks.py`
**`app/models/`:**
- Purpose: All Tortoise-ORM model definitions; migrated via Aerich (not ClickHouse)
- Key files: `app/models/admin.py` (RBAC), `app/models/keyword.py` (crawl state tracking)
**`app/repositories/`:**
- Purpose: Typed ClickHouse query encapsulation following the Repository pattern
- Key files: `app/repositories/clickhouse_repo.py`
**`jobs_spider/`:**
- Purpose: Standalone scripts that run on ECS nodes; communicate with backend only via HTTP
- Key files: `jobs_spider/boss/boos_api.py` (2246-line core with SmartIPManager, BossZhipinAPI)
**`spiderJobs/`:**
- Purpose: Newer modular spider framework; parallel architecture to `jobs_spider/` with cleaner separation
- Key files: `spiderJobs/runner/api_client.py`, `spiderJobs/runner/loop.py`
**`web/src/views/`:**
- Purpose: One directory per route group; each view owns its local route definition in `route.js`
- Key files: `web/src/views/analytics/index.vue` (ECharts dashboard), `web/src/views/cleaning/monitor.vue`
## Key File Locations
**Entry Points:**
- `run.py`: Uvicorn startup; reads env vars `APP_HOST`, `APP_PORT`, `UVICORN_WORKERS`
- `app/__init__.py`: FastAPI app factory `create_app()` and lifespan context manager
- `web/src/main.js`: Vue 3 app entry; mounts Pinia, Router, directives
**Configuration:**
- `app/settings/config.py`: All backend config (`Settings` extends `pydantic_settings.BaseSettings`); override via env vars
- `web/vite.config.js`: Vite build and dev server config
- `pyproject.toml`: ruff, black, isort tool configuration
**Core Logic:**
- `app/core/init_app.py`: Startup sequence (migrations, seed data, ClickHouse init)
- `app/core/scheduler.py`: All APScheduler job functions and `start_scheduler()`
- `app/core/clickhouse_init.py`: All ClickHouse `CREATE TABLE IF NOT EXISTS` DDL statements
- `app/services/ingest/registry.py`: `PlatformConfig` dataclass + global registry dict
- `app/services/ingest/service.py`: `IngestService.store_batch()` — main data ingestion path
- `app/services/company_cleaner.py`: `CompanyCleaner` — automated company data pipeline
**ClickHouse Schema:**
- `app/core/clickhouse_init.py`: Defines all 6 MergeTree tables and 1 analytics VIEW DDL; apply via `ClickHouseInitializer.initialize_all_tables()`
**MySQL Migrations:**
- `migrations/models/`: Auto-generated by `aerich migrate`; applied via `aerich upgrade` or automatically on startup
**Testing:**
- `tests/test_company_jobs_sync.py`
- `tests/test_company_storage.py`
## Naming Conventions
**Files (Python):**
- Modules: `snake_case.py` (e.g., `company_cleaner.py`, `clickhouse_init.py`)
- Private crawler modules: leading underscore prefix (e.g., `_boss_client.py`, `_zhilian_sign.py`)
- Spider scripts in `jobs_spider/`: descriptive names without prefix (e.g., `boos_api.py`, `zhilian_single.py`)
**Files (Vue/JS):**
- Pages: `index.vue` for primary view, `monitor.vue` / descriptive name for secondary views
- API modules: `snake_case.js` matching domain (e.g., `analytics.js`, `keyword.js`)
- Router guard files: Grouped in `router/guard/` directory
**Classes (Python):**
- Services: `PascalCaseService` (e.g., `IngestService`, `AnalyticsService`, `CompanyCleaner`)
- Controllers: `PascalCaseController` with module-level singleton instance in snake_case (e.g., `user_controller = UserController()`)
- Repos: `PascalCaseRepo` (e.g., `ClickHouseBaseRepo`, `JobAnalyticsRepo`)
- Models: `PascalCase` matching table concept (e.g., `BossToken`, `CompanyCleaningQueue`)
**Directories:**
- Domain groupings: lowercase (e.g., `cleaning/`, `recruitment/`, `keyword/`)
- Platform names used consistently: `boss`, `qcwy`, `zhilian` (lowercase) across all layers
## Where to Add New Code
**New API Endpoint:**
1. Create or update router file in `app/api/v1/{domain}/`
2. Register the router in `app/api/v1/__init__.py` with `v1_router.include_router(...)`
3. Run `api_controller.refresh_api()` on next app startup to sync new routes to the `api` table
4. Add corresponding Pydantic schemas in `app/schemas/{domain}.py`
**New Platform Data Type (Ingest):**
1. Define `PlatformConfig` with dedup fields in `app/services/ingest/configs/`
2. Call `register(config)` at module import in `app/services/ingest/configs/__init__.py`
3. ClickHouse table DDL goes in `app/core/clickhouse_init.py` under `_TABLE_DDLS`
**New MySQL Model:**
1. Add model class to appropriate file in `app/models/` (or create new file)
2. Register the model module in `app/settings/config.py` under `TORTOISE_ORM.apps.models.models`
3. Run `aerich migrate && aerich upgrade` (or rely on `RUN_MIGRATIONS_ON_STARTUP=True`)
**New Scheduled Job:**
1. Define async job function in `app/core/scheduler.py`
2. Wrap execution body with `DistributedLock` context from `app/core/locks.py`
3. Add `scheduler.add_job(...)` call inside `start_scheduler()`
4. Record run result via `_record_task_run()` to `ScheduledTaskRun` table
**New Frontend Page:**
1. Create view component at `web/src/views/{module}/index.vue`
2. Add `route.js` in the same directory with route meta
3. Register the route in `web/src/router/routes/` static routes or rely on backend menu system
4. Add corresponding backend menu entry in `app/core/init_app.py``init_menus()` (for seed data)
**New In-Process Platform Crawler:**
1. Add private client modules `_platform_client.py`, `_platform_api.py`, `_platform_sign.py` under `app/services/crawler/`
2. Create public facade `platform.py` extending or wrapping `BaseFetcher` / `BaseSearcher` from `app/services/crawler/_base.py`
3. Instantiate in `CompanyCleaner.__init__()` if needed for company pipeline
**Shared Utilities:**
- Backend Python utilities: `app/utils/`
- Frontend JS utilities: `web/src/utils/`
## Special Directories
**`.startup_lock/`:**
- Purpose: File-based mutex directory used to prevent multi-worker race on startup seed data
- Generated: Yes (created and deleted at startup)
- Committed: No (in `.gitignore`)
**`migrations/`:**
- Purpose: Aerich migration history for MySQL schema
- Generated: Yes (by `aerich migrate`)
- Committed: Yes
**`static/`:**
- Purpose: Served at `/static/` path; can hold built frontend assets or other static files
- Generated: Partially (built frontend output can be placed here)
- Committed: Yes (currently contains static assets)
**`.planning/`:**
- Purpose: GSD planning documents (codebase analysis, phase plans)
- Generated: Yes (by GSD tooling)
- Committed: Yes
**`jobs_spider/*/logs/`:**
- Purpose: Rotating daily log files for standalone spider processes
- Generated: Yes (by loguru in spider scripts)
- Committed: No
---
*Structure analysis: 2026-03-21*