# Codebase Structure **Analysis Date:** 2026-03-21 ## Directory Layout ``` JobData/ # Project root ├── app/ # FastAPI backend (Python) │ ├── __init__.py # App factory create_app() │ ├── api/ │ │ └── v1/ # All REST API routes (/api/v1/...) │ │ ├── __init__.py # Router aggregation + permission guards │ │ ├── analytics.py # Data analytics endpoints │ │ ├── stats.py # Table count stats endpoints │ │ ├── pipeline.py # ECS pipeline trigger endpoint │ │ ├── base/ # Login, user-info, menu-tree (no auth) │ │ ├── job/ # Data ingest endpoints (batch/single/async) │ │ ├── cleaning/ # Targeted cleaning endpoints │ │ ├── company/ # Company search endpoints │ │ ├── keyword/ # Keyword CRUD endpoints │ │ ├── token/ # Boss token management endpoints │ │ ├── proxy/ # Proxy IP management endpoints │ │ ├── users/ # User CRUD │ │ ├── roles/ # Role + permission assignment │ │ ├── menus/ # Menu tree CRUD │ │ ├── apis/ # API registration management │ │ ├── depts/ # Department tree CRUD │ │ └── auditlog/ # Audit log query │ ├── controllers/ # CRUD orchestration for MySQL entities │ │ ├── api.py # ApiController (refresh_api(), scan routes) │ │ ├── user.py # UserController │ │ ├── role.py # RoleController │ │ ├── menu.py # MenuController │ │ ├── dept.py # DeptController │ │ ├── keyword.py # KeywordController │ │ ├── proxy.py # ProxyController │ │ ├── token.py # TokenController │ │ ├── cleaning.py # CleaningController │ │ └── company.py # CompanyController (search via crawler services) │ ├── core/ # Framework infrastructure │ │ ├── init_app.py # lifespan: migrations, seed, ClickHouse init │ │ ├── scheduler.py # APScheduler job definitions + registration │ │ ├── clickhouse.py # ClickHouseManager singleton │ │ ├── clickhouse_init.py # ClickHouse DDL (tables + analytics view) │ │ ├── dependency.py # AuthControl, PermissionControl, DependPermission │ │ ├── locks.py # DistributedLock (Redis + file fallback) │ │ ├── middlewares.py # BackGroundTaskMiddleware, HttpAuditLogMiddleware │ │ ├── ip_tracking.py # IpTrackingMiddleware │ │ ├── exceptions.py # Custom exception classes + handlers │ │ ├── crud.py # Generic ListParams, pagination helpers │ │ ├── ctx.py # CTX_USER_ID ContextVar │ │ ├── bgtask.py # BgTasks helper for BackGroundTaskMiddleware │ │ ├── proxy_rule.py # Proxy selection rules │ │ └── algorithms/ │ │ └── antispider.py # Anti-spider signature helpers │ ├── models/ # Tortoise-ORM models (MySQL) │ │ ├── base.py # BaseModel, TimestampMixin, UUIDModel │ │ ├── admin.py # User, Role, Api, Menu, Dept, AuditLog │ │ ├── token.py # BossToken │ │ ├── cleaning.py # CleaningTask / CleaningRecord │ │ ├── metrics.py # ScheduledTaskRun, StatsTotal, IpUploadStats │ │ ├── company.py # CompanyCleaningQueue │ │ ├── keyword.py # BossKeyword, QcwyKeyword, ZhilianKeyword │ │ └── enums.py # MethodType enum │ ├── repositories/ # ClickHouse query layer │ │ └── clickhouse_repo.py # ClickHouseBaseRepo, JobAnalyticsRepo │ ├── schemas/ # Pydantic request/response schemas │ │ ├── ingest.py # IngestBatchRequest, PlatformType, DataType enums │ │ ├── analytics.py # Analytics query params + response schemas │ │ ├── keyword.py # Keyword schemas │ │ ├── token.py # Token schemas │ │ ├── base.py # Common response envelopes │ │ ├── login.py # Login request/response │ │ ├── users.py # User schemas │ │ ├── roles.py # Role schemas │ │ ├── menus.py # Menu schemas (MenuType enum) │ │ ├── apis.py # API schemas │ │ ├── depts.py # Dept schemas │ │ ├── cleaning.py # Cleaning schemas │ │ └── proxy.py # Proxy schemas │ ├── services/ # Business logic │ │ ├── analytics_service.py # AnalyticsService → JobAnalyticsRepo │ │ ├── cleaning.py # CleaningService (targeted cleaning) │ │ ├── company_cleaner.py # CompanyCleaner (automated company pipeline) │ │ ├── company_jobs_sync.py # CompanyJobsSyncService │ │ ├── company_storage.py # company_storage (ClickHouse company insert) │ │ ├── ingest/ # Ingest pipeline subsystem │ │ │ ├── service.py # IngestService.store_batch/store_single/query_data │ │ │ ├── registry.py # PlatformConfig registry + get_config() │ │ │ ├── dedup.py # batch_dedup_filter() against ClickHouse │ │ │ ├── remote_push.py # Async HTTP push to secondary target │ │ │ └── configs/ # Per-platform PlatformConfig registrations │ │ └── crawler/ # In-process platform HTTP clients │ │ ├── _base.py # BaseFetcher, BaseSearcher, ApiResult │ │ ├── _http_client.py # HTTPClient wrapper (sync requests) │ │ ├── _boss_client.py # Boss HTTP client │ │ ├── _boss_api.py # Boss API methods │ │ ├── _boss_sign.py # Boss request signing │ │ ├── _zhilian_client.py │ │ ├── _zhilian_api.py │ │ ├── _zhilian_sign.py │ │ ├── _job51_client.py │ │ ├── _job51_api.py │ │ ├── _job51_sign.py │ │ ├── boss.py # BossService (high-level facade) │ │ ├── qcwy.py # QcwyService (high-level facade) │ │ └── zhilian.py # ZhilianService (high-level facade) │ ├── settings/ │ │ └── config.py # Settings (pydantic-settings BaseSettings) │ ├── log/ │ │ └── __init__.py # loguru logger export │ └── utils/ # Shared utilities ├── web/ # Vue 3 frontend │ ├── src/ │ │ ├── main.js # App entry, Pinia + Router setup │ │ ├── App.vue # Root component │ │ ├── api/ # Axios API call modules │ │ │ ├── index.js # System management APIs │ │ │ ├── analytics.js # Analytics APIs │ │ │ ├── keyword.js # Keyword APIs │ │ │ ├── proxy.js # Proxy APIs │ │ │ └── token.js # Token APIs │ │ ├── components/ # Shared components (CrudTable, CrudModal, QueryBar) │ │ ├── layout/ # App shell (sidebar, topbar, tabs) │ │ ├── router/ # Vue Router config + dynamic route loading │ │ │ ├── index.js # createRouter(), addDynamicRoutes() │ │ │ ├── guard/ # Navigation guards (auth, loading, title) │ │ │ └── routes/ # Static route definitions │ │ ├── store/ # Pinia stores │ │ │ └── modules/ # user, permission, app, tags │ │ ├── utils/ │ │ │ ├── auth/ # JWT token read/write │ │ │ ├── http/ # Axios instance + interceptors │ │ │ └── storage/ # localStorage wrapper │ │ └── views/ # Page-level Vue components │ │ ├── login/ # Login page │ │ ├── analytics/ # ECharts dashboard (trend + distribution) │ │ ├── recruitment/ # Job data browser (boss/, qcwy/, zhilian/) │ │ ├── cleaning/ # Cleaning operation + monitor pages │ │ ├── keyword/ # Keyword management │ │ ├── profile/ # User profile │ │ └── system/ # RBAC admin (user, role, menu, api, dept, auditlog, proxy, token) │ ├── vite.config.js # Vite build config (plugins, proxy) │ └── package.json # Frontend dependencies ├── jobs_spider/ # Standalone crawlers (external process) │ ├── boss/ │ │ ├── boos_api.py # BossZhipinAPI core + SmartIPManager + main (~2246 lines) │ │ ├── company_spider.py # Boss company batch spider │ │ └── enumerate_combos.py # City × keyword combo enumeration │ ├── qcwy/ │ │ ├── qcwy.py # QCWY API wrapper │ │ ├── search_company_jobs.py │ │ └── run_company_search.py │ └── zhilian/ │ ├── zhilian_single.py # Zhilian single-job spider │ └── company_spider.py # Zhilian company spider ├── spiderJobs/ # Second-generation spider framework (structured) │ ├── core/ │ │ ├── base.py # Shared base classes │ │ └── http_client.py # HTTP client abstraction │ ├── platforms/ # Platform-specific implementations │ │ ├── boss/ │ │ ├── job51/ │ │ └── zhilian/ │ └── runner/ # Execution runners │ ├── api_client.py # Backend API push client │ ├── loop.py # Job crawl loop runner │ └── company_loop.py # Company crawl loop runner ├── migrations/ │ └── models/ # Aerich migration files (auto-generated) ├── static/ # Served at /static/ (built frontend or assets) ├── deploy/ # Deployment configs and sample images ├── docs/ # Project documentation ├── ecs_full_pipeline.py # Alibaba Cloud ECS batch provisioning script ├── create_ecs_instances.py # ECS instance creation utilities ├── reclean_qcwy_jobs.py # Standalone QCWY re-cleaning script ├── run.py # Uvicorn launch entry point ├── Pipfile / Pipfile.lock # Python dependency management (pipenv) ├── pyproject.toml # ruff + black + isort tool config └── Dockerfile # Container build spec ``` ## Directory Purposes **`app/api/v1/`:** - Purpose: One sub-package per API domain; each contains `__init__.py` exporting its `APIRouter` - Contains: Route handler functions, dependency injection wiring, Pydantic schema imports - Key files: `app/api/v1/__init__.py` (aggregation), `app/api/v1/job/job.py` (ingest routes), `app/api/v1/analytics.py` **`app/controllers/`:** - Purpose: Business-operation classes for MySQL-backed entities; encapsulate Tortoise-ORM queries - Contains: One controller per domain entity; typically expose `create`, `update`, `delete`, `list`, `get` methods - Key files: `app/controllers/api.py` (scan and register FastAPI routes → DB) **`app/services/ingest/`:** - Purpose: Self-contained data pipeline: registry lookup → dedup check → ClickHouse insert → remote push - Contains: `service.py` (main orchestrator), `registry.py` (config store), `dedup.py`, `remote_push.py`, `configs/` (registrations per platform) - Key files: `app/services/ingest/service.py`, `app/services/ingest/registry.py` **`app/services/crawler/`:** - Purpose: In-process HTTP clients for platform APIs, used by backend scheduler (company cleaning) and company search controller - Contains: Underscore-prefixed private modules (`_boss_client.py`, `_boss_sign.py`, etc.) and public service facades (`boss.py`, `qcwy.py`, `zhilian.py`) - Key files: `app/services/crawler/_base.py` (shared base classes) **`app/core/`:** - Purpose: Framework plumbing — startup, middleware chain, auth, distributed locking, ClickHouse connection - Key files: `app/core/dependency.py` (auth guards), `app/core/scheduler.py` (all cron jobs), `app/core/locks.py` **`app/models/`:** - Purpose: All Tortoise-ORM model definitions; migrated via Aerich (not ClickHouse) - Key files: `app/models/admin.py` (RBAC), `app/models/keyword.py` (crawl state tracking) **`app/repositories/`:** - Purpose: Typed ClickHouse query encapsulation following the Repository pattern - Key files: `app/repositories/clickhouse_repo.py` **`jobs_spider/`:** - Purpose: Standalone scripts that run on ECS nodes; communicate with backend only via HTTP - Key files: `jobs_spider/boss/boos_api.py` (2246-line core with SmartIPManager, BossZhipinAPI) **`spiderJobs/`:** - Purpose: Newer modular spider framework; parallel architecture to `jobs_spider/` with cleaner separation - Key files: `spiderJobs/runner/api_client.py`, `spiderJobs/runner/loop.py` **`web/src/views/`:** - Purpose: One directory per route group; each view owns its local route definition in `route.js` - Key files: `web/src/views/analytics/index.vue` (ECharts dashboard), `web/src/views/cleaning/monitor.vue` ## Key File Locations **Entry Points:** - `run.py`: Uvicorn startup; reads env vars `APP_HOST`, `APP_PORT`, `UVICORN_WORKERS` - `app/__init__.py`: FastAPI app factory `create_app()` and lifespan context manager - `web/src/main.js`: Vue 3 app entry; mounts Pinia, Router, directives **Configuration:** - `app/settings/config.py`: All backend config (`Settings` extends `pydantic_settings.BaseSettings`); override via env vars - `web/vite.config.js`: Vite build and dev server config - `pyproject.toml`: ruff, black, isort tool configuration **Core Logic:** - `app/core/init_app.py`: Startup sequence (migrations, seed data, ClickHouse init) - `app/core/scheduler.py`: All APScheduler job functions and `start_scheduler()` - `app/core/clickhouse_init.py`: All ClickHouse `CREATE TABLE IF NOT EXISTS` DDL statements - `app/services/ingest/registry.py`: `PlatformConfig` dataclass + global registry dict - `app/services/ingest/service.py`: `IngestService.store_batch()` — main data ingestion path - `app/services/company_cleaner.py`: `CompanyCleaner` — automated company data pipeline **ClickHouse Schema:** - `app/core/clickhouse_init.py`: Defines all 6 MergeTree tables and 1 analytics VIEW DDL; apply via `ClickHouseInitializer.initialize_all_tables()` **MySQL Migrations:** - `migrations/models/`: Auto-generated by `aerich migrate`; applied via `aerich upgrade` or automatically on startup **Testing:** - `tests/test_company_jobs_sync.py` - `tests/test_company_storage.py` ## Naming Conventions **Files (Python):** - Modules: `snake_case.py` (e.g., `company_cleaner.py`, `clickhouse_init.py`) - Private crawler modules: leading underscore prefix (e.g., `_boss_client.py`, `_zhilian_sign.py`) - Spider scripts in `jobs_spider/`: descriptive names without prefix (e.g., `boos_api.py`, `zhilian_single.py`) **Files (Vue/JS):** - Pages: `index.vue` for primary view, `monitor.vue` / descriptive name for secondary views - API modules: `snake_case.js` matching domain (e.g., `analytics.js`, `keyword.js`) - Router guard files: Grouped in `router/guard/` directory **Classes (Python):** - Services: `PascalCaseService` (e.g., `IngestService`, `AnalyticsService`, `CompanyCleaner`) - Controllers: `PascalCaseController` with module-level singleton instance in snake_case (e.g., `user_controller = UserController()`) - Repos: `PascalCaseRepo` (e.g., `ClickHouseBaseRepo`, `JobAnalyticsRepo`) - Models: `PascalCase` matching table concept (e.g., `BossToken`, `CompanyCleaningQueue`) **Directories:** - Domain groupings: lowercase (e.g., `cleaning/`, `recruitment/`, `keyword/`) - Platform names used consistently: `boss`, `qcwy`, `zhilian` (lowercase) across all layers ## Where to Add New Code **New API Endpoint:** 1. Create or update router file in `app/api/v1/{domain}/` 2. Register the router in `app/api/v1/__init__.py` with `v1_router.include_router(...)` 3. Run `api_controller.refresh_api()` on next app startup to sync new routes to the `api` table 4. Add corresponding Pydantic schemas in `app/schemas/{domain}.py` **New Platform Data Type (Ingest):** 1. Define `PlatformConfig` with dedup fields in `app/services/ingest/configs/` 2. Call `register(config)` at module import in `app/services/ingest/configs/__init__.py` 3. ClickHouse table DDL goes in `app/core/clickhouse_init.py` under `_TABLE_DDLS` **New MySQL Model:** 1. Add model class to appropriate file in `app/models/` (or create new file) 2. Register the model module in `app/settings/config.py` under `TORTOISE_ORM.apps.models.models` 3. Run `aerich migrate && aerich upgrade` (or rely on `RUN_MIGRATIONS_ON_STARTUP=True`) **New Scheduled Job:** 1. Define async job function in `app/core/scheduler.py` 2. Wrap execution body with `DistributedLock` context from `app/core/locks.py` 3. Add `scheduler.add_job(...)` call inside `start_scheduler()` 4. Record run result via `_record_task_run()` to `ScheduledTaskRun` table **New Frontend Page:** 1. Create view component at `web/src/views/{module}/index.vue` 2. Add `route.js` in the same directory with route meta 3. Register the route in `web/src/router/routes/` static routes or rely on backend menu system 4. Add corresponding backend menu entry in `app/core/init_app.py` → `init_menus()` (for seed data) **New In-Process Platform Crawler:** 1. Add private client modules `_platform_client.py`, `_platform_api.py`, `_platform_sign.py` under `app/services/crawler/` 2. Create public facade `platform.py` extending or wrapping `BaseFetcher` / `BaseSearcher` from `app/services/crawler/_base.py` 3. Instantiate in `CompanyCleaner.__init__()` if needed for company pipeline **Shared Utilities:** - Backend Python utilities: `app/utils/` - Frontend JS utilities: `web/src/utils/` ## Special Directories **`.startup_lock/`:** - Purpose: File-based mutex directory used to prevent multi-worker race on startup seed data - Generated: Yes (created and deleted at startup) - Committed: No (in `.gitignore`) **`migrations/`:** - Purpose: Aerich migration history for MySQL schema - Generated: Yes (by `aerich migrate`) - Committed: Yes **`static/`:** - Purpose: Served at `/static/` path; can hold built frontend assets or other static files - Generated: Partially (built frontend output can be placed here) - Committed: Yes (currently contains static assets) **`.planning/`:** - Purpose: GSD planning documents (codebase analysis, phase plans) - Generated: Yes (by GSD tooling) - Committed: Yes **`jobs_spider/*/logs/`:** - Purpose: Rotating daily log files for standalone spider processes - Generated: Yes (by loguru in spider scripts) - Committed: No --- *Structure analysis: 2026-03-21*