- ARCH-04: job51 migrated to crawler_core (no old deps) - ARCH-05: zhilian migrated to crawler_core (no old deps) - 34 new mock tests (17 job51 + 17 zhilian) - Added _parse_zhilian_response custom parser for zhilian API format - Fixed POST Searcher _request() overrides for job51/zhilian - Full regression: 98 passed in 0.12s
351 lines
20 KiB
Markdown
351 lines
20 KiB
Markdown
# Codebase Structure
|
||
|
||
**Analysis Date:** 2026-03-21
|
||
|
||
## Directory Layout
|
||
|
||
```
|
||
JobData/ # Project root
|
||
├── app/ # FastAPI backend (Python)
|
||
│ ├── __init__.py # App factory create_app()
|
||
│ ├── api/
|
||
│ │ └── v1/ # All REST API routes (/api/v1/...)
|
||
│ │ ├── __init__.py # Router aggregation + permission guards
|
||
│ │ ├── analytics.py # Data analytics endpoints
|
||
│ │ ├── stats.py # Table count stats endpoints
|
||
│ │ ├── pipeline.py # ECS pipeline trigger endpoint
|
||
│ │ ├── base/ # Login, user-info, menu-tree (no auth)
|
||
│ │ ├── job/ # Data ingest endpoints (batch/single/async)
|
||
│ │ ├── cleaning/ # Targeted cleaning endpoints
|
||
│ │ ├── company/ # Company search endpoints
|
||
│ │ ├── keyword/ # Keyword CRUD endpoints
|
||
│ │ ├── token/ # Boss token management endpoints
|
||
│ │ ├── proxy/ # Proxy IP management endpoints
|
||
│ │ ├── users/ # User CRUD
|
||
│ │ ├── roles/ # Role + permission assignment
|
||
│ │ ├── menus/ # Menu tree CRUD
|
||
│ │ ├── apis/ # API registration management
|
||
│ │ ├── depts/ # Department tree CRUD
|
||
│ │ └── auditlog/ # Audit log query
|
||
│ ├── controllers/ # CRUD orchestration for MySQL entities
|
||
│ │ ├── api.py # ApiController (refresh_api(), scan routes)
|
||
│ │ ├── user.py # UserController
|
||
│ │ ├── role.py # RoleController
|
||
│ │ ├── menu.py # MenuController
|
||
│ │ ├── dept.py # DeptController
|
||
│ │ ├── keyword.py # KeywordController
|
||
│ │ ├── proxy.py # ProxyController
|
||
│ │ ├── token.py # TokenController
|
||
│ │ ├── cleaning.py # CleaningController
|
||
│ │ └── company.py # CompanyController (search via crawler services)
|
||
│ ├── core/ # Framework infrastructure
|
||
│ │ ├── init_app.py # lifespan: migrations, seed, ClickHouse init
|
||
│ │ ├── scheduler.py # APScheduler job definitions + registration
|
||
│ │ ├── clickhouse.py # ClickHouseManager singleton
|
||
│ │ ├── clickhouse_init.py # ClickHouse DDL (tables + analytics view)
|
||
│ │ ├── dependency.py # AuthControl, PermissionControl, DependPermission
|
||
│ │ ├── locks.py # DistributedLock (Redis + file fallback)
|
||
│ │ ├── middlewares.py # BackGroundTaskMiddleware, HttpAuditLogMiddleware
|
||
│ │ ├── ip_tracking.py # IpTrackingMiddleware
|
||
│ │ ├── exceptions.py # Custom exception classes + handlers
|
||
│ │ ├── crud.py # Generic ListParams, pagination helpers
|
||
│ │ ├── ctx.py # CTX_USER_ID ContextVar
|
||
│ │ ├── bgtask.py # BgTasks helper for BackGroundTaskMiddleware
|
||
│ │ ├── proxy_rule.py # Proxy selection rules
|
||
│ │ └── algorithms/
|
||
│ │ └── antispider.py # Anti-spider signature helpers
|
||
│ ├── models/ # Tortoise-ORM models (MySQL)
|
||
│ │ ├── base.py # BaseModel, TimestampMixin, UUIDModel
|
||
│ │ ├── admin.py # User, Role, Api, Menu, Dept, AuditLog
|
||
│ │ ├── token.py # BossToken
|
||
│ │ ├── cleaning.py # CleaningTask / CleaningRecord
|
||
│ │ ├── metrics.py # ScheduledTaskRun, StatsTotal, IpUploadStats
|
||
│ │ ├── company.py # CompanyCleaningQueue
|
||
│ │ ├── keyword.py # BossKeyword, QcwyKeyword, ZhilianKeyword
|
||
│ │ └── enums.py # MethodType enum
|
||
│ ├── repositories/ # ClickHouse query layer
|
||
│ │ └── clickhouse_repo.py # ClickHouseBaseRepo, JobAnalyticsRepo
|
||
│ ├── schemas/ # Pydantic request/response schemas
|
||
│ │ ├── ingest.py # IngestBatchRequest, PlatformType, DataType enums
|
||
│ │ ├── analytics.py # Analytics query params + response schemas
|
||
│ │ ├── keyword.py # Keyword schemas
|
||
│ │ ├── token.py # Token schemas
|
||
│ │ ├── base.py # Common response envelopes
|
||
│ │ ├── login.py # Login request/response
|
||
│ │ ├── users.py # User schemas
|
||
│ │ ├── roles.py # Role schemas
|
||
│ │ ├── menus.py # Menu schemas (MenuType enum)
|
||
│ │ ├── apis.py # API schemas
|
||
│ │ ├── depts.py # Dept schemas
|
||
│ │ ├── cleaning.py # Cleaning schemas
|
||
│ │ └── proxy.py # Proxy schemas
|
||
│ ├── services/ # Business logic
|
||
│ │ ├── analytics_service.py # AnalyticsService → JobAnalyticsRepo
|
||
│ │ ├── cleaning.py # CleaningService (targeted cleaning)
|
||
│ │ ├── company_cleaner.py # CompanyCleaner (automated company pipeline)
|
||
│ │ ├── company_jobs_sync.py # CompanyJobsSyncService
|
||
│ │ ├── company_storage.py # company_storage (ClickHouse company insert)
|
||
│ │ ├── ingest/ # Ingest pipeline subsystem
|
||
│ │ │ ├── service.py # IngestService.store_batch/store_single/query_data
|
||
│ │ │ ├── registry.py # PlatformConfig registry + get_config()
|
||
│ │ │ ├── dedup.py # batch_dedup_filter() against ClickHouse
|
||
│ │ │ ├── remote_push.py # Async HTTP push to secondary target
|
||
│ │ │ └── configs/ # Per-platform PlatformConfig registrations
|
||
│ │ └── crawler/ # In-process platform HTTP clients
|
||
│ │ ├── _base.py # BaseFetcher, BaseSearcher, ApiResult
|
||
│ │ ├── _http_client.py # HTTPClient wrapper (sync requests)
|
||
│ │ ├── _boss_client.py # Boss HTTP client
|
||
│ │ ├── _boss_api.py # Boss API methods
|
||
│ │ ├── _boss_sign.py # Boss request signing
|
||
│ │ ├── _zhilian_client.py
|
||
│ │ ├── _zhilian_api.py
|
||
│ │ ├── _zhilian_sign.py
|
||
│ │ ├── _job51_client.py
|
||
│ │ ├── _job51_api.py
|
||
│ │ ├── _job51_sign.py
|
||
│ │ ├── boss.py # BossService (high-level facade)
|
||
│ │ ├── qcwy.py # QcwyService (high-level facade)
|
||
│ │ └── zhilian.py # ZhilianService (high-level facade)
|
||
│ ├── settings/
|
||
│ │ └── config.py # Settings (pydantic-settings BaseSettings)
|
||
│ ├── log/
|
||
│ │ └── __init__.py # loguru logger export
|
||
│ └── utils/ # Shared utilities
|
||
├── web/ # Vue 3 frontend
|
||
│ ├── src/
|
||
│ │ ├── main.js # App entry, Pinia + Router setup
|
||
│ │ ├── App.vue # Root component
|
||
│ │ ├── api/ # Axios API call modules
|
||
│ │ │ ├── index.js # System management APIs
|
||
│ │ │ ├── analytics.js # Analytics APIs
|
||
│ │ │ ├── keyword.js # Keyword APIs
|
||
│ │ │ ├── proxy.js # Proxy APIs
|
||
│ │ │ └── token.js # Token APIs
|
||
│ │ ├── components/ # Shared components (CrudTable, CrudModal, QueryBar)
|
||
│ │ ├── layout/ # App shell (sidebar, topbar, tabs)
|
||
│ │ ├── router/ # Vue Router config + dynamic route loading
|
||
│ │ │ ├── index.js # createRouter(), addDynamicRoutes()
|
||
│ │ │ ├── guard/ # Navigation guards (auth, loading, title)
|
||
│ │ │ └── routes/ # Static route definitions
|
||
│ │ ├── store/ # Pinia stores
|
||
│ │ │ └── modules/ # user, permission, app, tags
|
||
│ │ ├── utils/
|
||
│ │ │ ├── auth/ # JWT token read/write
|
||
│ │ │ ├── http/ # Axios instance + interceptors
|
||
│ │ │ └── storage/ # localStorage wrapper
|
||
│ │ └── views/ # Page-level Vue components
|
||
│ │ ├── login/ # Login page
|
||
│ │ ├── analytics/ # ECharts dashboard (trend + distribution)
|
||
│ │ ├── recruitment/ # Job data browser (boss/, qcwy/, zhilian/)
|
||
│ │ ├── cleaning/ # Cleaning operation + monitor pages
|
||
│ │ ├── keyword/ # Keyword management
|
||
│ │ ├── profile/ # User profile
|
||
│ │ └── system/ # RBAC admin (user, role, menu, api, dept, auditlog, proxy, token)
|
||
│ ├── vite.config.js # Vite build config (plugins, proxy)
|
||
│ └── package.json # Frontend dependencies
|
||
├── jobs_spider/ # Standalone crawlers (external process)
|
||
│ ├── boss/
|
||
│ │ ├── boos_api.py # BossZhipinAPI core + SmartIPManager + main (~2246 lines)
|
||
│ │ ├── company_spider.py # Boss company batch spider
|
||
│ │ └── enumerate_combos.py # City × keyword combo enumeration
|
||
│ ├── qcwy/
|
||
│ │ ├── qcwy.py # QCWY API wrapper
|
||
│ │ ├── search_company_jobs.py
|
||
│ │ └── run_company_search.py
|
||
│ └── zhilian/
|
||
│ ├── zhilian_single.py # Zhilian single-job spider
|
||
│ └── company_spider.py # Zhilian company spider
|
||
├── spiderJobs/ # Second-generation spider framework (structured)
|
||
│ ├── core/
|
||
│ │ ├── base.py # Shared base classes
|
||
│ │ └── http_client.py # HTTP client abstraction
|
||
│ ├── platforms/ # Platform-specific implementations
|
||
│ │ ├── boss/
|
||
│ │ ├── job51/
|
||
│ │ └── zhilian/
|
||
│ └── runner/ # Execution runners
|
||
│ ├── api_client.py # Backend API push client
|
||
│ ├── loop.py # Job crawl loop runner
|
||
│ └── company_loop.py # Company crawl loop runner
|
||
├── migrations/
|
||
│ └── models/ # Aerich migration files (auto-generated)
|
||
├── static/ # Served at /static/ (built frontend or assets)
|
||
├── deploy/ # Deployment configs and sample images
|
||
├── docs/ # Project documentation
|
||
├── ecs_full_pipeline.py # Alibaba Cloud ECS batch provisioning script
|
||
├── create_ecs_instances.py # ECS instance creation utilities
|
||
├── reclean_qcwy_jobs.py # Standalone QCWY re-cleaning script
|
||
├── run.py # Uvicorn launch entry point
|
||
├── Pipfile / Pipfile.lock # Python dependency management (pipenv)
|
||
├── pyproject.toml # ruff + black + isort tool config
|
||
└── Dockerfile # Container build spec
|
||
```
|
||
|
||
## Directory Purposes
|
||
|
||
**`app/api/v1/`:**
|
||
- Purpose: One sub-package per API domain; each contains `__init__.py` exporting its `APIRouter`
|
||
- Contains: Route handler functions, dependency injection wiring, Pydantic schema imports
|
||
- Key files: `app/api/v1/__init__.py` (aggregation), `app/api/v1/job/job.py` (ingest routes), `app/api/v1/analytics.py`
|
||
|
||
**`app/controllers/`:**
|
||
- Purpose: Business-operation classes for MySQL-backed entities; encapsulate Tortoise-ORM queries
|
||
- Contains: One controller per domain entity; typically expose `create`, `update`, `delete`, `list`, `get` methods
|
||
- Key files: `app/controllers/api.py` (scan and register FastAPI routes → DB)
|
||
|
||
**`app/services/ingest/`:**
|
||
- Purpose: Self-contained data pipeline: registry lookup → dedup check → ClickHouse insert → remote push
|
||
- Contains: `service.py` (main orchestrator), `registry.py` (config store), `dedup.py`, `remote_push.py`, `configs/` (registrations per platform)
|
||
- Key files: `app/services/ingest/service.py`, `app/services/ingest/registry.py`
|
||
|
||
**`app/services/crawler/`:**
|
||
- Purpose: In-process HTTP clients for platform APIs, used by backend scheduler (company cleaning) and company search controller
|
||
- Contains: Underscore-prefixed private modules (`_boss_client.py`, `_boss_sign.py`, etc.) and public service facades (`boss.py`, `qcwy.py`, `zhilian.py`)
|
||
- Key files: `app/services/crawler/_base.py` (shared base classes)
|
||
|
||
**`app/core/`:**
|
||
- Purpose: Framework plumbing — startup, middleware chain, auth, distributed locking, ClickHouse connection
|
||
- Key files: `app/core/dependency.py` (auth guards), `app/core/scheduler.py` (all cron jobs), `app/core/locks.py`
|
||
|
||
**`app/models/`:**
|
||
- Purpose: All Tortoise-ORM model definitions; migrated via Aerich (not ClickHouse)
|
||
- Key files: `app/models/admin.py` (RBAC), `app/models/keyword.py` (crawl state tracking)
|
||
|
||
**`app/repositories/`:**
|
||
- Purpose: Typed ClickHouse query encapsulation following the Repository pattern
|
||
- Key files: `app/repositories/clickhouse_repo.py`
|
||
|
||
**`jobs_spider/`:**
|
||
- Purpose: Standalone scripts that run on ECS nodes; communicate with backend only via HTTP
|
||
- Key files: `jobs_spider/boss/boos_api.py` (2246-line core with SmartIPManager, BossZhipinAPI)
|
||
|
||
**`spiderJobs/`:**
|
||
- Purpose: Newer modular spider framework; parallel architecture to `jobs_spider/` with cleaner separation
|
||
- Key files: `spiderJobs/runner/api_client.py`, `spiderJobs/runner/loop.py`
|
||
|
||
**`web/src/views/`:**
|
||
- Purpose: One directory per route group; each view owns its local route definition in `route.js`
|
||
- Key files: `web/src/views/analytics/index.vue` (ECharts dashboard), `web/src/views/cleaning/monitor.vue`
|
||
|
||
## Key File Locations
|
||
|
||
**Entry Points:**
|
||
- `run.py`: Uvicorn startup; reads env vars `APP_HOST`, `APP_PORT`, `UVICORN_WORKERS`
|
||
- `app/__init__.py`: FastAPI app factory `create_app()` and lifespan context manager
|
||
- `web/src/main.js`: Vue 3 app entry; mounts Pinia, Router, directives
|
||
|
||
**Configuration:**
|
||
- `app/settings/config.py`: All backend config (`Settings` extends `pydantic_settings.BaseSettings`); override via env vars
|
||
- `web/vite.config.js`: Vite build and dev server config
|
||
- `pyproject.toml`: ruff, black, isort tool configuration
|
||
|
||
**Core Logic:**
|
||
- `app/core/init_app.py`: Startup sequence (migrations, seed data, ClickHouse init)
|
||
- `app/core/scheduler.py`: All APScheduler job functions and `start_scheduler()`
|
||
- `app/core/clickhouse_init.py`: All ClickHouse `CREATE TABLE IF NOT EXISTS` DDL statements
|
||
- `app/services/ingest/registry.py`: `PlatformConfig` dataclass + global registry dict
|
||
- `app/services/ingest/service.py`: `IngestService.store_batch()` — main data ingestion path
|
||
- `app/services/company_cleaner.py`: `CompanyCleaner` — automated company data pipeline
|
||
|
||
**ClickHouse Schema:**
|
||
- `app/core/clickhouse_init.py`: Defines all 6 MergeTree tables and 1 analytics VIEW DDL; apply via `ClickHouseInitializer.initialize_all_tables()`
|
||
|
||
**MySQL Migrations:**
|
||
- `migrations/models/`: Auto-generated by `aerich migrate`; applied via `aerich upgrade` or automatically on startup
|
||
|
||
**Testing:**
|
||
- `tests/test_company_jobs_sync.py`
|
||
- `tests/test_company_storage.py`
|
||
|
||
## Naming Conventions
|
||
|
||
**Files (Python):**
|
||
- Modules: `snake_case.py` (e.g., `company_cleaner.py`, `clickhouse_init.py`)
|
||
- Private crawler modules: leading underscore prefix (e.g., `_boss_client.py`, `_zhilian_sign.py`)
|
||
- Spider scripts in `jobs_spider/`: descriptive names without prefix (e.g., `boos_api.py`, `zhilian_single.py`)
|
||
|
||
**Files (Vue/JS):**
|
||
- Pages: `index.vue` for primary view, `monitor.vue` / descriptive name for secondary views
|
||
- API modules: `snake_case.js` matching domain (e.g., `analytics.js`, `keyword.js`)
|
||
- Router guard files: Grouped in `router/guard/` directory
|
||
|
||
**Classes (Python):**
|
||
- Services: `PascalCaseService` (e.g., `IngestService`, `AnalyticsService`, `CompanyCleaner`)
|
||
- Controllers: `PascalCaseController` with module-level singleton instance in snake_case (e.g., `user_controller = UserController()`)
|
||
- Repos: `PascalCaseRepo` (e.g., `ClickHouseBaseRepo`, `JobAnalyticsRepo`)
|
||
- Models: `PascalCase` matching table concept (e.g., `BossToken`, `CompanyCleaningQueue`)
|
||
|
||
**Directories:**
|
||
- Domain groupings: lowercase (e.g., `cleaning/`, `recruitment/`, `keyword/`)
|
||
- Platform names used consistently: `boss`, `qcwy`, `zhilian` (lowercase) across all layers
|
||
|
||
## Where to Add New Code
|
||
|
||
**New API Endpoint:**
|
||
1. Create or update router file in `app/api/v1/{domain}/`
|
||
2. Register the router in `app/api/v1/__init__.py` with `v1_router.include_router(...)`
|
||
3. Run `api_controller.refresh_api()` on next app startup to sync new routes to the `api` table
|
||
4. Add corresponding Pydantic schemas in `app/schemas/{domain}.py`
|
||
|
||
**New Platform Data Type (Ingest):**
|
||
1. Define `PlatformConfig` with dedup fields in `app/services/ingest/configs/`
|
||
2. Call `register(config)` at module import in `app/services/ingest/configs/__init__.py`
|
||
3. ClickHouse table DDL goes in `app/core/clickhouse_init.py` under `_TABLE_DDLS`
|
||
|
||
**New MySQL Model:**
|
||
1. Add model class to appropriate file in `app/models/` (or create new file)
|
||
2. Register the model module in `app/settings/config.py` under `TORTOISE_ORM.apps.models.models`
|
||
3. Run `aerich migrate && aerich upgrade` (or rely on `RUN_MIGRATIONS_ON_STARTUP=True`)
|
||
|
||
**New Scheduled Job:**
|
||
1. Define async job function in `app/core/scheduler.py`
|
||
2. Wrap execution body with `DistributedLock` context from `app/core/locks.py`
|
||
3. Add `scheduler.add_job(...)` call inside `start_scheduler()`
|
||
4. Record run result via `_record_task_run()` to `ScheduledTaskRun` table
|
||
|
||
**New Frontend Page:**
|
||
1. Create view component at `web/src/views/{module}/index.vue`
|
||
2. Add `route.js` in the same directory with route meta
|
||
3. Register the route in `web/src/router/routes/` static routes or rely on backend menu system
|
||
4. Add corresponding backend menu entry in `app/core/init_app.py` → `init_menus()` (for seed data)
|
||
|
||
**New In-Process Platform Crawler:**
|
||
1. Add private client modules `_platform_client.py`, `_platform_api.py`, `_platform_sign.py` under `app/services/crawler/`
|
||
2. Create public facade `platform.py` extending or wrapping `BaseFetcher` / `BaseSearcher` from `app/services/crawler/_base.py`
|
||
3. Instantiate in `CompanyCleaner.__init__()` if needed for company pipeline
|
||
|
||
**Shared Utilities:**
|
||
- Backend Python utilities: `app/utils/`
|
||
- Frontend JS utilities: `web/src/utils/`
|
||
|
||
## Special Directories
|
||
|
||
**`.startup_lock/`:**
|
||
- Purpose: File-based mutex directory used to prevent multi-worker race on startup seed data
|
||
- Generated: Yes (created and deleted at startup)
|
||
- Committed: No (in `.gitignore`)
|
||
|
||
**`migrations/`:**
|
||
- Purpose: Aerich migration history for MySQL schema
|
||
- Generated: Yes (by `aerich migrate`)
|
||
- Committed: Yes
|
||
|
||
**`static/`:**
|
||
- Purpose: Served at `/static/` path; can hold built frontend assets or other static files
|
||
- Generated: Partially (built frontend output can be placed here)
|
||
- Committed: Yes (currently contains static assets)
|
||
|
||
**`.planning/`:**
|
||
- Purpose: GSD planning documents (codebase analysis, phase plans)
|
||
- Generated: Yes (by GSD tooling)
|
||
- Committed: Yes
|
||
|
||
**`jobs_spider/*/logs/`:**
|
||
- Purpose: Rotating daily log files for standalone spider processes
|
||
- Generated: Yes (by loguru in spider scripts)
|
||
- Committed: No
|
||
|
||
---
|
||
|
||
*Structure analysis: 2026-03-21*
|