JobData/.planning/codebase/INTEGRATIONS.md
win 00a727519f docs(phase-3): complete execution — 2/2 plans, 98 tests passing
- ARCH-04: job51 migrated to crawler_core (no old deps)
- ARCH-05: zhilian migrated to crawler_core (no old deps)
- 34 new mock tests (17 job51 + 17 zhilian)
- Added _parse_zhilian_response custom parser for zhilian API format
- Fixed POST Searcher _request() overrides for job51/zhilian
- Full regression: 98 passed in 0.12s
2026-03-21 19:19:17 +08:00

166 lines
8.6 KiB
Markdown

# External Integrations
**Analysis Date:** 2026-03-21
## APIs & External Services
**Recruitment Platforms (scraped, not official APIs):**
- Boss Zhipin (Boss直聘) - Job and company data via WeChat mini-program endpoints
- Client: `requests` (sync) in `jobs_spider/boss/boos_api.py`
- Endpoints scraped: `/wapi/zpgeek/miniapp/search/joblist.json`, `/wapi/batch/batchRunV3`, `/wapi/zpgeek/miniapp/brand/detail.json`
- Auth: MPT Token (stored in MySQL `boss_token` table, managed via `app/api/v1/token/`)
- Anti-detection: `SmartIPManager`, `IPAnomalyDetector`, random 10-20s delays, session rebuilding on ban
- JS signing: `PyExecJS` executes JS sign algorithms (`jobs_spider/boss/boos_api.py`)
- 51job / Qiancheng Wuyou (前程无忧) - Job data via platform API
- Client: `httpx` + `requests` in `jobs_spider/qcwy/qcwy.py`
- Auth: HMAC signature computed per-request
- Proxy: Hardcoded proxy URL present in `jobs_spider/qcwy/qcwy.py` (line 31) - security risk
- Zhilian Zhaopin (智联招聘) - Job and company data
- Client: `requests` in `jobs_spider/zhilian/zhilian_single.py`
- Sign client: `app/services/crawler/_zhilian_sign.py`
**Qixin.com (企信) - Company data enrichment (disabled):**
- Endpoint: `http://external-data.qixin.com/extend/extend_data_push`
- Auth: MD5 HMAC with timestamp + salt (salt hardcoded in `app/services/ingest/remote_push.py` line 48)
- Status: Push code commented out; function `push_to_remote()` currently only logs
**Webhook / Report Endpoint:**
- Outgoing: Configurable via `REPORT_ENDPOINT` env var (default: none)
- Used by: `app/core/scheduler.py` `_post_with_retry()` to POST stats JSON every 6 hours
- Timeout: `REPORT_TIMEOUT` (default 10s), retries: `REPORT_MAX_RETRIES` (default 3)
## Data Storage
**Databases:**
MySQL (Primary business database):
- Connection: `mysql://root:jobdata123@121.4.126.241:3306/job_data` (default hardcoded in `app/settings/config.py`; override via `TORTOISE_ORM` env or individual components)
- Client: Tortoise-ORM 0.23.0 with aiomysql driver
- Migrations: Aerich (`migrations/` directory, runs on startup if `RUN_MIGRATIONS_ON_STARTUP=True`)
- Tables: `user`, `role`, `api`, `menu`, `dept`, `auditlog`, `boss_token`, `cleaning_*`, `scheduled_task_run`, `stats_total`, `ip_upload_stats`
- Models: `app/models/admin.py`, `app/models/token.py`, `app/models/cleaning.py`, `app/models/metrics.py`, `app/models/keyword.py`, `app/models/company.py`
ClickHouse (Analytics / raw job data):
- Connection: host `121.4.126.241`, port `8123`, DB `job_data` (configured in `app/settings/config.py`)
- Env vars: `CLICKHOUSE_HOST`, `CLICKHOUSE_PORT`, `CLICKHOUSE_USER`, `CLICKHOUSE_PASS`, `CLICKHOUSE_DB`
- Client: `clickhouse-connect==0.7.19` async client (`app/core/clickhouse.py`)
- Connection pool: `urllib3.PoolManager` with `num_pools=2`, `maxsize=5` per worker
- Tables managed via DDL in `app/core/clickhouse_init.py` (NOT Aerich migrations)
- Tables: `boss_job`, `boss_company`, `qcwy_job`, `qcwy_company`, `zhilian_job`, `zhilian_company` (MergeTree engine), `pending_company` (ReplacingMergeTree), `job_analytics` (VIEW)
- Queries: Repository pattern in `app/repositories/clickhouse_repo.py`
**File Storage:**
- Local filesystem only (no S3 / object storage detected)
- Static files served from `static/` directory mounted via FastAPI (`app/__init__.py`)
- Spider logs written to `logs/` directory relative to spider working dir
- ECS pipeline log: `ecs_full_pipeline.log` (root project directory)
**Caching:**
- Redis (optional): Used only for distributed locking in `app/core/locks.py`
- Env vars: `REDIS_HOST`, `REDIS_PORT` (default 6379), `REDIS_DB` (default 0), `REDIS_PASS`
- Falls back to filesystem-based TTL locks when Redis unavailable
- Lock files stored in system temp directory: `{tempdir}/jobdata_lock_{name}/`
## Authentication & Identity
**Auth Provider:** Custom JWT implementation (no third-party auth provider)
- Implementation: `app/core/dependency.py` (`DependPermission` dependency)
- Algorithm: HS256
- Expiry: 7 days (60 * 24 * 7 minutes, `app/settings/config.py`)
- Secret: `SECRET_KEY` (default `"CHANGE_ME_DEV_ONLY"` - must override in production)
- Storage (frontend): JWT stored in `localStorage`, injected via axios request interceptor (`web/src/utils/http/interceptors.js`)
- RBAC: Route-level permission check via `DependPermission`; roles link to APIs (many-to-many in MySQL)
**Boss Token Management:**
- Boss MPT tokens stored in `boss_token` MySQL table (`app/models/token.py`)
- Managed via `app/api/v1/token/` CRUD endpoints
- Spiders fetch active tokens from `/api/v1/token/tokens` and mark invalid ones
## Monitoring & Observability
**Error Tracking:** None (no Sentry or equivalent detected)
**Logs:**
- Backend: `loguru` structured logging throughout all modules
- Spider: `loguru` with daily rotation (`logs/log_{YYYY-MM-DD}.log`, 30-day retention)
- ECS pipeline: File log appended to `ecs_full_pipeline.log`
- No centralized log aggregation (no ELK, CloudWatch, etc.)
**Alerting:**
- Email alerts via SMTP on: stats job completion, IP anomaly detection
- SMTP provider: QQ Mail (`smtp.qq.com:587`) with STARTTLS
- Env vars: `SMTP_HOST`, `SMTP_PORT`, `SMTP_USER`, `SMTP_PASS`, `SMTP_FROM`, `SMTP_TO`
- Default recipient hardcoded in `app/settings/config.py` line 47
- IP anomaly alerts triggered every 10 minutes by `ip_alert_job` in `app/core/scheduler.py`
- Stats reports sent every 6 hours by `stats_job` in `app/core/scheduler.py`
**Metrics:**
- IP upload tracking: `IpTrackingMiddleware` (`app/core/ip_tracking.py`) records per-source, per-IP upload counts to `ip_upload_stats` MySQL table
- Task run history: `ScheduledTaskRun` model (`app/models/metrics.py`) records all scheduler task outcomes
- Stats totals: `StatsTotal` model records ClickHouse row counts per table per run
## CI/CD & Deployment
**Hosting:**
- Production: Alibaba Cloud ECS (manually provisioned + Docker container)
- Crawler nodes: Alibaba Cloud ECS spot instances (cn-shanghai-b, `ecs.t5-lc1m1.small`, Ubuntu 24.04)
- ECS node config: `ecs_full_pipeline.py` (20 spot instances, `SpotAsPriceGo` strategy, auto-release)
- Vswitch: `vsw-uf6twb2lrd02fi7q2i605`, Security Group: `sg-uf60380ctrux2uusucad`
**CI Pipeline:** None detected (no GitHub Actions, Jenkins, or equivalent)
**Container:**
- Multi-stage `Dockerfile` in project root
- Stage 1: `node:18-alpine` builds Vue3 frontend
- Stage 2: `python:3.11-slim-bullseye` installs Python deps + nginx
- Entrypoint: `deploy/entrypoint.sh`
- Nginx config: `deploy/web.conf`
- Exposed port: 80
## Environment Configuration
**Required env vars (backend):**
- `SECRET_KEY` - JWT signing secret (must override `"CHANGE_ME_DEV_ONLY"`)
- `CLICKHOUSE_HOST` - ClickHouse server address
- `CLICKHOUSE_USER` / `CLICKHOUSE_PASS` - ClickHouse credentials
- MySQL connection (embedded in `TORTOISE_ORM` dict or override connection string)
**Optional env vars:**
- `APP_HOST` / `APP_PORT` / `UVICORN_WORKERS` - Server binding
- `REDIS_HOST` / `REDIS_PORT` / `REDIS_PASS` - Distributed lock backend
- `SMTP_HOST` / `SMTP_USER` / `SMTP_PASS` / `SMTP_FROM` / `SMTP_TO` - Email alerting
- `REPORT_ENDPOINT` - Webhook for stats reporting
- `RUN_MIGRATIONS_ON_STARTUP` / `INITIALIZE_SEED_DATA_ON_STARTUP` - Startup behavior flags
**Spider env vars:**
- `API_BASE_URL` - Backend URL for data push (default: `http://124.222.106.226:9999`)
- `SLEEP_MIN_SECONDS` / `SLEEP_MAX_SECONDS` - Request rate limiting
- `PROXY_USERNAME` / `PROXY_PASSWORD` / `PROXY_TUNNEL` / `PROXY_SCHEME` - Proxy pool
- `MAX_PAGES` / `PAGE_SIZE` - Crawl depth control
**Alibaba Cloud (ECS pipeline):**
- Uses `alibabacloud_credentials` default credential chain (env vars `ALIBABA_CLOUD_ACCESS_KEY_ID` / `ALIBABA_CLOUD_ACCESS_KEY_SECRET` or `~/.alibabacloud/credentials`)
**Secrets location:** All secrets are environment variables; no secrets manager. `config.py` contains hardcoded defaults that are insecure for production use.
## Webhooks & Callbacks
**Incoming:**
- `/api/v1/universal/data/batch-store-async` - Spider data ingestion endpoint (called by all three crawlers)
- Header auth: `token: dev` (or `API_TOKEN` env var)
- Body: `{"data_list": [...], "data_type": "job"|"company", "platform": "boss"|"qcwy"|"zhilian"}`
- `/api/v1/keyword/available` - Spiders poll this to get next keyword to crawl
- `/api/v1/token/tokens` - Spiders fetch active Boss MPT tokens
- `/api/v1/pipeline` - Manual trigger for ECS full pipeline
**Outgoing:**
- `REPORT_ENDPOINT` (configurable) - Stats payload POST every 6 hours (`app/core/scheduler.py`)
- `http://external-data.qixin.com/...` - Company data push to Qixin (currently disabled/commented out in `app/services/ingest/remote_push.py`)
- SMTP email notifications on scheduler events
---
*Integration audit: 2026-03-21*