# External Integrations **Analysis Date:** 2026-03-21 ## APIs & External Services **Recruitment Platforms (scraped, not official APIs):** - Boss Zhipin (Boss直聘) - Job and company data via WeChat mini-program endpoints - Client: `requests` (sync) in `jobs_spider/boss/boos_api.py` - Endpoints scraped: `/wapi/zpgeek/miniapp/search/joblist.json`, `/wapi/batch/batchRunV3`, `/wapi/zpgeek/miniapp/brand/detail.json` - Auth: MPT Token (stored in MySQL `boss_token` table, managed via `app/api/v1/token/`) - Anti-detection: `SmartIPManager`, `IPAnomalyDetector`, random 10-20s delays, session rebuilding on ban - JS signing: `PyExecJS` executes JS sign algorithms (`jobs_spider/boss/boos_api.py`) - 51job / Qiancheng Wuyou (前程无忧) - Job data via platform API - Client: `httpx` + `requests` in `jobs_spider/qcwy/qcwy.py` - Auth: HMAC signature computed per-request - Proxy: Hardcoded proxy URL present in `jobs_spider/qcwy/qcwy.py` (line 31) - security risk - Zhilian Zhaopin (智联招聘) - Job and company data - Client: `requests` in `jobs_spider/zhilian/zhilian_single.py` - Sign client: `app/services/crawler/_zhilian_sign.py` **Qixin.com (企信) - Company data enrichment (disabled):** - Endpoint: `http://external-data.qixin.com/extend/extend_data_push` - Auth: MD5 HMAC with timestamp + salt (salt hardcoded in `app/services/ingest/remote_push.py` line 48) - Status: Push code commented out; function `push_to_remote()` currently only logs **Webhook / Report Endpoint:** - Outgoing: Configurable via `REPORT_ENDPOINT` env var (default: none) - Used by: `app/core/scheduler.py` `_post_with_retry()` to POST stats JSON every 6 hours - Timeout: `REPORT_TIMEOUT` (default 10s), retries: `REPORT_MAX_RETRIES` (default 3) ## Data Storage **Databases:** MySQL (Primary business database): - Connection: `mysql://root:jobdata123@121.4.126.241:3306/job_data` (default hardcoded in `app/settings/config.py`; override via `TORTOISE_ORM` env or individual components) - Client: Tortoise-ORM 0.23.0 with aiomysql driver - Migrations: Aerich (`migrations/` directory, runs on startup if `RUN_MIGRATIONS_ON_STARTUP=True`) - Tables: `user`, `role`, `api`, `menu`, `dept`, `auditlog`, `boss_token`, `cleaning_*`, `scheduled_task_run`, `stats_total`, `ip_upload_stats` - Models: `app/models/admin.py`, `app/models/token.py`, `app/models/cleaning.py`, `app/models/metrics.py`, `app/models/keyword.py`, `app/models/company.py` ClickHouse (Analytics / raw job data): - Connection: host `121.4.126.241`, port `8123`, DB `job_data` (configured in `app/settings/config.py`) - Env vars: `CLICKHOUSE_HOST`, `CLICKHOUSE_PORT`, `CLICKHOUSE_USER`, `CLICKHOUSE_PASS`, `CLICKHOUSE_DB` - Client: `clickhouse-connect==0.7.19` async client (`app/core/clickhouse.py`) - Connection pool: `urllib3.PoolManager` with `num_pools=2`, `maxsize=5` per worker - Tables managed via DDL in `app/core/clickhouse_init.py` (NOT Aerich migrations) - Tables: `boss_job`, `boss_company`, `qcwy_job`, `qcwy_company`, `zhilian_job`, `zhilian_company` (MergeTree engine), `pending_company` (ReplacingMergeTree), `job_analytics` (VIEW) - Queries: Repository pattern in `app/repositories/clickhouse_repo.py` **File Storage:** - Local filesystem only (no S3 / object storage detected) - Static files served from `static/` directory mounted via FastAPI (`app/__init__.py`) - Spider logs written to `logs/` directory relative to spider working dir - ECS pipeline log: `ecs_full_pipeline.log` (root project directory) **Caching:** - Redis (optional): Used only for distributed locking in `app/core/locks.py` - Env vars: `REDIS_HOST`, `REDIS_PORT` (default 6379), `REDIS_DB` (default 0), `REDIS_PASS` - Falls back to filesystem-based TTL locks when Redis unavailable - Lock files stored in system temp directory: `{tempdir}/jobdata_lock_{name}/` ## Authentication & Identity **Auth Provider:** Custom JWT implementation (no third-party auth provider) - Implementation: `app/core/dependency.py` (`DependPermission` dependency) - Algorithm: HS256 - Expiry: 7 days (60 * 24 * 7 minutes, `app/settings/config.py`) - Secret: `SECRET_KEY` (default `"CHANGE_ME_DEV_ONLY"` - must override in production) - Storage (frontend): JWT stored in `localStorage`, injected via axios request interceptor (`web/src/utils/http/interceptors.js`) - RBAC: Route-level permission check via `DependPermission`; roles link to APIs (many-to-many in MySQL) **Boss Token Management:** - Boss MPT tokens stored in `boss_token` MySQL table (`app/models/token.py`) - Managed via `app/api/v1/token/` CRUD endpoints - Spiders fetch active tokens from `/api/v1/token/tokens` and mark invalid ones ## Monitoring & Observability **Error Tracking:** None (no Sentry or equivalent detected) **Logs:** - Backend: `loguru` structured logging throughout all modules - Spider: `loguru` with daily rotation (`logs/log_{YYYY-MM-DD}.log`, 30-day retention) - ECS pipeline: File log appended to `ecs_full_pipeline.log` - No centralized log aggregation (no ELK, CloudWatch, etc.) **Alerting:** - Email alerts via SMTP on: stats job completion, IP anomaly detection - SMTP provider: QQ Mail (`smtp.qq.com:587`) with STARTTLS - Env vars: `SMTP_HOST`, `SMTP_PORT`, `SMTP_USER`, `SMTP_PASS`, `SMTP_FROM`, `SMTP_TO` - Default recipient hardcoded in `app/settings/config.py` line 47 - IP anomaly alerts triggered every 10 minutes by `ip_alert_job` in `app/core/scheduler.py` - Stats reports sent every 6 hours by `stats_job` in `app/core/scheduler.py` **Metrics:** - IP upload tracking: `IpTrackingMiddleware` (`app/core/ip_tracking.py`) records per-source, per-IP upload counts to `ip_upload_stats` MySQL table - Task run history: `ScheduledTaskRun` model (`app/models/metrics.py`) records all scheduler task outcomes - Stats totals: `StatsTotal` model records ClickHouse row counts per table per run ## CI/CD & Deployment **Hosting:** - Production: Alibaba Cloud ECS (manually provisioned + Docker container) - Crawler nodes: Alibaba Cloud ECS spot instances (cn-shanghai-b, `ecs.t5-lc1m1.small`, Ubuntu 24.04) - ECS node config: `ecs_full_pipeline.py` (20 spot instances, `SpotAsPriceGo` strategy, auto-release) - Vswitch: `vsw-uf6twb2lrd02fi7q2i605`, Security Group: `sg-uf60380ctrux2uusucad` **CI Pipeline:** None detected (no GitHub Actions, Jenkins, or equivalent) **Container:** - Multi-stage `Dockerfile` in project root - Stage 1: `node:18-alpine` builds Vue3 frontend - Stage 2: `python:3.11-slim-bullseye` installs Python deps + nginx - Entrypoint: `deploy/entrypoint.sh` - Nginx config: `deploy/web.conf` - Exposed port: 80 ## Environment Configuration **Required env vars (backend):** - `SECRET_KEY` - JWT signing secret (must override `"CHANGE_ME_DEV_ONLY"`) - `CLICKHOUSE_HOST` - ClickHouse server address - `CLICKHOUSE_USER` / `CLICKHOUSE_PASS` - ClickHouse credentials - MySQL connection (embedded in `TORTOISE_ORM` dict or override connection string) **Optional env vars:** - `APP_HOST` / `APP_PORT` / `UVICORN_WORKERS` - Server binding - `REDIS_HOST` / `REDIS_PORT` / `REDIS_PASS` - Distributed lock backend - `SMTP_HOST` / `SMTP_USER` / `SMTP_PASS` / `SMTP_FROM` / `SMTP_TO` - Email alerting - `REPORT_ENDPOINT` - Webhook for stats reporting - `RUN_MIGRATIONS_ON_STARTUP` / `INITIALIZE_SEED_DATA_ON_STARTUP` - Startup behavior flags **Spider env vars:** - `API_BASE_URL` - Backend URL for data push (default: `http://124.222.106.226:9999`) - `SLEEP_MIN_SECONDS` / `SLEEP_MAX_SECONDS` - Request rate limiting - `PROXY_USERNAME` / `PROXY_PASSWORD` / `PROXY_TUNNEL` / `PROXY_SCHEME` - Proxy pool - `MAX_PAGES` / `PAGE_SIZE` - Crawl depth control **Alibaba Cloud (ECS pipeline):** - Uses `alibabacloud_credentials` default credential chain (env vars `ALIBABA_CLOUD_ACCESS_KEY_ID` / `ALIBABA_CLOUD_ACCESS_KEY_SECRET` or `~/.alibabacloud/credentials`) **Secrets location:** All secrets are environment variables; no secrets manager. `config.py` contains hardcoded defaults that are insecure for production use. ## Webhooks & Callbacks **Incoming:** - `/api/v1/universal/data/batch-store-async` - Spider data ingestion endpoint (called by all three crawlers) - Header auth: `token: dev` (or `API_TOKEN` env var) - Body: `{"data_list": [...], "data_type": "job"|"company", "platform": "boss"|"qcwy"|"zhilian"}` - `/api/v1/keyword/available` - Spiders poll this to get next keyword to crawl - `/api/v1/token/tokens` - Spiders fetch active Boss MPT tokens - `/api/v1/pipeline` - Manual trigger for ECS full pipeline **Outgoing:** - `REPORT_ENDPOINT` (configurable) - Stats payload POST every 6 hours (`app/core/scheduler.py`) - `http://external-data.qixin.com/...` - Company data push to Qixin (currently disabled/commented out in `app/services/ingest/remote_push.py`) - SMTP email notifications on scheduler events --- *Integration audit: 2026-03-21*