# Architecture Patterns: Shared Crawler Core **Domain:** Recruiting data crawler — dual-deployment shared core **Researched:** 2026-03-21 **Overall confidence:** HIGH (based entirely on direct codebase inspection) --- ## Problem Statement The codebase currently has the same code in two places, copied verbatim: | File | Copy location | |------|---------------| | `spiderJobs/core/http_client.py` | `app/services/crawler/_http_client.py` | | `spiderJobs/core/base.py` | `app/services/crawler/_base.py` | | `spiderJobs/platforms/boss/sign.py` | `app/services/crawler/_boss_sign.py` | | `spiderJobs/platforms/boss/client.py` | `app/services/crawler/_boss_client.py` | | `spiderJobs/platforms/boss/api.py` | `app/services/crawler/_boss_api.py` | The docstrings in `app/services/crawler/_http_client.py` literally say: "复制自 spiderJobs/core/http_client.py — 不要直接 import spiderJobs,避免跨模块依赖" ("copied from — do not import directly to avoid cross-module dependency"). The comment reveals the source of the problem: the team correctly identified that two separate deployment units cannot cross-import, but solved it by copy-paste rather than extraction. There are also two competing external spider frameworks: `jobs_spider/` (older, monolithic, single large file per platform) and `spiderJobs/` (newer, modular, uses `runner/loop.py` + `runner/api_client.py` as a proper execution harness). The refactor must unify these. --- ## Recommended Architecture ### Core Design Decision: Installable Shared Package Extract shared crawler code into a proper Python package at `crawler_core/` (project root level). Both deployment contexts import from it. No copying. No circular imports. ``` crawler_core/ # Shared library — zero app/* or spiderJobs/* imports ├── __init__.py ├── http_client.py # HTTPClient (requests-go, TLS fingerprint, proxy rotation) ├── base.py # ApiResult, BaseFetcher, BaseSearcher, parse_response() ├── boss/ │ ├── __init__.py │ ├── sign.py # BossSign (Traceid generation algorithm) │ ├── client.py # BossClient(HTTPClient) — header injection │ └── api.py # SearchRecJobs, GetBrandDetail, etc. ├── qcwy/ │ ├── __init__.py │ ├── sign.py │ ├── client.py │ └── api.py └── zhilian/ ├── __init__.py ├── sign.py ├── client.py └── api.py ``` **Why a root-level package and not inside `app/` or `spiderJobs/`:** - `app/` is the FastAPI backend. Importing it from external scripts would pull in Tortoise-ORM, APScheduler, ClickHouse, and every other backend dependency — making external scripts impossibly heavy. - `spiderJobs/` is a deployment directory, not a library. - A root-level `crawler_core/` has no such baggage. It imports only `requests-go` and stdlib. Both `app/` and `spiderJobs/` import from it. **Constraints satisfied:** - No circular imports - No shared runtime state (crawler_core is stateless pure logic) - Installable as local package (editable install via `pip install -e .` or `pipenv install -e .`) for ECS deployment - Tests can import `crawler_core` directly without mocking a FastAPI app --- ## Component Boundaries ### What Talks to What ``` ┌─────────────────────────────────────────────────────────────┐ │ crawler_core/ │ │ HTTPClient · BaseFetcher · BaseSearcher · ApiResult │ │ boss/ · qcwy/ · zhilian/ (sign + client + api) │ │ Imports: requests-go, stdlib only │ └────────────────┬────────────────────────┬───────────────────┘ │ │ ┌────────────▼──────────┐ ┌─────────▼─────────────────┐ │ app/services/ │ │ spiderJobs/ │ │ crawler/ │ │ platforms/ + runner/ │ │ │ │ │ │ BossService facade │ │ run_crawl_loop() │ │ QcwyService facade │ │ RunnerAPIClient │ │ ZhilianService facade│ │ (HTTP push to backend) │ │ │ │ │ │ Called by: │ │ Called by: │ │ CompanyCleaner │ │ ECS node scripts │ │ CompanyController │ │ jobs_spider/ (deprecated) │ └────────────┬──────────┘ └────────────────────────────┘ │ ┌────────────▼──────────────────────────────────────────┐ │ app/ (FastAPI backend — async boundary) │ │ CompanyCleaner (asyncio, calls sync crawler via │ │ asyncio.to_thread()) │ │ IngestService (async ClickHouse writes) │ │ APScheduler (triggers CompanyCleaner async jobs) │ └───────────────────────────────────────────────────────┘ ``` ### Component Responsibilities | Component | Responsibility | What It Must NOT Do | |-----------|---------------|---------------------| | `crawler_core/` | Signing, HTTP requests, response parsing — per platform | Import `app.*`, use async, maintain global state | | `app/services/crawler/` facades | Wrap `crawler_core` with logging, token loading, proxy mgmt | Contain signing or HTTP logic | | `spiderJobs/runner/` | Crawl loop orchestration, keyword polling, data upload | Import `app.*` | | `app/services/ingest/` | Registry-driven ClickHouse insertion, dedup | Know anything about platform HTTP clients | | `app/core/scheduler.py` | Schedule jobs, manage distributed lock | Run synchronous I/O on the event loop directly | --- ## Abstract Base Class Hierarchy ```python # crawler_core/base.py @dataclass class ApiResult: success: bool status_code: int data: Any = None list: list[dict] = field(default_factory=list) count: int = 0 is_end_page: bool = True error: Optional[str] = None class BaseFetcher: """Single-resource GET (company detail, job detail).""" ENDPOINT: str = "" def __init__(self, http_client: HTTPClient): ... def _build_params(self) -> dict: raise NotImplementedError def _parse(self, http_code: int, raw: Any) -> ApiResult: ... def fetch(self) -> ApiResult: ... # sync — wraps _http.get() class BaseSearcher: """Paginated search (job list, brand job list).""" ENDPOINT: str = "" def __init__(self, page_size: int, http_client: HTTPClient): ... def _build_params(self, page_index: int) -> dict: raise NotImplementedError def _request(self, params: dict) -> tuple[int, Any]: ... def _parse(self, http_code: int, raw: Any) -> ApiResult: ... def search(self, page_index: int = 1) -> ApiResult: ... # sync def load_all(self, max_pages: int, on_page: Callable | None) -> list[dict]: ... ``` Platform clients subclass `HTTPClient` from `crawler_core/http_client.py`: ```python # crawler_core/boss/client.py class BossClient(HTTPClient): """Injects mpt/wt2/Traceid headers on every request.""" def __init__(self, signer: BossSign, ...): ... def post(self, path, body, headers=None) -> tuple[int, Any]: ... # sync def get(self, path, params=None, headers=None) -> tuple[int, Any]: ... # sync ``` Platform API classes subclass `BaseFetcher` / `BaseSearcher`: ```python # crawler_core/boss/api.py class SearchRecJobs(BaseSearcher): ENDPOINT = "/wapi/zpgeek/miniapp/homepage/recjoblist.json" def _build_params(self, page_index: int) -> dict: ... def _parse(self, http_code: int, raw: Any) -> ApiResult: ... # Boss-specific parsing class GetBrandDetail(BaseFetcher): ENDPOINT = "/wapi/zpgeek/miniapp/brand/get.json" def _build_params(self) -> dict: ... ``` **The hierarchy is deliberately flat.** Three levels: `HTTPClient` → `PlatformClient` → `PlatformAPIClass`. No further abstraction. --- ## Sync vs Async: The Critical Design Point ### Why the Core Must Stay Synchronous `crawler_core/` uses `requests-go` (a sync library that wraps a TLS fingerprinting mechanism). There is no async version of `requests-go`. The platform APIs require TLS fingerprint spoofing that `httpx` does not support without significant additional work. Keeping the core synchronous is the correct choice given the constraint. ### How the Backend (async) Calls Synchronous Crawlers `CompanyCleaner` runs inside the FastAPI async event loop (called by APScheduler). It cannot block the event loop with synchronous `requests` calls. The correct pattern is `asyncio.to_thread()`: ```python # app/services/company_cleaner.py async def _fetch_company_detail(self, source: str, company_id: str) -> Optional[dict]: """Run synchronous crawler call off the event loop thread.""" if source == "boss": return await asyncio.to_thread( self.boss_service.get_company_detail_by_id, company_id ) elif source == "qcwy": return await asyncio.to_thread( self.qcwy_service.get_company_detail_by_id, company_id ) elif source == "zhilian": return await asyncio.to_thread( self.zhilian_service.get_company_detail_by_id, company_id ) ``` `asyncio.to_thread()` runs the synchronous function in a thread pool, yielding back to the event loop while it executes. This is the standard Python pattern for sync I/O in an async context. **Current state:** The existing `CompanyCleaner.process_pending_companies()` likely calls sync methods directly on the event loop — this is a bug to fix during refactoring. The fix is wrapping each crawler call with `asyncio.to_thread()`. ### External Scripts: Pure Synchronous, No Async `spiderJobs/runner/loop.py` (`run_crawl_loop`) and any ECS scripts are pure synchronous `while True` loops. They import from `crawler_core/` directly. No async. This is correct and intentional. ``` External Script (sync) Backend (async) ────────────────────── ───────────────────────────── while True: async def process_company(): kw = api.fetch_keyword() detail = await asyncio.to_thread( searcher = BossSearcher(...) boss_service.get_company_detail result = searcher.search() ) api.upload_data(result.list) await company_storage.save(detail) ``` --- ## Data Flow Patterns ### Flow 1: External Spider → Backend Storage ``` ECS Script (spiderJobs/runner/loop.py) │ ├─ GET /api/v1/keyword/available │ → KeywordController marks keyword as "crawling" │ ├─ crawler_core/boss/api.py SearchRecJobs.search(page) │ → returns ApiResult.list (raw job dicts) │ ├─ POST /api/v1/universal/data/batch-store-async │ body: {platform, channel, data_type, data_list} │ └─ Backend: IngestService.store_batch() → registry lookup → dedup → ClickHouse insert ``` The external script knows nothing about ClickHouse. It only pushes raw dicts over HTTP. This boundary is correct and must stay. ### Flow 2: In-Process Company Cleaning (Scheduled) ``` APScheduler → company_cleaning_job() [async] │ ├─ CompanyCleaner.collect_pending_companies() [async] │ → ClickHouse: query job tables for company IDs │ → MySQL: check which company IDs already processed │ → MySQL: insert new companies into CompanyCleaningQueue │ └─ CompanyCleaner.process_pending_companies() [async] ├─ MySQL: fetch N companies from queue ├─ asyncio.to_thread(boss_service.get_company_detail_by_id) │ → crawler_core/boss/api.py GetBrandDetail.fetch() [sync] ├─ company_storage.save_company() [async, ClickHouse] └─ MySQL: update company status to "done" ``` The flow crosses the sync/async boundary exactly once: at `asyncio.to_thread()`. Everything above (ClickHouse queries, MySQL operations) stays async. Everything below (HTTP to Boss API) stays sync. ### Flow 3: Company Jobs Sync (new, post-refactor) After a company is cleaned, the system should backfill that company's current job listings: ``` CompanyJobsSyncService.sync_company_jobs(company_id, source) [async] ├─ asyncio.to_thread(boss_service.get_company_jobs_by_id, company_id) │ → crawler_core/boss/api.py SearchBrandJobs.search() [sync] └─ IngestService.store_batch(platform="boss", data_type="company_job") → ClickHouse insert into boss_company_job table (new table needed) ``` --- ## Package Structure: Suggested Build Order Build `crawler_core/` first — everything else depends on it. ### Phase 1: Extract `crawler_core/` 1. Create `crawler_core/__init__.py`, `crawler_core/http_client.py`, `crawler_core/base.py` - These are identical to the current `spiderJobs/core/` files with one import path change - Zero functional change; pure extraction 2. Create `crawler_core/boss/`, `crawler_core/qcwy/`, `crawler_core/zhilian/` - Extract from `spiderJobs/platforms/boss/`, qcwy/, zhilian/ - Update internal imports to `from crawler_core.boss.sign import BossSign`, etc. 3. Add `crawler_core` to `Pipfile` as an editable local dependency: ```toml [packages] crawler_core = {path = ".", editable = true} ``` (Or configure `pyproject.toml` with `[tool.setuptools.packages.find]` including `crawler_core`) **Verification gate:** `spiderJobs/` and `app/services/crawler/` both import from `crawler_core` without errors. `spiderJobs/core/` and `spiderJobs/platforms/` can be deleted. ### Phase 2: Rewire `app/services/crawler/` Facades The service facades (`boss.py`, `qcwy.py`, `zhilian.py`) stay in `app/services/crawler/` — they belong to the backend. Their job is to: - Load tokens from MySQL (`BossToken`) - Load proxies from `ProxyController` - Instantiate `crawler_core` clients with those credentials - Expose typed async-friendly methods (`get_company_detail_by_id`) - Wrap `asyncio.to_thread()` for each sync call The private underscore modules (`_boss_client.py`, `_boss_sign.py`, etc.) are deleted. Facades import directly from `crawler_core`. ```python # app/services/crawler/boss.py (after refactor) from crawler_core.boss.client import BossClient, create_client from crawler_core.boss.sign import BossSign from crawler_core.boss.api import GetBrandDetail, SearchBrandJobs ``` **Verification gate:** `CompanyCleaner` test passes with mocked `crawler_core` calls. ### Phase 3: Rewire `spiderJobs/` External Scripts `spiderJobs/runner/` imports `from spiderJobs.core.base import ...` → change to `from crawler_core.base import ...`. Delete `spiderJobs/core/` and `spiderJobs/platforms/`. `spiderJobs/platforms/boss/main.py` becomes a thin entry point: ```python from crawler_core.boss.client import create_client from crawler_core.boss.api import SearchRecJobs from spiderJobs.runner.loop import run_crawl_loop ``` `jobs_spider/` (the older framework with monolithic files) is deprecated in favor of `spiderJobs/` + `crawler_core/`. --- ## Anti-Patterns to Avoid ### Anti-Pattern 1: Async Crawler Core **What:** Making `crawler_core/` async because the backend is async. **Why bad:** `requests-go` is sync. The TLS fingerprint mechanism is sync. Wrapping everything in `asyncio.run()` or creating fake async wrappers adds complexity with no benefit. The `asyncio.to_thread()` boundary at the facade layer is the correct and idiomatic solution. **Instead:** Keep `crawler_core/` synchronous. Use `asyncio.to_thread()` in the backend facades. ### Anti-Pattern 2: Putting `RunnerAPIClient` in `crawler_core/` **What:** Moving the HTTP push client (backend API caller) into the shared core. **Why bad:** `RunnerAPIClient` is deployment-specific: it knows about `/api/v1/keyword/available`, `/api/v1/universal/data/batch-store-async`, and the backend's URL scheme. That is runner orchestration logic, not platform API logic. **Instead:** `RunnerAPIClient` stays in `spiderJobs/runner/api_client.py`. It is not shared. ### Anti-Pattern 3: Importing `app.*` from External Scripts **What:** Having ECS scripts import from `app.settings.config`, `app.models`, or `app.core.*`. **Why bad:** That would require Tortoise-ORM initialization, database connections, and all backend startup logic to run on ECS nodes. This would fail. **Instead:** External scripts get all config from environment variables. They communicate with the backend only via HTTP. ### Anti-Pattern 4: Blocking the Event Loop with Sync Crawler Calls **What:** Calling `boss_service.get_company_detail_by_id()` directly in an `async def` without `asyncio.to_thread()`. **Why bad:** Each HTTP call takes 2-10 seconds. With 30 companies to process, that blocks the event loop for up to 5 minutes. Other requests, heartbeats, and scheduler jobs starve. **Instead:** Always wrap sync crawler calls in `asyncio.to_thread()` inside async context. ### Anti-Pattern 5: Monolithic Platform Files **What:** Keeping the existing `jobs_spider/boss/boos_api.py` (2246 lines) pattern in the refactored code. **Why bad:** Impossible to test individual components. Signing, HTTP, and business logic are entangled. **Instead:** Three files per platform: `sign.py` (pure functions, no I/O), `client.py` (HTTP wrapper, no business logic), `api.py` (endpoint classes). --- ## Scalability Considerations | Concern | Current | After Refactor | |---------|---------|---------------| | Adding a 4th platform (e.g., job51) | Copy files into both `spiderJobs/` and `app/services/crawler/` | Add `crawler_core/job51/` once; both consumers get it | | Testing signing algorithm | Must run full crawler | Test `crawler_core/boss/sign.py` in isolation — pure functions, no I/O | | Updating User-Agent headers | Two files to update | One file: `crawler_core/boss/client.py` | | Running multiple concurrent companies | Single-threaded | `asyncio.to_thread()` uses thread pool; default pool size = CPU count × 5 | | ECS deployment of crawlers | Must deploy `spiderJobs/` + copy of core | Deploy `crawler_core/` + `spiderJobs/`; single source of truth | --- ## Suggested Build Order for Roadmap Phases Dependencies flow in one direction: ``` Phase 1: crawler_core/ extraction └─ Produces: installable shared package, 3-platform coverage Phase 2: app/services/crawler/ facade rewire └─ Depends on: Phase 1 (imports crawler_core) └─ Produces: asyncio.to_thread() boundary, deleted underscore copies Phase 3: spiderJobs/ rewire + jobs_spider/ deprecation └─ Depends on: Phase 1 (imports crawler_core) └─ Produces: single external runner codebase Phase 4: Company cleaning flow optimization └─ Depends on: Phase 2 (async-safe facades) └─ Produces: CompanyJobsSyncService using new facades Phase 5: Unit tests └─ Depends on: Phase 1 (crawler_core is testable in isolation) └─ crawler_core tests: no mocking needed for sign.py (pure functions) └─ facade tests: mock crawler_core, test async wrapping ``` Phases 2 and 3 can run in parallel (both depend only on Phase 1, not each other). --- ## Sources - Direct inspection: `/Users/win/2025/AICoding/JobData/app/services/crawler/_http_client.py` (docstring confirms copy from `spiderJobs/core/http_client.py`) - Direct inspection: `/Users/win/2025/AICoding/JobData/spiderJobs/core/base.py` vs `app/services/crawler/_base.py` — byte-identical logic - Direct inspection: `/Users/win/2025/AICoding/JobData/app/services/company_cleaner.py` — async consumer of sync crawlers - Direct inspection: `/Users/win/2025/AICoding/JobData/spiderJobs/runner/loop.py` — sync-only external runner - Direct inspection: `/Users/win/2025/AICoding/JobData/.planning/codebase/ARCHITECTURE.md` — existing architecture analysis --- *Architecture analysis: 2026-03-21*