docs: add research findings (stack, features, architecture, pitfalls, summary)

This commit is contained in:
win 2026-03-21 16:36:37 +08:00
parent 030da1ce53
commit f3005ef525
5 changed files with 1413 additions and 0 deletions

View File

@ -0,0 +1,423 @@
# Architecture Patterns: Shared Crawler Core
**Domain:** Recruiting data crawler — dual-deployment shared core
**Researched:** 2026-03-21
**Overall confidence:** HIGH (based entirely on direct codebase inspection)
---
## Problem Statement
The codebase currently has the same code in two places, copied verbatim:
| File | Copy location |
|------|---------------|
| `spiderJobs/core/http_client.py` | `app/services/crawler/_http_client.py` |
| `spiderJobs/core/base.py` | `app/services/crawler/_base.py` |
| `spiderJobs/platforms/boss/sign.py` | `app/services/crawler/_boss_sign.py` |
| `spiderJobs/platforms/boss/client.py` | `app/services/crawler/_boss_client.py` |
| `spiderJobs/platforms/boss/api.py` | `app/services/crawler/_boss_api.py` |
The docstrings in `app/services/crawler/_http_client.py` literally say: "复制自 spiderJobs/core/http_client.py — 不要直接 import spiderJobs避免跨模块依赖" ("copied from — do not import directly to avoid cross-module dependency"). The comment reveals the source of the problem: the team correctly identified that two separate deployment units cannot cross-import, but solved it by copy-paste rather than extraction.
There are also two competing external spider frameworks: `jobs_spider/` (older, monolithic, single large file per platform) and `spiderJobs/` (newer, modular, uses `runner/loop.py` + `runner/api_client.py` as a proper execution harness). The refactor must unify these.
---
## Recommended Architecture
### Core Design Decision: Installable Shared Package
Extract shared crawler code into a proper Python package at `crawler_core/` (project root level). Both deployment contexts import from it. No copying. No circular imports.
```
crawler_core/ # Shared library — zero app/* or spiderJobs/* imports
├── __init__.py
├── http_client.py # HTTPClient (requests-go, TLS fingerprint, proxy rotation)
├── base.py # ApiResult, BaseFetcher, BaseSearcher, parse_response()
├── boss/
│ ├── __init__.py
│ ├── sign.py # BossSign (Traceid generation algorithm)
│ ├── client.py # BossClient(HTTPClient) — header injection
│ └── api.py # SearchRecJobs, GetBrandDetail, etc.
├── qcwy/
│ ├── __init__.py
│ ├── sign.py
│ ├── client.py
│ └── api.py
└── zhilian/
├── __init__.py
├── sign.py
├── client.py
└── api.py
```
**Why a root-level package and not inside `app/` or `spiderJobs/`:**
- `app/` is the FastAPI backend. Importing it from external scripts would pull in Tortoise-ORM, APScheduler, ClickHouse, and every other backend dependency — making external scripts impossibly heavy.
- `spiderJobs/` is a deployment directory, not a library.
- A root-level `crawler_core/` has no such baggage. It imports only `requests-go` and stdlib. Both `app/` and `spiderJobs/` import from it.
**Constraints satisfied:**
- No circular imports
- No shared runtime state (crawler_core is stateless pure logic)
- Installable as local package (editable install via `pip install -e .` or `pipenv install -e .`) for ECS deployment
- Tests can import `crawler_core` directly without mocking a FastAPI app
---
## Component Boundaries
### What Talks to What
```
┌─────────────────────────────────────────────────────────────┐
│ crawler_core/ │
│ HTTPClient · BaseFetcher · BaseSearcher · ApiResult │
│ boss/ · qcwy/ · zhilian/ (sign + client + api) │
│ Imports: requests-go, stdlib only │
└────────────────┬────────────────────────┬───────────────────┘
│ │
┌────────────▼──────────┐ ┌─────────▼─────────────────┐
│ app/services/ │ │ spiderJobs/ │
│ crawler/ │ │ platforms/ + runner/ │
│ │ │ │
│ BossService facade │ │ run_crawl_loop() │
│ QcwyService facade │ │ RunnerAPIClient │
│ ZhilianService facade│ │ (HTTP push to backend) │
│ │ │ │
│ Called by: │ │ Called by: │
│ CompanyCleaner │ │ ECS node scripts │
│ CompanyController │ │ jobs_spider/ (deprecated) │
└────────────┬──────────┘ └────────────────────────────┘
┌────────────▼──────────────────────────────────────────┐
│ app/ (FastAPI backend — async boundary) │
│ CompanyCleaner (asyncio, calls sync crawler via │
│ asyncio.to_thread()) │
│ IngestService (async ClickHouse writes) │
│ APScheduler (triggers CompanyCleaner async jobs) │
└───────────────────────────────────────────────────────┘
```
### Component Responsibilities
| Component | Responsibility | What It Must NOT Do |
|-----------|---------------|---------------------|
| `crawler_core/` | Signing, HTTP requests, response parsing — per platform | Import `app.*`, use async, maintain global state |
| `app/services/crawler/` facades | Wrap `crawler_core` with logging, token loading, proxy mgmt | Contain signing or HTTP logic |
| `spiderJobs/runner/` | Crawl loop orchestration, keyword polling, data upload | Import `app.*` |
| `app/services/ingest/` | Registry-driven ClickHouse insertion, dedup | Know anything about platform HTTP clients |
| `app/core/scheduler.py` | Schedule jobs, manage distributed lock | Run synchronous I/O on the event loop directly |
---
## Abstract Base Class Hierarchy
```python
# crawler_core/base.py
@dataclass
class ApiResult:
success: bool
status_code: int
data: Any = None
list: list[dict] = field(default_factory=list)
count: int = 0
is_end_page: bool = True
error: Optional[str] = None
class BaseFetcher:
"""Single-resource GET (company detail, job detail)."""
ENDPOINT: str = ""
def __init__(self, http_client: HTTPClient): ...
def _build_params(self) -> dict: raise NotImplementedError
def _parse(self, http_code: int, raw: Any) -> ApiResult: ...
def fetch(self) -> ApiResult: ... # sync — wraps _http.get()
class BaseSearcher:
"""Paginated search (job list, brand job list)."""
ENDPOINT: str = ""
def __init__(self, page_size: int, http_client: HTTPClient): ...
def _build_params(self, page_index: int) -> dict: raise NotImplementedError
def _request(self, params: dict) -> tuple[int, Any]: ...
def _parse(self, http_code: int, raw: Any) -> ApiResult: ...
def search(self, page_index: int = 1) -> ApiResult: ... # sync
def load_all(self, max_pages: int, on_page: Callable | None) -> list[dict]: ...
```
Platform clients subclass `HTTPClient` from `crawler_core/http_client.py`:
```python
# crawler_core/boss/client.py
class BossClient(HTTPClient):
"""Injects mpt/wt2/Traceid headers on every request."""
def __init__(self, signer: BossSign, ...): ...
def post(self, path, body, headers=None) -> tuple[int, Any]: ... # sync
def get(self, path, params=None, headers=None) -> tuple[int, Any]: ... # sync
```
Platform API classes subclass `BaseFetcher` / `BaseSearcher`:
```python
# crawler_core/boss/api.py
class SearchRecJobs(BaseSearcher):
ENDPOINT = "/wapi/zpgeek/miniapp/homepage/recjoblist.json"
def _build_params(self, page_index: int) -> dict: ...
def _parse(self, http_code: int, raw: Any) -> ApiResult: ... # Boss-specific parsing
class GetBrandDetail(BaseFetcher):
ENDPOINT = "/wapi/zpgeek/miniapp/brand/get.json"
def _build_params(self) -> dict: ...
```
**The hierarchy is deliberately flat.** Three levels: `HTTPClient``PlatformClient``PlatformAPIClass`. No further abstraction.
---
## Sync vs Async: The Critical Design Point
### Why the Core Must Stay Synchronous
`crawler_core/` uses `requests-go` (a sync library that wraps a TLS fingerprinting mechanism). There is no async version of `requests-go`. The platform APIs require TLS fingerprint spoofing that `httpx` does not support without significant additional work. Keeping the core synchronous is the correct choice given the constraint.
### How the Backend (async) Calls Synchronous Crawlers
`CompanyCleaner` runs inside the FastAPI async event loop (called by APScheduler). It cannot block the event loop with synchronous `requests` calls. The correct pattern is `asyncio.to_thread()`:
```python
# app/services/company_cleaner.py
async def _fetch_company_detail(self, source: str, company_id: str) -> Optional[dict]:
"""Run synchronous crawler call off the event loop thread."""
if source == "boss":
return await asyncio.to_thread(
self.boss_service.get_company_detail_by_id, company_id
)
elif source == "qcwy":
return await asyncio.to_thread(
self.qcwy_service.get_company_detail_by_id, company_id
)
elif source == "zhilian":
return await asyncio.to_thread(
self.zhilian_service.get_company_detail_by_id, company_id
)
```
`asyncio.to_thread()` runs the synchronous function in a thread pool, yielding back to the event loop while it executes. This is the standard Python pattern for sync I/O in an async context.
**Current state:** The existing `CompanyCleaner.process_pending_companies()` likely calls sync methods directly on the event loop — this is a bug to fix during refactoring. The fix is wrapping each crawler call with `asyncio.to_thread()`.
### External Scripts: Pure Synchronous, No Async
`spiderJobs/runner/loop.py` (`run_crawl_loop`) and any ECS scripts are pure synchronous `while True` loops. They import from `crawler_core/` directly. No async. This is correct and intentional.
```
External Script (sync) Backend (async)
────────────────────── ─────────────────────────────
while True: async def process_company():
kw = api.fetch_keyword() detail = await asyncio.to_thread(
searcher = BossSearcher(...) boss_service.get_company_detail
result = searcher.search() )
api.upload_data(result.list) await company_storage.save(detail)
```
---
## Data Flow Patterns
### Flow 1: External Spider → Backend Storage
```
ECS Script (spiderJobs/runner/loop.py)
├─ GET /api/v1/keyword/available
│ → KeywordController marks keyword as "crawling"
├─ crawler_core/boss/api.py SearchRecJobs.search(page)
│ → returns ApiResult.list (raw job dicts)
├─ POST /api/v1/universal/data/batch-store-async
│ body: {platform, channel, data_type, data_list}
└─ Backend: IngestService.store_batch()
→ registry lookup → dedup → ClickHouse insert
```
The external script knows nothing about ClickHouse. It only pushes raw dicts over HTTP. This boundary is correct and must stay.
### Flow 2: In-Process Company Cleaning (Scheduled)
```
APScheduler → company_cleaning_job() [async]
├─ CompanyCleaner.collect_pending_companies() [async]
│ → ClickHouse: query job tables for company IDs
│ → MySQL: check which company IDs already processed
│ → MySQL: insert new companies into CompanyCleaningQueue
└─ CompanyCleaner.process_pending_companies() [async]
├─ MySQL: fetch N companies from queue
├─ asyncio.to_thread(boss_service.get_company_detail_by_id)
│ → crawler_core/boss/api.py GetBrandDetail.fetch() [sync]
├─ company_storage.save_company() [async, ClickHouse]
└─ MySQL: update company status to "done"
```
The flow crosses the sync/async boundary exactly once: at `asyncio.to_thread()`. Everything above (ClickHouse queries, MySQL operations) stays async. Everything below (HTTP to Boss API) stays sync.
### Flow 3: Company Jobs Sync (new, post-refactor)
After a company is cleaned, the system should backfill that company's current job listings:
```
CompanyJobsSyncService.sync_company_jobs(company_id, source) [async]
├─ asyncio.to_thread(boss_service.get_company_jobs_by_id, company_id)
│ → crawler_core/boss/api.py SearchBrandJobs.search() [sync]
└─ IngestService.store_batch(platform="boss", data_type="company_job")
→ ClickHouse insert into boss_company_job table (new table needed)
```
---
## Package Structure: Suggested Build Order
Build `crawler_core/` first — everything else depends on it.
### Phase 1: Extract `crawler_core/`
1. Create `crawler_core/__init__.py`, `crawler_core/http_client.py`, `crawler_core/base.py`
- These are identical to the current `spiderJobs/core/` files with one import path change
- Zero functional change; pure extraction
2. Create `crawler_core/boss/`, `crawler_core/qcwy/`, `crawler_core/zhilian/`
- Extract from `spiderJobs/platforms/boss/`, qcwy/, zhilian/
- Update internal imports to `from crawler_core.boss.sign import BossSign`, etc.
3. Add `crawler_core` to `Pipfile` as an editable local dependency:
```toml
[packages]
crawler_core = {path = ".", editable = true}
```
(Or configure `pyproject.toml` with `[tool.setuptools.packages.find]` including `crawler_core`)
**Verification gate:** `spiderJobs/` and `app/services/crawler/` both import from `crawler_core` without errors. `spiderJobs/core/` and `spiderJobs/platforms/` can be deleted.
### Phase 2: Rewire `app/services/crawler/` Facades
The service facades (`boss.py`, `qcwy.py`, `zhilian.py`) stay in `app/services/crawler/` — they belong to the backend. Their job is to:
- Load tokens from MySQL (`BossToken`)
- Load proxies from `ProxyController`
- Instantiate `crawler_core` clients with those credentials
- Expose typed async-friendly methods (`get_company_detail_by_id`)
- Wrap `asyncio.to_thread()` for each sync call
The private underscore modules (`_boss_client.py`, `_boss_sign.py`, etc.) are deleted. Facades import directly from `crawler_core`.
```python
# app/services/crawler/boss.py (after refactor)
from crawler_core.boss.client import BossClient, create_client
from crawler_core.boss.sign import BossSign
from crawler_core.boss.api import GetBrandDetail, SearchBrandJobs
```
**Verification gate:** `CompanyCleaner` test passes with mocked `crawler_core` calls.
### Phase 3: Rewire `spiderJobs/` External Scripts
`spiderJobs/runner/` imports `from spiderJobs.core.base import ...` → change to `from crawler_core.base import ...`. Delete `spiderJobs/core/` and `spiderJobs/platforms/`.
`spiderJobs/platforms/boss/main.py` becomes a thin entry point:
```python
from crawler_core.boss.client import create_client
from crawler_core.boss.api import SearchRecJobs
from spiderJobs.runner.loop import run_crawl_loop
```
`jobs_spider/` (the older framework with monolithic files) is deprecated in favor of `spiderJobs/` + `crawler_core/`.
---
## Anti-Patterns to Avoid
### Anti-Pattern 1: Async Crawler Core
**What:** Making `crawler_core/` async because the backend is async.
**Why bad:** `requests-go` is sync. The TLS fingerprint mechanism is sync. Wrapping everything in `asyncio.run()` or creating fake async wrappers adds complexity with no benefit. The `asyncio.to_thread()` boundary at the facade layer is the correct and idiomatic solution.
**Instead:** Keep `crawler_core/` synchronous. Use `asyncio.to_thread()` in the backend facades.
### Anti-Pattern 2: Putting `RunnerAPIClient` in `crawler_core/`
**What:** Moving the HTTP push client (backend API caller) into the shared core.
**Why bad:** `RunnerAPIClient` is deployment-specific: it knows about `/api/v1/keyword/available`, `/api/v1/universal/data/batch-store-async`, and the backend's URL scheme. That is runner orchestration logic, not platform API logic.
**Instead:** `RunnerAPIClient` stays in `spiderJobs/runner/api_client.py`. It is not shared.
### Anti-Pattern 3: Importing `app.*` from External Scripts
**What:** Having ECS scripts import from `app.settings.config`, `app.models`, or `app.core.*`.
**Why bad:** That would require Tortoise-ORM initialization, database connections, and all backend startup logic to run on ECS nodes. This would fail.
**Instead:** External scripts get all config from environment variables. They communicate with the backend only via HTTP.
### Anti-Pattern 4: Blocking the Event Loop with Sync Crawler Calls
**What:** Calling `boss_service.get_company_detail_by_id()` directly in an `async def` without `asyncio.to_thread()`.
**Why bad:** Each HTTP call takes 2-10 seconds. With 30 companies to process, that blocks the event loop for up to 5 minutes. Other requests, heartbeats, and scheduler jobs starve.
**Instead:** Always wrap sync crawler calls in `asyncio.to_thread()` inside async context.
### Anti-Pattern 5: Monolithic Platform Files
**What:** Keeping the existing `jobs_spider/boss/boos_api.py` (2246 lines) pattern in the refactored code.
**Why bad:** Impossible to test individual components. Signing, HTTP, and business logic are entangled.
**Instead:** Three files per platform: `sign.py` (pure functions, no I/O), `client.py` (HTTP wrapper, no business logic), `api.py` (endpoint classes).
---
## Scalability Considerations
| Concern | Current | After Refactor |
|---------|---------|---------------|
| Adding a 4th platform (e.g., job51) | Copy files into both `spiderJobs/` and `app/services/crawler/` | Add `crawler_core/job51/` once; both consumers get it |
| Testing signing algorithm | Must run full crawler | Test `crawler_core/boss/sign.py` in isolation — pure functions, no I/O |
| Updating User-Agent headers | Two files to update | One file: `crawler_core/boss/client.py` |
| Running multiple concurrent companies | Single-threaded | `asyncio.to_thread()` uses thread pool; default pool size = CPU count × 5 |
| ECS deployment of crawlers | Must deploy `spiderJobs/` + copy of core | Deploy `crawler_core/` + `spiderJobs/`; single source of truth |
---
## Suggested Build Order for Roadmap Phases
Dependencies flow in one direction:
```
Phase 1: crawler_core/ extraction
└─ Produces: installable shared package, 3-platform coverage
Phase 2: app/services/crawler/ facade rewire
└─ Depends on: Phase 1 (imports crawler_core)
└─ Produces: asyncio.to_thread() boundary, deleted underscore copies
Phase 3: spiderJobs/ rewire + jobs_spider/ deprecation
└─ Depends on: Phase 1 (imports crawler_core)
└─ Produces: single external runner codebase
Phase 4: Company cleaning flow optimization
└─ Depends on: Phase 2 (async-safe facades)
└─ Produces: CompanyJobsSyncService using new facades
Phase 5: Unit tests
└─ Depends on: Phase 1 (crawler_core is testable in isolation)
└─ crawler_core tests: no mocking needed for sign.py (pure functions)
└─ facade tests: mock crawler_core, test async wrapping
```
Phases 2 and 3 can run in parallel (both depend only on Phase 1, not each other).
---
## Sources
- Direct inspection: `/Users/win/2025/AICoding/JobData/app/services/crawler/_http_client.py` (docstring confirms copy from `spiderJobs/core/http_client.py`)
- Direct inspection: `/Users/win/2025/AICoding/JobData/spiderJobs/core/base.py` vs `app/services/crawler/_base.py` — byte-identical logic
- Direct inspection: `/Users/win/2025/AICoding/JobData/app/services/company_cleaner.py` — async consumer of sync crawlers
- Direct inspection: `/Users/win/2025/AICoding/JobData/spiderJobs/runner/loop.py` — sync-only external runner
- Direct inspection: `/Users/win/2025/AICoding/JobData/.planning/codebase/ARCHITECTURE.md` — existing architecture analysis
---
*Architecture analysis: 2026-03-21*

View File

@ -0,0 +1,411 @@
# Feature Landscape
**Domain:** Multi-platform job data crawler framework (Boss直聘 / 前程无忧 / 智联招聘)
**Researched:** 2026-03-21
**Context:** Refactoring existing crawler system — not a greenfield build. Features are evaluated against what exists and what the refactor must deliver.
---
## Current State Baseline
Before categorizing features, this is what the existing system already has (do not re-build):
| Capability | Implementation | Location |
|------------|---------------|----------|
| Keyword-driven crawl dispatch | `/api/v1/keyword/available` pull model | `app/api/v1/keyword/` |
| Batch data ingestion pipeline | `IngestService` + `PlatformConfig` registry | `app/services/ingest/` |
| ClickHouse dedup on insert | `batch_dedup_filter()` — 1 or 2 column key | `app/services/ingest/dedup.py` |
| Random delay anti-detection | `sleep_random_between()`, 1020s enforced | `jobs_spider/boss/boos_api.py` |
| Proxy rotation + tunnel proxy | `SmartIPManager`, `HTTPClient` | `app/core/algorithms/antispider.py`, `_http_client.py` |
| IP anomaly detection | `IPAnomalyDetector` — 403/429/407, slow response, code=35 | `_http_client.py` |
| TLS fingerprint spoofing | `requests_go` with `TLS_CHROME_LATEST`, random JA3 | `_http_client.py` |
| Structured logging | `loguru` rotating daily files, 30-day retention | all modules |
| Scheduled job framework | APScheduler, distributed file/Redis lock | `app/core/scheduler.py`, `locks.py` |
| IP upload tracking | `IpTrackingMiddleware``IpUploadStats` table | `app/core/ip_tracking.py` |
| Company cleaning pipeline | `CompanyCleaner` collect→process→store, every 5 min | `app/services/company_cleaner.py` |
| Frontend monitoring (partial) | `monitor.vue` — pending queue, today processed, completion rate | `web/src/views/cleaning/monitor.vue` |
| Unified crawl loop (partial) | `spiderJobs/runner/loop.py` — page progress reporting | `spiderJobs/runner/` |
| BaseFetcher / BaseSearcher | Template method base classes | `app/services/crawler/_base.py` |
**Critical gap:** Two parallel crawler frameworks exist (`jobs_spider/` + `spiderJobs/`) with divergent code. The base classes in `app/services/crawler/` are partially implemented but not used by `jobs_spider/`. Dedup logic is duplicated in both frameworks.
---
## Table Stakes
Features the system cannot function correctly without. Missing or broken = data pipeline breaks.
### 1. Unified Platform Base Classes
**Why required:** Without a single `BasePlatformClient` that `jobs_spider/`, `app/services/crawler/`, AND `spiderJobs/` all import from, every bug fix or anti-detection update must be applied in 3 places — guaranteed divergence.
**What this means concretely:**
- Single `_base.py` in a shared package (not copied between directories)
- `BaseFetcher`, `BaseSearcher`, `ApiResult`, `parse_response` defined once
- Each platform implements `_build_params()`, `_parse()` — nothing else
- Sign/auth modules (`_boss_sign.py`, `_zhilian_sign.py`, etc.) imported from the shared package
**Complexity:** Medium — requires restructuring import paths, not rewriting logic.
| Sub-feature | Status | Notes |
|-------------|--------|-------|
| Single base class file | Partial — `app/services/crawler/_base.py` exists but is a copy, not the source of truth | Must become canonical |
| Shared sign modules | Partial — `_boss_sign.py` etc. exist in `app/services/crawler/` | `jobs_spider/` uses its own copies |
| `spiderJobs/` consolidation | Missing — `spiderJobs/core/base.py` is yet another copy | Must be eliminated or unified |
---
### 2. Keyword-Driven Dispatch with Crawl State Tracking
**Why required:** Without atomic keyword assignment with state tracking, multiple ECS nodes will crawl the same keyword simultaneously, producing massive duplicates and wasting API quota.
**What this means concretely:**
- `GET /api/v1/keyword/available?reserve=True` atomically marks a keyword as in-progress
- Keyword state machine: `pending → in_progress → completed / failed`
- `last_completed_page` persisted so interrupted crawls can resume from breakpoint
- Crawlers report `status=completed|failed` + page count on finish
- The `spiderJobs/runner/loop.py` already implements `report_crawl_complete()` and `report_page_progress()` — this is the model to follow
**Complexity:** Low — pattern already exists in `spiderJobs/runner/`, needs to be the standard.
| Sub-feature | Status | Notes |
|-------------|--------|-------|
| Atomic keyword assignment | Exists in `jobs_spider/boss/boos_api.py` | Not in qcwy/zhilian |
| Crawl completion reporting | Exists in `spiderJobs/runner/loop.py` | Not in `jobs_spider/` |
| Breakpoint resume (last_completed_page) | Exists in `spiderJobs/runner/loop.py` | Not in `jobs_spider/` |
---
### 3. Per-Page Batch Upload (Not End-of-Keyword Batch)
**Why required:** If a crawler accumulates all pages before pushing, a crash at page 9 of 10 loses everything. Uploading per-page means each page's data is safe before moving to the next.
**What this means concretely:**
- After each page fetch: `POST /api/v1/universal/data/batch-store-async` with that page's data
- Do not accumulate across pages in memory
- Report page progress to backend (`report_page_progress()`)
- `spiderJobs/runner/loop.py` does this correctly — `jobs_spider/boss/boos_api.py` accumulates and pushes at the end
**Complexity:** Low — structural change to push loop.
| Sub-feature | Status | Notes |
|-------------|--------|-------|
| Per-page upload | Missing in `jobs_spider/boss/` | `spiderJobs/runner/` has it |
| Per-page progress reporting | Missing in `jobs_spider/` | `spiderJobs/runner/` has it |
---
### 4. Dedup at Ingestion (ClickHouse Query Before Insert)
**Why required:** Without dedup, repeated crawls of the same keyword accumulate duplicate job records, corrupting analytics counts. ClickHouse's MergeTree eventual dedup is not sufficient for real-time correctness.
**What this means concretely:**
- `batch_dedup_filter()` queries ClickHouse for existing keys before inserting
- Dedup key per platform: Boss = `job_id`; QCWY = `(job_id, update_date_time)`; Zhilian = `(number, first_publish_time)`
- Dedup happens in `IngestService`, not in the crawler — crawler just pushes raw data
- All three platforms must route through the same `IngestService` (currently QCWY has separate logic)
**Complexity:** Low — `IngestService` + `PlatformConfig` registry already works; just need all platforms to use it.
| Sub-feature | Status | Notes |
|-------------|--------|-------|
| `batch_dedup_filter()` | Exists and works | Supports 1 or 2 key columns |
| Boss dedup via registry | Exists | Working |
| QCWY dedup via registry | Partial | Separate code path; needs unification |
| Zhilian dedup via registry | Partial | Same issue |
---
### 5. Retry with Exponential Backoff
**Why required:** Platform APIs are unstable — rate limits, transient 5xx errors, proxy failures. Without retry, a single flaky response aborts an entire crawl session.
**What this means concretely:**
- Retry on: network timeout, `requests.exceptions.RequestException`, HTTP 5xx, platform business errors that are transient (not bans)
- Do NOT retry on: HTTP 403/429 IP ban signals (switch proxy instead), HTTP 401 token expired (refresh token instead)
- Max retries: 3 attempts with exponential backoff (2s, 4s, 8s jitter)
- After max retries: log error, report `status=failed` to backend, continue to next keyword (do not crash)
**Current state:** No structured retry in either `jobs_spider/` or `app/services/crawler/`. Failures cause immediate abort.
**Complexity:** Medium — needs a retry decorator/wrapper in `_http_client.py`.
| Sub-feature | Status | Notes |
|-------------|--------|-------|
| Retry on transient errors | Missing | Critical gap |
| Exponential backoff | Missing | |
| Distinguish retryable vs ban | Missing | IPAnomalyDetector detects bans but retry is not wired |
---
### 6. IP/Proxy Management with Ban Detection
**Why required:** Chinese recruitment platforms aggressively detect and block scraper IPs. Without automated ban detection and proxy switching, crawlers die silently and collect no data.
**What this means concretely:**
- `IPAnomalyDetector` checks HTTP 403/429/407, slow response (>5s), business code=35, message containing "IP地址存在异常"
- On ban: switch proxy immediately (not after retry), rebuild `requests.Session`, log ban event
- `SmartIPManager` handles: tunnel proxy → proxy pool round-robin → local direct connection with cooldown
- Proxy health state is in-memory per crawler process (no shared state needed — each ECS node manages its own pool)
**Current state:** EXISTS and works in `jobs_spider/boss/boos_api.py`. NOT present in qcwy or zhilian crawlers. Must be shared via base class.
**Complexity:** Low — logic exists, needs to be in shared module not duplicated.
| Sub-feature | Status | Notes |
|-------------|--------|-------|
| `IPAnomalyDetector` | Exists in boss + antispider.py | Duplicated — must consolidate |
| `SmartIPManager` | Exists in boss | Missing from qcwy/zhilian |
| Session rebuild on ban | Exists in boss | Missing from qcwy/zhilian |
---
### 7. Token Lifecycle Management (Boss 直聘)
**Why required:** Boss直聘 uses MPT tokens (WeChat mini-program login tokens) that expire. Without automatic refresh, crawlers stop collecting data silently.
**What this means concretely:**
- `GET /api/v1/token/tokens` returns available active Boss tokens
- On token-expired response (business code indicating auth failure): mark token as inactive, fetch next available token
- If no tokens available: log error, wait, retry — do not crash
- Token management is Boss-specific; QCWY and Zhilian use cookie/session auth
**Current state:** Exists in `jobs_spider/boss/boos_api.py`. Needs to be in the shared Boss client module.
**Complexity:** Low — already implemented, just needs to stay in shared code.
---
### 8. Structured Error Logging per Crawl Session
**Why required:** Without structured logs, diagnosing why data volume dropped on a given day is impossible. Operators cannot tell if the issue is IP bans, token expiry, keyword exhaustion, or platform API changes.
**What this means concretely:**
- `loguru` rotating daily log files per platform, 30-day retention (already exists)
- Each log entry must include: platform, keyword (city+job), page, error_type, proxy_used, http_status
- Log at: session start, each page success/failure, proxy switch, token switch, crawl completion
- Errors forwarded to backend `ScheduledTaskRun` table for dashboard visibility
**Current state:** Logging exists but is inconsistent — boss uses loguru, qcwy/zhilian use `print()`. Must standardize.
**Complexity:** Low — configuration change + replacing `print()` with `logger`.
---
## Differentiators
Features that make the system operationally robust and reduce manual intervention. Not strictly required for data collection, but essential for unattended production operation.
### D1. Crawl Health Dashboard (Frontend)
**Value:** Operators can see at a glance: which platforms are crawling, how many jobs were ingested today, which IPs are active/banned, whether company cleaning is keeping up with job volume.
**What this means concretely:**
- Expand existing `monitor.vue` to cover job crawling (not just company cleaning)
- Per-platform: jobs ingested today (from `IpUploadStats`), keywords completed/pending/failed, active crawler IPs
- Alert indicators: zero ingestion in last 30 min, all keywords exhausted, token pool empty
- Auto-refresh every 30s (already exists for company cleaning panel)
- Backend API: `/api/v1/analytics/crawler-health` aggregating from `IpUploadStats` + `keyword` tables
**Complexity:** Medium — requires new backend analytics endpoint + frontend component.
---
### D2. Platform-Specific Anti-Detection Profiles
**Value:** Each platform has different detection mechanisms. A unified anti-detection config per platform reduces manual tuning when bans increase.
**What this means concretely:**
- Per-platform config dataclass: `min_delay`, `max_delay`, `max_pages_per_keyword`, `user_agent_rotation`, `cookie_fields_to_preserve`
- Boss: must preserve `__zp_sseed__`, `__zp_sname__`, `__zp_sts__` cookies (anti-bot challenge)
- QCWY: standard cookie session, no special handling
- Zhilian: custom header signing (already in `_zhilian_sign.py`)
- Config loaded from environment variables, not hardcoded
**Current state:** Delays are configurable via env vars. Cookie handling and platform-specific behavior is hardcoded in each crawler.
**Complexity:** Low-Medium — config dataclass refactor.
---
### D3. Company Inline Crawl During Job Search
**Value:** Crawling company details in the same session as job search reduces total requests, shares the same authenticated session/proxy, and keeps company data fresh relative to job postings.
**What this means concretely:**
- After each page of job results: extract unique company IDs from jobs, fetch company details for new ones
- Session-level `seen_companies` set prevents re-fetching same company in same run
- Falls back gracefully if company fetch fails — job data is not blocked on company data
- Already implemented in `spiderJobs/runner/loop.py` as `_crawl_companies_from_jobs()` — this is the reference
**Current state:** Implemented in `spiderJobs/runner/`. Missing from `jobs_spider/`.
**Complexity:** Low — pattern exists, needs to be the standard in unified loop.
---
### D4. Keyword Progress Persistence and Resume
**Value:** ECS instances can be terminated mid-crawl (spot instance preemption). Without persisted progress, the entire keyword must restart. With `last_completed_page`, the crawler resumes from where it stopped.
**What this means concretely:**
- Backend stores `last_completed_page` per keyword (already in keyword model)
- Crawler reads `start_page = last_completed_page + 1` from keyword response
- On each successful page: `report_page_progress(keyword_id, page)` updates backend
- On spot instance termination: next invocation continues from last checkpoint
**Current state:** `spiderJobs/runner/loop.py` fully implements this. `jobs_spider/` does not.
**Complexity:** Low — `spiderJobs/runner/` is the reference implementation.
---
### D5. Scheduler Task Run History + Alerting
**Value:** When company cleaning fails silently or a scheduled task stops running (due to a lock deadlock), operators need visibility without watching logs.
**What this means concretely:**
- `ScheduledTaskRun` table already records task runs with success/failure status
- Email alert when: task fails N consecutive times, or hasn't run in 2x expected interval
- `ip_alert_job` (every 10 min) already does IP anomaly alerting via SMTP — extend this pattern
- Frontend: scheduler health section in monitor dashboard
**Current state:** `ScheduledTaskRun` model + `_record_task_run()` exists. Email alerting exists for IP anomalies. Not wired for task-level failures.
**Complexity:** Medium — extend existing alerting pattern.
---
### D6. Unified Crawl Result Reporting API
**Value:** Crawlers that report completion status enable automatic keyword queue management — completed keywords can be reset for the next crawl cycle, failed ones flagged for retry.
**What this means concretely:**
- `POST /api/v1/keyword/crawl-complete` with `{keyword_id, status, pages_completed, jobs_found, error_message}`
- Backend updates keyword state: `completed` → reset to `pending` after 24h; `failed` → increment failure count
- Already partially designed in `spiderJobs/runner/api_client.py`
**Current state:** `report_crawl_complete()` exists in `spiderJobs/runner/api_client.py`. Not wired to keyword state machine in backend.
**Complexity:** Medium — backend endpoint + keyword state machine.
---
## Anti-Features
Things to deliberately NOT build. These appear useful but create more problems than they solve in this context.
### AF1. Shared Crawl State via Redis/Database
**What it looks like:** Storing proxy health, ban status, session cookies in Redis so multiple ECS nodes share state.
**Why avoid:** Adds Redis as a hard dependency for crawlers. ECS spot instances are ephemeral — shared state becomes stale or corrupted when nodes die mid-operation. Each crawler process managing its own in-memory state is simpler, more reliable, and eliminates inter-node coordination. The existing `SmartIPManager` in-memory approach is correct.
**What to do instead:** Each ECS node gets its own proxy pool slice. Node-level state stays in-process. Only keyword assignment (which is already serialized via the backend API) needs cross-node coordination.
---
### AF2. Playwright / Browser Automation as Default
**What it looks like:** Using headless Chrome via Playwright for all platform requests to bypass detection.
**Why avoid:** Playwright adds 35x resource overhead per request. Boss直聘's WeChat mini-program API and QCWY/Zhilian's mobile APIs are accessible via direct HTTP with correct headers and TLS fingerprints (`requests_go` + `TLS_CHROME_LATEST`). Browser automation is only justified when the platform serves JS-rendered content with no underlying API — which is not the case for any of the three target platforms. Playwright is already listed as a dependency but should be a last-resort escape hatch, not the default.
**What to do instead:** Keep `requests_go` with `TLS_CHROME_LATEST` and random JA3 as the standard HTTP client. Use Playwright only for specific cookie acquisition flows if needed.
---
### AF3. Real-time Data Processing / Stream Pipeline
**What it looks like:** Adding Kafka/Flink/Redis Streams to process job data as it arrives.
**Why avoid:** The current volume (thousands of jobs per day) does not justify stream infrastructure. ClickHouse handles batch inserts efficiently. The 5-minute company cleaning scheduler + async ingest pipeline covers all real-time requirements. Adding a streaming layer multiplies operational complexity with no data quality benefit at this scale.
**What to do instead:** Keep the batch-insert-with-dedup pattern. Optimize batch sizes if needed (current single-batch-per-page is correct).
---
### AF4. ML-Based Anti-Detection (Behavioral Fingerprinting)
**What it looks like:** Training ML models to randomize mouse movement patterns, request timing distributions, or user session behavior to mimic human browsing.
**Why avoid:** The three target platforms are mobile/mini-program APIs, not web browser interfaces. Behavioral fingerprinting only matters for full browser sessions. The current approach (random delays 1020s, TLS fingerprint spoofing, IP rotation) addresses the actual detection vectors. ML-based solutions add training/maintenance cost with no benefit against the actual detection mechanisms used.
**What to do instead:** Tune the existing `IPStrategyConfig` parameters (delay range, failure thresholds) per platform via environment variables.
---
### AF5. Crawler-Side Data Normalization / Schema Transformation
**What it looks like:** Having crawlers parse, clean, and normalize job data fields (salary parsing, city name normalization, skill extraction) before pushing to the backend.
**Why avoid:** Normalization logic changes frequently as platforms update their response schemas. If normalization breaks, raw data is already lost — it was never stored. Storing raw JSON and normalizing downstream (in ClickHouse views or a separate cleaning service) preserves the original data for re-processing. The existing `CleaningService` covers this correctly.
**What to do instead:** Crawlers push raw JSON. Backend's `IngestService` stores raw JSON in `json_data` column. ClickHouse views and `CleaningService` handle field extraction and normalization.
---
### AF6. Separate Crawler Orchestration Database
**What it looks like:** A dedicated database (separate from MySQL/ClickHouse) to manage crawler state, job queues, worker registrations.
**Why avoid:** The existing keyword table in MySQL + `IpUploadStats` + `ScheduledTaskRun` already provide sufficient crawler state management. Adding a third database (e.g., MongoDB for crawler state) fragments operational concerns and adds another system to maintain. The backend API is the correct coordination point for ECS crawler nodes.
**What to do instead:** Extend the keyword model with the fields needed for state tracking (`last_completed_page`, `status`, `failure_count`). The existing pattern is sufficient.
---
## Feature Dependencies
```
Unified Base Classes (1) → IP/Proxy Management (6) [proxy logic in shared _http_client.py]
Unified Base Classes (1) → Retry with Backoff (5) [retry wrapper in shared _http_client.py]
Unified Base Classes (1) → Token Lifecycle (7) [token client in shared boss module]
Keyword Dispatch (2) → Per-Page Upload (3) [keyword_id needed for page progress reporting]
Keyword Dispatch (2) → Keyword Progress Resume (D4) [last_completed_page field]
Per-Page Upload (3) → Dedup at Ingestion (4) [smaller batches = faster dedup queries]
IP/Proxy Management (6) → Crawl Health Dashboard (D1) [IpUploadStats feeds dashboard]
Keyword Dispatch (2) → Unified Crawl Reporting API (D6) [keyword state machine]
```
---
## MVP Recommendation for This Refactor
The refactor's highest-leverage work, in priority order:
**Phase 1 — Foundation (unblock everything else):**
1. Unified base classes — canonicalize `_base.py`, `_http_client.py`, sign modules into a single importable package
2. `SmartIPManager` + `IPAnomalyDetector` moved to shared module, used by all three platforms
3. Standardized logging (`loguru`) in all crawlers — replace `print()` calls
**Phase 2 — Data reliability:**
4. Per-page upload loop (following `spiderJobs/runner/loop.py` pattern) for all three platforms
5. Retry with backoff in `_http_client.py`
6. All platforms route through `IngestService` + `PlatformConfig` registry for dedup
**Phase 3 — Operational visibility:**
7. Crawl completion reporting API + keyword state machine
8. Crawler health dashboard extension (job ingestion metrics, keyword status)
**Defer:**
- Platform anti-detection profiles (D2): low urgency, current env-var approach is adequate
- Scheduler task alerting (D5): existing SMTP infra works; wire task failures as follow-up
- Company inline crawl (D3): correct pattern exists in `spiderJobs/runner/`; include if refactoring that module
---
## Sources
- Codebase analysis: `.planning/codebase/ARCHITECTURE.md` (2026-03-21)
- `app/services/crawler/_base.py` — BaseFetcher/BaseSearcher template method pattern
- `app/services/crawler/_http_client.py` — HTTPClient, proxy priority logic, TLS fingerprint
- `app/core/algorithms/antispider.py` — IPStrategyConfig, IPAnomalyDetector, SmartIPManager
- `app/services/ingest/dedup.py` — batch_dedup_filter, 1/2 column key support
- `app/services/ingest/service.py` — IngestService.store_batch() pipeline
- `app/services/ingest/registry.py` — PlatformConfig registry, DedupFieldSpec
- `spiderJobs/runner/loop.py` — run_crawl_loop(), per-page upload, breakpoint resume, company inline crawl
- `jobs_spider/boss/boos_api.py` (lines 1100) — SmartIPManager usage, sleep_random_between(), token management
- `app/core/ip_tracking.py` — IpTrackingMiddleware, IpUploadStats
- `app/core/locks.py` — DistributedLock (Redis + file fallback)
- `.planning/PROJECT.md` — refactor scope, constraints, key decisions

View File

@ -0,0 +1,252 @@
# Domain Pitfalls: 招聘数据爬虫重构
**Domain:** Chinese recruitment platform crawler refactoring (Boss直聘 / 前程无忧 / 智联招聘)
**Researched:** 2026-03-21
**Confidence:** HIGH — all findings are grounded in direct codebase evidence from this repository
---
## Critical Pitfalls
Mistakes that cause rewrites, data loss, or crawler shutdown.
---
### Pitfall 1: Breaking the Running Crawler While Refactoring
**What goes wrong:** You refactor `jobs_spider/boss/boos_api.py` or `app/services/crawler/` in-place. The new base class has a subtly different interface (parameter order, return type, exception handling). The old code path that was working for production data collection silently stops collecting, or starts pushing malformed data into ClickHouse.
**Why it happens:** This codebase already has three parallel spider implementations (`jobs_spider/`, `spiderJobs/`, `app/services/crawler/`) that evolved independently. The temptation during refactoring is to "just migrate" the live code directly. But the live scripts (`jobs_spider/boss/boos_api.py`) run as standalone ECS processes with a persistent loop — there is no deployment gate. Any change to the shared module takes effect on the next ECS restart or local re-run, with no rollback.
**Consequences:**
- Zero data ingested for the duration of the broken crawler run (hours or days on ECS)
- No test coverage currently means no automated detection of breakage
- ClickHouse dedup logic will not surface "nothing came in" as an error
**Prevention:**
- Always keep the old code path alive and parallel until the new one has ingested at least one full keyword sweep successfully
- Use the feature-flag pattern: add a `USE_NEW_CRAWLER=0/1` env variable; default to old code; validate new code side-by-side before switching default
- Write a data-volume canary check: if a keyword cycle produces 0 records when it historically produces 50+, alert immediately rather than silently succeeding
**Detection (warning signs):**
- `push_job()` returns success but ClickHouse row count for today is lower than yesterday's baseline
- Crawler log shows "抓取完成" with 0 records for keywords that reliably returned results before
- `batch_dedup_filter` ignored count is unexpectedly 100% (all records already existed, possibly because empty batches are silently accepted)
**Phase:** Address in the very first refactoring phase, before touching any shared path. Establish the parallel-run validation pattern before any code changes.
---
### Pitfall 2: Copying Sign/Algorithm Code Instead of Sharing It
**What goes wrong:** The sign algorithm (`_compute_checksum`, `_generate_uuid`, `BossSign`) is already duplicated verbatim between `spiderJobs/platforms/boss/sign.py` and `app/services/crawler/_boss_sign.py`. The comment in `_boss_sign.py` says "复制自 spiderJobs/platforms/boss/sign.py". This copy-paste pattern will be repeated for the new unified base: if the shared core lives in `app/`, the `jobs_spider/` standalone scripts cannot import it without adding the app to `sys.path` at runtime, which is fragile. The solution chosen in the current code — "copy the file with a comment" — is not sharing; it is duplication with extra steps.
**Why it happens:** Python's import system makes cross-directory sharing feel difficult when one consumer is an app package and the other is a standalone script. The easy path is copy-paste, which maintains independence but recreates the bug-propagation problem.
**Consequences:**
- A future Boss anti-bot update requires changing the sign algorithm. The fix must be applied in 2+ places. One place is missed. One spider silently generates invalid traceid headers and gets 403s — but since there is no test, this is only discovered when data stops appearing.
- Same problem will recur for HTTP client, IP manager, and data parser
**Prevention:**
- Extract shared crawler core into a proper installable package (`crawler_core/`) or a well-defined subdirectory with a single source of truth
- The `jobs_spider/` scripts should `sys.path.insert(0, repo_root)` and import from the shared package — one authoritative source
- Alternatively: keep `app/services/crawler/` as the canonical location, and have `jobs_spider/` scripts import directly from `app.services.crawler.*` after setting `PYTHONPATH`
- Do NOT use the "复制自 X" comment pattern — it is a documentation of failure, not a solution
**Detection:**
- Run `diff spiderJobs/platforms/boss/sign.py app/services/crawler/_boss_sign.py` — if they differ, a divergence has already occurred
- Any comment containing "复制自" or "copy from" is a warning sign of duplicated code
**Phase:** Must be the foundational decision in Phase 1 before any platform-specific work. The sharing mechanism must be established first; all subsequent platform rewrites depend on it.
---
### Pitfall 3: Anti-Bot State Is Instance-Local, Not Process-Global
**What goes wrong:** The refactored `SmartIPManager`, token cache, and session objects are created fresh per-request when embedded in the FastAPI async service (`CleaningService`, `CompanyCleaner`). Each call to `CleaningService` instantiates a new `BossService`, which creates a new `BossSign` and a new `HTTPClient` with a new session. The anti-bot state (IP failure counters, cooldown timers, session cookies) that Boss直聘 relies on to not trigger rate-limiting is discarded after each request.
**Why it happens:** The current `CleaningService.__init__` creates `self.boss_service = BossService()` per instance. `CleaningService` itself is instantiated per API request. This works for the standalone script (one long-lived process, one `SmartIPManager`) but the FastAPI service context has a completely different lifecycle.
**Consequences:**
- The standalone scripts maintain session state correctly (SmartIPManager accumulates failure counts, applies cooldowns)
- The embedded service makes "cold" requests every time, resetting the TLS fingerprint session and appearing as new clients — which is exactly what anti-bot systems profile
- The `__zp_sseed__` / `__zp_sname__` / `__zp_sts__` cookie state managed in `boos_api.py` (anti-bot response at code=37) exists only in the single-process script
**Prevention:**
- The shared core (IP manager, session, token cache) must be process-level singletons when used inside FastAPI, not per-request instances
- Use FastAPI's `lifespan` to initialize a single shared crawler context; inject it via dependency
- For the standalone scripts, the existing per-process instance model is correct and should not change
**Detection:**
- Boss returns `code=35` (IP banned) or `code=37` (anti-bot challenge) on the first request from the FastAPI service context
- High 403/429 rate from embedded cleaning service, while standalone scripts on the same IP work fine
**Phase:** Phase addressing the FastAPI embedded crawler service. Do not assume the standalone script's session management model transfers to async service context.
---
### Pitfall 4: ClickHouse Dedup Scans the Full Table
**What goes wrong:** `batch_dedup_filter` in `app/services/ingest/dedup.py` issues `SELECT job_id FROM job_data.boss_job WHERE job_id IN (...)` with no time-bound filter. For `boss_job` after months of operation, this scans every row. The MergeTree `ORDER BY created_at` means `job_id` is not in the primary key — the scan will read the entire column. At scale (millions of rows), this dedup check becomes the bottleneck and eventually causes timeouts, causing the ingest service to fallback to `check_duplicate=False` (silently inserting duplicates).
**Why it happens:** The current schema uses `ORDER BY created_at` for all tables (confirmed in `clickhouse_init.py`). Dedup columns (`job_id`, `number`, etc.) are secondary columns with no index. This was fine when tables were small.
**Consequences:**
- Ingest latency spikes as tables grow, eventually timing out
- Silent duplicate insertion when dedup times out and the batch is inserted anyway
- The `CONCERNS.md` already flags this — but the fix (adding `PREWHERE created_at > now() - INTERVAL N DAY`) cannot be done as a view change; it requires changing the dedup query logic
**Prevention:**
- Add a configurable `days_back` parameter (defaulting to 30) to all dedup queries: `WHERE job_id IN (...) AND created_at >= now() - INTERVAL 30 DAY`
- This matches the existing `company_cleaner.py` `days_back=90` pattern already in the codebase
- During refactoring, add this as a mandatory parameter to `batch_dedup_filter` — not an optional afterthought
- Consider adding a secondary index (`INDEX idx_job_id job_id TYPE bloom_filter GRANULARITY 4`) in the DDL for tables where dedup is critical
**Detection:**
- ClickHouse query log shows `dedup_filter` queries scanning >1M rows with no `created_at` filter
- Ingest endpoint latency increases week-over-week as table grows
- `duplicate` count in ingest result drops to 0 even though the same crawler ran yesterday (dedup failing silently)
**Phase:** Phase 1 (dedup infrastructure). This is a data correctness issue, not a performance optimization. Fix before the refactored crawlers start pushing higher volumes.
---
## Moderate Pitfalls
### Pitfall 5: Sync `requests` Blocking the Async Event Loop
**What goes wrong:** The refactored crawler services use `requests` (sync) wrapped in `asyncio.to_thread()`. Under concurrent cleaning requests or batch company processing, the thread pool fills up. The `CleaningService` spawns 50 company cleanings in one scheduler tick — each calls `to_thread(boss_service.get_company_detail_by_id)`. With uvicorn's default thread pool (typically `min(32, os.cpu_count() + 4)` threads), 20+ concurrent cleaning tasks saturate the pool, and subsequent requests queue behind them.
**Why it happens:** The crawler libraries (`requests`, `requests_go`) are synchronous. Wrapping them with `to_thread` is the correct pattern, but the thread pool limit is invisible until load hits it.
**Prevention:**
- Use `asyncio.Semaphore` to cap concurrency of `to_thread` calls within the cleaning scheduler — cap at `(thread_pool_size // 3)` to leave headroom for other async operations
- Alternatively, migrate to `httpx.AsyncClient` for the embedded service context (not the standalone scripts, which can stay synchronous)
- Set `UVICORN_WORKERS` and thread pool size explicitly; document the relationship
**Detection:**
- Event loop lag exceeds 100ms during scheduled cleaning runs
- `asyncio.to_thread` queue depth grows without bound during batch operations
**Phase:** Phase addressing `CompanyCleaner` and `CleaningService` concurrency. Low urgency until company processing batch size increases.
---
### Pitfall 6: ReplacingMergeTree Dedup Is Eventual, Not Immediate
**What goes wrong:** `pending_company` uses `ENGINE = ReplacingMergeTree(version)`. The assumption is that inserting the same `(source, company_id)` twice will result in one row. This is wrong at query time. ClickHouse only deduplicates during background merges, which happen asynchronously. Between insertion and the next merge, both rows exist and will appear in `SELECT`. The `CompanyCleaner.collect_pending_companies()` query that reads `pending_company` may return duplicate company IDs, causing duplicate processing and potentially duplicate entries in MySQL.
**Why it happens:** ReplacingMergeTree is commonly misunderstood as providing immediate deduplication. It does not. The documentation is clear, but the confusion is pervasive.
**Prevention:**
- All queries against `pending_company` must use `SELECT ... FROM pending_company FINAL` to force merge semantics at query time
- Or: add `WHERE created_at = max(created_at) GROUP BY source, company_id` — but `FINAL` is simpler and correct
- The insert logic should also guard against inserting the same `(source, company_id)` in the same batch (pre-filter in application code)
**Detection:**
- `collect_pending_companies()` returns the same `company_id` twice in one batch
- MySQL company table gets duplicate entries from two `collect` runs in quick succession
**Phase:** Phase addressing company data pipeline. Must be fixed before increasing `collect_pending_companies` batch size.
---
### Pitfall 7: The Three-Framework Problem Causes Refactoring Scope Creep
**What goes wrong:** The project currently has `jobs_spider/`, `spiderJobs/`, and `app/services/crawler/` — three partially overlapping crawler implementations. During refactoring, there is a strong temptation to "also fix" the `spiderJobs/` framework, or to make the new base class compatible with all three. This expands the scope from "refactor three platforms" to "unify three frameworks," which is a completely different project.
**Why it happens:** `spiderJobs/` contains a cleaner architecture (`core/`, `platforms/`, `runner/`) that looks like the intended target. It is tempting to treat it as the destination rather than additional tech debt.
**Prevention:**
- Make a single explicit decision at the start: `spiderJobs/` is deleted or kept. Do not leave it in a half-migrated state.
- If deleted: do it in the first commit of the refactoring phase. An empty `spiderJobs/` with a `DEPRECATED.md` is fine. A half-migrated `spiderJobs/` that is 60% replaced is not.
- The refactoring target is the new `_base.py`, `_boss_api.py`, `_boss_client.py`, `_boss_sign.py` pattern already started in `app/services/crawler/` — finish that pattern, don't restart with `spiderJobs/`
**Detection:**
- A PR diff that touches `spiderJobs/` files during "platform rewrite" work is scope creep
- Story points/time estimate doubles from what was planned
**Phase:** Decision must be made and documented in Phase 1 before any platform work begins.
---
### Pitfall 8: Testing Crawlers by Hitting Real APIs
**What goes wrong:** Unit tests for crawler code make real HTTP requests to Boss直聘, 前程无忧, or 智联招聘. This causes: (a) tests fail in CI with no network access; (b) real request volume from CI triggers rate limiting or bans the test account's IP; (c) test results are non-deterministic (platform API changes break tests unexpectedly); (d) tests cannot run at all without valid tokens.
**Why it happens:** There are currently zero crawler tests. When tests are first written, the path of least resistance is to write integration tests against real endpoints. This "just works" locally but creates all of the above problems.
**Prevention:**
- All unit tests for crawler logic must use `unittest.mock.patch` or `pytest-mock` to mock the HTTP layer
- Test the layers independently: sign algorithm (pure function, no HTTP needed), response parser (provide fixture JSON, no HTTP needed), IP manager state machine (mock requests.Session)
- Maintain a `tests/fixtures/` directory with saved real API response JSON (anonymized if needed) — test parsers against these fixtures, not live responses
- A single integration test suite (marked `@pytest.mark.integration`) can hit real APIs, but it should be opt-in and never run in CI
**Detection:**
- Test suite requires `API_BASE_URL`, `BOSS_TOKEN`, or network access to pass
- Tests fail with `ConnectionRefusedError` or `TimeoutError` in CI
**Phase:** Phase establishing the test infrastructure. Write mocked tests first — before implementing the refactored code — following TDD discipline.
---
## Minor Pitfalls
### Pitfall 9: Log File Creation in CWD Breaks When CWD Changes
**What goes wrong:** Both `jobs_spider/boss/boos_api.py` and `jobs_spider/qcwy/qcwy.py` contain `os.makedirs("logs", exist_ok=True)` and `logger.add("logs/log_{time:YYYY-MM-DD}.log", ...)`. These create a `logs/` directory relative to the current working directory when the script is launched. If ECS scripts are run from a different directory (e.g., `/root` instead of `/app`), logs appear in an unexpected location, making debugging harder.
**Prevention:** Use `pathlib.Path(__file__).parent / "logs"` for script-relative log paths.
**Phase:** Minor cleanup, address during standalone script refactoring.
---
### Pitfall 10: Hardcoded Proxy URL Leaks Credentials
**What goes wrong:** `jobs_spider/qcwy/qcwy.py` line 31 contains `PROXY_URL = "http://t13319619426654:ln8aj9nl@s432.kdltps.com:15818"` — a real proxy credential hardcoded in source. This is committed to the repo.
**Prevention:** Remove immediately. Require `PROXY_URL` via environment variable with no default. Rotate the credential.
**Phase:** Pre-refactoring security fix. Do not include in any commit that goes to a shared remote.
---
### Pitfall 11: ClickHouse Schema Cannot Be Evolved Without Manual DDL
**What goes wrong:** All six data tables use `CREATE TABLE IF NOT EXISTS`. Adding a new column to the DDL (e.g., adding `province` to `boss_job`) does nothing on a production database where the table already exists. The refactored crawler may produce records with a new field that the ClickHouse schema does not have, causing insert failures or silent field drops.
**Prevention:** For each schema change, write an explicit `ALTER TABLE ... ADD COLUMN IF NOT EXISTS` statement in `clickhouse_init.py`'s upgrade path. Run this on startup using a version-tracking approach. Do not rely on `CREATE TABLE IF NOT EXISTS` for evolution.
**Phase:** Any phase that changes what fields the crawler stores. Medium urgency — only becomes critical when schema changes are needed.
---
## Phase-Specific Warnings
| Phase Topic | Likely Pitfall | Mitigation |
|-------------|----------------|------------|
| Establish shared crawler core | Pitfall 2 (copy-paste instead of share) | Decide the import mechanism first; prove it works with one sign file before touching anything else |
| Rewrite Boss直聘 client | Pitfall 1 (break running crawler) | Keep old `boos_api.py` running; use env flag to switch; validate side-by-side |
| Rewrite Boss直聘 client | Pitfall 3 (anti-bot state lost) | Test the new client as a standalone script with a real keyword before embedding in FastAPI |
| ClickHouse dedup refactoring | Pitfall 4 (full table scan) | Add `days_back` parameter as non-optional; add integration test with time bounds |
| Company pipeline optimization | Pitfall 6 (ReplacingMergeTree eventual) | Add `FINAL` to all `pending_company` SELECT queries before increasing batch size |
| Write tests | Pitfall 8 (real API in tests) | Mock at the `HTTPClient.get/post` level, not at the platform API level |
| Delete duplicate framework | Pitfall 7 (scope creep) | Delete `spiderJobs/` explicitly in first PR; do not migrate it |
---
## Sources
All findings are grounded in direct codebase inspection of this repository:
- `/Users/win/2025/AICoding/JobData/.planning/codebase/CONCERNS.md` — security and tech debt audit
- `/Users/win/2025/AICoding/JobData/app/services/ingest/dedup.py` — dedup implementation
- `/Users/win/2025/AICoding/JobData/app/services/ingest/service.py` — ingest service
- `/Users/win/2025/AICoding/JobData/app/core/clickhouse_init.py` — schema DDL (MergeTree ORDER BY created_at, not dedup columns)
- `/Users/win/2025/AICoding/JobData/app/services/crawler/_boss_sign.py` vs `/Users/win/2025/AICoding/JobData/spiderJobs/platforms/boss/sign.py` — confirmed verbatim duplication
- `/Users/win/2025/AICoding/JobData/app/services/crawler/_base.py` — existing base class
- `/Users/win/2025/AICoding/JobData/app/services/crawler/_http_client.py` — HTTP layer
- `/Users/win/2025/AICoding/JobData/app/services/cleaning.py` — async service wrapping sync crawlers
- `/Users/win/2025/AICoding/JobData/jobs_spider/qcwy/qcwy.py` — hardcoded proxy credential, sleep defaults differ from Boss
- `/Users/win/2025/AICoding/JobData/jobs_spider/boss/boos_api.py` — 2281-line god file, SmartIPManager
- `/Users/win/2025/AICoding/JobData/.planning/PROJECT.md` — refactoring constraints and goals

282
.planning/research/STACK.md Normal file
View File

@ -0,0 +1,282 @@
# Technology Stack — 招聘数据爬虫重构
**Project:** JobData 爬虫交互层重构
**Researched:** 2026-03-21
**Confidence:** HIGH (existing code examined directly; external libraries verified against PyPI and official docs)
---
## Summary Recommendation
Retain `requests_go` as the HTTP transport for crawlers (already working, actively maintained through Dec 2025, TLS fingerprint impersonation is critical for Boss). Adopt `httpx.AsyncClient` inside the FastAPI process only, where the event loop already runs. Standardize on `curl_cffi` as the long-term strategic replacement for `requests_go` if `requests_go` shows maintenance gaps — it is the better-documented, more widely adopted alternative with the same fingerprint capabilities. Use `tenacity` for retry logic, `pytest` + `respx` + `pytest-anyio` for testing, and a hybrid deduplication strategy (application-side pre-check with a 30-day recency window + ClickHouse MergeTree engine-level eventual dedup as backstop).
---
## Recommended Stack
### HTTP Client Layer
#### For External Crawler Scripts (`jobs_spider/`, `spiderJobs/`)
| Technology | Version | Purpose | Why |
|------------|---------|---------|-----|
| `requests_go` | 1.0.9 (Dec 2025) | Primary HTTP client with Chrome TLS impersonation | Already in use in `spiderJobs/core/http_client.py` and `app/services/crawler/_http_client.py`. Latest release Dec 2025 confirms active maintenance. Provides `TLS_CHROME_LATEST` with `random_ja3=True`, which is the correct countermeasure for Chrome 110+ JA3 randomization. API is identical to `requests`, so the existing `HTTPClient` wrapper works without changes. |
| `curl_cffi` | 0.7.x+ (Dec 2025) | Strategic future replacement | `curl_cffi` has broader community adoption, better documentation, and equivalent or superior fingerprint support (including HTTP/2 frame fingerprinting). If `requests_go` becomes unmaintained, migrate to `curl_cffi`. Both share a `requests`-compatible API. Do NOT switch mid-refactor; validate on a single platform first. |
**Do NOT use** plain `requests` for crawler HTTP — its default TLS stack (Python `ssl` module + OpenSSL) produces a fingerprint that platforms trivially detect. The existing `requests` imports in `jobs_spider/boss/boos_api.py` and `jobs_spider/qcwy/qcwy.py` are the root cause of the detection exposure in those legacy scripts.
**Do NOT use** `aiohttp` for crawlers — it offers no TLS fingerprint control and its API is more complex than `requests`/`httpx` for this use case.
#### For Backend Service Calls Inside FastAPI (`app/services/crawler/`)
| Technology | Version | Purpose | Why |
|------------|---------|---------|-----|
| `httpx` | 0.28.1 (already in Pipfile) | Async HTTP from within FastAPI event loop | `app/services/crawler/` runs inside the FastAPI async event loop. `requests_go` is synchronous; calling it from async code via `asyncio.to_thread()` is identified in CONCERNS.md as a thread-pool exhaustion risk under high concurrency. `httpx.AsyncClient` eliminates this. The existing `HTTPClient` wrapper interface (`get`, `post` returning `tuple[int, Any]`) can be preserved with an async version. |
The two contexts (external scripts vs. embedded in FastAPI) have different constraints and should use different transports. The shared `BaseFetcher`/`BaseSearcher` base classes can define the interface; each deployment context plugs in the appropriate client.
#### TLS Fingerprint Strategy
The `HTTPClient` class in `spiderJobs/core/http_client.py` (and its copy in `app/services/crawler/_http_client.py`) correctly:
- Sets `TLS_CHROME_LATEST` as the TLS config
- Enables `random_ja3=True` to rotate JA3 hashes per Chrome 110+ behavior
- Recreates sessions for tunnel proxies to force new TCP connections
Keep this pattern. Do not remove random JA3 — platforms have JA3 allowlists/denylist heuristics, and randomization within the Chrome profile mimics real Chrome behavior.
---
### Retry Logic
| Technology | Version | Purpose | Why |
|------------|---------|---------|-----|
| `tenacity` | already in Pipfile (Spider-specific) | Retry with exponential backoff + jitter | Already listed as a spider dependency. Supports async/sync equally. Provides `stop_after_attempt`, `wait_exponential`, `wait_random_exponential`, `retry_if_exception_type`. Replace all bare `except: pass` retry patterns in `jobs_spider/` with structured tenacity decorators. |
**Tenacity configuration pattern for crawlers:**
```python
from tenacity import retry, stop_after_attempt, wait_random_exponential, retry_if_exception_type
import httpx
@retry(
stop=stop_after_attempt(3),
wait=wait_random_exponential(multiplier=1, min=10, max=30), # preserves 10s+ minimum
retry=retry_if_exception_type((httpx.TimeoutException, httpx.NetworkError)),
reraise=True,
)
def fetch_with_retry(...): ...
```
The `min=10` maps to the existing `SLEEP_MIN_SECONDS` constraint. This replaces the current `sleep_random_between()` manual sleep pattern while preserving the anti-detection delay requirement.
---
### Base Class Architecture
**Recommendation: Keep the existing two-layer pattern from `spiderJobs/` but make it the single source of truth.**
The `spiderJobs/` package has the best-designed structure:
- `core/base.py``ApiResult`, `BaseFetcher`, `BaseSearcher` (platform-agnostic)
- `core/http_client.py``HTTPClient` (transport-agnostic)
- `platforms/{boss,job51,zhilian}/{sign,client,api}.py` — per-platform implementations
The `app/services/crawler/` directory is a copy of this (confirmed by the "复制自" comments). The duplication is the problem, not the pattern.
**Resolution:** Make `spiderJobs/` the authoritative shared package. Install it as an editable package (`pip install -e ./spiderJobs`) in both the app environment and the standalone spider environment. Remove the copies in `app/services/crawler/`. This gives one codebase, one set of tests, and eliminates the current copy-paste maintenance burden.
If editable install is not viable for the Docker deployment, use a `uv` workspace or `pyproject.toml` at the repo root referencing `spiderJobs` as a local path dependency.
```toml
# pyproject.toml (repo root, new file)
[tool.uv.workspace]
members = ["spiderJobs", "app"]
```
The base class pattern itself (abstract `_build_params`, concrete `fetch`/`search`) is correct and should be preserved as-is.
---
### Anti-Detection Libraries and Approaches
| Concern | Approach | Library |
|---------|----------|---------|
| TLS fingerprint | Chrome JA3 impersonation with randomization | `requests_go` (`TLS_CHROME_LATEST`, `random_ja3=True`) |
| HTTP/2 fingerprint | Handled by `requests_go`/`curl_cffi` | Same library |
| IP blocking | Proxy pool rotation + tunnel proxy | Custom `HTTPClient` (already implemented) |
| Session fingerprint | Recreate session on IP switch | `_new_session()` pattern (already in `HTTPClient`) |
| Timing detection | Minimum 10s random delay between requests | `tenacity` `wait_random_exponential(min=10, max=30)` |
| Token expiry | Mark token invalid, fetch next from backend | Existing `SmartIPManager` pattern — keep |
| Anti-bot challenge (code=35/37) | Session rebuild + cookie refresh | Keep existing logic from `BossZhipinAPI` |
**Do NOT add** Playwright to the base crawler loop. Playwright adds 200-500ms startup overhead per page and is only justified for JavaScript-rendered anti-bot challenges. Reserve it for edge-case recovery only, not the main fetch path.
---
### Data Deduplication at Insert Time
**Strategy: Application-side pre-check with a 30-day recency window + engine-level backstop.**
The current `batch_dedup_filter` in `app/services/ingest/dedup.py` does a `SELECT key_col FROM table WHERE key_col IN (...)` with no time bound. CONCERNS.md identifies this as a full-table scan on every ingest batch — a real performance problem as data grows.
**Recommended approach:**
```python
# Add PREWHERE recency filter to all dedup queries
query = (
f"SELECT {key_col} FROM {table} "
f"PREWHERE created_at >= now() - INTERVAL 30 DAY " # <-- add this
f"WHERE {key_col} IN {{keys:Array(String)}}"
)
```
Use `PREWHERE` (not `WHERE`) — ClickHouse evaluates `PREWHERE` before reading full column data, making it significantly cheaper on MergeTree tables. A 30-day window matches the assumption in `company_cleaner.py`'s `days_back` logic and covers realistic re-crawl cycles.
**For ClickHouse table engine:** The existing `MergeTree` tables for job data use a composite order key (`job_id`, `created_at` or similar). This is correct — `MergeTree` does not auto-deduplicate, but the engine-level `ORDER BY` ensures that a `FINAL` modifier or `GROUP BY` query can clean duplicates if needed. Do NOT switch to `ReplacingMergeTree` for job tables during this refactor — it adds background merge overhead and provides only eventual (not immediate) deduplication, while the application-side pre-check gives immediate guarantees at lower cost.
The existing `pending_company` table already uses `ReplacingMergeTree(updated_at)` with `(source, company_id)` dedup — this is correct and should stay.
**Summary:** No library change needed for deduplication. Fix the missing `PREWHERE` time-bound filter in `dedup.py`. Use `PREWHERE` over `WHERE` for all time-range filters in ClickHouse queries.
---
### Testing Stack
| Technology | Version | Purpose | Why |
|------------|---------|---------|-----|
| `pytest` | latest stable (8.x) | Test runner | Standard; already used in `tests/` (unittest style, but pytest-compatible) |
| `pytest-anyio` | 0.0.0+ (or `anyio[trio]`) | Async test support | Project uses `anyio==4.8.0` already in Pipfile. `pytest-anyio` integrates directly with the `anyio` backend already present. Use `@pytest.mark.anyio` on async tests. |
| `respx` | 0.21.x | Mock `httpx` requests | Intercepts `httpx.AsyncClient` calls without hitting the network. Use in all tests that exercise `app/services/crawler/` (the async versions). Pattern-matches on URL, method, headers. |
| `unittest.mock` | stdlib | Mock `requests_go` calls in external scripts | For tests of `jobs_spider/` and `spiderJobs/` which use `requests_go` (synchronous), standard `unittest.mock.patch` on `requests_go.Session.get/post` is sufficient. No extra library needed. |
| `pytest-cov` | 4.x | Coverage measurement | Enforce 80% coverage minimum on `spiderJobs/core/`, `spiderJobs/platforms/*/sign.py`, and all pure-function parsers. |
**Testing patterns for crawlers:**
1. **Sign algorithm tests** — Pure functions, no HTTP, no mocks needed. `BossSign.generate_traceid()`, `ZhilianSign.sign_headers()`, `_compute_checksum()` are all deterministic enough to test with snapshot assertions. Test these first — they are the highest-value, lowest-cost tests.
2. **API parser tests** — Feed recorded real response JSON fixtures into `BaseFetcher._parse()` / platform-specific `_parse()` overrides. Verify `ApiResult` fields. Use `pytest.fixture` to load JSON from `tests/fixtures/`. These tests do not need HTTP at all.
3. **HTTPClient / BaseFetcher integration tests** — Use `unittest.mock.patch` on `requests_go.Session` to inject fixed `(status_code, json)` return values. Verify the client calls the right URL, sends correct headers, and handles errors.
4. **Ingest/dedup tests** — The `batch_dedup_filter` function takes an `AsyncClient` — use `respx` to mock ClickHouse query responses. Test: (a) single-column dedup filters correctly, (b) two-column dedup, (c) time-bound filter is present in generated query.
5. **Integration/smoke tests** — One test per platform that replays a recorded full HTTP exchange (using `respx` route replay or `responses` library) and verifies the full pipeline from `BaseFetcher.fetch()` through `build_insert_row()`. Not against a real server.
**Do NOT use** `pytest-asyncio` alongside `pytest-anyio` — the two event loop managers conflict. The project already has `anyio` as a hard dependency (FastAPI uses it); `pytest-anyio` is the natural extension.
**Existing test pattern:** Current tests in `tests/` use `unittest.TestCase`. These should be migrated to plain `pytest` style (no class inheritance) for simplicity, but this is not blocking.
---
### Code Organization for Shared Core
**Problem:** Three parallel implementations of the same thing exist:
- `jobs_spider/boss/boos_api.py` — 2281-line monolith (old)
- `spiderJobs/platforms/boss/` — split correctly into sign/client/api (new)
- `app/services/crawler/` — copy of `spiderJobs/` with import paths changed
**Recommended layout after refactor:**
```
spiderJobs/ # Canonical shared package — install as editable dep
core/
base.py # ApiResult, BaseFetcher, BaseSearcher
http_client.py # HTTPClient (requests_go transport)
http_client_async.py # AsyncHTTPClient (httpx.AsyncClient transport) <-- NEW
platforms/
boss/
sign.py # BossSign (pure, no HTTP)
client.py # BossClient(HTTPClient)
client_async.py # BossAsyncClient(AsyncHTTPClient) <-- NEW for app embedding
api.py # Boss search/fetch using BossClient
job51/
sign.py
client.py
api.py
zhilian/
sign.py
client.py
api.py
runner/
loop.py # Main crawl loop (keyword fetch → crawl → push)
api_client.py # Push to backend API
jobs_spider/ # Entry-point scripts only, import from spiderJobs
boss/main.py
qcwy/main.py
zhilian/main.py
app/services/crawler/ # App-embedded adapters, import from spiderJobs
boss.py # BossService: thin wrapper using BossAsyncClient
qcwy.py
zhilian.py
```
The `*_async.py` variant of each client wraps `httpx.AsyncClient` with the same interface. Sign classes are transport-agnostic and shared.
---
## Alternatives Considered
| Category | Recommended | Alternative | Why Not |
|----------|-------------|-------------|---------|
| TLS-aware HTTP client | `requests_go` (crawlers) | `curl_cffi` | Both equally capable. `requests_go` already integrated; switching mid-refactor adds risk. `curl_cffi` is the better long-term bet but not worth the disruption now. |
| Async HTTP (in-app) | `httpx.AsyncClient` | `aiohttp` | `httpx` already in Pipfile at 0.28.1. `aiohttp` API is more complex, no advantage here. |
| Async HTTP (in-app) | `httpx.AsyncClient` | `requests_go` async | `requests_go`'s async mode is less documented; `httpx` is the FastAPI-idiomatic choice and has `respx` for testing. |
| Retry | `tenacity` | Manual sleep loops | Tenacity already in Pipfile; structured retries with jitter are safer and testable. |
| Dedup strategy | App-side + PREWHERE time window | ReplacingMergeTree only | ReplacingMergeTree is eventual (not immediate); duplicates appear in queries until merge runs. App-side gives immediate guarantees at the cost of one extra SELECT per batch — acceptable. |
| Dedup strategy | App-side + PREWHERE time window | Full-table scan (current) | Current approach is a correctness/performance regression as table grows past millions of rows. |
| Test mocking | `respx` (httpx) + `unittest.mock` (requests_go) | `responses`, `pytest-mock` | `respx` is the purpose-built `httpx` mock; `unittest.mock` is stdlib. Minimum new dependencies. |
| Async test runner | `pytest-anyio` | `pytest-asyncio` | Project already depends on `anyio` 4.8.0. Mixing `pytest-asyncio` creates event loop conflicts. |
| Shared code structure | Editable install (`pip install -e`) | File copies (current) | Current copy-paste approach (confirmed by "复制自" comments) guarantees drift. Editable install or uv workspace gives one truth. |
---
## Versions and Installation
```bash
# No new major libraries needed for the refactor.
# Changes to existing dependencies:
# Add to Pipfile [packages]:
requests_go = "==1.0.9" # Pin to latest verified version
tenacity = "*" # Already in Pipfile as spider-specific
pytest = ">=8.0" # Dev dependency
pytest-anyio = "*" # Dev dependency
respx = ">=0.21" # Dev dependency
pytest-cov = ">=4.0" # Dev dependency
# Install spiderJobs as editable (development):
pip install -e ./spiderJobs
# Or with uv workspace (production-safe):
# Add pyproject.toml at repo root (see Code Organization section above)
```
---
## Key Constraints Respected
- **No new HTTP library for the main path**`requests_go` stays for external scripts. Zero regression risk.
- **Async for in-app crawlers**`httpx` is already in Pipfile. The `asyncio.to_thread()` workaround in CONCERNS.md is eliminated.
- **10-second minimum delay preserved**`tenacity` `wait_random_exponential(min=10)` enforces this constraint.
- **Dedup at insert time** — Application-side `batch_dedup_filter` with time-bounded PREWHERE. No schema changes.
- **Shared signing/parsing code** — Single canonical `spiderJobs/` package, installed into both environments.
- **ClickHouse table structure unchanged** — No DDL changes required for any recommendation here.
---
## Sources
- `requests_go` PyPI: latest release 1.0.9 (December 2025) — [pypi.org](https://pypi.org)
- `curl_cffi` PyPI: latest release Dec 2025, Python 3.10+ required — [pypi.org](https://pypi.org)
- TLS/JA3 fingerprinting landscape 2025: [scrapfly.io](https://scrapfly.io), [capsolver.com](https://capsolver.com)
- `httpx` async performance (7x over sync `requests` in concurrent scenarios): [proxy-cheap.com](https://proxy-cheap.com), [plainenglish.io](https://plainenglish.io)
- `respx` for `httpx` mocking: [rednafi.com](https://rednafi.com), [keploy.io](https://keploy.io)
- `pytest-anyio` vs `pytest-asyncio` tradeoffs: [FastAPI docs](https://tiangolo.com), [medium.com](https://medium.com)
- ClickHouse `ReplacingMergeTree` vs application-side dedup: [clickhouse.com](https://clickhouse.com), [chistadata.com](https://chistadata.com), [glassflow.dev](https://glassflow.dev)
- `tenacity` async retry patterns: [towardsai.net](https://towardsai.net), [leapcell.io](https://leapcell.io)
- uv workspaces for Python monorepos: [medium.com](https://medium.com)
---
*Stack research: 2026-03-21*

View File

@ -0,0 +1,45 @@
# Research Summary: 招聘数据爬虫重构
**Synthesized:** 2026-03-21
## 核心技术决策
**保留:** `requests_go` (TLS 指纹伪装,反爬核心)、`httpx.AsyncClient` (后端异步调用)、ClickHouse + MySQL 双库架构
**新增:** `tenacity` (重试)、`pytest` + `respx` (测试)
**避免:** 纯 `requests` (TLS 指纹暴露)、`aiohttp` (无指纹控制)
## 关键架构方案
**提取 `crawler_core/` 可安装共享包**,放在项目根目录,外部脚本和后端都从中导入。解决当前"复制粘贴+注释说明"的反模式。
**同步/异步边界:** 核心逻辑保持同步(`requests_go`),后端通过 `asyncio.to_thread()` 桥接。
**废弃 `jobs_spider/`:** 保留 `spiderJobs/`(结构更好的新框架)+ `crawler_core/`
## Table Stakes 功能
- 统一基类 (BaseFetcher/BaseSearcher)
- 按唯一 ID 入库去重30天窗口 + MergeTree 兜底)
- 反爬延迟10-20s随机、代理轮换、TLS 指纹
- 分页断点续爬 + 进度汇报
- 关键词拉取分发
- 结构化日志 + 错误重试
## Top 5 风险
1. **重构期间打断在线爬虫** — 用 feature flag 并行新旧代码
2. **签名代码复制粘贴** — 必须先提取共享包再改平台代码
3. **反爬会话状态丢失** — SmartIPManager 在 FastAPI 中需做成单例
4. **ClickHouse 去重全表扫描** — 加 30 天时间窗口限制
5. **无测试覆盖** — 签名函数是纯函数,最先写测试
## 建议阶段顺序
1. 提取 `crawler_core/` 共享包(解除代码重复)
2. 重写三平台爬虫客户端(基于统一基类)
3. 重连后端 facade + 外部脚本
4. 优化去重和公司清洗流程
5. 测试覆盖 + 前端监控优化
---
*Research synthesized: 2026-03-21*