283 lines
18 KiB
Markdown

# Technology Stack — 招聘数据爬虫重构
**Project:** JobData 爬虫交互层重构
**Researched:** 2026-03-21
**Confidence:** HIGH (existing code examined directly; external libraries verified against PyPI and official docs)
---
## Summary Recommendation
Retain `requests_go` as the HTTP transport for crawlers (already working, actively maintained through Dec 2025, TLS fingerprint impersonation is critical for Boss). Adopt `httpx.AsyncClient` inside the FastAPI process only, where the event loop already runs. Standardize on `curl_cffi` as the long-term strategic replacement for `requests_go` if `requests_go` shows maintenance gaps — it is the better-documented, more widely adopted alternative with the same fingerprint capabilities. Use `tenacity` for retry logic, `pytest` + `respx` + `pytest-anyio` for testing, and a hybrid deduplication strategy (application-side pre-check with a 30-day recency window + ClickHouse MergeTree engine-level eventual dedup as backstop).
---
## Recommended Stack
### HTTP Client Layer
#### For External Crawler Scripts (`jobs_spider/`, `spiderJobs/`)
| Technology | Version | Purpose | Why |
|------------|---------|---------|-----|
| `requests_go` | 1.0.9 (Dec 2025) | Primary HTTP client with Chrome TLS impersonation | Already in use in `spiderJobs/core/http_client.py` and `app/services/crawler/_http_client.py`. Latest release Dec 2025 confirms active maintenance. Provides `TLS_CHROME_LATEST` with `random_ja3=True`, which is the correct countermeasure for Chrome 110+ JA3 randomization. API is identical to `requests`, so the existing `HTTPClient` wrapper works without changes. |
| `curl_cffi` | 0.7.x+ (Dec 2025) | Strategic future replacement | `curl_cffi` has broader community adoption, better documentation, and equivalent or superior fingerprint support (including HTTP/2 frame fingerprinting). If `requests_go` becomes unmaintained, migrate to `curl_cffi`. Both share a `requests`-compatible API. Do NOT switch mid-refactor; validate on a single platform first. |
**Do NOT use** plain `requests` for crawler HTTP — its default TLS stack (Python `ssl` module + OpenSSL) produces a fingerprint that platforms trivially detect. The existing `requests` imports in `jobs_spider/boss/boos_api.py` and `jobs_spider/qcwy/qcwy.py` are the root cause of the detection exposure in those legacy scripts.
**Do NOT use** `aiohttp` for crawlers — it offers no TLS fingerprint control and its API is more complex than `requests`/`httpx` for this use case.
#### For Backend Service Calls Inside FastAPI (`app/services/crawler/`)
| Technology | Version | Purpose | Why |
|------------|---------|---------|-----|
| `httpx` | 0.28.1 (already in Pipfile) | Async HTTP from within FastAPI event loop | `app/services/crawler/` runs inside the FastAPI async event loop. `requests_go` is synchronous; calling it from async code via `asyncio.to_thread()` is identified in CONCERNS.md as a thread-pool exhaustion risk under high concurrency. `httpx.AsyncClient` eliminates this. The existing `HTTPClient` wrapper interface (`get`, `post` returning `tuple[int, Any]`) can be preserved with an async version. |
The two contexts (external scripts vs. embedded in FastAPI) have different constraints and should use different transports. The shared `BaseFetcher`/`BaseSearcher` base classes can define the interface; each deployment context plugs in the appropriate client.
#### TLS Fingerprint Strategy
The `HTTPClient` class in `spiderJobs/core/http_client.py` (and its copy in `app/services/crawler/_http_client.py`) correctly:
- Sets `TLS_CHROME_LATEST` as the TLS config
- Enables `random_ja3=True` to rotate JA3 hashes per Chrome 110+ behavior
- Recreates sessions for tunnel proxies to force new TCP connections
Keep this pattern. Do not remove random JA3 — platforms have JA3 allowlists/denylist heuristics, and randomization within the Chrome profile mimics real Chrome behavior.
---
### Retry Logic
| Technology | Version | Purpose | Why |
|------------|---------|---------|-----|
| `tenacity` | already in Pipfile (Spider-specific) | Retry with exponential backoff + jitter | Already listed as a spider dependency. Supports async/sync equally. Provides `stop_after_attempt`, `wait_exponential`, `wait_random_exponential`, `retry_if_exception_type`. Replace all bare `except: pass` retry patterns in `jobs_spider/` with structured tenacity decorators. |
**Tenacity configuration pattern for crawlers:**
```python
from tenacity import retry, stop_after_attempt, wait_random_exponential, retry_if_exception_type
import httpx
@retry(
stop=stop_after_attempt(3),
wait=wait_random_exponential(multiplier=1, min=10, max=30), # preserves 10s+ minimum
retry=retry_if_exception_type((httpx.TimeoutException, httpx.NetworkError)),
reraise=True,
)
def fetch_with_retry(...): ...
```
The `min=10` maps to the existing `SLEEP_MIN_SECONDS` constraint. This replaces the current `sleep_random_between()` manual sleep pattern while preserving the anti-detection delay requirement.
---
### Base Class Architecture
**Recommendation: Keep the existing two-layer pattern from `spiderJobs/` but make it the single source of truth.**
The `spiderJobs/` package has the best-designed structure:
- `core/base.py``ApiResult`, `BaseFetcher`, `BaseSearcher` (platform-agnostic)
- `core/http_client.py``HTTPClient` (transport-agnostic)
- `platforms/{boss,job51,zhilian}/{sign,client,api}.py` — per-platform implementations
The `app/services/crawler/` directory is a copy of this (confirmed by the "复制自" comments). The duplication is the problem, not the pattern.
**Resolution:** Make `spiderJobs/` the authoritative shared package. Install it as an editable package (`pip install -e ./spiderJobs`) in both the app environment and the standalone spider environment. Remove the copies in `app/services/crawler/`. This gives one codebase, one set of tests, and eliminates the current copy-paste maintenance burden.
If editable install is not viable for the Docker deployment, use a `uv` workspace or `pyproject.toml` at the repo root referencing `spiderJobs` as a local path dependency.
```toml
# pyproject.toml (repo root, new file)
[tool.uv.workspace]
members = ["spiderJobs", "app"]
```
The base class pattern itself (abstract `_build_params`, concrete `fetch`/`search`) is correct and should be preserved as-is.
---
### Anti-Detection Libraries and Approaches
| Concern | Approach | Library |
|---------|----------|---------|
| TLS fingerprint | Chrome JA3 impersonation with randomization | `requests_go` (`TLS_CHROME_LATEST`, `random_ja3=True`) |
| HTTP/2 fingerprint | Handled by `requests_go`/`curl_cffi` | Same library |
| IP blocking | Proxy pool rotation + tunnel proxy | Custom `HTTPClient` (already implemented) |
| Session fingerprint | Recreate session on IP switch | `_new_session()` pattern (already in `HTTPClient`) |
| Timing detection | Minimum 10s random delay between requests | `tenacity` `wait_random_exponential(min=10, max=30)` |
| Token expiry | Mark token invalid, fetch next from backend | Existing `SmartIPManager` pattern — keep |
| Anti-bot challenge (code=35/37) | Session rebuild + cookie refresh | Keep existing logic from `BossZhipinAPI` |
**Do NOT add** Playwright to the base crawler loop. Playwright adds 200-500ms startup overhead per page and is only justified for JavaScript-rendered anti-bot challenges. Reserve it for edge-case recovery only, not the main fetch path.
---
### Data Deduplication at Insert Time
**Strategy: Application-side pre-check with a 30-day recency window + engine-level backstop.**
The current `batch_dedup_filter` in `app/services/ingest/dedup.py` does a `SELECT key_col FROM table WHERE key_col IN (...)` with no time bound. CONCERNS.md identifies this as a full-table scan on every ingest batch — a real performance problem as data grows.
**Recommended approach:**
```python
# Add PREWHERE recency filter to all dedup queries
query = (
f"SELECT {key_col} FROM {table} "
f"PREWHERE created_at >= now() - INTERVAL 30 DAY " # <-- add this
f"WHERE {key_col} IN {{keys:Array(String)}}"
)
```
Use `PREWHERE` (not `WHERE`) — ClickHouse evaluates `PREWHERE` before reading full column data, making it significantly cheaper on MergeTree tables. A 30-day window matches the assumption in `company_cleaner.py`'s `days_back` logic and covers realistic re-crawl cycles.
**For ClickHouse table engine:** The existing `MergeTree` tables for job data use a composite order key (`job_id`, `created_at` or similar). This is correct — `MergeTree` does not auto-deduplicate, but the engine-level `ORDER BY` ensures that a `FINAL` modifier or `GROUP BY` query can clean duplicates if needed. Do NOT switch to `ReplacingMergeTree` for job tables during this refactor — it adds background merge overhead and provides only eventual (not immediate) deduplication, while the application-side pre-check gives immediate guarantees at lower cost.
The existing `pending_company` table already uses `ReplacingMergeTree(updated_at)` with `(source, company_id)` dedup — this is correct and should stay.
**Summary:** No library change needed for deduplication. Fix the missing `PREWHERE` time-bound filter in `dedup.py`. Use `PREWHERE` over `WHERE` for all time-range filters in ClickHouse queries.
---
### Testing Stack
| Technology | Version | Purpose | Why |
|------------|---------|---------|-----|
| `pytest` | latest stable (8.x) | Test runner | Standard; already used in `tests/` (unittest style, but pytest-compatible) |
| `pytest-anyio` | 0.0.0+ (or `anyio[trio]`) | Async test support | Project uses `anyio==4.8.0` already in Pipfile. `pytest-anyio` integrates directly with the `anyio` backend already present. Use `@pytest.mark.anyio` on async tests. |
| `respx` | 0.21.x | Mock `httpx` requests | Intercepts `httpx.AsyncClient` calls without hitting the network. Use in all tests that exercise `app/services/crawler/` (the async versions). Pattern-matches on URL, method, headers. |
| `unittest.mock` | stdlib | Mock `requests_go` calls in external scripts | For tests of `jobs_spider/` and `spiderJobs/` which use `requests_go` (synchronous), standard `unittest.mock.patch` on `requests_go.Session.get/post` is sufficient. No extra library needed. |
| `pytest-cov` | 4.x | Coverage measurement | Enforce 80% coverage minimum on `spiderJobs/core/`, `spiderJobs/platforms/*/sign.py`, and all pure-function parsers. |
**Testing patterns for crawlers:**
1. **Sign algorithm tests** — Pure functions, no HTTP, no mocks needed. `BossSign.generate_traceid()`, `ZhilianSign.sign_headers()`, `_compute_checksum()` are all deterministic enough to test with snapshot assertions. Test these first — they are the highest-value, lowest-cost tests.
2. **API parser tests** — Feed recorded real response JSON fixtures into `BaseFetcher._parse()` / platform-specific `_parse()` overrides. Verify `ApiResult` fields. Use `pytest.fixture` to load JSON from `tests/fixtures/`. These tests do not need HTTP at all.
3. **HTTPClient / BaseFetcher integration tests** — Use `unittest.mock.patch` on `requests_go.Session` to inject fixed `(status_code, json)` return values. Verify the client calls the right URL, sends correct headers, and handles errors.
4. **Ingest/dedup tests** — The `batch_dedup_filter` function takes an `AsyncClient` — use `respx` to mock ClickHouse query responses. Test: (a) single-column dedup filters correctly, (b) two-column dedup, (c) time-bound filter is present in generated query.
5. **Integration/smoke tests** — One test per platform that replays a recorded full HTTP exchange (using `respx` route replay or `responses` library) and verifies the full pipeline from `BaseFetcher.fetch()` through `build_insert_row()`. Not against a real server.
**Do NOT use** `pytest-asyncio` alongside `pytest-anyio` — the two event loop managers conflict. The project already has `anyio` as a hard dependency (FastAPI uses it); `pytest-anyio` is the natural extension.
**Existing test pattern:** Current tests in `tests/` use `unittest.TestCase`. These should be migrated to plain `pytest` style (no class inheritance) for simplicity, but this is not blocking.
---
### Code Organization for Shared Core
**Problem:** Three parallel implementations of the same thing exist:
- `jobs_spider/boss/boos_api.py` — 2281-line monolith (old)
- `spiderJobs/platforms/boss/` — split correctly into sign/client/api (new)
- `app/services/crawler/` — copy of `spiderJobs/` with import paths changed
**Recommended layout after refactor:**
```
spiderJobs/ # Canonical shared package — install as editable dep
core/
base.py # ApiResult, BaseFetcher, BaseSearcher
http_client.py # HTTPClient (requests_go transport)
http_client_async.py # AsyncHTTPClient (httpx.AsyncClient transport) <-- NEW
platforms/
boss/
sign.py # BossSign (pure, no HTTP)
client.py # BossClient(HTTPClient)
client_async.py # BossAsyncClient(AsyncHTTPClient) <-- NEW for app embedding
api.py # Boss search/fetch using BossClient
job51/
sign.py
client.py
api.py
zhilian/
sign.py
client.py
api.py
runner/
loop.py # Main crawl loop (keyword fetch → crawl → push)
api_client.py # Push to backend API
jobs_spider/ # Entry-point scripts only, import from spiderJobs
boss/main.py
qcwy/main.py
zhilian/main.py
app/services/crawler/ # App-embedded adapters, import from spiderJobs
boss.py # BossService: thin wrapper using BossAsyncClient
qcwy.py
zhilian.py
```
The `*_async.py` variant of each client wraps `httpx.AsyncClient` with the same interface. Sign classes are transport-agnostic and shared.
---
## Alternatives Considered
| Category | Recommended | Alternative | Why Not |
|----------|-------------|-------------|---------|
| TLS-aware HTTP client | `requests_go` (crawlers) | `curl_cffi` | Both equally capable. `requests_go` already integrated; switching mid-refactor adds risk. `curl_cffi` is the better long-term bet but not worth the disruption now. |
| Async HTTP (in-app) | `httpx.AsyncClient` | `aiohttp` | `httpx` already in Pipfile at 0.28.1. `aiohttp` API is more complex, no advantage here. |
| Async HTTP (in-app) | `httpx.AsyncClient` | `requests_go` async | `requests_go`'s async mode is less documented; `httpx` is the FastAPI-idiomatic choice and has `respx` for testing. |
| Retry | `tenacity` | Manual sleep loops | Tenacity already in Pipfile; structured retries with jitter are safer and testable. |
| Dedup strategy | App-side + PREWHERE time window | ReplacingMergeTree only | ReplacingMergeTree is eventual (not immediate); duplicates appear in queries until merge runs. App-side gives immediate guarantees at the cost of one extra SELECT per batch — acceptable. |
| Dedup strategy | App-side + PREWHERE time window | Full-table scan (current) | Current approach is a correctness/performance regression as table grows past millions of rows. |
| Test mocking | `respx` (httpx) + `unittest.mock` (requests_go) | `responses`, `pytest-mock` | `respx` is the purpose-built `httpx` mock; `unittest.mock` is stdlib. Minimum new dependencies. |
| Async test runner | `pytest-anyio` | `pytest-asyncio` | Project already depends on `anyio` 4.8.0. Mixing `pytest-asyncio` creates event loop conflicts. |
| Shared code structure | Editable install (`pip install -e`) | File copies (current) | Current copy-paste approach (confirmed by "复制自" comments) guarantees drift. Editable install or uv workspace gives one truth. |
---
## Versions and Installation
```bash
# No new major libraries needed for the refactor.
# Changes to existing dependencies:
# Add to Pipfile [packages]:
requests_go = "==1.0.9" # Pin to latest verified version
tenacity = "*" # Already in Pipfile as spider-specific
pytest = ">=8.0" # Dev dependency
pytest-anyio = "*" # Dev dependency
respx = ">=0.21" # Dev dependency
pytest-cov = ">=4.0" # Dev dependency
# Install spiderJobs as editable (development):
pip install -e ./spiderJobs
# Or with uv workspace (production-safe):
# Add pyproject.toml at repo root (see Code Organization section above)
```
---
## Key Constraints Respected
- **No new HTTP library for the main path** — `requests_go` stays for external scripts. Zero regression risk.
- **Async for in-app crawlers** — `httpx` is already in Pipfile. The `asyncio.to_thread()` workaround in CONCERNS.md is eliminated.
- **10-second minimum delay preserved** — `tenacity` `wait_random_exponential(min=10)` enforces this constraint.
- **Dedup at insert time** — Application-side `batch_dedup_filter` with time-bounded PREWHERE. No schema changes.
- **Shared signing/parsing code** — Single canonical `spiderJobs/` package, installed into both environments.
- **ClickHouse table structure unchanged** — No DDL changes required for any recommendation here.
---
## Sources
- `requests_go` PyPI: latest release 1.0.9 (December 2025) — [pypi.org](https://pypi.org)
- `curl_cffi` PyPI: latest release Dec 2025, Python 3.10+ required — [pypi.org](https://pypi.org)
- TLS/JA3 fingerprinting landscape 2025: [scrapfly.io](https://scrapfly.io), [capsolver.com](https://capsolver.com)
- `httpx` async performance (7x over sync `requests` in concurrent scenarios): [proxy-cheap.com](https://proxy-cheap.com), [plainenglish.io](https://plainenglish.io)
- `respx` for `httpx` mocking: [rednafi.com](https://rednafi.com), [keploy.io](https://keploy.io)
- `pytest-anyio` vs `pytest-asyncio` tradeoffs: [FastAPI docs](https://tiangolo.com), [medium.com](https://medium.com)
- ClickHouse `ReplacingMergeTree` vs application-side dedup: [clickhouse.com](https://clickhouse.com), [chistadata.com](https://chistadata.com), [glassflow.dev](https://glassflow.dev)
- `tenacity` async retry patterns: [towardsai.net](https://towardsai.net), [leapcell.io](https://leapcell.io)
- uv workspaces for Python monorepos: [medium.com](https://medium.com)
---
*Stack research: 2026-03-21*