# Technology Stack — 招聘数据爬虫重构 **Project:** JobData 爬虫交互层重构 **Researched:** 2026-03-21 **Confidence:** HIGH (existing code examined directly; external libraries verified against PyPI and official docs) --- ## Summary Recommendation Retain `requests_go` as the HTTP transport for crawlers (already working, actively maintained through Dec 2025, TLS fingerprint impersonation is critical for Boss). Adopt `httpx.AsyncClient` inside the FastAPI process only, where the event loop already runs. Standardize on `curl_cffi` as the long-term strategic replacement for `requests_go` if `requests_go` shows maintenance gaps — it is the better-documented, more widely adopted alternative with the same fingerprint capabilities. Use `tenacity` for retry logic, `pytest` + `respx` + `pytest-anyio` for testing, and a hybrid deduplication strategy (application-side pre-check with a 30-day recency window + ClickHouse MergeTree engine-level eventual dedup as backstop). --- ## Recommended Stack ### HTTP Client Layer #### For External Crawler Scripts (`jobs_spider/`, `spiderJobs/`) | Technology | Version | Purpose | Why | |------------|---------|---------|-----| | `requests_go` | 1.0.9 (Dec 2025) | Primary HTTP client with Chrome TLS impersonation | Already in use in `spiderJobs/core/http_client.py` and `app/services/crawler/_http_client.py`. Latest release Dec 2025 confirms active maintenance. Provides `TLS_CHROME_LATEST` with `random_ja3=True`, which is the correct countermeasure for Chrome 110+ JA3 randomization. API is identical to `requests`, so the existing `HTTPClient` wrapper works without changes. | | `curl_cffi` | 0.7.x+ (Dec 2025) | Strategic future replacement | `curl_cffi` has broader community adoption, better documentation, and equivalent or superior fingerprint support (including HTTP/2 frame fingerprinting). If `requests_go` becomes unmaintained, migrate to `curl_cffi`. Both share a `requests`-compatible API. Do NOT switch mid-refactor; validate on a single platform first. | **Do NOT use** plain `requests` for crawler HTTP — its default TLS stack (Python `ssl` module + OpenSSL) produces a fingerprint that platforms trivially detect. The existing `requests` imports in `jobs_spider/boss/boos_api.py` and `jobs_spider/qcwy/qcwy.py` are the root cause of the detection exposure in those legacy scripts. **Do NOT use** `aiohttp` for crawlers — it offers no TLS fingerprint control and its API is more complex than `requests`/`httpx` for this use case. #### For Backend Service Calls Inside FastAPI (`app/services/crawler/`) | Technology | Version | Purpose | Why | |------------|---------|---------|-----| | `httpx` | 0.28.1 (already in Pipfile) | Async HTTP from within FastAPI event loop | `app/services/crawler/` runs inside the FastAPI async event loop. `requests_go` is synchronous; calling it from async code via `asyncio.to_thread()` is identified in CONCERNS.md as a thread-pool exhaustion risk under high concurrency. `httpx.AsyncClient` eliminates this. The existing `HTTPClient` wrapper interface (`get`, `post` returning `tuple[int, Any]`) can be preserved with an async version. | The two contexts (external scripts vs. embedded in FastAPI) have different constraints and should use different transports. The shared `BaseFetcher`/`BaseSearcher` base classes can define the interface; each deployment context plugs in the appropriate client. #### TLS Fingerprint Strategy The `HTTPClient` class in `spiderJobs/core/http_client.py` (and its copy in `app/services/crawler/_http_client.py`) correctly: - Sets `TLS_CHROME_LATEST` as the TLS config - Enables `random_ja3=True` to rotate JA3 hashes per Chrome 110+ behavior - Recreates sessions for tunnel proxies to force new TCP connections Keep this pattern. Do not remove random JA3 — platforms have JA3 allowlists/denylist heuristics, and randomization within the Chrome profile mimics real Chrome behavior. --- ### Retry Logic | Technology | Version | Purpose | Why | |------------|---------|---------|-----| | `tenacity` | already in Pipfile (Spider-specific) | Retry with exponential backoff + jitter | Already listed as a spider dependency. Supports async/sync equally. Provides `stop_after_attempt`, `wait_exponential`, `wait_random_exponential`, `retry_if_exception_type`. Replace all bare `except: pass` retry patterns in `jobs_spider/` with structured tenacity decorators. | **Tenacity configuration pattern for crawlers:** ```python from tenacity import retry, stop_after_attempt, wait_random_exponential, retry_if_exception_type import httpx @retry( stop=stop_after_attempt(3), wait=wait_random_exponential(multiplier=1, min=10, max=30), # preserves 10s+ minimum retry=retry_if_exception_type((httpx.TimeoutException, httpx.NetworkError)), reraise=True, ) def fetch_with_retry(...): ... ``` The `min=10` maps to the existing `SLEEP_MIN_SECONDS` constraint. This replaces the current `sleep_random_between()` manual sleep pattern while preserving the anti-detection delay requirement. --- ### Base Class Architecture **Recommendation: Keep the existing two-layer pattern from `spiderJobs/` but make it the single source of truth.** The `spiderJobs/` package has the best-designed structure: - `core/base.py` — `ApiResult`, `BaseFetcher`, `BaseSearcher` (platform-agnostic) - `core/http_client.py` — `HTTPClient` (transport-agnostic) - `platforms/{boss,job51,zhilian}/{sign,client,api}.py` — per-platform implementations The `app/services/crawler/` directory is a copy of this (confirmed by the "复制自" comments). The duplication is the problem, not the pattern. **Resolution:** Make `spiderJobs/` the authoritative shared package. Install it as an editable package (`pip install -e ./spiderJobs`) in both the app environment and the standalone spider environment. Remove the copies in `app/services/crawler/`. This gives one codebase, one set of tests, and eliminates the current copy-paste maintenance burden. If editable install is not viable for the Docker deployment, use a `uv` workspace or `pyproject.toml` at the repo root referencing `spiderJobs` as a local path dependency. ```toml # pyproject.toml (repo root, new file) [tool.uv.workspace] members = ["spiderJobs", "app"] ``` The base class pattern itself (abstract `_build_params`, concrete `fetch`/`search`) is correct and should be preserved as-is. --- ### Anti-Detection Libraries and Approaches | Concern | Approach | Library | |---------|----------|---------| | TLS fingerprint | Chrome JA3 impersonation with randomization | `requests_go` (`TLS_CHROME_LATEST`, `random_ja3=True`) | | HTTP/2 fingerprint | Handled by `requests_go`/`curl_cffi` | Same library | | IP blocking | Proxy pool rotation + tunnel proxy | Custom `HTTPClient` (already implemented) | | Session fingerprint | Recreate session on IP switch | `_new_session()` pattern (already in `HTTPClient`) | | Timing detection | Minimum 10s random delay between requests | `tenacity` `wait_random_exponential(min=10, max=30)` | | Token expiry | Mark token invalid, fetch next from backend | Existing `SmartIPManager` pattern — keep | | Anti-bot challenge (code=35/37) | Session rebuild + cookie refresh | Keep existing logic from `BossZhipinAPI` | **Do NOT add** Playwright to the base crawler loop. Playwright adds 200-500ms startup overhead per page and is only justified for JavaScript-rendered anti-bot challenges. Reserve it for edge-case recovery only, not the main fetch path. --- ### Data Deduplication at Insert Time **Strategy: Application-side pre-check with a 30-day recency window + engine-level backstop.** The current `batch_dedup_filter` in `app/services/ingest/dedup.py` does a `SELECT key_col FROM table WHERE key_col IN (...)` with no time bound. CONCERNS.md identifies this as a full-table scan on every ingest batch — a real performance problem as data grows. **Recommended approach:** ```python # Add PREWHERE recency filter to all dedup queries query = ( f"SELECT {key_col} FROM {table} " f"PREWHERE created_at >= now() - INTERVAL 30 DAY " # <-- add this f"WHERE {key_col} IN {{keys:Array(String)}}" ) ``` Use `PREWHERE` (not `WHERE`) — ClickHouse evaluates `PREWHERE` before reading full column data, making it significantly cheaper on MergeTree tables. A 30-day window matches the assumption in `company_cleaner.py`'s `days_back` logic and covers realistic re-crawl cycles. **For ClickHouse table engine:** The existing `MergeTree` tables for job data use a composite order key (`job_id`, `created_at` or similar). This is correct — `MergeTree` does not auto-deduplicate, but the engine-level `ORDER BY` ensures that a `FINAL` modifier or `GROUP BY` query can clean duplicates if needed. Do NOT switch to `ReplacingMergeTree` for job tables during this refactor — it adds background merge overhead and provides only eventual (not immediate) deduplication, while the application-side pre-check gives immediate guarantees at lower cost. The existing `pending_company` table already uses `ReplacingMergeTree(updated_at)` with `(source, company_id)` dedup — this is correct and should stay. **Summary:** No library change needed for deduplication. Fix the missing `PREWHERE` time-bound filter in `dedup.py`. Use `PREWHERE` over `WHERE` for all time-range filters in ClickHouse queries. --- ### Testing Stack | Technology | Version | Purpose | Why | |------------|---------|---------|-----| | `pytest` | latest stable (8.x) | Test runner | Standard; already used in `tests/` (unittest style, but pytest-compatible) | | `pytest-anyio` | 0.0.0+ (or `anyio[trio]`) | Async test support | Project uses `anyio==4.8.0` already in Pipfile. `pytest-anyio` integrates directly with the `anyio` backend already present. Use `@pytest.mark.anyio` on async tests. | | `respx` | 0.21.x | Mock `httpx` requests | Intercepts `httpx.AsyncClient` calls without hitting the network. Use in all tests that exercise `app/services/crawler/` (the async versions). Pattern-matches on URL, method, headers. | | `unittest.mock` | stdlib | Mock `requests_go` calls in external scripts | For tests of `jobs_spider/` and `spiderJobs/` which use `requests_go` (synchronous), standard `unittest.mock.patch` on `requests_go.Session.get/post` is sufficient. No extra library needed. | | `pytest-cov` | 4.x | Coverage measurement | Enforce 80% coverage minimum on `spiderJobs/core/`, `spiderJobs/platforms/*/sign.py`, and all pure-function parsers. | **Testing patterns for crawlers:** 1. **Sign algorithm tests** — Pure functions, no HTTP, no mocks needed. `BossSign.generate_traceid()`, `ZhilianSign.sign_headers()`, `_compute_checksum()` are all deterministic enough to test with snapshot assertions. Test these first — they are the highest-value, lowest-cost tests. 2. **API parser tests** — Feed recorded real response JSON fixtures into `BaseFetcher._parse()` / platform-specific `_parse()` overrides. Verify `ApiResult` fields. Use `pytest.fixture` to load JSON from `tests/fixtures/`. These tests do not need HTTP at all. 3. **HTTPClient / BaseFetcher integration tests** — Use `unittest.mock.patch` on `requests_go.Session` to inject fixed `(status_code, json)` return values. Verify the client calls the right URL, sends correct headers, and handles errors. 4. **Ingest/dedup tests** — The `batch_dedup_filter` function takes an `AsyncClient` — use `respx` to mock ClickHouse query responses. Test: (a) single-column dedup filters correctly, (b) two-column dedup, (c) time-bound filter is present in generated query. 5. **Integration/smoke tests** — One test per platform that replays a recorded full HTTP exchange (using `respx` route replay or `responses` library) and verifies the full pipeline from `BaseFetcher.fetch()` through `build_insert_row()`. Not against a real server. **Do NOT use** `pytest-asyncio` alongside `pytest-anyio` — the two event loop managers conflict. The project already has `anyio` as a hard dependency (FastAPI uses it); `pytest-anyio` is the natural extension. **Existing test pattern:** Current tests in `tests/` use `unittest.TestCase`. These should be migrated to plain `pytest` style (no class inheritance) for simplicity, but this is not blocking. --- ### Code Organization for Shared Core **Problem:** Three parallel implementations of the same thing exist: - `jobs_spider/boss/boos_api.py` — 2281-line monolith (old) - `spiderJobs/platforms/boss/` — split correctly into sign/client/api (new) - `app/services/crawler/` — copy of `spiderJobs/` with import paths changed **Recommended layout after refactor:** ``` spiderJobs/ # Canonical shared package — install as editable dep core/ base.py # ApiResult, BaseFetcher, BaseSearcher http_client.py # HTTPClient (requests_go transport) http_client_async.py # AsyncHTTPClient (httpx.AsyncClient transport) <-- NEW platforms/ boss/ sign.py # BossSign (pure, no HTTP) client.py # BossClient(HTTPClient) client_async.py # BossAsyncClient(AsyncHTTPClient) <-- NEW for app embedding api.py # Boss search/fetch using BossClient job51/ sign.py client.py api.py zhilian/ sign.py client.py api.py runner/ loop.py # Main crawl loop (keyword fetch → crawl → push) api_client.py # Push to backend API jobs_spider/ # Entry-point scripts only, import from spiderJobs boss/main.py qcwy/main.py zhilian/main.py app/services/crawler/ # App-embedded adapters, import from spiderJobs boss.py # BossService: thin wrapper using BossAsyncClient qcwy.py zhilian.py ``` The `*_async.py` variant of each client wraps `httpx.AsyncClient` with the same interface. Sign classes are transport-agnostic and shared. --- ## Alternatives Considered | Category | Recommended | Alternative | Why Not | |----------|-------------|-------------|---------| | TLS-aware HTTP client | `requests_go` (crawlers) | `curl_cffi` | Both equally capable. `requests_go` already integrated; switching mid-refactor adds risk. `curl_cffi` is the better long-term bet but not worth the disruption now. | | Async HTTP (in-app) | `httpx.AsyncClient` | `aiohttp` | `httpx` already in Pipfile at 0.28.1. `aiohttp` API is more complex, no advantage here. | | Async HTTP (in-app) | `httpx.AsyncClient` | `requests_go` async | `requests_go`'s async mode is less documented; `httpx` is the FastAPI-idiomatic choice and has `respx` for testing. | | Retry | `tenacity` | Manual sleep loops | Tenacity already in Pipfile; structured retries with jitter are safer and testable. | | Dedup strategy | App-side + PREWHERE time window | ReplacingMergeTree only | ReplacingMergeTree is eventual (not immediate); duplicates appear in queries until merge runs. App-side gives immediate guarantees at the cost of one extra SELECT per batch — acceptable. | | Dedup strategy | App-side + PREWHERE time window | Full-table scan (current) | Current approach is a correctness/performance regression as table grows past millions of rows. | | Test mocking | `respx` (httpx) + `unittest.mock` (requests_go) | `responses`, `pytest-mock` | `respx` is the purpose-built `httpx` mock; `unittest.mock` is stdlib. Minimum new dependencies. | | Async test runner | `pytest-anyio` | `pytest-asyncio` | Project already depends on `anyio` 4.8.0. Mixing `pytest-asyncio` creates event loop conflicts. | | Shared code structure | Editable install (`pip install -e`) | File copies (current) | Current copy-paste approach (confirmed by "复制自" comments) guarantees drift. Editable install or uv workspace gives one truth. | --- ## Versions and Installation ```bash # No new major libraries needed for the refactor. # Changes to existing dependencies: # Add to Pipfile [packages]: requests_go = "==1.0.9" # Pin to latest verified version tenacity = "*" # Already in Pipfile as spider-specific pytest = ">=8.0" # Dev dependency pytest-anyio = "*" # Dev dependency respx = ">=0.21" # Dev dependency pytest-cov = ">=4.0" # Dev dependency # Install spiderJobs as editable (development): pip install -e ./spiderJobs # Or with uv workspace (production-safe): # Add pyproject.toml at repo root (see Code Organization section above) ``` --- ## Key Constraints Respected - **No new HTTP library for the main path** — `requests_go` stays for external scripts. Zero regression risk. - **Async for in-app crawlers** — `httpx` is already in Pipfile. The `asyncio.to_thread()` workaround in CONCERNS.md is eliminated. - **10-second minimum delay preserved** — `tenacity` `wait_random_exponential(min=10)` enforces this constraint. - **Dedup at insert time** — Application-side `batch_dedup_filter` with time-bounded PREWHERE. No schema changes. - **Shared signing/parsing code** — Single canonical `spiderJobs/` package, installed into both environments. - **ClickHouse table structure unchanged** — No DDL changes required for any recommendation here. --- ## Sources - `requests_go` PyPI: latest release 1.0.9 (December 2025) — [pypi.org](https://pypi.org) - `curl_cffi` PyPI: latest release Dec 2025, Python 3.10+ required — [pypi.org](https://pypi.org) - TLS/JA3 fingerprinting landscape 2025: [scrapfly.io](https://scrapfly.io), [capsolver.com](https://capsolver.com) - `httpx` async performance (7x over sync `requests` in concurrent scenarios): [proxy-cheap.com](https://proxy-cheap.com), [plainenglish.io](https://plainenglish.io) - `respx` for `httpx` mocking: [rednafi.com](https://rednafi.com), [keploy.io](https://keploy.io) - `pytest-anyio` vs `pytest-asyncio` tradeoffs: [FastAPI docs](https://tiangolo.com), [medium.com](https://medium.com) - ClickHouse `ReplacingMergeTree` vs application-side dedup: [clickhouse.com](https://clickhouse.com), [chistadata.com](https://chistadata.com), [glassflow.dev](https://glassflow.dev) - `tenacity` async retry patterns: [towardsai.net](https://towardsai.net), [leapcell.io](https://leapcell.io) - uv workspaces for Python monorepos: [medium.com](https://medium.com) --- *Stack research: 2026-03-21*