4.2 KiB
4.2 KiB
gsd_state_version, milestone, milestone_name, status, stopped_at, last_updated, progress
| gsd_state_version | milestone | milestone_name | status | stopped_at | last_updated | progress | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1.0 | v1.0 | milestone | unknown | Phase 6 UI-SPEC approved | 2026-03-21T14:56:38.784Z |
|
STATE: JobData 爬虫交互重构
Last updated: 2026-03-21T10:23Z Milestone: 爬虫交互重构 v1 Stopped At: Phase 6 UI-SPEC approved
Project Reference
Core value: 基于关键词驱动爬虫抓取职位数据,可靠入库 ClickHouse,定时完成公司信息采集同步
Current focus: Phase 6 — 质量 & 前端
Current Position
Phase: 6 Plan: Not started
Performance Metrics
| Metric | Value |
|---|---|
| Plans completed | 2 |
| Plans total (estimated) | TBD |
| Phases completed | 0/6 |
| Requirements mapped | 19/19 |
| Phase 01-shared-core P01 | 15min | 3 tasks | 8 files | | Phase 01-shared-core P02 | 15min | 2 tasks | 7 files |
Accumulated Context
Key Decisions
| Decision | Rationale | Status |
|---|---|---|
| 先提取 crawler_core/ 再改平台代码 | 避免签名代码再次被复制粘贴 | Active |
| feature flag 并行新旧代码 | 重构期间不中断在线爬虫 | Active |
| jobs_spider/ 最终废弃 | 迁移到 spiderJobs/ + crawler_core | Active |
| 同步核心 + asyncio.to_thread() 桥接 | requests_go 不支持 async,保持 TLS 指纹 | Active |
| 30 天时间窗口去重 | 避免全表扫描 ClickHouse | Active |
| pyproject.toml uses where=['..'] | pip install -e ./crawler_core resolves from repo root | Decided (01-01) |
| tenacity min=10s wait | STACK.md mandatory 10s anti-detection delay preserved | Decided (01-01) |
| stdlib logging in crawler_core | No loguru dependency; loguru bridge deferred to Phase 5 | Decided (01-01) |
| Result[T] replaces ApiResult | Typed generic eliminates untyped Any in return values | Decided (01-01) |
| Sign algorithms ported verbatim | Behavioral parity with originals; no regression risk | Decided (01-02) |
| No tests/crawler_core/init.py | Would shadow crawler_core package causing import failures | Decided (01-02) |
| conftest.py at project root | Reliable sys.path setup for pytest crawler_core imports | Decided (01-02) |
Architecture Notes
crawler_core/放项目根目录,pip install -e ./crawler_core本地可安装- 核心保持同步(requests_go),后端通过 asyncio.to_thread() 调用
- SmartIPManager 在 FastAPI 中需做成单例,避免会话状态丢失
- ClickHouse 表结构不变,MySQL 模型变更走 Aerich
- 签名算法测试:41 个纯函数单元测试,无网络依赖,
pytest tests/crawler_core/即可运行
Known Risks
| Risk | Mitigation |
|---|---|
| 重构期间打断在线爬虫 | feature flag 并行新旧代码,旧路径保留到 Phase 4 完成 |
| 反爬会话状态丢失 | SmartIPManager 做单例,Phase 4 重点验证 |
| 无测试基线 | Phase 1 先写签名纯函数测试,Phase 2 同步补 HTTP mock 测试 |
Open TODOs
- 确认 crawler_core/ 包的 Python 包名(建议
crawler_core,避免与系统包冲突) - 确认 feature flag 实现方式(环境变量 vs 配置文件)
- Phase 3 可与 Phase 2 并行(两平台互相独立),视进度决定
Blockers
None currently.
Session Continuity
To resume this project:
- Check current phase in this file
- Read
.planning/ROADMAP.mdfor phase goals and success criteria - Read
.planning/REQUIREMENTS.mdfor requirement coverage - Run
git statusto see uncommitted work - Start with
/gsd:plan-phase {current_phase}if no active plan
Key files:
.planning/ROADMAP.md— Phase structure and success criteria.planning/REQUIREMENTS.md— All v1 requirements with traceability.planning/PROJECT.md— Project context and constraintscrawler_core/— CREATED: shared package with HTTPClient, BaseFetcher, BaseSearcher, Result[T], plus sign algorithm subpackagestests/crawler_core/— CREATED: 41 pure-function sign algorithm unit testsspiderJobs/— external scripts using crawler_coreapp/services/crawler/— backend facade layer
State initialized: 2026-03-21