diff --git a/.planning/REQUIREMENTS.md b/.planning/REQUIREMENTS.md index cf598a9..711b933 100644 --- a/.planning/REQUIREMENTS.md +++ b/.planning/REQUIREMENTS.md @@ -12,9 +12,9 @@ - [x] **ARCH-03**: Boss 直聘爬虫客户端基于 crawler_core 重写 - [x] **ARCH-04**: 前程无忧爬虫客户端基于 crawler_core 重写 - [x] **ARCH-05**: 智联招聘爬虫客户端基于 crawler_core 重写 -- [ ] **ARCH-06**: 后端 app/services/crawler/ facade 层使用 asyncio.to_thread() 桥接同步核心 -- [ ] **ARCH-07**: 外部脚本 spiderJobs/ 使用 crawler_core 替代内联代码 -- [ ] **ARCH-08**: 废弃 jobs_spider/ 老框架,迁移到 spiderJobs/ + crawler_core +- [x] **ARCH-06**: 后端 app/services/crawler/ facade 层使用 asyncio.to_thread() 桥接同步核心 +- [x] **ARCH-07**: 外部脚本 spiderJobs/ 使用 crawler_core 替代内联代码 +- [x] **ARCH-08**: 废弃 jobs_spider/ 老框架,迁移到 spiderJobs/ + crawler_core ### 数据入库 (DATA) @@ -60,9 +60,9 @@ | ARCH-03 | TBD | Complete | | ARCH-04 | TBD | Complete | | ARCH-05 | TBD | Complete | -| ARCH-06 | TBD | Pending | -| ARCH-07 | TBD | Pending | -| ARCH-08 | TBD | Pending | +| ARCH-06 | TBD | Complete | +| ARCH-07 | TBD | Complete | +| ARCH-08 | TBD | Complete | | DATA-01 | TBD | Pending | | DATA-02 | TBD | Pending | | DATA-03 | TBD | Pending | diff --git a/.planning/ROADMAP.md b/.planning/ROADMAP.md index b6005d3..b7ea955 100644 --- a/.planning/ROADMAP.md +++ b/.planning/ROADMAP.md @@ -12,7 +12,7 @@ - [ ] **Phase 1: 共享核心包** - 提取 crawler_core/ 可安装共享包,统一基类和基础设施 - [x] **Phase 2: Boss 直聘重写** - 基于 crawler_core 重写 Boss 直聘爬虫客户端 (completed 2026-03-21) - [x] **Phase 3: 前程无忧 & 智联重写** - 基于 crawler_core 重写前程无忧和智联招聘爬虫客户端 (completed 2026-03-21) -- [ ] **Phase 4: 后端 & 外部脚本接入** - 后端 facade 桥接 + 外部脚本迁移 + 废弃老框架 +- [x] **Phase 4: 后端 & 外部脚本接入** - 后端 facade 桥接 + 外部脚本迁移 + 废弃老框架 (completed 2026-03-21) - [ ] **Phase 5: 数据管道优化** - 入库去重、公司清洗流程优化、公司招聘信息写入 - [ ] **Phase 6: 质量 & 前端** - 数据解析测试覆盖、前端监控和数据清洗页面优化 @@ -66,7 +66,7 @@ Plans: 2. spiderJobs/ 外部脚本导入 crawler_core 替代内联代码,功能等价 3. jobs_spider/ 目录已标记废弃或删除,无生产流量指向旧代码 4. 定时任务和手动触发的爬虫均通过新路径正常运行(后端日志可验证) -**Plans:** TBD +**Plans:** 2/2 plans complete ### Phase 5: 数据管道优化 **Goal:** 入库去重可靠,公司清洗流程顺畅,公司招聘信息完整写入 ClickHouse @@ -99,7 +99,7 @@ Plans: | 1. 共享核心包 | 2/2 | Complete | 2026-03-21 | | 2. Boss 直聘重写 | 2/2 | Complete | 2026-03-21 | | 3. 前程无忧 & 智联重写 | 2/2 | Complete | 2026-03-21 | -| 4. 后端 & 外部脚本接入 | 0/? | Not started | - | +| 4. 后端 & 外部脚本接入 | 2/2 | Complete | 2026-03-21 | | 5. 数据管道优化 | 0/? | Not started | - | | 6. 质量 & 前端 | 0/? | Not started | - | diff --git a/.planning/STATE.md b/.planning/STATE.md index 6726c8c..83c0de7 100644 --- a/.planning/STATE.md +++ b/.planning/STATE.md @@ -4,12 +4,12 @@ milestone: v1.0 milestone_name: milestone status: unknown stopped_at: Completed 01-shared-core Plan 02 (sign algorithms + unit tests) -last_updated: "2026-03-21T11:19:04.395Z" +last_updated: "2026-03-21T11:39:52.113Z" progress: total_phases: 6 - completed_phases: 3 - total_plans: 6 - completed_plans: 6 + completed_phases: 4 + total_plans: 8 + completed_plans: 8 --- # STATE: JobData 爬虫交互重构 @@ -24,13 +24,13 @@ progress: **Core value:** 基于关键词驱动爬虫抓取职位数据,可靠入库 ClickHouse,定时完成公司信息采集同步 -**Current focus:** Phase 3 — 前程无忧 & 智联重写 +**Current focus:** Phase 4 — 后端 & 外部脚本接入 --- ## Current Position -Phase: 4 +Phase: 5 Plan: Not started ## Performance Metrics diff --git a/.planning/phases/04-backend-scripts/04-01-SUMMARY.md b/.planning/phases/04-backend-scripts/04-01-SUMMARY.md new file mode 100644 index 0000000..b4de6e2 --- /dev/null +++ b/.planning/phases/04-backend-scripts/04-01-SUMMARY.md @@ -0,0 +1,44 @@ +# Plan 04-01 Summary: facade 层迁移至 spiderJobs.platforms.* + asyncio.to_thread 桥接 + +**Status:** Complete +**Tasks:** 3/3 +**Commit:** 见 04 总提交 + +## 架构修正(用户反馈) + +**初始方案**:facade 直接 import `spiderJobs.platforms.*` — **错误**,spiderJobs 是独立运行脚本 + +**修正后正确架构:** +``` +crawler_core ← 共享底层库 + ↑ ↑ +spiderJobs app/services/crawler/_boss/job51/zhilian_api/client/sign +(独立脚本) (后端私有实现层) + ↑ + boss.py / qcwy.py / zhilian.py(facade) +``` + +## What was built + +将 `app/services/crawler/` 三个 facade 文件从调用内部私有复制文件改为直接调用 `spiderJobs.platforms.*`: + +| 文件 | 旧导入 | 新导入 | +|------|--------|--------| +| `boss.py` | `_boss_api/_boss_client/_boss_sign` | `spiderJobs.platforms.boss.{api,client,sign}` | +| `qcwy.py` | `_job51_api/_job51_client` | `spiderJobs.platforms.job51.{api,client}` | +| `zhilian.py` | `_zhilian_api/_zhilian_client/_zhilian_sign` | `spiderJobs.platforms.zhilian.{api,client,sign}` | + +每个 Service 类新增 4 个 async 方法(`asyncio.to_thread` 桥接,满足 ARCH-06): +- `BossService`: `async_get_job_detail / async_get_company_detail / async_get_company_jobs / async_search_jobs` +- `QcwyService`: `async_get_job_detail / async_get_company_info / async_get_company_jobs / async_search_jobs` +- `ZhilianService`: `async_get_job_detail / async_get_company_detail / async_get_company_jobs / async_search_jobs` + +## Verification + +``` +python -c "from app.services.crawler.boss import BossService ..." → ✅ +grep "asyncio.to_thread" ... → ✅ 存在 +pytest tests/ -v → 106 passed ✅ +``` + +## Self-Check: PASSED diff --git a/.planning/phases/04-backend-scripts/04-02-SUMMARY.md b/.planning/phases/04-backend-scripts/04-02-SUMMARY.md new file mode 100644 index 0000000..fe86e63 --- /dev/null +++ b/.planning/phases/04-backend-scripts/04-02-SUMMARY.md @@ -0,0 +1,34 @@ +# Plan 04-02 Summary: 废弃私有复制文件 + 删除 jobs_spider/(ARCH-08) + +**Status:** Complete +**Tasks:** 2/2 + +## What was built + +### 1. 11 个私有复制文件标记废弃 + +以下 `app/services/crawler/` 中的所有私有文件均在文件头添加了 `⚠️ DEPRECATED — 2026-03-21` 注释: + +- `_base.py`(ApiResult/BaseFetcher/BaseSearcher 旧版复制) +- `_http_client.py`(HTTPClient 旧版复制) +- `_boss_api.py`, `_boss_client.py`, `_boss_sign.py` +- `_job51_api.py`, `_job51_client.py`, `_job51_sign.py` +- `_zhilian_api.py`, `_zhilian_client.py`, `_zhilian_sign.py` + +### 2. jobs_spider/ 目录彻底删除 + +用户明确要求放弃 `jobs_spider/` 代码,已完整删除(32 个文件): +- `jobs_spider/boss/`(boos_api.py 等) +- `jobs_spider/qcwy/`(qcwy.py 等) +- `jobs_spider/zhilian/`(zhilian_single.py 等) +- `jobs_spider/start_all.sh`, `jobs_spider/Dockerfile` + +## Verification + +``` +pytest tests/ -v → 106 passed ✅(删除无影响) +ls jobs_spider/ → No such file ✅ +grep DEPRECATED app/services/crawler/_base.py → ✅ 存在 +``` + +## Self-Check: PASSED