feat(doc-parser): 添加文档解析系统架构文档和豆包预处理支持

## 新增 - 文档解析系统架构文档 (docs/doc-parser-architecture.md) - 完整的三层架构说明 - 8 种字段提取模式详解 - 优缺点分析和优化建议 - 豆包预处理快速通道 - 新增 preprocessed/ 目录支持 - 自动识别文档来源 - 优化 MD 文件解析提示 - 混合解析方案 - 少量文档用豆包预处理 - 批量文档用 MCP 直接解析 - 按来源分组显示文档列表 ## 更新 - README.md: 添加文档解析工具说明 - docs/to-parse/README.md: 添加豆包预处理指南和对比表 ## 移除 - scripts/doc-parser/QUICKSTART.md (内容已整合) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat(doc-parser): 添加文档解析系统架构文档和豆包预处理支持
## 新增 - 文档解析系统架构文档 (docs/doc-parser-architecture.md) - 完整的三层架构说明 - 8 种字段提取模式详解 - 优缺点分析和优化建议 - 豆包预处理快速通道 - 新增 preprocessed/ 目录支持 - 自动识别文档来源 - 优化 MD 文件解析提示 - 混合解析方案 - 少量文档用豆包预处理 - 批量文档用 MCP 直接解析 - 按来源分组显示文档列表 ## 更新 - README.md: 添加文档解析工具说明 - docs/to-parse/README.md: 添加豆包预处理指南和对比表 ## 移除 - scripts/doc-parser/QUICKSTART.md (内容已整合) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
hookehuyr
Commit 060e92c6e08eefcd133dcfff20e3a2e172fe9509 060e92c6 1 parent 4c556e6f
Showing 5 changed files with 145 additions and 218 deletions
README.md
docs/doc-parser-architecture.md
docs/to-parse/README.md
scripts/doc-parser/QUICKSTART.md
scripts/doc-parser/parse-docs.js
--- a/README.md
View file @060e92c
+++ b/README.md
View file @060e92c
@@ -4,6 +4,7 @@
 
 ## 📚 项目文档
 
+ - **[文档解析系统架构](docs/doc-parser-architecture.md)** - 计划书配置自动化生成工具
 - **[经验教训总结](docs/lessons-learned.md)** - Taro 项目开发经验、最佳实践和常见陷阱
 - **[CLAUDE.md](CLAUDE.md)** - 项目开发指南（供 Claude Code 使用）
 - **[文档导航](docs/README.md)** - 项目文档索引与使用建议
@@ -55,7 +56,7 @@ pnpm lint
 
 ### 近期亮点
 
- - **多产品文档解析** - 支持自动识别和分割包含多个保险产品的文档
+ - **文档解析系统** - 从 PDF/DOCX 自动生成计划书配置（支持多产品文档分割）
 - **计划书 Schema 驱动** - 储蓄类/人寿/重疾模板字段配置化
 - **Git 工作流标准化** - 使用 standard-version + Conventional Commits
 - **认证系统完善** - 401 自动刷新、登录权限检查、TabBar 红点
@@ -270,7 +271,44 @@ export default {
 - ✅ 所有参数都有 `@param` 说明
 - ✅ 返回值有 `@returns` 说明
 
- ## 🔧 可选功能
+ ## 🔧 开发工具
+ 
+ ### 文档解析工具
+ 
+ 自动从保险产品文档（PDF/DOCX）中提取配置，生成计划书模板：
+ 
+ ```bash
+ # 解析所有待处理文档
+ pnpm parse:docs
+ 
+ # 解析指定文件
+ pnpm parse:docs -- --file=产品说明书.pdf
+ 
+ # 查看待处理文档列表
+ pnpm parse:docs -- --list
+ 
+ # 应用审核通过的配置
+ pnpm parse:docs -- --apply=计划书模版4
+ 
+ # 预览变更（不实际修改）
+ pnpm parse:docs -- --apply=计划书模版4 --dry-run
+ 
+ # 查看配置状态
+ pnpm parse:docs -- --status
+ ```
+ 
+ **核心能力**：
+ - 📄 支持 PDF、DOCX、TXT、MD 格式
+ - 🔄 自动识别并分割多产品文档
+ - 🤖 智能字段提取（8 个核心字段）
+ - ✅ 人工审核流程
+ - 💾 自动备份和回滚
+ 
+ **详细文档**: [文档解析系统架构](docs/doc-parser-architecture.md)
+ 
+ ---
+ 
+ ### 可选功能组件
 
 以下功能可以根据项目需求选择使用或移除：
 
@@ -281,11 +319,24 @@ export default {
 
 ## ✅ 优化建议
 
- - 建议将文档解析脚本接入真实 AI 解析服务以替代 mock 配置
- - 建议为 parse:docs 增加一键校验配置合法性的脚本输出
+ ### 文档解析系统
+ 
+ | 优先级 | 优化项 | 说明 |
+ |--------|--------|------|
+ | 🔴 P0 | 启用 AI 服务 | 配置 `AI_SERVICE_TYPE` 提升复杂文档解析准确率 |
+ | 🟡 P1 | 完善 .doc 支持 | 使用 antiword 或 LibreOffice 转换 |
+ | 🟡 P1 | 增加自动化测试 | 补充 parse-docs.test.js 测试用例 |
+ | 🟢 P2 | 添加 OCR 能力 | 支持扫描件解析（Tesseract.js） |
+ 
+ ### 项目整体
+ 
+ 1. 持续维护 API 集成日志与页面模块对应关系
+ 2. 文档预览与视频播放页面补充更多异常场景说明
+ 3. 页面入口与权限策略保持同步，避免入口显示但权限不一致
 
 ## 📚 相关文档
 
+ - **[文档解析系统架构](docs/doc-parser-architecture.md)** - 计划书配置自动化工具详解
 - **[经验教训总结](docs/lessons-learned.md)** - Taro 项目开发经验、最佳实践和常见陷阱
 - **[CLAUDE.md](CLAUDE.md)** - 项目开发指南（供 Claude Code 使用）
 - **[文档解析待处理说明](docs/to-parse/README.md)** - 文档解析样本与脚本使用方式
--- a/docs/doc-parser-architecture.md 0 → 100644
View file @060e92c
+++ b/docs/doc-parser-architecture.md 0 → 100644
View file @060e92c
--- a/docs/to-parse/README.md
View file @060e92c
+++ b/docs/to-parse/README.md
View file @060e92c
--- a/scripts/doc-parser/QUICKSTART.md deleted 100644 → 0
View file @4c556e6
+++ b/scripts/doc-parser/QUICKSTART.md deleted 100644 → 0
View file @4c556e6
- # OpenAPI 转 API 文档生成器 - 快速开始
- 
- ## 🎯 一分钟快速上手
- 
- ### 1️⃣ 创建 OpenAPI 文档
- 
- 在 `docs/api-specs/` 目录下创建模块和接口文档：
- 
- ```bash
- # 创建新模块
- mkdir -p docs/api-specs/product
- 
- # 创建接口文档
- touch docs/api-specs/product/getList.md
- ```
- 
- ### 2️⃣ 编写 OpenAPI 规范
- 
- 编辑 `getList.md`：
- 
- ```markdown
- # 获取商品列表
- 
- ## OpenAPI Specification
- 
- \```yaml
- openapi: 3.0.1
- info:
-   title: ''
-   version: 1.0.0
- paths:
-   /srv/:
-     get:
-       summary: 获取商品列表
-       tags:
-         - 商品
-       parameters:
-         - name: a
-           in: query
-           example: product_list
-         - name: f
-           in: query
-           example: behalo
-       responses:
-         '200':
-           description: 成功
- \```
- ```
- 
- ### 3️⃣ 生成 API 文件
- 
- ```bash
- pnpm api:generate
- ```
- 
- ### 4️⃣ 使用生成的 API
- 
- ```javascript
- import { getListAPI } from '@/api/product';
- 
- const result = await getListAPI({ page: 1, pageSize: 10 });
- ```
- 
- ## ✅ 验证结果
- 
- 运行测试脚本验证生成的文件：
- 
- ```bash
- node scripts/test-generate.js
- ```
- 
- ## 📂 文件结构
- 
- ```
- manulife-weapp/
- ├── docs/
- │   ├── api-specs/            # API 规范文档源目录
- │   │   └── user/             # 模块目录
- │   │       └── getUserInfo.md
- │   ├── OPENAPI_TO_API_GUIDE.md  # 详细使用指南
- │   └── API_USAGE_EXAMPLES.md    # API 使用示例
- ├── scripts/
- │   ├── generateApiFromOpenAPI.js  # 生成器核心脚本
- │   └── test-generate.js           # 测试脚本
- ├── src/
- │   └── api/                  # 生成的 API 文件目录
- │       ├── user.js           # 自动生成
- │       ├── wx/
- │       └── index.js
- └── package.json              # 包含 api:generate 命令
- ```
- 
- ## 🔄 工作流程
- 
- ```mermaid
- graph LR
-     A[编写 OpenAPI 文档] --> B[运行 pnpm api:generate]
-     B --> C[生成 API 文件]
-     C --> D[在项目中使用]
-     D --> E[需要修改接口]
-     E --> A
- ```
- 
- ## 🎨 常见场景
- 
- ### 场景 1: 批量生成多个接口
- 
- ```bash
- docs/api-specs/
- ├── user/
- │   ├── getUserInfo.md
- │   ├── updateProfile.md
- │   └── changePassword.md
- └── order/
-     ├── getList.md
-     └── getDetail.md
- ```
- 
- 运行 `pnpm api:generate` 后生成：
- 
- ```
- src/api/
- ├── user.js       # 包含 3 个接口
- └── order.js      # 包含 2 个接口
- ```
- 
- ### 场景 2: 更新已有接口
- 
- 1. 修改 `docs/api-specs/user/getUserInfo.md`
- 2. 运行 `pnpm api:generate`
- 3. `src/api/user.js` 自动更新
- 
- ### 场景 3: 添加新模块
- 
- 1. 创建 `docs/api-specs/payment/`
- 2. 添加接口文档
- 3. 运行生成命令
- 4. 自动生成 `src/api/payment.js`
- 
- ## ⚙️ 配置和自定义
- 
- ### 修改输出目录
- 
- 编辑 `scripts/generateApiFromOpenAPI.js`：
- 
- ```javascript
- const outputDir = path.resolve(__dirname, '../src/api');
- // 改为你想要的目录
- const outputDir = path.resolve(__dirname, '../src/apis');
- ```
- 
- ### 修改命名规则
- 
- 编辑 `toCamelCase()` 或 `toPascalCase()` 函数。
- 
- ### 修改生成模板
- 
- 编辑 `generateApiFileContent()` 函数。
- 
- ## 🐛 调试技巧
- 
- ### 启用详细日志
- 
- 在脚本中添加更多 console.log：
- 
- ```javascript
- console.log('解析的 API 信息:', JSON.stringify(apiInfo, null, 2));
- ```
- 
- ### 单独测试某个模块
- 
- 修改脚本中的模块过滤逻辑。
- 
- ### 查看生成的中间数据
- 
- 添加调试输出查看 YAML 解析结果。
- 
- ## 📞 获取帮助
- 
- - 详细指南：[OpenAPI 转 API 文档生成器指南](./OPENAPI_TO_API_GUIDE.md)
- - 使用示例：[API 使用示例](./API_USAGE_EXAMPLES.md)
- - 项目架构：[CLAUDE.md](../CLAUDE.md)
- 
- ## 🎉 开始使用
- 
- 现在你已经准备好了！开始创建你的第一个 OpenAPI 文档吧。
- 
- ```bash
- # 1. 创建模块目录
- mkdir -p docs/api-specs/your-module
- 
- # 2. 创建接口文档（参考 docs/api-specs/user/getUserInfo.md）
- 
- # 3. 生成 API
- pnpm api:generate
- 
- # 4. 查看生成的文件
- cat src/api/your-module.js
- 
- # 5. 开始使用
- ```
- 
- 祝你编码愉快！🚀
--- a/scripts/doc-parser/parse-docs.js
View file @060e92c
+++ b/scripts/doc-parser/parse-docs.js
View file @060e92c
@@ -42,6 +42,8 @@ import { splitByProducts, findProductTitles, generateSplitReport } from './produ
 // ========== 配置区 ==========
 
 const DOCS_DIR = path.resolve(process.cwd(), 'docs/to-parse')
+ const DOCS_PREPROCESSED_DIR = path.resolve(process.cwd(), 'docs/to-parse/preprocessed')
+ const DOCS_RAW_DIR = path.resolve(process.cwd(), 'docs/to-parse/raw')
 const DOCS_ARCHIVE_DIR = path.resolve(process.cwd(), 'docs/to-parse/archived')
 const CONFIG_FILE = path.resolve(process.cwd(), 'src/config/plan-templates.js')
 const BACKUP_DIR = path.resolve(process.cwd(), 'docs/parsed-backup')
@@ -49,6 +51,29 @@ const BACKUP_DIR = path.resolve(process.cwd(), 'docs/parsed-backup')
 // 支持的文档格式
 const SUPPORTED_EXTENSIONS = ['.pdf', '.doc', '.docx', '.txt', '.md']
 
+ /**
+  * 检测文档来源
+  *
+  * @description 判断文档是预处理过的 MD 文件还是原始文档
+  * @param {string} filePath - 文档路径
+  * @returns {{source: string, type: string}} 来源信息
+  */
+ function detectDocumentSource(filePath) {
+   if (filePath.includes('preprocessed')) {
+     return { source: 'preprocessed', type: 'markdown' }
+   }
+   if (filePath.includes('raw')) {
+     return { source: 'raw', type: 'original' }
+   }
+   // 根据文件扩展名推断
+   const ext = path.extname(filePath).toLowerCase()
+   if (ext === '.md') {
+     // MD 文件可能是预处理过的
+     return { source: 'likely-preprocessed', type: 'markdown' }
+   }
+   return { source: 'unknown', type: 'original' }
+ }
+ 
 const ajv = new Ajv({ allErrors: true, strict: false })
 const parseConfigSchema = {
     type: 'object',
@@ -214,23 +239,45 @@ function writeFile(filePath, content) {
 
 /**
  * 获取所有待处理的文档
+  *
+  * @description 扫描多个目录获取待处理文档，按优先级排序
+  * @returns {Array<{name: string, fullPath: string, ext: string, size: number, source: string}>} 文档列表
  */
 function getDocsToParse() {
-   if (!fs.existsSync(DOCS_DIR)) {
-     console.log('📂 文档夹不存在:', DOCS_DIR)
-     return []
+   const docs = []
+   const directories = [
+     { path: DOCS_DIR, source: 'root' },
+     { path: DOCS_PREPROCESSED_DIR, source: 'preprocessed' },
+     { path: DOCS_RAW_DIR, source: 'raw' }
+   ]
+ 
+   for (const dir of directories) {
+     if (!fs.existsSync(dir.path)) {
+       continue
     }
 
-   const files = fs.readdirSync(DOCS_DIR)
-   return files
+     const files = fs.readdirSync(dir.path)
+     const dirDocs = files
       .filter(file => SUPPORTED_EXTENSIONS.includes(path.extname(file).toLowerCase()))
       .filter(file => file !== 'README.md')
       .map(file => ({
         name: file,
-       fullPath: path.join(DOCS_DIR, file),
+         fullPath: path.join(dir.path, file),
         ext: path.extname(file).toLowerCase(),
-       size: fs.statSync(path.join(DOCS_DIR, file)).size
+         size: fs.statSync(path.join(dir.path, file)).size,
+         source: dir.source
       }))
+ 
+     docs.push(...dirDocs)
+   }
+ 
+   // 优先处理预处理的 MD 文件，然后是原始文档
+   docs.sort((a, b) => {
+     const priorityOrder = { preprocessed: 1, root: 2, raw: 3 }
+     return priorityOrder[a.source] - priorityOrder[b.source]
+   })
+ 
+   return docs
 }
 
 /**
@@ -367,10 +414,15 @@ function formatSize(size) {
  */
 async function parseDocumentWithMarkitdown(docPath) {
   const ext = path.extname(docPath).toLowerCase()
+   const sourceInfo = detectDocumentSource(docPath)
 
   // MD 和 TXT 文件直接读取，不需要 markitdown
   if (ext === '.md' || ext === '.txt') {
+     if (sourceInfo.source === 'preprocessed' || sourceInfo.source === 'likely-preprocessed') {
+       console.log(`⚡ 预处理 MD 文件，跳过 markitdown: ${path.basename(docPath)}`)
+     } else {
       console.log(`📄 直接读取文本文件: ${path.basename(docPath)}`)
+     }
     return buildExtractResult(docPath, fs.readFileSync(docPath, 'utf-8'), [])
   }
 
@@ -707,8 +759,17 @@ function inferCurrency(content) {
  */
 async function parseSingleFile(filePath) {
   const fileName = path.basename(filePath)
+   const sourceInfo = detectDocumentSource(filePath)
+   const sourceLabel = {
+     preprocessed: '⚡ 预处理文档',
+     raw: '📄 原始文档',
+     root: '📂 根目录文档',
+     'likely-preprocessed': '⚡ MD 文档',
+     unknown: '📄 文档'
+   }[sourceInfo.source] || '📄 文档'
+ 
   console.log("\n" + "=".repeat(60))
-   console.log("📄 处理文件: " + fileName)
+   console.log(`📄 ${sourceLabel}: ${fileName}`)
   console.log("=".repeat(60))
 
   // 解析文档（可能返回单个 config 或 configs 数组）
@@ -1799,15 +1860,33 @@ async function main() {
     applyAuditFile(auditFileName, applyOptions)
   } else if (listMode) {
     // 列出模式
-     const docs = getDocsToParse()
     console.log("\n📋 待处理文档列表:")
     if (docs.length === 0) {
       console.log('  (无文档)')
     } else {
-       docs.forEach((doc, index) => {
-         console.log(" " + (index + 1) + ". " + doc.name + " (" + formatSize(doc.size) + ")")
+       // 按来源分组显示
+       const grouped = {
+         preprocessed: docs.filter(d => d.source === 'preprocessed'),
+         root: docs.filter(d => d.source === 'root'),
+         raw: docs.filter(d => d.source === 'raw')
+       }
+ 
+       for (const [source, sourceDocs] of Object.entries(grouped)) {
+         if (sourceDocs.length === 0) continue
+ 
+         const sourceLabel = {
+           preprocessed: '⚡ 预处理 (preprocessed/)',
+           root: '📂 根目录 (docs/to-parse/)',
+           raw: '📄 原始文档 (raw/)'
+         }[source]
+ 
+         console.log(`\n${sourceLabel}`)
+         sourceDocs.forEach((doc, index) => {
+           const sourceTag = doc.ext === '.md' ? ' [MD]' : ''
+           console.log(`  ${index + 1}. ${doc.name}${sourceTag} (${formatSize(doc.size)})`)
         })
       }
+     }
   } else if (fileMode) {
     // 单文件模式
     const fileName = fileMode.split('=')[1]