feat(doc-parser): 添加文档解析系统架构文档和豆包预处理支持

## 新增 - 文档解析系统架构文档 (docs/doc-parser-architecture.md) - 完整的三层架构说明 - 8 种字段提取模式详解 - 优缺点分析和优化建议 - 豆包预处理快速通道 - 新增 preprocessed/ 目录支持 - 自动识别文档来源 - 优化 MD 文件解析提示 - 混合解析方案 - 少量文档用豆包预处理 - 批量文档用 MCP 直接解析 - 按来源分组显示文档列表 ## 更新 - README.md: 添加文档解析工具说明 - docs/to-parse/README.md: 添加豆包预处理指南和对比表 ## 移除 - scripts/doc-parser/QUICKSTART.md (内容已整合) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat(doc-parser): 添加文档解析系统架构文档和豆包预处理支持
## 新增 - 文档解析系统架构文档 (docs/doc-parser-architecture.md) - 完整的三层架构说明 - 8 种字段提取模式详解 - 优缺点分析和优化建议 - 豆包预处理快速通道 - 新增 preprocessed/ 目录支持 - 自动识别文档来源 - 优化 MD 文件解析提示 - 混合解析方案 - 少量文档用豆包预处理 - 批量文档用 MCP 直接解析 - 按来源分组显示文档列表 ## 更新 - README.md: 添加文档解析工具说明 - docs/to-parse/README.md: 添加豆包预处理指南和对比表 ## 移除 - scripts/doc-parser/QUICKSTART.md (内容已整合) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
hookehuyr
Commit 060e92c6e08eefcd133dcfff20e3a2e172fe9509 060e92c6 1 parent 4c556e6f
Showing 5 changed files with 145 additions and 218 deletions
README.md
docs/doc-parser-architecture.md
docs/to-parse/README.md
scripts/doc-parser/QUICKSTART.md
scripts/doc-parser/parse-docs.js
--- a/README.md
View file @060e92c
+++ b/README.md
View file @060e92c
@@ -4,6 +4,7 @@
 ## 📚 项目文档
+- **[文档解析系统架构](docs/doc-parser-architecture.md)** - 计划书配置自动化生成工具
 - **[经验教训总结](docs/lessons-learned.md)** - Taro 项目开发经验、最佳实践和常见陷阱
 - **[CLAUDE.md](CLAUDE.md)** - 项目开发指南（供 Claude Code 使用）
 - **[文档导航](docs/README.md)** - 项目文档索引与使用建议
@@ -55,7 +56,7 @@ pnpm lint
 ### 近期亮点
-- **多产品文档解析** - 支持自动识别和分割包含多个保险产品的文档
+- **文档解析系统** - 从 PDF/DOCX 自动生成计划书配置（支持多产品文档分割）
 - **计划书 Schema 驱动** - 储蓄类/人寿/重疾模板字段配置化
 - **Git 工作流标准化** - 使用 standard-version + Conventional Commits
 - **认证系统完善** - 401 自动刷新、登录权限检查、TabBar 红点
@@ -270,7 +271,44 @@ export default {
 - ✅ 所有参数都有 `@param` 说明
 - ✅ 返回值有 `@returns` 说明
-## 🔧 可选功能
+## 🔧 开发工具
+
+### 文档解析工具
+
+自动从保险产品文档（PDF/DOCX）中提取配置，生成计划书模板：
+
+```bash
+# 解析所有待处理文档
+pnpm parse:docs
+
+# 解析指定文件
+pnpm parse:docs -- --file=产品说明书.pdf
+
+# 查看待处理文档列表
+pnpm parse:docs -- --list
+
+# 应用审核通过的配置
+pnpm parse:docs -- --apply=计划书模版4
+
+# 预览变更（不实际修改）
+pnpm parse:docs -- --apply=计划书模版4 --dry-run
+
+# 查看配置状态
+pnpm parse:docs -- --status
+```
+
+**核心能力**：
+- 📄 支持 PDF、DOCX、TXT、MD 格式
+- 🔄 自动识别并分割多产品文档
+- 🤖 智能字段提取（8 个核心字段）
+- ✅ 人工审核流程
+- 💾 自动备份和回滚
+
+**详细文档**: [文档解析系统架构](docs/doc-parser-architecture.md)
+
+---
+
+### 可选功能组件
 以下功能可以根据项目需求选择使用或移除：
@@ -281,11 +319,24 @@ export default {
 ## ✅ 优化建议
-- 建议将文档解析脚本接入真实 AI 解析服务以替代 mock 配置
+### 文档解析系统
-- 建议为 parse:docs 增加一键校验配置合法性的脚本输出
+
+| 优先级 | 优化项 | 说明 |
+|--------|--------|------|
+| 🔴 P0 | 启用 AI 服务 | 配置 `AI_SERVICE_TYPE` 提升复杂文档解析准确率 |
+| 🟡 P1 | 完善 .doc 支持 | 使用 antiword 或 LibreOffice 转换 |
+| 🟡 P1 | 增加自动化测试 | 补充 parse-docs.test.js 测试用例 |
+| 🟢 P2 | 添加 OCR 能力 | 支持扫描件解析（Tesseract.js） |
+
+### 项目整体
+
+1. 持续维护 API 集成日志与页面模块对应关系
+2. 文档预览与视频播放页面补充更多异常场景说明
+3. 页面入口与权限策略保持同步，避免入口显示但权限不一致
 ## 📚 相关文档
+- **[文档解析系统架构](docs/doc-parser-architecture.md)** - 计划书配置自动化工具详解
 - **[经验教训总结](docs/lessons-learned.md)** - Taro 项目开发经验、最佳实践和常见陷阱
 - **[CLAUDE.md](CLAUDE.md)** - 项目开发指南（供 Claude Code 使用）
 - **[文档解析待处理说明](docs/to-parse/README.md)** - 文档解析样本与脚本使用方式
--- a/docs/doc-parser-architecture.md 0 → 100644
View file @060e92c
+++ b/docs/doc-parser-architecture.md 0 → 100644
View file @060e92c
--- a/docs/to-parse/README.md
View file @060e92c
+++ b/docs/to-parse/README.md
View file @060e92c
--- a/scripts/doc-parser/QUICKSTART.md deleted 100644 → 0
View file @4c556e6
+++ b/scripts/doc-parser/QUICKSTART.md deleted 100644 → 0
View file @4c556e6
-# OpenAPI 转 API 文档生成器 - 快速开始
-
-## 🎯 一分钟快速上手
-
-### 1️⃣ 创建 OpenAPI 文档
-
-在 `docs/api-specs/` 目录下创建模块和接口文档：
-
-```bash
-# 创建新模块
-mkdir -p docs/api-specs/product
-
-# 创建接口文档
-touch docs/api-specs/product/getList.md
-```
-
-### 2️⃣ 编写 OpenAPI 规范
-
-编辑 `getList.md`：
-
-```markdown
-# 获取商品列表
-
-## OpenAPI Specification
-
-\```yaml
-openapi: 3.0.1
-info:
-  title: ''
-  version: 1.0.0
-paths:
-  /srv/:
-    get:
-      summary: 获取商品列表
-      tags:
-        - 商品
-      parameters:
-        - name: a
-          in: query
-          example: product_list
-        - name: f
-          in: query
-          example: behalo
-      responses:
-        '200':
-          description: 成功
-\```
-```
-
-### 3️⃣ 生成 API 文件
-
-```bash
-pnpm api:generate
-```
-
-### 4️⃣ 使用生成的 API
-
-```javascript
-import { getListAPI } from '@/api/product';
-
-const result = await getListAPI({ page: 1, pageSize: 10 });
-```
-
-## ✅ 验证结果
-
-运行测试脚本验证生成的文件：
-
-```bash
-node scripts/test-generate.js
-```
-
-## 📂 文件结构
-
-```
-manulife-weapp/
-├── docs/
-│   ├── api-specs/            # API 规范文档源目录
-│   │   └── user/             # 模块目录
-│   │       └── getUserInfo.md
-│   ├── OPENAPI_TO_API_GUIDE.md  # 详细使用指南
-│   └── API_USAGE_EXAMPLES.md    # API 使用示例
-├── scripts/
-│   ├── generateApiFromOpenAPI.js  # 生成器核心脚本
-│   └── test-generate.js           # 测试脚本
-├── src/
-│   └── api/                  # 生成的 API 文件目录
-│       ├── user.js           # 自动生成
-│       ├── wx/
-│       └── index.js
-└── package.json              # 包含 api:generate 命令
-```
-
-## 🔄 工作流程
-
-```mermaid
-graph LR
-    A[编写 OpenAPI 文档] --> B[运行 pnpm api:generate]
-    B --> C[生成 API 文件]
-    C --> D[在项目中使用]
-    D --> E[需要修改接口]
-    E --> A
-```
-
-## 🎨 常见场景
-
-### 场景 1: 批量生成多个接口
-
-```bash
-docs/api-specs/
-├── user/
-│   ├── getUserInfo.md
-│   ├── updateProfile.md
-│   └── changePassword.md
-└── order/
-    ├── getList.md
-    └── getDetail.md
-```
-
-运行 `pnpm api:generate` 后生成：
-
-```
-src/api/
-├── user.js       # 包含 3 个接口
-└── order.js      # 包含 2 个接口
-```
-
-### 场景 2: 更新已有接口
-
-1. 修改 `docs/api-specs/user/getUserInfo.md`
-2. 运行 `pnpm api:generate`
-3. `src/api/user.js` 自动更新
-
-### 场景 3: 添加新模块
-
-1. 创建 `docs/api-specs/payment/`
-2. 添加接口文档
-3. 运行生成命令
-4. 自动生成 `src/api/payment.js`
-
-## ⚙️ 配置和自定义
-
-### 修改输出目录
-
-编辑 `scripts/generateApiFromOpenAPI.js`：
-
-```javascript
-const outputDir = path.resolve(__dirname, '../src/api');
-// 改为你想要的目录
-const outputDir = path.resolve(__dirname, '../src/apis');
-```
-
-### 修改命名规则
-
-编辑 `toCamelCase()` 或 `toPascalCase()` 函数。
-
-### 修改生成模板
-
-编辑 `generateApiFileContent()` 函数。
-
-## 🐛 调试技巧
-
-### 启用详细日志
-
-在脚本中添加更多 console.log：
-
-```javascript
-console.log('解析的 API 信息:', JSON.stringify(apiInfo, null, 2));
-```
-
-### 单独测试某个模块
-
-修改脚本中的模块过滤逻辑。
-
-### 查看生成的中间数据
-
-添加调试输出查看 YAML 解析结果。
-
-## 📞 获取帮助
-
-- 详细指南：[OpenAPI 转 API 文档生成器指南](./OPENAPI_TO_API_GUIDE.md)
-- 使用示例：[API 使用示例](./API_USAGE_EXAMPLES.md)
-- 项目架构：[CLAUDE.md](../CLAUDE.md)
-
-## 🎉 开始使用
-
-现在你已经准备好了！开始创建你的第一个 OpenAPI 文档吧。
-
-```bash
-# 1. 创建模块目录
-mkdir -p docs/api-specs/your-module
-
-# 2. 创建接口文档（参考 docs/api-specs/user/getUserInfo.md）
-
-# 3. 生成 API
-pnpm api:generate
-
-# 4. 查看生成的文件
-cat src/api/your-module.js
-
-# 5. 开始使用
-```
-
-祝你编码愉快！🚀
--- a/scripts/doc-parser/parse-docs.js
View file @060e92c
+++ b/scripts/doc-parser/parse-docs.js
View file @060e92c
@@ -42,6 +42,8 @@ import { splitByProducts, findProductTitles, generateSplitReport } from './produ
 // ========== 配置区 ==========
 const DOCS_DIR = path.resolve(process.cwd(), 'docs/to-parse')
+const DOCS_PREPROCESSED_DIR = path.resolve(process.cwd(), 'docs/to-parse/preprocessed')
+const DOCS_RAW_DIR = path.resolve(process.cwd(), 'docs/to-parse/raw')
 const DOCS_ARCHIVE_DIR = path.resolve(process.cwd(), 'docs/to-parse/archived')
 const CONFIG_FILE = path.resolve(process.cwd(), 'src/config/plan-templates.js')
 const BACKUP_DIR = path.resolve(process.cwd(), 'docs/parsed-backup')
@@ -49,6 +51,29 @@ const BACKUP_DIR = path.resolve(process.cwd(), 'docs/parsed-backup')
 // 支持的文档格式
 const SUPPORTED_EXTENSIONS = ['.pdf', '.doc', '.docx', '.txt', '.md']
+/**
+ * 检测文档来源
+ *
+ * @description 判断文档是预处理过的 MD 文件还是原始文档
+ * @param {string} filePath - 文档路径
+ * @returns {{source: string, type: string}} 来源信息
+ */
+function detectDocumentSource(filePath) {
+  if (filePath.includes('preprocessed')) {
+    return { source: 'preprocessed', type: 'markdown' }
+  }
+  if (filePath.includes('raw')) {
+    return { source: 'raw', type: 'original' }
+  }
+  // 根据文件扩展名推断
+  const ext = path.extname(filePath).toLowerCase()
+  if (ext === '.md') {
+    // MD 文件可能是预处理过的
+    return { source: 'likely-preprocessed', type: 'markdown' }
+  }
+  return { source: 'unknown', type: 'original' }
+}
+
 const ajv = new Ajv({ allErrors: true, strict: false })
 const parseConfigSchema = {
     type: 'object',
@@ -214,23 +239,45 @@ function writeFile(filePath, content) {
 /**
  * 获取所有待处理的文档
+ *
+ * @description 扫描多个目录获取待处理文档，按优先级排序
+ * @returns {Array<{name: string, fullPath: string, ext: string, size: number, source: string}>} 文档列表
  */
 function getDocsToParse() {
-  if (!fs.existsSync(DOCS_DIR)) {
+  const docs = []
-    console.log('📂 文档夹不存在:', DOCS_DIR)
+  const directories = [
-    return []
+    { path: DOCS_DIR, source: 'root' },
+    { path: DOCS_PREPROCESSED_DIR, source: 'preprocessed' },
+    { path: DOCS_RAW_DIR, source: 'raw' }
+  ]
+
+  for (const dir of directories) {
+    if (!fs.existsSync(dir.path)) {
+      continue
     }
-  const files = fs.readdirSync(DOCS_DIR)
+    const files = fs.readdirSync(dir.path)
-  return files
+    const dirDocs = files
       .filter(file => SUPPORTED_EXTENSIONS.includes(path.extname(file).toLowerCase()))
       .filter(file => file !== 'README.md')
       .map(file => ({
         name: file,
-      fullPath: path.join(DOCS_DIR, file),
+        fullPath: path.join(dir.path, file),
         ext: path.extname(file).toLowerCase(),
-      size: fs.statSync(path.join(DOCS_DIR, file)).size
+        size: fs.statSync(path.join(dir.path, file)).size,
+        source: dir.source
       }))
+
+    docs.push(...dirDocs)
+  }
+
+  // 优先处理预处理的 MD 文件，然后是原始文档
+  docs.sort((a, b) => {
+    const priorityOrder = { preprocessed: 1, root: 2, raw: 3 }
+    return priorityOrder[a.source] - priorityOrder[b.source]
+  })
+
+  return docs
 }
 /**
@@ -367,10 +414,15 @@ function formatSize(size) {
  */
 async function parseDocumentWithMarkitdown(docPath) {
   const ext = path.extname(docPath).toLowerCase()
+  const sourceInfo = detectDocumentSource(docPath)
   // MD 和 TXT 文件直接读取，不需要 markitdown
   if (ext === '.md' || ext === '.txt') {
+    if (sourceInfo.source === 'preprocessed' || sourceInfo.source === 'likely-preprocessed') {
+      console.log(`⚡ 预处理 MD 文件，跳过 markitdown: ${path.basename(docPath)}`)
+    } else {
       console.log(`📄 直接读取文本文件: ${path.basename(docPath)}`)
+    }
     return buildExtractResult(docPath, fs.readFileSync(docPath, 'utf-8'), [])
   }
@@ -707,8 +759,17 @@ function inferCurrency(content) {
  */
 async function parseSingleFile(filePath) {
   const fileName = path.basename(filePath)
+  const sourceInfo = detectDocumentSource(filePath)
+  const sourceLabel = {
+    preprocessed: '⚡ 预处理文档',
+    raw: '📄 原始文档',
+    root: '📂 根目录文档',
+    'likely-preprocessed': '⚡ MD 文档',
+    unknown: '📄 文档'
+  }[sourceInfo.source] || '📄 文档'
+
   console.log("\n" + "=".repeat(60))
-  console.log("📄 处理文件: " + fileName)
+  console.log(`📄 ${sourceLabel}: ${fileName}`)
   console.log("=".repeat(60))
   // 解析文档（可能返回单个 config 或 configs 数组）
@@ -1799,15 +1860,33 @@ async function main() {
     applyAuditFile(auditFileName, applyOptions)
   } else if (listMode) {
     // 列出模式
-    const docs = getDocsToParse()
     console.log("\n📋 待处理文档列表:")
     if (docs.length === 0) {
       console.log('  (无文档)')
     } else {
-      docs.forEach((doc, index) => {
+      // 按来源分组显示
-        console.log(" " + (index + 1) + ". " + doc.name + " (" + formatSize(doc.size) + ")")
+      const grouped = {
+        preprocessed: docs.filter(d => d.source === 'preprocessed'),
+        root: docs.filter(d => d.source === 'root'),
+        raw: docs.filter(d => d.source === 'raw')
+      }
+
+      for (const [source, sourceDocs] of Object.entries(grouped)) {
+        if (sourceDocs.length === 0) continue
+
+        const sourceLabel = {
+          preprocessed: '⚡ 预处理 (preprocessed/)',
+          root: '📂 根目录 (docs/to-parse/)',
+          raw: '📄 原始文档 (raw/)'
+        }[source]
+
+        console.log(`\n${sourceLabel}`)
+        sourceDocs.forEach((doc, index) => {
+          const sourceTag = doc.ext === '.md' ? ' [MD]' : ''
+          console.log(`  ${index + 1}. ${doc.name}${sourceTag} (${formatSize(doc.size)})`)
         })
       }
+    }
   } else if (fileMode) {
     // 单文件模式
     const fileName = fileMode.split('=')[1]