# InsightFlow Phase 7 - 多模态支持 API 文档 ## 概述 Phase 7 多模态支持模块为 InsightFlow 添加了处理视频和图片的能力,支持: 1. **视频处理**:提取音频、关键帧、OCR 识别 2. **图片处理**:识别白板、PPT、手写笔记等内容 3. **多模态实体关联**:跨模态实体对齐和知识融合 ## 新增 API 端点 ### 视频处理 #### 上传视频 ``` POST /api/v1/projects/{project_id}/upload-video ``` **参数:** - `file` (required): 视频文件 - `extract_interval` (optional): 关键帧提取间隔(秒),默认 5 秒 **响应:** ```json { "video_id": "abc123", "project_id": "proj456", "filename": "meeting.mp4", "status": "completed", "audio_extracted": true, "frame_count": 24, "ocr_text_preview": "会议内容预览...", "message": "Video processed successfully" } ``` #### 获取项目视频列表 ``` GET /api/v1/projects/{project_id}/videos ``` **响应:** ```json [ { "id": "abc123", "filename": "meeting.mp4", "duration": 120.5, "fps": 30.0, "resolution": {"width": 1920, "height": 1080}, "ocr_preview": "会议内容...", "status": "completed", "created_at": "2024-01-15T10:30:00" } ] ``` #### 获取视频关键帧 ``` GET /api/v1/videos/{video_id}/frames ``` **响应:** ```json [ { "id": "frame001", "frame_number": 1, "timestamp": 0.0, "image_url": "/tmp/frames/video123/frame_000001_0.00.jpg", "ocr_text": "第一页内容...", "entities": [{"name": "Project Alpha", "type": "PROJECT"}] } ] ``` ### 图片处理 #### 上传图片 ``` POST /api/v1/projects/{project_id}/upload-image ``` **参数:** - `file` (required): 图片文件 - `detect_type` (optional): 是否自动检测图片类型,默认 true **响应:** ```json { "image_id": "img789", "project_id": "proj456", "filename": "whiteboard.jpg", "image_type": "whiteboard", "ocr_text_preview": "白板内容...", "description": "这是一张白板图片。内容摘要:...", "entity_count": 5, "status": "completed" } ``` #### 批量上传图片 ``` POST /api/v1/projects/{project_id}/upload-images-batch ``` **参数:** - `files` (required): 多个图片文件 **响应:** ```json { "project_id": "proj456", "total_count": 3, "success_count": 3, "failed_count": 0, "results": [ { "image_id": "img001", "status": "success", "image_type": "ppt", "entity_count": 4 } ] } ``` #### 获取项目图片列表 ``` GET /api/v1/projects/{project_id}/images ``` ### 多模态实体关联 #### 跨模态实体对齐 ``` POST /api/v1/projects/{project_id}/multimodal/align ``` **参数:** - `threshold` (optional): 相似度阈值,默认 0.85 **响应:** ```json { "project_id": "proj456", "aligned_count": 5, "links": [ { "link_id": "link001", "source_entity_id": "ent001", "target_entity_id": "ent002", "source_modality": "video", "target_modality": "document", "link_type": "same_as", "confidence": 0.95, "evidence": "Cross-modal alignment: exact" } ], "message": "Successfully aligned 5 cross-modal entity pairs" } ``` #### 获取多模态统计信息 ``` GET /api/v1/projects/{project_id}/multimodal/stats ``` **响应:** ```json { "project_id": "proj456", "video_count": 3, "image_count": 10, "multimodal_entity_count": 25, "cross_modal_links": 8, "modality_distribution": { "audio": 15, "video": 8, "image": 12, "document": 20 } } ``` #### 获取实体多模态提及 ``` GET /api/v1/entities/{entity_id}/multimodal-mentions ``` **响应:** ```json [ { "id": "mention001", "entity_id": "ent001", "entity_name": "Project Alpha", "modality": "video", "source_id": "video123", "source_type": "video_frame", "text_snippet": "Project Alpha 进度", "confidence": 1.0, "created_at": "2024-01-15T10:30:00" } ] ``` #### 建议多模态实体合并 ``` GET /api/v1/projects/{project_id}/multimodal/suggest-merges ``` **响应:** ```json { "project_id": "proj456", "suggestion_count": 3, "suggestions": [ { "entity1": {"id": "ent001", "name": "K8s", "type": "TECH"}, "entity2": {"id": "ent002", "name": "Kubernetes", "type": "TECH"}, "similarity": 0.95, "match_type": "alias_match", "suggested_action": "merge" } ] } ``` ## 数据库表结构 ### videos 表 存储视频文件信息 - `id`: 视频ID - `project_id`: 所属项目ID - `filename`: 文件名 - `duration`: 视频时长(秒) - `fps`: 帧率 - `resolution`: 分辨率(JSON) - `audio_transcript_id`: 关联的音频转录ID - `full_ocr_text`: 所有帧OCR文本合并 - `extracted_entities`: 提取的实体(JSON) - `extracted_relations`: 提取的关系(JSON) - `status`: 处理状态 ### video_frames 表 存储视频关键帧信息 - `id`: 帧ID - `video_id`: 所属视频ID - `frame_number`: 帧序号 - `timestamp`: 时间戳(秒) - `image_url`: 图片URL或路径 - `ocr_text`: OCR识别文本 - `extracted_entities`: 该帧提取的实体 ### images 表 存储图片文件信息 - `id`: 图片ID - `project_id`: 所属项目ID - `filename`: 文件名 - `ocr_text`: OCR识别文本 - `description`: 图片描述 - `extracted_entities`: 提取的实体 - `extracted_relations`: 提取的关系 - `status`: 处理状态 ### multimodal_mentions 表 存储实体在多模态中的提及 - `id`: 提及ID - `project_id`: 所属项目ID - `entity_id`: 实体ID - `modality`: 模态类型(audio/video/image/document) - `source_id`: 来源ID - `source_type`: 来源类型 - `text_snippet`: 文本片段 - `confidence`: 置信度 ### multimodal_entity_links 表 存储跨模态实体关联 - `id`: 关联ID - `entity_id`: 实体ID - `linked_entity_id`: 关联实体ID - `link_type`: 关联类型(same_as/related_to/part_of) - `confidence`: 置信度 - `evidence`: 关联证据 - `modalities`: 涉及的模态列表 ## 依赖安装 ```bash pip install ffmpeg-python pillow opencv-python pytesseract ``` 注意:使用 OCR 功能需要安装 Tesseract OCR 引擎: - Ubuntu/Debian: `sudo apt-get install tesseract-ocr tesseract-ocr-chi-sim` - macOS: `brew install tesseract tesseract-lang` - Windows: 下载安装包从 https://github.com/UB-Mannheim/tesseract/wiki ## 环境变量 ```bash # 可选:自定义临时目录 export INSIGHTFLOW_TEMP_DIR=/path/to/temp # 可选:Tesseract 路径(Windows) export TESSERACT_CMD=C:\Program Files\Tesseract-OCR\tesseract.exe ```