insightflow/backend/docs/multimodal_api.md

# InsightFlow Phase 7 - 多模态支持 API 文档

## 概述

Phase 7 多模态支持模块为 InsightFlow 添加了处理视频和图片的能力，支持：

1. **视频处理**：提取音频、关键帧、OCR 识别
2. **图片处理**：识别白板、PPT、手写笔记等内容
3. **多模态实体关联**：跨模态实体对齐和知识融合

## 新增 API 端点

### 视频处理

#### 上传视频
```
POST /api/v1/projects/{project_id}/upload-video
```

**参数：**
- `file` (required): 视频文件
- `extract_interval` (optional): 关键帧提取间隔（秒），默认 5 秒

**响应：**
```json
{
  "video_id": "abc123",
  "project_id": "proj456",
  "filename": "meeting.mp4",
  "status": "completed",
  "audio_extracted": true,
  "frame_count": 24,
  "ocr_text_preview": "会议内容预览...",
  "message": "Video processed successfully"
}
```

#### 获取项目视频列表
```
GET /api/v1/projects/{project_id}/videos
```

**响应：**
```json
[
  {
    "id": "abc123",
    "filename": "meeting.mp4",
    "duration": 120.5,
    "fps": 30.0,
    "resolution": {"width": 1920, "height": 1080},
    "ocr_preview": "会议内容...",
    "status": "completed",
    "created_at": "2024-01-15T10:30:00"
  }
]
```

#### 获取视频关键帧
```
GET /api/v1/videos/{video_id}/frames
```

**响应：**
```json
[
  {
    "id": "frame001",
    "frame_number": 1,
    "timestamp": 0.0,
    "image_url": "/tmp/frames/video123/frame_000001_0.00.jpg",
    "ocr_text": "第一页内容...",
    "entities": [{"name": "Project Alpha", "type": "PROJECT"}]
  }
]
```

### 图片处理

#### 上传图片
```
POST /api/v1/projects/{project_id}/upload-image
```

**参数：**
- `file` (required): 图片文件
- `detect_type` (optional): 是否自动检测图片类型，默认 true

**响应：**
```json
{
  "image_id": "img789",
  "project_id": "proj456",
  "filename": "whiteboard.jpg",
  "image_type": "whiteboard",
  "ocr_text_preview": "白板内容...",
  "description": "这是一张白板图片。内容摘要：...",
  "entity_count": 5,
  "status": "completed"
}
```

#### 批量上传图片
```
POST /api/v1/projects/{project_id}/upload-images-batch
```

**参数：**
- `files` (required): 多个图片文件

**响应：**
```json
{
  "project_id": "proj456",
  "total_count": 3,
  "success_count": 3,
  "failed_count": 0,
  "results": [
    {
      "image_id": "img001",
      "status": "success",
      "image_type": "ppt",
      "entity_count": 4
    }
  ]
}
```

#### 获取项目图片列表
```
GET /api/v1/projects/{project_id}/images
```

### 多模态实体关联

#### 跨模态实体对齐
```
POST /api/v1/projects/{project_id}/multimodal/align
```

**参数：**
- `threshold` (optional): 相似度阈值，默认 0.85

**响应：**
```json
{
  "project_id": "proj456",
  "aligned_count": 5,
  "links": [
    {
      "link_id": "link001",
      "source_entity_id": "ent001",
      "target_entity_id": "ent002",
      "source_modality": "video",
      "target_modality": "document",
      "link_type": "same_as",
      "confidence": 0.95,
      "evidence": "Cross-modal alignment: exact"
    }
  ],
  "message": "Successfully aligned 5 cross-modal entity pairs"
}
```

#### 获取多模态统计信息
```
GET /api/v1/projects/{project_id}/multimodal/stats
```

**响应：**
```json
{
  "project_id": "proj456",
  "video_count": 3,
  "image_count": 10,
  "multimodal_entity_count": 25,
  "cross_modal_links": 8,
  "modality_distribution": {
    "audio": 15,
    "video": 8,
    "image": 12,
    "document": 20
  }
}
```

#### 获取实体多模态提及
```
GET /api/v1/entities/{entity_id}/multimodal-mentions
```

**响应：**
```json
[
  {
    "id": "mention001",
    "entity_id": "ent001",
    "entity_name": "Project Alpha",
    "modality": "video",
    "source_id": "video123",
    "source_type": "video_frame",
    "text_snippet": "Project Alpha 进度",
    "confidence": 1.0,
    "created_at": "2024-01-15T10:30:00"
  }
]
```

#### 建议多模态实体合并
```
GET /api/v1/projects/{project_id}/multimodal/suggest-merges
```

**响应：**
```json
{
  "project_id": "proj456",
  "suggestion_count": 3,
  "suggestions": [
    {
      "entity1": {"id": "ent001", "name": "K8s", "type": "TECH"},
      "entity2": {"id": "ent002", "name": "Kubernetes", "type": "TECH"},
      "similarity": 0.95,
      "match_type": "alias_match",
      "suggested_action": "merge"
    }
  ]
}
```

## 数据库表结构

### videos 表
存储视频文件信息
- `id`: 视频ID
- `project_id`: 所属项目ID
- `filename`: 文件名
- `duration`: 视频时长（秒）
- `fps`: 帧率
- `resolution`: 分辨率（JSON）
- `audio_transcript_id`: 关联的音频转录ID
- `full_ocr_text`: 所有帧OCR文本合并
- `extracted_entities`: 提取的实体（JSON）
- `extracted_relations`: 提取的关系（JSON）
- `status`: 处理状态

### video_frames 表
存储视频关键帧信息
- `id`: 帧ID
- `video_id`: 所属视频ID
- `frame_number`: 帧序号
- `timestamp`: 时间戳（秒）
- `image_url`: 图片URL或路径
- `ocr_text`: OCR识别文本
- `extracted_entities`: 该帧提取的实体

### images 表
存储图片文件信息
- `id`: 图片ID
- `project_id`: 所属项目ID
- `filename`: 文件名
- `ocr_text`: OCR识别文本
- `description`: 图片描述
- `extracted_entities`: 提取的实体
- `extracted_relations`: 提取的关系
- `status`: 处理状态

### multimodal_mentions 表
存储实体在多模态中的提及
- `id`: 提及ID
- `project_id`: 所属项目ID
- `entity_id`: 实体ID
- `modality`: 模态类型（audio/video/image/document）
- `source_id`: 来源ID
- `source_type`: 来源类型
- `text_snippet`: 文本片段
- `confidence`: 置信度

### multimodal_entity_links 表
存储跨模态实体关联
- `id`: 关联ID
- `entity_id`: 实体ID
- `linked_entity_id`: 关联实体ID
- `link_type`: 关联类型（same_as/related_to/part_of）
- `confidence`: 置信度
- `evidence`: 关联证据
- `modalities`: 涉及的模态列表

## 依赖安装

```bash
pip install ffmpeg-python pillow opencv-python pytesseract
```

注意：使用 OCR 功能需要安装 Tesseract OCR 引擎：
- Ubuntu/Debian: `sudo apt-get install tesseract-ocr tesseract-ocr-chi-sim`
- macOS: `brew install tesseract tesseract-lang`
- Windows: 下载安装包从 https://github.com/UB-Mannheim/tesseract/wiki

## 环境变量

```bash
# 可选：自定义临时目录
export INSIGHTFLOW_TEMP_DIR=/path/to/temp

# 可选：Tesseract 路径（Windows）
export TESSERACT_CMD=C:\Program Files\Tesseract-OCR\tesseract.exe
```