PlanOpticon processes a video through eight sequential steps. Each one checkpoints its output, so failures don’t mean starting over. Here’s the full pipeline and why each step exists.

1. Frame Extraction

OpenCV samples the video at a configurable rate. Change detection keeps frames where something visually interesting happened. Periodic capture grabs frames at fixed intervals for slow-moving content. Face detection discards webcam-only frames. A 90-minute video typically produces 100-150 useful frames from thousands of candidates.

2. Audio Extraction

FFmpeg strips the audio track to WAV. Simple, fast, and the foundation for transcription.

3. Transcription

Whisper (API or local) transcribes the audio with timestamps and speaker diarization. The output is a structured JSON with segments, each tagged with start time, end time, speaker ID, and text. Also generates SRT subtitles and plain text versions.

4. Diagram Analysis

Every extracted frame gets classified by a vision model: is this a diagram, a slide, a screenshot, or noise? High-confidence diagrams get a full analysis — the model extracts elements, relationships, visible text, and generates Mermaid code to recreate the diagram programmatically. Medium-confidence frames are saved as screen captures with captions.

5. Content Analysis

The transcript and visual context are analyzed together. The LLM identifies key points (what matters), action items (who needs to do what by when), and cross-references between spoken content and visual content. If someone says “as you can see in the diagram” — the system links that statement to the actual diagram frame.

6. Knowledge Graph

Transcript segments are batched (10 at a time) and sent to an LLM for entity and relationship extraction in a single combined prompt. Entities are typed (person, concept, technology, organization, time) and relationships are labeled (uses, manages, depends_on). In batch mode, KGs merge across videos.

7. Report Generation

Everything gets assembled into a structured report — Markdown for developers, HTML for browsers, PDF for stakeholders. Diagrams render inline with their Mermaid source. The transcript links to timestamps. Action items are grouped by assignee.

8. Manifest

A JSON manifest ties everything together: paths to all artifacts, metadata, statistics, and the full structured output. This is the machine-readable index that makes PlanOpticon output programmable — you can build downstream tools on top of the manifest without parsing reports.

Why eight steps and not one

Each step is independently valuable and independently cacheable. Need just a transcript? Stop at step 3. Want diagrams without the knowledge graph? Skip step 6. Crash during report generation? Re-run and it picks up at step 7.

The pipeline is the product. Everything else is just making each step better.

GitHub · Docs · PyPI

Written by Leo M.