Overview
Captionly is a self-hosted editing toolkit, a single-user alternative to stitching together a pile of paid web tools. It started as an auto-caption app and grew into five workflows under one roof: caption a video, clip viral moments from a long one, remove an image background, upscale an image, and synthesize speech.
The guiding principle is local-first. The browser runs the editor and ffmpeg.wasm, a small Python sidecar runs the AI models on your machine, and nothing is uploaded or tracked. Only the clip tool may reach off-device, and only if you point it at a cloud LLM instead of a local one.
The Toolkit
Five tools share one picker, one design language, and one local sidecar. Adding a tool is just adding a tile.
Auto-caption a video
WhisperX transcribes with word-level timing (~±20 ms). An in-browser editor lets you reposition the overlay, edit text, and get karaoke-style word highlighting, then ffmpeg burns the captions in.
Clip viral moments
Drop a long video; WhisperX transcribes it and an LLM picks the short segments worth posting. Native ffmpeg stream-copies them out, no re-encode.
Remove background
BiRefNet matting on Apple Silicon (MPS) returns a clean transparent PNG from any image.
Upscale an image
Real-ESRGAN x4plus with tiled inference, 2× or 4×, streaming per-tile progress to the UI.
Voice synthesis
edge-tts plus RVC voice conversion, proxied to a separate local Speek Docker container for custom character voices.
In-browser export
ffmpeg.wasm renders the same caption frames the editor previews, so the preview and the exported video are pixel-identical by construction.
Under the Hood
Local Python sidecar
A FastAPI app bound to 127.0.0.1 mounts one router per model, loads weights lazily, and exposes a single /health endpoint as the source of truth.
Local-first & private
No accounts, no cloud, no telemetry. The whole security model is "localhost only", the one exception is an optional cloud LLM for clipping, which you choose.
Cross-origin isolated
ffmpeg.wasm needs SharedArrayBuffer, so every route ships COOP/COEP headers to enable cross-origin isolation.
One-command dev
A single concurrently script boots the web app, the Python sidecar, and the Speek voice container together.
Tech Stack
| Framework | Next.js 15 (App Router) |
|---|---|
| Frontend | React 19, TypeScript, Tailwind CSS 3 |
| State | Zustand |
| Video | ffmpeg.wasm (client-side) |
| Transcription | WhisperX (faster-whisper + wav2vec2 alignment) |
| Image models | BiRefNet (matting) · Real-ESRGAN x4plus (upscale) |
| Voice | edge-tts + RVC (Speek Docker container) |
| AI (clip) | Anthropic Claude SDK or any OpenAI-compatible endpoint (e.g. Ollama) |
| Sidecar | Python FastAPI (127.0.0.1) |
| UI | Radix UI, lucide-react, Motion |
| Repository | Public on GitHub |
Status
Captionly is a personal project and runs locally, there is no hosted demo, by design. The full source is public on GitHub: clone it, set up the Python sidecar, and run the dev script. Reach out if you'd like a walkthrough.