Captionly | Toprak Demirel

Overview

Captionly is a self-hosted editing toolkit, a single-user alternative to stitching together a pile of paid web tools. It started as an auto-caption app and grew into five workflows under one roof: caption a video, clip viral moments from a long one, remove an image background, upscale an image, and synthesize speech.

The guiding principle is local-first. The browser runs the editor and ffmpeg.wasm, a small Python sidecar runs the AI models on your machine, and nothing is uploaded or tracked. Only the clip tool may reach off-device, and only if you point it at a cloud LLM instead of a local one.

The Toolkit

Five tools share one picker, one design language, and one local sidecar. Adding a tool is just adding a tile.

Core

Auto-caption a video

WhisperX transcribes with word-level timing (~±20 ms). An in-browser editor lets you reposition the overlay, edit text, and get karaoke-style word highlighting, then ffmpeg burns the captions in.

Clip viral moments

Drop a long video; WhisperX transcribes it and an LLM picks the short segments worth posting. Native ffmpeg stream-copies them out, no re-encode.

Remove background

BiRefNet matting on Apple Silicon (MPS) returns a clean transparent PNG from any image.

Upscale an image

Real-ESRGAN x4plus with tiled inference, 2× or 4×, streaming per-tile progress to the UI.

Voice synthesis

edge-tts plus RVC voice conversion, proxied to a separate local Speek Docker container for custom character voices.

In-browser export

ffmpeg.wasm renders the same caption frames the editor previews, so the preview and the exported video are pixel-identical by construction.

Under the Hood

Local Python sidecar

A FastAPI app bound to 127.0.0.1 mounts one router per model, loads weights lazily, and exposes a single /health endpoint as the source of truth.

Local-first & private

No accounts, no cloud, no telemetry. The whole security model is "localhost only", the one exception is an optional cloud LLM for clipping, which you choose.

Cross-origin isolated

ffmpeg.wasm needs SharedArrayBuffer, so every route ships COOP/COEP headers to enable cross-origin isolation.

One-command dev

A single concurrently script boots the web app, the Python sidecar, and the Speek voice container together.

Tech Stack

Framework	Next.js 15 (App Router)
Frontend	React 19, TypeScript, Tailwind CSS 3
State	Zustand
Video	ffmpeg.wasm (client-side)
Transcription	WhisperX (faster-whisper + wav2vec2 alignment)
Image models	BiRefNet (matting) · Real-ESRGAN x4plus (upscale)
Voice	edge-tts + RVC (Speek Docker container)
AI (clip)	Anthropic Claude SDK or any OpenAI-compatible endpoint (e.g. Ollama)
Sidecar	Python FastAPI (127.0.0.1)
UI	Radix UI, lucide-react, Motion
Repository	Public on GitHub

Status

Captionly is a personal project and runs locally, there is no hosted demo, by design. The full source is public on GitHub: clone it, set up the Python sidecar, and run the dev script. Reach out if you'd like a walkthrough.