Whisper transcribes
Audio extracted at 16 kHz mono, sent to Groq's Whisper Large v3 endpoint. Returns word-level timestamps, per-word confidence, and detected language.
Most screen recorders cap captions at 5-10 European languages. El Ojo Studio uses Whisper Large v3 — the same model OpenAI ships with ChatGPT — plus Llama 3.3 70B for correction. RTL for Urdu, Arabic, Hebrew. CJK and Devanagari fonts. Roman transliteration if your audience reads English but speaks Hindi.
Hit “Generate transcript” in Studio and three things happen in sequence:
Audio extracted at 16 kHz mono, sent to Groq's Whisper Large v3 endpoint. Returns word-level timestamps, per-word confidence, and detected language.
Llama 3.3 70B reads each segment and fixes punctuation, proper nouns, recognition errors. Especially helpful for Urdu, Hindi, Pashto where Whisper sometimes mis-segments. Per-segment, deterministic.
Pick a style (Karaoke / Hormozi / MrBeast / Minimal). Studio renders the overlay with the right font and direction. Auto-emoji on key words. Filters out words you've deleted in the transcript editor.
Whisper Large v3 supports 99 languages out of the box. El Ojo adds language hints for the ones most users ask about:
Captions render with direction: rtl and a font stack tuned for Naskh / Nastaliq scripts. Word order stays correct — no “reversed letters” bug.
Automatic font fallback to "Noto Sans CJK" so glyphs render correctly. Per-character timing for Karaoke style.
Devanagari and other Indic font stacks. Word-level timing with diacritic preservation. Optional Roman transliteration for English-script audiences.
Toggle on “Transliterate to Roman” and Llama converts the Devanagari / Arabic-script transcript to Roman. Useful for diaspora audiences and accessibility.
Whisper Large v3 is most accurate on European languages. Captions ship with the right accented characters and punctuation style.
Auto-detect handles the rest. From Vietnamese to Swahili to Welsh — if Whisper supports it, El Ojo captions it.
Four built-in styles. Pick once, reuse forever. Each style scales with the player — embedded player, fullscreen, or the editor preview all render at the right size.
Each word fills in as it's spoken — reads like a karaoke prompter. Best for music or fast-paced talks.
Black bold text with a yellow highlight on the current word. Popular on TikTok and Instagram Reels. Aggressive attention-grabber.
Large white text, centered, drop shadow. The classic YouTube creator look. Reads even at thumbnail size.
Small white text at the bottom. Doesn't compete with your content. Best for technical or formal recordings.
Whisper Large v3 via Groq. Word-level timestamps via verbose_json. Languages auto-detected; you can also hint the language for higher accuracy on Urdu, Hindi, Pashto.
After Whisper returns the transcript, Llama 3.3 70B reads each segment and corrects spelling, punctuation, proper nouns, and obvious recognition errors. Runs only if you opt in. Deterministic per segment.
If your audio is in Urdu, Hindi, or another Indic language, El Ojo can output the transcript in Roman / Latin script — what users call "Hinglish" for Hindi and "Roman Urdu" for Urdu. Useful for audiences who speak the language but don't read the native script.
Four: Karaoke (word-by-word fill), Hormozi (bold black-and-yellow highlight), MrBeast (huge centered text), and Minimal (clean understated). Toggle in the editor.
Yes when you render with the burn-in flag. Otherwise they live as overlay metadata you can toggle on/off in the editor preview.
Arabic, Urdu, Hebrew, Farsi/Persian. Captions render right-to-left with the correct font stack. CJK (Chinese, Japanese, Korean) and Devanagari (Hindi, Marathi) also use the appropriate font automatically.
Free tier of Studio includes auto-captions on your first 10 videos. Pro removes the cap.