caption#
Word-timed captions with kinetic animation styles. The schema's flagship "AI captions" surface — pass an array of timestamped words (e.g., direct from Whisper) and a style name, and the runtime renders them with sub-word timing. Inherits common fields.
interface CaptionElement extends BaseElement { type: 'caption'; words: CaptionWord[]; style?: 'tiktok_bounce' | 'fade_reveal' | 'kinetic_typewriter' | 'word_pop'; max_length?: number | 'auto'; // windowing — see below // Text-like styling (same semantics as `text`) font_family?: string; font_size?: number | string; font_weight?: number | string; font_style?: 'normal' | 'italic'; fill_color?: string; stroke_color?: string; stroke_width?: number; text_align?: 'left' | 'center' | 'right'; line_height?: number; letter_spacing?: number; background_color?: string; background_border_radius?: number; shadow_color?: string; shadow_x?: number; shadow_y?: number; shadow_blur?: number; // Caption-specific highlight_color?: string; highlight_background_color?: string; }
Words#
interface CaptionWord { text: string; start: number; end: number; }
| Field | Type | Description |
|---|---|---|
text | string | The word as it should render. |
start | number | When the word becomes active, in seconds relative to the caption element's time. |
end | number | When the word stops being active. |
Words don't have to be space-separated tokens — passing { text: "—" } between sentences renders a visible em-dash; punctuation can ride on the preceding word.
Windowing (max_length)#
A whole transcript on one caption would render as one unreadable block. max_length shows only part of it at a time — only the chunk active at the current moment displays (inside the box, wrapped):
| Value | Meaning |
|---|---|
number | Max letters per chunk — a chunk grows word-by-word until the next word would exceed this many characters (Creatomate-style). |
'auto' | Chunk by a few words (also breaking on pauses) — the sensible default for speech. |
| absent | No windowing — the whole transcript shows at once. |
Windowing is a display rule: it doesn't change word start/end, and the kinetic style still animates within the active chunk. Transcribing in the editor sets max_length: 'auto' by default.
Style#
| Field | Type | Default | Description |
|---|---|---|---|
style | 'tiktok_bounce' | 'fade_reveal' | 'kinetic_typewriter' | 'word_pop' | 'fade_reveal' | Per-word animation behavior. |
tiktok_bounce#
The active word scales up ~18% and gets the highlight color; previous words stay visible at the base fill color. Designed for vertical social video.
fade_reveal#
Words fade in as they activate and stay visible. Calmer than tiktok_bounce. Default for documentary / explainer pacing.
kinetic_typewriter#
Words pop in one at a time, no overlap. Each word stays until the next activates, then is replaced. Use this when you want one word on screen at a time.
word_pop#
The active word scales briefly (~120ms) on activation, then settles. Previous words stay at the base style. Subtler than tiktok_bounce.
Highlight colors#
| Field | Type | Default | Description |
|---|---|---|---|
highlight_color | string (hex) | fill_color | Color applied to the currently-active word. Defaults to the base fill_color (no color change). |
highlight_background_color | string (hex) | — | Background plate color applied behind the currently-active word. Pair with background_border_radius for a chip effect. |
Pair highlight_color with a contrast fill_color to make the active word "pop" — for example, white words with a yellow highlight.
Typography#
Same fields as text. See that page for font_family, font_size, font_weight, font_style, text_align, line_height, letter_spacing, plate, and shadow.
Example: Whisper-style captions#
{ "type": "caption", "x": 960, "y": 1600, "track": 4, "time": 1, "duration": 8, "style": "tiktok_bounce", "font_family": "Inter", "font_size": 64, "font_weight": 800, "fill_color": "#ffffff", "highlight_color": "#facc15", "stroke_color": "#000000", "stroke_width": 4, "words": [ { "text": "this", "start": 0, "end": 0.35 }, { "text": "is", "start": 0.35, "end": 0.55 }, { "text": "how", "start": 0.55, "end": 0.80 }, { "text": "agents","start": 0.80, "end": 1.30 }, { "text": "make", "start": 1.30, "end": 1.60 }, { "text": "video", "start": 1.60, "end": 2.20 } ] }
Notes#
- Whisper integration — Whisper's word-level timestamps map directly onto the
wordsarray. The MCP server's authoring guide includes a one-shot recipe. - Caption duration vs word duration — the element's
duration(from common fields) bounds when the caption is on screen. Words withstart/endoutside the element's[0, duration]are clipped. - Animation interaction — caption styles handle per-word animation. Top-level
animationson the caption element fire on the whole element (e.g., afade-inat start applies before the per-word style takes over).