`caption`#

Word-timed captions with kinetic animation styles. The schema's flagship "AI captions" surface — pass an array of timestamped words (e.g., direct from Whisper) and a style name, and the runtime renders them with sub-word timing. Inherits common fields.

interface CaptionElement extends BaseElement {
  type: 'caption';
  words: CaptionWord[];
  style?: 'tiktok_bounce' | 'fade_reveal' | 'kinetic_typewriter' | 'word_pop';
  max_length?: number | 'auto';   // windowing — see below

  // Text-like styling (same semantics as `text`)
  font_family?: string;
  font_size?: number | string;
  font_weight?: number | string;
  font_style?: 'normal' | 'italic';
  fill_color?: string;
  stroke_color?: string;
  stroke_width?: number;
  text_align?: 'left' | 'center' | 'right';
  line_height?: number;
  letter_spacing?: number;
  background_color?: string;
  background_border_radius?: number;
  shadow_color?: string;
  shadow_x?: number;
  shadow_y?: number;
  shadow_blur?: number;

  // Caption-specific
  highlight_color?: string;
  highlight_background_color?: string;
}

Words#

interface CaptionWord {
  text: string;
  start: number;
  end: number;
}

Field	Type	Description
`text`	`string`	The word as it should render.
`start`	`number`	When the word becomes active, in seconds relative to the caption element's `time`.
`end`	`number`	When the word stops being active.

Words don't have to be space-separated tokens — passing { text: "—" } between sentences renders a visible em-dash; punctuation can ride on the preceding word.

Windowing (`max_length`)#

A whole transcript on one caption would render as one unreadable block. max_length shows only part of it at a time — only the chunk active at the current moment displays (inside the box, wrapped):

Value	Meaning
`number`	Max letters per chunk — a chunk grows word-by-word until the next word would exceed this many characters (Creatomate-style).
`'auto'`	Chunk by a few words (also breaking on pauses) — the sensible default for speech.
absent	No windowing — the whole transcript shows at once.

Windowing is a display rule: it doesn't change word start/end, and the kinetic style still animates within the active chunk. Transcribing in the editor sets max_length: 'auto' by default.

Style#

Field	Type	Default	Description
`style`	`'tiktok_bounce' \| 'fade_reveal' \| 'kinetic_typewriter' \| 'word_pop'`	`'fade_reveal'`	Per-word animation behavior.

`tiktok_bounce`#

The active word scales up ~18% and gets the highlight color; previous words stay visible at the base fill color. Designed for vertical social video.

`fade_reveal`#

Words fade in as they activate and stay visible. Calmer than tiktok_bounce. Default for documentary / explainer pacing.

`kinetic_typewriter`#

Words pop in one at a time, no overlap. Each word stays until the next activates, then is replaced. Use this when you want one word on screen at a time.

`word_pop`#

The active word scales briefly (~120ms) on activation, then settles. Previous words stay at the base style. Subtler than tiktok_bounce.

Highlight colors#

Field	Type	Default	Description
`highlight_color`	`string` (hex)	`fill_color`	Color applied to the currently-active word. Defaults to the base `fill_color` (no color change).
`highlight_background_color`	`string` (hex)	—	Background plate color applied behind the currently-active word. Pair with `background_border_radius` for a chip effect.

Pair highlight_color with a contrast fill_color to make the active word "pop" — for example, white words with a yellow highlight.

Typography#

Same fields as text. See that page for font_family, font_size, font_weight, font_style, text_align, line_height, letter_spacing, plate, and shadow.

Example: Whisper-style captions#

{
  "type": "caption",
  "x": 960, "y": 1600,
  "track": 4,
  "time": 1,
  "duration": 8,
  "style": "tiktok_bounce",
  "font_family": "Inter",
  "font_size": 64,
  "font_weight": 800,
  "fill_color": "#ffffff",
  "highlight_color": "#facc15",
  "stroke_color": "#000000",
  "stroke_width": 4,
  "words": [
    { "text": "this",  "start": 0,    "end": 0.35 },
    { "text": "is",    "start": 0.35, "end": 0.55 },
    { "text": "how",   "start": 0.55, "end": 0.80 },
    { "text": "agents","start": 0.80, "end": 1.30 },
    { "text": "make",  "start": 1.30, "end": 1.60 },
    { "text": "video", "start": 1.60, "end": 2.20 }
  ]
}

Notes#

Whisper integration — Whisper's word-level timestamps map directly onto the words array. The MCP server's authoring guide includes a one-shot recipe.
Caption duration vs word duration — the element's duration (from common fields) bounds when the caption is on screen. Words with start / end outside the element's [0, duration] are clipped.
Animation interaction — caption styles handle per-word animation. Top-level animations on the caption element fire on the whole element (e.g., a fade-in at start applies before the per-word style takes over).

caption#