TL;DR

Reanimate can automatically synchronize animations to your voice if you have a transcript and an audio recording. This works with the help of Gentle. Accuracy is not perfect but it is pretty close, and it is by far the easiest way of adding narration to an animation. API documentation.

Motivation

Narration adds a new dimension to animations. When done well, the narration and animation will synergize and form a single, coherent experience. When done poorly, the experience will be disjointed and jarring.

Synchronizing audio and video can be time consuming and difficult to get exactly right even when you put in a lot of effort. It is typically done in one of three ways:

Manually time and render your entire animation, then record narration over it. This is relatively quick but it is incredibly difficult to match audio and visual cues.
Split the animation into small fragments and use video editing software to align the fragments with the narration. This approach usually works well but takes a lot of effort.
Write your animation with timings tied directly to a transcript.

The third approach is by far the least time-consuming and has built-in support in reanimate. The rest of this article goes over the details.

Forced aligners

So, how can we tell at that time a word appears in an audio recording? If you've ever seen YouTube's automatic captions then you'll know that voice recognition is far from perfect, often hilariously so. Fortunately, the problem of interpreting an audio recording becomes much simpler if we already have a transcript. Given a transcript, there is a class of programs, called forced aligners, that can tell us exactly when each word is being spoken with nearly 100% accuracy. Gentle is one of the most widely used forced aligners and they describe what they do as follows:

"Forced aligners are computer programs that take media files and their transcripts and return extremely precise timing information for each word (and phoneme) in the media."

Timings from transcript

Transcripts can be read in reanimate directly from non-IO code. This works because the transcript is not allowed to change during execution. The core of the API looks like this:

data Transcript
data TWord
transcriptText  :: Transcript -> Text
transcriptWords :: Transcript -> [TWord]

loadTranscript :: FilePath -> Transcript

Running loadTranscript "my_transcript.txt" will first look for a JSON file named my_transcript.json containing all the timing information. If no such JSON file could be found, reanimate will look for my_transcript.mp3/m4a/flac and use Gentle to generate the timing information. The timing information is saved as a JSON file that subsequent calls will use directly.

The video below illustrates how accurate the automatically generated timing information can be:

Toggle source code.

#!/usr/bin/env stack
-- stack runghc --package reanimate
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE ApplicativeDo #-}
module Main where

import           Control.Monad
import qualified Data.Text                     as T
import           Reanimate
import           Reanimate.Voice
import           Reanimate.Builtin.Documentation
import           Graphics.SvgTree                         ( ElementRef(..) )

transcript :: Transcript
transcript = loadTranscript "voice_transcript.txt"

main :: IO ()
main = reanimate $ scene $ do
  newSpriteSVG_ $ mkBackgroundPixel rtfdBackgroundColor
  waitOn $ forM_ (splitTranscript transcript) $ \(svg, tword) -> fork $ do
    let render v = centerUsing (latex $ transcriptText transcript) $ masked
          (wordKey tword)
          v
          svg
          (withFillColor "grey" $ mkRect 1 1)
          (withFillColor "black" $ mkRect 1 1)
    highlighted <- simpleVar render 0
    wait (wordStart tword)
    let dur = wordEnd tword - wordStart tword
    tweenVar highlighted dur $ \v -> fromToS v 1
  wait 2
 where
  wordKey tword =
    T.unpack (wordReference tword) ++ show (wordStartOffset tword)

{-# INLINE masked #-}
masked :: String -> Double -> SVG -> SVG -> SVG -> SVG
masked key t maskSVG srcSVG dstSVG = mkGroup
  [ mkClipPath label $ removeGroups maskSVG
  , withClipPathRef (Ref label)
    $ translate (x - w / 2 + w * t) y (scaleToSize w screenHeight dstSVG)
  , withClipPathRef (Ref label)
    $ translate (x + w / 2 + w * t) y (scaleToSize w screenHeight srcSVG)
  ]
 where
  label         = "word-mask-" ++ key
  (x, y, w, _h) = boundingBox maskSVG

Faking transcripts

Scripts and animations are often developed and revised in parallel. It would be too much work to make audio recordings of drafts but a rough idea of the timing information is paramount when developing the visuals. To this end, reanimate can fake timing information and pretend to "read" the script at roughly 120 words per minute.

Faking data is done automatically by loadTranscript if no audio file can be found. It's also possible to directly parse a text as a transcript:

fakeTranscript :: Text -> Transcript

Toggle source code.

#!/usr/bin/env stack
-- stack runghc --package reanimate
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE ApplicativeDo #-}
module Main where

import           Control.Monad
import qualified Data.Text                     as T
import           Reanimate
import           Reanimate.Voice
import           Reanimate.Builtin.Documentation
import           Graphics.SvgTree                         ( ElementRef(..) )

transcript :: Transcript
transcript =
  fakeTranscript
    "There is no audio\n\n\
    \for this transcript....\n\n\n\
    \Timings are fake,\n\n\
    \which is quite useful\n\n\
    \during development"

main :: IO ()
main = reanimate $ scene $ do
  newSpriteSVG_ $ mkBackgroundPixel rtfdBackgroundColor
  waitOn $ forM_ (splitTranscript transcript) $ \(svg, tword) -> do
    highlighted <- newVar 0
    void $ newSprite $ do
      v <- unVar highlighted
      pure $ centerUsing (latex $ transcriptText transcript) $ masked
        (wordKey tword)
        v
        svg
        (withFillColor "grey" $ mkRect 1 1)
        (withFillColor "black" $ mkRect 1 1)
    fork $ do
      wait (wordStart tword)
      let dur = wordEnd tword - wordStart tword
      tweenVar highlighted dur $ \v -> fromToS v 1
  wait 2
 where
  wordKey tword =
    T.unpack (wordReference tword) ++ show (wordStartOffset tword)

{-# INLINE masked #-}
masked :: String -> Double -> SVG -> SVG -> SVG -> SVG
masked key t maskSVG srcSVG dstSVG = mkGroup
  [ mkClipPath label $ removeGroups maskSVG
  , withClipPathRef (Ref label)
    $ translate (x - w / 2 + w * t) y (scaleToSize w screenHeight dstSVG)
  , withClipPathRef (Ref label)
    $ translate (x + w / 2 + w * t) y (scaleToSize w screenHeight srcSVG)
  ]
 where
  label         = "word-mask-" ++ key
  (x, y, w, _h) = boundingBox maskSVG

Setting up triggers

Now that we've covered the basics, let's look at the API for querying timings and setting up triggers in more detail:

transcriptWords :: Transcript -> [TWord]

data TWord = TWord
  { wordAligned     :: Text
  , wordCase        :: Text
  , wordStart       :: Double -- ^ Start of pronunciation in seconds
  , wordStartOffset :: Int    -- ^ Character index of word in transcript
  , wordEnd         :: Double -- ^ End of pronunciation in seconds
  , wordEndOffset   :: Int    -- ^ Last character index of word in transcript
  , wordPhones      :: [Phone]
  , wordReference   :: Text   -- ^ The word being pronounced.
  } deriving (Show)

Looking at the TWord data structure, the key fields are wordStart, wordEnd, and wordReference. The difference between wordStart and wordEnd gives the duration of a word, and the video below illustrates how graphical elements can respond to the reading speed:

Toggle source code.

#!/usr/bin/env stack
-- stack runghc --package reanimate
{-# LANGUAGE ApplicativeDo     #-}
{-# LANGUAGE OverloadedStrings #-}
module Main where

import           Control.Lens
import           Control.Monad
import qualified Data.Text                       as T
import           Graphics.SvgTree                (ElementRef (..))
import           Reanimate
import           Reanimate.Builtin.Documentation
import           Reanimate.Voice

transcript :: Transcript
transcript = loadTranscript "voice_triggers.txt"

transformer :: SVG -> SVG
transformer =
  translate (-4) 0 . centerUsing (latex $ transcriptText transcript)

main :: IO ()
main = reanimate $ scene $ do
  newSpriteSVG_ $ mkBackgroundPixel rtfdBackgroundColor
  waitOn $ forM_ (splitTranscript transcript) $ \(svg, tword) -> do
    let render v = transformer $ masked (wordKey tword)
                                        v
                                        svg
                                        (withFillColor "grey" $ mkRect 1 1)
                                        (withFillColor "black" $ mkRect 1 1)
    highlighted <- simpleVar render 0
    let dur = wordEnd tword - wordStart tword
    fork $ do
      wait (wordStart tword)
      tweenVar highlighted dur $ \v -> fromToS v 1
    fork $ do
      wait (wordStart tword)
      case wordReference tword of
        "one"   -> highlight dur $ latex "1"
        "two"   -> highlight dur $ latex "2"
        "three" -> highlight dur $ latex "3"
        "red"   -> highlight dur $ withFillColor "red" $ mkCircle 1
        "green" -> highlight dur $ withFillColor "green" $ mkCircle 1
        "blue"  -> highlight dur $ withFillColor "blue" $ mkCircle 1
        _       -> return ()
  wait 2
 where
  wordKey tword =
    T.unpack (wordReference tword) ++ show (wordStartOffset tword)
  highlight dur img =
    play
      $ animate
          (\t ->
            translate (screenWidth / 4) 0 $ scale t $ scaleToHeight 4 $ center
              img
          )
      & signalA (bellS 2)
      & setDuration dur

{-# INLINE masked #-}
masked :: String -> Double -> SVG -> SVG -> SVG -> SVG
masked key t maskSVG srcSVG dstSVG = mkGroup
  [ mkClipPath label $ removeGroups maskSVG
  , withClipPathRef (Ref label)
    $ translate (x - w / 2 + w * t) y (scaleToSize w screenHeight dstSVG)
  , withClipPathRef (Ref label)
    $ translate (x + w / 2 + w * t) y (scaleToSize w screenHeight srcSVG)
  ]
 where
  label         = "word-mask-" ++ key
  (x, y, w, _h) = boundingBox maskSVG

Words can be looked up in a transcript and ambiguities can be resolved by inserting section markers and using them as keys.

-- | Locate the first word that occurs after all the given keys.
--   An error is thrown if no such word exists. An error is thrown
--   if the keys do not exist in the transcript.
findWord  :: Transcript -> [Text] -> Text -> TWord

-- | Locate all words that occur after all the given keys.
--   May return an empty list. An error is thrown
--   if the keys do not exist in the transcript.
findWords :: Transcript -> [Text] -> Text -> [TWord]

Below is a transcript with two section markers, 'middle' and 'final'. These markers make it easy to tell the different between the word 'circle' in the first paragraph and the word 'circle' in the second paragraph. In the corresponding video, the words trigger different events depending on which paragraph (or section) they're in. Section markers are only used as references and are not included in the audio recording.

Everything in this animation is timed by my voice.
Every flash, every circle, every square.

[middle]
I decide when everything begins and ends.
Time has run out for the square
and now the circle

[final]
Let's spawn three more circles
and destroy them in a

flash flash flash

Toggle source code.

#!/usr/bin/env stack
-- stack runghc --package reanimate
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE ApplicativeDo #-}
module Main where

import           Control.Monad
import qualified Data.Text                     as T
import           Reanimate
import           Reanimate.Voice
import           Reanimate.Builtin.Documentation
import           Geom2D.CubicBezier                       ( QuadBezier(..)
                                                          , evalBezier
                                                          , Point(..)
                                                          )
import           Graphics.SvgTree                         ( ElementRef(..) )

transcript :: Transcript
transcript = loadTranscript "voice_advanced.txt"

main :: IO ()
main = reanimate $ scene $ do
  bg <- newSpriteSVG $ mkBackgroundPixel rtfdBackgroundColor
  spriteZ bg (-100)
  newSpriteSVG_ $ mkGroup
    [withStrokeColor "black" $ mkLine (-screenWidth, 0) (screenWidth, 0)]

  centerTxt <- textHandler

  flashEffect
  circleEffect
  squareEffect
  finalEffect

  waitOn $ forM_ (transcriptWords transcript) $ \tword -> fork $ do
    wait (wordStart tword)
    writeVar centerTxt $ wordReference tword

  wait 2

wordDuration :: TWord -> Double
wordDuration tword = wordEnd tword - wordStart tword

--
finalEffect :: Scene s ()
finalEffect = fork $ do
  let begin  = findWord transcript ["final"] "circles"
      ends   = findWords transcript ["final"] "flash"
      path   = QuadBezier (Point 6 (-radius)) (Point 0 6) (Point (-6) (-radius))
      radius = 0.3
  wait (wordStart begin)
  ss <- fork $ replicateM 3 (circleSprite radius path <* wait 0.2)
  mapM_ (flip spriteZ (-1)) ss
  forM_ (zip ss ends) $ \(s, end) -> fork $ do
    spriteMap s flipXAxis
    wait (wordStart end - wordStart begin)
    destroySprite s

-- square effect
squareEffect :: Scene s ()
squareEffect = fork $ do
  let
    begin = findWord transcript [] "square"
    end   = findWord transcript ["middle"] "square"
    path =
      QuadBezier (Point 6 (-size / 2)) (Point 0 6) (Point (-6) (-size / 2))
    size = 1
  wait (wordStart begin)
  s <- squareSprite size path
  spriteMap s (rotate 180)
  spriteZ s (-1)
  wait (wordStart end + wordDuration end / 2 - wordStart begin)
  destroySprite s

-- circle effect
circleEffect :: Scene s ()
circleEffect = fork $ do
  let begin  = findWord transcript [] "circle"
      end    = findWord transcript ["middle"] "circle"
      path   = QuadBezier (Point 6 (-radius)) (Point 0 6) (Point (-6) (-radius))
      radius = 0.3
  wait (wordStart begin)
  s <- circleSprite radius path
  spriteZ s (-1)
  wait (wordStart end + wordDuration end / 2 - wordStart begin)
  destroySprite s

-- flash effect
flashEffect :: Scene s ()
flashEffect = forM_ (findWords transcript [] "flash") $ \flashWord -> fork $ do
  wait (wordStart flashWord)
  flash <- newSpriteSVG $ mkBackground "black"
  spriteTween flash (wordDuration flashWord)
    $ \t -> withGroupOpacity (fromToS 0 0.7 $ (powerS 2 . reverseS) t)
  wait (wordDuration flashWord)
  destroySprite flash

--------------------------------------------------------------------------
-- Helpers and sprites

textHandler :: Scene s (Var s T.Text)
textHandler = simpleVar render T.empty
 where
  render txt =
    let txtSvg      = translate 0 (-0.25) $ centerX $ latex txt
        activeWidth = svgWidth txtSvg + 0.5
    in  mkGroup
          [ withStrokeWidth 0 $ withFillColorPixel rtfdBackgroundColor $ mkRect
            activeWidth
            1
          , txtSvg
          , withStrokeColor "black"
            $ mkLine (activeWidth / 2, 0.5) (activeWidth / 2, -0.5)
          , withStrokeColor "black"
            $ mkLine (-activeWidth / 2, 0.5) (-activeWidth / 2, -0.5)
          ]

circleSprite :: Double -> QuadBezier Double -> Scene s (Sprite s)
circleSprite radius path = newSprite $ do
  t <- spriteT
  d <- spriteDuration
  pure
    $ let Point x y = evalBezier path (t / d)
      in  mkGroup
            [ mkClipPath "circle-mask"
            $ removeGroups
            $ translate 0 (screenHeight / 2)
            $ withFillColorPixel rtfdBackgroundColor
            $ mkRect screenWidth screenHeight
            , withClipPathRef (Ref "circle-mask") $ translate x y $ mkCircle
              radius
            ]

squareSprite :: Double -> QuadBezier Double -> Scene s (Sprite s)
squareSprite size path = newSprite $ do
  t <- spriteT
  d <- spriteDuration
  pure
    $ let Point x y = evalBezier path (t / d)
      in  mkGroup
            [ mkClipPath "square-mask"
            $ removeGroups
            $ translate 0 (screenHeight / 2)
            $ withFillColorPixel rtfdBackgroundColor
            $ mkRect screenWidth screenHeight
            , withClipPathRef (Ref "square-mask")
            $ translate x y
            $ rotate (t / d * 360)
            $ withFillOpacity 0
            $ withStrokeColor "black"
            $ mkRect size size
            ]