TL;DR
Reanimate can automatically synchronize animations to your voice if you have a transcript and an audio recording. This works with the help of Gentle. Accuracy is not perfect but it is pretty close, and it is by far the easiest way of adding narration to an animation. API documentation.
Motivation
Narration adds a new dimension to animations. When done well, the narration and animation will synergize and form a single, coherent experience. When done poorly, the experience will be disjointed and jarring.
Synchronizing audio and video can be time consuming and difficult to get exactly right even when you put in a lot of effort. It is typically done in one of three ways:
- Manually time and render your entire animation, then record narration over it. This is relatively quick but it is incredibly difficult to match audio and visual cues.
- Split the animation into small fragments and use video editing software to align the fragments with the narration. This approach usually works well but takes a lot of effort.
- Write your animation with timings tied directly to a transcript.
The third approach is by far the least time-consuming and has built-in support in reanimate. The rest of this article goes over the details.
Forced aligners
So, how can we tell at that time a word appears in an audio recording? If you've ever seen YouTube's automatic captions then you'll know that voice recognition is far from perfect, often hilariously so. Fortunately, the problem of interpreting an audio recording becomes much simpler if we already have a transcript. Given a transcript, there is a class of programs, called forced aligners, that can tell us exactly when each word is being spoken with nearly 100% accuracy. Gentle is one of the most widely used forced aligners and they describe what they do as follows:
"Forced aligners are computer programs that take media files and their transcripts and return extremely precise timing information for each word (and phoneme) in the media."
Timings from transcript
Transcripts can be read in reanimate directly from non-IO code. This works because the transcript is not allowed to change during execution. The core of the API looks like this:
data Transcript
data TWord
transcriptText :: Transcript -> Text
transcriptWords :: Transcript -> [TWord]
loadTranscript :: FilePath -> Transcript
Running loadTranscript "my_transcript.txt"
will first look for a JSON file named my_transcript.json
containing all the timing information. If no such JSON file could be found, reanimate will look for my_transcript.mp3/m4a/flac
and use Gentle to generate the timing information. The timing information is saved as a JSON file that subsequent calls will use directly.
The video below illustrates how accurate the automatically generated timing information can be:
Toggle source code.
#!/usr/bin/env stack
-- stack runghc --package reanimate
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE ApplicativeDo #-}
module Main where
import Control.Monad
import qualified Data.Text as T
import Reanimate
import Reanimate.Voice
import Reanimate.Builtin.Documentation
import Graphics.SvgTree ( ElementRef(..) )
transcript :: Transcript
transcript = loadTranscript "voice_transcript.txt"
main :: IO ()
main = reanimate $ scene $ do
newSpriteSVG_ $ mkBackgroundPixel rtfdBackgroundColor
waitOn $ forM_ (splitTranscript transcript) $ \(svg, tword) -> fork $ do
let render v = centerUsing (latex $ transcriptText transcript) $ masked
(wordKey tword)
v
svg
(withFillColor "grey" $ mkRect 1 1)
(withFillColor "black" $ mkRect 1 1)
highlighted <- simpleVar render 0
wait (wordStart tword)
let dur = wordEnd tword - wordStart tword
tweenVar highlighted dur $ \v -> fromToS v 1
wait 2
where
wordKey tword =
T.unpack (wordReference tword) ++ show (wordStartOffset tword)
{-# INLINE masked #-}
masked :: String -> Double -> SVG -> SVG -> SVG -> SVG
masked key t maskSVG srcSVG dstSVG = mkGroup
[ mkClipPath label $ removeGroups maskSVG
, withClipPathRef (Ref label)
$ translate (x - w / 2 + w * t) y (scaleToSize w screenHeight dstSVG)
, withClipPathRef (Ref label)
$ translate (x + w / 2 + w * t) y (scaleToSize w screenHeight srcSVG)
]
where
label = "word-mask-" ++ key
(x, y, w, _h) = boundingBox maskSVG
Faking transcripts
Scripts and animations are often developed and revised in parallel. It would be too much work to make audio recordings of drafts but a rough idea of the timing information is paramount when developing the visuals. To this end, reanimate can fake timing information and pretend to "read" the script at roughly 120 words per minute.
Faking data is done automatically by loadTranscript
if no audio file can be found. It's also possible to directly parse a text as a transcript:
fakeTranscript :: Text -> Transcript
Toggle source code.
#!/usr/bin/env stack
-- stack runghc --package reanimate
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE ApplicativeDo #-}
module Main where
import Control.Monad
import qualified Data.Text as T
import Reanimate
import Reanimate.Voice
import Reanimate.Builtin.Documentation
import Graphics.SvgTree ( ElementRef(..) )
transcript :: Transcript
transcript =
fakeTranscript
"There is no audio\n\n\
\for this transcript....\n\n\n\
\Timings are fake,\n\n\
\which is quite useful\n\n\
\during development"
main :: IO ()
main = reanimate $ scene $ do
newSpriteSVG_ $ mkBackgroundPixel rtfdBackgroundColor
waitOn $ forM_ (splitTranscript transcript) $ \(svg, tword) -> do
highlighted <- newVar 0
void $ newSprite $ do
v <- unVar highlighted
pure $ centerUsing (latex $ transcriptText transcript) $ masked
(wordKey tword)
v
svg
(withFillColor "grey" $ mkRect 1 1)
(withFillColor "black" $ mkRect 1 1)
fork $ do
wait (wordStart tword)
let dur = wordEnd tword - wordStart tword
tweenVar highlighted dur $ \v -> fromToS v 1
wait 2
where
wordKey tword =
T.unpack (wordReference tword) ++ show (wordStartOffset tword)
{-# INLINE masked #-}
masked :: String -> Double -> SVG -> SVG -> SVG -> SVG
masked key t maskSVG srcSVG dstSVG = mkGroup
[ mkClipPath label $ removeGroups maskSVG
, withClipPathRef (Ref label)
$ translate (x - w / 2 + w * t) y (scaleToSize w screenHeight dstSVG)
, withClipPathRef (Ref label)
$ translate (x + w / 2 + w * t) y (scaleToSize w screenHeight srcSVG)
]
where
label = "word-mask-" ++ key
(x, y, w, _h) = boundingBox maskSVG
Setting up triggers
Now that we've covered the basics, let's look at the API for querying timings and setting up triggers in more detail:
transcriptWords :: Transcript -> [TWord]
data TWord = TWord
{ wordAligned :: Text
, wordCase :: Text
, wordStart :: Double -- ^ Start of pronunciation in seconds
, wordStartOffset :: Int -- ^ Character index of word in transcript
, wordEnd :: Double -- ^ End of pronunciation in seconds
, wordEndOffset :: Int -- ^ Last character index of word in transcript
, wordPhones :: [Phone]
, wordReference :: Text -- ^ The word being pronounced.
} deriving (Show)
Looking at the TWord
data structure, the key fields are wordStart
, wordEnd
, and wordReference
. The difference between wordStart
and wordEnd
gives the duration of a word, and the video below illustrates how graphical elements can respond to the reading speed:
Toggle source code.
#!/usr/bin/env stack
-- stack runghc --package reanimate
{-# LANGUAGE ApplicativeDo #-}
{-# LANGUAGE OverloadedStrings #-}
module Main where
import Control.Lens
import Control.Monad
import qualified Data.Text as T
import Graphics.SvgTree (ElementRef (..))
import Reanimate
import Reanimate.Builtin.Documentation
import Reanimate.Voice
transcript :: Transcript
transcript = loadTranscript "voice_triggers.txt"
transformer :: SVG -> SVG
transformer =
translate (-4) 0 . centerUsing (latex $ transcriptText transcript)
main :: IO ()
main = reanimate $ scene $ do
newSpriteSVG_ $ mkBackgroundPixel rtfdBackgroundColor
waitOn $ forM_ (splitTranscript transcript) $ \(svg, tword) -> do
let render v = transformer $ masked (wordKey tword)
v
svg
(withFillColor "grey" $ mkRect 1 1)
(withFillColor "black" $ mkRect 1 1)
highlighted <- simpleVar render 0
let dur = wordEnd tword - wordStart tword
fork $ do
wait (wordStart tword)
tweenVar highlighted dur $ \v -> fromToS v 1
fork $ do
wait (wordStart tword)
case wordReference tword of
"one" -> highlight dur $ latex "1"
"two" -> highlight dur $ latex "2"
"three" -> highlight dur $ latex "3"
"red" -> highlight dur $ withFillColor "red" $ mkCircle 1
"green" -> highlight dur $ withFillColor "green" $ mkCircle 1
"blue" -> highlight dur $ withFillColor "blue" $ mkCircle 1
_ -> return ()
wait 2
where
wordKey tword =
T.unpack (wordReference tword) ++ show (wordStartOffset tword)
highlight dur img =
play
$ animate
(\t ->
translate (screenWidth / 4) 0 $ scale t $ scaleToHeight 4 $ center
img
)
& signalA (bellS 2)
& setDuration dur
{-# INLINE masked #-}
masked :: String -> Double -> SVG -> SVG -> SVG -> SVG
masked key t maskSVG srcSVG dstSVG = mkGroup
[ mkClipPath label $ removeGroups maskSVG
, withClipPathRef (Ref label)
$ translate (x - w / 2 + w * t) y (scaleToSize w screenHeight dstSVG)
, withClipPathRef (Ref label)
$ translate (x + w / 2 + w * t) y (scaleToSize w screenHeight srcSVG)
]
where
label = "word-mask-" ++ key
(x, y, w, _h) = boundingBox maskSVG
Words can be looked up in a transcript and ambiguities can be resolved by inserting section markers and using them as keys.
-- | Locate the first word that occurs after all the given keys.
-- An error is thrown if no such word exists. An error is thrown
-- if the keys do not exist in the transcript.
findWord :: Transcript -> [Text] -> Text -> TWord
-- | Locate all words that occur after all the given keys.
-- May return an empty list. An error is thrown
-- if the keys do not exist in the transcript.
findWords :: Transcript -> [Text] -> Text -> [TWord]
Below is a transcript with two section markers, 'middle' and 'final'. These markers make it easy to tell the different between the word 'circle' in the first paragraph and the word 'circle' in the second paragraph. In the corresponding video, the words trigger different events depending on which paragraph (or section) they're in. Section markers are only used as references and are not included in the audio recording.
Everything in this animation is timed by my voice.
Every flash, every circle, every square.
[middle]
I decide when everything begins and ends.
Time has run out for the square
and now the circle
[final]
Let's spawn three more circles
and destroy them in a
flash flash flash
Toggle source code.
#!/usr/bin/env stack
-- stack runghc --package reanimate
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE ApplicativeDo #-}
module Main where
import Control.Monad
import qualified Data.Text as T
import Reanimate
import Reanimate.Voice
import Reanimate.Builtin.Documentation
import Geom2D.CubicBezier ( QuadBezier(..)
, evalBezier
, Point(..)
)
import Graphics.SvgTree ( ElementRef(..) )
transcript :: Transcript
transcript = loadTranscript "voice_advanced.txt"
main :: IO ()
main = reanimate $ scene $ do
bg <- newSpriteSVG $ mkBackgroundPixel rtfdBackgroundColor
spriteZ bg (-100)
newSpriteSVG_ $ mkGroup
[withStrokeColor "black" $ mkLine (-screenWidth, 0) (screenWidth, 0)]
centerTxt <- textHandler
flashEffect
circleEffect
squareEffect
finalEffect
waitOn $ forM_ (transcriptWords transcript) $ \tword -> fork $ do
wait (wordStart tword)
writeVar centerTxt $ wordReference tword
wait 2
wordDuration :: TWord -> Double
wordDuration tword = wordEnd tword - wordStart tword
--
finalEffect :: Scene s ()
finalEffect = fork $ do
let begin = findWord transcript ["final"] "circles"
ends = findWords transcript ["final"] "flash"
path = QuadBezier (Point 6 (-radius)) (Point 0 6) (Point (-6) (-radius))
radius = 0.3
wait (wordStart begin)
ss <- fork $ replicateM 3 (circleSprite radius path <* wait 0.2)
mapM_ (flip spriteZ (-1)) ss
forM_ (zip ss ends) $ \(s, end) -> fork $ do
spriteMap s flipXAxis
wait (wordStart end - wordStart begin)
destroySprite s
-- square effect
squareEffect :: Scene s ()
squareEffect = fork $ do
let
begin = findWord transcript [] "square"
end = findWord transcript ["middle"] "square"
path =
QuadBezier (Point 6 (-size / 2)) (Point 0 6) (Point (-6) (-size / 2))
size = 1
wait (wordStart begin)
s <- squareSprite size path
spriteMap s (rotate 180)
spriteZ s (-1)
wait (wordStart end + wordDuration end / 2 - wordStart begin)
destroySprite s
-- circle effect
circleEffect :: Scene s ()
circleEffect = fork $ do
let begin = findWord transcript [] "circle"
end = findWord transcript ["middle"] "circle"
path = QuadBezier (Point 6 (-radius)) (Point 0 6) (Point (-6) (-radius))
radius = 0.3
wait (wordStart begin)
s <- circleSprite radius path
spriteZ s (-1)
wait (wordStart end + wordDuration end / 2 - wordStart begin)
destroySprite s
-- flash effect
flashEffect :: Scene s ()
flashEffect = forM_ (findWords transcript [] "flash") $ \flashWord -> fork $ do
wait (wordStart flashWord)
flash <- newSpriteSVG $ mkBackground "black"
spriteTween flash (wordDuration flashWord)
$ \t -> withGroupOpacity (fromToS 0 0.7 $ (powerS 2 . reverseS) t)
wait (wordDuration flashWord)
destroySprite flash
--------------------------------------------------------------------------
-- Helpers and sprites
textHandler :: Scene s (Var s T.Text)
textHandler = simpleVar render T.empty
where
render txt =
let txtSvg = translate 0 (-0.25) $ centerX $ latex txt
activeWidth = svgWidth txtSvg + 0.5
in mkGroup
[ withStrokeWidth 0 $ withFillColorPixel rtfdBackgroundColor $ mkRect
activeWidth
1
, txtSvg
, withStrokeColor "black"
$ mkLine (activeWidth / 2, 0.5) (activeWidth / 2, -0.5)
, withStrokeColor "black"
$ mkLine (-activeWidth / 2, 0.5) (-activeWidth / 2, -0.5)
]
circleSprite :: Double -> QuadBezier Double -> Scene s (Sprite s)
circleSprite radius path = newSprite $ do
t <- spriteT
d <- spriteDuration
pure
$ let Point x y = evalBezier path (t / d)
in mkGroup
[ mkClipPath "circle-mask"
$ removeGroups
$ translate 0 (screenHeight / 2)
$ withFillColorPixel rtfdBackgroundColor
$ mkRect screenWidth screenHeight
, withClipPathRef (Ref "circle-mask") $ translate x y $ mkCircle
radius
]
squareSprite :: Double -> QuadBezier Double -> Scene s (Sprite s)
squareSprite size path = newSprite $ do
t <- spriteT
d <- spriteDuration
pure
$ let Point x y = evalBezier path (t / d)
in mkGroup
[ mkClipPath "square-mask"
$ removeGroups
$ translate 0 (screenHeight / 2)
$ withFillColorPixel rtfdBackgroundColor
$ mkRect screenWidth screenHeight
, withClipPathRef (Ref "square-mask")
$ translate x y
$ rotate (t / d * 360)
$ withFillOpacity 0
$ withStrokeColor "black"
$ mkRect size size
]