Navy Boards

Navy Boards

Developing Narrative
Experiences for
Amazon Alexa

© 2020

Wrapping Up Dynamic Audio Wrapping Up Dynamic Audio - Adventures In Developing For Alexa

Wrapping Up Dynamic Audio

I’m writing this post while wrapping up everything else, so it’ll likely be shorter than intended. Probably a blessing for you!

The code associated with the work done in this leg of the project is in the bounty-hunter-example repository. In the shortest number of words possible, this stretch of the project was about:

  • assembling dynamic audio from multiple possible choices for each part of a sentence
  • assembling a full piece of audio for an Alexa response including multiple levels of background atmosphere, dialogue and sound effects
  • experimenting with generating these full soundscapes at runtime
  • incorporating FMOD to generate dynamic audio soundscapes

That’s… that actually sounds like a lot in hindsight. I didn’t do as much actual audio mixing as I’d like, in the sense of sitting with REAPER and figuring out what sounds good, but that part of the process was the self-indulgent part. A lot of fun when I got to it, but not the central point. Let’s talk about the rest of it.

FMOD Dynamic Soundscapes

This was a huge win. Thanks to invaluable consults with Michael Thieler and Maize Wallin, I was able to understand how to create randomized soundscapes that fit my needs in FMOD Studio and then use the Profiler to record individual audio tracks for later use.

For the Bounty Hunter example, I used this to create ambient tracks for experiencing critical damage (incorporating a warning klaxon and a repeated warning dialogue, along with intermittent sparking SFX) as well as for when the ship is undergoing evasive maneuvers, creating an ambient track with intermittent metal creaks for hull strain, and maneuvering jets (both created using individual Scatterer instruments)

Assembling Dynamic Audio From Multiple Possible Choices

As a continuation of the work in Mixing For Variation multiple takes were recorded for some components of dialogue - for example, indicating assent, or an error before providing more details (have a look at the Bounty Hunter Dynamic Audio Asset List for some examples of this. To illustrate one here, a line from the second-in-command confirming that mass driver cannons were being fired could choose from:

  • “Aye Cap’n”
  • “Confirmed”
  • “Right away”

And then from:

  • “Firing Mass Drivers”
  • “Cannons lighting ‘em up”
  • “Mass Drivers hot”

Giving 9 possible combinations for a response to that order. This gave a real sense of life to the interaction compared to a single response. One thing that became clear very quickly though is that non-verbals (breaths, sighs, groans, ticks, clicks and the like) need to be used very sparingly or randomised as components used carefully as well, as these stood out when repeated far more glaringly than any dialogue.

Assembling A Full Piece of Audio Including Atmosphere

This is where I’d really hoped that pydub would do heavy lifting, and it didn’t disappoint. After getting ffmpeg running on an EC2 instance, it was possible to use pydub to glue together individual audio files and overlay dialogue on ambience. To add icing to the cake, I used pydub to provide a short fade-in and fade-out on each generated file, to be less jarring to the listener.

Experimenting With Generating Soundscapes At Runtime

This is where the real work came in. Alexa can fetch audio from any secure URL that is using a valid SSL certificate. After adding a custom ‘A’ record for a domain I had hanging around to point at a specific AWS EC2 instance configured with an elastic IP, I used certbot (a Let’s Encrypt tool for generating free production SSL certificates) to configure an NGINX HTTP server automatically with the necessary SSL configuration.

This took some troubleshooting. SSL configuration is always a bear when its not working, no matter the technology involved. I made heavy use of the Device Log in the Alexa Developer Console alongside the nginx -T test command to verify the NGINX server’s SSL configuration and get compatible cyphers between the two pieces of software.

The NGINX server was configured to pass traffic on to a Flask application with a single handler for all URL requests. The audio-manager Python class within the Bounty Hunter Alexa skill would build up a URL containing a series of indicators about the current state in the path, and end each request with an MP3 file suffix so Alexa recognised the requested URL as a valid MP3 file request.

(I experimented with passing along all dialogue and ambience files required as part of a request, but very quickly this became too long)

The audio_util class within the Flask application was then responsible for taking those input parameters and determining the necessary files to bake a response file. While it didn’t feel ideal to have to create case-by-case considerations rather than finding a way to define everything via configuration, there are a few things here worth pointing out:

  • a good chunk of the smarts of assembling audio was located in JSON configuration in the audio_util class.
  • where options exist for a piece of dialogue, an individual choice is be made by wildcard-matching against a filename pattern.
  • each combination of dialogue parts would be cached as a temporary filename identified by an MD5 hash of the string of all filenames used. So a unique combination would be created once, then reused.
  • The final filename incorporating ambience and dialogue would also be identified by an MD5 hash of the dialogue file and ambience files used. A random section of each ambience file would be used when generating a result.

The thing that I’d revisit here is the facility to create multiple choices for the same combination of dialogue component parts. Otherwise you’d end up with the same situation as non-verbals - hearing a maneuvering jet (or some other sound effect) fire at the same point in a sentence would stand out over multiple play sessions.

The idea behind rendering this audio at runtime was to introduce the capacity for more randomness and flexibility - I’m not aware of a way to use pydub’s export method to write to S3, and being able to create new audio at runtime seemed desirable. Looking back, I’m glad I broke the back of this problem, but I’m not convinced it’s the way forward. It adds moving parts to a runtime configuration, AWS costs via a running EC2 instance, and runtime delay when complex audio is generated for the first time. I suspect I’d move to using Flask as part of the deployment process to create files (including a number of variants) up front, and deploy a large amount of files to S3 with obscure MD5 hash filenames instead.