October 31, 2016

Your Detailed Guide to Spatial Audio for 360 Video—Using Mostly Free Tools

headphones
The creators of one of YouTube’s first 360 videos with ambisonic audio share their process, from recording to post.

[Editor's Note: No Film School asked the 'Love is All Around' team to detail their experiences creating ambisonic audio when YouTube started to support the technology.]

Just in time for Halloween, we released Love is All Around: A Horror Story, a 360º video in YouTube's #Room301 horror series. When YouTube asked us to create a video about our worst fear, we chose rejection. Instead of setting up the video cinematically, director Michael Morgenstern staged it as a play, with different parts of the set representing different locations. YouTube's pre-constructed set with lighting board enabled us to pre-program lighting cues, just like in a theater.
 
Directing the viewer's attention is key, and we were excited to take advantage of YouTube's recent support for ambisonic audio to be one of the first videos on the platform with the technology. Spatial audio allows you to spatially locate sounds to direct the viewer's attention left or right. It's also supported in Milk VR and other platforms.
 
The final output for YouTube is first-order ambisonic audio, which encodes four spatially located sound sources separated at 90º angles to each other. Higher-order ambisonics, which may be supported in the future, encode 9, 16, or more audio sources.
 

(The video will be available on the Samsung VR library soon. Until then, watch it on Gear VR by downloading this file and side-loading it onto your device.)

Capturing audio

We captured location audio using six omni pattern lavalier microphones placed on actors, three boom microphones placed underneath the camera array positioned at left/right/center orientation, and a single h2n microphone mounted atop the Jump system.
 
An Ambisonic B-format recording mode is supported in the 2.0 firmware of the h2n, but we failed to enable the correct recording mode and were left with only two channels of usable room ambiance. If you’re working with a similar setup, it is crucial to double check that you are in SPATIAL mode (completely separate from four-channel QUAD). This article on the Zoom company page provides more information on what is currently the cheapest means of recording B-format location ambiance, though limited to the horizontal axis only.
 
We tried to invest most of our audio focus on set into capturing high quality usable LAV takes (with as little proximity effect or clothing interference as possible). Because all diegetic sources are now spatially defined, traditional means of filling dialogue tracks with room tone noise does not translate well to the 360 soundstage. Any unwanted noise picked up on a dialogue microphone provides the listener with false cues and generates further complications with reverb appearing and disappearing suddenly in discreet spatial locations. For this reason, the boom microphones exist purely as fallback (for ADR guide) and to capture the occasional usable production effect.

Every plugin and third party component utilized in this technical demonstration is both publicly available and free for noncommercial use.

Post-production

For the post production process we opted to perform the initial XML/OMF session imports and the majority of audio editing tasks in ProTools. This workflow allowed for the mix design to be mentally separated from base leveling, noise reduction and trimming. For everything beyond standard editing procedure, Reaper and a slew of incredible open source VST plugins enabled the ambisonic design to take place.

It should be positively noted that every plugin and third party component utilized in this technical demonstration is both publicly available and free for noncommercial use as of this writing. We experimented with many different tools and retail demos during the preparation phase. The modules documented below were the clear winners for their utility value and application specifically to YouTube 360 format.
 

Free tools & resources

  • ambiX: a set of ambisonic tools by Matthias Kronlachner

The plugin suite at the core of this whole tech stack currently acts as the only officially documented workflow for YouTube 360 audio. Perhaps the most unexpectedly useful of all of the plugins in this suite is the "ambiX converter." This champion utility allows for the successful integration of nearly every other ambisonic plugin which can be found today. All you have to do is define an input and output channel order and normalization spec.

Conversion parameters for ATK -> ambiX
Conversion parameters for ATK -> ambiX

  • mcfx: a set of multi-channel filter tools and convolution reverb by Matthias Kronlachner.

This toolkit pairs perfectly with the ambiX utilities in providing bread and butter audio capabilities (EQ/Filter/Delay) which work in a vast selection of multi-channel formats all throughout first and higher order ambisonics. Additionally, it includes a convolution reverb engine which is easily integrated into YouTube format (ACN Channel Order, SN3D normalization) by using the ambiX converter.

mcfx_filter4 (bread and butter filter/eq which works nicely with most ambisonic tools)
mcfx_filter4 (bread and butter filter/eq which works nicely with most ambisonic tools)

This Google KB article documents the core setup process in Reaper pretty well. It includes a number of example files and images to get you started thinking down a particular path. In the end, the Ambisonic Toolkit (ATK) was chosen for almost all the essential panning and spatialization of diegetic source material while ambiX tools were only leveraged for monitoring to YouTube's specification (and binaural impulse response) as well as performing vital channel conversion and filtering.

  • The Ambisonic Toolkit: an extremely powerful and unique set of Ambisonic tools designed by Trond Lossius (and my personal favorite of the bunch)
Without a doubt, this toolset has the most impressive selection of ambisonic functions from both a technical and creative standpoint. It was easily adapted to YouTube specs using the aforementioned ambiX converter, and provided the most realistic outcome for encoding non-spatial source material to ambisonic. Perhaps the key takeaway discovery from this whole exercise was introducing the concept of diffusion and transformation of mono source material. Trond demonstrates it best in his tutorial video.

The Planewave encoder is made with the most common way of encoding mono sources. With first order ambisonics this gives us as clear of a direction as you can possibly get and the sound here is really like, if you had a sound source that was infinitely far away. For all practical reasons that means it is several meters away from you and it's coming towards you. The sound wave passing by you is looking like planewaves, much the same way as a wave would look as it comes in towards the shore... The sound source is as clearly pinpointed as you can possibly get.

If, however, we want our source audio to be less clearly defined in terms of location (for surround mixers, think divergence) other approaches must be taken. In the real world, sound almost never arrives at our ears in this "infinitely directional" manner, replicated by the baseline ambisonic mono encoders.

We can mimic the natural spread and dispersion of mono sources using a combination of the Encode Diffuser and Transform FocusPressPushZoom plugins. In this model, the "Degree of transformation" parameter becomes your panning divergence in circumstances where sources should be partially directional (such as medium/close perspectives relative to the viewer).

Leveraging Lossius' Diffusion and FocusPress algorithms to increase spatial harmonic content from mono dialogue source

  • WigWare: set of Ambisonic utilities designed by Bruce Wiggins.
WigWare is the only free suite that contains a non-IR based Ambisonic Reverb capable of generating decent small-to-medium-sized room reflections. It was found to be exceedingly difficult to locate B-format impulse responses of rooms that were not either grandiose in scale or downright peculiar. WigWare came to the rescue here with two different simple reverb modules to compose with.

AmbiFreeverb2 in the WigWare tools converted for use in YouTube spatial audio format
AmbiFreeverb2 in the WigWare tools converted for use in YouTube spatial audio format

  • Other Mentions:
    • Syncing 360 Video with Reaper DAW
      SpookFM developed a clever hack for syncing the timeline and binaural perspective with Reaper by integrating the GoPro player. This proved to be the easiest means of monitoring video in a 2D square viewport. Jump Inspector as documented by Google can also be synced to your Reaper timeline and head position, providing a real-time monitor/playback method with a head mounted

    • Sites containing free B-format impulse responses:
      Open AIR Library
      isophonics

Pain Points & Cautions

Here are a couple more things to note as you embark upon this process:

  1. YouTube's processing of 360 video can take a lot of time on its own, but spatial audio takes even longer. Do not immediately assume your ambisonic content isn't working if you can't hear the binaural transform taking place right away. In our tests, it took about 5-6 hours after YouTube's initial processing was completed for the spatial audio to appear correctly.

  2. Keep in mind how the non-diagetic audio track might be utilized in YouTube spatial audio RFC. There is no clear example of how to conduct an mp4 container in this manner to date. It would be very useful to have a totally separate stereo audio track free from the binaural processing for unaffected stereo musical recordings and the potential creative design application.

In the future, ambisonic can almost be imagined as a new process performed during pre-mixing in which the editor records "location entities." 

Muxing your audio and video

Ambisonic audio is more complex than a simple stereo track, so pairing it with video presents challenges. Your output should be a four-track audio stream, and the order can't be switched. There are several ways we found to mux the audio:

  1. In Reaper. This allows you to be sure the tracks are exported correctly, but has the disadvantage that video can't be changed without sending a full resolution version of the video to the sound mixer.
  2. Muxing with ffmpeg. The command-line ffmpeg allows you to combine the video stream from one file with the audio stream from another. You can send the finished audio to the editor, who can mux as many versions as they want.
    Installing ffmpeg on a mac is easy. An example command to simply mux two streams is:
    ffmpeg -i {video file} -i {audio file} -map 0:0 -map 1:0 -codec:v copy -codec:a copy -movflags faststart {output file}
    And an example command to recompress as h.264 with aac audio:
    ffmpeg -i {video file} -i {audio file} -map 0:0 -map 1:0 -codec:v libx264 -preset veryslow -profile:v baseline -level 5.2 -pix_fmt yuv420p -qp 17 -codec:a libfdk_aac -b:a 256k -movflags faststart {output file}
  3. You can also configure Premiere to work with ambisonic audio. We got this working but didn't experiment much with it. After exporting a file, be sure to use YouTube's Spatial Media Metadata Injector to add metadata which identifies the video as a VR video with spatial audio. Note that YouTube and Milk VR both support .mov and .mp4 files, though .mp4 files will be more compatible across devices (Gear VR, for example, does not play local .mov files.) The mp4 container does not support PCM audio.

The potential of ambisonic sound design

Overall, Love is All Around was a fantastic learning experience and a bit of an awakening to the creative potentials of ambisonic sound design. For mixes of larger scale and complexity, the notion of recording real time 360 pan automation to the character's movements onscreen takes shape. It can almost be imagined as a new process performed during pre-mixing in which the editor records "location entities." Dialogue and effects could then be bussed through to these various locales in space which always share a common origin (e.g. named actors moving onscreen). Meanwhile, we’ll all keep experimenting in this emerging space.

Your Comment

1 Comment

Is there a known solution to stream https urls of videos with muxed ambisonic audio from a Unity3D project (mobiles, VR platforms etc...)

December 27, 2016 at 4:07AM

0
Reply