Ph.D. Public Defense

Computational Framework for the Analysis of Spatial Audio

Steven Crawford

Supervised by Professor Mark Bocko

Monday, June 1, 2020
10 a.m.

In this thesis, the extent to which the coherence properties of binaural signals can be used in the computation of cues that aid in the prediction of perceived location and apparent auditory source width of sound events is explored. To this end, a computational, modular framework for spatial audio analysis is developed and applied as a prediction tool on synthetically rendered spatial sound fields. The primary focus is to develop a binaural fusion model, and more generally, a model of spatial hearing with immediate practical application to objective, spatial sound-localization predictions for arbitrary multi-channel and/or headphone-based spatial audio synthesis schemes. The binaural model and overall framework may serve as useful tools for the analysis and design of spatial audio experiences for virtual and augmented reality systems. The framework also may be employed as a convenient alternative to the direct use of human participants in listening experiments, and as such may serve as a tool in the development of spatial audio rendering systems. Utilizing a computational binaural auditory model as its front-end, the framework is composed of a series of modular signal processing blocks designed to simulate the peripheral and central stages of the human auditory system. The present implementation of the binaural fusion model builds upon the Meddis hair cell model for the peripheral stages and coincidence detection of delay-line activity patterns (the Jeffress model) for the central processing stage. However, the overall framework is intended to be modular and not model specific. Additionally, signals are processed using a gammatone filter bank to produce an interaural coherence function; a short-time windowed cross-correlation of the binaural signals in each gammatone filter band. These interaural coherence functions collectively produce a binaural activity representation, called a ‘correlogram’, from which the perceived auditory image location and its spatial extent may be inferred. A dictionary of basis correlograms corresponding to measured sound source locations in three-dimensional space is generated and then used to compare the rendering precision and accuracy of virtual auditory images produced using first order and higher order Ambisonics (FOA and HOA). A form of regularized regression, called an elastic net, is used to infer the spatial and psychoacoustic properties of the virtual acoustic source. The final output of the framework is a predictive metric representing the perceived location and ‘width’ of an acoustic source in 3D space.