Department of Electrical and Computer Engineering Ph.D. Public Defense

Sound Search by Vocal Imitation

Yichi Zhang

Supervised by Professor Zhiyao Duan

Monday, December 9, 2019
9 a.m.

Computer Studies Building, Room 426

Traditional search through collections of audio recordings compares a text- based query to text metadata associated with each audio file and does not address the actual content of the audio. Text descriptions do not describe all aspects of the audio content in detail. Query by vocal imitation (QBV) is a kind of query by example that lets users imitate the content of the audio they seek, providing an alternative search method to traditional text search. In this thesis, I propose a web-based sound search by vocal imitation system for novel computer human interaction. It uses a vocal imitation from the user as a query, and searches for a sound recording similar to the imitation in a sound database.

First, I introduce the motivation of this research, the concept of sound search by vocal imitation, its application scenarios, and existing challenges in Chapter 1. I then discuss research backgrounds about vocal imitation, query by example, and audio-related interaction in Chapter 2. Then various sound search algorithms are proposed in the following two chapters. Specifically, in Chapter 3, search algorithms by automatic feature learning using Stacked Auto-Encoder (SAE) are proposed for sound search by vocal imitation in both supervised and unsupervised manners. However, feature extraction and distance calculation modules are isolated, which cannot guarantee that learned features are optimal for distance calculation. Hence, in Chapter 4, I further propose search algorithms by end-to-end Siamese style neural networks to solve this issue. To obtain more insights about how such networks work, investigations are made to visualize and sonify the input patterns that maximize the activation of certain neurons in each layer, using activation maximization approach and Griffin-Lim algorithm. In Chapter 5, I design a search engine for sounds by vocal imitation queries called Vroom!. The frontend and backend implementations are described. A comprehensive subjective study on Amazon Mechanical Turk is conducted to evaluate the performance of the vocal-imitation-based search engine and compare with a text-based sound search engine TextSearch as the baseline. An experimental framework to wrap around Vroom! and TextSearch is built to conduct this user study. Comprehensive analyses and important conclusions are made based on user ratings and behavioral data collected from the subjects. Finally, I conclude this thesis in Chapter 6, limitations of this work, possible solutions, and future work are also discussed.