Skip to main content

News & Events

Electrical and Computer Engineering Ph.D. Public Defense


Multi-modal Analysis for Music Performances

Bochen Li

Supervised by Professor Zhiyao Duan

Tuesday, August 11, 2020
1 p.m.
Zoom Meeting TBA

When we talk about music, we tend to view it as an art of sound. However, it is much broader than that. We watch music performances, read musical scores, and memorize lyrics of songs. The visual aspect of musical performances helps express the ideas of performers and engage the audience, while the symbolic aspect of music connects composers with performers across continents and centuries. A successful intelligent system for music understanding should model all these modalities and their relations. This is exactly the objective of my research, multi-modal analysis of music performances. This is at the core of artificial intelligence: It bridges computer audition and computer vision, and connects symbolic processing with signal processing. It will enable novel multimedia information retrieval applications and music interactions for experts and novices alike.

Fundamentally, there are two problems to be addressed: 1) To coordinate the multiple modalities; 2) To leverage this coordination to achieve novel music analyses that are impossible by analyzing each modality alone.

The first problem of identifying coordination can be addressed as temporal alignment of audio and music score which are represented in different time units (seconds vs. beats), and spatial source association in videos of ensemble performances, which identifies the affiliation between the players from the visual scene and the audio/score tracks. For temporal alignment I focus on real-time audio-to-score alignment for piano performance with the sustained effect as one of the most challenging cases. For source association I address the problem in chamber ensemble (one player for a track) for common Western instruments including strings, wood-wind, and brass.

For the second problem, I conduct research to prove that the coordination of the multiple modalities of music performance benefits traditional music information retrieval (MIR) tasks and enables new frontiers of emerging research topics. I design multi-modal systems that leverage the visual information to help estimate and stream pitches in polyphonic music from ensemble performances, and to help analyze performance expressiveness (e.g., vibrato characteristics). This concept is implemented in string ensembles and witnesses a great success. I also propose to address the source separation problem on singing performance, and improve the vocal separation quality by incorporating the visual modality, e.g., the mouth movement of the singer. Last but not the least, I propose new topics about expressive visual performance generation, such as generating expressive body movements of pianists given music scores.