Ph.D. Public Defense

Structured Analysis and Generation in Music, Audio, and Beyond

Yujia Yan

Supervised by Zhiyao Duan

Tuesday, August 26, 2025
3 p.m.–4 p.m.

426 Computer Studies Building

Many staring at camera wearing glasses with dark hairMusic is fundamentally organized sound, yet its hierarchical, temporal, and relational structure remains challenging for data-driven models. This dissertation explores structure- aware frameworks for the analysis and generation of music that embed explicit structured modeling at key levels of abstraction.

The first part reframes automatic music transcription as a structured interval- prediction task. Notes and pedal events are modeled directly as sets of non-overlapping time intervals within a neural semi-Markov conditional random field (semi-CRF) frame- work. By scoring holistic intervals rather than individual frames, this end-to-end approach removes multi-stage heuristics and achieves state-of-the-art results on piano transcription benchmarks.

The second part addresses symbolic music generation through two frameworks. The first is a part-invariant model capable of generating or harmonizing music scores with an arbitrary number of parts using a single network. The second targets Western staff notation and represents music as a grid of part-wise measures,  each processed  by a structured encoder-decoder. This representation supports both autoregressive and non-autoregressive generation paradigms, including conditional generation for tasks such as score inpainting.

The third part proposes a foundational tool for explicit rate control in variational latent models using Gaussian latents, laying the groundwork for future unsupervised discovery of meaning structured latent representations for audio. We introduce the Slashed Normal, a simple posterior parameterization for Gaussian latents in variational inference. By tying the squared £2-norm of each posterior parameter vector (the “KL amplitude”) directly to its Kullback-Leibler divergence, it enables fine-grained, flexible, and precise control over information rates. Although motivated by structured music analysis, the Slashed Normal’s simplicity and interpretability make it broadly applicable across domains.

Together, the dissertation demonstrates that embedding structured modeling leads to more efficient, robust, and generalizable models for music intelligence.