Automatic interpretation of structural analyses

PDF, BIB, Poster
Smith, J. B. L., and E. Chew. 2017. Automatic interpretation of music structure analyses: A validated technique for post-hoc estimation of the rationale for an annotation. Proceedings of the International Society for Music Information Retrieval Conference. Suzhou, China. 435–41.


Structural descriptions are usually single-dimensional, or perhaps hierarchical. Here is a two-level analysis of “Hello, Goodbye” by The Beatles:

This annotation tells us that sections A and B are different—but what makes them different? Do listeners think B is defined by a harmonic or melodic progression, or by a timbre? What was the listener’s rationale when they decided on this interpretation?

Collecting this information from listeners is onerous, and the introspection required is difficult. Instead, we aim to automatically interpret existing annotations by comparing them to the audio.

If successful, we could visualize structure to see which musical attributes characterize each section, like so:

"Hello, Goodbye" analysis

In the above figure, the x-axis is time (in seconds), and each row plots the importance of a given feature to the label of each section. Cells get brighter when a feature is:

  1. homogenous throughout that section;
  2. similar in other sections with the same label;
  3. different in other sections with different labels.


Finding appropriate data is not trivial! To validate the algorithm, we need structural annotations paired with listener rationales.

We obtained the data in a music perception study: we composed stimuli with intended forms, each suited to intended rationales:

We also confirmed that listeners perceived these structure with the same rationales:

We have a large number of stimuli, in three styles, with either 3 parts (AAB vs. ABB) or 4 parts (AABB vs. ABAB vs. ABBA).


We compute self-similarity matrices (SSMs) from several audio features, each of which is assumed to correlate with a relevant musical attribute.

We generate masked SSM segments, each revealing the relationship of a segment to the rest of the piece.

Then, a quadratic program (QP) estimates coefficients to recreate the ground truth SSM from the masked segments. E.g.:

The example above has structure:

  • ABBA justified by timbre
  • AABB justified by rhythm
  • ABAB justified by harmony

The QP reconstructs the ABAB interpretation (represented in the top left square) using only bass chroma.

The QP approach has clear limitations:

  • If two musical attributes explain a section equally, the QP might only point to one. Instead, we can measure correlation.
  • Sequences that are repeated but non-homogenous may be overlooked in a point-wise SSM comparison. Instead, we can use segment-indexed SSMs, or apply additional stripe masking.


The suggested improvements all had a positive impact: the best algorithm used the stripe-masked SSMs, indexing by segment, and correlation instead of the QP output.

But accuracy varied among musical styles and features, as these confusion plots show:


We can use the validated approach to analyze SALAMI annotations:

Analysis of "We Are The Champions" by Queen
  • A: Harmonies stable, orchestration builds up → harmonies in a and b are unique across the piece.
  • B: Complex chord sequence, stable timbre → timbre cannot explain individuated subsections.

Some analyses have prime markers. If we consider primed sections to be similar or different changes the interpretation.

Analysis of “Another One Bites The Dust”, by Queen:
  • if d=d′: Stable, stripped-down harmony throughout.
  • if dd′: Sections feature odd, varying sound effects.

Previous work

This project follows up work previously published at two previous workshops:

Smith, J. B. L., and E. Chew. 12 August 2016. “Validating a technique for post-hoc estimation of a listener’s focus in music structure analysis.” Oral presentation at CogMIR (Cognitively Based Music Informatics Research), satellite event of the International Society for Music Information Retrieval Conference, New York, NY.
Smith, J. B. L., and E. Chew. 13 October 2015. “Validating an Optimisation Technique for Estimating the Focus of a Listener.” Poster presentation at Mathemusical Conversations, Singapore.