Unsupervised semantic structure discovery for audio

Automatic deduction of semantic event sequences from multimedia requires awareness of context, which in turn requires processing sequences of audiovisual scenes. Most non-speech audio databases, however, are not labeled at a sub-file level, and obtaining (acoustic or semantic) annotations for sub-file sound segments is likely to be expensive. In our work, we introduce a novel latent hierarchical structure that attempts to leverage weakly or un labeled data to process the observed acoustics to infer semantic import at various levels. The higher layers in the hierarchical structure of our model represent increasingly higher level semantics.

Unsupervised semantic structure discovery for audio

From HLT@INESC-ID