Ensemble width estimation in HRTF-convolved binaural music recordings using an auditory model and a gradient-boosted decision trees regressor

Paweł Antoniuk, Sławomir K. Zieliński, Hyunkook Lee

Research output: Contribution to journalArticlepeer-review

Abstract

Binaural audio recordings become increasingly popular in multimedia repositories, posing new challenges in indexing, searching, and retrieval of such excerpts in terms of their spatial audio scene characteristics. This paper presents a new method for the automatic estimation of one of the most important spatial attributes of binaural recordings of music, namely “ensemble width.” The method has been developed using a repository of 23,040 binaural excerpts synthesized by convolving 192 multi-track music recordings with 30 sets of head-related transfer functions (HRTF). The synthesized excerpts represented various spatial distributions of music sound sources along a frontal semicircle in the horizontal plane. A binaural auditory model was exploited to derive the standard binaural cues from the synthesized excerpts, yielding a dataset representing interaural level and time differences, complemented by interaural cross-correlation coefficients. Subsequently, a regression method, based on gradient-boosted decision trees, was applied to the formerly calculated dataset to estimate ensemble width values. According to the obtained results, the mean absolute error of the ensemble width estimation averaged across experimental conditions amounts to 6.63° (SD 0.12°). The accuracy of the method is the highest for the recordings with ensembles narrower than 30°, yielding the mean absolute error ranging between 0.8° and 10.2°. The performance of the proposed algorithm is relatively uniform regardless of the horizontal position of an ensemble. However, its accuracy deteriorates for wider ensembles, with the error reaching 25.2° for the music ensembles spanning 90°. The developed method exhibits satisfactory generalization properties when evaluated both under music-independent and HRTF-independent conditions. The proposed method outperforms the technique based on “spatiograms” recently introduced in the literature.

Original languageEnglish
Article number53
Number of pages26
JournalEurasip Journal on Audio, Speech, and Music Processing
Volume2024
Issue number1
Early online date10 Oct 2024
DOIs
Publication statusPublished - 1 Dec 2024

Cite this