Single-Channel Audio Source Separation Using Deep Neural Network Ensembles

Emad M Grais, Gerard Roma, Andrew JR Simpson, Mark D Plumbley

Research output: Chapter in Book/Report/Conference proceedingConference contribution

18 Citations (Scopus)

Abstract

Deep neural networks (DNNs) are often used to tackle the single channel source separation (SCSS) problem by predicting time-frequency masks. The predicted masks are then used to separate the sources from the mixed signal. Different types of masks produce separated sources with different levels of distortion and interference. Some types of masks produce separated sources with low distortion, while other masks produce low interference between the separated sources. In this paper a combination of different DNNs’ predictions (masks) is used for SCSS to achieve better quality of the separated sources than using each DNN individually. We train four different DNNs by minimizing four different cost functions to predict four different masks. The first and second DNNs are trained to approximate reference binary and soft masks. The third DNN is trained to predict a mask from the reference sources directly. The last DNN is trained similarly to the third DNN but with an additional discriminative constraint to maximize the differences between the estimated sources. Our experimental results show that combining the predictions of different DNNs achieves separated sources with better quality than using each DNN individually.
LanguageEnglish
Title of host publicationAudio Engineering Society Convention 140
PublisherAudio Engineering Society
Publication statusPublished - 26 May 2016
Externally publishedYes
Event140th Audio Engineering Society Convention 2016 - Palais des Congrès, Paris, France
Duration: 4 Jun 20167 Jun 2016
Conference number: 140
http://www.aes.org/events/140/ (Link to Conference Website)

Conference

Conference140th Audio Engineering Society Convention 2016
CountryFrance
CityParis
Period4/06/167/06/16
Internet address

Fingerprint

Source separation
Masks
Deep neural networks
Cost functions

Cite this

Grais, E. M., Roma, G., Simpson, A. JR., & Plumbley, M. D. (2016). Single-Channel Audio Source Separation Using Deep Neural Network Ensembles. In Audio Engineering Society Convention 140 [9494] Audio Engineering Society.
Grais, Emad M ; Roma, Gerard ; Simpson, Andrew JR ; Plumbley, Mark D. / Single-Channel Audio Source Separation Using Deep Neural Network Ensembles. Audio Engineering Society Convention 140. Audio Engineering Society, 2016.
@inproceedings{4f4308a05097402fa906636e42a072bd,
title = "Single-Channel Audio Source Separation Using Deep Neural Network Ensembles",
abstract = "Deep neural networks (DNNs) are often used to tackle the single channel source separation (SCSS) problem by predicting time-frequency masks. The predicted masks are then used to separate the sources from the mixed signal. Different types of masks produce separated sources with different levels of distortion and interference. Some types of masks produce separated sources with low distortion, while other masks produce low interference between the separated sources. In this paper a combination of different DNNs’ predictions (masks) is used for SCSS to achieve better quality of the separated sources than using each DNN individually. We train four different DNNs by minimizing four different cost functions to predict four different masks. The first and second DNNs are trained to approximate reference binary and soft masks. The third DNN is trained to predict a mask from the reference sources directly. The last DNN is trained similarly to the third DNN but with an additional discriminative constraint to maximize the differences between the estimated sources. Our experimental results show that combining the predictions of different DNNs achieves separated sources with better quality than using each DNN individually.",
author = "Grais, {Emad M} and Gerard Roma and Simpson, {Andrew JR} and Plumbley, {Mark D}",
year = "2016",
month = "5",
day = "26",
language = "English",
booktitle = "Audio Engineering Society Convention 140",
publisher = "Audio Engineering Society",
address = "United States",

}

Grais, EM, Roma, G, Simpson, AJR & Plumbley, MD 2016, Single-Channel Audio Source Separation Using Deep Neural Network Ensembles. in Audio Engineering Society Convention 140., 9494, Audio Engineering Society, 140th Audio Engineering Society Convention 2016, Paris, France, 4/06/16.

Single-Channel Audio Source Separation Using Deep Neural Network Ensembles. / Grais, Emad M; Roma, Gerard; Simpson, Andrew JR; Plumbley, Mark D.

Audio Engineering Society Convention 140. Audio Engineering Society, 2016. 9494.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - Single-Channel Audio Source Separation Using Deep Neural Network Ensembles

AU - Grais, Emad M

AU - Roma, Gerard

AU - Simpson, Andrew JR

AU - Plumbley, Mark D

PY - 2016/5/26

Y1 - 2016/5/26

N2 - Deep neural networks (DNNs) are often used to tackle the single channel source separation (SCSS) problem by predicting time-frequency masks. The predicted masks are then used to separate the sources from the mixed signal. Different types of masks produce separated sources with different levels of distortion and interference. Some types of masks produce separated sources with low distortion, while other masks produce low interference between the separated sources. In this paper a combination of different DNNs’ predictions (masks) is used for SCSS to achieve better quality of the separated sources than using each DNN individually. We train four different DNNs by minimizing four different cost functions to predict four different masks. The first and second DNNs are trained to approximate reference binary and soft masks. The third DNN is trained to predict a mask from the reference sources directly. The last DNN is trained similarly to the third DNN but with an additional discriminative constraint to maximize the differences between the estimated sources. Our experimental results show that combining the predictions of different DNNs achieves separated sources with better quality than using each DNN individually.

AB - Deep neural networks (DNNs) are often used to tackle the single channel source separation (SCSS) problem by predicting time-frequency masks. The predicted masks are then used to separate the sources from the mixed signal. Different types of masks produce separated sources with different levels of distortion and interference. Some types of masks produce separated sources with low distortion, while other masks produce low interference between the separated sources. In this paper a combination of different DNNs’ predictions (masks) is used for SCSS to achieve better quality of the separated sources than using each DNN individually. We train four different DNNs by minimizing four different cost functions to predict four different masks. The first and second DNNs are trained to approximate reference binary and soft masks. The third DNN is trained to predict a mask from the reference sources directly. The last DNN is trained similarly to the third DNN but with an additional discriminative constraint to maximize the differences between the estimated sources. Our experimental results show that combining the predictions of different DNNs achieves separated sources with better quality than using each DNN individually.

M3 - Conference contribution

BT - Audio Engineering Society Convention 140

PB - Audio Engineering Society

ER -

Grais EM, Roma G, Simpson AJR, Plumbley MD. Single-Channel Audio Source Separation Using Deep Neural Network Ensembles. In Audio Engineering Society Convention 140. Audio Engineering Society. 2016. 9494