Skip to content

Evaluation Dataset

Voicebank-DEMAND

# PESQ,STOI

Adaptation Method

We adopt an adaptation method similar to that in speech separation tasks, where a three-layer Bi-LSTM (SepRNN) is applied to the output features of the upstream model to generate a predicted spectral mask of the clean signal.

Data description

Voicebank-DEMAND is a synthetic dataset created by mixing up clean speech and noise.The clean speech is extracted from the Voice Bank corpus, and the noise is from the Diverse Environments Multichannel Acoustic Noise Database (DEMAND). The training set contains 28 speakers with 4 signal-to-noise ratios (SNRs) (15, 10, 5, and 0 dB) and the test set contains 2 speakers with 4 SNRs (17.5, 12.5, 7.5,and 2.5 dB). The training set contains 11,572 utterances (9.4h) and the test set contains 824 utterances (0.6h). The lengths of utterances range from 1.1s to 15.1s with an average of 2.9s.

Dataset structure

Amount of source data

Training set 8.8h, validation set 0.6h, test set 0.6h

Data detail

The training set, validation set, and test set all contain three files: spk2utt, utt2spk, and wav.scp:

  • wav.scp:wav_id wav_path
  • spk2utt:wav_id wav_id
  • utt2spk:wav_id wav_id

Sample of source dataset

wav.scp:
p226_001 /home/datasets/noisy-vctk-16k/noisy_trainset_28spk_wav_16k/p226_001.wav

spk2utt:
p226_001 p226_001

utt2spk:
p226_001 p226_001

Citation information

@inproceedings{ValentiniBotinhao2017NoisySD,
  title={Noisy speech database for training speech enhancement algorithms and TTS models},
  author={Cassia Valentini-Botinhao},
  year={2017},
  url={https://api.semanticscholar.org/CorpusID:64530884}
}

Licensing information

CC BY 4.0 Licensed