Please read downstream/README.md for the general command pattern, and read upstream/example/README.md for registering a new pretrained model (upstream).
In this document we detail the commands for reproducing the paper SUPERB: Speech processing Universal PERformance Benchmark and SUPERB-SG: Enhanced Speech processing Universal PERformance Benchmark for Semantic and Generative Capabilities. If you use the tasks here for your research, please consider citing the following papers:
@inproceedings{yang21c_interspeech,
author={Shu-wen Yang and Po-Han Chi and Yung-Sung Chuang and Cheng-I Jeff Lai and Kushal Lakhotia and Yist Y. Lin and Andy T. Liu and Jiatong Shi and Xuankai Chang and Guan-Ting Lin and Tzu-Hsien Huang and Wei-Cheng Tseng and Ko-tik Lee and Da-Rong Liu and Zili Huang and Shuyan Dong and Shang-Wen Li and Shinji Watanabe and Abdelrahman Mohamed and Hung-yi Lee},
title={{SUPERB: Speech Processing Universal PERformance Benchmark}},
year=2021,
booktitle={Proc. Interspeech 2021},
pages={1194--1198},
doi={10.21437/Interspeech.2021-1775}
}
@article{superb_sg,
title={SUPERB-SG: Enhanced Speech processing Universal PERformance Benchmark for Semantic and Generative Capabilities},
author={Hsiang-Sheng Tsai and Heng-Jui Chang and Wen-Chin Huang and Zili Huang and Kushal Lakhotia and Shu-wen Yang and Shuyan Dong and Andy T. Liu and Cheng-I Lai and Jiatong Shi and Xuankai Chang and Phil Hall and Hsuan-Jui Chen and Shang-Wen Li and Shinji Watanabe and Abdel-rahman Mohamed and Hung-yi Lee},
journal={ArXiv},
year={2022},
volume={abs/2203.06849}
}
Besides the tasks presented in the paper, we are also extending the coverage over all speech tasks. In the SUPERB Challenge in AAAI workshop: The 2nd Self-supervised Learning for Audio and Speech Processing, more tasks are introduced into the benchmark framework, and the setup detailed here serves as the public-set in the challenge. We list all tasks below:
ID | Task Name | Category | Paper | Challenge public-set |
---|---|---|---|---|
PR | Phoneme Recognition | Content | V | V |
ASR | Automatic Speech Recognition | Content | V | V |
KS | Keyword Spotting | Content | V | |
QbE | Query-by-Example | Content | V | V |
SID | Speaker Identification | Speaker | V | V |
ASV | Automatic Speaker Verification | Speaker | V | V |
SD | Speaker Diarization | Speaker | V | V |
ER | Emotion Recognition | Paralinguistics | V | V |
IC | Spoken Intent Classification | Semantics | V | |
SF | Spoken Slot Filling | Semantics | V | |
ST | Speech Translation | Semantics | V | V |
SE | Speech Enhancement | Generation | V | V |
SS | Source Separation | Generation | V | V |
VC | Voice Conversion | Generation | V |
This document contains the following meterials:
To reproduce the results in the SUPERB paper, you can follow the commands below by only changing the learning rate: config.optimizer.lr
in the config file with the override
option.
# The default lr for ASR is 1.0e-4
python3 run_downstream.py -m train -u wav2vec2 -d asr -n ExpName \
-o config.optimizer.lr=1.0e-5
If the fully converged training time is too long, you can also consider using distributed training to avoid the gradient accumulation.
Specified by the command -d ctc
Download LibriSpeech and unzip. Only need train-clean-100, dev-clean, and test-clean.
Check the prepared file structure
LibriSpeech/
├── train-clean-100/
├── dev-clean/
└── test-clean/
Change the path in downstream/ctc/libriphone.yaml
downstream_expert:
corpus:
path: "root directory of LibriSpeech"
python3 run_downstream.py -n ExpName -m train -u fbank -d ctc -c downstream/ctc/libriphone.yaml
python3 run_downstream.py -m evaluate -e result/downstream/ExpName/dev-best.ckpt
Specified by the command -d asr
Download LibriSpeech and unzip. Only need train-clean-100, dev-clean, and test-clean.
Check the prepared file structure
LibriSpeech/
├── train-clean-100/
├── dev-clean/
└── test-clean/
Change the path in downstream/asr/config.yaml
downstream_expert:
datarc:
libri_root: "root directory of LibriSpeech"
Prepare the lengths for utterances in LibriSpeech's train-clean-100, dev-clean and test-clean:
# Official LibriSpeech is in .flac format
python3 preprocess/generate_len_for_bucket.py -i "root directory of LibriSpeech" -o data/librispeech -a .flac --n_jobs 12
python3 run_downstream.py -n ExpName -m train -u fbank -d asr
python3 run_downstream.py -m evaluate -t "test-clean" -e result/downstream/dev-clean-best.ckpt
Installing all the dependencies right could be quite complicated. Note that the decoding is not required for SSL representations to perform well on ASR and you can also skip the ASR results from LM decoding when submitting to the leaderboard.
Install KenLM
Install flashlight python bindings
Download LibriSpeech official 4-gram LM
Download character-based lexicon
Make sure your fairseq version contains this commit cb8469
python3 run_downstream.py -m evaluate -t "test-clean" -e result/downstream/dev-best.ckpt \
-o "\
config.downstream_expert.datarc.decoder_args.decoder_type='kenlm',, \
config.downstream_expert.datarc.decoder_args.kenlm_model='/path/to/4-gram.arpa.gz',, \
config.downstream_expert.datarc.decoder_args.lexicon='/path/to/librispeech_lexicon.lst' \
"
Specified by the command -d speech_commands
Download data
Download and unpack Speech Commands
mkdir -p /CORPORA_DIR/speech_commands_v0.01
tar zxf speech_commands_v0.01.tar.gz -C /CORPORA_DIR/speech_commands_v0.01
Download and unpack Speech Commands test set
mkdir -p /CORPORA_DIR/speech_commands_test_set_v0.01
tar zxf speech_commands_test_set_v0.01.tar.gz -C /CORPORA_DIR/speech_commands_test_set_v0.01
Change the following path in downstream/speech_commands/config.yaml
to yours
downstream_expert:
datarc:
speech_commands_root: "/CORPORA_DIR/speech_commands_v0.01/"
speech_commands_test_root: "/CORPORA_DIR/speech_commands_test_set_v0.01/"
python3 run_downstream.py -n ExpName -m train -u fbank -d speech_commands
python3 run_downstream.py -m evaluate -e result/downstream/ExpName/dev-best.ckpt
The implementation is directly compatible with Speech Command v2. You can enable this by just changing the train/test dataset. All other steps should be the same.
Specified by the command -d quesst14_dtw
. This task does not require training. We extract representations and run dynamic time warping (DTW) on them.
Download QUESST14
export CORPORA_DIR="the root directory of all your datasets"
wget https://speech.fit.vutbr.cz/files/quesst14Database.tgz
tar zxf quesst14Database.tgz -C $CORPORA_DIR
Change the path in downstream/quesst14/config.yaml
downstream:
datarc:
dataset_root: "CORPORA_DIR/quesst14Database"
In SUPERB, we run DTW for all the hidden states layer-by-layer. Choose the best layer according to dev set and report its score on the test set. A specific layer can be selected by -l
option, indexed from 0. The following take the last layer as an example.
# The default dist_fn if not specified is "cosine_exp"
# as it yields the best result for almost all upstream
# Supported dist_fn: cosine, cityblock, euclidean, cosine_exp
layer=-1;
dist_fn=cosine;
# dev
python3 run_downstream.py -m evaluate -t "dev" -u hubert -l ${layer} \
-d quesst14_dtw -n ExpName_${layer}_dev \
-o config.downstream_expert.dtwrc.dist_method=$dist_fn
# test
python3 run_downstream.py -m evaluate -t "test" -u fbank -l ${layer} \
-d quesst14_dtw -n ExpName_${layer}_test \
-o config.downstream_expert.dtwrc.dist_method=$dist_fn
export S3PRL_DIR=/YOUR/S3PRL/PATH
cd $CORPORA_DIR/quesst14Database/scoring
# dev
./score-TWV-Cnxe.sh $S3PRL_DIR/result/downstream/ExpName_${layer}_dev \
groundtruth_quesst14_dev -10
# test
./score-TWV-Cnxe.sh $S3PRL_DIR/result/downstream/ExpName_${layer}_test \
groundtruth_quesst14_eval -10
After you benchmark all the layers of an upstream, says you find the 6-th layer is the best for QbE according to dev set. Please use ExpName_6_test
as the submission expdir for submit.py
.
Specified by the command -d fluent_commands
Download and unzip data: Fluent Speech Commands
wget http://140.112.21.28:9000/fluent.tar.gz
Check the prepared file structure
fluent_speech_commands_dataset
├── wavs
│ └── speakers
├── data
│ └── [*.csv]
├── readme.md
└── Fluent Speech Commands Public License.pdf
Change the following paths under downstream/fluent_commands/config.yaml
to your own:
downstream_expert:
datarc:
file_path: "root directory of fluent_speech_commands_dataset"
python3 run_downstream.py -n ExpName -m train -u fbank -d fluent_commands
python3 run_downstream.py -m evaluate -e result/downstream/ExpName/dev-best.ckpt
Optional: Preprocess Audio SNIPS from the official version.
# Official Audio SNIPS is in mp3 format, we will convert them to wav
# We need mp3 support on sox package (originally not supported)
# First ensure you have the sox installed
# Then install the mp3 support
# apt-get
apt-get install libsox-fmt-mp3
# or yum install
yum install soxr sox-plugins-freeworld -y
# after installing the mp3 support
CORPORA_DIR="the root directory of all your datasets"
./preprocess/snips_prepare_data.sh $CORPORA_DIR
Download the preprocessed Audio SNIPS and unzip
Change the paths in downstream/ctc/snips.yaml
downstream_expert:
corpus:
path: "CORPORA_DIR/SNIPS"
text:
slots_file: "CORPORA_DIR/SNIPS/slots.txt"
python3 run_downstream.py -n ExpName -m train -u fbank -d ctc -c downstream/ctc/snips.yaml
python3 run_downstream.py -m evaluate -e result/downstream/ExpName/dev-best.ckpt
Download dataset from Voxceleb1 and unzip them.
voxceleb1_root="/CORPORA_DIR/VoxCeleb1/"
mkdir -p $voxceleb1_root/dev
mkdir -p $voxceleb1_root/test
# prepare dev
cd $voxceleb1_root/dev/
wget https://thor.robots.ox.ac.uk/~vgg/data/voxceleb/vox1a/vox1_dev_wav_partaa
wget https://thor.robots.ox.ac.uk/~vgg/data/voxceleb/vox1a/vox1_dev_wav_partab
wget https://thor.robots.ox.ac.uk/~vgg/data/voxceleb/vox1a/vox1_dev_wav_partac
wget https://thor.robots.ox.ac.uk/~vgg/data/voxceleb/vox1a/vox1_dev_wav_partad
cat vox1_dev* > vox1_dev_wav.zip
unzip vox1_dev_wav.zip
# prepare test
cd $voxceleb1_root/test/
wget https://thor.robots.ox.ac.uk/~vgg/data/voxceleb/vox1a/vox1_test_wav.zip
unzip vox1_test_wav.zip
Check prepared file structure
Voxceleb1/
├── dev/
│ └── wav/
│ └──Speaker id folders
└── test/
└── wav/
└──Speaker id folders
Change the path in downstream/voxceleb1/config.yaml
downstream_expert:
datarc:
file_path: "root directory of VoxCeleb1"
python3 run_downstream.py -n ExpName -m train -u fbank -d voxceleb1
python3 run_downstream.py -m evaluate -e result/downstream/ExpName/dev-best.ckpt
Follow the step 1 and 2 in SID
Change the path in downstream/sv_voxceleb1/config.yaml
downstream_expert:
datarc:
file_path: "root directory of VoxCeleb1"
python3 run_downstream.py -n ExpName -m train -u fbank -d sv_voxceleb1
If you already know a specific checkpoint to test, says states-20000.ckpt, you can test it with:
python3 run_downstream.py -m evaluate -e result/downstream/ExpName/states-20000.ckpt
However, there is no official validation set under VoxCeleb1 setting, we save checkpoints every 20000 updates and report the best EER. Evaluating checkpoints take long time so we don't test them along with training on a single GPU. We save all checkpoints and test them parallely with another GPU. The following command will:
Note. The already evaluated checkpoints will be passed.
voxceleb1="root directory of VoxCeleb1"
./downstream/sv_voxceleb1/test_expdir.sh result/downstream/ExpName $voxceleb1
We prepare the frame-wise training label on-the-fly, and convert the frame-wise prediction into RTTM files annotated in seconds. The inferenced RTTM will then be scored by comparing to the groundtruth RTTM by dscore. You can choose the frame_shift
(stride) of the training label for the upstream representation. This only affects the training materials and does not affect the groundtruth RTTM, which is fixed in Libri2Mix during data preparation.
Simulate Libri2Mix Data for Diarization
S3PRL_DIR="root directory of your cloned s3prl"
CORPORA_DIR"root directory of all your datasets, which hopefully contains LibriSpeech (not necessary)"
git clone https://github.com/s3prl/LibriMix.git
cd LibriMix
bash generate_librimix_sd.sh $CORPORA_DIR
python3 scripts/prepare_diarization.py \
--target_dir $S3PRL_DIR/downstream/diarization/data \
--source_dir $CORPORA_DIR/Libri2Mix/wav16k/max/metadata
Train with the label in the same frame_shift
as the upstream representation: (recommened)
python3 run_downstream.py -n ExpName -m train -u fbank -d diarization
Train with the label in a specific frame_shift
(e.g. 160):
python3 run_downstream.py -n ExpName -m train -u fbank -d diarization \
-o config.downstream_expert.datarc.frame_shift=160
The upstream representation will be upsampled (duplicate) or downsampled (take 1 per N frames) to match the sequence length of your assigned label. This can be useful when the representation has too small frame_shift
and hence too long sequence, which leads to too long training time.
The frame_shift
for the training label is already saved in the checkpoint, and the same frame_shift
will be used to convert the frame-wise prediction into RTTM files annotated in seconds.
python3 run_downstream.py -m evaluate -e result/downstream/ExpName/best-states-dev.ckpt
Clone dscore
git clone https://github.com/ftshijt/dscore
Change the path in downstream/diarization/score.sh
dscore_dir="root directory of your cloned dscore"
Run scoring
./downstream/diarization/score.sh result/downstream/ExpName downstream/diarization/data/test
The scoring results will look like
One should report the lowest number at the bottom, where the column represents DER and the most bottom row will always have the lowest DER which is the number we will report.
Re-check the scoring results: Running the above scoring script takes time. If you want to re-check the scored results, use
./downstream/diarization/report.sh result/downstream/ExpName
Download dataset and unzip. You will need to fill a form in IEMOCAP official website to get the dataset.
Change the path in downstream/emotion/config.yaml
downstream_expert:
datarc:
root: "root directory of IEMOCAP"
IEMOCAP provides 5 splits of data: Section1, Section2, Section3, Section4 and Section5. Conventionally, each split will be selected as the test set and train the model with other 4 splits. That is, 5 times of training and testing is required, and 5 testing scores will be averaged to report the final number. We can change the test_fold
option in the config file to control which split we want to reserve as the test set.
# test_fold can be: fold1, fold2, fold3, fold4, fold5
python3 run_downstream.py -n ExpName -m train -u fbank -d emotion -c downstream/emotion/config.yaml -o "config.downstream_expert.datarc.test_fold='fold1'"
python3 run_downstream.py -m evaluate -e result/downstream/ExpName/dev-best.ckpt
for test_fold in fold1 fold2 fold3 fold4 fold5;
do
# The default config is "downstream/emotion/config.yaml"
python3 run_downstream.py -n ExpName_$test_fold -m train -u fbank -d emotion -o "config.downstream_expert.datarc.test_fold='$test_fold'"
python3 run_downstream.py -m evaluate -e result/downstream/ExpName_$test_fold/dev-best.ckpt
done
Simulate Libri2Mix data for source separation. For source separation, we only need 16kHz and min condition. (Usually for source separation, people are using 8kHz min condition, but due to the constrait of pre-trained models we are using 16kHz)
# download the script and simulate Libri2Mix dataset
git clone https://github.com/s3prl/LibriMix.git
cd LibriMix
./generate_librimix_ss.sh storage_dir
# prepare train, dev and test data in Kaldi format
python downstream/separation_stft/scripts/LibriMix/data_prepare.py \
--part train-100 storage_dir/Libri2Mix downstream/separation_stft/datasets/Libri2Mix
python downstream/separation_stft/scripts/LibriMix/data_prepare.py \
--part dev storage_dir/Libri2Mix downstream/separation_stft/datasets/Libri2Mix
python downstream/separation_stft/scripts/LibriMix/data_prepare.py \
--part test storage_dir/Libri2Mix downstream/separation_stft/datasets/Libri2Mix
# subsample dev set from 3000 utts to 1000 utts (for faster validation)
python downstream/separation_stft/scripts/LibriMix/subsample.py \
downstream/separation_stft/datasets/Libri2Mix/wav16k/min/dev \
downstream/separation_stft/datasets/Libri2Mix/wav16k/min/dev_1000
cd $YOUR_S3PRL_ROOT/s3prl/
Train with STFT magnitude as the upstream. The default stride is 20ms, and you can adjust that in upstream/log_stft/stft_mag.yaml
python3 run_downstream.py -m train \
-d separation_stft \
-c downstream/separation_stft/configs/cfg.yaml \
-u stft_mag \
-g 'upstream/log_stft/stft_mag.yaml' \
-n ExpName
Train with wav2vec2 as the upstream.
python3 run_downstream.py --mode train \
-d separation_stft \
-c downstream/separation_stft/configs/cfg.yaml \
-u wav2vec2 \
-n ExpName \
python3 run_downstream.py -m evaluate \
-e result/downstream/ExpName/best-states-dev.ckpt \
The model is expected to output si-sdri on the test set.
We use Voicebank-DEMAND dataset for speech enhancement. We follow the data preparation in SpeechBrain:
# Download the Voicebank-DEMAND dataset and convert it to 16kHz
# I am following the data preparation script in SpeechBrain toolkit (https://github.com/speechbrain/speechbrain/blob/develop/recipes/Voicebank/voicebank_prepare.py)
from voicebank_prepare import download_vctk
download_vctk(data_dir)
However, the above pipeline might take too much time to download the original dataset. Hence, we also provide the already preprocessed archive:
wget http://140.112.21.28:9000/noisy-vctk-16k.zip
unzip noisy-vctk-16k.zip
Check the unzipped voicebank directory structure
data_dir/
├── clean_testset_wav_16k/
├── clean_trainset_28spk_wav_16k/
├── noisy_testset_wav_16k/
├── noisy_trainset_28spk_wav_16k/
├── testset_txt/
└── trainset_28spk_txt/
Prepare kaldi-style scp files
# prepare train, dev and test data in Kaldi format
python downstream/enhancement_stft/scripts/Voicebank/data_prepare.py \
data_dir downstream/enhancement_stft/datasets/voicebank --part train
python downstream/enhancement_stft/scripts/Voicebank/data_prepare.py \
data_dir downstream/enhancement_stft/datasets/voicebank --part dev
python downstream/enhancement_stft/scripts/Voicebank/data_prepare.py \
data_dir downstream/enhancement_stft/datasets/voicebank --part test
Train with hubert as the upstream.
python3 run_downstream.py -m train \
-c downstream/enhancement_stft/configs/cfg_voicebank.yaml \
-d enhancement_stft \
-u hubert \
-n ExpName \
python3 run_downstream.py -m evaluate \
-e result/downstream/ExpName/best-states-dev.ckpt \
The model is expected to output pesq, stoi, covl and si-sdri on the test set.
The following instruction is only a minimal description for benchmarking. A complete guide about the task, dataset, implementation and usage can be found in the README. We evaluate the VC capability by training 4 target speaker models that given any source speaker utterance, the single-speaker model can convert it to a specific target speaker. This setting is known as Any-to-one VC. The trained 4 target speakers are: TEF1, TEF2, TEM1, TEM2. The quality of the target speaker model is evaluated with MCD (lower better). One should average the MCD from four speakers.
Download the VCC2020 dataset and the pretrained vocoder.
cd downstream/a2o-vc-vcc2020
cd data
./data_download.sh vcc2020/
cd ../
# Download the pretrained PWGs.
./vocoder_download.sh ./
Specify a target speaker for training from: TEF1, TEF2, TEM1, TEM2
python run_downstream.py -m train -n EXPNAME -u wav2vec -d a2o-vc-vcc2020 \
-o config.downstream_expert.trgspk=TEF1
Waveform generation and evaluation (using wav2vec for example) for a specific checkpoint.
./downstream/a2o-vc-vcc2020/decode.sh ./downstream/a2o-vc-vcc2020/pwg_task1 result/downstream/EXPNAME/<step> TEF1
The following instruction is only a minimal description for benchmarking. A complete guide about the task, dataset, implementation and usage can be found in the README.
Preparing CoVoST En->De dataset.
Download Common Voice audio clips and transcripts (english) (Common Voice Corpus 4).
Change the path in downstream/speech_translation/prepare_data/prepare_covo.sh
covo_root="root directory of covost"
src_lang=en
tgt_lang=de
cd downstream/speech_translation/prepare_data/
bash prepare_covo.sh
python run_downstream.py -m train -n ExpName -u fbank -d speech_translation
python run_downstream.py -m evaluate -e result/downstream/ExpName/dev-best.ckpt
The model will report case-sensitive detokenized BLEU.
Read README.
After finishing the Testing of each task, the prediction files for leaderboard submission will be located under the expdir
. You can use submit.py to easily organize them into a zip file which can later be submitted to our leaderboard. We now support submissions for the following tasks: PR, ASR, KS, QbE, SID, ASV, SD, IC, SF, ER, SE, SS, ST.
If you find superbbenchmark.org is down temporarily, please try to use 140.112.21.28 as an alternative. They share the same backend. We will make the official domain work as soon as possible.
Please use the master branch newer than 852db2e. Note that our SUPERB codebase is backward-compatible, so you don't need to re-train any model after upgrading to this newer version. You only need this new version to inference the prediction files for submission correctly.
output_dir="submission"
python3 submit/submit.py \
--output_dir $output_dir \
--pr pr_expdir \
--sid sid_expdir \
--ks ks_expdir \
--ic ic_expdir \
--er_fold1 er_fold1_expdir \
--er_fold2 er_fold2_expdir \
--er_fold3 er_fold3_expdir \
--er_fold4 er_fold4_expdir \
--er_fold5 er_fold5_expdir \
--asr_no_lm asr_expdir \
--asr_with_lm asr_expdir \
--qbe qbe_expdir \
--sf sf_expdir \
--sv sv_expdir \
--sd sd_expdir \
--se se_expdir \
--ss ss_expdir \
--st st_expdir \
After executing, you can submit submission/predict.zip to the leaderboard.
We also prepare the example-expdirs for you to diagnose if the submission fails. After unzipping you will see the following structure:
expdirs/
asr_expdir/
er_fold1_expdir/
er_fold2_expdir/
er_fold3_expdir/
er_fold4_expdir/
er_fold5_expdir/
ic_expdir/
ks_expdir/
pr_expdir/
qbe_expdir/
...
Each expdir will contain the minimal submission-related files which should also appear in your expdir after you do the testing. Here is an example-script on how to use the above example-expdirs to prepare a submittable zip file.
cd s3prl/s3prl/submit
./demo_submit.sh examples
After executing, you will see:
s3prl/s3prl/submit/examples/
expdirs/
expdirs.zip
predict/
predict.zip
The predict.zip is the one for you to submit.
You don't need to prepare all the expdirs for the submission. You can zip only a subset of expdirs. After your submission, the leaderboard will only show the results of your submitted tasks. Eg.
python3 submit/submit.py \
--output_dir submission \
--pr pr_expdir
The above command will produce a predict.zip which will only show the PR score after submitted to the leaderboard.
Emotion Recognition (er) does 5-fold cross validation: 5 training and 5 testing, so 5 expdirs in total.
The expdirs for asr_no_lm
and asr_with_lm
are typically the same. Since the same ASR downstream model was trained and just decoded in different ways, so the same expdir assigned for training is used when testing. The default testing will produce predictions for asr_no_lm
. By using Kenlm decoding you can get predictions for asr_with_lm
. See ASR section below for more information.
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。