SUPERB Benchmark & Challenge

Prerequisite

Please read downstream/README.md for the general command pattern, and read upstream/example/README.md for registering a new pretrained model (upstream).

Introduction

In this document we detail the commands for reproducing the paper SUPERB: Speech processing Universal PERformance Benchmark and SUPERB-SG: Enhanced Speech processing Universal PERformance Benchmark for Semantic and Generative Capabilities. If you use the tasks here for your research, please consider citing the following papers:

@inproceedings{yang21c_interspeech,
  author={Shu-wen Yang and Po-Han Chi and Yung-Sung Chuang and Cheng-I Jeff Lai and Kushal Lakhotia and Yist Y. Lin and Andy T. Liu and Jiatong Shi and Xuankai Chang and Guan-Ting Lin and Tzu-Hsien Huang and Wei-Cheng Tseng and Ko-tik Lee and Da-Rong Liu and Zili Huang and Shuyan Dong and Shang-Wen Li and Shinji Watanabe and Abdelrahman Mohamed and Hung-yi Lee},
  title={{SUPERB: Speech Processing Universal PERformance Benchmark}},
  year=2021,
  booktitle={Proc. Interspeech 2021},
  pages={1194--1198},
  doi={10.21437/Interspeech.2021-1775}
}

@article{superb_sg,
  title={SUPERB-SG: Enhanced Speech processing Universal PERformance Benchmark for Semantic and Generative Capabilities},
  author={Hsiang-Sheng Tsai and Heng-Jui Chang and Wen-Chin Huang and Zili Huang and Kushal Lakhotia and Shu-wen Yang and Shuyan Dong and Andy T. Liu and Cheng-I Lai and Jiatong Shi and Xuankai Chang and Phil Hall and Hsuan-Jui Chen and Shang-Wen Li and Shinji Watanabe and Abdel-rahman Mohamed and Hung-yi Lee},
  journal={ArXiv},
  year={2022},
  volume={abs/2203.06849}
}

Besides the tasks presented in the paper, we are also extending the coverage over all speech tasks. In the SUPERB Challenge in AAAI workshop: The 2nd Self-supervised Learning for Audio and Speech Processing, more tasks are introduced into the benchmark framework, and the setup detailed here serves as the public-set in the challenge. We list all tasks below:

ID	Task Name	Category	Paper	Challenge public-set
PR	Phoneme Recognition	Content	V	V
ASR	Automatic Speech Recognition	Content	V	V
KS	Keyword Spotting	Content	V
QbE	Query-by-Example	Content	V	V
SID	Speaker Identification	Speaker	V	V
ASV	Automatic Speaker Verification	Speaker	V	V
SD	Speaker Diarization	Speaker	V	V
ER	Emotion Recognition	Paralinguistics	V	V
IC	Spoken Intent Classification	Semantics	V
SF	Spoken Slot Filling	Semantics	V
ST	Speech Translation	Semantics	V	V
SE	Speech Enhancement	Generation	V	V
SS	Source Separation	Generation	V	V
VC	Voice Conversion	Generation	V

This document contains the following meterials:

The command for each task

Data preparation
Training
Testing / Scoring

The training artifacts of each task

Tensorboard logs
Trained downstream weights (the best on dev set)

Leaderboard submission helper

Ready for the tasks presented in the paper
Will be ready for the challenge on Sep 30, 2021
- New tasks submission
- Overall metrics

Task-specific usages

To reproduce the results in the SUPERB paper, you can follow the commands below by only changing the learning rate: config.optimizer.lr in the config file with the override option.

# The default lr for ASR is 1.0e-4
python3 run_downstream.py -m train -u wav2vec2 -d asr -n ExpName \
    -o config.optimizer.lr=1.0e-5

If the fully converged training time is too long, you can also consider using distributed training to avoid the gradient accumulation.

PR: Phoneme Recognition

Specified by the command -d ctc

Prepare data

Download LibriSpeech and unzip. Only need train-clean-100, dev-clean, and test-clean.

Check the prepared file structure

LibriSpeech/
├── train-clean-100/
├── dev-clean/
└── test-clean/

Change the path in downstream/ctc/libriphone.yaml

downstream_expert:
    corpus:
        path: "root directory of LibriSpeech"

Training

python3 run_downstream.py -n ExpName -m train -u fbank -d ctc -c downstream/ctc/libriphone.yaml

Testing

python3 run_downstream.py -m evaluate -e result/downstream/ExpName/dev-best.ckpt

ASR: Automatic Speech Recognition

Specified by the command -d asr

Prepare data

Download LibriSpeech and unzip. Only need train-clean-100, dev-clean, and test-clean.

Check the prepared file structure

LibriSpeech/
├── train-clean-100/
├── dev-clean/
└── test-clean/

Change the path in downstream/asr/config.yaml

downstream_expert:
    datarc:
        libri_root: "root directory of LibriSpeech"

Prepare the lengths for utterances in LibriSpeech's train-clean-100, dev-clean and test-clean:

# Official LibriSpeech is in .flac format
python3 preprocess/generate_len_for_bucket.py -i "root directory of LibriSpeech" -o data/librispeech -a .flac --n_jobs 12

Training

python3 run_downstream.py -n ExpName -m train -u fbank -d asr

Testing without LM

python3 run_downstream.py -m evaluate -t "test-clean" -e result/downstream/dev-clean-best.ckpt

Testing with KenLM + LibriSpeech official 4-gram LM

Installing all the dependencies right could be quite complicated. Note that the decoding is not required for SSL representations to perform well on ASR and you can also skip the ASR results from LM decoding when submitting to the leaderboard.

I. Prepare Decoding Environment

Install KenLM
- Please follow the official installation instructions of KenLM instead of the one documented in flashlight or wav2letter du to some known issues.
Install flashlight python bindings
- Only the python bindings is required instead of the entire flashlight toolkit
Download LibriSpeech official 4-gram LM
- https://www.openslr.org/resources/11/4-gram.arpa.gz
- Downloaded filename: 4-gram.arpa.gz
Download character-based lexicon
- https://dl.fbaipublicfiles.com/fairseq/wav2vec/librispeech_lexicon.lst
- Downloaded filename: librispeech_lexicon.lst
Make sure your fairseq version contains this commit cb8469

II. Test

python3 run_downstream.py -m evaluate -t "test-clean" -e result/downstream/dev-best.ckpt \
    -o "\
        config.downstream_expert.datarc.decoder_args.decoder_type='kenlm',, \
        config.downstream_expert.datarc.decoder_args.kenlm_model='/path/to/4-gram.arpa.gz',, \
        config.downstream_expert.datarc.decoder_args.lexicon='/path/to/librispeech_lexicon.lst' \
       "

KS: Keyword Spotting

Specified by the command -d speech_commands

Prepare data

Download data
- http://download.tensorflow.org/data/speech_commands_v0.01.tar.gz
- http://download.tensorflow.org/data/speech_commands_test_set_v0.01.tar.gz

Download and unpack Speech Commands

mkdir -p /CORPORA_DIR/speech_commands_v0.01
tar zxf speech_commands_v0.01.tar.gz -C /CORPORA_DIR/speech_commands_v0.01

Download and unpack Speech Commands test set

mkdir -p /CORPORA_DIR/speech_commands_test_set_v0.01
tar zxf speech_commands_test_set_v0.01.tar.gz -C /CORPORA_DIR/speech_commands_test_set_v0.01

Change the following path in downstream/speech_commands/config.yaml to yours

downstream_expert:
    datarc:
        speech_commands_root: "/CORPORA_DIR/speech_commands_v0.01/"
        speech_commands_test_root: "/CORPORA_DIR/speech_commands_test_set_v0.01/"

Training

python3 run_downstream.py -n ExpName -m train -u fbank -d speech_commands

Testing

python3 run_downstream.py -m evaluate -e result/downstream/ExpName/dev-best.ckpt

Compatible with Speech Command v2

The implementation is directly compatible with Speech Command v2. You can enable this by just changing the train/test dataset. All other steps should be the same.

QbE: Query-by-Example Spoken Term Detection

Specified by the command -d quesst14_dtw. This task does not require training. We extract representations and run dynamic time warping (DTW) on them.

Prepare data

Download QUESST14

export CORPORA_DIR="the root directory of all your datasets"    
wget https://speech.fit.vutbr.cz/files/quesst14Database.tgz
tar zxf quesst14Database.tgz -C $CORPORA_DIR

Change the path in downstream/quesst14/config.yaml

downstream:
    datarc:
        dataset_root: "CORPORA_DIR/quesst14Database"

Dynamic Time Warping (DTW)

In SUPERB, we run DTW for all the hidden states layer-by-layer. Choose the best layer according to dev set and report its score on the test set. A specific layer can be selected by -l option, indexed from 0. The following take the last layer as an example.

# The default dist_fn if not specified is "cosine_exp"
# as it yields the best result for almost all upstream
# Supported dist_fn: cosine, cityblock, euclidean, cosine_exp

layer=-1;
dist_fn=cosine;

# dev
python3 run_downstream.py -m evaluate -t "dev" -u hubert -l ${layer} \
    -d quesst14_dtw -n ExpName_${layer}_dev \
    -o config.downstream_expert.dtwrc.dist_method=$dist_fn

# test
python3 run_downstream.py -m evaluate -t "test" -u fbank -l ${layer} \
    -d quesst14_dtw -n ExpName_${layer}_test \
    -o config.downstream_expert.dtwrc.dist_method=$dist_fn

Scoring

export S3PRL_DIR=/YOUR/S3PRL/PATH
cd $CORPORA_DIR/quesst14Database/scoring

# dev
./score-TWV-Cnxe.sh $S3PRL_DIR/result/downstream/ExpName_${layer}_dev \
    groundtruth_quesst14_dev -10

# test
./score-TWV-Cnxe.sh $S3PRL_DIR/result/downstream/ExpName_${layer}_test \
    groundtruth_quesst14_eval -10

Submit

After you benchmark all the layers of an upstream, says you find the 6-th layer is the best for QbE according to dev set. Please use ExpName_6_test as the submission expdir for submit.py.

IC: Intent Classification

Specified by the command -d fluent_commands

Prepare data

Download and unzip data: Fluent Speech Commands
- Official data link: http://fluent.ai:2052/jf8398hf30f0381738rucj3828chfdnchs.tar.gz
- Official website: https://fluent.ai/fluent-speech-commands-a-dataset-for-spoken-language-understanding-research/
- Since the official link might break occasionally, we provide a backup link. If this is not allowed please let us know and we will remove it immediately.
- Please use wget http://140.112.21.28:9000/fluent.tar.gz

Check the prepared file structure

fluent_speech_commands_dataset
├── wavs
│   └── speakers
├── data
│   └── [*.csv]
├── readme.md
└── Fluent Speech Commands Public License.pdf

Change the following paths under downstream/fluent_commands/config.yaml to your own:

downstream_expert:
    datarc:
        file_path: "root directory of fluent_speech_commands_dataset"

Training

python3 run_downstream.py -n ExpName -m train -u fbank -d fluent_commands

Testing

python3 run_downstream.py -m evaluate -e result/downstream/ExpName/dev-best.ckpt

SF: End-to-end Slot Filling

Prepare data

Optional: Preprocess Audio SNIPS from the official version.

# Official Audio SNIPS is in mp3 format, we will convert them to wav
# We need mp3 support on sox package (originally not supported)
# First ensure you have the sox installed
# Then install the mp3 support

# apt-get
apt-get install libsox-fmt-mp3

# or yum install
yum install soxr sox-plugins-freeworld -y

# after installing the mp3 support
CORPORA_DIR="the root directory of all your datasets"
./preprocess/snips_prepare_data.sh $CORPORA_DIR

Download the preprocessed Audio SNIPS and unzip
- https://drive.google.com/file/d/1oBRZd-PaCKz5iY3eZkXs5OB_ZZ4w7bbG/view?usp=sharing

Change the paths in downstream/ctc/snips.yaml

downstream_expert:
    corpus:
        path: "CORPORA_DIR/SNIPS"
    text:
        slots_file: "CORPORA_DIR/SNIPS/slots.txt"

Training

python3 run_downstream.py -n ExpName -m train -u fbank -d ctc -c downstream/ctc/snips.yaml

Testing

python3 run_downstream.py -m evaluate -e result/downstream/ExpName/dev-best.ckpt

SID: Speaker Identification

Prepare data

Download dataset from Voxceleb1 and unzip them.

voxceleb1_root="/CORPORA_DIR/VoxCeleb1/"
mkdir -p $voxceleb1_root/dev
mkdir -p $voxceleb1_root/test

# prepare dev
cd $voxceleb1_root/dev/
wget https://thor.robots.ox.ac.uk/~vgg/data/voxceleb/vox1a/vox1_dev_wav_partaa
wget https://thor.robots.ox.ac.uk/~vgg/data/voxceleb/vox1a/vox1_dev_wav_partab
wget https://thor.robots.ox.ac.uk/~vgg/data/voxceleb/vox1a/vox1_dev_wav_partac
wget https://thor.robots.ox.ac.uk/~vgg/data/voxceleb/vox1a/vox1_dev_wav_partad
cat vox1_dev* > vox1_dev_wav.zip
unzip vox1_dev_wav.zip

# prepare test
cd $voxceleb1_root/test/
wget https://thor.robots.ox.ac.uk/~vgg/data/voxceleb/vox1a/vox1_test_wav.zip
unzip vox1_test_wav.zip

Check prepared file structure

Voxceleb1/
├── dev/
│   └── wav/
│       └──Speaker id folders
└── test/
    └── wav/
        └──Speaker id folders

Change the path in downstream/voxceleb1/config.yaml

downstream_expert:
    datarc:
        file_path: "root directory of VoxCeleb1"

Training

python3 run_downstream.py -n ExpName -m train -u fbank -d voxceleb1

Testing

python3 run_downstream.py -m evaluate -e result/downstream/ExpName/dev-best.ckpt

ASV: Automatic Speaker Verification

Prepare data

Follow the step 1 and 2 in SID

Change the path in downstream/sv_voxceleb1/config.yaml

downstream_expert:
    datarc:
        file_path: "root directory of VoxCeleb1"

Training

python3 run_downstream.py -n ExpName -m train -u fbank -d sv_voxceleb1

Testing

If you already know a specific checkpoint to test, says states-20000.ckpt, you can test it with:

python3 run_downstream.py -m evaluate -e result/downstream/ExpName/states-20000.ckpt

However, there is no official validation set under VoxCeleb1 setting, we save checkpoints every 20000 updates and report the best EER. Evaluating checkpoints take long time so we don't test them along with training on a single GPU. We save all checkpoints and test them parallely with another GPU. The following command will:

Run a for-loop to find newly saved checkpoints in expdir
Evaluate it if any is found and log the testing result
Prepare the best prediction file according to already tested checkpoints

Note. The already evaluated checkpoints will be passed.

voxceleb1="root directory of VoxCeleb1"
./downstream/sv_voxceleb1/test_expdir.sh result/downstream/ExpName $voxceleb1

SD: Speaker Diarization

We prepare the frame-wise training label on-the-fly, and convert the frame-wise prediction into RTTM files annotated in seconds. The inferenced RTTM will then be scored by comparing to the groundtruth RTTM by dscore. You can choose the frame_shift (stride) of the training label for the upstream representation. This only affects the training materials and does not affect the groundtruth RTTM, which is fixed in Libri2Mix during data preparation.

Prepare data

Simulate Libri2Mix Data for Diarization

S3PRL_DIR="root directory of your cloned s3prl"
CORPORA_DIR"root directory of all your datasets, which hopefully contains LibriSpeech (not necessary)"

git clone https://github.com/s3prl/LibriMix.git
cd LibriMix
bash generate_librimix_sd.sh $CORPORA_DIR
python3 scripts/prepare_diarization.py \
    --target_dir $S3PRL_DIR/downstream/diarization/data \
    --source_dir $CORPORA_DIR/Libri2Mix/wav16k/max/metadata

Training

Train with the label in the same frame_shift as the upstream representation: (recommened)

python3 run_downstream.py -n ExpName -m train -u fbank -d diarization

Train with the label in a specific frame_shift (e.g. 160):

python3 run_downstream.py -n ExpName -m train -u fbank -d diarization \
    -o config.downstream_expert.datarc.frame_shift=160

The upstream representation will be upsampled (duplicate) or downsampled (take 1 per N frames) to match the sequence length of your assigned label. This can be useful when the representation has too small frame_shift and hence too long sequence, which leads to too long training time.

Testing

The frame_shift for the training label is already saved in the checkpoint, and the same frame_shift will be used to convert the frame-wise prediction into RTTM files annotated in seconds.

I. Inference predictions (for submission and for scoring locally)

python3 run_downstream.py -m evaluate -e result/downstream/ExpName/best-states-dev.ckpt

II. Scoring (not required for submission)

Clone dscore

git clone https://github.com/ftshijt/dscore

Change the path in downstream/diarization/score.sh

dscore_dir="root directory of your cloned dscore"

Run scoring

./downstream/diarization/score.sh result/downstream/ExpName downstream/diarization/data/test

The scoring results will look like

One should report the lowest number at the bottom, where the column represents DER and the most bottom row will always have the lowest DER which is the number we will report.
Re-check the scoring results: Running the above scoring script takes time. If you want to re-check the scored results, use
```
./downstream/diarization/report.sh result/downstream/ExpName
```

ER: Emotion Recognition

Prepare data

Download dataset and unzip. You will need to fill a form in IEMOCAP official website to get the dataset.
- https://sail.usc.edu/iemocap/

Change the path in downstream/emotion/config.yaml

downstream_expert:
    datarc:
        root: "root directory of IEMOCAP"

Training

IEMOCAP provides 5 splits of data: Section1, Section2, Section3, Section4 and Section5. Conventionally, each split will be selected as the test set and train the model with other 4 splits. That is, 5 times of training and testing is required, and 5 testing scores will be averaged to report the final number. We can change the test_fold option in the config file to control which split we want to reserve as the test set.

# test_fold can be: fold1, fold2, fold3, fold4, fold5
python3 run_downstream.py -n ExpName -m train -u fbank -d emotion -c downstream/emotion/config.yaml -o "config.downstream_expert.datarc.test_fold='fold1'"

Testing

python3 run_downstream.py -m evaluate -e result/downstream/ExpName/dev-best.ckpt

Cross validation

for test_fold in fold1 fold2 fold3 fold4 fold5;
do
    # The default config is "downstream/emotion/config.yaml"
    python3 run_downstream.py -n ExpName_$test_fold -m train -u fbank -d emotion -o "config.downstream_expert.datarc.test_fold='$test_fold'"
    python3 run_downstream.py -m evaluate -e result/downstream/ExpName_$test_fold/dev-best.ckpt
done

SS: Source Separation

Prepare data

Simulate Libri2Mix data for source separation. For source separation, we only need 16kHz and min condition. (Usually for source separation, people are using 8kHz min condition, but due to the constrait of pre-trained models we are using 16kHz)

# download the script and simulate Libri2Mix dataset
git clone https://github.com/s3prl/LibriMix.git
cd LibriMix 
./generate_librimix_ss.sh storage_dir

# prepare train, dev and test data in Kaldi format
python downstream/separation_stft/scripts/LibriMix/data_prepare.py \
--part train-100 storage_dir/Libri2Mix downstream/separation_stft/datasets/Libri2Mix

python downstream/separation_stft/scripts/LibriMix/data_prepare.py \
--part dev storage_dir/Libri2Mix downstream/separation_stft/datasets/Libri2Mix

python downstream/separation_stft/scripts/LibriMix/data_prepare.py \
--part test storage_dir/Libri2Mix downstream/separation_stft/datasets/Libri2Mix

# subsample dev set from 3000 utts to 1000 utts (for faster validation)
python downstream/separation_stft/scripts/LibriMix/subsample.py \
downstream/separation_stft/datasets/Libri2Mix/wav16k/min/dev \
downstream/separation_stft/datasets/Libri2Mix/wav16k/min/dev_1000

cd $YOUR_S3PRL_ROOT/s3prl/

Training

Train with STFT magnitude as the upstream. The default stride is 20ms, and you can adjust that in upstream/log_stft/stft_mag.yaml

python3 run_downstream.py -m train \
        -d separation_stft \
        -c downstream/separation_stft/configs/cfg.yaml \
        -u stft_mag \
        -g 'upstream/log_stft/stft_mag.yaml' \
        -n ExpName

Train with wav2vec2 as the upstream.

python3 run_downstream.py --mode train \
        -d separation_stft \
        -c downstream/separation_stft/configs/cfg.yaml \
        -u wav2vec2 \
        -n ExpName \

Testing

python3 run_downstream.py -m evaluate \
        -e result/downstream/ExpName/best-states-dev.ckpt \

The model is expected to output si-sdri on the test set.

SE: Speech Enhancement

Prepare data

We use Voicebank-DEMAND dataset for speech enhancement. We follow the data preparation in SpeechBrain:

# Download the Voicebank-DEMAND dataset and convert it to 16kHz
# I am following the data preparation script in SpeechBrain toolkit (https://github.com/speechbrain/speechbrain/blob/develop/recipes/Voicebank/voicebank_prepare.py)
from voicebank_prepare import download_vctk
download_vctk(data_dir)

However, the above pipeline might take too much time to download the original dataset. Hence, we also provide the already preprocessed archive:

wget http://140.112.21.28:9000/noisy-vctk-16k.zip
unzip noisy-vctk-16k.zip

Check the unzipped voicebank directory structure

data_dir/
├── clean_testset_wav_16k/
├── clean_trainset_28spk_wav_16k/
├── noisy_testset_wav_16k/
├── noisy_trainset_28spk_wav_16k/
├── testset_txt/
└── trainset_28spk_txt/

Prepare kaldi-style scp files

# prepare train, dev and test data in Kaldi format
python downstream/enhancement_stft/scripts/Voicebank/data_prepare.py \
    data_dir downstream/enhancement_stft/datasets/voicebank --part train
python downstream/enhancement_stft/scripts/Voicebank/data_prepare.py \
    data_dir downstream/enhancement_stft/datasets/voicebank --part dev
python downstream/enhancement_stft/scripts/Voicebank/data_prepare.py \
    data_dir downstream/enhancement_stft/datasets/voicebank --part test

Training

Train with hubert as the upstream.

python3 run_downstream.py -m train \
       -c downstream/enhancement_stft/configs/cfg_voicebank.yaml \
       -d enhancement_stft \
       -u hubert \
       -n ExpName \

Testing

python3 run_downstream.py -m evaluate \
       -e result/downstream/ExpName/best-states-dev.ckpt \

The model is expected to output pesq, stoi, covl and si-sdri on the test set.

VC: Voice conversion

The following instruction is only a minimal description for benchmarking. A complete guide about the task, dataset, implementation and usage can be found in the README. We evaluate the VC capability by training 4 target speaker models that given any source speaker utterance, the single-speaker model can convert it to a specific target speaker. This setting is known as Any-to-one VC. The trained 4 target speakers are: TEF1, TEF2, TEM1, TEM2. The quality of the target speaker model is evaluated with MCD (lower better). One should average the MCD from four speakers.

Prepare data

Download the VCC2020 dataset and the pretrained vocoder.

cd downstream/a2o-vc-vcc2020
cd data
./data_download.sh vcc2020/
cd ../

# Download the pretrained PWGs.
./vocoder_download.sh ./

Training

Specify a target speaker for training from: TEF1, TEF2, TEM1, TEM2

python run_downstream.py -m train -n EXPNAME -u wav2vec -d a2o-vc-vcc2020 \
    -o config.downstream_expert.trgspk=TEF1

Testing

Waveform generation and evaluation (using wav2vec for example) for a specific checkpoint.

./downstream/a2o-vc-vcc2020/decode.sh ./downstream/a2o-vc-vcc2020/pwg_task1 result/downstream/EXPNAME/<step> TEF1

ST: Speech Translation

The following instruction is only a minimal description for benchmarking. A complete guide about the task, dataset, implementation and usage can be found in the README.

Prepare data

Preparing CoVoST En->De dataset.

Download Common Voice audio clips and transcripts (english) (Common Voice Corpus 4).
Change the path in downstream/speech_translation/prepare_data/prepare_covo.sh

covo_root="root directory of covost"
src_lang=en
tgt_lang=de

Run the following script

cd downstream/speech_translation/prepare_data/
bash prepare_covo.sh

Training

python run_downstream.py -m train -n ExpName -u fbank -d speech_translation

Testing

python run_downstream.py -m evaluate -e result/downstream/ExpName/dev-best.ckpt

The model will report case-sensitive detokenized BLEU.

OOD-ASR: Out-of-domain Automatic Speech Recognition Tasks

Read README.

Leaderboard submission

After finishing the Testing of each task, the prediction files for leaderboard submission will be located under the expdir. You can use submit.py to easily organize them into a zip file which can later be submitted to our leaderboard. We now support submissions for the following tasks: PR, ASR, KS, QbE, SID, ASV, SD, IC, SF, ER, SE, SS, ST.

If you find superbbenchmark.org is down temporarily, please try to use 140.112.21.28 as an alternative. They share the same backend. We will make the official domain work as soon as possible.

Please use the master branch newer than 852db2e. Note that our SUPERB codebase is backward-compatible, so you don't need to re-train any model after upgrading to this newer version. You only need this new version to inference the prediction files for submission correctly.

output_dir="submission"

python3 submit/submit.py \
    --output_dir $output_dir \
    --pr pr_expdir \
    --sid sid_expdir \
    --ks ks_expdir \
    --ic ic_expdir \
    --er_fold1 er_fold1_expdir \
    --er_fold2 er_fold2_expdir \
    --er_fold3 er_fold3_expdir \
    --er_fold4 er_fold4_expdir \
    --er_fold5 er_fold5_expdir \
    --asr_no_lm asr_expdir \
    --asr_with_lm asr_expdir \
    --qbe qbe_expdir \
    --sf sf_expdir \
    --sv sv_expdir \
    --sd sd_expdir \
    --se se_expdir \
    --ss ss_expdir \
    --st st_expdir \

After executing, you can submit submission/predict.zip to the leaderboard.

We also prepare the example-expdirs for you to diagnose if the submission fails. After unzipping you will see the following structure:

expdirs/
    asr_expdir/
    er_fold1_expdir/
    er_fold2_expdir/
    er_fold3_expdir/
    er_fold4_expdir/
    er_fold5_expdir/
    ic_expdir/
    ks_expdir/
    pr_expdir/
    qbe_expdir/
    ...

Each expdir will contain the minimal submission-related files which should also appear in your expdir after you do the testing. Here is an example-script on how to use the above example-expdirs to prepare a submittable zip file.

cd s3prl/s3prl/submit
./demo_submit.sh examples

After executing, you will see:

s3prl/s3prl/submit/examples/
    expdirs/
    expdirs.zip
    predict/
    predict.zip

The predict.zip is the one for you to submit.

Note1

You don't need to prepare all the expdirs for the submission. You can zip only a subset of expdirs. After your submission, the leaderboard will only show the results of your submitted tasks. Eg.

python3 submit/submit.py \
    --output_dir submission \
    --pr pr_expdir

The above command will produce a predict.zip which will only show the PR score after submitted to the leaderboard.

Note2

Emotion Recognition (er) does 5-fold cross validation: 5 training and 5 testing, so 5 expdirs in total.

Note3

The expdirs for asr_no_lm and asr_with_lm are typically the same. Since the same ASR downstream model was trained and just decoded in different ways, so the same expdir assigned for training is used when testing. The default testing will produce predictions for asr_no_lm. By using Kenlm decoding you can get predictions for asr_with_lm. See ASR section below for more information.

XunGong / s3prl .gitee-modal { width: 500px !important; }

SUPERB Benchmark & Challenge

Prerequisite

Introduction

The command for each task

The training artifacts of each task

Leaderboard submission helper

Task-specific usages

PR: Phoneme Recognition

Prepare data

Training

Testing

ASR: Automatic Speech Recognition

Prepare data

Training

Testing without LM

Testing with KenLM + LibriSpeech official 4-gram LM

I. Prepare Decoding Environment

II. Test

KS: Keyword Spotting

Prepare data

Training

Testing

Compatible with Speech Command v2

QbE: Query-by-Example Spoken Term Detection

Prepare data

Dynamic Time Warping (DTW)

Scoring

Submit

IC: Intent Classification

Prepare data

Training

Testing

SF: End-to-end Slot Filling

Prepare data

Training

Testing

SID: Speaker Identification

Prepare data

Training

Testing

ASV: Automatic Speaker Verification

Prepare data

Training

Testing

SD: Speaker Diarization

Prepare data

Training

Testing

I. Inference predictions (for submission and for scoring locally)

II. Scoring (not required for submission)

ER: Emotion Recognition

Prepare data

Training

Testing

Cross validation

SS: Source Separation

Prepare data

Training

Testing

SE: Speech Enhancement

Prepare data

Training

Testing

VC: Voice conversion

Prepare data

Training

Testing

ST: Speech Translation

Prepare data

Training

Testing

OOD-ASR: Out-of-domain Automatic Speech Recognition Tasks

Leaderboard submission

Note1

Note2

Note3

简介

发行版

贡献者

XunGong / s3prl