1 Star 0 Fork 0

XunGong / s3prl

加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
克隆/下载
superb.md 30.81 KB
一键复制 编辑 原始数据 按行查看 历史
Leo Yang 提交于 2022-07-10 05:13 . simplify iemocap prepare

SUPERB Benchmark & Challenge

Prerequisite

Please read downstream/README.md for the general command pattern, and read upstream/example/README.md for registering a new pretrained model (upstream).

Introduction

In this document we detail the commands for reproducing the paper SUPERB: Speech processing Universal PERformance Benchmark and SUPERB-SG: Enhanced Speech processing Universal PERformance Benchmark for Semantic and Generative Capabilities. If you use the tasks here for your research, please consider citing the following papers:

@inproceedings{yang21c_interspeech,
  author={Shu-wen Yang and Po-Han Chi and Yung-Sung Chuang and Cheng-I Jeff Lai and Kushal Lakhotia and Yist Y. Lin and Andy T. Liu and Jiatong Shi and Xuankai Chang and Guan-Ting Lin and Tzu-Hsien Huang and Wei-Cheng Tseng and Ko-tik Lee and Da-Rong Liu and Zili Huang and Shuyan Dong and Shang-Wen Li and Shinji Watanabe and Abdelrahman Mohamed and Hung-yi Lee},
  title={{SUPERB: Speech Processing Universal PERformance Benchmark}},
  year=2021,
  booktitle={Proc. Interspeech 2021},
  pages={1194--1198},
  doi={10.21437/Interspeech.2021-1775}
}
@article{superb_sg,
  title={SUPERB-SG: Enhanced Speech processing Universal PERformance Benchmark for Semantic and Generative Capabilities},
  author={Hsiang-Sheng Tsai and Heng-Jui Chang and Wen-Chin Huang and Zili Huang and Kushal Lakhotia and Shu-wen Yang and Shuyan Dong and Andy T. Liu and Cheng-I Lai and Jiatong Shi and Xuankai Chang and Phil Hall and Hsuan-Jui Chen and Shang-Wen Li and Shinji Watanabe and Abdel-rahman Mohamed and Hung-yi Lee},
  journal={ArXiv},
  year={2022},
  volume={abs/2203.06849}
}

Besides the tasks presented in the paper, we are also extending the coverage over all speech tasks. In the SUPERB Challenge in AAAI workshop: The 2nd Self-supervised Learning for Audio and Speech Processing, more tasks are introduced into the benchmark framework, and the setup detailed here serves as the public-set in the challenge. We list all tasks below:

ID Task Name Category Paper Challenge public-set
PR Phoneme Recognition Content V V
ASR Automatic Speech Recognition Content V V
KS Keyword Spotting Content V
QbE Query-by-Example Content V V
SID Speaker Identification Speaker V V
ASV Automatic Speaker Verification Speaker V V
SD Speaker Diarization Speaker V V
ER Emotion Recognition Paralinguistics V V
IC Spoken Intent Classification Semantics V
SF Spoken Slot Filling Semantics V
ST Speech Translation Semantics V V
SE Speech Enhancement Generation V V
SS Source Separation Generation V V
VC Voice Conversion Generation V

This document contains the following meterials:

The command for each task

  • Data preparation
  • Training
  • Testing / Scoring

The training artifacts of each task

  • Tensorboard logs
  • Trained downstream weights (the best on dev set)

Leaderboard submission helper

  • Ready for the tasks presented in the paper
  • Will be ready for the challenge on Sep 30, 2021
    • New tasks submission
    • Overall metrics

Task-specific usages

To reproduce the results in the SUPERB paper, you can follow the commands below by only changing the learning rate: config.optimizer.lr in the config file with the override option.

# The default lr for ASR is 1.0e-4
python3 run_downstream.py -m train -u wav2vec2 -d asr -n ExpName \
    -o config.optimizer.lr=1.0e-5

If the fully converged training time is too long, you can also consider using distributed training to avoid the gradient accumulation.

PR: Phoneme Recognition

Specified by the command -d ctc

Prepare data

  1. Download LibriSpeech and unzip. Only need train-clean-100, dev-clean, and test-clean.

  2. Check the prepared file structure

    LibriSpeech/
    ├── train-clean-100/
    ├── dev-clean/
    └── test-clean/
  3. Change the path in downstream/ctc/libriphone.yaml

    downstream_expert:
        corpus:
            path: "root directory of LibriSpeech"

Training

python3 run_downstream.py -n ExpName -m train -u fbank -d ctc -c downstream/ctc/libriphone.yaml

Testing

python3 run_downstream.py -m evaluate -e result/downstream/ExpName/dev-best.ckpt

ASR: Automatic Speech Recognition

Specified by the command -d asr

Prepare data

  1. Download LibriSpeech and unzip. Only need train-clean-100, dev-clean, and test-clean.

  2. Check the prepared file structure

    LibriSpeech/
    ├── train-clean-100/
    ├── dev-clean/
    └── test-clean/
  3. Change the path in downstream/asr/config.yaml

    downstream_expert:
        datarc:
            libri_root: "root directory of LibriSpeech"
  4. Prepare the lengths for utterances in LibriSpeech's train-clean-100, dev-clean and test-clean:

    # Official LibriSpeech is in .flac format
    python3 preprocess/generate_len_for_bucket.py -i "root directory of LibriSpeech" -o data/librispeech -a .flac --n_jobs 12

Training

python3 run_downstream.py -n ExpName -m train -u fbank -d asr

Testing without LM

python3 run_downstream.py -m evaluate -t "test-clean" -e result/downstream/dev-clean-best.ckpt

Testing with KenLM + LibriSpeech official 4-gram LM

Installing all the dependencies right could be quite complicated. Note that the decoding is not required for SSL representations to perform well on ASR and you can also skip the ASR results from LM decoding when submitting to the leaderboard.

I. Prepare Decoding Environment
  1. Install KenLM

    • Please follow the official installation instructions of KenLM instead of the one documented in flashlight or wav2letter du to some known issues.
  2. Install flashlight python bindings

    • Only the python bindings is required instead of the entire flashlight toolkit
  3. Download LibriSpeech official 4-gram LM

  4. Download character-based lexicon

  5. Make sure your fairseq version contains this commit cb8469

II. Test
python3 run_downstream.py -m evaluate -t "test-clean" -e result/downstream/dev-best.ckpt \
    -o "\
        config.downstream_expert.datarc.decoder_args.decoder_type='kenlm',, \
        config.downstream_expert.datarc.decoder_args.kenlm_model='/path/to/4-gram.arpa.gz',, \
        config.downstream_expert.datarc.decoder_args.lexicon='/path/to/librispeech_lexicon.lst' \
       "

KS: Keyword Spotting

Specified by the command -d speech_commands

Prepare data

  1. Download data

  2. Download and unpack Speech Commands

    mkdir -p /CORPORA_DIR/speech_commands_v0.01
    tar zxf speech_commands_v0.01.tar.gz -C /CORPORA_DIR/speech_commands_v0.01
  3. Download and unpack Speech Commands test set

    mkdir -p /CORPORA_DIR/speech_commands_test_set_v0.01
    tar zxf speech_commands_test_set_v0.01.tar.gz -C /CORPORA_DIR/speech_commands_test_set_v0.01
  4. Change the following path in downstream/speech_commands/config.yaml to yours

    downstream_expert:
        datarc:
            speech_commands_root: "/CORPORA_DIR/speech_commands_v0.01/"
            speech_commands_test_root: "/CORPORA_DIR/speech_commands_test_set_v0.01/"

Training

python3 run_downstream.py -n ExpName -m train -u fbank -d speech_commands

Testing

python3 run_downstream.py -m evaluate -e result/downstream/ExpName/dev-best.ckpt

Compatible with Speech Command v2

The implementation is directly compatible with Speech Command v2. You can enable this by just changing the train/test dataset. All other steps should be the same.

QbE: Query-by-Example Spoken Term Detection

Specified by the command -d quesst14_dtw. This task does not require training. We extract representations and run dynamic time warping (DTW) on them.

Prepare data

  1. Download QUESST14

    export CORPORA_DIR="the root directory of all your datasets"    
    wget https://speech.fit.vutbr.cz/files/quesst14Database.tgz
    tar zxf quesst14Database.tgz -C $CORPORA_DIR
  2. Change the path in downstream/quesst14/config.yaml

    downstream:
        datarc:
            dataset_root: "CORPORA_DIR/quesst14Database"

Dynamic Time Warping (DTW)

In SUPERB, we run DTW for all the hidden states layer-by-layer. Choose the best layer according to dev set and report its score on the test set. A specific layer can be selected by -l option, indexed from 0. The following take the last layer as an example.

# The default dist_fn if not specified is "cosine_exp"
# as it yields the best result for almost all upstream
# Supported dist_fn: cosine, cityblock, euclidean, cosine_exp

layer=-1;
dist_fn=cosine;

# dev
python3 run_downstream.py -m evaluate -t "dev" -u hubert -l ${layer} \
    -d quesst14_dtw -n ExpName_${layer}_dev \
    -o config.downstream_expert.dtwrc.dist_method=$dist_fn

# test
python3 run_downstream.py -m evaluate -t "test" -u fbank -l ${layer} \
    -d quesst14_dtw -n ExpName_${layer}_test \
    -o config.downstream_expert.dtwrc.dist_method=$dist_fn

Scoring

export S3PRL_DIR=/YOUR/S3PRL/PATH
cd $CORPORA_DIR/quesst14Database/scoring

# dev
./score-TWV-Cnxe.sh $S3PRL_DIR/result/downstream/ExpName_${layer}_dev \
    groundtruth_quesst14_dev -10

# test
./score-TWV-Cnxe.sh $S3PRL_DIR/result/downstream/ExpName_${layer}_test \
    groundtruth_quesst14_eval -10

Submit

After you benchmark all the layers of an upstream, says you find the 6-th layer is the best for QbE according to dev set. Please use ExpName_6_test as the submission expdir for submit.py.

IC: Intent Classification

Specified by the command -d fluent_commands

Prepare data

  1. Download and unzip data: Fluent Speech Commands

  2. Check the prepared file structure

    fluent_speech_commands_dataset
    ├── wavs
    │   └── speakers
    ├── data
    │   └── [*.csv]
    ├── readme.md
    └── Fluent Speech Commands Public License.pdf
  3. Change the following paths under downstream/fluent_commands/config.yaml to your own:

    downstream_expert:
        datarc:
            file_path: "root directory of fluent_speech_commands_dataset"

Training

python3 run_downstream.py -n ExpName -m train -u fbank -d fluent_commands

Testing

python3 run_downstream.py -m evaluate -e result/downstream/ExpName/dev-best.ckpt

SF: End-to-end Slot Filling

Prepare data

  1. Optional: Preprocess Audio SNIPS from the official version.

    # Official Audio SNIPS is in mp3 format, we will convert them to wav
    # We need mp3 support on sox package (originally not supported)
    # First ensure you have the sox installed
    # Then install the mp3 support
    
    # apt-get
    apt-get install libsox-fmt-mp3
    
    # or yum install
    yum install soxr sox-plugins-freeworld -y
    
    # after installing the mp3 support
    CORPORA_DIR="the root directory of all your datasets"
    ./preprocess/snips_prepare_data.sh $CORPORA_DIR
  2. Download the preprocessed Audio SNIPS and unzip

  3. Change the paths in downstream/ctc/snips.yaml

    downstream_expert:
        corpus:
            path: "CORPORA_DIR/SNIPS"
        text:
            slots_file: "CORPORA_DIR/SNIPS/slots.txt"

Training

python3 run_downstream.py -n ExpName -m train -u fbank -d ctc -c downstream/ctc/snips.yaml

Testing

python3 run_downstream.py -m evaluate -e result/downstream/ExpName/dev-best.ckpt

SID: Speaker Identification

Prepare data

  1. Download dataset from Voxceleb1 and unzip them.

    voxceleb1_root="/CORPORA_DIR/VoxCeleb1/"
    mkdir -p $voxceleb1_root/dev
    mkdir -p $voxceleb1_root/test
    
    # prepare dev
    cd $voxceleb1_root/dev/
    wget https://thor.robots.ox.ac.uk/~vgg/data/voxceleb/vox1a/vox1_dev_wav_partaa
    wget https://thor.robots.ox.ac.uk/~vgg/data/voxceleb/vox1a/vox1_dev_wav_partab
    wget https://thor.robots.ox.ac.uk/~vgg/data/voxceleb/vox1a/vox1_dev_wav_partac
    wget https://thor.robots.ox.ac.uk/~vgg/data/voxceleb/vox1a/vox1_dev_wav_partad
    cat vox1_dev* > vox1_dev_wav.zip
    unzip vox1_dev_wav.zip
    
    # prepare test
    cd $voxceleb1_root/test/
    wget https://thor.robots.ox.ac.uk/~vgg/data/voxceleb/vox1a/vox1_test_wav.zip
    unzip vox1_test_wav.zip
  2. Check prepared file structure

    Voxceleb1/
    ├── dev/
    │   └── wav/
    │       └──Speaker id folders
    └── test/
        └── wav/
            └──Speaker id folders
  3. Change the path in downstream/voxceleb1/config.yaml

    downstream_expert:
        datarc:
            file_path: "root directory of VoxCeleb1"    

Training

python3 run_downstream.py -n ExpName -m train -u fbank -d voxceleb1

Testing

python3 run_downstream.py -m evaluate -e result/downstream/ExpName/dev-best.ckpt

ASV: Automatic Speaker Verification

Prepare data

  1. Follow the step 1 and 2 in SID

  2. Change the path in downstream/sv_voxceleb1/config.yaml

    downstream_expert:
        datarc:
            file_path: "root directory of VoxCeleb1"    

Training

python3 run_downstream.py -n ExpName -m train -u fbank -d sv_voxceleb1

Testing

If you already know a specific checkpoint to test, says states-20000.ckpt, you can test it with:

python3 run_downstream.py -m evaluate -e result/downstream/ExpName/states-20000.ckpt

However, there is no official validation set under VoxCeleb1 setting, we save checkpoints every 20000 updates and report the best EER. Evaluating checkpoints take long time so we don't test them along with training on a single GPU. We save all checkpoints and test them parallely with another GPU. The following command will:

  1. Run a for-loop to find newly saved checkpoints in expdir
  2. Evaluate it if any is found and log the testing result
  3. Prepare the best prediction file according to already tested checkpoints

Note. The already evaluated checkpoints will be passed.

voxceleb1="root directory of VoxCeleb1"
./downstream/sv_voxceleb1/test_expdir.sh result/downstream/ExpName $voxceleb1

SD: Speaker Diarization

We prepare the frame-wise training label on-the-fly, and convert the frame-wise prediction into RTTM files annotated in seconds. The inferenced RTTM will then be scored by comparing to the groundtruth RTTM by dscore. You can choose the frame_shift (stride) of the training label for the upstream representation. This only affects the training materials and does not affect the groundtruth RTTM, which is fixed in Libri2Mix during data preparation.

Prepare data

Simulate Libri2Mix Data for Diarization

S3PRL_DIR="root directory of your cloned s3prl"
CORPORA_DIR"root directory of all your datasets, which hopefully contains LibriSpeech (not necessary)"

git clone https://github.com/s3prl/LibriMix.git
cd LibriMix
bash generate_librimix_sd.sh $CORPORA_DIR
python3 scripts/prepare_diarization.py \
    --target_dir $S3PRL_DIR/downstream/diarization/data \
    --source_dir $CORPORA_DIR/Libri2Mix/wav16k/max/metadata

Training

Train with the label in the same frame_shift as the upstream representation: (recommened)

python3 run_downstream.py -n ExpName -m train -u fbank -d diarization

Train with the label in a specific frame_shift (e.g. 160):

python3 run_downstream.py -n ExpName -m train -u fbank -d diarization \
    -o config.downstream_expert.datarc.frame_shift=160

The upstream representation will be upsampled (duplicate) or downsampled (take 1 per N frames) to match the sequence length of your assigned label. This can be useful when the representation has too small frame_shift and hence too long sequence, which leads to too long training time.

Testing

The frame_shift for the training label is already saved in the checkpoint, and the same frame_shift will be used to convert the frame-wise prediction into RTTM files annotated in seconds.

I. Inference predictions (for submission and for scoring locally)
python3 run_downstream.py -m evaluate -e result/downstream/ExpName/best-states-dev.ckpt
II. Scoring (not required for submission)
  1. Clone dscore

    git clone https://github.com/ftshijt/dscore
  2. Change the path in downstream/diarization/score.sh

    dscore_dir="root directory of your cloned dscore"
  3. Run scoring

    ./downstream/diarization/score.sh result/downstream/ExpName downstream/diarization/data/test
  4. The scoring results will look like

    One should report the lowest number at the bottom, where the column represents DER and the most bottom row will always have the lowest DER which is the number we will report.

  5. Re-check the scoring results: Running the above scoring script takes time. If you want to re-check the scored results, use

    ./downstream/diarization/report.sh result/downstream/ExpName

ER: Emotion Recognition

Prepare data

  1. Download dataset and unzip. You will need to fill a form in IEMOCAP official website to get the dataset.

  2. Change the path in downstream/emotion/config.yaml

    downstream_expert:
        datarc:
            root: "root directory of IEMOCAP"

Training

IEMOCAP provides 5 splits of data: Section1, Section2, Section3, Section4 and Section5. Conventionally, each split will be selected as the test set and train the model with other 4 splits. That is, 5 times of training and testing is required, and 5 testing scores will be averaged to report the final number. We can change the test_fold option in the config file to control which split we want to reserve as the test set.

# test_fold can be: fold1, fold2, fold3, fold4, fold5
python3 run_downstream.py -n ExpName -m train -u fbank -d emotion -c downstream/emotion/config.yaml -o "config.downstream_expert.datarc.test_fold='fold1'"

Testing

python3 run_downstream.py -m evaluate -e result/downstream/ExpName/dev-best.ckpt

Cross validation

for test_fold in fold1 fold2 fold3 fold4 fold5;
do
    # The default config is "downstream/emotion/config.yaml"
    python3 run_downstream.py -n ExpName_$test_fold -m train -u fbank -d emotion -o "config.downstream_expert.datarc.test_fold='$test_fold'"
    python3 run_downstream.py -m evaluate -e result/downstream/ExpName_$test_fold/dev-best.ckpt
done

SS: Source Separation

Prepare data

Simulate Libri2Mix data for source separation. For source separation, we only need 16kHz and min condition. (Usually for source separation, people are using 8kHz min condition, but due to the constrait of pre-trained models we are using 16kHz)

# download the script and simulate Libri2Mix dataset
git clone https://github.com/s3prl/LibriMix.git
cd LibriMix 
./generate_librimix_ss.sh storage_dir

# prepare train, dev and test data in Kaldi format
python downstream/separation_stft/scripts/LibriMix/data_prepare.py \
--part train-100 storage_dir/Libri2Mix downstream/separation_stft/datasets/Libri2Mix

python downstream/separation_stft/scripts/LibriMix/data_prepare.py \
--part dev storage_dir/Libri2Mix downstream/separation_stft/datasets/Libri2Mix

python downstream/separation_stft/scripts/LibriMix/data_prepare.py \
--part test storage_dir/Libri2Mix downstream/separation_stft/datasets/Libri2Mix

# subsample dev set from 3000 utts to 1000 utts (for faster validation)
python downstream/separation_stft/scripts/LibriMix/subsample.py \
downstream/separation_stft/datasets/Libri2Mix/wav16k/min/dev \
downstream/separation_stft/datasets/Libri2Mix/wav16k/min/dev_1000

cd $YOUR_S3PRL_ROOT/s3prl/

Training

Train with STFT magnitude as the upstream. The default stride is 20ms, and you can adjust that in upstream/log_stft/stft_mag.yaml

python3 run_downstream.py -m train \
        -d separation_stft \
        -c downstream/separation_stft/configs/cfg.yaml \
        -u stft_mag \
        -g 'upstream/log_stft/stft_mag.yaml' \
        -n ExpName

Train with wav2vec2 as the upstream.

python3 run_downstream.py --mode train \
        -d separation_stft \
        -c downstream/separation_stft/configs/cfg.yaml \
        -u wav2vec2 \
        -n ExpName \

Testing

python3 run_downstream.py -m evaluate \
        -e result/downstream/ExpName/best-states-dev.ckpt \

The model is expected to output si-sdri on the test set.

SE: Speech Enhancement

Prepare data

  1. We use Voicebank-DEMAND dataset for speech enhancement. We follow the data preparation in SpeechBrain:

    # Download the Voicebank-DEMAND dataset and convert it to 16kHz
    # I am following the data preparation script in SpeechBrain toolkit (https://github.com/speechbrain/speechbrain/blob/develop/recipes/Voicebank/voicebank_prepare.py)
    from voicebank_prepare import download_vctk
    download_vctk(data_dir)

    However, the above pipeline might take too much time to download the original dataset. Hence, we also provide the already preprocessed archive:

    wget http://140.112.21.28:9000/noisy-vctk-16k.zip
    unzip noisy-vctk-16k.zip
  2. Check the unzipped voicebank directory structure

    data_dir/
    ├── clean_testset_wav_16k/
    ├── clean_trainset_28spk_wav_16k/
    ├── noisy_testset_wav_16k/
    ├── noisy_trainset_28spk_wav_16k/
    ├── testset_txt/
    └── trainset_28spk_txt/
  3. Prepare kaldi-style scp files

    # prepare train, dev and test data in Kaldi format
    python downstream/enhancement_stft/scripts/Voicebank/data_prepare.py \
        data_dir downstream/enhancement_stft/datasets/voicebank --part train
    python downstream/enhancement_stft/scripts/Voicebank/data_prepare.py \
        data_dir downstream/enhancement_stft/datasets/voicebank --part dev
    python downstream/enhancement_stft/scripts/Voicebank/data_prepare.py \
        data_dir downstream/enhancement_stft/datasets/voicebank --part test

Training

Train with hubert as the upstream.

python3 run_downstream.py -m train \
       -c downstream/enhancement_stft/configs/cfg_voicebank.yaml \
       -d enhancement_stft \
       -u hubert \
       -n ExpName \

Testing

python3 run_downstream.py -m evaluate \
       -e result/downstream/ExpName/best-states-dev.ckpt \

The model is expected to output pesq, stoi, covl and si-sdri on the test set.

VC: Voice conversion

The following instruction is only a minimal description for benchmarking. A complete guide about the task, dataset, implementation and usage can be found in the README. We evaluate the VC capability by training 4 target speaker models that given any source speaker utterance, the single-speaker model can convert it to a specific target speaker. This setting is known as Any-to-one VC. The trained 4 target speakers are: TEF1, TEF2, TEM1, TEM2. The quality of the target speaker model is evaluated with MCD (lower better). One should average the MCD from four speakers.

Prepare data

Download the VCC2020 dataset and the pretrained vocoder.

cd downstream/a2o-vc-vcc2020
cd data
./data_download.sh vcc2020/
cd ../

# Download the pretrained PWGs.
./vocoder_download.sh ./

Training

Specify a target speaker for training from: TEF1, TEF2, TEM1, TEM2

python run_downstream.py -m train -n EXPNAME -u wav2vec -d a2o-vc-vcc2020 \
    -o config.downstream_expert.trgspk=TEF1

Testing

Waveform generation and evaluation (using wav2vec for example) for a specific checkpoint.

./downstream/a2o-vc-vcc2020/decode.sh ./downstream/a2o-vc-vcc2020/pwg_task1 result/downstream/EXPNAME/<step> TEF1

ST: Speech Translation

The following instruction is only a minimal description for benchmarking. A complete guide about the task, dataset, implementation and usage can be found in the README.

Prepare data

Preparing CoVoST En->De dataset.

  1. Download Common Voice audio clips and transcripts (english) (Common Voice Corpus 4).

  2. Change the path in downstream/speech_translation/prepare_data/prepare_covo.sh

covo_root="root directory of covost"
src_lang=en
tgt_lang=de
  1. Run the following script
cd downstream/speech_translation/prepare_data/
bash prepare_covo.sh

Training

python run_downstream.py -m train -n ExpName -u fbank -d speech_translation

Testing

python run_downstream.py -m evaluate -e result/downstream/ExpName/dev-best.ckpt

The model will report case-sensitive detokenized BLEU.

OOD-ASR: Out-of-domain Automatic Speech Recognition Tasks

Read README.

Leaderboard submission

After finishing the Testing of each task, the prediction files for leaderboard submission will be located under the expdir. You can use submit.py to easily organize them into a zip file which can later be submitted to our leaderboard. We now support submissions for the following tasks: PR, ASR, KS, QbE, SID, ASV, SD, IC, SF, ER, SE, SS, ST.

If you find superbbenchmark.org is down temporarily, please try to use 140.112.21.28 as an alternative. They share the same backend. We will make the official domain work as soon as possible.

Please use the master branch newer than 852db2e. Note that our SUPERB codebase is backward-compatible, so you don't need to re-train any model after upgrading to this newer version. You only need this new version to inference the prediction files for submission correctly.

output_dir="submission"

python3 submit/submit.py \
    --output_dir $output_dir \
    --pr pr_expdir \
    --sid sid_expdir \
    --ks ks_expdir \
    --ic ic_expdir \
    --er_fold1 er_fold1_expdir \
    --er_fold2 er_fold2_expdir \
    --er_fold3 er_fold3_expdir \
    --er_fold4 er_fold4_expdir \
    --er_fold5 er_fold5_expdir \
    --asr_no_lm asr_expdir \
    --asr_with_lm asr_expdir \
    --qbe qbe_expdir \
    --sf sf_expdir \
    --sv sv_expdir \
    --sd sd_expdir \
    --se se_expdir \
    --ss ss_expdir \
    --st st_expdir \

After executing, you can submit submission/predict.zip to the leaderboard.

We also prepare the example-expdirs for you to diagnose if the submission fails. After unzipping you will see the following structure:

expdirs/
    asr_expdir/
    er_fold1_expdir/
    er_fold2_expdir/
    er_fold3_expdir/
    er_fold4_expdir/
    er_fold5_expdir/
    ic_expdir/
    ks_expdir/
    pr_expdir/
    qbe_expdir/
    ...

Each expdir will contain the minimal submission-related files which should also appear in your expdir after you do the testing. Here is an example-script on how to use the above example-expdirs to prepare a submittable zip file.

cd s3prl/s3prl/submit
./demo_submit.sh examples

After executing, you will see:

s3prl/s3prl/submit/examples/
    expdirs/
    expdirs.zip
    predict/
    predict.zip

The predict.zip is the one for you to submit.

Note1

You don't need to prepare all the expdirs for the submission. You can zip only a subset of expdirs. After your submission, the leaderboard will only show the results of your submitted tasks. Eg.

python3 submit/submit.py \
    --output_dir submission \
    --pr pr_expdir

The above command will produce a predict.zip which will only show the PR score after submitted to the leaderboard.

Note2

Emotion Recognition (er) does 5-fold cross validation: 5 training and 5 testing, so 5 expdirs in total.

Note3

The expdirs for asr_no_lm and asr_with_lm are typically the same. Since the same ASR downstream model was trained and just decoded in different ways, so the same expdir assigned for training is used when testing. The default testing will produce predictions for asr_no_lm. By using Kenlm decoding you can get predictions for asr_with_lm. See ASR section below for more information.

1
https://gitee.com/hsungong/s3prl.git
git@gitee.com:hsungong/s3prl.git
hsungong
s3prl
s3prl
master

搜索帮助