01-COSMIC-signature-identification.Rmd · 王诗翔/sigminer-doc

# (PART) Common Workflow {-} # COSMIC Signature Identification {#sbssig} In this chapter, we will introduce how to identify COSMIC signatures from records of variant calling data. The COSMIC signatures include three type of signatures: SBS, DBS and ID (short for INDEL). The signature identification procedure has been divided into 3 steps: 1. Read mutation data. 2. Tally components: for SBS, it means classifying SBS records into 96 components (the most common case) and generate sample matrix. 3. Extract signatures: estimate signature number and identify signatures. ## Read Data > Make sure `library(sigminer)` before running the following code. The input data should be in [VCF](https://www.ebi.ac.uk/training-beta/online/courses/human-genetic-variation-introduction/variant-identification-and-analysis/understanding-vcf-format/), [MAF](https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/) format. - For VCF, it can only be VCF file paths. - For MAF, it can be either a MAF file or a `data.frame`. MAF format is the standard way to represent small-scale variants in Sigminer. There is a popular R/Bioconductor package [**maftools**](https://github.com/PoisonAlien/maftools) [@mayakonda2018maftools] for analyzing MAF data. It provides an R class **MAF** to represent MAF format data. ### VCF as input If you use VCF files as input, you can use `read_vcf()` to read multiple VCF files as a `MAF` object. ```{r} vcfs Here we save this cohort to let user can also run the examples without installing package TCGAmutations. ```{r} brca This classification is based the six substitution subtypes: C>A, C>G, C>T, T>A, T>C, and T>G (all substitutions are referred to by the pyrimidine of the mutated Watson—Crick base pair). Further, each of the substitutions is examined by incorporating information on the bases immediately 5’ and 3’ to each mutated base generating 96 possible mutation types (6 types of substitution x 4 types of 5’ base x 4 types of 3’ base). ```{r echo=FALSE, fig.cap="The illustration of 96 components, fig source: https://en.wikipedia.org/wiki/Mutational_signatures"} knitr::include_graphics("fig/MutationTypes_v3.jpg") ``` We tally components in each sample, and generate a sample-by-component matrix. ```{r} mt_tally Here set `useSyn = TRUE` to include all variant records in MAF object to generate sample matrix. ```{r} mt_tally$nmf_matrix[1:5, 1:5] ``` We use notion `left[ref>mut]right` to mark each component, e.g. `C[T>G]A` means a base T with 5' adjacent base C and 3' adjacent base A is mutated to base G. ### Other Situations Above we show the most common SBS classifications, there are other situations supported by **sigminer**, including other classifications for SBS records and other mutation types (DBS and ID). All situations about SBS, DBS and ID signatures are well documented in [wiki of SigProfilerMatrixGenerator package](https://osf.io/s93d5/wiki/home/). #### Other SBS classifications After calling `sig_tally()`, the most used matrix is stored in `nmf_matrix`, and all matrices generated by **sigminer** are stored in `all_matrices`. ```{r} str(mt_tally$all_matrices, max.level = 1) ``` If you add the strand classification, all matrices can be generated by **sigminer** will return. ```{r} mt_tally2 Program will stop if no records to analyze. Let's see ID records. ```{r} mt_tally_ID `pConstant` option is set to avoid errors raised by **NMF** package. We can show signature number survey for different measures by `show_sig_number_survey2()`. ```{r} ## You can also select the measures to show ## by 'what' option show_sig_number_survey2(mt_est$survey) ``` > For the details of all the measures above, please read @gaujoux2010flexible and [vignette](https://cran.r-project.org/web/packages/NMF/vignettes/) of R package **NMF**. The measures either provide stability (`cophenetic`) or how well can be reconstructed (`rss`). Typically, measure **cophenetic** is used for determining the signature number. We can easily generate an elbow plot with function `show_sig_number_survey()`. ```{r} show_sig_number_survey(mt_est$survey, right_y = NULL) ``` > The most common approach is to use the cophenetic correlation coefficient. Brunet et al. suggested choosing the smallest value of r for which this coefficient starts decreasing. [@gaujoux2010flexible] Cophenetic value (range from 0-1) indicates the robustness of consensus matrix clustering. In this situation, 3 is good. However, we can found that the cophenetic values are all >=0.9 from 2 to 5. So the more suitable way is considering both stability and reconstruction error at the same time, it can be easily done by `show_sig_number_survey()`. ```{r} show_sig_number_survey(mt_est$survey) ``` > This function is very flexible, you can pick up any measure to the left/right axis. However, the default setting is the most recommended way. We can see that we get a minimal RSS in signature number, and when this value goes from 5 to 6, the RSS increase! So we should not choose signature number more than 5 here because 6 is overfitting. **NOTE**: There are no gold standard to determine the signature number. Sometimes, you should consider multiple measures. Remember, the most important thing is that **you should have a good biological explanation for each signature**. The best solution in study may not be the best solution in math. ### Method 1: Extract Signatures After selecting a suitable signature number, now you can extract signatures. In general, use 30~50 NMF runs will get a robust result. Here we extract 5 signatures. ```{r, eval=FALSE} mt_sig