Therefore, as experiments that include biological replication become more common, statistical frameworks to account for multiple sources of biological variability will be critical, as recently described by Lhnemann et al. On the other hand, subject had the smallest FPR (0.03) compared to wilcox and mixed (0.26 and 0.08, respectively) and had a higher PPV (0.38 compared to 0.10 and 0.23). As a gold standard, results from bulk RNA-seq comparing CD66+ and CD66- basal cells (bulk). Importantly, although these results specifically target differences in small airway secretory cells and are not directly comparable with other transcriptome studies, previous bulk RNA-seq (Bartlett et al., 2016) and microarray (Stoltz et al., 2010) studies have suggested few gene expression differences in airway epithelial tissues between CF and non-CF pigs; true differential gene expression between genotypes at birth is therefore likely to be small, as detected by the subject method. Rows correspond to different proportions of differentially expressed genes, pDE and columns correspond to different SDs of (natural) log fold change, . We performed DS analysis using the same seven methods as Section 3.1. Results for analysis of CF and non-CF pig small airway secretory cells. RNA-seqR "Seurat" FindMarkers() FindMarkers() Volcano plotMA plot In Supplementary Figure S14(ef), we quantify the ability of each method to correctly identify markers of T cells and macrophages from a database of known cell type markers (Franzen et al., 2019). In practice, we have omitted comparisons of gene expression in rare cell types because the gene expression profiles had high variation, and the reliability of the comparisons was questionable. I have scoured the web but I still cannot figure out how to do this. Figure 5d shows ROC and PR curves for the three scRNA-seq methods using the bulk RNA-seq as a gold standard. Comparisons of characteristics of the simulated and real data are shown in Supplementary Figures S1S6. The following equations are identical: . Red and blue dots represent genes with a log 2 FC (fold . The wilcox, MAST and Monocle methods had intermediate performance in these nine settings. The resulting matrix contains counts of each genefor each subject and can be analyzed using software for bulk RNA-seq data. Overall, the subject and mixed methods had the highest concordance between permutation and method P-values. Seurat has four tests for differential expression which can be set with the test.use parameter: ROC test ("roc"), t-test ("t"), LRT test based on zero-inflated data ("bimod", default), LRT test based on tobit-censoring models ("tobit") The ROC test returns the 'classification power' for any individual marker (ranging from 0 . Alternatively, batch correction methods have been proposed to remove inter-individual differences prior to DS analysis, however, this increases type I error rates and disturbs the rank-order of results as explained in Zimmerman et al. Figure 4b shows the top 50 genes for each method, defined by the smallest 50 adjusted P-values. Supplementary Figure S9 contains computation times for each method and simulation setting for the 100 simulated datasets. Step 3: Create a basic volcano plot. For each subject, the number of cells and numbers of UMIs per cell were matched to the pig data. This will mean, however, that FindMarkers() takes longer to complete. In a scRNA-seq study of human tracheal epithelial cells from healthy subjects and subjects with idiopathic pulmonary fibrosis (IPF), the authors found that the basal cell population contained specialized subtypes (Carraro et al., 2020). Infinite p-values are set defined value of the highest -log(p) + 100. Until computationally efficient methods exist to fit hierarchical models incorporating all sources of biological variation inherent to scRNA-seq, we believe that pseudobulk methods are useful tools for obtaining time-efficient DS results with well-controlled FDR. Nine simulation settings were considered. ## [118] sctransform_0.3.5 parallel_4.2.0 grid_4.2.0 Then, we consider the top g genes for each method, which are the g genes with the smallest adjusted P-values, and find what percentage of these top genes are known markers. make sure label exists on your cells in the metadata corresponding to treatment (before- and after-), You will be returned a gene list of pvalues + logFc + other statistics. < 10e-20) with a different symbol at the top of the graph. ## [70] ggridges_0.5.4 evaluate_0.20 stringr_1.5.0 In addition to simulated data, we analysed an animal model dataset containing large and small airway epithelia from CF and non-CF pigs (Rogers et al., 2008). We identified cell types, and our DS analyses focused on comparing expression profiles between large and small airways and CF and non-CF pigs. According to this criterion, the subject method had the best performance, and the degree to which subject outperformed the other methods improved with larger values of the signal-to-noise ratio parameter . Gene counts were simulated from the model in Section 2.1. Next, I'm looking to visualize this using a volcano plot using the EnhancedVolcano package: Here, we introduce a mathematical framework for modeling different sources of biological variation introduced in scRNA-seq data, and we provide a mathematical justification for the use of pseudobulk methods for DS analysis. Figure 6(e and f) shows ROC and PR curves for the three scRNA-seq methods using the bulk RNA-seq as a gold standard. Finally, we discuss potential shortcomings and future work. This is the model used in DESeq2 (Love et al., 2014). Data for the analysis of human trachea were obtained from GEO accessions GSE143705 (bulk RNA-seq) and GSE143706 (scRNA-seq). In stage ii, we assume that we have not measured cell-level covariates, so that variation in expression between cells of the same type occurs only through the dispersion parameter ij2. It is important to emphasize that the aggregation of counts occurs within cell types or cell states, so that the advantages of single-cell sequencing are retained. ## [5] ssHippo.SeuratData_3.1.4 pbmcsca.SeuratData_3.0.0 The general process for detecting genes then would be: Repeat for all cell clusters/types of interest, depending on your research questions. ## [16] cluster_2.1.3 ROCR_1.0-11 limma_3.54.1 In your DoHeatmap () call, you do not provide features so the function does not know which genes/features to use for the heatmap. 6e), subject and mixed have the same area under the ROC curve (0.82) while the wilcox method has slightly smaller area (0.78). A volcano plot is a type of scatterplot that shows statistical significance (P value) versus magnitude of change (fold change). These analyses provide guidance on strengths and weaknesses of different methods in practice. In contrast, single-cell experiments contain an additional source of biological variation between cells. The subject method had the shortest average computation times, typically <1 min. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (, https://doi.org/10.1093/bioinformatics/btab337, https://www.bioconductor.org/packages/release/bioc/html/aggregateBioVar.html, https://creativecommons.org/licenses/by/4.0/, Receive exclusive offers and updates from Oxford Academic, Academic Pulmonary Sleep Medicine Physician Opportunity in Scenic Central Pennsylvania, MEDICAL MICROBIOLOGY AND CLINICAL LABORATORY MEDICINE PHYSICIAN, CLINICAL CHEMISTRY LABORATORY MEDICINE PHYSICIAN. In a scRNA-seq experiment with multiple subjects, we assume that the observed data consist of gene counts for G genes drawn from multiple cells among n subjects. FindMarkers: Finds markers (differentially expressed genes) for identified clusters. The volcano plot for the subject method shows three genes with adjusted P-value <0.05 (-log 10 (FDR) > 1.3), whereas the other six methods detected a much larger number of genes. For the AT2 cells (Fig. For example, consider a hypothetical gene having heterogeneous expression in CF pigs, where cells were either low expressors or high expressors versus homogeneous expression in non-CF pigs, where cells were moderate expressors. Marker detection methods were found to have unacceptable FDR due to pseudoreplication bias, in which cells from the same individual are correlated but treated as independent replicates, and pseudobulk methods were found to be too conservative, in the sense that too many differentially expressed genes were undiscovered. a, Volcano plot of RNA-seq data from bulk hippocampal tissue from 8- to 9-month-old P301S transgenic and non-transgenic mice (Wald test). The marker genes list can be a list or a dictionary. We compared the performances of subject, wilcox and mixed for DS analysis of the scRNA-seq from healthy and IPF subjects within AT2 and AM cells using bulk RNA-seq of purified AT2 and AM cell type fractions as a gold standard, similar to the method used in Section 3.5. Single-cell RNA-sequencing (scRNA-seq) enables analysis of the effects of different conditions or perturbations on specific cell types or cellular states. Theorem 1: The expected value of Kij is ij=sjqij. This research was supported in part through computational resources provided by The University of Iowa, Iowa City, Iowa. This study found that generally pseudobulk methods and mixed models had better statistical characteristics than marker detection methods, in terms of detecting differentially expressed genes with well-controlled false discovery rates (FDRs), and pseudobulk methods had fast computation times. The difference between these formulas is in the mean calculation. Default is 0.25. Step 5: Export and save it. #' @param output_dir The relative directory that will be used to save results. dotplot visualization does not work for scaled or corrected matrices in which cero counts had been replaced by other values. First, a random proportion of genes, pDE, were flagged as differentially expressed. In practice, often only one cutoff value for the adjusted P-value will be chosen to detect genes. Consider a purified cell type (PCT) study design, in which many cells from a cell type of interest could be isolated and profiled using bulk RNA-seq. Third, we examine properties of DS testing in practice, comparing cells versus subjects as units of analysis in a simulation study and using available scRNA-seq data from humans and pigs. ## [34] zoo_1.8-11 glue_1.6.2 polyclip_1.10-4 For higher numbers of differentially expressed genes (pDE > 0.01), the subject method had lower NPV values when = 0.5 and similar or higher NPV values when > 0.5. In the second stage, the observed data for each gene, measured as a count, is assumed to follow a Poisson distribution with mean equal to the product of a size factor, such as sequencing depth, and gene expression generated in the first stage. Further, subject has the highest AUPR (0.21) followed by mixed (0.14) and wilcox (0.08). The main idea of the theorem is that if gene counts are summed across cells and the number of cells grows large for each subject, the influence of cell-level variation on the summed counts is negligible. First, we identified the AT2 and AM cells via clustering (Fig. With this data you can now make a volcano plot; Repeat for all cell clusters/types of interest, depending on your research questions. Multiple methods and bioinformatic tools exist for initial scRNA-seq data processing, including normalization, dimensionality reduction, visualization, cell type identification, lineage relationships and differential gene expression (DGE) analysis (Chen et al., 2019; Hwang et al., 2018; Luecken and Theis, 2019; Vieth et al., 2019; Zaragosi et al., 2020). In summary, here we (i) suggested a modeling framework for scRNA-seq data from multiple biological sources, (ii) showed how failing to account for biological variation could inflate the FDR of DS analysis and (iii) provided a formal justification for the validity of pseudobulking to allow DS analysis to be performed on scRNA-seq data using software designed for DS analysis of bulk RNA-seq data (Crowell et al., 2020; Lun et al., 2016; McCarthy et al., 2017). With this data you can now make a volcano plot. ## [121] tidyr_1.3.0 rmarkdown_2.21 Rtsne_0.16 Applying themes to plots. Well demonstrate visualization techniques in Seurat using our previously computed Seurat object from the 2,700 PBMC tutorial. ## [1] systemfonts_1.0.4 plyr_1.8.8 igraph_1.4.1 10e-20) with a different symbol at the top of the graph. ## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C (b) CD66+ basal cells were identified via detection of CEACAM5 or CEACAM6. ## [85] mime_0.12 formatR_1.14 compiler_4.2.0 "poisson" : Likelihood ratio test assuming an . These methods provide interpretable results that generalize to a population of research subjects, account for important sources of biological and technical variability and provide adequate FDR control. As increases, the width of the distribution of effect sizes increases, so that the signal-to-noise ratio for differentially expressed genes is larger. In (a), vertical axes are negative log10-transformed adjusted P-values, and horizontal axes are log2-transformed fold changes. For example, lets pretend that DCs had merged with monocytes in the clustering, but we wanted to see what was unique about them based on their position in the tSNE plot. "t" : Student's t-test. In bulk RNA-seq studies, gene counts are often assumed to follow a negative binomial distribution (Hardcastle and Kelly, 2010; Leng et al., 2013; Love et al., 2014; Robinson et al., 2010). # Calculate feature-specific contrast levels based on quantiles of non-zero expression. Further, applying computational methods that account for all sources of variation will be necessary to gain better insights into biological systems, operating at the granular level of cells all the way up to the level of populations of subjects. (a) t-SNE plot shows AT2 cells (red) and AM (green) from single-cell RNA-seq profiling of human lung from healthy subjects and subjects with IPF. Figure 3a shows the area under the PR curve (AUPR) for each method and simulation setting. These were the values used in the original paper for this dataset. Supplementary Figure S12a shows volcano plots for the results of the seven DS methods described. For each method, we compared the permutation P-values to the P-values directly computed by each method, which we define as the method P-values. So, If I change the assay to "RNA", how we can trust that the DEGs are not due . Supplementary Figure S14 shows the results of marker detection for T cells and macrophages. Step-by-step guide to create your volcano plot. Further, the cell-level variance and subject-level variance parameters were matched to the pig data. Crowell et al. ## loaded via a namespace (and not attached): ## [1] systemfonts_1.0.4 plyr_1.8.8 igraph_1.4.1, ## [4] lazyeval_0.2.2 sp_1.6-0 splines_4.2.0, ## [7] crosstalk_1.2.0 listenv_0.9.0 scattermore_0.8, ## [10] digest_0.6.31 htmltools_0.5.5 fansi_1.0.4, ## [13] magrittr_2.0.3 memoise_2.0.1 tensor_1.5, ## [16] cluster_2.1.3 ROCR_1.0-11 limma_3.54.1, ## [19] globals_0.16.2 matrixStats_0.63.0 pkgdown_2.0.7, ## [22] spatstat.sparse_3.0-1 colorspace_2.1-0 rappdirs_0.3.3, ## [25] ggrepel_0.9.3 textshaping_0.3.6 xfun_0.38, ## [28] dplyr_1.1.1 crayon_1.5.2 jsonlite_1.8.4, ## [31] progressr_0.13.0 spatstat.data_3.0-1 survival_3.3-1, ## [34] zoo_1.8-11 glue_1.6.2 polyclip_1.10-4, ## [37] gtable_0.3.3 leiden_0.4.3 future.apply_1.10.0, ## [40] abind_1.4-5 scales_1.2.1 spatstat.random_3.1-4, ## [43] miniUI_0.1.1.1 Rcpp_1.0.10 viridisLite_0.4.1, ## [46] xtable_1.8-4 reticulate_1.28 ggmin_0.0.0.9000, ## [49] htmlwidgets_1.6.2 httr_1.4.5 RColorBrewer_1.1-3, ## [52] ellipsis_0.3.2 ica_1.0-3 farver_2.1.1, ## [55] pkgconfig_2.0.3 sass_0.4.5 uwot_0.1.14, ## [58] deldir_1.0-6 utf8_1.2.3 tidyselect_1.2.0, ## [61] labeling_0.4.2 rlang_1.1.0 reshape2_1.4.4, ## [64] later_1.3.0 munsell_0.5.0 tools_4.2.0, ## [67] cachem_1.0.7 cli_3.6.1 generics_0.1.3, ## [70] ggridges_0.5.4 evaluate_0.20 stringr_1.5.0, ## [73] fastmap_1.1.1 yaml_2.3.7 ragg_1.2.5, ## [76] goftest_1.2-3 knitr_1.42 fs_1.6.1, ## [79] fitdistrplus_1.1-8 purrr_1.0.1 RANN_2.6.1, ## [82] pbapply_1.7-0 future_1.32.0 nlme_3.1-157, ## [85] mime_0.12 formatR_1.14 compiler_4.2.0, ## [88] plotly_4.10.1 png_0.1-8 spatstat.utils_3.0-2, ## [91] tibble_3.2.1 bslib_0.4.2 stringi_1.7.12, ## [94] highr_0.10 desc_1.4.2 lattice_0.20-45, ## [97] Matrix_1.5-3 vctrs_0.6.1 pillar_1.9.0, ## [100] lifecycle_1.0.3 spatstat.geom_3.1-0 lmtest_0.9-40, ## [103] jquerylib_0.1.4 RcppAnnoy_0.0.20 data.table_1.14.8, ## [106] cowplot_1.1.1 irlba_2.3.5.1 httpuv_1.6.9, ## [109] R6_2.5.1 promises_1.2.0.1 KernSmooth_2.23-20, ## [112] gridExtra_2.3 parallelly_1.35.0 codetools_0.2-18, ## [115] MASS_7.3-56 rprojroot_2.0.3 withr_2.5.0, ## [118] sctransform_0.3.5 parallel_4.2.0 grid_4.2.0, ## [121] tidyr_1.3.0 rmarkdown_2.21 Rtsne_0.16, ## [124] spatstat.explore_3.1-0 shiny_1.7.4, Fast integration using reciprocal PCA (RPCA), Integrating scRNA-seq and scATAC-seq data, Demultiplexing with hashtag oligos (HTOs), Interoperability between single-cell object formats. Subject-level gene expression scores were computed as the average counts per million for all cells from each subject. In each panel, PR curves are plotted for each of seven DS analysis methods: subject (red), wilcox (blue), NB (green), MAST (purple), DESeq2 (orange), Monocle (gold) and mixed (brown). The value of pDE describes the relative number of differentially expressed genes in a simulated dataset, and the value of controls the signal-to-noise ratio. (c) Volcano plots show results of three methods (subject, wilcox and mixed) used to identify CD66+ and CD66- basal cell marker genes. ## EnhancedVolcano (Blighe, Rana, and Lewis 2018) will attempt to fit as many labels in the plot window as possible, thus avoiding 'clogging' up the . ## [28] dplyr_1.1.1 crayon_1.5.2 jsonlite_1.8.4 PR curves for DS analysis methods. If a gene was not differentially expressed, the value of i2 was set to 0. Overall, the volcano plots for subject and mixed look similar with a higher number of genes upregulated in the IPF group, while the wilcox method exhibits a much different shape with more genes highly downregulated in the IPF group. Under normal circumstances, the DS analysis should remain valid because the pseudobulk method accounts for this imbalance via different size factors for each subject. 6b). ## [46] xtable_1.8-4 reticulate_1.28 ggmin_0.0.0.9000 ## For each subject, gene counts are summed for all cells. As scRNA-seq costs have decreased, collecting data from more than one biological replicate has become more feasible, but careful modeling of different layers of biological variation remains challenging for many users. For each method, the computed P-values for all genes were adjusted to control the FDR using the BenjaminiHochberg procedure (Benjamini and Hochberg, 1995). Rows correspond to different proportions of differentially expressed genes, pDE and columns correspond to different SDs of (natural) log fold change, . Pseudobulking has been tested in real scRNA-seq studies (Kang et al., 2018) and benchmarked extensively via simulation (Crowell et al., 2020). Four of the methods were applications of the FindMarkers function in the R package Seurat (Butler et al., 2018; . The Author(s) 2021. S14f), wilcox produces better ranked gene lists of known markers than both subject and wilcox and again, the mixed method has the worst performance. Our analysis of CF and non-CF pigs showed that the subject method better controlled the FPR of DS analysis when the expected rate of true positives is small; here, using the same animal model, we compare large and small airway ciliated cells which are expected to vary largely. For each setting, 100 datasets were simulated, and we compared seven different DS methods. # search for positive markers monocyte.de.markers <- FindMarkers (pbmc, ident.1 = "CD14+ Mono", ident.2 = NULL, only.pos = TRUE) head (monocyte.de.markers) ## [103] jquerylib_0.1.4 RcppAnnoy_0.0.20 data.table_1.14.8 RNA-Seq Data Heatmap: Is it necessary to do a log2 . ## [1] stats graphics grDevices utils datasets methods base ## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 As a gold standard, results from bulk RNA-seq of isolated AT2 cells and AM comparing IPF and healthy lungs (bulk). d Volcano plots showing DE between T cells from random groups of unstimulated controls drawn . We are deprecating this functionality in favor of the patchwork system. ## [124] spatstat.explore_3.1-0 shiny_1.7.4. Next, we applied our approach for marker detection and DS analysis to published human datasets. The volcano plot for the subject method shows three genes with adjusted P-value <0.05 (log10(FDR) > 1.3), whereas the other six methods detected a much larger number of genes. data("pbmc_small") # Find markers for cluster 2 markers <- FindMarkers(object = pbmc_small, ident.1 = 2) head(x = markers) # Take all cells in cluster 2, and find markers that separate cells in the 'g1' group (metadata # variable 'group') markers <- FindMarkers(pbmc_small, ident.1 = "g1", group.by = 'groups', subset.ident = "2") head(x = markers) # Pass 'clustertree' or an object of class .

Why Was George Whitefield So Popular Document C, Articles F