The LRGASP project: benchmarking of Long read transcriptomics methods for transcriptome identification and quantification.
Ana Conesa, on behalf of the LRGASP consortium
Institute for Integrative Systems Biology, Spanish National Research Council, Paterna, Spain.
LRGASP Consortium: Francisco J. Pardo-Palacios, Dingjie Wang, Fairlie Reese, Mark Diekhans, Sílvia Carbonell-Sala, Brian Williams, Jane E. Loveland, Maite De María, Matthew S. Adams, Gabriela Balderrama-Gutierrez, Amit K. Behera, , Jose M. Gonzalez, Toby Hunt, Julien Lagarde, Haoran Li, Cindy E. Liang, Andrey D. Prjibelski, Leon Sheynkman, David Moraga-Amador, If Barnes, Andrew Berry, Muhammed Hasan Çelik, Natàlia Garcia-Reyero, Stefan Goetz, Liudmyla Kondratova, Jorge Martinez-Tomas, Carlos Menor, Jonathan M. Mudge, Alejandro Paniagua, Marie-Marthe Suner, Hazuki Takahashi, Alison D. Tang, Ingrid Ashley Youngworth, Piero Carninci, Nancy Denslow, Roderic Guigó, Margaret E.Hunter, Hagen U. Tilgner Barbara J. Wold, Christopher Vollmers, Adam Frankish, Kin Fai Au, Gloria M. Sheynkman, Ali Mortazavi, Ana Conesa, Angela N. Brooks.
Long reads, single molecule sequencing methods featured by Pacific Biosystems and Oxford Nanopore, have created new opportunities for the analysis of the transcriptomes as they enable the detection of full-length transcripts and the identification of alternative isoforms. Novel analysis algorithms have been developed to analyze these data. Using these resources, studies in different species showed that the repertoire of alternative transcripts that are expressed in any tissue is larger than previously anticipated. However, long reads methods have also their limitations. Library preparations do not always render full-length molecules and the single molecule technologies have higher error rates than Illumina. As these approaches are increasingly being used to define transcriptome catalogues, there is an urgent benchmarking necessity to establish the accuracy of both the technologies and bioinformatics tools for transcriptome detection and quantification.
The LRGASP project is an international initiative for such effort. We generated long reads data in three species, on different biological samples -including spike-ins and simulated data- and using four different library preparation methods and subjected them to sequencing by Pacbio and Nanopore. Data was made public to the bioinformatics community for prediction of transcript models with and without the utilization of a reference annotation, and for quantification. A total of 12 bioinformatics labs submitted nearly 200 transcriptome predictions for evaluation. Quality of the transcriptome identity was evaluated using SQANTI framework and performance metrics were obtained on spike-ins, simulated data, and a set of 50 GENCODE-manually curated loci. A selection of novel transcript predictions was experimentally validated.
Our results indicate that a large diversity in the predicted transcriptome exists when comparing methodologies, both experimental methods and analysis algorithms. Specifically, the detection of novel transcripts seems to be particularly challenging. Moreover, our data reveals that the library preparation protocol, not only the sequencing technology is critical for data quality, being read length and quality, rather than quantity, most important. Our results also reveal that long reads-based quantification is possible when a sufficient number of reads is available, but that accurate identification of the transcriptome without the use of a reference annotation is difficult. Finally, our results confirm the complexity of the expressed transcriptome and suggest that novel strategies should be envisioned to describe the dynamics of gene expression. The LRGASP is the largest effort to date for the benchmarking of long reads sequencing methods for transcriptome analysis.
Ana Conesa is Research Professor at the Institute for Integrative Systems Biology (CSIC) in Valencia (Spain), Courtesy Professor at the University of Florida. She is a member Spanish Royal Academy of Engineer, Fellow of the International Society for Computational Biology, honorary member of the Spanish Society of Bioinformatics and Computational Biology and member of the Board of Directors of the International Society for Computational Biology.
Ana Conesa’s lab is interested in understanding functional aspects of gene expression at the genome-wide level and across different organisms. Her group has developed statistical methods and software tools for transcriptomics analysis and she has pioneered the development of methods for multi-omics integration and long-reads transcriptomics. A strong drive in her research is helping the genomics community to bridge the gap between data and knowledge by creating bioinformatics tools that everybody can use. Some of our popular software tools are Blast2GO, PaintOmics, maSigPro, NOISeq, Qualimap, SQANTI, tappAS, etc. She has led multiple EU projects to develop methods for the analysis of the transcriptome, more recently with a focus on the utilization of long read sequencing to characterize transcriptome complexity. She is also co-founder of Biobam Bioinformatics, a start-up that provides bioinformatics tools for biologists. She has published 156 research papers that have received more than 34.000 citations and has an h-index of 58.