Batch Analyis of RNA-Seq Data with Docker in AWS Cloud

Yılmaz A.

ICAMLS 2018 International Conference on Applied Mathematics, Modeling and Life Science Problems, İstanbul, Turkey, 3 - 05 October 2018, pp.87

  • Publication Type: Conference Paper / Summary Text
  • City: İstanbul
  • Country: Turkey
  • Page Numbers: pp.87
  • Yıldız Technical University Affiliated: Yes


As the cost of sequencing is decreasing the number of next generation sequenc-
ing studies is increasing at a rapid pace. Staggering amount of sequencing data
accumulated over the years are kept at publicly available databases such as Short
Read Archive (SRA) and European Nucleotide Archive (ENA). The truly enormous
amount of sequencing data provides opportunity for mining gene expression and
genome variant studies.
However, such a mining task not only requires extensive computational resources
but also orchestration of analysis steps at a large scale. The latter challenge is due to
the fact that the analysis of sequencing data comprises of multiple steps each carried
by different software. If the overall goal can be summarized as ”setting up multiple
computers and distributing the workload and processes”, achieving this manually is
clearly impractical. However, by the help of various tools and technologies, setting
up such an environment is much easier than before. ”Setting up computer” part is
taken care by containerization technology in which Docker is the leading platform.
”Multiple computers” part is taken care by cloud services where CPU, RAM and
harddisk space can be used with hourly fee. In this talk, Amazon AWS EC2 will
be demonstrated. Finally, ”distributing workload and processes” part can be taken
care by bioinformatic pipeline frameworks. In this talk, Nextflow [1] framework will
be demonstrated which is able use containers and run in cloud.
Containerization not only eases the pain of software installation and configura-
tion but also supports reproducible research [2]. Combining containerization with
cloud computing allows rapid and affordable bioinformatic analysis at scale [3]. Con-
tainerization also allows integrative analysis by mixing and matching tools from dif-
ferent fields of bioinformatics. This talk will briefly introduce the aforementioned
tools and technologies and then provide an example batch analysis where human
RNA-Seq data from multiple sequencing projects were used in order to get with
preliminary results related to trans-splicing and non-aligned reads.

[1] Di Tommaso, P., Chatzou, M., Floden, E. W., Barja, P. P., Palumbo, E., and
Notredame, C. (2017). Nextflow enables reproducible computational workflows.
Nature Biotechnology, 35(4), 316.
[2] Boettiger, C. (2015). An introduction to Docker for reproducible research. ACM
SIGOPS Operating Systems Review, 49(1), 71-79.
[3] Langmead, B., Nellore, A. (2018). Cloud computing for genomic data analysis
and collaboration. Nature Reviews Genetics, 19(4), 208.