The Blooming Era of Genome Informatics: State-of-the-Art and Future Challenges

Ka-Chun  Wong; Ommega Internationals

Editorial

Publish Date : 2015-10-26

Journal of Bioinformatics, Proteomics and Imaging Analysis

The Blooming Era of Genome Informatics: State-of-the-Art and Future Challenges

Ka-Chun Wong

Article information

Affiliation

Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong

Corresponding Author

Ka-Chun Wong, Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong. Tel: (852)34428618; E-mail: kc.w@cityu.edu.hk

Citation

Wong, K.C. The Blooming Era of Genome Informatics: State-of-the-Art and Future Challenges (2016) Bioinfo Proteom Img Anal 1(2): 49- 50.

Copy rights

Introduction

At the beginning of the current 21^st century, we have witnessed the tremendous improvement and growth in parallel sequencing technologies; for instance, the next-generation sequencing (NGS) technologies are well-known for its cost-effectiveness as well as high-throughput capability^[1]. Different variants of NGS technologies have been developed; for instance, DNA Sequencing (DNA-seq) is developed to sequence and map genomes; Chromatin ImmunoPrecipitation with Sequencing (ChIP-Seq) is developed to determine the protein-DNA binding locations on genome; RNA Sequencing (RNA-Seq) is developed to quantify the RNAs transcribed within cells in a high throughput manner; RNA ImmunoPrecipitation with Sequencing (RIP-Seq) is similar to ChIP-Seq but for protein-RNA binding location identification; High-throughput Chromosome conformation capture (Hi-C) is developed to capture the three-dimensional structure of genome; Deoxyribonuclease (DNase) hypersensitive sites sequencing (DNase-Seq) is developed to estimate the genome-wide open chromatin regions based on DNA accessibility; Bisulfate sequencing is developed to probe the DNA methylation on genome. Although their objectives are different, their novelty remains the same: taking advantage of NGS to accelerate the existing wet-lab genome technologies to a genome-wide level, resulting in big data challenges.

NGS technology is designed for high-throughput sequencing. Thanks to its parallel nature, a single run can result in millions of sequencing reads in few days or even hours for now. From the computational perspective, the data is massive and should be measured in GigaBytes (GBs), or even TeraBytes (TBs). Such data scales are no longer able to be handled by some of the existing general statistical methods which have been developed in the past^[2]. Instead, we have to develop scalable but still accurate bioinformatics methods which scale with the exponentially growing data; for instance, Wong et al. have developed a scalable bioinformatics method called SignalSpider (https://www.cs.toronto.edu/~wkc/SignalSpider/) which can analyse multiple ChIP-Seq signal profiles simultaneously in linear time complexity^[3]. SignalSpider has been demonstrated to segment the reference human genome (hg19) into different regulatory regions accurately. In particular, it is very effective in capturing the combinatorial relationships among multiple DNA-binding proteins which are probed by ChIP-Seq. Following Signal Spider, Signal Ranker and Full Signal Ranker have been developed to harness regression and classification tasks on multiple ChIP-Seq profiles^[4]. Signal Ranker and Full Signal Ranker have been demonstrated more accurate than the traditional machine learning methods such as Gaussian Mixture Regression. It is worth noticing that, although those methods are developed in the context of ChIP-Seq, it can easily be adopted to other NGS technologies such as RNA-Seq.

On the other hand, it cannot be ignored that the maturing sequencing technologies have expanded not only the data amount we can analyses, but also the number of high-impact applications we can build. The growing data can provide new and novel insights into human disease studies; for instance, taking advantage of the available proteomes, Wong and Zhang have developed a deleterious residue change prediction method called SNPdryad^[5]. It is significant in the sense that SNPdryad has demonstrated competitive performance edge over the well-established methods such as PolyPhen2 and SIFT on PolyPhen2’s own datasets (humdiv and humvar). In addition, SNPdryad has been run on the complete human proteome, generating prediction scores for all the possible residue changes on all the known human proteins. Another interesting direction using the NGS technologies is to develop personalized medicine solutions for long-term healthcare. Before the current 21st century, limited by the available data, medical studies are carried out without taking special attention into individual genetic information; for instance, drugs are designed for human groups without any personal customization. Nonetheless, such a situation will be changed because, for now, we can sequence each human genome at less than USD $1000^[6]. The human genome information will enable us to take care of each individual patient separately; personalized medicine solutions can be developed.

In the future, the NGS technologies will be applied further to other areas as illustrated by the on-going GTEx project, leading to high-resolution tissue-specific genotype-phenotype studies on our human bodies. In addition, the third-generation technologies are being developed such as Single-Molecule Sequencing in Real Time (SMRT)^[7]. SMART is very cost-effective, fast, and less-sample-required. It has lots of competitive features (e.g. long sequencing reads) which can improve the existing sequencing quality, and thus our understanding on genome. Nonetheless, there is not any free lunch. The evolving sequencing technologies will impose a serious big data challenge because it has been estimated that we will have millions of human genome by 2025^[8]). We not only need capable data analysts but also sufficient computational machines as well as efficient methods to harness the exponentially growing genome data in the coming future.

40
Total Download	PDF	CITATION	XML

Affiliation

Corresponding Author

Citation

Copy rights

Introduction

References

Recent Articles

Links