An Integrated Approach for de novo Gene Prediction, Assembly and Biosynthetic Gene Cluster Discovery of Metagenomic Sequencing Data


Student Name: Sirisha Thippabhotla
Defense Date:
Location: Eaton Hall, Room 1
Chair: Cuncong Zhong

Prasad Kulkarni

Fengjun Li

Zijun Yao

Liang Xu

Abstract:

Metagenomics is the study of genomic content present in given microbial communities. Metagenomic functional analysis aims to quantify protein families and reconstruct metabolic pathways from the metagenome. It plays a central role in understanding the interaction between the microbial community and its host or environment. De novo functional analysis, which allows the discovery of novel protein families, remains challenging for high-complexity communities. There are currently three main approaches for recovering novel genes or proteins: de novo nucleotide assembly, gene calling, and peptide assembly. Unfortunately, their informational dependencies have been overlooked, and have been formulated as independent problems. 

In this work, we propose a novel de novo analysis pipeline that leverages these informational dependencies, to improve functional analysis of metagenomics data. Specifically, the pipeline will contain four novel modules: an assembly graph module, a graph-based gene calling module, a peptide assembly module, and a biosynthetic gene cluster (BGC) discovery module. The assembly graph module will be computational and memory efficient. It will be based on a combination of de Bruijn and string graphs. The assembly graphs contain important sequencing information, which can be further exploited to improve functional annotation. De novo gene-calling enables us to predict novel genes and protein sequences, that have not been previously characterized. We hypothesize that de novo gene calling can benefit from assembly graph structures, as they contain important start/stop codon information that provide stronger ORF signals. The assembly graph framework will be designed for both nucleotide and protein sequences. The resulting protein sequences from gene calling can be further assembled into longer protein contigs using our assembly framework. For the novel BGC module, the gene members of a BGC will be marked in the assembly graph. Finding a BGC can be achieved by identifying a path connecting its gene members in the assembly graph. Experimental results have shown that our proposed pipeline improved existing gene calling sensitivity on unassembled reads, achieving a 10-15% improvement in sensitivity over the state-of-the-art methods, at a high specificity (>90%). Our pipeline further allowed for more sensitive and accurate peptide assembly, recovering more reference proteins, delivering more hypothetical protein sequences.

Degree: PhD Comprehensive Defense (CS)
Degree Type: PhD Comprehensive Defense
Degree Field: Computer Science