I'm a Research Fellow in Biomedical Informatics at Harvard University. My research focuses on leveraging genome-scale sequencing data with clinical health record data to derive insights on the molecular mechanisms underlying rare human diseases in Dr. Isaac Kohane's lab.
Previously as a Computer Science Ph.D. student at Princeton University, my research focused on detecting and interpreting protein interaction and cellular network perturbations within and across organisms in Prof. Mona Singh's lab. As an undergraduate at Tufts University, I worked with Prof. Lenore Cowen on improving methods for protein structural alignments.
I enjoy traveling, tulips, and pumpkin muffins.Email LinkedIn CV (PDF)
Commonalities across computational workflows for uncovering explanatory variants in undiagnosed cases
Genetics in Medicine
Genomic sequencing has become an increasingly powerful and relevant tool to be leveraged for the discovery of genetic aberrations underlying rare, Mendelian conditions. Although the computational tools incorporated into diagnostic workflows for this task are continually evolving and improving, we nevertheless sought to investigate commonalities across sequencing processing workflows to reveal consensus and standard practice tools and highlight exploratory analyses where technical and theoretical method improvements would be most impactful.
Methods: We collected details regarding the computational approaches used by a genetic testing laboratory and 11 clinical research sites in the United States participating in the Undiagnosed Diseases Network via meetings with bioinformaticians, online survey forms, and analyses of internal protocols.
Results: We found that tools for processing genomic sequencing data can be grouped into four distinct categories. Whereas well-established practices exist for initial variant calling and quality control steps, there is substantial divergence across sites in later stages for variant prioritization and multimodal data integration, demonstrating a diversity of approaches for solving the most mysterious undiagnosed cases.
Conclusion: The largest differences across diagnostic workflows suggest that advances in structural variant detection, noncoding variant interpretation, and integration of additional biomedical data may be especially promising for solving chronically undiagnosed cases.
PertInInt: An integrative, analytical approach to rapidly uncover cancer driver genes with perturbed interactions and functionalities
A major challenge in cancer genomics is to identify genes with functional roles in cancer and uncover their mechanisms of action. We introduce an integrative framework that identifies cancer-relevant genes by pinpointing those whose interaction or other functional sites are enriched in somatic mutations across tumors. We derive analytical calculations that enable us to avoid time-prohibitive permutation-based significance tests, making it computationally feasible to simultaneously consider multiple measures of protein site functionality. Our accompanying software, PertInInt, combines knowledge about sites participating in interactions with DNA, RNA, peptides, ions, or small molecules with domain, evolutionary conservation, and gene-level mutation data. When applied to 10,037 tumor samples, PertInInt uncovers both known and newly predicted cancer genes, while additionally revealing what types of interactions or other functionalities are disrupted. PertInInt's analysis demonstrates that somatic mutations are frequently enriched in interaction sites and domains and implicates interaction perturbation as a pervasive cancer-driving event. Software available at http://github.com/Singh-Lab/PertInInt.
Systematic domain-based aggregation of protein structures highlights DNA-, RNA- and other ligand-binding positions
Nucleic Acids Research
Domains are fundamental subunits of proteins, and while they play major roles in facilitating protein–DNA, protein–RNA and other protein–ligand interactions, a systematic assessment of their various interaction modes is still lacking. A comprehensive resource identifying positions within domains that tend to interact with nucleic acids, small molecules and other ligands would expand our knowledge of domain functionality as well as aid in detecting ligand-binding sites within structurally uncharacterized proteins. Here, we introduce an approach to identify per-domain-position interaction 'frequencies' by aggregating protein co-complex structures by domain and ascertaining how often residues mapping to each domain position interact with ligands. We perform this domain-based analysis on ~91000 co-complex structures, and infer positions involved in binding DNA, RNA, peptides, ions or small molecules across 4128 domains, which we refer to collectively as the InteracDome. Cross-validation testing reveals that ligand-binding positions for 2152 domains are highly consistent and can be used to identify residues facilitating interactions in ~63-69% of human genes. Our resource of domain-inferred ligand-binding sites should be a great aid in understanding disease etiology: whereas these sites are enriched in Mendelian-associated and cancer somatic mutations, they are depleted in polymorphisms observed across healthy populations. The InteracDome is available at http://interacdome.princeton.edu.
PDF BibTex doi: 10.1093/nar/gky1224
Pervasive variation of transcription factor orthologs contributes to regulatory network divergence
Differences in transcriptional regulatory networks underlie much of the phenotypic variation observed across organisms. Changes to cis-regulatory elements are widely believed to be the predominant means by which regulatory networks evolve, yet examples of regulatory network divergence due to transcription factor (TF) variation have also been observed. To systematically ascertain the extent to which TFs contribute to regulatory divergence, we analyzed the evolution of the largest class of metazoan TFs, Cys2-His2 zinc finger (C2H2-ZF) TFs, across 12 Drosophila species spanning ~45 million years of evolution. Remarkably, we uncovered that a significant fraction of all C2H2-ZF 1-to-1 orthologs in flies exhibit variations that can affect their DNA-binding specificities. In addition to loss and recruitment of C2H2-ZF domains, we found diverging DNA-contacting residues in ~44% of domains shared between D. melanogaster and the other fly species. These diverging DNA-contacting residues, present in ~70% of the D. melanogaster C2H2-ZF genes in our analysis and corresponding to ~26% of all annotated D. melanogaster TFs, show evidence of functional constraint: they tend to be conserved across phylogenetic clades and evolve more slowly than other diverging residues. These same variations were rarely found as polymorphisms within a population of D. melanogaster flies, indicating their rapid fixation. The predicted specificities of these dynamic domains gradually change across phylogenetic distances, suggesting stepwise evolutionary trajectories for TF divergence. Further, whereas proteins with conserved C2H2-ZF domains are enriched in developmental functions, those with varying domains exhibit no functional enrichments. Our work suggests that a set of highly dynamic and largely unstudied TFs are a likely source of regulatory variation in Drosophila and other metazoans.
Formatt: Correcting protein multiple structural alignments by incorporating sequence alignment
Background: The quality of multiple protein structure alignments are usually computed and assessed based on geometric functions of the coordinates of the backbone atoms from the protein chains. These purely geometric methods do not utilize directly protein sequence similarity, and in fact, determining the proper way to incorporate sequence similarity measures into the construction and assessment of protein multiple structure alignments has proved surprisingly difficult.
Results: We present Formatt, a multiple structure alignment based on the Matt purely geometric multiple structure alignment program, that also takes into account sequence similarity when constructing alignments. We show that Formatt outperforms Matt and other popular structure alignment programs on the popular HOMSTRAD benchmark. For the SABMark twilight zone benchmark set that captures more remote homology, Formatt and Matt outperform other programs; depending on choice of embedded sequence aligner, Formatt produces either better sequence and structural alignments with a smaller core size than Matt, or similarly sized alignments with better sequence similarity, for a small cost in average RMSD.
Conclusions: Considering sequence information as well as purely geometric information seems to improve quality of multiple structure alignments, though defining what constitutes the best alignment when sequence and structural measures would suggest different alignments remains a difficult open question.
Formatt: Correcting protein multiple structural alignments by sequence peeking
Proceedings of the 2011 ACM Conference on Bioinformatics, Computational Biology and Biomedicine (BCB)
We present Formatt, a multiple structure alignment program based on the Matt purely geometric multiple structural alignment program, that also takes into account sequence similarity when constructing alignments. We show that Formatt is superior to Matt in alignment quality based on objective measures (most notably Staccato sequence and structure scores) while preserving the same advantages in core length and RMSD that Matt has as a flexible structure aligner, as compared to other multiple structure alignment programs on popular benchmark datasets. Applications include producing better training data for threading methods.
Evolving soft robotic locomotion in PhysX
Proceedings of the 2009 ACM Conference on Genetic and Evolutionary Computation (GECCO)
Given the complexity of the problem, genetic algorithms are one of the more promising methods of discovering control schemes for soft robotics. Since physically embodied evolution is time consuming and expensive, an outstanding challenge lies in developing fast and suitably realistic simulations in which to evolve soft robot gaits. We describe two parallel methods of using NVidia’s PhysX, a hardware-accelerated (GPGPU) physics engine, in order to evolve and optimize soft bodied gaits. The first method involves the evolution of open-loop gaits using a reduced-order lumped parameter model. The second method involves harnessing PhysX’s soft-bodied material simulation capabilites. In each case we discuss the the challenges and possibilities involved in using the PhysX for evolutionary soft robotics.
For more information, please see my CV.
Last updated: 13-Aug-2019