Evolutionary analysis of large molecular sequence data is widely employed throughout modern biology, medicine, and pharmacology. Thanks to next generation sequencing technologies, millions of whole genome sequences are constantly produced at low cost. Lack of scalable computational methods and technologies, however, implies that many big omics data sets are analyzed with simple and inappropriate models, or using approximations that may produce inaccurate inferences. The pressing need to develop effective computational methods that can accurately analyze big omics data under appropriately complex evolutionary models constantly requires new algorithmic ideas as well as solutions of standing computational challenges. The ever-accelerating pace at which genomic, transcriptomic, and proteomic data is produced in the modern world makes it impractical to rerun all analyses every time new data arrive or existing data are refined. The need to rerun the same analysis over and over again with slightly different data is a fundamental reason why inaccurate and often unreliable heuristics are favored over statistically sound methods. This motivates research on so-called “online” computational biology algorithms capable of integrating new data as it appears.
In this talk I will present mathematical, statistical, and computational machinery we recently developed for evolutionary analysis of omics data, with a specific focus of enabling such online algorithms. These include methods from modern branches of computational geometry, data science, and computer science to enhance statistical and computational performance of evolutionary approaches used in infectious disease and cancer research, ecology, and pharmacology.