eMST, a scalable and interpretable method for Phylogenetic analysis of hundreds and thousands of SARS-CoV-2 genomes
Sergey Knyazev0, Harman Singh1, Varuni Sarwal2, Ram Ayyala2, Daniel Novikov0, Roya Hosseini0, Serghei Mangul3, Alex Zelikovsky0
(0) Georgia State University
(1) Indian Institute of Technology Delhi
(2) University of California, Los Angeles
(3) University of Southern California
Find me on Wed Nov 25th, 1:30-2:50pm AEDT in Remo, table 70
Abstract
A novel coronavirus, known as SARS-CoV-2, was identified as the cause of an outbreak of pneumonia in Wuhan, China, in December 2019. Travel-associated cases of coronavirus disease 2019 (COVID-19) were reported outside of China as early as January 13, 2020, and the virus has subsequently spread to nearly all nations. Sequencing and phylogenetic analysis of viral genomes is essential for tracking the transmission of SARS-CoV-2.
Previous attempts to provide a phylogenetic analysis for studying the transmission of SARS-CoV-2 in the United States, are not scalable to large datasets, provide limited information about the network connectivity and lack user friendly visualization.We present a new network analysis method called eMST(epsilon Minimum Spanning Tree). This method can be used to create a graph, with genetic samples as nodes, connected by edges with weights corresponding to the hamming distance between the nodes. Given a value of epsilon (e), the eMST is then constructed by considering the union of all possible MST’s with one edge of weight w replaced by another edge of weight less than w(1+e). The output of the eMST is in the form of an edge list, which is visualized using Gephi.
We validate the results derived from our phylogenetic analysis using eMST, with the results obtained from NextStrain on the data from a previous study (Fauver, Joseph R., et al.Cell (2020)). We then extend our analysis to a larger number of strains with emphasis on specific states, namely California, New York and Washington. For each of these cases, we observe that Nextstrain and eMST results are in agreement with each other, and eMST provides a better visualization and a negligible running time. We finally plan to scale up our analysis to large genomic datasets of size more than 80k, to create a global network which proves the scalability of our approach.
Phylogenetic clustering of SARS-CoV-2 genomes is an important first step in studying the coast to coast spread of SARS-CoV-2 during the early epidemic in the United States. eMST will be of broad interest to all scientists engaged in such research as this method improves user visualization, provides detailed information about network connectivity and is scalable to large datasets, thus allowing scientists to draw inferences from the spread of SARS-CoV-2 in the United States.
Comments