In the context of a global and collaborative setting in which multiple laboratories are involved in SARS-CoV-2 surveillance and sequencing, it is critical to effectively integrate and standardise the results produced by different teams. This document describes the challenge of integrating different sources of raw genomic data and proposes a series of recommendations as to how to minimise biases and problems when analysing datasets combining raw data drawn from heterogeneous sources.
In this document we provide a bioinformatics pipeline that allows the analysis of a heterogenous data set of publicly shared SARS-CoV-2 raw reads to generate consensus sequences for each sample. We then perform a standard phylogenetics reconstruction that can be performed on sequence alignments of consensus genomes. The chosen pipeline performed well despite the highly heterogeneous nature of the raw sequence data selected. Detailed annotation of all the steps as well as all the scripts needed to reproduce the analysis are available on the Github platform
Funded by the European Union. Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or European Health and Digital Executive Agency. Neither the European Union nor the granting authority can be held responsible for them.