qc3C – reference-free quality assessment for Hi-C sequencing data.

Matthew DeMaere0, Aaron Darling0
(0) ithree institute, UTS

Find me on Tues Nov 24th, 1:40-3pm AEDT in Remo, table 7

Abstract
Hi-C is a sample preparation method that enables high-throughput sequencing to capture genome-wide proximity interactions between DNA molecules. The technique has been successfully applied to solve challenging problems such as 3D structural analysis of chromatin, scaffolding of large genome assemblies and more recently the accurate resolution of metagenome-assembled genomes (MAGs).

Despite continued refinements, however, Hi-C library preparation remains a costly and complex laboratory protocol. QC options for Hi-C libraries are limited, with current wet-lab protocols only giving a very crude assay for the re-digest of ligation junctions in a Hi-C library. This QC approach does not provide a reliable estimate of the fraction of library templates containing a Hi-C junction, nor is it available to all Hi-C protocols. QC via sequencing is another possible approach, but current tools require a reference genome to estimate quality metrics for the Hi-C library.

We propose a new, reference-free approach for Hi-C library quality assessment that requires only a small amount of sequencing data from a library. Our tool, qc3C, implements an algorithm that estimates the fraction of reads in the library that contain Hi-C junctions, along with other quality metrics. The algorithm builds upon the observation that proximity ligation events are likely to create k-mers that would not naturally occur in the sample. The algorithm uses an empirical cumulative distribution of k-mer depths to compute the probability that a read containing a Hi-C junction sequence was generated by the proximity ligation reaction. This in turn enables the total fraction of reads containing Hi-C junctions to be estimated. We characterise the accuracy of the new algorithm on simulated and real datasets and compare it to reference-based methods.

qc3C enables sequencing depth requirements to be estimated more precisely on a per-library and per-experiment basis, for chromosome conformation studies, for Hi-C scaffolding of assemblies, and for metagenomic Hi-C. Our qc3C software is an easy to use open-source tool that integrates with the multiQC framework. To our knowledge, qc3C is the first reference-free Hi-C quality assessment tool, enabling Hi-C to be more easily applied to non-model organisms and environmental samples.

qc3C is available from https://github.com/cerebis/qc3C