JAFFAL: Detecting fusion genes with long read transcriptome sequencing

Nadia Davidson0, Ying Chen1, Jonathan Göke1, Alicia Oshlack0
(0) Peter MacCallum Cancer Centre
(1) Genome Institute of Singapore (A*STAR)

Find me on Tues Nov 24th, 1:40-3pm AEDT in Remo, table 33

Abstract
Genomic rearrangements are common in the cancer landscape and have the potential to create novel oncogenes by fusing parts of two genes together. Massively parallel short read transcriptome sequencing has greatly expanded our knowledge of fusion genes across cancers with ~ 16% of cancers shown to have fusion gene events across a range of tumour types. These events are known to drive cancer and novel drugs have been developed to specifically target a number of these driver fusions.

Short read sequencing requires RNA molecules to be reverse transcribed and fragmented. Therefore long range information about the structure of the fusion transcript away from the breakpoint is lost. Long read sequencing technologies, as offered by Oxford Nanopore Technologies (ONT), allow the full length of individual mRNA molecules to be sequenced. This can provide unprecedented opportunities to study splicing, RNA modifications and run rapid and remote diagnostics. However, the generated data has a very high rate of errors and fusion finding algorithms designed for short reads do not work.

Here we present JAFFAL, a method to accurately identify fusion genes from long read transcriptomes. Our method is based on the JAFFA pipeline, which can call fusions from reads of any length provided they had a low error rate. To facilitate ONT transcriptomes for fusion finding we utilised the error tolerant aligner minimap2 with realignment of breakpoints to overcome insertion and deletion type errors. We also substantially improved computational efficiency to run approximately 10 million reads per hour using 8 threads. We validated JAFFAL using simulations and ONT data from cancer cell lines sequenced by our collaborators in the Singapore Nanopore-Expression Project (SG-NEx). We show that fusions can be detected in long read data with similar accuracy as short reads, and in addition, their splicing structure can be ascertained. Finally, by comparing ONT transcriptome sequencing protocols we show that numerous chimeric molecules are generated during cDNA library preparation that are absent when RNA is sequenced directly.