Using single-cell cytometry to illustrate the generalisable unbiased evaluation of clustering algorithms using Pareto fronts

Givanna Putri0, Irena Koprinska0, Thomas Ashhurst0, Nicholas King0, Mark Read0
(0) University of Sydney

Find me on Wed Nov 25th, 1:30-2:50pm AEDT in Remo, table 128

Abstract
Clustering is widely used in biological fields such as microbial ecology, genomics, and cytometry to partition cells on basis of similarity. Many automated gating algorithms now exist to cluster cytometry and single cell sequencing data into discrete populations. Comparative algorithm evaluations on benchmark datasets rely either on a single performance metric, or a few metrics considered independently of one another. However, single metrics are biased as they emphasise different aspects of clustering performance and hence differ in how clustering solutions are ranked. This undermines the translatability of results onto other non-benchmark datasets, and underlies the lack of consensus regarding optimal clustering algorithms in the field. We propose the Pareto fronts framework as an integrative evaluation protocol, wherein individually biased metrics are instead leveraged as complementary perspectives. Judged superior are algorithms that provide the best trade-off between the multiple metrics considered simultaneously. This yields a more comprehensive and complete view of clustering performance. Moreover, by broadly and systematically sampling algorithm parameter values using the Latin hypercube sampling method, our protocol discounts (un)fortunate parameter value selections as confounding factors. Furthermore, it reveals how meticulously each algorithm must be tuned in order to obtain good results, vital knowledge for users with novel data. We exemplify the protocol by conducting a comparative study between two clustering algorithms using four common performance metrics applied across four cytometry benchmark datasets. To our knowledge, this is the first time Pareto fronts have been used to evaluate the performance of clustering algorithms in any application domain.