Population assignment from cancer genome profiling data

Huang Q and Baudis M. (2018)

bioRxiv, 2018-07-14. doi:10.1101/368647

Abstract For a variety of human malignancies, incidence, treatment efficacy and overall prognosis show considerable variation between different populations and ethnic groups. Disentangling the effects related to particular population backgrounds can help in both understanding cancer biology and in tailoring therapeutic interventions. Because self-reported or inferred patient data can be incomplete or misleading due to migration and genomic admixture, a data-driven ancestry estimation should be preferred. While tools to map and utilize ancestry information from healthy individuals have been introduced, a population assignment based on genotyping data from somatic variation profiling of cancer samples is still missing. We analyzed sequencing-based variation data from the 1000 Genomes project, containing 2504 individuals out of 5 continental groups. This reference was then used to extract population-biased SNPs used in genotyping array platforms of varying resolutions. We found that despite widespread and extensive somatic mutations of cancer profiling data, more than 90% of cancer samples can be correctly mapped to one of the population group when compared to their paired unmutated normals. Pre-filtering samples for admixed individuals increased the accuracy to 96%. This work provides a data-driven approach to estimate the population background from cancer genome profiling data. This proof-of-concept study will facilitate efforts to understand the interplay between population and ethnicity related genetic background and differences in understanding statistical and molecular differences in cancer entities with respect to possible hereditary contributions. The docker version of the tool is provided through "baudisgroup/tum2pop" in DockerHub and deposited in "baudisgroup/tum2pop-mapping" in GitHub.