Scripties UMCG - Rijksuniversiteit Groningen
English | Nederlands

Predicting the primary origin of Cancer of Unknown Primary by using mRNA expression profiles and machine Learning

(2017) Bakker, J.A.

Cancer of unknown primary (CUP) is metastasized cancer with an undetectable primary tumor.
It accounts for 3-5% of all patients with cancer and it is the fourth most frequent cause of
death by cancer in the western world. The perspective for patients with CUP is unfortunate
since current cancer treatments are depending on the primary site of origin. This study is an
attempt to build a computer model to predict the origin of CUP. To train and evaluate this model,
8068 mRNA expression profiles of relevant tissues with known origins are collected from the
Gene Expression Omnibus (GEO) and ArrayExpress (AE). Two data representations, the most
important 5000 probes determined with the median absolute deviation and a mixing matrix as
the result of an independent component analysis, are compared. Moreover, detected batch effects
are eliminated. To achieve the best predictive model with the highest accuracy, conventional
machine learning algorithms are compared to stacking and neural networks. The effect of healthy
tissues in the training data is determined and also the performance of the models is explored.
The final model, a neural network trained on the top 5000 probes, achieved an accuracy of 0.968
on unseen evaluation data. This model is applied on a dataset of 90 CUP samples to predict their
primary origin. Since evaluation of these predictions is impossible, a comparison with silver
labels and known proportions obtained from literature is executed. Despite the development of
an accurate model, the results of this comparison remain questionable. A clinical study should
be performed to see if the predictive model improves the survival and treatment response of
CUP patients.

To top