Since the breakthrough discovery in 2006 that living cells can be “reprogrammed” (subject to specific biochemical stimuli) from one to another cell type, cell reprogramming protocols have been optimized to meet the need for essential cell types that are now routinely engineered “on demand” for regenerative medicine, disease modeling and patient-specific drug testing. Parallel to the rise of cell reprogramming protocols, technological advancements have allowed researchers to routinely quantitate various cell properties in a high-throughput manner. This has created great opportunities for computational scientists to integrate and analyze large collections of high-dimensional and heterogeneous biological datasets in the quest for more cost-effective ways to quantify success of cell reprogramming, and to ultimately understand what cell properties (features in these datasets) make one cell type different or similar to other cell types.


This project lies in the intersection of Data Science and Computational Biology, aiming to provide standalone data-driven computational tools for scoring cell identity that will be i) easy to use by standard biomedical community users, having minimal or no computational background, and ii) integrated in open-source development projects like R/Biconductor to be accessible for further development and integration into custom bioinformatics data analysis workflows.


i) A data-driven methodology for computationally scoring cell identity of engineered cells. It will be based on integration of cell measurements collected with different technologies at different molecular level (DNA, RNA, protein) across different experimental conditions (cell types and reprogramming protocols) within the framework of unsupervised and/or supervised statistical learning. Challenges: high-dimensional but sparse data for some cell types; evaluation of the method, given that no ground truth exists for the type of the engineered cells; distance measures for quantifying similarity between two cell types.

ii) Method implementation preferably in R (or Python);

iii) Web application for interactive benchmarking (visualization) of the method on publicly available datasets, or user-uploaded datasets;

iv) Comparative analysis with existing state-of-the-art methods in the field that will be the base for a scientific publication. We will aim to publish the work produced within the course of the project, and you are welcome to contribute as a co-author in the writing of the manuscript.

Duration and Type

  • As a student project preferable 1 semester, but can be also done in 2 semesters
  • The summer scholarship is over 12 weeks between S2 2020 and S1 2021
  • Honours project, other postgraduate project (MSc, MProfStuds, …)


  • Basic maths, statistics,  machine learning skills
  • Basic Bash scripting and good programming skills in R (or Python)
  • Having experience with Bioconductor or familiarity with bioinformatics data analysis and biological datasets will be advantageous, but it is not mandatory and we will help you in this respect.

Supervisor and contact

  • Katerina Taskova
  • Send CV and transcript by mail
  • Applications for the summer scholarships via Faculty of Science