Genomics and Epigenomics

ChIP-seq phantom peaks

Chromatin immunoprecipitation followed by sequencing (ChIP-seq) is a widely-used technique to map where DNA-binding proteins attach to the genome. Big projects and databases like ENCODE and modENCODE have used this method to identify binding sites for hundreds of proteins across different species. With all the data collected, it’s clear that some parts of the genome have unsually high frequency of protein-DNA interactions. These areas, known as high-occupancy target (HOT) regions, have been found in multiple species.




ChIP-seq phantom peaks or as we called them High-occupancy target (HOT) regions are parts of the genome that have an unusual amount of transcription factor binding sites. These regions show up in various species and are thought to be biologically important because of the high concentration of transcription factor binding. They also overlap with housekeeping gene promoters, and the related genes are consistently expressed across many cell types. Despite these interesting features, HOT regions are mainly defined using ChIP-seq experiments and don’t show the typical motifs for the transcription factors believed to bind there.

For us, the plausible explanations for motifless binding are a combination of 1) interaction of transcriptions factors (TFs) where only a handful of them are actually binding to DNA 2) existence of weak binding sites where TFs bind to non-canonical motifs in a weak manner 3) regions with high-affinity for chromatin immunoprecipitation called ‘hyper-ChIPable’ regions.

Upon observing common low-level sequence features of HOT regions across species, we investigated whether potential technical biases in ChIP-seq could at least partially explain false positive signals on HOT regions. 14 out of 22 publicly available ChIP-seq experiments with knock-out of the genes that encodes target proteins show enrichment even though the chipped protein shouldn’t be present in the analysed sample. Such false positive signal is the highest on HOT regions.





The observed ChIP signal arises from a combination of different signal sources. The signal in a ChIP experiment originates from an antibody binding to the intended target protein (blue), and nonspecific antibody binding—either to the non-target proteins (orange) or directly to polynucleotide structures, such as R-loops (red). The error (orange + red) is not proportional to the signal from the targeted protein, rather, it depends on sequence properties, antibody properties and expression characteristics of individual genomic regions. The combination of different noise profiles result in a subset of ChIP-seq peaks being false positives.





For more details check out our:




DNA methylation biomarkers derived from cell-free DNA


I have a strong interest in DNA methylation analysis of cell-free DNA, which I have applied in two projects focused on acute coronary syndrome (using blood cfDNA) and neuroblastoma (using urine cfDNA and solid tumor samples)

Cell-free DNA signatures are quickly becoming the target of choice for non-invasive screening, diagnosis, treatment and monitoring of human tumors. DNA methylation changes occur early in tumorigenesis and are widespread, making cfDNA methylation an attractive cancer biomarker.

Mutations, methylation, DNA integrity, microsatellite alterations and viral DNA can be detected in cell-free DNA (cfDNA) in blood. Tumour-related cfDNA, which circulates in the blood of cancer patients, is released by tumour cells in different forms and at different levels. Figure copied from Cell-free nucleic acids as biomarkers in cancer patients.

We applied computational approached that accurately resolves relative fractions of diverse cell subsets in ccfDNA:


Figure adapted from Newman et al Nature Methods 2015.


DNA methylation biomarkers in acute coronary syndrome (cfDNA in blood)

We proposed circulating cell-free DNA (ccfDNA) as an additional marker for acute coronary syndrome (ACS) since the damaged tissues can release DNA to the bloodstream. We used ccfDNA methylation profiles for differentiating between the ACS types and provided computational tools to repeat similar analysis for other diseases. We leveraged cell type specificity of DNA methylation to deconvolute the ccfDNA cell types of origin and to find methylation-based biomarkers that stratify patients.

We identified hundreds of methylation markers associated with ACS types and validated them in an independent cohort. Many such markers were associated with genes involved in cardiovascular conditions and inflammation. ccfDNA methylation showed promise as a non-invasive diagnostic for acute coronary events. These methods are not limited to acute events, and may be used for chronic cardiovascular diseases as well.

For more details check out our:

DNA methylation landscape of neuroblastoma (primary tissues and urine cfDNA)

DNA methylation plays diverse roles in cancer, altering normal gene regulation and contributing to disease states. Neuroblastoma, an early childhood cancer, arises from aberrant differentiation of neural crest tissues within the sympathetic nervous system. Its clinical spectrum ranges from spontaneously regressing tumors to highly aggressive forms. Known genetic alterations only partially account for this variability, suggesting an epigenetic role in neuroblastoma pathogenesis.

To investigate the regulatory function of DNA methylation at single-nucleotide resolution and on a genome-wide scale in primary neuroblastomas, we collaborated with Prof. Dr. med. Johannes H. Schulte at Charité Hospital in Berlin. We analyzed a cohort of samples, incorporating whole-genome bisulfite sequencing (Bisulfite-seq) data alongside matching RNA-seq data.

Our analysis confirmed previously identified methylation-based clustering patterns, distinguishing high-risk from low-risk tumors and MYCN-amplified from non-MYCN-amplified tumors, as well as MYCN-driven DNA methylation deregulation at regulatory elements in MYCN-amplified cases. Additionally, we employed an integrative approach, combining Bisulfite-seq, RNA-seq, publicly available ChIP-seq data for tumor-specific H3K27ac marks, and known DNA motifs.

This approach revealed that specific transcription factor networks, deregulated in high-risk neuroblastomas, can be modeled based solely on DNA methylation data. Our findings suggest that epigenetic mechanisms contribute to neuroblastoma’s regulatory dysfunctions and may serve as valuable targets for further research.

In this study, we aimed to gain a deeper understanding of the DNA methylation landscape in MYCN-amplified and non-MYCN-amplified high-risk neuroblastomas at both single-nucleotide and genome-wide levels, focusing on associations between DNA methylation aberrations and neuroblastoma-specific transcription factor regulatory networks.

We confirmed our findings in solid tissues as described above by analyzing cfDNA derived from urine in collaboration with the AG Deubzer lab at Charité Hospital in Berlin.



Hierarchical clustering of DNA methylation percentage of differentially methylated CpGs between 24 neuroblastoma patients. Each column represents a patient and each row represents a CpG. The percentage of DNA methylation is normalized to [0,1] range. Each patient has depicted risk group, sequencing batch, age (in days), and estimated gender



(A) Regulatory PPI network based on motif activity results using MNA using ISMARA (in red), and HR_nMNA DMRs (in blue). TFs common in both networks are depicted in green. (B) Top 20 Gene Ontology terms (rows) using metascape (Zhou et al. 2019) based on TFs from MNA and HR_nMNA networks (columns).


Target nomination


Visualization and statistical analysis of biomarkers in patients with limited treatment options (as part of clinical trials)

We developed a comprehensive set of visualizations, including oncoprints, to effectively highlight and emphasize key biomarkers in high medical need patients. These visualizations offer a clear, intuitive understanding of the underlying genomic alterations, facilitating the identification of potential therapeutic targets. The analysis focuses on patients with diseases that have limited treatment options or poor outcomes derived from external databases of clnical trials, denoted as a high medical need population. The visual representation may categorize patients based on disease severity, unmet medical need, or current lack of effective therapies.

We performed comprehensive analysis of target nomination in the context of high medical need populations, using survival analysis and other statistical emthods to illustrate the clinical significance of the nominated targets (genes).


Example of visualisation and survival analysis of selected biomarkers.



Identifying novel anticancer targets using machine learning and AI methods

To prioritize molecules for further study, we utilized Positive and Unlabeled (PU) learning. This method is designed for cases where we know which molecules are likely to be of high interest (positive samples) but lack explicit negative examples.

PU learning is a machine learning framework that operates with only positive and unlabeled data, assuming that the unlabeled set may contain both positive and negative examples. This approach is particularly useful in fields like medical diagnosis and target identification, where data often naturally arise in this format. For instance, medical records typically document only the conditions a patient has been diagnosed with, not those they do not have. This absence of diagnosis does not imply the absence of disease, as many conditions, such as diabetes, often go undiagnosed (Claesen et al. 2015b).


The goal of PU learning is the same as general binary classification: train a classifier that can distinguish between positive and negative examples based on the attributes. However, during the learning phase, only some of the positive examples in the training data are labeled and none of the negative examples are, as illustrated below:


We applied PU classificators for molecules and their features of interest such as gene expresion, gene mutations, available therapies using:

Additionally, we applied unsupervised approach of using autoencoders (deep learning) in order to find the most predictive features of group of molecules of interest.