Research

Bias interpretation in genomics

We used machine learning models - elastic net regression and principal component analysis (PCA)- to investigate genomic regions called 'HOT regions' which appear to attract unusually high numbers of proteins and are likely technical artifacts of chromatin immunoprecipitation followed by sequencing (ChIP-seq) experiments.

While factors like antibody quality and chromatin interactions are known to affect ChIP-seq reliability, our study revealed that GC- and CpG-rich sequences, DNA methylation, and RNA:DNA hybrids (R-loops) also contribute to these artifacts across species. This work shows how machine learning can uncover hidden biases in genomic data and improve experimental interpretation.

HOT regions figure
Figure: Unexpected ChIP-seq signals appear in HOT regions even without the target protein (KO ChIP-seq). The barplot shows how often these regions are detected as bound - HOT regions correspond to the top 1% of genomic regions with the highest protein binding signals (99th percentile).

Publication: Wreczycka K et al, Nucleic Acids Research, 2019

Liquid biopsy epigenetics in disease

DNA methylation biomarkers in acute coronary syndrome (blood-derived cfDNA)

We explored circulating cell-free DNA (cfDNA) methylation as a non-invasive biomarker for acute coronary syndrome (ACS), based on the principle that damaged tissues release DNA into the bloodstream.

Using cfDNA methylation profiles, we differentiated ACS subtypes and identified cell type-specific DNA methylation markers to trace the origin of cfDNA. Hundreds of methylation markers linked to cardiovascular conditions and inflammation were identified and validated in an independent cohort, highlighting the potential of cfDNA methylation for ACS diagnosis.

HOT regions figure
Figure: PCA of 254 differentially methylated regions linked to ACS severity using linear models.

Publication: Rafael R C Cuadrat et al, NAR Genomics and Bioinformatics, 2023

DNA methylation profiling in neuroblastoma (solid tissues and urine-derived cfDNA)

Neuroblastoma is a pediatric cancer ranging from mild to aggressive forms. While genetic changes explain some variability, we showed that DNA methylation plays a key role in its progression. In collaboration with Charité Hospital (Berlin), we analyzed primary tumor tissues and urine cfDNA using bisulfite-seq and RNA-seq, identifying methylation patterns distinguishing high- and low-risk tumors. We also linked MYCN-driven methylation changes to disrupted transcription factor networks, highlighting potential targets for therapies.

Neuroblastoma clustering

Figure: Methylation-based clustering of neuroblastoma patients using differentially methylated CpGs.

Open source software

genomation – a Bioconductor R package designed to simplify genomic feature and interval analysis. It includes functions for reading BED/GFF files as GRanges, summarizing features over regions, creating enrichment plots or heatmaps, and annotating regions with exons, introns, or promoters.

https://github.com/BIMSBbioinfo/genomation, developed in the team of Dr. Altuna Akalin at Bioinformatics and Omics Data Science Platform at MDC BIMSB.

PiGx – a collection of genomics pipelines implemented using Snakemake, Python, and R. Each pipeline is easily configured with a sample sheet and a simple settings file. PiGx generates comprehensive, interactive HTML reports that summarize key findings from your samples.

https://github.com/BIMSBbioinfo/pigx, developed in the team of Dr. Altuna Akalin at Bioinformatics and Omics Data Science Platform at MDC BIMSB.

motifActivity – an R package for identifying key transcription factors (TFs) responsible for changes in gene expression or epigenetic marks across samples. It predicts TF activity profiles using input data from RNA-seq, BS-seq, ChIP-seq, ATAC-seq, and similar methods, combined with a set of DNA motifs.

https://github.com/katwre/motifActivity, developed in the team of Dr. Altuna Akalin at Bioinformatics and Omics Data Science Platform at MDC BIMSB.

3.1 Prioritization of therapeutic targets in clinical trials

Visualization and Statistical Analysis of Biomarkers

We developed interactive visualizations, including oncoprints, to highlight key biomarkers in patients with limited treatment options. These visual summaries help uncover genomic alterations and support identifying new therapeutic targets.

We focused on patients from clinical trial databases facing poor outcomes or lacking effective therapies. Our statistical analyses, including survival analysis, demonstrate the clinical relevance of nominated targets.

Figure: Example of biomarker visualization and survival analysis.

Machine learning for target identification

To prioritize therapeutic targets, we applied Positive and Unlabeled (PU) learning, ideal for cases where only confirmed targets are known. PU classifiers helped distinguish potential targets using gene expression, mutations, and therapy annotations.

Figure: PU learning principle (figure adapted from a blogpost).

Additionally, we used autoencoders (PyTorch) to uncover hidden patterns and prioritize key molecular features in an unsupervised way.

Figure: Autoencoder workflow and latent space visualization using t-SNE.

3.2 Web app feature development

I contributed to enhancing the IGV web application, an interactive tool for visual exploration of genomic data (source code). Built with JavaScript and Python, this tool allows visualization of both public and in-house datasets.

  • Enabled dynamic visualization of new in-house genomic datasets.
  • Added highlighting of genomic regions of interest (e.g., genetic variants).
  • Developed new display options for RefSeq and GENCODE annotations:
    • Collapse/expand all transcript isoforms.
    • Extend selected gene isoforms for detailed view.
    • Added controls to adjust track widths for optimal display.
  • Linked visualized tracks to their source databases.
  • Implemented command-line tool for automated snapshots of defined genes or regions.
Figure: Example IGV web app view showing genomic data tracks.

Side projects

Protein Folding in the HP Model - implementation of simulated annealing and replica exchange Monte Carlo algorithm for protein folding in the HP model in Python and NumPy. The HP model simplifies protein folding by using hydrophobic (H) and polar (P) amino acids on a square lattice. Metropolis–Hastings algorithm enables sampling protein configurations based on the Boltzmann distribution.

Figure: Lattice HP model showing global energy.

https://github.com/katwre/bioinformatics-projects/tree/master/Molecular_Dynamics


Genome Assembly Using de Bruijn Graph - implementation of de Bruijn graph-based genome assembly with Eulerian walk to reconstruct DNA sequences from k-mers. Includes short-read assembly principles based on publications by Compeau et al. (2011) and Pevzner et al. (2001)

Figure: De Bruijn graph.

https://github.com/katwre/bioinformatics-projects/tree/master/genome_assembly


Sudoku - a simple Sudoku game implemented in JavaScript and JQuery.

https://github.com/katwre/sudoku


Minesweeper - classic Minesweeper game implemented in Java using SWING and AWT libraries.

https://github.com/katwre/Minesweeper


Django-Based Web Services

Django-based server for Multiple Sequence Alignment (MSA) visualization - https://github.com/freesci/MSA-vis-project

Mobile application using Django, manifesto app, and localStorage - https://github.com/katwre/phone_application


Discover Your Career Match

Interactive tool that matches careers to users based on their personality profile (Big Five personality traits). Runs directly in the browser via Pyodide.

PCA plot of careers based on personality traits

Figure: PCA plot showing career matches based on personality profile.

https://github.com/katwre/Personalities