Portfolio
- Research
- Open Source Software
- Freelance
-
Side Projects
- Survival analysis using gene expression & clinical data (Cox models)
- Deep learning for xray images and omics (CNNs, transfer learning, VAE, NMF, BERT, GNN)
- LLM-based assistant for bioinformatics queries
- Computational biology & algorithms
- Protein folding - Monte Carlo
- Genome assembly - de Bruijn graph, euler walk
- Evolutionary tree estimation - Felsenstein & NNI
- Regulatory DNA discovery - MSA & binomial enrichment
- Games: Sudoku (JavaScript) and Minesweeper (Java)
- Django web services (Multiple Sequence Alignment visualization and mobile app)
- In-browser Python career-matching tool (Pyodine)
Research
Bias interpretation in genomics
We used machine learning models - elastic net regression and principal component analysis (PCA)- to investigate genomic regions called 'HOT regions' which appear to attract unusually high numbers of proteins and are likely technical artifacts of chromatin immunoprecipitation followed by sequencing (ChIP-seq) experiments.
While factors like antibody quality and chromatin interactions are known to affect ChIP-seq reliability, our study revealed that GC- and CpG-rich sequences, DNA methylation, and RNA:DNA hybrids (R-loops) also contribute to these artifacts across species. This work shows how machine learning can uncover hidden biases in genomic data and improve experimental interpretation.
Publication: Wreczycka K et al, Nucleic Acids Research, 2019
Liquid biopsy epigenetics in disease
DNA methylation biomarkers in acute coronary syndrome (blood-derived cfDNA)
We explored circulating cell-free DNA (cfDNA) methylation as a non-invasive biomarker for acute coronary syndrome (ACS), based on the principle that damaged tissues release DNA into the bloodstream.
Using cfDNA methylation profiles, we differentiated ACS subtypes and identified cell type-specific DNA methylation markers to trace the origin of cfDNA. Hundreds of methylation markers linked to cardiovascular conditions and inflammation were identified and validated in an independent cohort, highlighting the potential of cfDNA methylation for ACS diagnosis.
Publication: Rafael R C Cuadrat et al, NAR Genomics and Bioinformatics, 2023
DNA methylation profiling in neuroblastoma (solid tissues and urine-derived cfDNA)
Neuroblastoma is a pediatric cancer ranging from mild to aggressive forms. While genetic changes explain some variability, we showed that DNA methylation plays a key role in its progression. In collaboration with Charité Hospital (Berlin), we analyzed primary tumor tissues and urine cfDNA using bisulfite-seq and RNA-seq, identifying methylation patterns distinguishing high- and low-risk tumors. We also linked MYCN-driven methylation changes to disrupted transcription factor networks, highlighting potential targets for therapies.
Figure: Methylation-based clustering of neuroblastoma patients using differentially methylated CpGs.
Open source software
genomation – a Bioconductor R package designed to simplify genomic feature and interval analysis. It includes functions for reading BED/GFF files as GRanges, summarizing features over regions, creating enrichment plots or heatmaps, and annotating regions with exons, introns, or promoters.
https://github.com/BIMSBbioinfo/genomation, developed in the team of Dr. Altuna Akalin at Bioinformatics and Omics Data Science Platform at MDC BIMSB.
PiGx – a collection of genomics pipelines implemented using Snakemake, Python, and R. Each pipeline is easily configured with a sample sheet and a simple settings file. PiGx generates comprehensive, interactive HTML reports that summarize key findings from your samples.
https://github.com/BIMSBbioinfo/pigx, developed in the team of Dr. Altuna Akalin at Bioinformatics and Omics Data Science Platform at MDC BIMSB.
motifActivity – an R package for identifying key transcription factors (TFs) responsible for changes in gene expression or epigenetic marks across samples. It predicts TF activity profiles using input data from RNA-seq, BS-seq, ChIP-seq, ATAC-seq, and similar methods, combined with a set of DNA motifs.
https://github.com/katwre/motifActivity, developed in the team of Dr. Altuna Akalin at Bioinformatics and Omics Data Science Platform at MDC BIMSB.
Freelance
Prioritization of therapeutic targets in clinical trials
Visualization and survival analysis of biomarkers
We developed interactive visualizations, including oncoprints, to highlight key biomarkers in patients with limited treatment options. These visual summaries help uncover genomic alterations and support identifying new therapeutic targets.
We focused on patients from clinical trial databases facing poor outcomes or lacking effective therapies. Our statistical analyses, including survival analysis, demonstrate the clinical relevance of nominated targets.
Figure: Example of biomarker visualization and survival analysis.
Machine learning/AI for target identification
To prioritize therapeutic targets, we applied Positive and Unlabeled (PU) learning, ideal for cases where only confirmed targets are known. PU classifiers helped distinguish potential targets using gene expression, mutations, and therapy annotations.
Figure: PU learning principle (figure adapted from a blogpost).
Additionally, we utilized autoencoders to uncover hidden patterns and prioritize key molecular features in an unsupervised way.
Figure: Schematic of a Variational Autoencoder (figure adapted from a blogpost).
Multi-omics and AI (Enformer) for an Alzheimer's disease biomarker
In this project, I investigated glial-to-neuron reprogramming through activation of a specific transcription factor (TF) in Alzheimer’s disease. I integrated multi-omics data - including RNA-seq, ATAC-seq, and H3K4me2/H3K27me3 ChIP-seq - to identify differentially expressed genes and gene pathways (GO, GSEA), their association with Alzheimer’s disease risk variants (GWAS), regulatory enhancers, DNA motifs and TF binding sites linked to neuronal differentiation. Large-scale analyses were executed via Nextflow pipelines on Kubernetes and AWS, ensuring scalable and reproducible processing of NGS datasets.
Using DeepMind’s Enformer model by Avsec et al., I mapped the regulatory landscape around the target TF to uncover and confirm DNA regions predicted to most strongly influence its expression in neurons and glia. Enformer predicts regulatory activity directly from DNA sequence and uses gradient-based attribution to reveal regions with the strongest impact on gene expression. I analyzed a ~400 kb region around the transcription factor, applying cell-type–specific masks and signal smoothing to identify candidate enhancers.
Web app feature development
I contributed to enhancing the IGV web application, an interactive tool for visual exploration of genomic data (source code). Built with JavaScript and Python, this tool allows visualization of both public and in-house datasets.
- Enabled dynamic visualization of new in-house genomic datasets.
- Added highlighting of genomic regions of interest (e.g., genetic variants).
- Developed new display options for RefSeq and GENCODE annotations:
- Collapse/expand all transcript isoforms.
- Extend selected gene isoforms for detailed view.
- Added controls to adjust track widths for optimal display.
- Linked visualized tracks to their source databases.
- Implemented command-line tool for automated snapshots of defined genes or regions.
Side projects
Survival analysis using gene expression & clinical data (Cox models)
I developed several survival models to predict the risk of mortality or relapse in newly diagnosed multiple myeloma patients, using baseline clinical and/or gene expression data.The workflow involved RNA-seq preprocessing, unsupervised exploratory analysis (PCA, clustering), and multiple survival modeling strategies - Cox regression, random survival forests, LASSO-based feature selection, and pathway-informed models - evaluated using the C-index.
Figure: C-index comparison of multiple survival models.
Figure: Kaplan–Meier plot of the best performing model.
https://github.com/katwre/survival_analysis/tree/main/
Deep learning for xray images and omics (CNNs, transfer learning, VAE, NMF, BERT, GNN)
CNNs and transfer learning for image classification tasks based on chest X-rays
I applied convolutional neural networks (CNNs) to classify chest X-ray images using both 224×224 and 64×64 pixel inputs, aiming to explore whether lightweight models can retain sufficient diagnostic power for image-based classification tasks. In addition to training a baseline CNN from scratch, I employed transfer learning with pretrained convolutional backbones such as ResNet to evaluate whether pretrained models could further enhance classification performance on chest X-ray images.
Healthy
Pneumonia
Figure: Example X-ray images of a healthy individual and a pneumonia patient.
https://github.com/katwre/ML-projects/blob/main/CNN_and_TransferLearning_Xray/
Autoencoder for scRNA-seq dimensionality reduction and data imputation
I developed a simple autoencoder with a custom loss function for imputing missing values in single-cell RNA-seq data. The approach was inspired by the method proposed by Badsha et al. (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7144625/).
Figure: Imputed scRNA-seq.
Figure: Model's output and the true gene expression values. Non-imputed data (blue): where the model reconstructed known values. Imputed data (orange): where the model predicted missing (masked) values.
https://github.com/katwre/ML-projects/tree/main/autoencoder_scRNAseq/
Variational autoencoder (VAE) to mitigate batch effects in scRNA-seq using federated learning
This project explored a scVI model (variational autoencoder for single-cell data) in a federated setting using the Flower framework (Flower.ai) and the SecAgg+ secure aggregation protocol. For comparison, the same model was also trained in a centralized setting.
Figure: Baseline Gene Expression UMAP.
Figure: Centralized scVI Model.
Figure: Federated scVI Model.
https://github.com/katwre/ML-projects/tree/main/federated_learning_scRNA-seq/
VAE, BERT, semi-supervised NMF and lasso/ridge/elastic net for the cell type deconvolution
This project studies cell-free DNA (cfDNA) fragments that circulate in the blood. These fragments originate from many different cell types across the body. When tissues are damaged or diseased, they release more DNA than usual, altering the overall composition of cfDNA in the bloodstream. By identifying which cell types the DNA comes from, we can gain an early view of tissue health and disease signals. I applied several deconvolution methods to estimate cell type proportions from bulk DNA methylation data. Regression-based approaches (NNLS, Lasso, Ridge, Elastic Net) model methylation profiles as mixtures of reference cell types. In addition, I developed:
- A variational autoencoder (VAE) that reconstructs CpG profiles while jointly predicting cell type proportions.
- A semi-supervised NMF (ssNMF) that anchors factorization to known reference signatures.
- A lightweight transformer model, treating CpG regions as tokens with embeddings and self-attention to capture genomic dependencies.
Figure: Deconvolution of the DNA methylation signal from blood DNA sequenced using Bisulfite-seq.
https://github.com/katwre/ML-projects/blob/main/VAE_NMF_Transformer_regression_cfDNA/
GNN for spatial transcriptomics
This project my aim was to demonstrate how GNNs can capture spatially coherent patterns in gene expression and to compare these learned embeddings to traditional PCA and k-means-based clustering.
Spatial transcriptomics captures gene expression while preserving tissue architecture, enabling the study of cellular organization and microenvironments. However, identifying coherent spatial domains, regions of similar expression patterns and spatial context, remains challenging. Graph Neural Networks (GNNs) are great for this type of data because they can model both gene expression features and spatial neighborhood relationships. In this project, I implemented a mini Graph Autoencoder (GAE) from scratch in PyTorch to learn unsupervised spatial embeddings of tissue spots from a toy Visium H&E dataset provided by Squidpy (a 10x Genomics Visium H&E mouse brain section (~2,700 spots, ~33k genes)).
Figure: Baseline PCA + KMeans.
Figure: GNN-based clustering.
https://github.com/katwre/ML-projects/tree/main/GNN_spatialomics/
LLM-based assistant for bioinformatics queries
This project explored an AI-powered assistant that helps researchers ask questions about biology in plain English and automatically turns them into SPARQL queries against public databases:
- UniProt - proteins, sequences, and annotations
- OMA - orthologs and evolutionary relationships
- Bgee - gene expression across species
The assistant is powered by LLMs (Mistral, Llama via Groq, Ollama) combined with retrieval-augmented generation (RAG) using Qdrant and FastEmbed. You can interact with the assistant either in the terminal/CLI or through a simple chat web app (Chainlit web UI). Key goals:
- Allow researchers to query complex biological knowledge bases witha nice web interface.
- Validate and execute queries automatically.
- Provide results summarized in plain language.
Figure: A web UI for prompting an LLM of choice (Mistral, Llama via Groq, Ollama).
https://github.com/katwre/ML-projects/tree/main/llm-biodata/
Computational biology & algorithms
Protein Folding in the HP Model (Replica Monte Carlo)
Implementation of simulated annealing and replica exchange Monte Carlo algorithm for protein folding in the Hydrophobic Polar (HP) model in Python and NumPy. The HP model simplifies protein folding by using hydrophobic (H) and polar (P) amino acids on a square lattice. Metropolis–Hastings algorithm enables sampling protein configurations based on the Boltzmann distribution.
Figure: HP model protein folding schematic in 2D lattice. Filled, black circles represent hydrophobic residues while unfilled circles represent polar residues. The conformation above yields an optimal energy score in the HP model of -2. The two hydrophobic contacts contributing to the score are between residues 4 and 13 and between residues 5 and 12 (Thachuk et al. 2007).
https://github.com/katwre/bioinformatics-projects/tree/master/Molecular_Dynamics
Genome assembly (de Bruijn graph, Eulerian walk)
Implementation of de Bruijn graph-based genome assembly with Eulerian walk to reconstruct DNA sequences from k-mers. Includes short-read assembly principles based on publications by Compeau et al. (2011) and Pevzner et al. (2001)
Figure: a schematic example of creating a de Bruijn graph from a DNA sequence containing repeats (Compeau et al. 2011).
My focus was on modern short-read assembly algorithms construct a de Bruijn graph by representing all k-mer prefixes and suffixes as nodes and then drawing edges that represent k-mers having a particular prefix and suffix. For example, the k-mer edge ATG has prefix AT and suffix TG. Finding an Eulerian cycle allows one to reconstruct the genome by forming an alignment in which each successive k-mer (from successive edges) is shifted by one position. This generates the same cyclic genome sequence without performing the computationally expensive task of finding a Hamiltonian cycle (as shown in the figure below).
Figure: Two strategies for genome assembly: from Hamiltonian cycles to Eulerian cycles (Pevzner et al. 2001). My focus was on the Eulerian cycle (subfigure d).
https://github.com/katwre/bioinformatics-projects/tree/master/genome_assembly
Evolutionary tree estimation - Felsenstein & NNI
Implementation of the Felsenstein's tree-pruning and the Nearest-Neighbor Interchange (NNI) algorithms. The Felsenstein's tree-pruning is a heuristic algorithm for computing the likelihood of an evolutionary tree from nucleic acid sequence data. It is devoted to searching for an optimal tree structure. NNI is great for rooted binary phylogenetic trees and using Jukes and Cantor substitution model. It is one of the simplest tree-rearrangement methods.
Figure: An example of applying an NNI to a subsplit directed acyclic graph (sDAG) (Jennings-Shaffer et al. 2025).
https://github.com/katwre/bioinformatics-projects/tree/master/comparative_genomics
Regulatory DNA discovery - MSA & binomial enrichment
Bio Motif Ensembl is a Python tool for discovering potential regulatory DNA regions across related mammalian genomes using Ensembl’s public MySQL databases. It retrieves orthologous gene sequences (e.g., human, mouse, rat), aligns their upstream regions, and detects conserved non-coding segments. These conserved blocks are then analyzed with motif discovery algorithms such as MEME and AlignACE, and tested for statistical overrepresentation using a binomial model. The framework integrates comparative genomics, multiple sequence alignment, and motif enrichment to identify functionally significant regulatory elements.
Figure: It's a similar concept to the approach published by MacIsaac et al. 2025.
https://github.com/katwre/bioinformatics-projects/tree/master/bio_motif_ensembl
Games: Sudoku (JavaScript) and Minesweeper (Java)
Sudoku
A simple Sudoku game implemented in JavaScript and JQuery.
https://github.com/katwre/sudoku
Minesweeper
A classic Minesweeper game implemented in Java using SWING and AWT libraries.
https://github.com/katwre/Minesweeper
Django-Based Web Services
Django-based server for Multiple Sequence Alignment (MSA) visualization - https://github.com/freesci/MSA-vis-project
Mobile application using Django, manifesto app, and localStorage - https://github.com/katwre/phone_application
Discover Your Career Match (Pyodine)
Interactive tool that matches careers to users based on their personality profile (Big Five personality traits). Runs directly in the browser via Pyodide.
Figure: PCA plot showing career matches based on personality profile.