Portfolio
- Research
- Open Source Software
- Freelance
-
Side Projects
- Survival analysis with gene expression data
- CNNs and transfer learning for chest X-rays classification
- Autoencoder for scRNA-seq imputation
- Federated VAE for batch effect correction
- AI assistant for SPARQL bioinformatics queries
- Cell type deconvolution with VAE, NMF, BERT, regression
- Protein folding simulation (Replica Monte Carlo)
- Genome assembly with de Bruijn graph
- Sudoku in JavaScript
- Minesweeper in Java
- Django-Based Web Services (for Multiple Sequence Alignment and mobile app)
- Career-match browser tool (Pyodine)
Research
Bias interpretation in genomics
We used machine learning models - elastic net regression and principal component analysis (PCA)- to investigate genomic regions called 'HOT regions' which appear to attract unusually high numbers of proteins and are likely technical artifacts of chromatin immunoprecipitation followed by sequencing (ChIP-seq) experiments.
While factors like antibody quality and chromatin interactions are known to affect ChIP-seq reliability, our study revealed that GC- and CpG-rich sequences, DNA methylation, and RNA:DNA hybrids (R-loops) also contribute to these artifacts across species. This work shows how machine learning can uncover hidden biases in genomic data and improve experimental interpretation.
Publication: Wreczycka K et al, Nucleic Acids Research, 2019
Liquid biopsy epigenetics in disease
DNA methylation biomarkers in acute coronary syndrome (blood-derived cfDNA)
We explored circulating cell-free DNA (cfDNA) methylation as a non-invasive biomarker for acute coronary syndrome (ACS), based on the principle that damaged tissues release DNA into the bloodstream.
Using cfDNA methylation profiles, we differentiated ACS subtypes and identified cell type-specific DNA methylation markers to trace the origin of cfDNA. Hundreds of methylation markers linked to cardiovascular conditions and inflammation were identified and validated in an independent cohort, highlighting the potential of cfDNA methylation for ACS diagnosis.
Publication: Rafael R C Cuadrat et al, NAR Genomics and Bioinformatics, 2023
DNA methylation profiling in neuroblastoma (solid tissues and urine-derived cfDNA)
Neuroblastoma is a pediatric cancer ranging from mild to aggressive forms. While genetic changes explain some variability, we showed that DNA methylation plays a key role in its progression. In collaboration with Charité Hospital (Berlin), we analyzed primary tumor tissues and urine cfDNA using bisulfite-seq and RNA-seq, identifying methylation patterns distinguishing high- and low-risk tumors. We also linked MYCN-driven methylation changes to disrupted transcription factor networks, highlighting potential targets for therapies.
Figure: Methylation-based clustering of neuroblastoma patients using differentially methylated CpGs.
Open source software

genomation – a Bioconductor R package designed to simplify genomic feature and interval analysis. It includes functions for reading BED/GFF files as GRanges, summarizing features over regions, creating enrichment plots or heatmaps, and annotating regions with exons, introns, or promoters.
https://github.com/BIMSBbioinfo/genomation, developed in the team of Dr. Altuna Akalin at Bioinformatics and Omics Data Science Platform at MDC BIMSB.

PiGx – a collection of genomics pipelines implemented using Snakemake, Python, and R. Each pipeline is easily configured with a sample sheet and a simple settings file. PiGx generates comprehensive, interactive HTML reports that summarize key findings from your samples.
https://github.com/BIMSBbioinfo/pigx, developed in the team of Dr. Altuna Akalin at Bioinformatics and Omics Data Science Platform at MDC BIMSB.

motifActivity – an R package for identifying key transcription factors (TFs) responsible for changes in gene expression or epigenetic marks across samples. It predicts TF activity profiles using input data from RNA-seq, BS-seq, ChIP-seq, ATAC-seq, and similar methods, combined with a set of DNA motifs.
https://github.com/katwre/motifActivity, developed in the team of Dr. Altuna Akalin at Bioinformatics and Omics Data Science Platform at MDC BIMSB.
Freelance
Prioritization of therapeutic targets in clinical trials
Visualization and survival analysis of biomarkers
We developed interactive visualizations, including oncoprints, to highlight key biomarkers in patients with limited treatment options. These visual summaries help uncover genomic alterations and support identifying new therapeutic targets.
We focused on patients from clinical trial databases facing poor outcomes or lacking effective therapies. Our statistical analyses, including survival analysis, demonstrate the clinical relevance of nominated targets.
Figure: Example of biomarker visualization and survival analysis.
Machine learning/AI for target identification
To prioritize therapeutic targets, we applied Positive and Unlabeled (PU) learning, ideal for cases where only confirmed targets are known. PU classifiers helped distinguish potential targets using gene expression, mutations, and therapy annotations.
Figure: PU learning principle (figure adapted from a blogpost).
Additionally, we utilized autoencoders to uncover hidden patterns and prioritize key molecular features in an unsupervised way.
Figure: Schematic of a Variational Autoencoder (figure adapted from a blogpost).
Web app feature development
I contributed to enhancing the IGV web application, an interactive tool for visual exploration of genomic data (source code). Built with JavaScript and Python, this tool allows visualization of both public and in-house datasets.
- Enabled dynamic visualization of new in-house genomic datasets.
- Added highlighting of genomic regions of interest (e.g., genetic variants).
- Developed new display options for RefSeq and GENCODE annotations:
- Collapse/expand all transcript isoforms.
- Extend selected gene isoforms for detailed view.
- Added controls to adjust track widths for optimal display.
- Linked visualized tracks to their source databases.
- Implemented command-line tool for automated snapshots of defined genes or regions.

Side projects
Survival analysis with clinical and gene expression data
I developed several survival models to predict the risk of mortality or relapse in newly diagnosed multiple myeloma patients, using baseline clinical and/or gene expression data.

Figure: C-index comparison of multiple survival models.

Figure: Kaplan–Meier plot of the best performing model.
https://github.com/katwre/survival_analysis/tree/main/
CNNs and transfer learning for image classification tasks based on chest X-rays
I applied convolutional neural networks (CNNs) to classify chest X-ray images using both 224×224 and 64×64 pixel inputs, aiming to explore whether lightweight models can still retain sufficient diagnostic power for image-based classification tasks. Beyond training a basic CNN from scratch, transfer learning was employed by leveraging pretrained convolutional backbones such as ResNet, to assess whether pretrained models can further enhance classification performance when applied to chest X-ray images.

Healthy

Pneumonia
Figure: Example X-ray images of a healthy individual and a pneumonia patient.
https://github.com/katwre/ML-projects/blob/main/CNN_and_TransferLearning_Xray/
Autoencoder for scRNA-seq dimensionality reduction and data imputation
I developed a simple autoencoder with a custom loss function for imputing missing values in single-cell RNA-seq data. The approach was inspired by the method proposed by Badsha et al. (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7144625/).

Figure: Imputed scRNA-seq.

Figure: Model's output and the true gene expression values. Non-imputed data (blue): where the model reconstructed known values. Imputed data (orange): where the model predicted missing (masked) values.
https://github.com/katwre/ML-projects/tree/main/autoencoder_scRNAseq/
Variational autoencoder (VAE) to mitigate batch effects in scRNA-seq using federated learning
This project explored a scVI model (variational autoencoder for single-cell data) in a federated setting with secure aggregation using the Flower framework (Flower.ai) and the SecAgg+ secure aggregation protocol. For comparison, the same model was also trained in a centralized setting.

Figure: Baseline Gene Expression UMAP.

Figure: Centralized scVI Model.

Figure: Federated scVI Model.
https://github.com/katwre/ML-projects/tree/main/federated_learning_scRNA-seq/
LLM-powered SPARQL Bioinformatics Assistant
This project explored an AI-powered assistant that helps researchers ask questions about biology in plain English and automatically turns them into SPARQL queries against public databases:
- UniProt (proteins, sequences, annotations)
- OMA (orthologs / evolutionary relationships)
- Bgee (gene expression in species)
The assistant is powered by LLMs (Mistral, Llama via Groq, Ollama) combined with retrieval-augmented generation (RAG) using Qdrant and FastEmbed. You can interact with the assistant either in the terminal/CLI or through a simple chat web app (Chainlit web UI).
Key goals:
- Allow researchers to query complex biological knowledge bases witha nice web interface.
- Validate and execute queries automatically.
- Provide results summarized in plain language.

Figure: A web UI for an LLM of choice (Mistral, Llama via Groq, Ollama).
https://github.com/katwre/ML-projects/tree/main/llm-biodata/
VAE, BERT, semi-supervised NMF and lasso/ridge/elastic net for the cell type deconvolution
This project studies DNA fragments that circulate in the blood. These fragments come from many different cell types in the body. When tissues are damaged or diseased, they release more DNA than usual, so the mix of DNA in the blood changes.By figuring out which cell types the DNA comes from, we can get an early picture of tissue health. I applied multiple deconvolution methods to estimate cell type proportions from bulk DNA methylation data. Regression-based approaches (NNLS, Lasso, Ridge, Elastic Net) model methylation profiles as mixtures of reference cell types. In addition, I developed: - A variational autoencoder (VAE) that reconstructs CpG profiles while jointly predicting cell type proportions. - A semi-supervised NMF (ssNMF) that anchors factorization to known reference signatures. - A lightweight Transformer model, treating CpG regions as tokens with embeddings and self-attention to capture genomic dependencies.

Figure: Deconvolution of the DNA methylation signal from blood DNA sequenced using Bisulfite-seq.
https://github.com/katwre/ML-projects/blob/main/VAE_NMF_Transformer_regression_cfDNA/
Protein Folding in the HP Model
Implementation of simulated annealing and replica exchange Monte Carlo algorithm for protein folding in the HP model in Python and NumPy. The HP model simplifies protein folding by using hydrophobic (H) and polar (P) amino acids on a square lattice. Metropolis–Hastings algorithm enables sampling protein configurations based on the Boltzmann distribution.

Figure: Lattice HP model showing global energy.
https://github.com/katwre/bioinformatics-projects/tree/master/Molecular_Dynamics
Genome assembly with de Bruijn graph
Implementation of de Bruijn graph-based genome assembly with Eulerian walk to reconstruct DNA sequences from k-mers. Includes short-read assembly principles based on publications by Compeau et al. (2011) and Pevzner et al. (2001)

Figure: De Bruijn graph.
https://github.com/katwre/bioinformatics-projects/tree/master/genome_assembly
Sudoku
A simple Sudoku game implemented in JavaScript and JQuery.

https://github.com/katwre/sudoku
Minesweeper
A classic Minesweeper game implemented in Java using SWING and AWT libraries.

https://github.com/katwre/Minesweeper
Django-Based Web Services
Django-based server for Multiple Sequence Alignment (MSA) visualization - https://github.com/freesci/MSA-vis-project
Mobile application using Django, manifesto app, and localStorage - https://github.com/katwre/phone_application
Discover Your Career Match
Interactive tool that matches careers to users based on their personality profile (Big Five personality traits). Runs directly in the browser via Pyodide.

Figure: PCA plot showing career matches based on personality profile.