AI Model Finds Marks on DNA

Summary
- Tumor development is often affected by molecular marks on DNA that change cellular function.
- The genome contains millions of so-called methyl marks, and measuring where they are is a challenge.
- A new AI model to support the task is unveiled in a study led by Sylvester researchers.
A new artificial intelligence model successfully predicts the layout of a key aspect of the human genome — its modification by a molecular mark to DNA called methylation.
Methyl marks can have a powerful effect on DNA, often shutting down a gene entirely. And the location of these marks on DNA can tell scientists a lot about a cell, such as whether it is cancerous.
“Studying DNA methylation is extremely relevant, and it’s one of the things that has been demonstrated to capture cellular identity most accurately, such as whether a cell is benign or malignant,” said Maria “Ken” Figueroa, M.D., associate director of translational research at Sylvester Comprehensive Cancer Center, part of the University of Miami Miller School of Medicine.

The new model promises to accelerate research on the linkage between DNA methylation and cancer, as well as other conditions, said Dr. Figueroa, also professor of biochemistry and molecular biology at the Miller School. It was built using technology similar to that behind well-known “generative” AI tools such as the image generator DALL-E.
The model was unveiled in a study in Science Advances on April 11. Dr. Figueroa is co-lead author of the study along with Yan Guo, Ph.D., professor of public health sciences at the Miller School and director of the biostatistics and the bioinformatics shared resource at Sylvester.
Studying DNA Methylation
Dr. Guo and Dr. Figueroa began discussing the project in 2023, when Dr. Guo was first interviewing for a position at Sylvester.
“I asked if any of her data has missing parts. And if we could find a way to impute that, would it be useful to her?” Dr. Guo said.

Imputing is the process of replacing missing values in a dataset with estimates. And that is just what Dr. Figueroa needed for her studies on DNA methylation and its role in cancer biology.
Studying DNA methylation is challenging.
Methyl groups are added to areas of DNA called CpG sites. There are more than 28 million CpG sites in the DNA of each cell of the body, and more than half of these sites are typically methylated. Moreover, the location of methyl marks varies with cell type, disease state, age and other physiological factors.
Dr. Figueroa likens the DNA sequence to a laptop and methylation to interchangeable software that changes how the laptop can be used.
Determining exactly how that software is working can be expensive. Techniques that accurately capture the location of methylated CpG sites typically cost thousands of dollars per sample. That can add up quickly, especially for large-scale use in clinical trials.
Dr. Guo set out to build a model to generate accurate data from less expensive techniques that only capture a fraction of methylated CpG sites.
AI to Build a Research Model
To build the model, Dr. Guo and his colleagues leveraged an AI technique called diffusion.
Diffusion models work by starting with random noise iteratively refined to generate realistic data. That’s similar to the workings of DALL-E, the popular generative AI tool used to create images from prompts. DALL-E is also powered by a diffusion model.
The new model is dubbed DiffuCpG. It enables researchers to estimate, or “impute,” the location of methylated CpGs. The model outperformed other methods for estimating the locations of these marks, according to the study.
“The algorithm fills in the missing data,” said Dr. Figueroa. “This is now going to allow us to get very comprehensive models of the DNA methylation landscape without having to break the bank.”
That will speed up cancer research. DNA methylation patterns can predict responses to certain therapies and estimate outcomes such as probability of survival. DNA methylation can also be used diagnostically, for example, to identify certain subtypes of leukemia or signs of colorectal cancer in stool-based tests.
“These types of applications will now be much more powerful, because we can collect more data,” said Dr. Figueroa.
She is already applying the model to fill out existing, incomplete DNA methylation datasets in her lab.
Filling in the Missing DNA Methylation Data

One of the brains behind the new model is Sylvester computer scientist Fengyao Yan, Ph.D. Dr. Yan was inspired by a 2022 publication in his field that described an AI diffusion model trained to reconstruct a human face using profile images with portions of the face cut out. The model learned to fill in the missing parts of the face.
“It was an immediate click for me,” said Dr. Yan. “This model could potentially be used to impute methylation data, in which the methylation probability pattern is clear and the distribution can be learned through a large dataset.”
CpG methylation is influenced by several types of information in the genome, such as the pattern of DNA sequences and the three-dimensional architecture of the DNA as it folds up within a cell. Such information was built into the model, which was then trained on high-quality data from 26 patients with acute myeloid leukemia. The model was validated and tested on separate datasets.
The researchers are now aiming to use a similar approach for other types of biological data.
“In theory, the model should be suitable for any type of data imputation, provided that the underlying data distribution pattern is strong,” said Dr. Yan, who is first author of the new study.
In a project with Sylvester computational oncologist Michele Ceccarelli, professor of public health sciences at the Miller School, the researchers are applying the model to detect cell boundaries in certain microscopy-based datasets (for spatial transcriptomics, which measures RNA expression and location in a cell).
The prospect of high-level collaborations on a variety of scientific questions is what ultimately convinced Dr. Guo to take the job at Sylvester.
“People here don’t just stay in their own lab. They talk to other people. Here, I always have researchers contacting me about using the latest technology to do things better,” said Dr. Guo. “AI is probably the most exciting thing that has happened in biology in the last three years.”
Tags: AI, artificial intelligence, cancer research, DNA, Dr. Maria Figueroa, genome research, Sylvester Comprehensive Cancer Center, technology