Data-Guided Control for Cellular Reprogramming with MATLAB
By Dr. Indika Rajapakse
“MATLAB is the best environment we have for exploring the networks that define how cellular reprogramming works—and how we can harness it to help transform cancer treatment and regenerative medicine.”
Chemotherapy is one of our most powerful weapons in the fight against leukemia and other cancers, but it can be a double-edged sword. In the process of destroying cancer cells, it frequently wipes out the immune system too. After administering this therapy, doctors often need to “treat the treatment,” performing a bone marrow transplant to help patients recover by rebuilding the immune system. These transplants introduce their own challenges because finding a compatible donor isn’t easy, and even when a match is found, serious complications such as graft-versus-host disease can arise.
Now imagine a different path. What if we could transform some of the patient’s own skin cells into the marrow cells they need? The dual challenges of finding a donor and donor tissue rejection would be eliminated at once. That’s the promise of cellular reprogramming, and it’s at the heart of my research group’s work at the University of Michigan. We are developing methods for converting one type of cell directly into another, using molecules called transcription factors (proteins that help turn genes on or off) to reset a cell’s identity.
From an engineering perspective, this can be formulated as a classic control problem. If the current state of the system is a skin cell and the goal is a bone marrow cell, how do we guide the system to the goal? We model it as ẋ = f(x, u), where x is the state of the cell and u represents the amount and timing of the transcription factors we apply. To find the right set of factors—and when to apply them—we run experiments that generate significant amounts of raw data, including RNA sequencing (RNA-Seq) data, 3D genome organization data, and data on transcription factor binding.
The success of our research depends on our ability to process and analyze the complex biological data sets we collect—often large, matrix-based, and high-dimensional. It also depends on our ability to make sense of the networks underlying gene regulation and genome organization. In our work, genes and genomic regions are network nodes; their interactions—physical contacts, regulatory influence, co-expression—form the edges. These networks can be simple graphs or more intricate hypergraphs, and both are naturally represented as matrices. These matrices are the core data structures passed into algorithms and models, many of which rely on eigenvalue decomposition, singular value decomposition (SVD), and other linear algebra operations—all of which are naturally and efficiently handled in MATLAB®. We have relied on MATLAB for years to build the workflows that form the foundation of our work, and, more recently, we’ve begun using the Biopipeline Designer app to define and run portions of these as bioinformatics pipelines.
A Brief History of Cellular Reprogramming and the Emergence of Data-Guided Control
Although the idea of reprogramming cells had been around for decades, the field of cellular reprogramming took a dramatic leap forward in 2006 when Shinya Yamanaka showed that just four transcription factors could revert a mature skin cell back to a pluripotent, stem cell–like state. His discovery of induced pluripotent stem cells earned a Nobel Prize and, in my view, completely changed the paradigm of biology. Interestingly, while Yamanaka’s work captured global attention, this wasn’t the first demonstration that cells could fundamentally change their identity and function. About twenty years earlier, Harold Weintraub—working at the Fred Hutchinson Cancer Research Center, where I later completed my postdoctoral fellowship in Genome Cell Biology—had demonstrated that one mature cell type could be directly converted into another, bypassing the pluripotent state entirely. He published that work in 1987, but at the time, the field wasn’t quite ready to absorb what he had discovered and the brilliance of his work.
Building on the vision of these early pioneers, our lab has focused our efforts on direct reprogramming—and, in particular, on how to make it more reliable and predictable. Among our principal contributions is a framework called data-guided control (Figure 1). This approach optimizes the use of transcription factors in cellular reprogramming by employing principles from mathematical control theory.
Figure 1. An overview of data-guided control, including a summary of control equation variables (A), the representation of topologically associating domains (TADs) as nodes in a dynamic network with edges determined from time-series RNA-Seq data (B), a conceptual illustration of identifying a set of TFs that push the cell state from one basin to another (C).
In data-guided control, we construct models for the natural evolution of cell populations by sampling gene expression at multiple time points throughout the cell cycle. To manage complexity, we cluster gene expression based on topologically associating domains (TADs) and model the dynamics of their expression levels. (TADs are regions of the genome that physically interact within themselves more frequently than with outside regions, forming discrete three-dimensional structural units.) To build these dynamical models, we integrate Hi-C data—which maps physical interactions between different regions of the genome—with RNA-Seq data that tracks how gene expression changes over time (Figure 2). The models, combined with data on transcription factor binding sites and activity, enable us to systematically identify the most promising transcription factor candidates for specific reprogramming tasks.
Using data-guided control, we have successfully identified factors previously validated in reprogramming experiments. More importantly, we have used it to identify potentially powerful new combinations. Matrix and visualization capabilities in MATLAB have proved to be valuable in this work, enabling us to efficiently process the complex mathematical operations underlying our control algorithms and to interpret the resulting high-dimensional biological data.
Streamlining Hypergraph Analysis and Bioinformatics Pipelines
While data-guided control gives us a way to model gene expression dynamics, explaining those dynamics often requires capturing regulatory interactions that go beyond simple pairwise models. Many biological interactions involve not just two, but many cellular components simultaneously. For example, gene regulation often requires the coordinated binding of several transcription factors and coactivators to enhancer and promoter regions of the genome. Standard network models, which represent relationships as connections between pairs of elements, cannot adequately capture these multi-way interactions. To address this complexity, our lab developed the Hypergraph Analysis Toolbox (HAT), a publicly available toolbox for analyzing and visualizing higher order structures in MATLAB. HAT enables researchers to construct, visualize, and analyze hypergraphs—mathematical structures where a single connection (hyperedge) can link multiple nodes, precisely representing multi-way interactions in complex biological systems. This capability is particularly valuable in cellular reprogramming, where understanding the intricate dynamics of gene regulatory networks and chromatin interactions can reveal optimal intervention points for converting one cell type into another. HAT helps us to identify critical regulatory modules and control points that would be invisible to pairwise network models, improving our ability to design effective reprogramming strategies.
Hypergraph analysis is often performed within the context of a multistep process or pipeline. Our experimental pipelines typically involve raw data collection from sequencing platforms, alignment to reference genomes, filtering, and other downstream steps. The Biopipeline Designer app enables us to streamline these processes (Figure 3). We can, for example, build a pipeline that starts with sequencing data, aligns it, quantifies gene expression, performs filtering and normalization, and then extracts biologically meaningful features—a signature—that can be used to identify or classify cells, track reprogramming progress, or guide interventions. We can build and run end-to-end bioinformatics workflows interactively, connecting both established bioinformatics tools and custom-developed code into cohesive analytical pipelines. Further, we can create custom blocks to represent any MATLAB function—including our HAT functions—and integrate them with prebuilt blocks for common bioinformatics operations.
This approach is particularly powerful when processing our RNA-Seq data for gene expression analysis. This data is critical to informing our understanding of cell state and reprogramming dynamics. The Biopipeline Designer app saves us time and ensures reproducibility, as completed pipelines can be shared or adapted for different data types with minimal modification. For cellular reprogramming research, where iterative experimentation and analysis are essential, we rely on the ability to rapidly adjust and rerun analyses with different parameters to refine our computational models and control strategies.
MATLAB and Mathematical Biology in the Classroom
In our lab, we use MATLAB in practically everything we do. This philosophy extends into my graduate-level instruction, where I teach the courses Mathematics of Biological Networks and Mathematics of Data. These courses cover essential concepts such as spectral graph theory, network controllability, SVD, probabilistic modeling, and neural networks—all applied to biological data sets using MATLAB.
When possible, we feature guest lectures from Cleve Moler, cofounder of MathWorks and creator of MATLAB, whose talk “How SVD Saved the Universe” both inspires my students and demonstrates the profound impact of linear algebra in scientific computing.
Current and Future Developments
While our current approach has primarily relied on Hi-C data for mapping pairwise chromatin interactions and identifying TADs, our lab is now working on integrating Oxford Nanopore Technologies’ long-read sequencing technology to enhance our understanding of chromatin architecture (the way DNA is packaged with proteins in the cell nucleus). Unlike traditional short-read sequencing, the company’s Pore-C method captures multi-way chromatin interactions and epigenetic modifications, providing a more comprehensive view of the 3D genome structure (Figure 4). This advancement will necessitate adaptations in our data processing workflows, and we plan to use Biopipeline Designer to manage and analyze the more complex data sets involved.
We are also extending our data-guided control framework to incorporate hypergraph representations, allowing us to more effectively model higher order gene regulatory interactions. Additionally, we are planning to move beyond population-level reprogramming and incorporate single-cell reprogramming, with the goal of improving reprogramming success rates. We are also looking toward tissue fabrication, exploring the potential of assembling functional tissues from reprogrammed cells. To support these endeavors, our long-term vision includes the development of fully automated laboratory systems, in which digital twins of the necessary robotic systems will be modeled and simulated in Simulink®.
When researchers discuss taking a skin cell, reprogramming it, and reintroducing it into a patient, it may sound like science fiction. As science fiction author Arthur C. Clarke famously noted, “Any sufficiently advanced technology is indistinguishable from magic.” In this spirit, I believe that MATLAB tools are crucial in enabling us to turn this “magic” into reality.
Published 2025