0
0
Could AI-upscaled MRIs replace expensive scanners, and *finally* make diagnostics cheaper?
MRI Super-Resolution with Deep Learning: A Comprehensive Survey
Get notified when new papers like this one come out!
The hidden cost of clinical MRI and why software might be the answer
High-resolution MRI scans reveal the subtle pathology that saves lives. A radiologist needs crystal-clear images to spot a small tumor, detect early dementia, or diagnose cardiac damage. But getting those images comes with a hidden price tag: patients must lie still for 30 minutes or longer while the machine whirs and clicks. Longer scans mean more motion artifacts, more patient discomfort, more claustrophobia. Longer scans also mean more expensive hardware requirements and higher operating costs. The fundamental engineering problem is this: resolution and speed are locked in a trade-off, and neither hardware improvements nor economic incentives have broken that coupling completely.
What if the answer wasn't better hardware but smarter software? What if hospitals could acquire fast, cheap, low-resolution scans and then use deep learning to reconstruct the high-resolution details that clinicians need? This is the promise of MRI super-resolution, and it's reshaping how researchers and engineers think about medical imaging.
A comprehensive new survey examines this landscape, mapping how deep learning has transformed MRI super-resolution from a niche signal processing problem into a vibrant field where computer vision, physics, and clinical medicine intersect. Understanding this field requires stepping back and seeing how different research communities are approaching the same problem, sometimes talking past each other, sometimes discovering they're solving the same underlying mathematics.
The clinical motivation: why speed and resolution can't both be cheap
MRI works by manipulating magnetic fields and measuring how tissue responds. The resolution of the final image depends directly on how many frequency samples the machine collects. More samples mean sharper images but longer scan times. This isn't a limitation of today's hardware. It's a fundamental property of Fourier sampling.
Clinicians face a constant pressure: a patient with chest pain needs answers quickly, but rushed scans might miss important details. A pediatric patient can only hold still for so long. A patient on medication must be monitored regularly, but repeated high-resolution scans accumulate radiation exposure concerns for other modalities or require unrealistic time commitments. The current solution is compromise, and the compromise is often unsatisfying.
This is where computational approaches diverge from hardware engineering. Hardware improvements follow established physics and economics: better magnets cost more, take longer to develop, require larger rooms and more cooling. Software improvements follow different laws. Once trained, a deep learning model runs in milliseconds. It requires no new hardware. It can be deployed in existing scanners globally. The marginal cost approaches zero.
Two fundamentally different ways to think about filling in missing details

Perspectives on MRI super-resolution methods: (a) Data-driven approach that learns LR-to-HR mapping purely from data; (b) Physics-informed approach that incorporates the underlying imaging physics while mapping the LR-to-HR
The field splits along a conceptual fault line. One school of thought treats MRI super-resolution as a generic image enhancement problem. The other treats it as physics-based inverse problem solving. This distinction shapes everything downstream: the architectures that get built, the training data that matters, the methods that work in the clinic versus only in research papers.
The data-driven perspective is intuitive. Collect thousands of pairs: a low-resolution scan paired with a high-resolution scan of the same anatomy. Train a neural network to observe the patterns. When I see this blurry pattern, what sharp version typically produces it? With enough training pairs and sufficient network capacity, the network becomes remarkably good at this pattern matching. The approach is simple to implement, flexible to adapt, and requires no deep understanding of MRI physics. You just need data.
But this perspective has a hidden vulnerability. Where do those training pairs come from? In research, they usually come from taking real high-resolution scans and artificially downsampling them to create fake low-resolution versions. The problem: real low-resolution scans acquired at low resolution are different. They have different noise characteristics, different artifacts, different contrast properties. A network trained on fake pairs performs noticeably worse on real scans. This domain gap is the dirty secret that explains why many published super-resolution methods work beautifully on benchmark datasets but fail clinically.
The physics-informed perspective starts with a different question: what is MRI actually doing? The scanner measures Fourier components of the imaged volume. It applies gradients, transmits radiofrequency pulses, listens to echoes. This entire process can be modeled mathematically as a forward model that takes a hypothetical high-resolution image and produces the measured low-resolution data. Super-resolution becomes an inverse problem: given the measured low-res data, find the high-res image that, when passed through the MRI forward model, would produce that data.
This shift is profound because it changes what you're optimizing for. Instead of learning "what does a high-res version of this pattern look like?" you're learning "what high-res image is physically consistent with what I actually measured?" Physics-informed methods bake this mathematical structure directly into the network architecture, not just the training data. The advantage is principled reasoning that generalizes better. The disadvantage is that your physics model is always an approximation of reality.
The survey's taxonomy captures this dichotomy in its primary organizing axis. The most promising recent approaches don't choose between these visions. They blend them.
The architectural evolution: where super-resolution actually happens in the network

Overview of super-resolution architectures: (a) Pre-upsampling, (b) Final upsampling, (c) Progressive upsampling, (d) Residual learning with long skip connections, (e) Dense connections with efficient feature reuse, (f) Recursive approaches
The problem of upsampling seems deceptively simple. You have a low-resolution image and want a high-resolution version. You need to go bigger at some point. The decision of where and how to enlarge the image cascades into the entire network design.
Pre-upsampling enlarges the image first, then processes it. Intuitive, but computationally expensive because all downstream operations happen at high resolution. Final upsampling processes at low resolution and enlarges at the end, efficient but risky: you might lose important detail before having a chance to recover it. Progressive upsampling gradually increases resolution in stages, learning refinements at each scale. This balances efficiency with multiple opportunities for detail recovery and has become the dominant approach in practice because it works across diverse imaging scenarios.
But upsampling location is just one decision among many. The real power comes from how information flows through the network.
Residual learning changed deep learning by recognizing a simple fact: in very deep networks, it's easier to learn the differences from a baseline than to learn everything from scratch. In MRI super-resolution, this insight is crucial because you're not creating anatomy from nothing. You're taking an existing low-resolution scan and enhancing it. A residual connection lets the network learn the high-frequency details to add rather than learning the entire high-resolution image. This prevents information loss about coarser anatomical structures that are already present in the low-res version.
Dense connections take this idea further by explicitly reusing features from earlier layers throughout the network. Instead of a feature only being available at the layer where it was computed, it gets concatenated to features at all downstream layers. This creates multiple paths for information flow and prevents the "information bottleneck" problem where a single narrow layer loses detail.
Recursive architectures use the same processing module multiple times. This reduces parameters and can be elegant, but only if the module is actually good at the task of refinement rather than trying to do everything in one pass.
These architectural choices matter not because one is universally best but because they encode different assumptions about what's happening during super-resolution. Are you building structure from scratch, or enhancing existing structure? Do you need multiple passes to refine details, or can you do it once? These questions don't have universal answers, which is why the field evolved multiple approaches rather than converging on a single champion.
Physics embedded in the network: deep unfolding and equilibrium models

Deep unfolding models provide a physics-informed approach to imaging inverse problems by integrating the forward model with a neural network within an iterative reconstruction framework
Deep unfolding represents a different design philosophy entirely. Instead of hand-crafting an architecture, start with the mathematical equations that solve the inverse problem, then replace expensive operations with learned neural modules.
Classical inverse problem solving works like this: you have measurements and a forward model. You iteratively refine your guess of the high-resolution image. Each iteration applies the forward model to see what measurements your guess would produce, compares to actual measurements, and takes a step toward consistency. This works but requires many iterations and doesn't exploit learned priors about what MRI images actually look like.
Deep unfolding takes this iterative algorithm and unrolls it into network layers. Instead of iterations, you have layers. In each layer, you still apply the forward model (unchanged, this is physics), but you replace the traditional step with a learned neural module that's smarter about how to update the estimate. Now the network learns how to take an iterate and move toward the solution more effectively than classical optimization. You get the best of both worlds: mathematical grounding in physics and learned intelligence about realistic solutions.

Deep Equilibrium models provide a physics-driven approach to imaging inverse problems by incorporating the forward model and a neural network within a fixed-point formulation
Deep Equilibrium models take this even further. Instead of unrolling a fixed number of iterations into layers, they ask: what happens if I let the iteration run until it converges, and I make the network itself part of the convergence criterion? The network learns not a single refinement step but an equilibrium condition. This is mathematically elegant and computationally efficient because you only run as many iterations as needed, not a predetermined number. But it's also more complex to train and reason about.
These physics-informed architectures represent a genuine innovation in how deep learning engages with domain knowledge. Rather than physics as training data (as in pure data-driven methods), physics becomes part of the computational graph itself. The gradient flow during backpropagation directly incorporates the forward model. This turns out to be surprisingly powerful for generalization because the network can't escape from physical consistency no matter what patterns it learns.
The continuous frontier: images as functions rather than grids

Illustration of INR-based MRI super-resolution. An encoder maps the input image to a coordinate feature grid, from which spatial coordinates and corresponding features are sampled. During training, the network learns a continuous representation
A newer paradigm challenges the discrete pixel foundation of traditional imaging. Implicit Neural Representations encode an image as a continuous function: feed in spatial coordinates and the network outputs intensity at that location. For MRI super-resolution, this means you encode the low-resolution scan into a learned feature map, then query that representation at any resolution you want. The resulting image is smooth and free from discrete grid artifacts that plague traditional upsampling methods.
This representation has clinical advantages. MRI scans don't have hard pixel boundaries. Patient anatomy is continuous. An implicit representation respects that physics in a way that discrete pixels don't. You can also trivially generate outputs at any resolution, not just standard sizes, which could enable custom sampling for specific clinical questions.

Comparison of feature representations between INR and GS. (a) INRs model pixels as discrete point samples; (b) GS represents each pixel as a self-adaptive continuous Gaussian field, allowing smooth and explicit evaluation of field
Gaussian Splatting extends this idea by representing each location not as a point but as a small Gaussian "splat", a fuzzy blob. This gives you explicit, interpretable representations where you can see what the model learned about each region. For MRI, this means representing anatomy as a collection of overlapping smooth structures rather than hard pixel boundaries, which is closer to physical reality. The approach is newer in the context of medical imaging but shows promise for producing more clinically plausible reconstructions.
These continuous representations hint at a future where medical imaging software thinks about anatomy differently than traditional computer vision. Traditional vision cares about sharp object boundaries. Medicine cares about tissue properties that vary smoothly. The mathematics can now accommodate that difference explicitly.
The generative turn: embracing uncertainty in super-resolution

Summary of forward and reverse processes in three major diffusion model formulations: DDPMs, SGMs, and SDE-based models
Diffusion models represent a philosophical reorientation in how to think about super-resolution. The older perspective asks: what is the single best high-resolution version of this scan? The newer perspective asks: what is the distribution of plausible high-resolution versions consistent with what I measured?
The mechanics of diffusion are becoming widely known. Start with clean data and gradually corrupt it with noise until you have pure noise. Train a network to reverse this process: predict the denoised version at each step. At inference time, start with noise and iterate through denoising. The elegance is that you can choose what kind of corruption to use. Standard approaches use Gaussian noise added to pixels. For MRI super-resolution, you can instead corrupt by downsampling: start with a random high-resolution image, progressively downsample it, train the network to undo the downsampling, then at inference time start with the measured low-res scan and iteratively upsample and refine.
Different corruption spaces create different inverse problems. Pixel-space diffusion works directly with image pixels. Frequency-space diffusion respects how MRI naturally operates, in Fourier space. Latent-space diffusion works on compressed representations learned by autoencoders, which can be computationally efficient.

Overview of diffusion corruption spaces in MRI SR. Top: Forward (red) and reverse (blue) denoising steps. Bottom: Three types of domains in which the diffusion process occurs: (i) pixel-based diffusion, operating directly on image pixels; (ii) frequency-based diffusion
Standard diffusion requires many denoising steps to reach high quality, making it computationally expensive. DDIM accelerates this by skipping steps deterministically while maintaining quality.

Illustration of DDIM sampling: instead of traversing all steps, the model maps x_{t+n} to x_{t} using a learned deterministic function conditioned on measured data. This enables accelerated inference by jumping between steps
Why does this generative perspective matter for clinical MRI? Because MRI is fundamentally uncertain. Noise, motion, measurement limitations, physiological variation. Multiple high-resolution anatomies could produce the same low-resolution measurement. Diffusion models don't pretend there's a single correct answer. They sample from the space of plausible answers. For clinicians, this is powerful: you can generate multiple candidates, recognize that uncertainty exists, and use that uncertainty in decision-making. You could average multiple samples to reduce noise or identify high-uncertainty regions that need closer attention.
The training data problem: synthetic versus real, paired versus unpaired

Learning paradigms for MRI super-resolution: (a) Supervised learning, where the network is trained on physically acquired and well-aligned pairs of LR-HR scans; (b) Unsupervised learning, where no LR-HR pairs are available
The gap between research papers and clinical reality often traces back to a single decision: where did the training data come from?
Supervised learning ideally uses perfectly aligned pairs of physically acquired low-resolution and high-resolution scans from the same patient. Train the network on these pairs and it learns to map low-res to high-res. The problem: acquiring both is twice the scan time. Hospitals don't do this routinely. So researchers compromise: take real high-resolution scans and artificially downsample them to create fake low-resolution training pairs.
This introduces a domain gap. Real low-resolution scans are fundamentally different from downsampled high-resolution scans. Real scans have motion artifacts specific to fast acquisition, different noise texture, different aliasing patterns, different contrast properties that come from different pulse sequences. A network trained on fake pairs learns to reverse artificial downsampling, not clinical reality.

Supervised vs. unsupervised super-resolution: Domain gaps between synthetic LR training data and real acquired LR images lead to discrepancies in the super-resolved outputs when models trained on synthetic data are applied to real scans
Unsupervised learning works with what hospitals actually have: abundant low-resolution scans, no high-resolution ground truth. Methods like cycle-consistency (where you super-resolve then downgrade and should recover the original) or generative adversarial networks enforce consistency without paired data. You're weaker because you have less information, but you're learning from real data. Self-supervised approaches add another angle: use physics directly. If you know the MRI forward model, you can generate synthetic low-res measurements from any hypothetical high-res volume without needing paired data. The network learns to invert this process using real measured low-res scans as the target.
For clinical translation, this training data question might matter more than architectural innovations. A mediocre architecture trained on real unpaired data will often beat a brilliant architecture trained on synthetic paired data when deployed clinically. Yet published papers often emphasize architecture innovations over addressing the domain gap. This explains a frustrating reality: papers with impressive-looking results often fail when deployed in real hospitals.
How to know if it actually works: the metrics problem
Standard image quality metrics like PSNR (peak signal-to-noise ratio) and SSIM (structural similarity) are fast to compute and enable reproducible benchmarking. They're also incomplete for medical imaging. Two images can have identical PSNR but one is clinically superior because it preserves the contrast of lesions. Another can have poor SSIM but be perfectly diagnostic because the noise pattern doesn't interfere with the relevant anatomy.
Physics-based metrics ask: does the super-resolved image preserve the physical properties that make MRI useful? Can you accurately extract tissue parameters from it? Do tissue boundaries align with anatomy? Does contrast look clinically realistic? These are harder to define and compute but more meaningful.
The only true evaluation is clinical validation: do radiologists prefer the super-resolved image for diagnosis? Do they catch more pathology? Do they need less time? This requires radiologist studies, ethics approval, and careful protocol design. Few papers do this. It's expensive and slow, which explains why so many papers report impressive metrics on research datasets without evidence of clinical value.
Generalization testing rarely happens but matters enormously. Does a method trained on brain imaging work on cardiac imaging? On a different scanner? On patients with disease not present in training data? Overfitting to benchmark datasets is trivial compared to achieving real robustness.
What remains unsolved and why the field will keep evolving
Clinical translation sits at the intersection of technical capability and practical reality. A method that's 1% more accurate but requires retraining for every scanner and every anatomy is useless. A method that runs too slowly to integrate into existing workflows won't be adopted. Regulatory approval for medical devices involves rigorous validation that research papers don't require. These barriers explain why few research methods become clinical standards.
Generalization across imaging domains remains deeply challenging. Transfer learning helps but doesn't solve the fundamental problem: MRI of the brain is different from MRI of the heart, which is different from MRI of joints. Different tissues, different contrast mechanisms, different pathologies. A network trained on one domain performs poorly on another without adaptation.
Computational efficiency constraints are often invisible in papers. Diffusion models might be beautiful mathematically but require hundreds of steps to generate a sample. Deep unfolding networks require forward model evaluations that scale with image dimensions. On a busy hospital scanner where a radiologist needs results in seconds, these overhead costs become disqualifying. Pruning, quantization, and efficient architecture design are necessary but less novel-sounding than new diffusion formulations, so they receive less attention.
The physics incorporation question remains unresolved: how much physical modeling should you bake into the network versus learning from data? Too much physics and you're constrained by incorrect models or incomplete understanding. Too little and you lose the power of physical guidance. Different applications might need different balances, yet most methods make a fixed choice rather than adapting to the problem at hand.
Uncertainty quantification for clinical use remains nascent. Generative models naturally provide it: sample multiple times and look at variance. But converting that into clinically actionable information, understanding what uncertainty means about diagnostic confidence, and integrating it into radiologist workflows are all unsolved. Confidence scores without interpretation are useless.
Where the field is heading
The trajectory is unmistakable. Early MRI super-resolution treated it as generic image upsampling. The field is moving toward hybrid frameworks that respect MRI physics while leveraging deep learning flexibility. Physics-informed architectures that embed forward models directly into networks are becoming mainstream. Generative approaches that embrace uncertainty rather than pretending super-resolution has a single correct answer are gaining traction. Continuous representations that respect the smooth nature of medical anatomy are emerging.
For readers thinking about this field, the key insight is that MRI super-resolution is no longer a solved problem or a completely unsolved one. It's a problem where the definition of "solved" is being negotiated between researchers optimizing for accuracy, engineers optimizing for deployment, and clinicians optimizing for patient outcomes. These objectives don't always align. A paper with the highest reported metrics often doesn't translate clinically. A clinically successful deployment often relies on boring engineering work rather than novel algorithms.
The real advances will come from teams that understand all three perspectives simultaneously: researchers comfortable with both deep learning and MRI physics, engineers thinking about real deployment constraints, and clinical teams willing to validate rigorously. The mathematics is becoming mature. The missing piece is the integration across disciplines.
Original Paper
Highlights
No highlights yet