Authors

Nanye Ma$^{1*\dagger}$

Shangyuan Tong$^{2*\dagger}$

Haolin Jia$^3$

Hexiang Hu$^3$

Yu-Chuan Su$^3$

Mingda Zhang$^3$

Xuan Yang$^3$

Yandong Li$^3$

Tommi Jaakkola$^2$

Xuhui Jia$^3$

Saining Xie$^{1,3}$

Affiliations

$^1$New York University

$^2$Massachusetts Institute of Technology

$^3$Google

$^*$Work done during an internship at Google.

$^\dagger$Equal Contribution.

Date

Jan. 15, 2025

Generative models have made significant impacts across various domains, largely due to their ability to scale during training by increasing data, computational resources, and model size, a phenomenon characterized by the scaling laws. Recent research has begun to explore inference-time scaling behavior in Large Language Models (LLMs), revealing how performance can further improve with additional computation during inference. Unlike LLMs, diffusion models inherently possess the flexibility to adjust inference-time computation via the number of denoising steps, although the performance gains typically flatten after a few dozen. In this work, we explore the inference-time scaling behavior of diffusion models beyond increasing denoising steps and investigate how the generation performance can further improve with increased computation.

Specifically, we consider a search problem aimed at identifying better noises for the diffusion sampling process. We structure the design space along two axes: the verifiers used to provide feedback, and the algorithms used to find better noise candidates. Through extensive experiments on class-conditioned and text-conditioned image generation benchmarks, our findings reveal that increasing inference-time compute leads to substantial improvements in the quality of samples generated by diffusion models, and with the complicated nature of images, combinations of the components in the framework can be specifically chosen to conform with different application scenario.

fig:teaser — **Inference scaling beyond increasing sampling steps.** We demonstrate the performance with respect to FID $ \textcolor{blue}{\boldsymbol{\downarrow}}$, IS $\textcolor{red}{\boldsymbol{\uparrow}}$ on ImageNet, and CLIPScore $\textcolor{green}{\boldsymbol{\uparrow}}$, Aesthetic Score $\textcolor{orange}{\boldsymbol{\uparrow}}$ on DrawBench. Our search framework exhibits substantial improvements in all settings over purely scaling NFEs with increasing denoising steps.

Innate Inference-Time Scaling of Diffusion

In recent years a family of flexible generative model based on transforming pure noise $\varepsilon \sim \mathcal{N}(0, \mathbf{I})$ into data $x_* \sim p(x)$ has emerged. This transformation can be described by a simple time-dependent process $$ x_t = \alpha_t x_* + \sigma_t \varepsilon $$ with t defined on $[0, T]$, $\alpha_t, \sigma_t$ being time-dependent functions and chosen such that $x_0 \sim p(x)$, $x_T \sim \mathcal{N}(0, \mathbf{I})$. At each $t$, $x_t$ has a conditional density $p_t(x | x_*) = \mathcal{N}(\alpha_t x_*, \sigma_t^2\mathbf{I})$, and our goal is to estimate the marginal density $p_t(x) = \int p_t(x | x_*) p(x) \mathrm{d}x $. Diffusion-Based and Flow-Based Models simulate a vector field via an ordinary differential equation (ODE) or a stochastic differential equation (SDE) to estimate the marginal density $p_t(x)$ and perform sampling.

These generation processes usually starts from pure noise and requires multiple forward passes of trained models to denoise and obtain clean data. These forward passes are thus dubbed denoising steps. Since the number of denoising steps can be adjusted to trade sample quality for computational cost, the generation process of diffusion models naturally provides flexibility in allocating inference-time computation budget. Under the context of generative models, such computation budget is also commonly measured by the number of function evaluations (NFE), to ensure a reasonable comparison with other families of models that use iterative sampling processes but without denoising capabilities. Empirical observations have indicated that by investing compute into denoising steps alone, performance gains tend to plateau after a certain NFEs, limiting the benefits of scaling up computation during inference. Therefore, a new framework needs to be designed.

In theory, there is explicit randomness in the sampling of diffusion models: the randomly drawn initial noise, and the optional subsequent noise injected via procedures like SDE and Restart Sampling Nonetheless, because the model evaluations are inherently deterministic, there is a fixed mapping from these noises to the final samples. It has been shown that some noises are better than others, suggesting that it is possible to push the inference time scaling limit by devoting more compute to finding the more preferable noises for sampling.

Inference-Time Scaling as Search Problems

We frame the inference-time scaling as a search problem over the sampling noises; in particular, how do we know which sampling noises are good, and how do we search them?

On a high-level, there are two design axes we propose to consider:

Verifiers: pre-traied models that are capable of providing feebacks on the quality of the noise candidates; concretely, they takes in the generated samples and optionally the corresponding conditions, and outputs a scalar value as the score for each generated sample.
Algorithms: functions that are used to find better noise candidates based on the feedbacks from the verifiers. Formally defined, algorithms are functions \[ f: \mathcal{V} \times D_\theta \times \{\mathbb{R}^{H \times W \times C} \times \mathbb{R}^d\}^N \to \mathbb{R}^{H\times W \times C} \] that takes in a verifier $\mathcal{V}$, a pre-trained Diffusion Model $D_\theta$, and $N$ pairs of generated samples and corresponding conditions, and outputs the best initial noises according to the deterministic mapping between noises and samples. Throughout this search procedure, $f$ typically performs multiple forward passes through $D_\theta$. We refer to these additional forward passes as the search cost, which we measure in terms of NFEs as well.

We present a design walk-through of class-conditional ImageNet generation task below. We take a SiT-XL model pre-trained on ImageNet-256 and perform sampling with a second-order Heun sampler. We measure inference compute budget with the total NFEs used in denoising steps and search. The denoising steps is fixed to the optimal setting of 250, and we primarily investigate the scaling behavior of the NFEs devoted to search.

Verifiers

We mainly consider three different verifiers, which are meant to simulate three different use cases, as stated below.

Oracle Verifier, which utilizes full privileged information about the final evaluation of the selected samples. On ImageNet, we directly take the most commonly-used FID and IS as the oracle verifiers. For IS, we select the samples with highest classification probability output by a pretrained InceptionV3 model of the conditioning class. For FID, we use the pre-calculated ImageNet Inception feature statistics as reference, and we greedily choose the sample that minimizes the divergence against the ground-truth statistics.

fig:oracle-verifier — **Performances of Oracle Verifiers.** Random Search with FID and IS on ImageNet. **Inference Compute** is given by the total NFEs devoted to denoising steps and search; the starting points of all curves in each and the following figures denote only devoting NFEs to denoising steps and 0 NFEs in search.

Despite the effectiveness of the oracle verifier, it is not practical in real-world scenarios, as it requires full access to the final evaluation of the samples. We take such results as mere proof-of-concept, that it is possible to achieve better performance by investing compute into search and scale significantly at inference-time.

Supervised Verifier, which has access to pre-trained models for evaluating both the quality of the samples and their alignment with the specified conditioning inputs. This is a more realistic setup as the pre-trained models are not directly linked to the final evaluation of the samples, and we want to investigate if the supervised verifiers can still provide reasonable feedbacks and enable effective inference-time scaling.

fig:supervised_verifier — **Performances of Supervised Verifiers.** Random Search with CLIP and DINO on ImageNet. **CLIP-ZeroShot** refers to us- ing the logits output by the CLIP zero-shot classifier formulated with Prompt Engineering, and **DINO-LinearHead** refers to using the pre-trained linear classifier provided by .

As shown above, we take two models with good learned representations, CLIP and DINO, and utilize the classification perspective of the two models. During search, we run samples through the classifiers and select the ones with the highest logits corresponding to the class labels used in generation.

Although this strategy effectively improves the IS of the samples comparing to purely scaling NFEs with increased denoising steps, the classifiers we use are only partially aligned with the goal of FID score, since they operate point-wise and do not consider the global statistics of the samples. This can lead to a significant reduction in sample variance and eventually manifests as mode collapse as the compute increases, as demonstrated below by the increasing Precision and the decreasing Recall.

fig:mode_collapse — **Performance of Random Search on ImageNet against DINO and CLIP classification logits.** We use random search on the SiT-XL model and report FID, IS, Precision, and Recall.

Self-Supervised Verifier, which uses the feature space (extracted by DINO / CLIP, respectively) cosine similarity of samples at low noise level ($ \sigma = 0.4 $) and clean samples ($\sigma = 0.0$) to evaluate the quality of initial noises. We found that such similarity score is highly correlated with the logits output by the DINO / CLIP classifiers, and thus use it as an effective surrogate for the supervised verifier, as demonstrated below.

fig:self_supervised_verifier — **Performances of Self-Supervised Verifiers.** **Left:** correlation between CLIP and DINO feature similarity score and their classification logits; **Right:** Random Search with CLIP and DINO feature similarity score as verifiers across different classifier-free guidance weight.

Since self-supervised verifiers do not require extra condition information, this result is encouraging for use cases where conditioning information is not available or hard to obtain, like the task of medical imaging generation.

Algorithms

We mainly consider three different search strategies as elaborated blow.

Random Search, which is essentially a Best-of-N strategy applied once on all noise candidates, and the primary axis for scaling NFEs in search is simply the number of noise candidates to select from. Its effectiveness has been demonstrated in previous section, and we note that since its search space is unconstrained, it accelerates the converging of search towards the bias of verifiers, leading to the loss in diversity. Such phenomenon is similar to reward hacking in reinforcement learning, and thus we term it as Verifier Hacking.

Zero-Order Search, which is similar to Zero-Order Optimization, and we specified the detailed procedure below.

we start with a random Gaussian noise $\mathbf{n}$ as pivot.
find $N$ candidates in the pivot's neighborhood. Formally, the neighborhood is defined as $S_{\mathbf{n}}^\lambda = \{\mathbf{y}: d(\mathbf{y}, \mathbf{n}) = \lambda\}$, where $d(\cdot, \cdot)$ is some distance metric.
run candidates through an ODE solver to obtain samples and their corresponding verifier scores.
find the best candidates, update it to be the pivot, and repeat steps 1-3.

We deem the number of iterations (i.e., how many times the algorithm runs through steps 1-3) to be the primary axis for scaling NFEs in search. When $N$ gets larger, the algorithm will locate a more precise local “optimum”, and when $\lambda$ increases, the algorithm will have a larger stride and thus traversing the noise space more quickly. In practice, we fix the value of $\lambda$ and investigate the scaling behavior w.r.t $N$. We abbreviate the algorithm as ZO-$N$.

Search over Paths, which iteratively refine the diffusion trajectory, and we specified the detailed procedure below.

sample $N$ initial i.i.d. noises and run the ODE solver until some noise level $\sigma$. The noisy samples $\mathbf{x}_\sigma$ serve as the search starting point.
sample $M$ i.i.d noises for each noisy samples, and simulate the forward noising process from $\sigma$ to $\sigma + \Delta f$ to produce $\{\mathbf{x}_{\sigma + \Delta f}\}$ with size $M$.
run ODE solver on each $\mathbf{x}_{\sigma + \Delta f}$ to noise level $\sigma + \Delta f - \Delta b$, and obtain $\mathbf{x}_{\sigma + \Delta f - \Delta b}$. Run verifiers on these samples and keep the top $N$ candidates. Repeat steps 2-3 until the ODE solver reaches $\sigma = 0$
run the remaining $N$ samples through random search and keep the best one.

To ensure the iteration terminates, we strictly require $\Delta b > \Delta f$ . Also, since the verifiers are typically not adapted to noisy input, we perform one additional denoising step in step 3 and use the clean x-prediction to interact with the verifiers. Here, the primary scaling axis is the number of noises $M$ added in step 2, and in practice, we investigate the scaling behavior with different numbers of initial noises $N$. We thus term the algorithm Paths-$N$.

fig:algorithms_performance — **Performances of Search Algorithms.** We fix the verifier to be DINO-LinearHead and investigate the FID and IS of Zero-Order Search and Search over Paths on ImageNet. For each algorithm, we further demonstrate the relationship between $N$ and their performances.

As shown above, due to the locality nature of the two algorithms, both of them manage to alleviate the diversity issue of FID to some extent while maintaining a scaling Inception Score.

fig:visual_teaser — **Visualizations of Scaling Behaviors.** Each row is constructed as follows: **left three**: sampled with increasing NFEs in denoising steps; **right four**: sampled with increasing NFEs in search. First two rows are sampled from SiT-XL with DINO-LinearHead, third row is sampled from PixArt-$\Sigma$ with Verifier Ensemble, and last two rows are sampled from FLUX.1-dev with Verifier Ensemble.

Inference-Time Scaling in Text-to-Image

With the instantiation of our search framework, we proceed to examine its inference-time scaling capability in larger-scale text-conditioned generation tasks, and study the alignment between verifiers and specific image generation tasks.

Datasets. For a more holistic evaluation of our framework, we use two datasets: DrawBench and T2I-CompBench. The former is for a more general image generation task, and the latter is for a more attributes- and composition-oriented image generation task.

Models. We use the newly released FLUX.1-dev model as our backbone, which is currently at the frontier of text-to-image generation and representative of the capabilities of many contemporary text-conditioned diffusion models.

Verifiers. We expand the choice of verifiers to cope with the complex nature of text-conditioned image generation: Aesthetic Score Predictor, CLIPScore, and ImageReward, to evaluate the visual quality, text-image alignment, and human preferences, respectively. Additionally, we combine these verifiers to create a Verifier Ensemble, to further expand the capacity of verifiers across the evaluative aspects. It uses the unweighted average ranking of the three verifiers as the final score.

Metrics. On DrawBench, we use all verifiers not employed in the search process as primary metrics to provide a more comprehensive evaluation. Considering the usage of Verifier Ensemble, we additionally introduce an LLM grader as a neutral evaluator for assessing sample qualities. We prompt the Gemini-1.5 model to assess synthesized images from five different perspectives: Accuracy to Prompt, Originality, Visual Quality, Internal Consistency, and Emotional Resonance. Each perspective is rated on a scale from 0 to 100, and the averaged overall score is used as the final metric.
On T2I-CompBench, we use the evaluation pipeline provided to assess the performance of our framework in compositional generation tasks.

fig:tverifiers — **Performances of Search with FLUX.1-dev at inference-time.** We fix the search budget to be 3840 NFEs with random search, and demonstrate the relative performance gain with generation without any search budget.

Analysis Results: Verifier Hacking and Verifier-Task Alignment

DrawBench. As shown above, and as indicated by the LLM Grader, searching with all verifiers generally improves sample quality, while specific improvement behaviors vary across different setups:

searching with ImageReward and Verifier Ensemble consistently improve scores across all evaluation metrics.
searching with Aesthetic and CLIP verifier improve the LLM grades but slightly degrade each other.

We attribute the former to the fact that both verifier possess more nuanced evaluative aspects and closely align with human preferences, and reason the latter to be the misalignment in evaluation - Aesthetic Score Predictor focusing solely on visual quality and often favoring highly stylized images, whereas CLIP prioritizes visual-text alignment at the expense of visual quality As a result, exploiting the biases of one verifier (e.g. Aesthetic Score) during search will deviate from the evaluation metrics assessed by the other verifier (e.g. CLIP)

Yet, since searching with Aesthetic and CLIP does not lead to a total collapse in sample quality, they can be well-suited for tasks that require a focus on specific attributes such as visual appeal or textual accuracy, rather than maintaining general-purpose performance. These different behaviors across verifiers highlight the importance of aligning verifiers with the specific task at hand.

fig:tverifiers_scale — **Scalability of search with FLUX.1-dev on DrawBench.** We use random search with Verifier Ensemble to obtain the results. Similar scaling behavior to ImageNet setting is observed across different metrics.

T2I-CompBench. Since T2I-CompBench emphasize correctness in relation to the text prompt, we see that ImageReward becomes the best verifier, whereas Aesthetic Score leads to minimal improvements and even degradation, as demonstrated below.

Verifier	Color	Shape	Texture	Spatial	Numeracy	Complex
-	0.7692	0.5187	0.6287	0.2429	0.6167	0.3600
Aesthetic	0.7618	0.5119	0.5826	0.2593	0.6159	0.3472
CLIP	0.8009	0.5722	0.7005	0.2988	0.6457	0.3704
ImageReward	0.8303	0.6274	0.7364	0.3151	0.6789	0.3810
Ensemble	0.8204	0.5959	0.7197	0.3043	0.6623	0.3754

Performance of search with FLUX.1-dev on T2I-CompBench. We use random search with Verifier Ensemble to obtain the samples; for evaluation, we use the pipeline provided in T2I-CompBench.

These contrasting behaviors of verifiers on DrawBench and T2I-CompBench highlight how certain verifiers can be better suited for particular tasks than others. This inspires the design of more task-specific verifiers, which we leave as future works.

Algorithms. Below we demonstrate the performance of search algorithms on DrawBench. For Zero-Order Search, we set the number of neighbors to be $N = 2$. For Search over Paths, we set the number of initial noises to be $N = 2$ as well.

Verifier	Aesthetic	CLIPScore	ImageReward	LLM Grader
-	5.79	0.71	0.97	84.29

Aesthetic + Random	6.38	0.69	0.99	86.04
+ ZO-2	6.33	0.69	0.96	85.90
+ Paths-2	6.31	0.70	0.95	85.86

CLIPScore + Random	5.68	0.82	1.22	86.15
+ ZO-2	5.72	0.81	1.16	85.48
+ Paths-2	5.71	0.81	1.14	85.45

ImageReward + Random	5.81	0.74	1.58	87.09
+ ZO-2	5.79	0.73	1.50	86.22
+ Paths-2	5.76	0.74	1.49	86.33

Ensemble + Random	6.06	0.77	1.41	88.18
+ ZO-2	5.99	0.77	1.38	87.25
+ Paths-2	6.02	0.76	1.34	86.84

Performance of search algorithms with different verifiers on DrawBench with FLUX.1-dev.

We see that all three methods can effectively improve the sampling quality, with random search outperforming the other two methods in some aspects, due to the locality nature of Zero-Order Search and Search over Paths.

Search is Compatible with Finetuning

Both search and finetuning methods aim to align the final samples with explicit reward models or human preferences. While the former shifts the sample modes toward the bias of specific verifiers, the latter directly modifies the model's distribution to align with the rewards. This raises a natural question: can we still shift the sample modes according to verifiers after the model distribution has been modified?

We take the DPO fine-tuned Stable Diffusion XL model and conduct search on the DrawBench dataset. Since the model is finetuned on the dataset Pick-a-Pic, we replace ImageReward with the PickScore evaluator. The results are included below.

Model	Aesthetic	CLIP	PickScore
SDXL	5.56	0.73	22.39

+ DPO	5.59	0.74	22.54
+ DPO & Search	5.66	0.76	23.54

Performance of search algorithms with different verifiers on DrawBench with FLUX.1-dev.

We see that search method can generalize to different models and can improve the performance of an already aligned model. This will be a useful tool to mitigate the cases where finetuned models disagree with reward models and to improve their generalizability.

Axes of Inference Compute Investment

Due to the iterative sampling nature of diffusion models, there are multiple dimensions in which we can allocate compute during search. We present them below and investigate their impact on search.

Number of search iterations. Increasing the allows the selected noises to approach the optimal set with respect to verifiers. We observed such behavior in all of our previous experiments.

Compute per search iteration. We denote such compute NFEs/iter. During search, adjusting NFEs/iter can reveal distinct compute-optimal regions, as shown below.

fig:nfes_iter — **Scalability of search with FLUX.1-dev on DrawBench.** We use random search with Verifier Ensemble to obtain the results. Similar scaling behavior to ImageNet setting is observed across different metrics.

Using fewer NFEs per iteration accelerates convergence but limits performance, whereas more NFEs slow convergence yet improve results. Beyond 50 NFEs/iter, additional computation yields diminishing returns. Consequently, 50 NFEs/iter were used for ImageNet to balance efficiency and performance, while 30 NFEs/iter sufficed for FLUX.1-dev due to its fewer denoising steps requirement for high-quality samples.

Effectiveness of Investing Compute

We explore the effectiveness of scaling inference-time compute for smaller diffusion models and highlight its efficiency relative to the performance of their larger counterparts without search. For ImageNet tasks, we utilize SiT-B and SiT-L, and for text-to-image tasks, we use the smaller transformer-based model PixArt-$\Sigma$ besides FLUX.1-dev.

Since models of different sizes incur significantly different costs per forward pass, we use estimated GFLOPs to measure their computational cost instead of NFEs.

fig:effec_scale — **Performance of our search methods across different model sizes (SiT-{B,L,XL}) on ImageNet.** We use the best set up for FID and IS separately. **Left**: ZO-4 with DINO-LinearHead.; **Right**: Random Search with DINO-LinearHead.

As shown above, scaling inference-time compute for small models on ImageNet can be highly effective - fixing compute budget, SiT-L can outperform SiT-XL in regions with limited inference compute. Yet, this requires the small model to have a relatively strong performance - SiT-B does not benefit from search as much as SiT-L and does not have an advantageous compute region.

These observations extend to the text-conditioned setting, as demonstrated below. With just one-tenth of the compute, PixArt-$\Sigma$ outperforms FLUX- 1.dev without search, and with roughly double the compute, PixArt-$\Sigma$ surpasses FLUX.1-dev without search by a significant margin. These results have important practical implications: the substantial compute resources invested in training can be offset by a fraction of that compute during generation, enabling access to higher-quality samples more efficiently.

Model	Compute Ratio	Aesthetic	CLIP	ImageReward	LLM Grader
FLUX	1	5.79	0.71	0.97	84.29

PixArt-$\Sigma$	~0.06	5.94	0.68	0.70	84.67
	~0.09	6.03	0.71	0.97	85.62
	~2.59	6.20	0.73	1.15	86.95

Comparison between PixArt-$\Sigma$ when search with Verifier Ensemble and FLUX without search. We use the total compute consumed by FLUX to generate one sample as the standard unit and scale the compute used by PixArt-$\Sigma$ accordingly. These total compute estimates are based on our best approximation and may not be entirely precise.

Conclusion

In this work, we present a framework for inference-time scaling in diffusion models, demonstrating that scaling compute through search could significantly improve performances across various model sizes and generation tasks, and different inference-time compute budget can lead to varied scaling behavior. Identifying verifiers and algorithms as two crucial design axes in our search framework, we show that optimal configurations vary by task, with no universal solution. Additionally, our investigation into the alignment between different verifiers and generation tasks uncovers their inherent biases, highlighting the need for more carefully designed verifiers to align with specific vision generation tasks.

Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps