CVPR 2026

Stepwise Credit Assignment for GRPO on Flow-Matching Models

Yash Savani1*, Branislav Kveton2, Yuchen Liu2, Yilin Wang2, Jing Shi2, Subhojyoti Mukherjee2, Nikos Vlassis2, Krishna Kumar Singh2
1Carnegie Mellon University    2Adobe Research
*Work done while intern at Adobe
Teaser figure showing stepwise credit assignment from temporal reward structure
Stepwise credit assignment from temporal reward structure. (Left) Two trajectories from the same initial noise reach similar final rewards (~0.90), but diverge substantially at intermediate steps. Uniform credit assignment treats them nearly identically; Stepwise-Flow-GRPO uses gains gt = rt-1 - rt to credit steps that improve reward and penalize those that hurt it. (Right) This finer credit assignment yields faster convergence and higher final reward.

Abstract

Flow-GRPO successfully applies reinforcement learning to flow models, but uses uniform credit assignment across all steps. This ignores the temporal structure of diffusion generation: early steps determine composition and content (low-frequency structure), while late steps resolve details and textures (high-frequency details). Moreover, assigning uniform credit based solely on the final image can inadvertently reward suboptimal intermediate steps, especially when errors are corrected later in the diffusion trajectory. We propose Stepwise-Flow-GRPO, which assigns credit based on each step's reward improvement. By leveraging Tweedie's formula to obtain intermediate reward estimates and introducing gain-based advantages, our method achieves superior sample efficiency and faster convergence. We also introduce a DDIM-inspired SDE that improves reward quality while preserving stochasticity for policy gradients.

Key Results

0.87
GenEval Score
(vs 0.72 Flow-GRPO)
3/4
Settings with superior
sample & wall-clock efficiency

Method

Stepwise-Flow-GRPO method overview
Stepwise-Flow-GRPO overview. Using Tweedie's formula, we predict clean images x̂0(t) from intermediate noisy states xt and reward these estimates. The stepwise reward gain gt = rt-1 - rt measures each step's contribution, providing fine-grained credit assignment without a separate critic model.

Sample Efficiency

Stepwise-Flow-GRPO consistently outperforms Flow-GRPO in reward per training step across all settings.

PickScore on GenEval ImageReward on GenEval UnifiedReward on GenEval PickScore on PickScore Dataset
Reward vs. training step across four settings: PickScore, ImageReward, and UnifiedReward on GenEval, and PickScore on PickScore Dataset.

Wall-Clock Efficiency

Despite additional computation for intermediate denoising, our method converges faster in wall-clock time.

PickScore wall-clock ImageReward wall-clock UnifiedReward wall-clock PickScore Dataset wall-clock
Reward vs. wall-clock time for the same settings. Stepwise-Flow-GRPO achieves visibly superior performance in 3 out of 4 settings.

Extended Training

Extended GenEval training results
400 GPU-hour run. Stepwise-Flow-GRPO achieves 0.87 GenEval, substantially outperforming Flow-GRPO (0.72) and surpassing GPT-4o (0.84).

GenEval Benchmark

Model Overall Single Obj. Two Obj. Counting Colors Position Attr. Bind.
Pretrained Models
SD3.5-M (cfg=1.0) 0.280.710.230.150.450.050.08
SD3.5-M (cfg=4.5) 0.630.980.780.500.810.240.52
Standard Training Duration
Flow-GRPO (cfg=1.0, PickScore) 0.600.960.730.670.670.210.35
Ours (cfg=1.0, PickScore) 0.600.960.750.670.670.210.34
Flow-GRPO (cfg=4.5, PickScore) 0.680.980.820.640.820.240.59
Ours (cfg=4.5, PickScore) 0.710.980.850.700.820.290.59
Extended Training
Flow-GRPO (cfg=4.5, GenEval, 400 GPU hrs) 0.72
Ours (cfg=4.5, UnifiedReward, 60 GPU hrs) 0.740.990.890.730.830.340.66
Ours (cfg=4.5, GenEval, 400 GPU hrs) 0.870.990.930.890.870.730.80
Reference: State-of-the-art Models
Janus-Pro-7B 0.800.990.890.590.900.790.66
SANA-1.5 4.8B 0.810.990.930.860.840.590.65
GPT-4o 0.840.990.920.850.920.750.61
After 400 GPU hours of extended training, our method achieves 0.87 overall GenEval score, substantially outperforming Flow-GRPO (0.72) and surpassing GPT-4o (0.84).

Qualitative Results

Qualitative comparison between Flow-GRPO and Stepwise-Flow-GRPO
Qualitative comparison. Stepwise-Flow-GRPO produces better spatial reasoning, attribute binding, and counting compared to Flow-GRPO. Flow-GRPO sometimes merges objects or places them unrealistically, while our method generates more plausible compositions.

More Results

Extended qualitative results page 1
Extended qualitative results page 2
Extended qualitative results page 3
Extended qualitative results. Comparison of generated images from Stepwise-Flow-GRPO across diverse GenEval prompts, demonstrating improved compositional understanding, spatial reasoning, and attribute binding.
Qualitative comparison across training objectives
Comparison across training objectives. Generated images from GenEval prompts using base SD3.5-M (left), GenEval reward training (middle), and UnifiedReward training (right). GenEval reward training improves prompt adherence and benchmark scores, while UnifiedReward training produces higher overall visual quality and more photorealistic images.

BibTeX

@inproceedings{savani2026stepwise,
  title     = {Stepwise Credit Assignment for GRPO on Flow-Matching Models},
  author    = {Savani, Yash and Kveton, Branislav and Liu, Yuchen and Wang, Yilin and Shi, Jing and Mukherjee, Subhojyoti and Vlassis, Nikos and Singh, Krishna Kumar},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2026}
}