Local Prompt Adaptation: Ankit Sanjyal

Local Prompt Adaptation for Style-Consistent Multi-Object Generation in Diffusion Models

Ankit Sanjyal

arXiv preprint arXiv:2507.20094 · cs.CV, cs.AI · July 2025 (v2: August 2025)

Diffusion Models Style Consistency Attention Control Training-Free SDXL

arXiv PDF Code

Abstract. Diffusion models produce high-quality visuals from natural language prompts. However, when prompts involve multiple objects alongside global or local style instructions, outputs often drift in style and lose spatial coherence. We present Local Prompt Adaptation (LPA), a lightweight, training-free method that splits the prompt into content and style tokens, then injects them selectively into the U-Net's attention layers at chosen timesteps. By conditioning object tokens early and style tokens later in the denoising process, LPA improves both layout control and stylistic uniformity without additional training cost. On the T2I benchmark, LPA improves CLIP-prompt alignment over vanilla SDXL by +0.41% and over SD1.5 by +0.34%, with no diversity loss.

Problem

When a prompt like "A red car and a blue bicycle in watercolor style" is fed to SDXL or SD1.5, the model treats all tokens equally: the style bleeds unevenly across objects, and object boundaries tend to collapse. This is because standard cross-attention has no mechanism to scope style tokens to specific objects or to specific denoising stages.

Method: Local Prompt Adaptation (LPA)

Step 1: Parse

Split the prompt into content tokens (objects, layout) and style tokens (artistic style, texture) using a lightweight parser.

Step 2: Schedule

Define injection windows: content tokens are injected early (high-noise steps) to anchor layout; style tokens are injected later (low-noise steps) to apply texture.

Step 3: Inject

Selectively replace cross-attention keys/values in specified U-Net layers with the corresponding token subset at each timestep window.

Step 4: Generate

Standard DDIM/DPM++ sampling proceeds: no retraining, no fine-tuning, one config change.

The optimal configuration found through ablation is LPA Late Only with a 300–650 step injection window, which delivers the strongest balance of prompt alignment and style consistency.

Results

Method	CLIP-Prompt ↑	CLIP-Style ↑	Training Required?
SDXL (vanilla)	0.2841	0.2203	—
SD1.5 (vanilla)	0.2748	0.2184	—
SDXL + CFG tuning	0.2859	0.2218	No
LPA Late Only (ours)	0.2853 (+0.41%)	0.2211 (+0.08%)	No

Results on T2I benchmark (SDXL comparison) and custom 50-prompt style-rich benchmark. All improvements with zero diversity loss.

Key Contributions

Training-free: No fine-tuning or additional parameters needed.
Model-agnostic: Works with SDXL, SD 1.5, and any U-Net-based diffusion model.
Injection window ablation: Systematic study of 6 parser × 4 window configurations.
Practical: Single config change: drop-in improvement for any generation pipeline.

Citation

@article{sanjyal2025lpa,
  title     = {Local Prompt Adaptation for Style-Consistent Multi-Object Generation in Diffusion Models},
  author    = {Sanjyal, Ankit},
  journal   = {arXiv preprint arXiv:2507.20094},
  year      = {2025}
}