DiffuseHigh: Training-free Progressive High-Resolution Image Synthesis through Structure Guidance

Large-scale generative models, such as text-to-image diffusion models, have garnered widespread attention across diverse domains due to their creative and high-fidelity image generation. Nonetheless, existing large-scale diffusion models are confined to generating images of up to 1K resolution, which is far from meeting the demands of contemporary commercial applications. Directly sampling higher-resolution images often yields results marred by artifacts such as object repetition and distorted shapes. Addressing the aforementioned issues typically necessitates training or fine-tuning models on higher-resolution datasets. However, this poses a formidable challenge due to the difficulty in collecting large-scale high-resolution images and substantial computational resources. While several preceding works have proposed alternatives to bypass the cumbersome training process, they often fail to produce convincing results. In this work, we probe the generative ability of diffusion models at higher resolution beyond their original capability and propose a novel progressive approach that fully utilizes generated low-resolution images to guide the generation of higher-resolution images. Our method obviates the need for additional training or fine-tuning which significantly lowers the burden of computational costs. Extensive experiments and results validate the efficiency and efficacy of our method.

DiffuseHigh adopts 'interpolation + noising-denoising' technique in order to generate high-resolution images, in a progressive manner. However, simply applying this approach frequently encounters challenges in effectively capturing certain structural properties and nuanced details from low-resolution inputs. To remedy the afforementioned issues, we incorporate a Discrete Wavelet Transform (DWT)-based structure guidance into the proposed progressive pipeline. This method aims to enhance the fidelity of generated images by encouraging the preservation of crucial features from the low-resolution input. Also, we found that sharpening the interpolated image helps to locate the image near the mode of sharp data distribution mode, therefore enhance the image quality and clarity. (For more details, please refer to our paper!)

We validate the role of each component involved in our pipeline. As illustrated, our structural guidance enables the generated image to preserve essential structures. By forcing the denoising process to maintain the low-frequency details of the sample, which is obtained from well-structured low-resolution images, samples with our DWT-based structural guidance present desirable structures and shapes. However, samples without structural guidance tend to have deformed shapes (mouth of the hedgehog in (a)) or artifacts (dots around the face in (b)). Also, we observed that the sharpening operation involved in our pipeline further enhances the quality of the image, particularly on blurred object boundaries or smoothed textures of the image ((c) and (d)).

Generated 2048 x 2048 images with DiffuseHigh + SDXL.

"A vast, bioluminescent forest where the trees glow in neon colors under a starry sky."

"A giant tortoise with a forest growing on its shell."

"A pirate captain with a weathered face and golden earring."

"An enigmatic wizard with a glowing staff and deep, wise eyes, standing in a lush, enchanted forest, with magical symbols swirling around him, rendered in a classic oil painting technique."

"Ninja racoon in cyberpunk, neonpunk style, photorealistic."

"A fairy garden with glowing flowers and tiny, winged creatures."

"A Van Gogh-inspired night sky over a quiet village."

"A royal queen in a majestic gown, adorned with jewels and a crown."

"A galaxy swirling inside a glass bottle, sitting on a wizard's desk."

Generated 4096 x 4096 images with DiffuseHigh + SDXL.

"A fairy garden with glowing flowers and tiny, winged creatures."

"A glowing white stag with crystal antlers in a magical forest."

"A regal fairy queen with luminescent wings and a crown of flowers, surrounded by shimmering fireflies and mystical fog, depicted in a vibrant, impressionistic oil painting."

"Cinematic photo of delicious chocolate icecream."

"A highly detailed oil painting of tiger."

"A cybernetic eagle with metallic feathers."

"A mystical elf with emerald green eyes and a crown of leaves, standing in a forest glade bathed in soft moonlight."

"A phoenix rising from its ashes in a fiery burst."

"Cyberpunk hero with neon tattoos and futuristic armor."

"A wise owl perched on a branch, surrounded by swirling magical energy."

BibTeX


    @misc{kim2024diffusehigh,
      title={DiffuseHigh: Training-free Progressive High-Resolution Image Synthesis through Structure Guidance},
      author={Kim, Younghyun and Hwang, Geunmin and Zhang, Junyu and Park, Eunbyung},
      journal={arXiv preprint arXiv:2406.18459},
      year={2024}
    }

DiffuseHigh: Training-free Progressive High-Resolution Image Synthesis through Structure Guidance

DiffuseHigh enables the pre-trained text-to-image diffusion models (SDXL in this figure) to generate higher-resolution images than the originally trained resolution, e.g., 4×, 16×, without any training or fine-tuning.

Abstract

DiffuseHigh Pipeline

Ablating DiffuseHigh Components

Generated 2048 x 2048 images with DiffuseHigh + SDXL.

Generated 4096 x 4096 images with DiffuseHigh + SDXL.

BibTeX