DiffuseHigh: Training-free Progressive High-Resolution Image Synthesis through Structure Guidance

Younghyun Kim1*, Geunmin Hwang1*, Junyu Zhang2,3, Eunbyung Park1,3
1Department of Artificial Intelligence, Sungkyunkwan University, 2School of Computer Science and Engineering, Central South University, 3Department of Electrical and Computer Engineering, Sungkyunkwan University

DiffuseHigh enables the pre-trained text-to-image diffusion models (SDXL in this figure) to generate higher-resolution images than the originally trained resolution, e.g., 4×, 16×, without any training or fine-tuning.

Abstract

Large-scale generative models, such as text-to-image diffusion models, have garnered widespread attention across diverse domains due to their creative and high-fidelity image generation. Nonetheless, existing large-scale diffusion models are confined to generating images of up to 1K resolution, which is far from meeting the demands of contemporary commercial applications. Directly sampling higher-resolution images often yields results marred by artifacts such as object repetition and distorted shapes. Addressing the aforementioned issues typically necessitates training or fine-tuning models on higher-resolution datasets. However, this poses a formidable challenge due to the difficulty in collecting large-scale high-resolution images and substantial computational resources. While several preceding works have proposed alternatives to bypass the cumbersome training process, they often fail to produce convincing results. In this work, we probe the generative ability of diffusion models at higher resolution beyond their original capability and propose a novel progressive approach that fully utilizes generated low-resolution images to guide the generation of higher-resolution images. Our method obviates the need for additional training or fine-tuning which significantly lowers the burden of computational costs. Extensive experiments and results validate the efficiency and efficacy of our method.

DiffuseHigh Pipeline

DiffuseHigh adopts 'interpolation + noising-denoising' technique in order to generate high-resolution images, in a progressive manner. However, simply applying this approach frequently encounters challenges in effectively capturing certain structural properties and nuanced details from low-resolution inputs. To remedy the afforementioned issues, we incorporate a Discrete Wavelet Transform (DWT)-based structure guidance into the proposed progressive pipeline. This method aims to enhance the fidelity of generated images by encouraging the preservation of crucial features from the low-resolution input. Also, we found that sharpening the interpolated image helps to locate the image near the mode of sharp data distribution mode, therefore enhance the image quality and clarity. (For more details, please refer to our paper!)

Ablating DiffuseHigh Components

We validate the role of each component involved in our pipeline. As illustrated, our structural guidance enables the generated image to preserve essential structures. By forcing the denoising process to maintain the low-frequency details of the sample, which is obtained from well-structured low-resolution images, samples with our DWT-based structural guidance present desirable structures and shapes. However, samples without structural guidance tend to have deformed shapes (mouth of the hedgehog in (a)) or artifacts (dots around the face in (b)). Also, we observed that the sharpening operation involved in our pipeline further enhances the quality of the image, particularly on blurred object boundaries or smoothed textures of the image ((c) and (d)).

Generated 2048 x 2048 images with DiffuseHigh + SDXL.

Generated 4096 x 4096 images with DiffuseHigh + SDXL.

BibTeX


    @misc{kim2024diffusehigh,
      title={DiffuseHigh: Training-free Progressive High-Resolution Image Synthesis through Structure Guidance},
      author={Kim, Younghyun and Hwang, Geunmin and Zhang, Junyu and Park, Eunbyung},
      journal={arXiv preprint arXiv:2406.18459},
      year={2024}
    }