Memory-Efficient Diffusion Image Generation: A Hybrid Framework for Low-Resource Environments
Abstract
Stable Diffusion and SDXL-Turbo have become the state-of-the-art text-to-image models in the modern era. However, their effectiveness is highly dependent on the timing, sampling, and computing resources. The majority of previous studies evaluate these models on the latest GPUs and high end resources, leaving out CPU or free-tier services.
This thesis bridges this gap by providing a comprehensive analysis and optimization approach for low-resource computational environments.
The paper presents two popular diffusion models: Stable Diffusion 1.5 and SDXL-Turbo. The framework combines systematic parameter sweeps, structural prompt engineering, perceptual similarity assessment using LPIPS, and hybrid inference pipelines to optimize generation quality with minimal hardware. The experiments were all run entirely on CPUs and in free-tier cloud environments, with explicit memory-optimized model-loading policies to enable their execution within the tight computational constraints.
The pipeline starts with an experimental baseline image generation pipeline using Stable Diffusion 1.5, due to its high structural consistency and its ability to work well in low-resource environments. The results of images generated during this phase are run through SDXL-Turbo via an Img2Img refinement process, creating a two-stage hybrid generation pipeline (SD1.5 → SDXL-Turbo). Quantitative analysis of LPIPS shows that the method significantly increases the perceptual similarity with the baseline images. The two-stage pipeline scores 0.3716 on LPIPS, which is much better than SDXL-Turbo alone, which scores 0.6743. It suggests that SD1.5 offers consistent compositional stability, and SDXL-Turbo is effective in increasing fine-grained textures and detail.
Besides the hybrid pipeline, a lightweight pixel-average ensemble approach was also tested to explore the idea of whether the results of the various diffusion runs can be used to further enhance the image quality. The ensemble outputs seemed to be visually smoother; however, quantitative LPIPS results indicated that there was greater perceptual divergence compared to the baseline images, which is consistent with the previously reported findings on ensemble diffusion strategies.
Overall, this thesis introduces a reproducible hardware-efficient diffusion model evaluation and optimization framework that can be employed in constrained environments. The work presents practical advice to researchers and developers who have to work with limited computational resources by showing that it is possible to create high-quality generation and reliable quantitative evaluation without the use of high-end GPUs.
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 MOSES MUKIIBI, Dr Khadak, Hussein Fouad Mohamed Ali, Ahmed Abdulhakim Al-Absi

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.