Generating Synthetic Images using Stable Diffusion and Prompts

A.V. Ronquillo

Main Body

4 Generating Synthetic Images using Stable Diffusion and Prompts

A.V. Ronquillo

Stable Diffusion Model

Stable Diffusion is a generative model created by the founder of StabilityAI, Emad Mostaque, and it was programmed in collaboration with EleutherAI, LAION, RunwayML, and LMU Munich.^[1] It became public in August 2022, and it is now one of the most reliable AI image generators. This open-source model currently has a free license and is widely used because it is highly accessible for individuals to generate images on their own computers.

Stable Diffusion builds on the foundations of generative models like Generative Adversarial Networks (GANs).^[2] However, GANs typically do not have built-in features of direct text-to-image functions. The Stable Diffusion model consists of a text-to-image component which can be facilitated by powerful prompts. Although this allows for countless possibilities in generating images, StabilityAI still expects users to handle the model within ethical boundaries.^[3] When Python scripts are created to utilize the Stable Diffusion model, the use of prompts in image generation can lead to more agency as the code can be implemented with adjustable parameters focusing on image quality. This further underlines the innovative and powerful nature of the model.

The model’s generative process consists of diffusion over a series of steps. It begins with an unclear and noisy image that will incrementally refine to become more like the desired image result. This involves training a denoising autoencoder to reconstruct the original image from the diffused versions, aligning the diffusion process with the goal of generating realistic images.^[4] The model also utilizes a multi-scale architecture with convolutional neural networks (CNNs) for the diffusion model. This architecture allows the capture of complex dependencies across different scales and components of the image. Details about how to integrate and load the Stable Diffusion model can be found in the Stable Diffusion Jupyter Notebook (figure 4.1).

This image shows documentation for setting up a Stable Diffusion model. It explains initializing a StableDiffusionPipeline using the CompVis/stable-diffusion-v1-4 model, moving computations to CUDA for GPU acceleration, and configuring a Hugging Face API token. Code snippets demonstrate the initialization process and token setup for image generation functionality. — **Figure 4.1:** Example of a module found in the Stable Diffusion Jupyter Notebook. Each cell consists of an explanation of the snippet of code from the script. This image and notebook were created by A.V. Ronquillo.

The only major challenge with this model is that it may require powerful hardware resources to run locally. However, Stable Diffusion is still a very useful open-source model characterized by various applications that can be implemented across a wide range of creative domains. It excels in generating high-quality, diverse, and realistic images across different categories like animals and objects. It can also be adapted for tasks like inpainting missing parts of images or editing existing images while maintaining visual consistency.

In the context of this project, Stable Diffusion is specifically used to augment datasets for training other machine learning models, providing synthetic but realistic variations of existing data. It is utilized as one method for generating synthetic animal images starting with real empty trail camera images. This approach is used as a strategy to counter imbalanced datasets to avoid bias in model training.

Empty Trail Camera Images

Pre-existing empty images are utilized as base images for generating synthetic animals. The ‘empty’ images are defined as trail camera images in which no animals are present within the frame of the camera (figure 4.2). The generated synthetic animals can be pasted onto these empty images using the Stable Diffusion model and prompting. A more detailed guide on how to load input images for the model can be found in the Stable Diffusion Jupyter Notebook. This method creates a type of synthetic image that features a synthetic entity on a real image. The use of real and local trail camera photographs expands on Stable Diffusion’s capability to create images with degrees of artificiality. The extent to which an image is synthesized or altered can impact the performance and training of machine learning models.^[5] Blending the same empty trail camera images to generate more animals that are in the minority class introduces a way of diversifying existing conditions of the natural dataset. In this manner, this strategy attempts to improve the consistency of the synthetic data with the world observed by the trail cameras.^[6]

This image shows a grid of six wildlife camera or trail camera photographs capturing forest scenes. The images display various outdoor environments with dense vegetation, trees, and natural lighting. One image appears to be in infrared or night vision mode (black and white), while others show daylight forest scenes with lush greenery. — **Figure 4.2:** Sample batch of empty trail camera images from the data collection. These are base images on which synthetic animals will be pasted using the Stable Diffusion model. This visual was curated by A.V. Ronquillo and the empty trail camera images are sourced from the Wild Winnipeg database.

Prompting the Generative Process

The Stable Diffusion model excels at interpreting prompts, and a large block of prompts was utilized to facilitate the processes of image generation. This strategy aims to create a large set of variations of the synthetic object’s appearance. The prompts are created by using a base prompt structure that contains interchangeable variables for stating the animal, its verb, position, location, and scale of the desired synthetic object (figure 4.3). By giving the script sets of nouns and verbs for the base prompt, every possible combination of those words can be inserted into the block of prompts for the model to interpret (figure 4.4). A detailed guide on generating a large block of prompts can be found in the Prompt Generator Jupyter Notebook. A set of samples of the results of these prompts can be previewed in the Stable Diffusion Jupyter Notebook, where documentation of errors and types of prompts are outlined.

This image shows Python code defining prompts for image generation. It includes a specific prompt for generating a realistic fox photograph and a base prompt template with word substitutions. The code lists various animals (fox, squirrel, cat, bird, deer), actions (standing, climbing, running), and positioning options for customizable image prompts. — **Figure 4.3:** Image showing how the structure of different nouns, positionings, and verbs are substituted into the base prompt through an automation process. The words can be changed and customizable to meet one’s targeted goals of image generation. This image was created by A.V. Ronquillo. The Prompt Generator Jupyter Notebook can be accessed through this link.

This image shows a Python code module containing a comprehensive list of text prompts for generating synthetic fox images. The code defines numerous variations of prompts specifying different fox positions, sizes, and activities. Each prompt follows a similar format describing "a realistic photograph of one singular fox" with specific positioning details. — **Figure 4.4:** Example of a module with the block of prompts inserted into the Stable Diffusion script. Each sentence is the result of all the possible combinations of nouns and verbs from its prompt generator. This image was created by A.V. Ronquillo and can also be found in the Stable Diffusion Jupyter Notebook.

The model interprets these prompts and randomly assigns each prompt to the empty trail camera images from the local database. As the images undergo the diffusion process, the model processes selected image quality parameters included in the Stable Diffusion script (figure 4.5). These parameters are implemented as an attempt to seamlessly blend the synthetic animals with the real empty trail camera images. The methods include transparency, brightness, vibrancy, contrast, and adding motion blur. If the input is an empty image set at nighttime, the model ensures that the generated synthetic animal is also in grayscale. Essentially, this feature aims to increase the realism of the synthetic object for purposes of model training and image classification. These parameters are used to enhance the proximity of synthetic images to real animal images from the local database. Any of these settings can be implemented and customized for a wide range of tasks in image generation outside the scope of synthetic animals in wildlife research.

This image shows Python code for generating and processing synthetic animal images using Stable Diffusion. The code demonstrates image generation, color/brightness adjustments, resizing, transparency creation, and random positioning. It includes methods for seamlessly blending generated animals onto background images with proper alpha channel handling and camera trap effects. — **Figure 4.5:** Example of a module explaining snippets of code focused on image quality parameters. This image was created by A.V. Ronquillo and can also be found in the Stable Diffusion Jupyter Notebook.

The outputs of the model’s diffusion process differ depending on the types of prompts, sentences, and adjustable image quality parameters. Stable Diffusion performs well when it comes to understanding the components of scale in an overall image. However, the method of pasting synthetic animals onto existing images did not yield perfect results. During the process, it was discovered that prompt specificity is highly important for generating target images. More detailed sentences will more likely yield the desired images. For instance, it is more likely that multiple foxes will be generated on the empty image if it is not specified in the prompt to create ‘one singular fox’ (figure 4.6). However, even if this were the case and one singular fox was generated, the blending of the synthetic animal did not create a seamless visual hybridity between the synthetic and real entities. This is primarily because the model struggled to adhere to the prompts, rather than the specificity of the prompt being the issue. The model habitually generated contextual backgrounds with the animals, even if it was specified in the prompt not to do so. For instance, the model continuously generated a solid bounding box with a fox (figure 4.7). An attempt to counter this was made through adjustments of the image quality parameters, such as the transparency of this box, however, this still failed to create a seamless image. More details about how this process of troubleshooting image quality and iterating on prompts are specified in the Stable Diffusion Jupyter Notebook. Although this method did not create seamlessly perfect images, it is important to document this method of image generation and automation.

These are two trail camera images with timestamps and temperature readings. The left image shows three red foxes running across an open field or clearing. The right image captures a deer in dense green forest vegetation. Both images display typical wildlife camera interface elements including date/time stamps and camera settings. — **Figure 4.6:** Examples of synthetic images created by A.V. Ronquillo through the Stable Diffusion model and prompts. The image to the left uses a prompt that does not have a high degree of detail: “A realistic fox running in the center of the image.” Whereas the image to the right consists of a more detailed structure compared to the previous type of sentence: “A realistic photograph of one singular fox by itself with no background image. This cutout of the fox is standing in the perspectival foreground as an adult fox. Do NOT generate multiple foxes.”

This trail camera image from June 11, 2020 shows a red fox standing in a lush green forest environment. The fox appears alert and is positioned among tall grass and vegetation. The camera's timestamp, temperature (15°C), and technical information are visible at the top and bottom of the frame. — **Figure 4.7:** Example of a synthetic image generated by A.V. Ronquillo through the Stable Diffusion model with this prompt: “A realistic photograph of one singular fox by itself with no background image. This cutout of the fox is standing in the perspectival midground as an adult fox. Do NOT generate multiple foxes.” Results similar to this image were further developed by increasing the opacity of the fox and increasing the transparency of the solid bounding box through the ‘alpha’ and ‘ImageEnhance’ settings.

Other Methods using Stable Diffusion

The Stable Diffusion Model was also used to try generating images fully from scratch. This type of image generation was attempted to test the realism of the model. The purpose was to test its capabilities to adhere to prompt instructions whilst generating images with a proper sense of scale. Because this approach does not include any pre-existing empty camera trap images, a location is specified in the prompt.

The model was asked to generate a photorealistic fox along with an accurate display of the location stated in the prompt (figure 4.8). A type of forest native to Manitoba was specified, and the model aimed to compose the animal in an accurate rendering of this environment. However, these images still do not reach the desired level of realism when it comes to their visual similarity to the real images from the original database.

Side-by-side comparison of two red foxes in different seasonal settings. Left image shows a fox with thick winter coat standing in snow among bare trees. Right image shows a fox with sleeker fur on a forest path surrounded by trees with bright yellow-green spring or autumn foliage. — **Figure 4.8:** Example of a synthetic image generated from scratch by A.V. Ronquillo using the Stable Diffusion model. A typical prompt used to generate these images: “A singular anatomically correct fox walking to the left side of the image, in a photorealistic setting in a Bur Oak Forest in Manitoba with a sense of scale in relation to the background.”

The methodology of using masks was also attempted. Implementing the masks required additional steps and a new block of prompts. The Stable Diffusion script was copied and modified to test this approach. The script uses the Stable Diffusion InPaint model, which was integrated with two sets of prompts. The first set of prompts generates high-contrast masks based on the empty images from the input folder (figure 4.9). The other set of prompts generates a realistic photograph of the animal within the outlines of the mask. The motive behind this method is to eliminate the solid bounding boxes that continued to appear in the images from a previous type of base prompt in the first Stable Diffusion script (figure 4.7).

By using the masks to assist in rendering the animals, the goal was to generate seamless animal objects by adding a superimposition of the grass and foliage objects onto the image. This process was executed to increase the realism of the output images. In doing so, a method of blending the synthetic images can be tested without the use of transparency. Unfortunately, there were still issues with this script and the test was not considered successful because the second set of prompts that render the image using masks were not properly processed by the Stable Diffusion InPaint model (figure 4.9)

This image shows two black and white sketched or artistic renderings of foxes in natural settings. The left panel depicts a fox among tall grass or vegetation, while the right panel shows a fox in what appears to be a more open landscape. Both images have a rough, hand-drawn aesthetic style. — **Figure 4.9:** Example of generated masks with deer (left) and fox (right) outlines. These images were generated by A.V. Ronquillo using the Stable Diffusion InPaint model. The first set of prompts generated these images as a test so that the second set of prompts can render the animal and the superimposed foliage features within the mask outlines. The type of prompt utilized for these images is stated as: “A high contrast, black and white silhouette of a fox walking from the right side of the image in the foreground.”

Notes

Kevin Pocock, "The Creators of Stable Diffusion: An Inside Look," *PC Guide*, updated August 11, 2023, https://www.pcguide.com/apps/who-made-stable-diffusion/. ↵
Vasco Meerman, "History and Literature on (Stable) Diffusion," *Medium*, August 16, 2023, https://medium.com/@vasco-dev/history-and-literature-on-latent-stable-diffusion-dbca69fd54d5. ↵
Pocock, "The Creators of Stable Diffusion." ↵
Meerman, "History and Literature on (Stable) Diffusion." ↵
Sara Beery et al., "Synthetic Examples Improve Generalization for Rare Classes," *arXiv:1904.05916v2 [cs.CV]*, May 14, 2019, https://arxiv.org/abs/1904.05916. ↵
Ibid. ↵

License

Icon for the Creative Commons Attribution-ShareAlike 4.0 International License

Teaching with Images Copyright © 2025 by Mark Meagher, Kamni Gill, A.V. Ronquillo, Ryleigh Bruce, Mitchell Constable, Matthew Glowacki, Zhenggang Li, and Owen Swendrowski-Yerex is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License, except where otherwise noted.