An end to end pipeline Stable Diffusion model tuned off of SD 1.5 trained on a custom dataset.
We were able to produce images as if they were drawn by an Elementary school child using crayons and markers. We were able to produce images as if the child was between the Scribble-Schematic drawing stages.
https://www.littlebigartists.com/articles/drawing-development-in-children-the-stages-from-0-to-17-years/
“A crayon drawing done by a child of Tahiti, ocean, palm trees, skewed perspectives, (people in canoes:0.9)” - StableDiffusion
We needed to reproduce different scenes but all with the same style. Dalle and Midjourney was great for producing images we needed but were restricted to one-off’s. Dalle tended to be too refined & midjourney produced images only past the schematic drawing stage. We chose to use Stable Diffusion as we could customize the model as much as we wanted and we could build LoRA’s. https://arxiv.org/abs/2106.09685
To generate multiple images of the same style we found that Dreambooth was the best approach.
”A "personalization" of text-to-image diffusion models. Given as input just a few images of a subject, we fine-tune a pretrained text-to-image model such that it learns to bind a unique identifier with that specific subject. Once the subject is embedded in the output domain of the model, the unique identifier can be used to synthesize novel photorealistic images of the subject contextualized in different scenes.”
@article{ruiz2022dreambooth,
title={DreamBooth: Fine Tuning Text-to-image Diffusion Models for Subject-Driven Generation},
author={Ruiz, Nataniel and Li, Yuanzhen and Jampani, Varun and Pritch, Yael and Rubinstein, Michael and Aberman, Kfir},
booktitle={arXiv preprint arxiv:2208.12242},
year={2022}
}
Results
Goal
Team & Role
“drawing, poorly done by a child of london ” -midjourney
Example of VQGans + CLIP
Generated using Faraday Cage thank you to ElutherAI
To produce images as if they were drawn by an Elementary school child using crayons and markers. We were able to produce images as if the child was between the Scribble-Schematic drawing stages.
I worked closely with a marketing manager a front-end developer and a graphic designer. My role was to research, gather data, and train the stable diffusion model. Tools used: Stable Diffusion, Beautiful Soup, Python, Javascript, HTML, CSS, Firebase, Automatic1111, Dreambooth, Flickr, Google Colab, Bash.
Research
Lite
There are multiple drawing stages a person goes through in life. We focused on the first three stages of developmental drawing: Scribble, Pre-Schematic, & Schematic.
Our final result needs to incorporate characteristics such as:
Scribble Stage:
Random scribbling.
Purposeful scribbling.
Naming scribbles.
Pre-Schematic Stage:
Attempt at representation.
Recognizable forms.
Tadpole man.
Floating objects.
Drawings tell stories.
Non-realistic colors.
Schematic Stage:
Use symbols for people and objects.
Development of a baseline.
Emphasize important features.
Show multiple perspectives at the same time.
X-ray drawings.
Midjourney vs Dalle vs Stable Diffusion
Base model testing
Data gathering
We utilized Flickr, a place for people to upload their own images, as a photo database to gather images from. We only used Public Domain Work images to build our dataset. PDM 1.0 Deed Public Domain Mark 1.0 Universal
Process & Design
Data
We created a dataset of 45 images to train in Dreambooth to create our fine tuned model. Dreambooth only needs 3-5 images, since we needed to generate 10 different scenes in the same style, we used 4-5 images of a particular scene in 1 of the three drawing development styles (scribble, pre-schematic, schematic). We used a bash script to resize all of our input images to be 512x512, and a bash script to rename all images “drawingscenes (1)” (1-45) as the name of the image is important as it is required to evoke that style when generating later on.
Code
Our first few times training the model we used a M1 16GB Macbook Pro. Which too ~1.5 hours to train. We switched to using Google Colab Pro for higher memory and actual GPU’s. We used Python, Gradio, and we modified Automatic1111’s Notebook.
Flow
To achieve our desired results required us to continually modify 2 aspects of this project: dataset & prompt.
Process overview:
Modify data ➡ Train in DB ➡ Tuned model into SD ➡ Prompt & generate
↩︎
Hosting
This project only entailed us to produce the images for the client. We went a step beyond and built a simple web UI which was just a text box for prompting and a button to generate. We host this on Firebase; due to generating costs vs build costs, this is okay for a small number of generations for this specific client. If building for a larger commercial company I recommend building your own servers as cloud computing bills can become unmanageable
Improvments
I’m curious of the effects of synthetic data. A Lamba workstation would also be of good use.