Textual Inversion - SDXL

This tutorial walks through a Textual Inversion training run with a Stable Diffusion XL base model.

1 - Dataset

For this tutorial, we'll use a dataset consisting of 4 images of Bruce the Gnome:

This sample dataset is included in the invoke-training repo under sample_data/bruce_the_gnome.

Here are a few tips for preparing a Textual Inversion dataset:

Aim for 4 to 50 images of your concept (object / style). The optimal number depends on many factors, and can be much higher than this for some use cases.
Vary all of the image features that you don't want your TI embedding to contain (e.g. background, pose, lighting, etc.).

2 - Configuration

Below is the training configuration that we'll use for this tutorial.

Raw config file: src/invoke_training/sample_configs/sdxl_textual_inversion_gnome_1x24gb.yaml.

Full config reference docs: Textual Inversion SDXL Config

sdxl_textual_inversion_gnome_1x24gb.yaml

# Training mode: Textual Inversion
# Base model:    SDXL
# GPU:           1 x 24GB

type: SDXL_TEXTUAL_INVERSION
seed: 1
base_output_dir: output/bruce/sdxl_ti

optimizer:
  optimizer_type: AdamW
  learning_rate: 2e-3

lr_warmup_steps: 200
lr_scheduler: cosine

data_loader:
  type: TEXTUAL_INVERSION_SD_DATA_LOADER
  dataset:
    type: IMAGE_DIR_DATASET
    dataset_dir: "sample_data/bruce_the_gnome"
    keep_in_memory: True
  caption_preset: object
  resolution: 1024
  center_crop: True
  random_flip: False
  shuffle_caption_delimiter: null
  dataloader_num_workers: 4

# General
model: stabilityai/stable-diffusion-xl-base-1.0
vae_model: madebyollin/sdxl-vae-fp16-fix
num_vectors: 4
placeholder_token: "bruce_the_gnome"
initializer_token: "gnome"
cache_vae_outputs: False
gradient_accumulation_steps: 1
weight_dtype: bfloat16
gradient_checkpointing: True

max_train_steps: 2000
save_every_n_steps: 200
validate_every_n_steps: 200

max_checkpoints: 20
validation_prompts:
  - A photo of bruce_the_gnome at the beach
  - A photo of bruce_the_gnome reading a book
train_batch_size: 1
num_validation_images_per_prompt: 3

3 - Start Training

Install invoke-training, if you haven't already.

Launch the Textual Inversion training pipeline:

# From inside the invoke-training/ source directory:
invoke-train -c src/invoke_training/sample_configs/sdxl_textual_inversion_gnome_1x24gb.yaml

Training takes ~40 mins on an NVIDIA RTX 4090.

4 - Monitor

In a new terminal, launch Tensorboard to monitor the training run:

tensorboard --logdir output/

Access Tensorboard at localhost:6006 in your browser.

Sample images will be logged to Tensorboard so that you can see how the Textual Inversion embedding is evolving.

Once training is complete, select the epoch that produces the best visual results.

For this tutorial, we'll choose epoch 500: Screenshot of the Tensorboard UI showing the validation images for epoch 500.

5 - Transfer to InvokeAI

If you haven't already, setup InvokeAI by following its documentation.

Copy the selected TI embedding into your ${INVOKEAI_ROOT}/autoimport/embedding/ directory. For example:

cp output/sdxl_ti_bruce_the_gnome/1702587511.2273068/checkpoint_epoch-00000500.safetensors ${INVOKEAI_ROOT}/autoimport/embedding/bruce_the_gnome.safetensors

Note that we renamed the file to bruce_the_gnome.safetensors. You can choose any file name, but this will become the token used to reference your embedding. So, in our case, we can refer to our new embedding by including <bruce_the_gnome> in our prompts.

Launch Invoke AI and you can now use your new bruce_the_gnome TI embedding! 🎉

Screenshot of the InvokeAI UI with an example of an image generated with the bruce_the_gnome TI embedding. Example image generated with the prompt "a photo of <bruce_the_gnome> at the park".