`SdxlLoraAndTextualInversionConfig`

type

type: Literal["SDXL_LORA_AND_TEXTUAL_INVERSION"] = (
    "SDXL_LORA_AND_TEXTUAL_INVERSION"
)

model

model: str = 'stabilityai/stable-diffusion-xl-base-1.0'

Name or path of the base model to train. Can be in diffusers format, or a single stable diffusion checkpoint file. (E.g. 'stabilityai/stable-diffusion-xl-base-1.0', '/path/to/JuggernautXL.safetensors', etc. )

hf_variant

hf_variant: str | None = 'fp16'

The Hugging Face Hub model variant to use. Only applies if model is a Hugging Face Hub model name.

lora_checkpoint_format

lora_checkpoint_format: Literal["invoke_peft", "kohya"] = (
    "kohya"
)

The format of the LoRA checkpoint to save. Choose between invoke_peft or kohya.

num_vectors

num_vectors: int = 1

Note: num_vectors can be overridden by initial_phrase.

The number of textual inversion embedding vectors that will be used to learn the concept.

Increasing the num_vectors enables the model to learn more complex concepts, but has the following drawbacks:

greater risk of overfitting
increased size of the resulting output file
consumes more of the prompt capacity at inference time

Typical values for num_vectors are in the range [1, 16].

As a rule of thumb, num_vectors can be increased as the size of the dataset increases (without overfitting).

placeholder_token

placeholder_token: str

The special word to associate the learned embeddings with. Choose a unique token that is unlikely to already exist in the tokenizer's vocabulary.

initializer_token

initializer_token: str | None = None

A vocabulary token to use as an initializer for the placeholder token. It should be a single word that roughly describes the object or style that you're trying to train on. Must map to a single tokenizer token.

For example, if you are training on a dataset of images of your pet dog, a good choice would be dog.

initial_phrase

initial_phrase: str | None = None

Note: Exactly one of initializer_token or initial_phrase should be set.

A phrase that will be used to initialize the placeholder token embedding. The phrase will be tokenized, and the corresponding embeddings will be used to initialize the placeholder tokens. The number of embedding vectors will be inferred from the length of the tokenized phrase, so keep the phrase short. The consequences of training a large number of embedding vectors are discussed in the num_vectors field documentation.

For example, if you are training on a dataset of images of pokemon, you might use pokemon sketch white background.

train_unet

train_unet: bool = True

Whether to add LoRA layers to the UNet model and train it.

train_text_encoder

train_text_encoder: bool = True

Whether to add LoRA layers to the text encoder and train it.

train_ti

train_ti: bool = True

Whether to train the textual inversion embeddings.

ti_train_steps_ratio

ti_train_steps_ratio: float | None = None

The fraction of the total training steps for which the TI embeddings will be trained. For example, if we are training for a total of 5000 steps and ti_train_steps_ratio=0.5, then the TI embeddings will be trained for 2500 steps and the will be frozen for the remaining steps.

If None, then the TI embeddings will be trained for the entire duration of training.

optimizer

optimizer: AdamOptimizerConfig | ProdigyOptimizerConfig = (
    AdamOptimizerConfig()
)

text_encoder_learning_rate

text_encoder_learning_rate: float | None = 1e-05

The learning rate to use for the text encoder model. Set to null or 0 to use the optimizer's default learning rate.

unet_learning_rate

unet_learning_rate: float | None = 0.0001

The learning rate to use for the UNet model. Set to null or 0 to use the optimizer's default learning rate.

textual_inversion_learning_rate

textual_inversion_learning_rate: float | None = 0.001

The learning rate to use for textual inversion training of the embeddings. Set to null or 0 to use the optimizer's default learning rate.

lr_scheduler

lr_scheduler: Literal[
    "linear",
    "cosine",
    "cosine_with_restarts",
    "polynomial",
    "constant",
    "constant_with_warmup",
] = "constant"

lr_warmup_steps

lr_warmup_steps: int = 0

The number of warmup steps in the learning rate scheduler. Only applied to schedulers that support warmup. See lr_scheduler.

min_snr_gamma

min_snr_gamma: float | None = 5.0

Min-SNR weighting for diffusion training was introduced in https://arxiv.org/abs/2303.09556. This strategy improves the speed of training convergence by adjusting the weight of each sample.

min_snr_gamma acts like an an upper bound on the weight of samples with low noise levels.

If None, then Min-SNR weighting will not be applied. If enabled, the recommended value is min_snr_gamma = 5.0.

lora_rank_dim

lora_rank_dim: int = 4

The rank dimension to use for the LoRA layers. Increasing the rank dimension increases the model's expressivity, but also increases the size of the generated LoRA model.

cache_text_encoder_outputs

cache_text_encoder_outputs: bool = False

If True, the text encoder(s) will be applied to all of the captions in the dataset before starting training and the results will be cached to disk. This reduces the VRAM requirements during training (don't have to keep the text encoders in VRAM), and speeds up training (don't have to run the text encoders for each training example). This option can only be enabled if train_text_encoder == False and there are no caption augmentations being applied.

cache_vae_outputs

cache_vae_outputs: bool = False

If True, the VAE will be applied to all of the images in the dataset before starting training and the results will be cached to disk. This reduces the VRAM requirements during training (don't have to keep the VAE in VRAM), and speeds up training (don't have to run the VAE encoding step). This option can only be enabled if all non-deterministic image augmentations are disabled (i.e. center_crop=True, random_flip=False).

enable_cpu_offload_during_validation

enable_cpu_offload_during_validation: bool = False

If True, models will be kept in CPU memory and loaded into GPU memory one-by-one while generating validation images. This reduces VRAM requirements at the cost of slower generation of validation images.

gradient_accumulation_steps

gradient_accumulation_steps: int = 1

The number of gradient steps to accumulate before each weight update. This value is passed to Hugging Face Accelerate. This is an alternative to increasing the batch size when training with limited VRAM.

weight_dtype

weight_dtype: Literal["float32", "float16", "bfloat16"] = (
    "bfloat16"
)

All weights (trainable and fixed) will be cast to this precision. Lower precision dtypes require less VRAM, and result in faster training, but are more prone to issues with numerical stability.

Recommendations:

"float32": Use this mode if you have plenty of VRAM available.
"bfloat16": Use this mode if you have limited VRAM and a GPU that supports bfloat16.
"float16": Use this mode if you have limited VRAM and a GPU that does not support bfloat16.

mixed_precision

mixed_precision: Literal["no", "fp16", "bf16", "fp8"] = "no"

The mixed precision mode to use.

If mixed precision is enabled, then all non-trainable parameters will be cast to the specified weight_dtype, and trainable parameters are kept in float32 precision to avoid issues with numerical stability.

This value is passed to Hugging Face Accelerate. See accelerate.Accelerator.mixed_precision for more details.

xformers

xformers: bool = False

If true, use xformers for more efficient attention blocks.

gradient_checkpointing

gradient_checkpointing: bool = False

Whether or not to use gradient checkpointing to save memory at the expense of a slower backward pass. Enabling gradient checkpointing slows down training by ~20%.

max_checkpoints

max_checkpoints: int | None = None

The maximum number of checkpoints to keep. New checkpoints will replace earlier checkpoints to stay under this limit. Note that this limit is applied to 'step' and 'epoch' checkpoints separately.

prediction_type

prediction_type: (
    Literal["epsilon", "v_prediction"] | None
) = None

The prediction_type that will be used for training. Choose between 'epsilon' or 'v_prediction' or leave 'None'. If 'None', the prediction type of the scheduler: noise_scheduler.config.prediction_type is used.

max_grad_norm

max_grad_norm: float | None = None

Max gradient norm for clipping. Set to null or 0 for no clipping.

validation_prompts

validation_prompts: list[str] = []

A list of prompts that will be used to generate images throughout training for the purpose of tracking progress.

negative_validation_prompts

negative_validation_prompts: list[str] | None = None

A list of negative prompts that will be applied when generating validation images. If set, this list should have the same length as 'validation_prompts'.

num_validation_images_per_prompt

num_validation_images_per_prompt: int = 4

The number of validation images to generate for each prompt in 'validation_prompts'. Careful, validation can become quite slow if this number is too large.

train_batch_size

train_batch_size: int = 4

The training batch size.

use_masks

use_masks: bool = False

If True, image masks will be applied to weight the loss during training. The dataset must contain masks for this feature to be used.

data_loader

data_loader: TextualInversionSDDataLoaderConfig

The data configuration.

See TextualInversionSDDataLoaderConfig for details.

vae_model

vae_model: str | None = None

The name of the Hugging Face Hub VAE model to train against. This will override the VAE bundled with the base model (specified by the model parameter). This config option is provided for SDXL models, because SDXL shipped with a VAE that produces NaNs in fp16 mode, so it is common to replace this VAE with a fixed version.

seed

seed: Optional[int] = None

A randomization seed for reproducible training. Set to any constant integer for consistent training results. If set to null, training will be non-deterministic.

base_output_dir

base_output_dir: str

The output directory where the training outputs (model checkpoints, logs, intermediate predictions) will be written. A subdirectory will be created with a timestamp for each new training run.

report_to

report_to: Literal[
    "all", "tensorboard", "wandb", "comet_ml"
] = "tensorboard"

The integration to report results and logs to. This value is passed to Hugging Face Accelerate. See accelerate.Accelerator.log_with for more details.

max_train_steps

max_train_steps: int | None = None

Total number of training steps to perform. One training step is one gradient update.

One of max_train_steps or max_train_epochs should be set.

max_train_epochs

max_train_epochs: int | None = None

Total number of training epochs to perform. One epoch is one pass over the entire dataset.

One of max_train_steps or max_train_epochs should be set.

save_every_n_epochs

save_every_n_epochs: int | None = None

The interval (in epochs) at which to save checkpoints.

One of save_every_n_epochs or save_every_n_steps should be set.

save_every_n_steps

save_every_n_steps: int | None = None

The interval (in steps) at which to save checkpoints.

One of save_every_n_epochs or save_every_n_steps should be set.

validate_every_n_epochs

validate_every_n_epochs: int | None = None

The interval (in epochs) at which validation images will be generated.

One of validate_every_n_epochs or validate_every_n_steps should be set.

validate_every_n_steps

validate_every_n_steps: int | None = None

The interval (in steps) at which validation images will be generated.

One of validate_every_n_epochs or validate_every_n_steps should be set.

check_validation_prompts

check_validation_prompts()