SdLoraConfig
seed
A randomization seed for reproducible training. Set to any constant integer for consistent training results. If
set to null
, training will be non-deterministic.
base_output_dir
The output directory where the training outputs (model checkpoints, logs, intermediate predictions) will be written. A subdirectory will be created with a timestamp for each new training run.
report_to
The integration to report results and logs to. This value is passed to Hugging Face Accelerate. See
accelerate.Accelerator.log_with
for more details.
max_train_steps
Total number of training steps to perform. One training step is one gradient update.
One of max_train_steps
or max_train_epochs
should be set.
max_train_epochs
Total number of training epochs to perform. One epoch is one pass over the entire dataset.
One of max_train_steps
or max_train_epochs
should be set.
save_every_n_epochs
The interval (in epochs) at which to save checkpoints.
One of save_every_n_epochs
or save_every_n_steps
should be set.
save_every_n_steps
The interval (in steps) at which to save checkpoints.
One of save_every_n_epochs
or save_every_n_steps
should be set.
validate_every_n_epochs
The interval (in epochs) at which validation images will be generated.
One of validate_every_n_epochs
or validate_every_n_steps
should be set.
validate_every_n_steps
The interval (in steps) at which validation images will be generated.
One of validate_every_n_epochs
or validate_every_n_steps
should be set.
model
Name or path of the base model to train. Can be in diffusers format, or a single stable diffusion checkpoint file. (E.g. 'runwayml/stable-diffusion-v1-5', '/path/to/realisticVisionV51_v51VAE.safetensors', etc. )
hf_variant
The Hugging Face Hub model variant to use. Only applies if model
is a Hugging Face Hub model name.
base_embeddings
A mapping of embedding tokens to trained embedding file paths. These embeddings will be applied to the base model before training.
Example:
Consider also adding the embedding tokens to the data_loader.caption_prefix
if they are not already present in the
dataset captions.
Note that the embeddings themselves are not fine-tuned further, but they will impact the LoRA model training if they are referenced in the dataset captions. The list of embeddings provided here should be the same list used at generation time with the resultant LoRA model.
lora_checkpoint_format
The format of the LoRA checkpoint to save. Choose between invoke_peft
or kohya
.
train_text_encoder
Whether to add LoRA layers to the text encoder and train it.
text_encoder_learning_rate
The learning rate to use for the text encoder model. If set, this overrides the optimizer's default learning rate.
unet_learning_rate
The learning rate to use for the UNet model. If set, this overrides the optimizer's default learning rate.
lr_scheduler
lr_scheduler: Literal['linear', 'cosine', 'cosine_with_restarts', 'polynomial', 'constant', 'constant_with_warmup'] = 'constant'
lr_warmup_steps
The number of warmup steps in the learning rate scheduler. Only applied to schedulers that support warmup. See lr_scheduler.
min_snr_gamma
Min-SNR weighting for diffusion training was introduced in https://arxiv.org/abs/2303.09556. This strategy improves the speed of training convergence by adjusting the weight of each sample.
min_snr_gamma
acts like an an upper bound on the weight of samples with low noise levels.
If None
, then Min-SNR weighting will not be applied. If enabled, the recommended value is min_snr_gamma = 5.0
.
lora_rank_dim
The rank dimension to use for the LoRA layers. Increasing the rank dimension increases the model's expressivity, but also increases the size of the generated LoRA model.
unet_lora_target_modules
The list of target modules to apply LoRA layers to in the UNet model. The default list will produce a highly expressive LoRA model.
For a smaller and less expressive LoRA model, the following list is recommended:
The list of target modules is passed to Hugging Face's PEFT library. See the docs for details.
text_encoder_lora_target_modules
The list of target modules to apply LoRA layers to in the text encoder models. The default list will produce a highly expressive LoRA model.
For a smaller and less expressive LoRA model, the following list is recommended:
The list of target modules is passed to Hugging Face's PEFT library. See the docs for details.
cache_text_encoder_outputs
If True, the text encoder(s) will be applied to all of the captions in the dataset before starting training and
the results will be cached to disk. This reduces the VRAM requirements during training (don't have to keep the
text encoders in VRAM), and speeds up training (don't have to run the text encoders for each training example).
This option can only be enabled if train_text_encoder == False
and there are no caption augmentations being
applied.
cache_vae_outputs
If True, the VAE will be applied to all of the images in the dataset before starting training and the results will be cached to disk. This reduces the VRAM requirements during training (don't have to keep the VAE in VRAM), and speeds up training (don't have to run the VAE encoding step). This option can only be enabled if all non-deterministic image augmentations are disabled (i.e. center_crop=True, random_flip=False).
enable_cpu_offload_during_validation
If True, models will be kept in CPU memory and loaded into GPU memory one-by-one while generating validation images. This reduces VRAM requirements at the cost of slower generation of validation images.
gradient_accumulation_steps
The number of gradient steps to accumulate before each weight update. This value is passed to Hugging Face Accelerate. This is an alternative to increasing the batch size when training with limited VRAM.
weight_dtype
All weights (trainable and fixed) will be cast to this precision. Lower precision dtypes require less VRAM, and result in faster training, but are more prone to issues with numerical stability.
Recommendations:
"float32"
: Use this mode if you have plenty of VRAM available."bfloat16"
: Use this mode if you have limited VRAM and a GPU that supports bfloat16."float16"
: Use this mode if you have limited VRAM and a GPU that does not support bfloat16.
See also mixed_precision
.
mixed_precision
The mixed precision mode to use.
If mixed precision is enabled, then all non-trainable parameters will be cast to the specified weight_dtype
, and
trainable parameters are kept in float32 precision to avoid issues with numerical stability.
This value is passed to Hugging Face Accelerate. See
accelerate.Accelerator.mixed_precision
for more details.
gradient_checkpointing
Whether or not to use gradient checkpointing to save memory at the expense of a slower backward pass. Enabling gradient checkpointing slows down training by ~20%.
max_checkpoints
The maximum number of checkpoints to keep. New checkpoints will replace earlier checkpoints to stay under this limit. Note that this limit is applied to 'step' and 'epoch' checkpoints separately.
prediction_type
The prediction_type that will be used for training. Choose between 'epsilon' or 'v_prediction' or leave 'None'.
If 'None', the prediction type of the scheduler: noise_scheduler.config.prediction_type
is used.
max_grad_norm
Max gradient norm for clipping. Set to None for no clipping.
validation_prompts
A list of prompts that will be used to generate images throughout training for the purpose of tracking progress. See also 'validate_every_n_epochs'.
negative_validation_prompts
A list of negative prompts that will be applied when generating validation images. If set, this list should have the same length as 'validation_prompts'.
num_validation_images_per_prompt
The number of validation images to generate for each prompt in 'validation_prompts'. Careful, validation can become quite slow if this number is too large.
use_masks
If True, image masks will be applied to weight the loss during training. The dataset must contain masks for this feature to be used.