A small patch release containing these fixes:
Full Changelog: https://github.com/huggingface/peft/compare/v0.19.0...v0.19.1
This PEFT release contains no less than nine new PEFT methods, described below. It also contains numerous enhancements that should make PEFT more useful to many users.
<img width="1248" height="560" alt="peft-v0 19 0" src="https://github.com/user-attachments/assets/f2878d0d-b1a1-46d0-9b61-55ab6097694c" />@yeonjoon-jung01 added "GraLoRA: Granular Low-Rank Adaptation for Parameter-Efficient Fine-Tuning" to PEFT (#2851). This method subdivides the base weight into smaller blocks and applies LoRA to those. This more granular adaptation promises to increase expressiveness and improve performance, especially at higher ranks (64+), closing the gap to full fine-tuning.
@Conzel contributed BD-LoRA: "Block-Diagonal LoRA for Eliminating Communication Overhead in Tensor Parallel LoRA Serving" (#2895). With BD-LoRA, the LoRA weights are implemented in a block-diagonal way. This allows to reduce communication overhead when using tensor parallelism (TP) and thus faster serving.
There is an experiment branch for BD-LoRA support in vLLM: vllm-project/vllm#28136.
Thanks to @kashif, PEFT now also supports Cartridges (#2953). The main purpose of this method is to train a prefix to compress a long context to a short size and thus save on tokens. On a low level, this is similar to prefix tuning. The PR also added an example recipe to quickly get started.
"PVeRA: Probabilistic Vector-Based Random Matrix Adaptation" was added to PEFT by @leofillioux in #2952. It is an extension of VeRA, a PEFT method that uses weight sharing between layers to be especially parameter efficient. PVeRA builds on top of that by adding a probabilistic element, sampling from the shared parameters and promising better performance overall.
@fei407 added PSOFT, "Efficient Orthogonal Fine-Tuning with Principal Subspace Adaptation", to PEFT in #3037. Orthogonal fine-tuning techniques like OFT and BOFT are good at preserving the structure and thus capabilities of the underlying base model. PSOFT improves efficiency of this technique by constraining the adaptation to low-rank principal subspace.
@yibozhong added Lily: "Low-Rank Interconnected Adaptation across Layers" to PEFT in #2563. Lily is on the surface similar to LoRA but has a sophisticated parameter sharing scheme. The A parameters are shared blockwise (e.g. 4 consecutive q_proj layers share the same A). There is a pool of B parameters that is shared globally, the actual B's are chosen in a data-dependent way through a router. This allows Lily to use higher ranks than LoRA while maintaining a low trainable parameter count.
In #3084, "PEANuT: Parameter-Efficient Adaptation with Weight-aware Neural Tweakers" was added to PEFT, again by @yibozhong. PEANuT adds a small, neural net (so called weight-aware neural tweakers) to the base model. Compared to LoRA, this increases expressivity for the same trainable parameter count or allows to greatly lower the parameter count without sacrificing expressivity. This comes at the expensive of a higher memory requirement for the same parameter count and decreased speed.
We have another serial contributor in @kashif, who also contributed TinyLoRA: "Learning to Reason in 13 Parameters" in #3024. This is a PEFT method that allows to train an extremely small number of parameters, much lower than what could be achieved even with LoRA rank 1. The paper shows that in particular with reinforcement learning, it can often be enough to train just a few parameters to achieve good results.
@LonglongaaaGo added "AdaMSS: Adaptive Multi-Subspace Approach for Parameter-Efficient Fine-Tuning" to PEFT. This method segments the base weights of the model into smaller subspaces that are targeted for fine-tuning. Moreover, it's possible to dynamically assign a lower parameter budget to less important subspaces during training, similar to what AdaLoRA does. This promises to provide higher expressiveness and better generalization than similar PEFT methods.
In #2939, we added functions to PEFT to allow converting checkpoints of many non-LoRA methods into LoRA checkpoints. This can be useful because many other packages support only LoRA but not other PEFT methods, e.g. Diffusers and vLLM. With the new conversions tools, more PEFT methods than just LoRA can thus be used with those packages. Conversion is lossy but empirical testing showed that with a sufficiently high LoRA rank, the error can be quite low.
@sambhavnoobcoder added a new way to initialize LoRA weights with "LoRA-GA: Low-Rank Adaptation with Gradient Approximation" (#2926). This allows you to initialize the LoRA weights in a way that aligns the gradients with full fine-tuning and should lead to faster training convergence.
In "LoRA vs Full Fine-tuning: An Illusion of Equivalence", the authors showed that LoRA fine-tuning can introduce so-called "intruder dimensions" which contribute to forgetting. We now have a utility function to remove intruder dimension in PEFT, reduce_intruder_dimension. When calling this on a fine-tuned LoRA model, forgetting should be reduced while the fine-tuned task performance should remain almost the same.
In #3048, @balvisio added support for Transformer Engine, a quantization method by NVIDIA, to PEFT.
In a series of PRs (#3079, #3091, #3096), @michaelbenayoun added support for Tensor Parallelism to LoRA.
In many LLMs, the embedding and the LM head have tied weights to save on parameter count. This can, however, lead to tricky situations when trying to fine-tune those layers. Through a series of PRs (#2803, #2922, #2870, #2879, #3126), we improved the user experience when doing so. Most notably, users can now pass ensure_weight_tying=True to their PEFT config to force weight tying to be upheld. Please check the PEFT weight tying docs for how weight tying is now being handled. Thanks to @romitjain, @sambhavnoobcoder, and @Cursx for their contributions.
#3055 makes LoRA work with base models that use very low precision floats like torch.float8_e4m3fn. An example of that would be MiniMax-M2.5.
#3128 introduces zero init to Prefix Tuning which, according to our benchmarks, reduced the result variance significantly and yielded good task accuracy without the need for prompt engineering.
With #3088 the LoftQ implementation now supports correcting errors for int8 quantization without utilizing activation thresholding alongside the already existing nf4 quantization.
The Bone PEFT method was removed in #3115. Users are directed to use MiSS instead, which is the improved replacement for Bone. Use this Bone-to-MiSS conversion script if you want to port old Bone checkpoints.
These two quantization methods now use GPTQModel as their backend (#2932) thanks to @ZX-ModelCloud.
requires_grad in modules_to_savePreviously, PEFT would enable requires_grad on the original module if the corresponding modules_to_save was disabled. This is almost never desirable and was thus fixed. Although this change is technically backwards-incompatible, it's an extreme niche case, so we don't expect any user to be negatively affected by it.
no_split_modules now captures values recursively by @githubnemo in https://github.com/huggingface/peft/pull/3032inference_mode when setting adapters with modules_to_save (Issue #2928) by @ada-ggf25 in https://github.com/huggingface/peft/pull/2931Full Changelog: https://github.com/huggingface/peft/compare/v0.18.1...v0.19.0
Small patch release containing the following changes:
FIXME update list of all changes, so some more commits were added
@ppetrushkov added RoAd: 2D Rotary Adaptation to PEFT in #2678. RoAd learns 2D rotation matrices that are applied using only element-wise multiplication, thus promising very fast inference with adapters in unmerged state.
Remarkably, besides LoRA, RoAd is the only PEFT method that supports mixed adapter batches. This means that when you have loaded a model with multiple RoAd adapters, you can use all of them for different samples in the same batch, which is much more efficient than switching adapters between batches:
model = PeftModel.from_pretrained(base_model, <path-to-road-adapter-A>, adapter_name="adapter-A")
model.add_adapter("adapter-B", <path-to-road-adapter-B>)
inputs = ... # input with 3 samples
# apply adapter A to sample 0, adapter B to sample 1, and use the base model for sample 2:
adapter_names = ["adapter-A", "adapter-B", "__base__"]
output_mixed = model(**inputs, adapter_names=adapter_names)
gen_mixed = model.generate(**inputs, adapter_names=adapter_names)
Activated LoRA is a technique added by @kgreenewald in #2609 for causal language models, allowing to selectively enable LoRA adapters depending on a specific token invocation sequence in the input. This has the major benefit of being able to re-use most of the KV cache during inference when the adapter is only used to generate part of the response, after which the base model takes over again.
@TheTahaaa contributed not only support for Arrow, a dynamic routing algorithm between multiple loaded LoRAs in #2644, but also GenKnowSub, a technique built upon Arrow where the 'library' of LoRAs available to Arrow is first modified by subtracting general knowledge adapters (e.g., trained on subsets of Wikipedia) to enhance task-specific performance.
Thanks to @Bilican, Wavelet Fine-Tuning (WaveFT) was added to PEFT in #2560. This method trains sparse updates in the wavelet domain of residual matrices, which is especially parameter efficient. It is very interesting for image generation, as it promises to generate diverse outputs while preserving subject fidelity.
Decoupled Low-rank Adaptation (DeLoRA) was added by @mwbini in #2780. This new PEFT method is similar to DoRA in so far as it decouples the angle and magnitude of the learned adapter weights. However, DeLoRA implements this in a way that promises to better prevent divergence. Moreover, it constrains the deviation of the learned weight by imposing an upper limit of the norm, which can be adjusted via the delora_lambda parameter.
Orthogonal Fine-Tuning (OSF) was added by @NikhilNayak-debug in #2685. By freezing the high-rank subspace of the targeted weight matrices and projecting gradient updates to a low-rank subspace, OSF achieves good performance on continual learning tasks. While it is a bit memory intensive for standard fine-tuning processes, it is definitely worth checking out on tasks where performance degradation of previously learned tasks is a concern.
In #2525, @ved1beta added the text generation benchmark to PEFT. This is a framework to determine and compare metrics with regard to text generation of different PEFT methods, e.g. runtime and memory usage. Right now, this benchmark is still lacking experimental settings and a visualization, analogous to what we have in the MetaMathQA benchmark. If this is something that interests you, we encourage you to let us know or, even better, contribute to this benchmark.
PEFT has integrations with other libraries like Transformers and Diffusers. To facilitate this integration, PEFT now provides a stable interface of functions that should be used if applicable. For example, the set_adapter function can be used to switch between PEFT adapters on the model, even if the model is not a PeftModel instance. We commit to keeping these functions backwards compatible, so it's safe for other libraries to build on top of those.
Some Transformers models can have tied weights. This is especially prevalent when it comes to the embedding and the LM head. Currently, the way that this is handled in PEFT is not obvious. We thus drafted an issue to illustrate the intended behavior in #2864. This shows what our goal is, although not everything is implemented yet.
In #2803, @romitjain added the ensure_weight_tying argument to LoraConfig. This argument, if set to True, enforces weight tying of the modules targeted with modules_to_save. Thus, if embedding and LM head are tied, they will share weights, which is important to allow, for instance, weight merging. Therefore, for most users, we recommend to enable this setting if they want to fully fine-tune the embedding and LM head. For backwards compatability, the setting is off by default though.
Note that in accordance with #2864, the functionality of ensure_weight_tying=True will be expanded to also include trainable tokens (#2870) and LoRA (tbd.) in the future.
@grewalsk extended LoHa and LoKr to support nn.Conv1d layers, as well as nn.Conv2d with 1x1 kernels, in #2515.
Thanks to @macmacmacmac, we now have a new initialization option for prompt tuning, random discrete initialization (#2815). This option should generally work better than random initialization, as corroborated on our PEFT method comparison suite. Give it a try if you use prompt tuning.
If you use multiple LoRA adapters, you can merge them into a single adapter using model.add_weighted_adapter. However, so far, this only worked with positive weights per adapter. Thanks to @sambhavnoobcoder and @valteu, it is now possible to pass negative weights too.
At the time of writing, the Transformers v5 release is imminent. This Transformers version will be incomptabile with PEFT < 0.18.0. If you plan to use Transformers v5 with PEFT, please upgrade PEFT to 0.18.0+.
This PEFT version no longer supports Python 3.9, which has reached its end of life. Please use Python 3.10+.
The OFT method has been updated to make it slightly faster and to stabilize the numerics in #2805. This means, however, that existing checkpoints may give slightly different results after upgrading to PEFT 0.18.0. Therefore, if you use OFT, we recommend to retrain the adapter.
hub_online_once in trainable token tests by @githubnemo in https://github.com/huggingface/peft/pull/2701to issue for 8-bit model by @yao-matrix in https://github.com/huggingface/peft/pull/2797trainable_token_indices for lm_head by @aflueckiger in https://github.com/huggingface/peft/pull/2863max_length to replace max_seq_length; correct README for by @kaixuanliu in https://github.com/huggingface/peft/pull/2862Full Changelog: https://github.com/huggingface/peft/compare/v0.17.1...v0.18.0
This patch release contains a few fixes (via #2710) for the newly introduced target_parameters feature, which allows LoRA to target nn.Parameters directly (useful for mixture of expert layers). Most notably:
model.add_adapter or model.load_adapter) did not work correctly. Since a solution is not trivial, PEFT now raises an error to prevent this situation.@kkb-code contributed Sparse High Rank Adapters (SHiRA, paper) which promise to offer a potential gain in performance over LoRAs - especially the concept loss when using multiple adapters is improved. Since the adapters only train on 1-2% of the weights and are inherently sparse, switching between adapters may be cheaper than with LoRAs. (#2584)
@JL-er added a new PEFT method, MiSS (Matrix Shard Sharing) in #2604. This method is an evolution of Bone, which, according to our PEFT method comparison benchmark, gives excellent results when it comes to performance and memory efficiency. If you haven't tried it, you should do so now.
At the same time, Bone will be deprecated in favor of MiSS and will be removed in PEFT v0.19.0. If you already have a Bone checkpoint, you can use scripts/convert-bone-to-miss.py to convert it into a MiSS checkpoint and proceed with training using MiSS.
nn.ParameterLoRA is now able to target nn.Parameter directly (#2638, #2665)! Ever had this complicated nn.Module with promising parameters inside but it was too custom to be supported by your favorite fine-tuning library? No worries, now you can target nn.Parameters directly using the target_parameters config attribute which works similarly to target_modules.
This option can be especially useful for models with Mixture of Expert (MoE) layers, as those often use nn.Parameters directly and cannot be targeted with target_modules. For example, for the Llama4 family of models, use the following config to target the MoE weights:
config = LoraConfig(
...,
target_modules=[], # <= prevent targeting any modules
target_parameters=["feed_forward.experts.down_proj", "feed_forward.experts.gate_up_proj"],
)
Note that this feature is still experimental as it comes with a few caveats and therefore might change in the future. Also, MoE weights with many experts can be quite huge, so expect a higher memory usage than compared to targeting normal nn.Linear layers.
state_dictSometimes, it is possible that there is a PEFT adapter checkpoint but the corresponding PEFT config is not known for whatever reason. To inject the PEFT layers for this checkpoint, you would usually have to reverse-engineer the corresponding PEFT config, most notably the target_modules argument, based on the state_dict from the checkpoint. This can be cumbersome and error prone. To avoid this, it is also possible to call inject_adapter_in_model and pass the loaded state_dict as an argument:
from safetensors.torch import load_file
from peft import LoraConfig, inject_adapter_in_model
model = ...
state_dict = load_file(<path-to-safetensors-file>)
lora_config = LoraConfig() # <= no need to specify further
model = inject_adapter_in_model(lora_config, model, state_dict=state_dict)
Find more on state_dict based injection in the docs.
A bug in prompt learning methods caused modules_to_save to be ignored. Especially classification tasks are affected since they usually add the classification/score layer to modules_to_save. In consequence, these layers were neither trained nor stored after training. This has been corrected now. (#2646)
Full Changelog: https://github.com/huggingface/peft/compare/v0.16.0...v0.17.0
In #2468, @AaronZLT added the LoRA-FA optimizer to PEFT. This optimizer is based on AdamW and it increases memory efficiency of LoRA training. This means that you can train LoRA with less memory, or, with the same memory budget, use higher LoRA ranks, potentially getting better results.
Thanks to @PaulAlbert31, a new PEFT method called RandLoRA was added to PEFT (#2464). Similarly to VeRA, it uses non-learnable random low rank matrices that are combined through learnable matrices. This way, RandLoRA can approximate full rank updates of the weights. Training models quantized with bitsandbytes is supported.
@Phoveran added Circular Convolution Adaptation, C3A, in #2577. This new PEFT method can overcome the limit of low rank adaptations as seen e.g. in LoRA while still promising to be fast and memory efficient.
Thanks to @gslama12 and @SP1029, LoRA now supports Conv2d layers with groups != 1. This requires the rank r being divisible by groups. See #2403 and #2567 for context.
@dsocek added support for Intel Neural Compressor (INC) quantization to LoRA in #2499.
DoRA now supports Conv1d layers thanks to @EskildAndersen (#2531).
Passing init_lora_weights="orthogonal" now enables orthogonal weight initialization for LoRA (#2498).
@gapsong brought us Quantization-Aware LoRA training in #2571. This can make QLoRA training more efficient, please check the included example. Right now, only GPTQ is supported.
There has been a big refactor of Orthogonal Finetuning, OFT, thanks to @zqiu24 (#2575). This makes the PEFT method run more quickly and require less memory. It is, however, incompatible with old OFT checkpoints. If you have old OFT checkpoints, either pin the PEFT version to <0.16.0 or retrain it with the new PEFT version.
Thanks to @keepdying, LoRA hotswapping with compiled models no longer leads to CUDA graph re-records (#2611).
required_grads_ of modules_to_save is now set to True when used directly with inject_adapter. This is relevant for PEFT integrations, e.g. Transformers or Diffusers.vlm.language_model, it will no longer work, please apply it to vlm directly (see #2554 for context). Morever, the refactor results in different checkpoints. We managed to ensure backwards compatability in PEFT, i.e. old checkpoints can be loaded successfully. There is, however, no forward compatibility, i.e. loading checkpoints trained after the refactor is not possible with package versions from before the refactor. In this case, you need to upgrade PEFT and transformers. More context in #2574.<0.16.0 and <4.52.0, respectively).modules_to_save by @githubnemo in https://github.com/huggingface/peft/pull/2481add_weighted_adapter by @Beinsezii in https://github.com/huggingface/peft/pull/2512rank_pattern, rank_alpha for add_weighted_adapter by @Beinsezii in https://github.com/huggingface/peft/pull/2550prepare_model_for_gradient_checkpointing protected to public by @qgallouedec in https://github.com/huggingface/peft/pull/2569Full Changelog: https://github.com/huggingface/peft/compare/v0.15.2...v0.16.0
This patch fixes a bug that resulted in prompt learning methods like P-tuning not to work (#2477).
This patch includes a fix for #2450. In this bug modules_to_save was not handled correctly when used in conjunction with DeepSpeed ZeRO stage 3 which resulted in those modules being placeholder values in the saved checkpoints.
Full Changelog: https://github.com/huggingface/peft/compare/v0.15.0...v0.15.1
@iboing and @5eqn contributed CorDA: Context-Oriented Decomposition Adaptation of Large Language Models for Task-Aware Parameter-Efficient Fine-tuning . This task-driven initialization method has two modes, knowledge-preservation and instruction-preservation, both using external data to select ranks intelligently. The former can be used to select those ranks that correspond to weights not affiliated with knowledge from, say, a QA dataset. The latter can be used to select those ranks that correspond most to the task at hand (e.g., a classification task). (#2231)
The new Trainable Tokens tuner allows for selective training of tokens without re-training the full embedding matrix, e.g. when adding support for reasoning / thinking tokens. This is a lot more memory efficient and the saved checkpoint is much smaller. It can be used standalone or in conjunction with LoRA adapters by passing trainable_token_indices to LoraConfig. (#2376)
LoRA now supports targeting multihead attention modules (but for now only those with _qkv_same_embed_dim=True). These modules were tricky as they may expose linear submodules but won't use their forward methods, therefore needing explicit support. (#1324)
Hotswapping now allows different alpha scalings and ranks without recompilation of the model when the model is prepared using a call to prepare_model_for_compiled_hotswap() before compiling the model. (#2177)
GPTQModel support was added in #2247 as a replacement for AutoGPTQ which is not maintained anymore.
all-linear as target_modules for custom (non-transformers) models (#2267). With this change comes a bugfix where it was possible that non-linear layers were selected when they shared the same name with a linear layer (e.g., bar.foo and baz.foo).register_peft_method() call. (#2282)PEFT_TYPE_TO_MODEL_MAPPING is now deprecated and should not be relied upon. Use PEFT_TYPE_TO_TUNER_MAPPING instead. (#2282)modules_to_save keys wrongly matched parts of the state dict if the key was a substring of another key (e.g., classifier and classifier2). (#2334)disable_input_dtype_casting=True. (#2353)rank_pattern and alpha_pattern used by many adapters now supports matching full paths as well by specifying the pattern with a caret in front, for example: ^foo to target model.foo but not model.bar.foo. (#2419)adapter_name conflict with tuner by @pzdkn in https://github.com/huggingface/peft/pull/2254"all-linear" to target custom models by @BenjaminBossan in https://github.com/huggingface/peft/pull/2267__all__ by @bluenote10 in https://github.com/huggingface/peft/pull/2280config.py by @innerlee in https://github.com/huggingface/peft/pull/2297prepare_model_for_kbit_training docstring by @NilBiescas in https://github.com/huggingface/peft/pull/2305resize_token_embeddings to docs by @bingwork in https://github.com/huggingface/peft/pull/2290get_peft_model() for in-place base model modification by @d-kleine in https://github.com/huggingface/peft/pull/2313low_cpu_mem_usage=True with 8bit bitsandbytes by @BenjaminBossan in https://github.com/huggingface/peft/pull/2325PEFT_TYPE_TO_MODEL_MAPPING variable with deprecation by @BenjaminBossan in https://github.com/huggingface/peft/pull/2328modules_to_save loading if substring by @BenjaminBossan in https://github.com/huggingface/peft/pull/2334modules_to_save by @BenjaminBossan in https://github.com/huggingface/peft/pull/2220torch.compile tests and docs by @BenjaminBossan in https://github.com/huggingface/peft/pull/2332nn.Conv1d by @CCLDArjun in https://github.com/huggingface/peft/pull/2333prepare_model_for_compiled_hotswap raises when no adapter was found by @BenjaminBossan in https://github.com/huggingface/peft/pull/2375hf_hub_download arguments are used when loading locally by @henryzhengr in https://github.com/huggingface/peft/pull/2373all-linear target modules by @BenjaminBossan in https://github.com/huggingface/peft/pull/2391PeftConfig.from_pretrained by @BenjaminBossan in https://github.com/huggingface/peft/pull/2397.eval() for inference by @faaany in https://github.com/huggingface/peft/pull/2408Full Changelog: https://github.com/huggingface/peft/compare/v0.14.0...v0.15.0
@tsachiblau added a new soft prompt method called Context-aware Prompt Tuning (CPT) which is a combination of In-Context Learning and Prompt Tuning in the sense that, for each training sample, it builds a learnable context from training examples in addition to the single training sample. Allows for sample- and parameter-efficient few-shot classification and addresses recency-bias.
@sirluk contributed a new LoRA initialization method called Explained Variance Adaptation (EVA). Instead of randomly initializing LoRA weights, this method uses SVD on minibatches of finetuning data to initialize the LoRA weights and is also able to re-allocate the ranks of the adapter based on the explained variance ratio (derived from SVD). Thus, this initialization method can yield better initial values and better rank distribution.
@JL-er added an implementation for Block Affine (Bone) Adaptation which utilizes presumed sparsity in the base layer weights to divide them into multiple sub-spaces that share a single low-rank matrix for updates. Compared to LoRA, Bone has the potential to significantly reduce memory usage and achieve faster computation.
PEFT now supports LoRAs for int8 torchao quantized models (check this and this notebook) . In addition, VeRA can now be used with 4 and 8 bit bitsandbytes quantization thanks to @ZiadHelal.
Hot-swapping of LoRA adapters is now possible using the hotswap_adapter function. Now you are able to load one LoRA and replace its weights in-place with the LoRA weights of another adapter which, in general, should be faster than deleting one adapter and loading the other adapter in its place. The feature is built so that no re-compilation of the model is necessary if torch.compile was called on the model (right now, this requires ranks and alphas to be the same for the adapters).
LoRA and IA³ now support Conv3d layers thanks to @jsilter, and @JINO-ROHIT added a notebook showcasing PEFT model evaluation using lm-eval-harness toolkit.
With the target_modules argument, you can specify which layers to target with the adapter (e.g. LoRA). Now you can also specify which modules not to target by using the exclude_modules parameter (thanks @JINO-ROHIT).
DynamicCache caching infrastructure of transformers (see #2096). If you are using this PEFT version and a recent version of transformers with an old prefix tuning checkpoint, you should double check that it still works correctly and retrain it if it doesn't.lora_bias parameter to LoRA layers to enable bias on LoRA B matrix. This is useful when extracting LoRA weights from fully fine-tuned parameters with bias vectors so that these can be taken into account.from_pretrained now warns the user if PEFT keys are missing.modules_to_save is now properly and transparently handled.SFTConfig instead of SFTTrainer keyword args by @qgallouedec in https://github.com/huggingface/peft/pull/2150eval and no dropout by @ariG23498 in https://github.com/huggingface/peft/pull/2122rank_pattern and alpha_pattern together in LoraConfig by @sirluk in https://github.com/huggingface/peft/pull/2195meta device check bug + add multi-gpu functionality by @sirluk in https://github.com/huggingface/peft/pull/2218None check for loftq_config attribute in LoraConfig by @sirluk in https://github.com/huggingface/peft/pull/2215task_type in PEFT Configurations by @d-kleine in https://github.com/huggingface/peft/pull/2210Full Changelog: https://github.com/huggingface/peft/compare/v0.13.2...v0.14.0
This patch release contains a small bug fix for an issue that prevented some LoRA checkpoints to be loaded correctly (mostly concerning stable diffusion checkpoints not trained with PEFT when loaded in diffusers, #2144).
Full Changelog: https://github.com/huggingface/peft/compare/v0.13.1...v0.13.2
This patch release contains a small bug fix for the low_cpu_mem_usage=True option (#2113).
Full Changelog: https://github.com/huggingface/peft/compare/v0.13.0...v0.13.1
@kallewoof added LoRA+ to PEFT (#1915). This is a function that allows to initialize an optimizer with settings that are better suited for training a LoRA adapter.
@leo-yangli added a new method to PEFT called VB-LoRA (#2039). The idea is to have LoRA layers be composed from a single vector bank (hence "VB") that is shared among all layers. This makes VB-LoRA extremely parameter efficient and the checkpoints especially small (comparable to the VeRA method), while still promising good fine-tuning performance. Check the VB-LoRA docs and example.
New Hugging Face team member @ariG23498 added the helper function rescale_adapter_scale to PEFT (#1951). Use this context manager to temporarily increase or decrease the scaling of the LoRA adapter of a model. It also works for PEFT adapters loaded directly into a transformers or diffusers model.
@ariG23498 also added DoRA support for embedding layers (#2006). So if you're using the use_dora=True option in the LoraConfig, you can now also target embedding layers.
For some time now, we support inference with batches that are using different adapters for different samples, so e.g. sample 1-5 use "adapter1" and samples 6-10 use "adapter2". However, this only worked for LoRA layers so far. @saeid93 extended this to also work with layers targeted by modules_to_save (#1990).
When loading a PEFT adapter, you now have the option to pass low_cpu_mem_usage=True (#1961). This will initialize the adapter with empty weights ("meta" device) before loading the weights instead of initializing on CPU or GPU. This can speed up loading PEFT adapters. So use this option especially if you have a lot of adapters to load at the same time or if these adapters are very big. Please let us know if you encounter issues with this option, as we may make this the default in the future.
Unless indicated otherwise, PEFT adapters are saved and loaded using the secure safetensors format. However, we also support the PyTorch format for checkpoints, which relies on the inherently insecure pickle protocol from Python. In the future, PyTorch will be more strict when loading these files to improve security by making the option weights_only=True the default. This is generally recommended and should not cause any trouble with PEFT checkpoints, which is why with this release, PEFT will enable this by default. Please open an issue if this causes trouble.
merge_and_unload by @snarayan21 in https://github.com/huggingface/peft/pull/1978helper.rescale_adapter_scale by @ariG23498 in https://github.com/huggingface/peft/pull/1989test_vera_dtypes on XPU by @faaany in https://github.com/huggingface/peft/pull/2017TestModelAndLayerStatus device-agnostic by @faaany in https://github.com/huggingface/peft/pull/2026test_mixed_adapter_batches_lora_opt_timing on XPU by @faaany in https://github.com/huggingface/peft/pull/2021test_common_gpu.py to work on XPU by @faaany in https://github.com/huggingface/peft/pull/2031test_gpu_examples.py on XPU by @faaany in https://github.com/huggingface/peft/pull/2036tie_word_embeddings by @ltoniazzi in https://github.com/huggingface/peft/pull/2025evaluation_strategy by @muellerzr in https://github.com/huggingface/peft/pull/1664Full Changelog: https://github.com/huggingface/peft/compare/v0.12.0...v0.13.0
@tokenizer-decode added support for a new LoRA initialization strategy called OLoRA (#1828). With this initialization option, the LoRA weights are initialized to be orthonormal, which promises to improve training convergence. Similar to PiSSA, this can also be applied to models quantized with bitsandbytes. Check out the accompanying OLoRA examples.
@EricLBuehler added the X-LoRA method to PEFT (#1491). This is a mixture of experts approach that combines the strength of multiple pre-trained LoRA adapters. Documentation has yet to be added but check out the X-LoRA tests for how to use it.
@Phoveran, @zqgao22, @Chaos96, and @DSAILatHKUST added discrete Fourier transform fine-tuning to PEFT (#1838). This method promises to match LoRA in terms of performance while reducing the number of parameters even further. Check out the included FourierFT notebook.
@DaShenZi721 added support for Householder Reflection Adaptation (#1864). This method bridges the gap between low rank adapters like LoRA on the one hand and orthogonal fine-tuning techniques such as OFT and BOFT on the other. As such, it is interesting for both LLMs and image generation models. Check out the HRA example on how to perform DreamBooth fine-tuning.
add_weighted_adapter method thanks to @alexrs (#1701).peft_model.get_layer_status() and peft_model.get_model_status() to get an overview of the layer/model status of the PEFT model. This can be especially helpful when dealing with multiple adapters or for debugging purposes. More information can be found in the docs (#1743).Important: If the base model is loaded in float16 (fp16) or bfloat16 (bf16), PEFT now autocasts adapter weights to float32 (fp32) instead of using the dtype of the base model (#1706). This requires more memory than previously but stabilizes training, so it's the more sensible default. To prevent this, pass autocast_adapter_dtype=False when calling get_peft_model, PeftModel.from_pretrained, or PeftModel.load_adapter.
The logic of device placement when loading multiple adapters on the same model has been changed (#1742). Previously, PEFT would move all adapters to the device of the base model. Now, only the newly loaded/created adapter is moved to the base model's device. This allows users to have more fine-grained control over the adapter devices, e.g. allowing them to offload unused adapters to CPU more easily.
save_pretrained with the convert_pissa_to_lora argument is deprecated, the argument was renamed to path_initial_model_for_weight_conversion (#1828). Also, calling this no longer deletes the original adapter (#1933).path_initial_model_for_weight_conversion) while also using use_rslora=True and rank_pattern or alpha_pattern now raises an error (#1930). This used not to raise but inference would return incorrect outputs. We also warn about this setting during initialization.We are now making sure to tag appropriate issues with the contributions welcome label. If you are looking for a way to contribute to PEFT, check out these issues.
config.json when the base model_id is local. by @elementary-particle in https://github.com/huggingface/peft/pull/1668merge_and_unload docs by @younesbelkada in https://github.com/huggingface/peft/pull/1805merge_and_unload by @snarayan21 in https://github.com/huggingface/peft/pull/1944Full Changelog: https://github.com/huggingface/peft/compare/v0.11.1...v0.12.0
Fix a bug that could lead to C++ compilation errors after importing PEFT (#1738 #1739).
Full Changelog: https://github.com/huggingface/peft/compare/v0.11.0...v0.11.1
Thanks to @yfeng95, @Zeju1997, and @YuliangXiu, PEFT was extended with BOFT: Parameter-Efficient Orthogonal Finetuning via Butterfly Factorization (#1326, BOFT paper link). In PEFT v0.7.0, we already added OFT, but BOFT is even more parameter efficient. Check out the included BOFT controlnet and BOFT dreambooth examples.
If the parameter reduction of LoRA is not enough for your use case, you should take a close look at VeRA: Vector-based Random Matrix Adaptation (#1564, VeRA paper link). This method resembles LoRA but adds two learnable scaling vectors to the two LoRA weight matrices. However, the LoRA weights themselves are shared across all layers, considerably reducing the number of trainable parameters.
The bulk of this PR was implemented by contributor @vvvm23 with the help of @dkopi.
PiSSA, Principal Singular values and Singular vectors Adaptation, is a new initialization method for LoRA, which was added by @fxmeng (#1626, PiSSA paper link). The improved initialization promises to speed up convergence and improve the final performance of LoRA models. When using models quantized with bitsandbytes, PiSSA initialization should reduce the quantization error, similar to LoftQ.
Thanks to @fahadh4ilyas, PEFT LoRA linear layers now support Half-Quadratic Quantization, HQQ (#1618, HQQ repo). HQQ is fast and efficient (down to 2 bits), while not requiring calibration data.
Another new quantization method supported in PEFT is Easy & Efficient Quantization for Transformers, EETQ (#1675, EETQ repo). This 8 bit quantization method works for LoRA linear layers and should be faster than bitsandbytes.
We added a feature to show adapter layer and model status of PEFT models in #1663. With the newly added methods, you can easily check what adapters exist on your model, whether gradients are active, whether they are enabled, which ones are active or merged. You will also be informed if irregularities have been detected.
To use this new feature, call model.get_layer_status() for layer-level information, and model.get_model_status() for model-level information. For more details, check out our docs on layer and model status.
modules_to_saveWe had the issue that when we were using classes such as PeftModelForSequenceClassification, we implicitly added the classifier layers to model.modules_to_save. However, this would only add a new ModulesToSaveWrapper instance for the first adapter being initialized. When initializing a 2nd adapter via model.add_adapter, this information was ignored. Now, peft_config.modules_to_save is updated explicitly to add the classifier layers (#1615). This is a departure from how this worked previously, but it reflects the intended behavior better.
Furthermore, when merging together multiple LoRA adapters using model.add_weighted_adapter, if these adapters had modules_to_save, the original parameters of these modules would be used. This is unexpected and will most likely result in bad outputs. As there is no clear way to merge these modules, we decided to raise an error in this case (#1615).
lru_cache to import_utils calls that did not previously have it by @tisles in https://github.com/huggingface/peft/pull/1584Repository anymore by @Wauplin in https://github.com/huggingface/peft/pull/1641% to be sensible by @stas00 in https://github.com/huggingface/peft/pull/1648dreambooth Git link by @charliermarsh in https://github.com/huggingface/peft/pull/1660Full Changelog: https://github.com/huggingface/peft/compare/v0.10.0...v0.11.0
We added a couple of changes to allow QLoRA to work with DeepSpeed ZeRO3 and Fully Sharded Data Parallel (FSDP). For instance, this allows you to fine-tune a 70B Llama model on two GPUs with 24GB memory each. Besides the latest version of PEFT, this requires bitsandbytes>=0.43.0, accelerate>=0.28.0, transformers>4.38.2, trl>0.7.11. Check out our docs on DeepSpeed and FSDP with PEFT, as well as this blogpost from answer.ai, for more details.
First time contributor @siddartha-RE added support for layer replication with LoRA. This allows you to duplicate layers of a model and apply LoRA adapters to them. Since the base weights are shared, this costs only very little extra memory, but can lead to a nice improvement of model performance. Find out more in our docs.
Last release, we added the option to enable DoRA in PEFT by simply adding use_dora=True to your LoraConfig. However, this only worked for non-quantized linear layers. With this PEFT release, we now also support Conv2d layers, as well as linear layers quantized with bitsandbytes.
If you have a PEFT model with multiple LoRA adapters attached to it, it's now possible to apply different adapters (or, in fact, no adapter) on different samples in the same batch. To do this, pass a list of adapter names as an additional argument. For example, if you have a batch of three samples:
output = model(**inputs, adapter_names=["adapter1", "adapter2", "__base__"])`
Here, "adapter1" and "adapter2" should be the same name as your corresponding LoRA adapters and "__base__" is a special name that refers to the base model without any adapter. Find more details in our docs.
Without this feature, if you wanted to run inference with different LoRA adapters, you'd have to use single samples or try to group batches with the same adapter, then switch between adapters using set_adapter -- this is inefficient and inconvenient. Therefore, it is recommended to use this new, faster method from now on when encountering this scenario.
We added an alternative way to initialize LoRA weights for a quantized model using the LoftQ method, which can be more convenient than the existing method. Right now, using LoftQ requires you to go through multiple steps as shown here. Furthermore, it's necessary to keep a separate copy of the quantized weights, as those are not identical to the quantized weights from the default model.
Using the new replace_lora_weights_loftq function, it's now possible to apply LoftQ initialization in a single step and without the need for extra copies of the weights. Check out the docs and this example notebook to see how it works. Right now, this method only supports 4bit quantization with bitsandbytes, and the model has to be stored in the safetensors format.
The function prepare_model_for_int8_training was deprecated for quite some time and is now removed completely. Use prepare_model_for_kbit_training instead.
Besides these highlights, we added many small improvements and fixed a couple of bugs. All these changes are listed below. As always, we thank all the awesome contributors who helped us improve PEFT.
CI / Docker] Follow up from #1481 by @younesbelkada in https://github.com/huggingface/peft/pull/1487Docs/ bnb / DeepSpeed] Add clarification on bnb + PEFT + DS compatibilities by @younesbelkada in https://github.com/huggingface/peft/pull/1529num_parameters() and get_nb_trainable_parameters() in PEFT by @kmehant in https://github.com/huggingface/peft/pull/1531prompt_tuning_init==TEXT by @kmehant in https://github.com/huggingface/peft/pull/1519levenshtein_distance algorithm in peft_lora_seq2seq_accelera… by @SUNGOD3 in https://github.com/huggingface/peft/pull/1527prompt_based_methods.md by @insist93 in https://github.com/huggingface/peft/pull/1548BitsAndBytesConfig as load_in_* is deprecated by @BenjaminBossan in https://github.com/huggingface/peft/pull/1552CI] Fix test docker CI by @younesbelkada in https://github.com/huggingface/peft/pull/1535Full Changelog: https://github.com/huggingface/peft/compare/v0.9.0...v0.10.0
With PR #1364, we added new methods for merging LoRA weights together. This is not about merging LoRA weights into the base model. Instead, this is about merging the weights from different LoRA adapters into a single adapter by calling add_weighted_adapter. This allows you to combine the strength from multiple LoRA adapters into a single adapter, while being faster than activating each of these adapters individually.
Although this feature has already existed in PEFT for some time, we have added new merging methods that promise much better results. The first is based on TIES, the second on DARE and a new one inspired by both called Magnitude Prune. If you haven't tried these new methods, or haven't touched the LoRA weight merging feature at all, you can find more information here:
Via #1394, we now support AutoAWQ in PEFT. This is a new method for 4bit quantization of model weights.
<img width="1197" alt="Screenshot 2024-02-28 at 09 41 40" src="https://github.com/huggingface/peft/assets/49240599/431d485b-c2b9-4e49-b407-89977875e6ef">Similarly, we now support AQLM via #1476. This method allows to quantize weights to as low as 2 bits. Both methods support quantizing nn.Linear layers. To find out more about all the quantization options that work with PEFT, check out our docs here.
Note these integrations do not support merge_and_unload() yet, meaning for inference you need to always attach the adapter weights into the base model
We now support Weight-Decomposed Low-Rank Adaptation aka DoRA via #1474. This new method is builds on top of LoRA and has shown very promising results. Especially at lower ranks (e.g. r=8), it should perform much better than LoRA. Right now, only non-quantized nn.Linear layers are supported. If you'd like to give it a try, just pass use_dora=True to your LoraConfig and you're good to go.
Thanks to @stevhliu and many other contributors, there have been big improvements to the documentation. You should find it more organized and more up-to-date. Our DeepSpeed and FSDP guides have also been much improved.
Check out our improved docs if you haven't already!
If you're implementing custom adapter layers, for instance a custom LoraLayer, note that all subclasses should now implement update_layer -- unless they want to use the default method by the parent class. In particular, this means you should no longer use different method names for the subclass, like update_layer_embedding. Also, we generally don't permit ranks (r) of 0 anymore. For more, see this PR.
Developers should have an easier time now since we fully embrace ruff. If you're the type of person who forgets to call make style before pushing to a PR, consider adding a pre-commit hook. Tests are now a bit less verbose by using plain asserts and generally embracing pytest features more fully. All of this comes thanks to @akx.
On top of these changes, we have added a lot of small changes since the last release, check out the full changes below. As always, we had a lot of support by many contributors, you're awesome!
MatMul8bitLtBackward view issue by @younesbelkada in https://github.com/huggingface/peft/pull/1425core/TPLinear] Fix breaking change by @younesbelkada in https://github.com/huggingface/peft/pull/1439set_adapters() after add_weighted_adapter by @sayakpaul in https://github.com/huggingface/peft/pull/1444modules_to_save config option when using DeepSpeed ZeRO-3 with ZeRO init enabled. by @pacman100 in https://github.com/huggingface/peft/pull/1450core / get_peft_state_dict] Ignore all exceptions to avoid unexpected errors by @younesbelkada in https://github.com/huggingface/peft/pull/1458Adaptation Prompt] Fix llama rotary embedding issue with transformers main by @younesbelkada in https://github.com/huggingface/peft/pull/1459CI] Add CI tests on transformers main to catch early bugs by @younesbelkada in https://github.com/huggingface/peft/pull/1461magnitude_prune merging method by @pacman100 in https://github.com/huggingface/peft/pull/1466CI] Fix adaptation prompt CI on transformers main by @younesbelkada in https://github.com/huggingface/peft/pull/1465CI] Run tests only when relevant files are modified by @younesbelkada in https://github.com/huggingface/peft/pull/1482CI / bnb] Fix failing bnb workflow by @younesbelkada in https://github.com/huggingface/peft/pull/1480PromptTuning] Simple fix for transformers >= 4.38 by @younesbelkada in https://github.com/huggingface/peft/pull/1484CI / Docker]: Create a workflow to temporarly build docker images in case dockerfiles are modified by @younesbelkada in https://github.com/huggingface/peft/pull/1481CI / Adaptation Prompt] Fix CI on transformers main by @younesbelkada in https://github.com/huggingface/peft/pull/1493Docker] Notify us when docker build pass or fail by @younesbelkada in https://github.com/huggingface/peft/pull/1503Full Changelog: https://github.com/huggingface/peft/compare/v0.8.2...v0.9.0
core] fix critical bug in diffusers by @younesbelkada in https://github.com/huggingface/peft/pull/1427Full Changelog: https://github.com/huggingface/peft/compare/v0.8.1...v0.8.2