Natural Language to LoRA Weights

Recently Sakana AI released a paper called Text-to-Lora, which proposes generating LoRA weights directly from natural language task descriptions.

Challenges with Finetuning

In NLP, large pre-trained models can be used for various downstream tasks such as Question Answering, Named Entity Recognition and translation. However, these models often require task specific adaptation. Fully finetuning the model for each task is computationally expansive and may degrade its performance on other tasks. To address this, researchers explored methods such as finetuning only a subset of parameters or adding lightweight task specific modules, though these approaches had their own limitatinos. In 2021, researchers from Microsoft introduced LoRA (Low Rank Adaptation) which quickly gained widespread adoption not only in language models but also in domains such vision transformers and diffusion models.

LoRA (Low Rank Adaptation)

LoRA efficiently finetunes pre-trained models by decomposing weight updates into low-rank matrices. Building upon the idea that weights of a large pre-trained model have low-intrinsic rank, the authors hypothesized that even the weight updates should also have a low intrinsic-rank.
If $W_0 \in \mathbb{R}^{d \times d}$ is a frozen pre-trained weight, the update is parameterized as $\Delta W = BA$ where $A \in \mathbb{R}^{r \times d}, B \in \mathbb{R}^{d \times r}$ with $r \ll d$. Only $A, B$ are trained, reducing trainable parameter count significantly. Assume $W_{0}$ has dim: 512x512, if rank is 8 then the A and B has dim: 512x8 and 8x512, i.e. 32 times less parameters than $W_{0}$. Also, there is no additional inference time latency.

Challenges in LoRA

The papers highlight following challenges with LoRA:

A new LoRA adapter need to be trained for each downstream task.
Though reduced trainable parameters, training LoRA adapter still takes significant time.

Text-to-Lora

— Sakana AI
In this paper authors raise these questions–

Can we train neural networks to generate LoRA?
Can these neural networks generate new LoRAs from unseen task descriptions during test-time?

Generating LoRAs from natural task descriptors:
Training Data:
N datasets ${D_1, D_2, \ldots, D_n}$ each having a task descriptor $T_i$, that contains a general description of the dataset.
Training Objective for $task_i:\; \Delta W_i = \arg\min \; Loss_{\text{sft}}(D_i, W_0, \Delta W_i)$

Hypernetworks

The network used to synthesize the weights of the LoRA. Hypernetwork were proposed in 2016 to generate parameters of another neural network. Hypernetworks in a way compresses different neural networks within it’s weights, given size of it’s parameters is much smaller than those networks it is synthesizing. Formally, given a layer descriptor vector $v_l$, hypernetwork generates parameter of layer $l: W_l = h(v_l)$. The network is trained end-to-end on a downstream task.

Text-to-Lora using Hypernetwork

Given a target module (m) and layer index (l), text2Lora uses a hypernetwork to generate low-rank matrices A and B, based on a task descriptor $z^{i}\in Z^{i}$ of task $t^{i}$ :
$\Delta W_{m,l}^{i} = h_{\theta}(\phi_{m,l}^{i})$

Note:: Task embedding can be one-hot encoding or a learnable embedding, however one-hot encoding may lead to generalization issues.

$\phi_{m,l}^{i} = \text{concat}[f(z^{i}), E[m], E[l]]$ Where f is a function to generate embedding of the task description, typically $\text{[CLS]}$ token of a bi-directional transformer or last token activation of an LLM.
E is a learnable embedding dictionary indexed by either module $m$ or layer index $l$.
Supervised finetuning objective for T2L is: $\Theta = \arg\min_{\theta}(\mathcal{L}_{\mathrm{SFT}}(D^{i},\psi,h_{\theta}(\phi^{i})))$

Note:: We can batch $m$ and $l$, that allows us to generate the parameters of all modules and layers in a single forward pass.

Different Hypernetwork Architectures:

Authors propose three architectures: L (Large), M (Medium) and S (Small).

L Architecture: Final layer outputs $A$ and $B$ matrices simultaneously. Output parameters: $|\theta_{head}| = d_{head}\times 2 \times r \times d$ where $d_{head}$ is the dimension of the last MLP, $r$ is the rank of the matrices $A$ and $B$ and $d$ is the dimension of weight matrix $W$.
M Architecture: Final layer outputs either $A$ and $B$ matrix depending on the input embedding, meaning that the output layer is shared between $A$ and $B$. Output parameters: $|\theta_{head}| = d_{head} \times r \times d$
S Architecture: Final layer outputs $A$ and $B$ matrices simultaneously. Output parameters: $|\theta_{head}| = d_{head}\times d$ where $d_{emb}$ is the dimension of weight matrix $W$. This variation has strongest inductive bias.

All of these architectures can generate all LoRAs in a single forward pass by batching the input text embeddings.

Training of Hyperparameters

Authors propose two ways of training the hypernetwork:

by reconstructing trained LoRAs
supervised finetuning on downstream task.

Training Text-to-LoRA via LoRA Reconstruction: The network is trained to reconstruct LoRA matrices. We can either use LoRAs from a pre-trained library or first perform peft training to generate LoRAs than train our hypernetwork. T2L can be trained using either one-hot vector embeddings or task description in natural languages. However, there is a downside with training using one-hot embeddings, as we won’t be able to use the network in a zero-shot setting.
Given a suitable library of LoRA adapters $\Omega$, the reconstruction loss for T2L can be written as: $\mathcal L(\Omega,\theta) = \mathbb E_{\Delta W^i \sim \Omega} (\Delta W^i - h_{\theta}(\phi^i))$
Training T2L via end-to-end finetuning: One issue with the prior approach is if we have reconstructed two LoRAs related to similar tasks, whose weights $\Delta W1$ and $\Delta W2$ reside in different minima, then trained hypernetwork might not generalize on the similar tasks. We can avoid it by directly finetuning T2L on the target task, so that it implicitly learns to cluster the similar LoRAs.

Experiments and Results

Model: Mistral-7b-Instruct
Dataset: Supernatural Instruction Dataset (\cite https://huggingface.co/datasets/andersonbcdefg/supernatural-instructions-2m)

Experiment 1: Aim: Whether T2L can recover the performance of task specific LoRAs when trained with reconstruction loss. Task-specific LoRAs are trained on the 9 benchmark datasets, creating a library of LoRAs and Hypernetwork is trained via reconstruction loss using these. The results show it can effectively distill the LoRAS and even outperforms LoRA weights on multiple tasks (likely due to implicit regularization from lossy compression of LoRAs)

Experiment 2: Aim: Whether T2L generate LoRAs in zero-shot setting T2L is trained via SFT on 479 tasks from the SNI dataset. Multiple Task descriptions are generated using GPT-4o-mini and sampled in an online fashion. The results on the evaluation set show that model outperforms various other approaches including multi-task LoRA, but falls short of task specific LoRA on several benchmarks. It does, however, outperforms task specific LoRA on certain benchmarks.

Limitations and Future Work

Performance gap in zero-shot setting
Sensitive to task descriptions
Requires high quality task descriptions (generated via gpt-4o-mini for the experiments)

Drag and Drop LLMs (DnD), which utilizes slightly differnet architecture for the same task of text-to-lora generation.
Hyperdecoders
HyperLoRA - Very similar framework but, along with the task description few shot examples are also provided.
Hypernetworks
Hypertuning