visit
Everyone is GPU-poor these days, and some of us are poorer than others. So my mission is to fine-tune a LLaMA-2 model with only one GPU on Google Colab and run the trained model on my laptop using llama.cpp.
A lot has been said about when to do prompt engineering, when to do RAG (Retrieval Augmented Generation), and when to fine-tune an existing LLM model. I will not get into details about those arguments and will leave you with two in-depth analyses to explore on your own.
Right now, Meta’s LLaMA-2 is the golden standard of open-source LLM with good performance and permissible license terms. And we will start with the smallest since it will be cheaper and faster to fine-tune. Once you have gone through the whole process, you will be well on your way to 13B and 70B models if you like.
We have all heard about the tremendous cost associated with training a large language model, which is not something the average Jack or Jill will undertake. But what we can do is freeze the model weights in an existing LLM (e.g. 7B parameters), while fine-tuning a tiny adapter (less than 1% of total parameters, 130M for example).
One of these adapters is called LoRA (Low-Rank Adaptation), not to be confused with the red-haired heroine in the movie “Run, Lola, run!”.
In addition, QLoRA uses a frozen, 4-bit quantized pre-trained language model instead of a 16-bit model into Low-Rank Adapters (LoRA). Thus we can fit the entire training into the GRAM of a single commodity GPU.
In this article, I’m using the OVH Cloud guide with minor changes to the training parameters.
I used Google Colab Pro’s Nvidia A100 high memory instance, and the total fine-tuning ran about 7 hours and consumed 91 compute units.
Google Colab A100 high memory
Actual memory usage during training:
You can certainly use a single T4 high memory (15GB GRAM) instance, which will take longer but cost less. I started but did not run through the entire training process, but it was estimated to be about 24 hours and 50 compute units. I’m quite sure someone can use Nvidia’s 4090 (24GB GRAM) or equivalent consumer GPU for this fine-tuning task as well.
Once the training is done, we save the LoRA adapter’s final checkpoints to mounted Google Drive so we don’t lose them once the Google Colab session is over:
output_dir = "results/llama2/final_checkpoint"
train(model, tokenizer, dataset, output_dir)
You can see that the file “adapter_model.bin” is tiny (152.7B) compared to llama2–7b’s “consolidated.00.pth” (13.5GB).
Both fine-tuning tutorials use GPU-based inference, but a true cheapskate would probably want to use his/her own laptop with a low-spec CPU and GPU. Thus comes into play. Your fine-tuned 7B model will run comfortably with fast speed on an M1-based Macbook Pro with 16G unified RAM. You can push to run the 13B model as well if you free up some memory from resource-hungry apps.
There are a few simple steps to get your recently fine-tuned model ready for llama.cpp use. All the models reside in the directory “models”. Let’s create a new directory called “lora” under “models”, copy over all the original llama2–7B files, and then copy over the two adapter files from the previous step. The folder “lora” should have the following files
Step 1: Convert LoRA adapter model to ggml compatible mode:
python3 convert-lora-to-ggml.py models/lora
Step 2: Convert into f16/f32 models:
python3 convert.py models/lora
Step 3: Quantize to 4 bits:
./quantize ./models/lora/ggml-model-f16.gguf ./models/lora/ggml-model-q4_0.gguf q4_0
Now finally, you have your shining new gguf file that is baked with your special training data. It’s time to use it, or in fancy words “inference with it”!
./main -m models/lora/ggml-model-q4_0.gguf --color -ins -n -1
You can see llama-2–7b-Lora is running blazing fast, while I have dozens of tabs open in two Chrome browsers, a Docker engine running database and web server, Visual Studio Code, and all the instant messaging systems imaginable all on an average Macbook Pro M1 with 16GB memory.
Congratulations! You have just fine-tuned your first personal LLM and run it on your laptop. Now there are a few things you can do next: