Instruction tuning Microsoft's Phi-2 transformer

Another blog?

“Men will literally start a machine learning blog where they learn in public instead of going to therapy.”

I love machine learning and think that the fact that the field has arrived in a big way is both awesome and deserved. I do find that there are some unforunate consequences that have arisen as a result of all of the money and attention, however:

The exact mechanics behind today’s biggest breakthroughs are shrouded in secrecy from the public. Although there is still lots of publishing happening in academia, the insights of the true luminaries in the field remain obscured from the general public behind the veil of the top foundation model labs.
There is a deluge of newcomers to the field, which is totally awesome! But along with the new faces comes an unintended flood of low-quality information and tutorials. Personally, I’ve found it somewhat difficult to differentiate the signal from the noise. Additionally, I’ve found that most tutorials use the Hugging Face trainer, which I can’t stand. I understand the value of it as an abstraction and making finetuning more accessible, but I much prefer the PyTorch training loop of for batch in dataloader: optimizer.zero_grad(); loss = model(batch); loss.backward(); optimizer.step().

And now I’m adding my own blog to the zeitgeist. Probably no one will read it, but writing it has been helpful for me and maybe reading it can be helpful for you too. I want to apologize in advance because the level of complexity I use in explaining concepts might not be suitable or helpful for any audience really (i.e., too complex for beginners, banal for experts). But again, I’m mostly writing this for myself. Anyways, feel free to shoot me an email if you want to discuss anything. I’m always happy to chat about machine learning, and I’m always looking to learn more.

Introduction

Unsupervised pre-training (e.g., masked language modeling or next token prediction) followed by fine-tuning is the dominant paradigm in language modeling right now, and has been for several years. The high-level intuition behind why this approach works so well is that the unsupervised pretraining step allows models to learn the statistical patterns that are requisite for a general understanding of language. After the pretraining imbues general language skills on the model, it can then be fine-tuned to be performant on a target task using a smaller and task-specific dataset.

Like so many other advances in the field of natural language processing (NLP), the pre-train then fine-tune paradigm was developed at OpenAI by Radford et al. circa 2018. Initially, they were primarily focused on fine-tuning models for traditional NLP tasks like semantic texutal similarity and textual entailment, but over time it has become evident that this paradigm extends to generative models as well with apporaches like instruction tuning and reinforcement learning from human feedback (RLHF) and reinforcement learning from AI feedback (RLAIF). This post will discuss instruction fine-tuning, but a future blog post will cover RLHF and RLAIF approaches.

Why instruction tuning?

Instruction tuning is a method in which a generative model is fine-tuned to follow instructions given by a user in the input prompt. Intruction tuning was pioneered at google research by Wei et al. ~2 years ago. Although instruction tuning is a nasecent line of inquiry in the bigger picture of scientific progress, it is already an essentially canonical method in generative modeling, particularly because getting instruction tuning to work really well (via supervised fine-tuning and then RLHF) was one of the critical enablers of ChatGPT. Anyways, how do we do it? This blog will focus on instruction fine-tuning Microsoft’s Phi-2 transformer on a single A100 in Google Colab.

The dataset: OIG-small-chip2

Building the instruction tuning dataset is arguably the most critical step in the process of instruction tuning. The pivotal nature of the dataset arises due to the fact that the instruction tuning dataset quality significantly impacts the performance of the fine-tuned model. Consequently, careful consideration is given to the design of the instructions and the selection of the human experts who generate the outputs. For example, the FLAN collection from Longpre et al. provides a blueprint for building instruction tuning datasets. They show that using techniques like mixture weighting, including few-shot prompt templates, data augmentation via input inversion all lead to improved intstruction tuned models. Although there are some general scientific best practices, curating instruction tuning datasets is largely a mixture between art and science at present. In fact, it seems like lots of innovation in the space actually takes place in Discord channels and by hobbyists these days, which is a crazy departure from where the machine learning field was 5 years ago. To develop an intuition into what these instruction-tuned datasets look like, I would highly recommend browsing some of the instruction tuning datsets available on Hugging Face, such as OpenHermes.

Despite the importance of the dataset, I have but limited time and only one A100 with which to perform ablation studies. Therefore, I just used the existing OIG-small-chip2 dataset. The OIG-small-chip2 dataset contains python code examples, natural instruction examples, generic harmless instruction examples, instructions and responses with lists, follow-up questions, Wikipedia toxic adversarial questions, and the grade school math GSM8K dataset. For more infromation, see the dataset page here.

The model: Microsoft’s Phi-2 model

First off—kudos to Microsoft for permissively licensing Phi-2 with the MIT license. Phi-2 is a transformer model with 2.7 billion parameters, which is extremely small compared to the current state-of-the-art models (per semianalysis, GPT-4 has ~1.8 trillion parameters). However, Microsoft was able to achieve impressive performance from this model considering both the model size and its training data size of only 30B tokens. The authors attribute this performance to the high quality of their proprietary data, which they describe as “textbook-quality data” (you can access their paper here).

The fine-tuning: See Colab

See the linked Colab for a full walkthrough of how to fine-tune Phi in less than 200 lines of code. In an ideal world (where I had more discretionary GPU access or time) I would have fine-tuned the hyperparameters including the training time in epochs, weight decay, batch size, and learning rate using Bayesian optimization. Here, I employed a vibes-based training strategy where I just picked the reasonable-seeming values of 2 epochs, a weight decay of 1e-4, a learning rate of 1e-5, and an effective batch size of 64.

I used Microsoft’s DeepSpeed package to allow me to use a larger batch size here. DeepSpeed is a deep learning optimization library for streamlining both language model training and inference. In this finetuning I specifically employed the ZeRO level 2 optimizer, in which optimizer and gradient states were offloaded to the CPU to reduce the GPU memory utilization, which critically allowed me to use a larger effective batch size.

Results: The fine-tuned model

The model takes about 3 hours to instruction tune. Given more time, I would certainly run the model on some benchmark tasks to get an objective measure of its performance. But based on vibes-only analysis, I can tell you that it is certainly far worse than GPT-4, which is wholly expected given the parameter disparity. However, it can still give reasonable answers. For example, when asked to produce a grocery list for stir fry:

User: Give me a grocery list for stir fry Assistant: Here is a grocery list for stir fry: 1. Chicken breast 2. Broccoli 3. Carrots 4. Bell peppers 5. Onion 6. Garlic 7. Ginger 8. Soy sauce 9. Rice vinegar 10. Cornstarch 11. Sesame oil 12. Green onions 13. Rice noodles 14. Peanuts 15. Scallions 16. Cilantro 17. Sesame seeds

which sounds like a good stir fry to me!

Conclusion

Anyways, that’s a wrap! In future, if I had more GPUs, I would be interested in seeing how larger models do, like Mixtral 8x7B. I hope that I conveyed through this blog post that instruction fine-tuning requires both a deep understanding of data preparation and language model fine-tuning techniques. The result is an adaptive model that can field human-specified tasks and instructions, providing more accurate and useful responses. I expect that increasingly sophisticated variants of instruction fine-tuning (such as explanation tuning from Mukherjee et al.) will continue to play a crucial role in shaping language models in the coming years.