SEEK: A Stacked Ensemble for Expert Knowledge ·

Large language models (LLMs) have seen recent spikes of popularity due to increasing quantities of training data and powerful compute. However, LLMs continue to hallucinate and are by no means memory efficient. We attempt to mitigate these issues through domain-specific knowledge distillation. Domain specific distillation requires reducing both the parameter size and vocabulary alterations while fine tuning to restricted domains of knowledge, such as law, medicine, math, or computer science, which allows models to perform more accurately in domain-specific answering tasks than domain-agnostic distillation. Through careful distillation, we propose a mixture-of-experts (MoE) of smaller and more efficient domain-specific and domain-agnostic models that in aggregate perform similarly to a larger model on domain-specific tasks.

Introduction and Motivation #

Large language models (LLMs) have caught on in popularity in the recent years as their capabilities have developed, especially with the release of ChatGPT and the announcement of GPT4 (OpenAI, 2023). As demand for more accessible and personalized LLMs increase, we foresee issues with the large memory footprint of LLMs. We wish to experiment with model distillation as a way to take advantage of the weights of existing LLMs to lower the overall memory footprint. Knowledge distillation is the process of training a smaller model \textit{student} on the input-output pairs of the larger \textit{teacher} model. By using fine-tuned domain-specific models as teacher models, higher accuracies are achieved in domain-specific tasks. (Yao et al., 2021). We hypothesize that when these domain-specific student models are mixed together, a model with an overall smaller memory footprint but with higher domain-specific accuracy can be constructed.

In this work, a mixture-of-experts model is built by stacking distilled BERT models for Multiple Choice Question Answering fine-tuned for the domains of arithmetic, science, and medicine. A mixture-of-experts model was chosen as it has been shown to achieve up to 1000x more improvements on model capacity (Shazeer et al., 2017), making it an appealing method when combining smaller models. Mixture-of-expert ensembles also take into account multiple experts, which allows for a more well-rounded response. We also aim to investigate whether multiple experts can help with answering questions across different domains.

We introduce SEEK, a stacked ensemble of expert knowledge for multiple choice question answering. We show that smaller specialized learners in aggregate can perform nearly as well as larger fine-tuned teacher models can. Rather than exponentially increasing model size based on exponential scaling laws, we have a model that can grow linearly in size across domains. The sections of this paper are as follows: In section 2 we outline current related works in knowledge distillation and mixture-of-experts. In sections 3 and 4 we expand on the importance of knowledge distillation and mixture-of-experts as a valid path for compressed knowledge. In section 5 we introduce SEEK, followed by experiments and ablation studies in section 6. Results are outlined in section 7.

Acknowledgements #

This project was part of a class project for Natural Language Processing - Self Supervised Models at JHU together with the great Kevin Kim and Sara Ren. The results, while not fully fleshed out, I think are pointing in the right direction towards massive open-sourced MOE LLMs with 30B+ parameters. More information can be found here.