Create Mixtures of Experts with MergeKit

Author:Murphy | View: 26270 | Time: 2025-03-22 22:17:47

Thanks to the release of Mixtral, the Mixture of Experts (MoE) architecture has become popular in recent months. This architecture offers an interesting tradeoff: higher performance at the cost of increased VRAM usage. While Mixtral and other MoE architectures are pre-trained from scratch, another method of creating MoE has recently appeared. Thanks to Arcee's MergeKit library, we now have a new way of creating MoEs by ensembling several pre-trained models. These are often referred to as frankenMoEs or MoErges to distinguish them from the pre-trained MoEs.

In this article, we will detail how the MoE architecture works and how frankenMoEs are created. Finally, we will make our own frankenMoE with MergeKit and evaluate it on several benchmarks. The code is available on Google Colab in a wrapper called LazyMergeKit.

Special thanks to Charles Goddard, the creator of MergeKit, for proofreading this article.