About training mem usage for moe

#2
by lucasjin - opened

hello, the parameters shows that all 40B for phi-3.5-moe, even though activated params were 6B, but when training, does it means we can treat it as same training mem usage as a 40B model?

Microsoft org

Hi. Thanks for your interest in phi-3. 5-moe. The answer is yes unless you are using expert parallelism. If you treat it as a normal 42B, deep speed zero stage 3 would greatly help to reduce your memory consumption. You can look at this how to use stage 3 for MoE. https://github.com/microsoft/DeepSpeed/issues/4808

Microsoft org

hello, the parameters shows that all 40B for phi-3.5-moe, even though activated params were 6B, but when training, does it means we can treat it as same training mem usage as a 40B model?

As mentioned in the comment by @ykim362 , using Deep Speed Zero Stage 3 will be helpful and, in fact, we included a sample code, sample_finetune.py in the model files.

In fact, I am inclined to fine-tune Phi-3.5-MoE on a multi-modal model. The Phi-Vision sets an extremely high standard for multi-modal language models (MLLM). I merely wish to determine if employing Phi-3.5-MoE would yield an even more potent MLLM. I possess experience in fine-tuning a Mistral 12B in MLLM, and the outcome was quite satisfactory. Are there any points to note when fine-tuning it on a MoE model?

Sign up or log in to comment