An Optimizer varieties the premise for coaching most trendy neural networks.
Published in 2017, the Adam Optimizer, along with its variants, has turn out to be the dominant and go-to optimizer for training LLMs in the industry today.
However there’s a difficulty with Adam that has been largely ignored resulting from its superior efficiency.
That difficulty is Reminiscence inefficiency.
To coach an LLM with 7 billion parameters, Adam requires round 86 GB of reminiscence.
For fashions like Google PaLM, which consists of 540 billion parameters, greater than 50 GPUs are wanted simply to include Adam itself.
However perhaps not anymore. Right here’s some thrilling information!
A group of ML researchers have developed a greater model of Adam known as Adam-mini.
The Adam-mini optimizer is twice as reminiscence environment friendly and achieves 49.6% larger throughput than AdamW when used to coach billion-parameter LLMs.