How knowledge distillation compresses neural networks
For instance, the famous BERT model has about ~110 million. It’s free, every week, in your inbox. So, what is knowledge distillation? Lets imagine a very complex task, such as image classification for thousands of classes. Often, you cant just slap on a ResNet50 and expect it to achieve 99% accuracy. So, you build an ensemble of models, balancing out the flaws of each one. In broad strokes, the process is the following....