Microsoft AI ‘Distills’ Knowledge with New NLP Approach
A technique known as “ensemble learning” has been used to improve the performance of natural language processing models. However, the approach typically consists of hundreds of different deep neural network models, making deployment prohibitively expensive owing to the heavy computing costs associated with inference runtimes.
AI researchers at Microsoft (NASDAQ: MSFT) came up with a way of compressing multiple ensemble models into a natural language processing algorithm dubbed Multi-Task Deep Neural Network (MT-DNN) using “knowledge distillation.” The technique is used to boil down the knowledge within a neural network, thereby compressing multiple models, or ensembles, into a single, compact and less expensive model.
The result, the Microsoft researchers noted in a paper, was more robust learning and “universal text representations” across multiple natural language understanding tasks.
Training and testing of the compressed model were performed on four Nvidia (NASDAQ: NVDA) Tesla V100 GPUs running on the PyTorch deep learning framework. The AI researchers reproduced their results on the General Language Understanding Evaluation (GLUE) benchmark using eight Tesla GPUs.
The GLUE benchmark consists of nine natural language understanding tasks, including question answering, sentiment analysis, text similarity and textual entailment. Textual entailment refers to the logical exercise of discerning whether one sentence can be inferred from another.
The MT-DNN algorithm was released in January and updated in April. Combined with Google’s language representation model called BERT (Bidirectional Encoder Representations from Transformers), the researchers claimed their distilled model outperformed BERT and the previous state-of-the-art model on a benchmark leaderboard.
The Microsoft team said they trained a collection of different MT-DNNs serving as a “teacher,” then trained a single algorithm as a “student.” The multi-task learning technique was used to distill knowledge from the ensemble teachers, they reported.
Microsoft’s knowledge distillation process was them used for learning multiple tasks. A set of tasks was selected based on the availability of labeled training data. The “teacher” ensemble of different neural networks was trained with these labeled data sets. The single “student” MT-DNN was trained using the multi-task learning approach.
Both the plain-vanilla and “distilled” versions of the MT-DNN algorithm outperformed the BERT model, the latter by a wide margin during benchmark testing, the researchers claimed. The results, they added, demonstrate that language representation using distilled MT-DNN is “more robust and universal” than previous approaches.
The GLUE benchmark results are here.
Microsoft’s knowledge distillation approach is among a growing number of NLP frameworks being turned over to the open-source community as corporate AI research expands and investigators seek to bounce their ideas off outside developers. Earlier this month, for example, Intel’s AI lab released a sentiment analysis algorithm aimed at improving NLP applications that do not scale across different domains.