Toward Data Efficient Learning in Computer Vision
Deep learning leads the performance in many areas of computer vision. Deep neural networks usually require a large amount of data to train a good model with the growing number of parameters. However, collecting and labeling a large dataset is not always realistic, e.g. to recognize rare diseases in the medical field. In addition, both collecting and labeling data are labor-intensive and time-consuming. In contrast, studies show that humans can recognize new categories with even a single example, which is apparently in the opposite direction of current machine learning algorithms. Thus, data-efficient learning, where the labeled data scale is relatively small, has attracted increased attention recently. According to the key components of machine learning algorithms, data-efficient learning algorithms can also be divided into three folders, data-based, model-based, and optimization-based. In this study, we investigate two data-based models and one model-based approach.
First, to collect more data to increase data quantity. The most direct way for data-efficient learning is to generate more data to mimic data-rich scenarios. To achieve this, we propose to integrate both spatial and Discrete Cosine Transformation (DCT) based frequency representations to finetune the classifier. In addition to the quantity, another property of data is the quality to the model, different from the quality to human eyes. As language carries denser information than natural images. To mimic language, we propose to explicitly increase the input information density in the frequency domain. The goal of model-based methods in data-efficient learning is mainly to make models converge faster. After carefully examining the self-attention modules in Vision Transformers, we discover that trivial attention covers useful non-trivial attention due to its large amount. To solve this issue, we proposed to divide attention weights into trivial and non-trivial ones by thresholds and suppress the accumulated trivial attention weights. Extensive experiments have been performed to demonstrate the effectiveness of the proposed models.