RESEARCH

: Few-shot computer vision algorithms have enormous potential to produce promised results for innovative applications which only have a small volume of example data for training. Currently, the few-shot algorithm research focuses on applying transfer learning on deep neural networks that are pre-trained on big datasets. However, adapting the transformers requires highly cost computation resources. In addition, the overfitting or underfitting problems and low accuracy on large classes in the face validation domain are identified in our research. Thus, this paper proposed an alternative enhancement solution by adding contrasted attention to the negative face pairs and positive pairs to the training process. Extra attention is created through clustering-based face pair creation algorithms. The evaluation results show that the proposed approach sufficiently addressed the problems without requiring high-cost resources


Introduction
Face validation is one of the important machine learning research topics for a wide range of smart applications. In the last decade, the development of Convolution Neural Network (CNN) architectures such as VGG-19 [1], ResNet [2] and Inception 3 [3] provided good performances on face validation [4]. However, researchers in this domain realize there is a crucial difference between deep CNN and human learning on face validation and other similar Artificial Intelligent (AI) which is the usage of data volume. A human can learn image concepts from a very small size of examples but deep CNN needs a huge dataset to capture the features but still can make mistakes on novel images [5]. In the meantime, many kinds of research focused on one or few-shot learning algorithms since 2006 [6,7,8,9,10,11]. In this paper, we discussed two important issues from the current state-of-art Siamese neural network on face validation, which are overfitting or underfitting (for simplification, we use overfitting as the general term in this paper) and less accuracy on large classes. We propose an enhanced pairing algorithm to address the issues.
The Siamese neural network was introduced by [6] presents a learning structure that has two parallel neural networks. One network is to understand the same concept and the other is to understand the differences between different concepts. However, these two networks are sharing the weights of the features from the learning process on the dataset. Therefore, the dataset needs to be pre-processed into two types of set pairs: pairs of the same concept and pairs of different concepts. In the face validation domain, they are image pairs of the same person (positive pairs) and image pairs of different people (negative pairs). The core mathematics function behind the Siamese neural network is the contrastive loss function: Here, the Y can be 0 (same concept) or 1 (different concept) and D 2 w presents the similarity measurement. The similarity measurement is based on the Bayesian likelihood function. In face validation, the facial features can be projected into a Euclidean space where distance calculations directly correspond to a measure of face similarity. Figure 1 shows the overall working process of the Siamese neural network [12]. The feature models are normally created from Artificial Neural Networks (ANN). If it applies Deep Neural Network (DNN) e.g. multi-layer CNN, then it can be defined as Deep Siamese neural network [13].

1 . Deep Siamese neural network
The Deep Siamese neural network applies a long sequence of convolution feature filters and each filter is consist of feature mapping, convolution activation function, and max-pooling(CNN process). The Siamese neural network will create a pair of deep CNN by joining them at the end with the loss function. The best deep Siamese neural network for image verification was claimed in [9] in 2015, which contains seven layers of convolution filters.
In the meantime, DeepFace [13]-A Deep Siamese neural network was pro-posed to do a human-level face validation task with a highly promised result of about 97% accuracy applying to the LFW image dataset. The Deepface's DNN has two connected blocks of CNNs. The first block contains 32 filters of 11×11×3 before the first max-pooling layer. The second one has 16 filters of 9×9×16 followed by the second max-pooling layer that has three subsequent convolution filters. For training such deep neural networks, DeepFace still re-quires a big dataset to train before applying the Siamese function. Thus, the other pathway is to apply transfer learning (Transformers) [14,15].

Transfer learning based Siamese neural network
Transfer learning means adopting well-build facial feature extraction models (general model) that are trained on a big dataset but with tunes on the last layer using the application-specific (small) dataset. In this way, the general model can be welltrained regardless of cost consumption on computation resources and time.
One of the earliest transfer models was introduced in [16], which transfers a developed Joint Bayesian Method (JBM)learning model from other domains to perform face verification training on the LFW dataset. The accuracy of the work can achieve at 96.33%.
In current the state of the art, Facenet [17] is the most well-known trans-fer model in the face validation domain. The unique character of the Facenet model is to extract face mapping features into a compact Euclidean space. As result, the similarity of face images can be directly measured as Euclidean distances. Therefore, Facenet has the suitable character to work with Siamese neural network as the transfer model.

3. Incrementation and Simplification processes on image classification
Based on the transfer learning idea, there are two opposite directions of research on few-shot learning recently on image classification. In 2019, an incremental few-shot learning algorithm [7] is developed that separates the learning process into two phases: base class weight learning (base learning) and meta-learning. The base learning phase is apply transfer learning to collect network weights for general classes on pre-trained and classified images. The meta-learning phase is to only extract feature weights through Siamese neural net-work on the novel images that the first phase never learnt before. As the result, attention weights are collected only focused on the novel image. Finally, both weights are integrated through an attractorregulariser gate to complete the classification task. This work works well on the novel dataset with predefined classes such as dog, cat and fish. For much more similar feature comparing (e.g. different types of fish), it does not work well but this is not the aim of their re-search anyway. In contrast to adding an extra layer to Siamese neural network, the Facebook AI research team claimed a surprising exploring research outcome that a simple Siamese neural network (SimSiam) [10] can get enough meaningful features to do image classification of images even without having negative sample pairs, large batches and momentum encoders. The core component that makes this happen is a one-side stop-gradient operation (see Figure 2 left). A combined and cloud-based clustering algorithm (SwAV) was also developed by having feature learning on the same images through two different image-augmented versions (see figure 2 right) [18]. The SwAV algorithm first encodes the class features into prototype vector C similar to the base class weights collection. Then the cloud online classification applies the swap prediction method to cluster the images into different classes.

Figure 2: SimSiam and SwAV networks
However, there is a major difference between these proposed image classification algorithms and the face validation Siamese neural network, which is the contrastive network designed to predict/cluster the classes on the left side and validation on the right side for the same encoded image and face validation in Siamese neural network is to identify if the two different images are the same.

. Limitations
In general, the current Siamese neural network approaches suffer two problems: • Significant overfitting problem for a smaller training dataset even with transformer. Applying the Facenet Siamese neural network to the Yale Face Dataset [19], we can clearly see an accuracy gap between training and validation as shown in Figure 3.
• Poor performance for a large training dataset without incremental com-puting (time cost) and significant pre-trained Deep networks (very cost to transfer). Applying only the Facenet Siamese neural network to the Labelled Faces in the Wild Dataset (FLW) [20], the accuracy rate is dramatically dropped compared to the smaller dataset which is clearly displayed in Figure 4. Here the smaller dataset refers to the number of distinct face classification labels is smaller than 20 and the dataset only contains a few hundred of images all together.

Proposed Clustering-Based Attention Siamese Neural Network
In this section, we introduce the Clustering-based Face validation Siamese neural network (CFVSiam). The hypothesis of CFVSiam is that clustered pairing algorithm can reduce significant numbers of image pairs for training but is more efficient and accurate because the networks are more sensitive to similar faces. More precisely, CFVSiam has three major steps of clustering (unsupervised learning) the few-shot face dataset based on the Kmean algorithm, creating negative pairs from the same cluster and different clusters (the positive pair will be created normally) and applying FaceNet-based deep Siamese neural network to encoding and contrastively compute the face validation. Figure 5 shows the architecture of CFVSiam.

1 Definition of CFVSiam
αi is the PCA optimised trained parameters of contrastive twin FaceNet DNNs f1 and f2 over positive pairs (same person -PP), negative pairs from the same cluster (NPS) and negative pairs from different clusters (NPD). The reason that we can reduce the number of pairs is that the algorithm only takes one image from different clusters to create negative pairs but focuses on creating more negative pairs in the same clusters. Thus, the Siamese neural network is more sensitive and accurate. The idea behind this is that it is not difficult for Siamese neural networks to learn differences between face images from two different clusters but it is hard to get accurate feature learning in the same cluster.

2 Algorithm
The overall creation process of CFVSiam network has two algorithms that are presented in Algorithm 1 and Algorithm 2 (see Appendices section).

Evaluations
The hypothesis is that the negative pair of people in the same cluster should have closer distances between them and the variance should be larger among them than in the random negative pairing generation process. Oppositely, the positive pair of the same person from different clusters should have longer distances and the variance should be smaller among them than in the random positive pairing generation process. Figure 6 shows that the assumption is correct if we take Yale face dataset as an example. On the top of the figure, the left figure presents different people paired in one of the clusters (mean = 8.237, variance = 3.721) compared to the right figure which includes all the negative pairs (mean = 8.335, variance = 3.194). Therefore, we believe that the negative pairs created through clustering make more attention to different people who have a certain degree of similarity and vice versa for the positive pairs. With KMeans cluster-ing (sklearn-clustering-MiniBatchKMeans python package), the faces can be grouped with a certain level of similarity that will make the data pay more attention to extract the different features for different peoples' face images that may be difficult to separate from the same cluster and more common features for same person 's face images that look very different from different clusters. We can prove that the overfitting problem is dramatically addressed. Compared to Figure 3, CFVSiam's validation results are rarely worse than train-ing results in all epos rounds for both datasets. In addition, the accuracy of the performances is both higher than without attention (see Figure  3 and Fig-ure 4) by training on Yale face dataset (train=96, valid=99) and LFW dataset (train=98, valid=99) with the clustering attentions. The , So, we can prove that CFVSiam's performance will not be affected by the size of the data (numbers of different people and color or black-white). Table 1 shows the comparing results to the other DNN models tested on the same LFW dataset which are not few-shot machine learning from the survey literature [21,22,23]. The proposed attention enhanced Siamese model's performance is strong as same as the most state-of-art DNN models which request much more computation resources and time.   Table 1 shows the comparing results to the other DNN models tested on the same LFW dataset which are not few-shot machine learning from the survey literature [21,22,23]. The proposed attention enhanced Siamese model's performance is strong as same as the most state-of-art DNN models which request much more computation resources and time.

Conclusion and Future Work
Few-shot learning algorithms have the advantage of using fewer data examples from each class to address classification or prediction problems. This advantage will enable the algorithms to train faster with lower costs for resource-limited applications. However, we found that the Siamese neural network has problems with overfitting and low accuracy for big size of classes. To address these problems, we proposed CFVSiam Network by adding a cluster attention mechanism to the pair data creation process. The evaluation results on two different datasets proved our hypothesis on the proposed enhancement in the face validation domain. Future research will focus on: • applying on CFVSiam Network to the real-world applications for further evaluation.
• generalization of the clustering-based attention algorithm to other neural network and application domains.

Conflicts of Interest
The author declare that he has no conflicts of interest to this work.