@sambodhi 2018-08-03T06:52:48.000000Z 字数 32253 阅读 3440

The Devil of Face Recognition is in the Noise

1. 提供了主流的人脸数据库的去污子集,即MegaFace和MS-Celeb-1M数据集,并构建一个新的大型噪声控制的IMDb-Face数据集。
2. 利用原始数据集和去污子集,对MegaFace和MS-Celeb-1M的标签噪声属性进行分析。在本论文中,我们展示了需要更多样本才能获得由去污子集产生的相同准确率。
3. 我们研究了根据人脸识别的准确率,不同类型的噪声(即标签翻转和离群值)之间的关联。
4. 我们研究了改善数据清洁度的方法,包括在数据标签策略对注释准确性的影响进行全面用户研究。

1 介绍




The second goal of our study is to build a clean face recognition dataset for the community. The dataset could help training better models and facilitate further understanding of the relationship between noise and face recognition per­formance. To this end, we build a clean dataset called IMDb-Face. The dataset consists of 1.7M images of 59K celebrities collected from movie screenshots and posters from the IMDb website[1]. Due to the nature of the data source, the images exhibit large variations in scale, pose, lighting, and occlusion. We carefully clean the dataset and simulate corruption by injecting noise on the training labels. The experiments show that the accuracy of face recognition decreases rapidly and nonlinearly with the increase of label noises. In particular, we confirm the common belief that the performance of face recognition is more sensitive to­wards label flips (example has erroneously been given the label of another class within the dataset) than outliers (image does not belong to any of the classes under consideration, but mistakenly has one of their labels). We also conduct an interesting experiment to analyze the reliability of different ways of annotating a face recognition dataset. We found that label accuracy correlates with time spent on annotation. The study helps us to find the source of erroneous labels and thereafter design better strategies to balance annotation cost and accuracy.



2 现有数据有多嘈杂?


2.1 人脸识别数据集

Table 2.1 provides a summary of representative datasets used in face recognition research.

LFW: Labeled Faces in the Wild (LFW) [7] is perhaps the most popular dataset to date for benchmarking face recognition approaches. The database consists of 13,000 facial images of 1, 680 celebrities. Images are collected from Yahoo News by running the Viola-Jones face detector. Limited by the detector, most of the faces in LFW is frontal. The dataset is considered sufficiently clean despite some incorrectly labeled matched pairs are reported. Errata of LFW are provided in http://vis-www.cs.umass.edu/lfw/.
LFW:对人脸识别方法进行基准测试,[7]可能是迄今为止最流行的数据集。该数据库由13000张1680名名人的面部照片组成。通过运行Viola-Jones面部探测器,可以从雅虎新闻(Yahoo News)上收集图像。由于探测器的限制,LFW中的大部分面都是正面的。尽管报告了一些标记不正确的匹配对,但数据集被认为是足够干净的。LFW的勘误表见http://vis-www.cs.umass.edu/lfw/

CelebFaces: CelebFaces [19](#bookmark45 "Current Document")[,20] is one of the early face recognition training databases that are made publicly available. Its first version contains 5,436 celebri­ties and 87, 628 images, and it was upgraded to 10,177 identities and 202, 599 images in a year later. Images in CelebFaces were collected from search engines and manually cleaned by workers.

CelebFaces: CelebFaces [19](#bookmark45“Current Document”)[,20]是早期人脸识别培训数据库之一,这些数据库都是公开的。第一个版本包含5436 celebri­关系,87年,628张图片,这是升级到10177的身份和202,一年后599图片。从搜索引擎中收集名人头像的图片,由工作人员手工清理。

VGG-Face: VGG-Face [15] contains 2,622 identities and 2.6M photos. More than 2,000 images per celebrity were downloaded from search engines. The au­thors treat the top 50 images as positive samples and train a linear SVM to select the top 1,000 faces. To avoid extensive manual annotation, the dataset was ‘block-wise’ verified, i.e., ranked images of each identity are displayed in blocks and annotators are asked to validate blocks as a whole. In this study we did not focus on VGG-Face [15] since it should have the similar ‘search-engine bias’ problem with MS-Celeb-1M [5].
VGG-Face: VGG-Face[15]包含2622个身份和2.6米照片。每个名人都有超过2000张图片从搜索引擎下载。非盟­雷神治疗前50名的图像作为正样本和训练一个线性支持向量机选择前1000名的面孔。为了避免广泛的手工注释,数据集是“块-wise”验证的,即。,将每个标识的排序图像显示在块中,并要求注释器将块作为一个整体进行验证。在这项研究中,我们没有关注VGG-Face[15],因为它应该和MS-Celeb-1M[5]有类似的“搜索引擎偏见”问题。

CASIA-WebFace: The images in CASIA-WebFace [25] were collected from IMDb website. The dataset contains 500K photos of 10K celebrities and it is semi-automatically cleaned via tag-constrained similarity clustering. The au­thors start with each celebrity’s main photo and those photos that contain only one face. Then faces are gradually added to the dataset constrained by feature similarity and name tag. CASIA-WebFace uses the same source as the pro­posed IMDb-Face dataset. However, limited by the feature and clustering steps, CASIA-WebFace may fail to recall many challenging faces.
CASIA-WebFace中的图像来自IMDb网站。该数据集包含10K名人的500K照片,通过标签约束的相似性聚类进行半自动清理。非盟­雷神开始每个名人的主图和那些只包含一个脸的照片。然后,在特征相似度和名称标签的约束下,将人脸逐步添加到数据集中。CASIA-WebFace带来IMDb-Face pro­使用相同的源数据集。然而,由于功能和集群步骤的限制,CASIA-WebFace可能无法回忆起许多具有挑战性的面孔。

MS-Celeb-lM: MS-Celeb-1M [5] contains 100K celebrities who are selected from the 1M celebrity list in terms of their popularities. Public search engines are then leveraged to provide approximately 100 images for each celebrity, resulting in about 10M web images. The data is deliberately left uncleaned for several reasons. Specifically, collecting a dataset of this scale requires tremendous efforts in cleaning the dataset. Perhaps more importantly, leaving the data in this form encourages researchers to devise new learning methods that can naturally deal with the inherent noises.
MS-Celeb-lM: MS-Celeb-1M[5]是指从100万名人榜中挑选出来的10万名名人。然后,公共搜索引擎为每个名人提供大约100张图片,从而产生约1000万张网络图片。由于几个原因,这些数据被有意地删除了。具体地说,收集这种规模的数据集需要在清理数据集方面付出巨大的努力。也许更重要的是,把数据以这种形式保留下来,鼓励研究人员设计出新的学习方法,能够自然地处理固有的噪音。

MegaFace: Kemelmacher-Shlizerman et al. [13] clean massive number of im­ages published on Flickr by proposing algorithms to cluster and filter face data from the YFCC100M dataset. For each user’s albums, the authors merge face pairs with a distance closer than /3 times of average distance. Clusters that con­tain more than three faces are kept. Then they drop ‘garbage’ groups and clean potential outliers in each group. A total of 672K identities and 4.7M images were collected. MegaFace2 avoids ‘search-engine’ bias as in VGG-Face [15] and MS- Celeb-1M [5]. However, we found this cluster-based approach introduces new bias. MegaFace prefers small groups with highly duplicated images, e.g., face captured from the same video. Limited by the base model for clustering, consid­erable groups in MegaFace contain noises, or sometimes mess up multiple people in the same group.
MegaFace:Kemelmacher-Shlizerman et al。[13]清洁Flickr上发表的大量的im­年龄提出集群算法和滤波器面临YFCC100M数据集的数据。对于每个用户的相册,作者将脸对合并为比平均距离近/3倍的距离。集群,反对­锡箔超过三个的脸。然后他们丢弃“垃圾”组,清除每个组中的潜在异常值。总共收集了672K个身份和470万张图片。MegaFace2避免了“搜索引擎”的偏见,就像VGG-Face[15]和MS- Celeb-1M[5]那样。然而,我们发现这种基于集群的方法引入了新的偏见。MegaFace更喜欢有高度重复图像的小群体,例如从同一视频中捕捉到的脸。限制了集群的基本模型,consid­MegaFace erable团体包含噪音,有时搞砸很多人在同一组。

2.2 An Approximation of Signal-to-Noise Ratio

Owing to the source of data and cleaning strategies, existing large-scale datasets invariably contain label noises. In this study, we aim to profile the noise distri­bution in existing datasets. Our analysis may provide a hint to future research on how one should exploit the distribution of these data.

It is infeasible to obtain the exact number of these noises due to the scale of the datasets. We bypass this difficulty by randomly selecting a subset of a dataset and manually categorize them into three groups - ccorrect identity assigned,,cdoubtful,,and cwrong identity assigned,. We select a subset of 2.7M images from MegaFace [13] and 3.7M images from MS-Celeb-lM [5]. For CASIA­WebFace [25] and CelebFaces [19,20], we sampled 30 identities to estimate their signal-to-noise ratio. The final statistics are visualized in Figure 2(a). Due to the difficulty in estimating the exact ratio, we approximate an upper and a lower bound of noisy data during the estimation. The lower-bound is more optimistic considering doubtful labels as clean data. The upper-bound is more pessimistic considering all doubtful cases as badly labeled. We provide more details on the estimations in the supplementary material. As observed in Figure 2(a), the noise percentage increases dramatically along the scale of data. This is not surprising given the difficulty in data annotation. It is noteworthy that the proposed IMDb- Face pushes the envelope of large-scale data with a very high signal-to-noise ratio (noise is under 10% of the full data).
由于数据集的规模,获得这些噪声的确切数字是不可行的。我们绕过这个困难通过随机选择一个数据集的一个子集,手动分类成三组——ccorrect身份分配,cdoubtful,分配和cwrong身份,。我们从MegaFace[13]中选取270万张图像,从MS-Celeb-lM[5]中选取370万张图像。在CASIA­WebFace[25]和CelebFaces(19、20),我们抽样30身份估计信噪比。最终的统计数据如图2(a)所示。由于估计精确比的困难,在估计过程中,我们估计了噪声数据的上界和下界。下界则更乐观地认为可疑的标签是干净的数据。考虑到所有可疑的情况都被贴上了糟糕的标签,上限则更加悲观。我们在补充材料中提供更多关于估计的细节。如图2(a)所示,噪声百分比沿数据的尺度急剧增加。考虑到数据注释的困难,这并不奇怪。值得注意的是,拟议的IMDb- Face以非常高的信噪比(噪声低于全部数据的10%)推动了大规模数据的信封。

We investigate further the noise distribution of the two largest public datasets to date, MS-Celeb-lM [5] and MegaFace [13]. We first categorize identities in a dataset based on their number of images. A total of six groups/bins are estab­lished. We then plot a histogram showing the signal-to-noise ratio of each bin along the noise lower- and upper-bounds. As can be seen in Figure 2(b,c), both datasets exhibit a long-tailed distribution, i.e., most identities have very few im­ages. This phenomenon is especially obvious on the MegaFace [13] dataset since it uses automatically formed clusters for determining identities, therefore, the same identity may be distributed in different clusters. Noises across all groups in MegaFace [13] are less in comparison to MS-Celeb-lM [5]. However, we found that many images in the clean portion of MegaFace [13] are duplicated images. In Sec. 4.2, we will perform experiments on the MegaFace and MS-Celeb-1M datasets to quantify the effect of noise on the face recognition task.

3 Building a Noise-Controlled Face Dataset

As shown in the previous section, face recognition datasets that are more than a million scale typically have a noise ratio higher than 30%. How about building a large scale noise controlled face dataset? It can be used to train better face recognition algorithms. More importantly, it can be used to further understand the relationship between noise and face recognition performance. To this end, we seek not only a cleaner and more diverse source to collect face data, but also an effective way to label the data.

3.1 Celebrity Faces from IMDb
3.1 IMDb的明星脸

Search engines are important sources from which we can quickly construct a large-scale dataset. The widely used Image Net [3] was built by querying images from Google Image. Most of the face recognition datasets were built in the
same way (except MegaFace [13]). While querying from search engines offers the convenience of data collection, it also introduces data bias. Search engines usually operate in a high-precision regime [2]. Observing the queried images in Figure 3, they tend to have a simple background with sufficient illumination, and the subjects are often in a near frontal posture. These data, to a certain extent, are more restricted than those we could observe in reality, e.g., faces in videos (IJB-A [9] and YTF [24]) and selfie photos (millions of distractors in MegaFace). Another pitfall in crawling images from search engines is the low recall rate. We performed a simple analysis and found that on average the recall rate is only 40% for the first 200 photos we query for a particular name.


In this study, we turn our data collection source to the IMDb website. IMDb is more structured. It includes a diverse range of photos under each celebrity’s profile, including official photos, lifestyle photos, and movie snapshots. Movie snapshots, we believe, provide essential data samples for training a robust face recognition model. Those screenshots are rarely returned by querying a search engine. In addition, the recall rate is much higher (90% on average) when we query a name on IMDb. This is much higher than 40% from search engines. The IMDb website lists about 300K celebrities who have official and gallery photos. By clawing IMDb dataset, we finally collected and cleaned 1.7M raw images from 59K celebrities.

3.2 Data Distribution

Figure 4-a presents the distribution of yaw angle in our dataset compared with MS-Celeb-1M and MegaFace. Figures 4-c, -d and -e present the age, gender and race distributions. As can be observed, images in IMDb-Face exhibit larger pose variations, and they also show diversity in age, gender and race.

3.3 How Good can Human Label Identity?

The data downloaded from IMDb are noisy as multiple celebrities may co-exist on the same image. We still need to clean the dataset before it can be used for training. We take this opportunity to study how human annotators would clean a face data. The study will help us to identify the source of noise during annotation and design a better data cleaning strategy for the full dataset.

For the purpose of the user study, we extract a small subset of 30 identities from the IMDb raw data. We carefully select three images with confirmed iden­tity serving as gallery images. The remaining images of these 30 identities are treated as query images. To make the user study more challenging and statisti­cally more meaningful, we inject 20% outliers to the query set. Next, we prepare three annotation schemes as follows. The interface of each scheme is depicted in Figure 5.

Scheme I - Draw the box: We present the target person to a volunteer by showing the three gallery faces. We then show a query image selected from the query set. The image may contain multiple persons. If the target appears in the query image, the volunteer is asked to draw a bounding box on the target. The volunteer can either confirm the selection or assign a ‘doubt’ flag on the box if he/she is not confident about the choice. ‘No target’ is selected when he/she cannot find the target person.

Scheme II - Choose l in 3: Similar to Scheme I, we present the target person to a volunteer by showing the gallery images. We then randomly sample three faces detected from the query set, from which the volunteer will select a single image as the target face. We ensure that all query faces have the same gender as the target person. Again, the volunteer can choose a ‘doubt’ flag if he/she is not confident about the selection or choose ‘no target’ at all.

Scheme III - Yes or No: Binary query is perhaps be the most natural and popular way to clean a face recognition set. We first rank all faces based on their similarity to probe faces in the gallery, and then ask a volunteer to make a choice if each belongs to the target person. The volunteer is allowed to answer ‘doubt’. Which scheme to choose?: Before we can quantify the effectiveness of differ­ent schemes, we first need to generate the ground truth of these 30 identities. We use a ‘consensus’ approach. Specifically, each of the aforementioned schemes was conducted on three different volunteers. We ensure that each query face was annotated nine times across the three schemes. If four of the annotations consistently point to the same identity, we assign the query face to the tar­geted identity. With this ground truth, we can measure the effectiveness of each annotation scheme.
方案III - Yes或No:二进制查询也许是最自然的和受欢迎的方式清洁面部识别集。我们首先面临排名根据他们的相似性来探测面临的画廊,然后让志愿者如果每个属于目标做出选择的人。志愿者被允许回答“疑问”。选择哪个方案呢?:之前我们可以量化­ent不同方案的有效性,我们首先需要生成这些30的地面实况身份。我们采用“协商一致”的方法。具体来说,上述方案都是针对三名不同的志愿者进行的。我们确保在三个方案中,每个查询面都被注释了9次。如果四个注释始终指向同一个身份,我们指定查询的脸焦油­得到身份。有了这个基本事实,我们可以度量每个注释方案的有效性。

Figure 6 shows the Receiver operating characteristic (ROC) curve of each of the three schemes[2]. Scheme I achieves the highest F\ score. It recalls more than 90% faces with under 10% false positive samples. Finding a face and drawing a box seems to make annotators more focused on finding the right face. Scheme II provides a high true positive rate when the false positive is low. The existence of distractors forces annotators to work harder to match the faces. Scheme III yields the worse true positive rate when the false positive is low. This is not surprising since this task is much easier than Schemes I and II. The annotators tend to make mistakes given this relaxing task, especially after a prolonged annotation process. We observe an interesting phenomenon: the longer a volunteer spends on annotating a sample, the more accurate the annotation is. With full speed in one hour, each volunteer can draw 180-300 faces in Scheme I, or finish around 600 selections in Scheme II, or answer over 1000 binary questions in Scheme III. We believe the most reliable way to clean a face recognition dataset is to leverage both Schemes I and II to achieve a high precision and recall. Limited by our budget, we only conducted Scheme I to clean the IMDb-Face dataset.

During the cleaning of the IMDb-Face, since multiple identities may co-exist on the same image, first we annotated gallery images to make sure the queried identity. The gallery images come from the official gallery provided by the IMDb website, which most of these official gallery images contain the true identity. We ask volunteers to look through the 10 gallery images back and forth and draw bounding box of the face that occurs most frequently. Then, annotators label the rest of the queried images guided by the three largest labeled faces as galleries. For identities having fewer than three gallery images, their queried images may have too much noise. To save labor, we did not annotate their images.

It took 50 annotators one month to clean the IMDb-Face dataset. Finally, we obtained 1.7M clean facial images from 2M raw images. We believe that the cleaning is of high quality. We estimate the noise level of IMBb-Face as the product of approximated noise level in the IMDb raw data (2.7 士 4.5%) and the false positive rate (8.7%) of Scheme I. The noise level is controlled under 2%. The quality of IMDb-Face is validated in our experiments.

4 Experiments

We divide our experiments into a few sections. First, we conduct ablation studies by simulating noise on our proposed dataset. The studies help us to observe the deterioration of performance in the presence of increasing noise, or when a fixed amount of clean data is diluted with noise. Second, we perform experiments on two existing datasets to further demonstrate the effect of noise. Third, we exam­ine the effectiveness of our dataset by comparing it to other datasets with the same training condition. Finally, we compare the model trained on our dataset with other state-of-the-arts. Next, we describe the experimental setting. Evaluation Metric: We report rank-1 identification accuracy on the Megaface benchmark [8]. It is a very challenging task to evaluate the performance of face recognition methods at the million scale of distractors. The MegaFace bench­mark consists of one gallery set and one probe set. The gallery set contains more than 1 million images and the probe set consists of two existing datasets: Facescrub [14] and FGNet. We use Facescrub [14] as MegaFace probe dataset in our experiments. Verification performance of MegaFace (reported as TPR at FPR= 10—6) is included in the supplementary material due to page limit. We also test LFW [7] and YTF [24] in Section 4.4.
我们把实验分成几个部分。首先,我们通过模拟实验数据上的噪声进行消融研究。这些研究帮助我们观察在增加噪音的情况下性能的恶化,或者当固定数量的清洁数据被噪音稀释时。其次,我们对两个现有的数据集进行实验,以进一步证明噪声的影响。第三,我们考试­线数据集的有效性通过比较相同的其他数据集训练条件。最后,我们将在我们的数据集中训练的模型与其他情况进行比较。接下来,我们描述实验设置。评估指标:我们报告Megaface基准[8]的等级1识别精度。在上百万个干扰项中评估人脸识别方法的性能是一项非常具有挑战性的任务。MegaFace台上­标志由一个美术馆和一个探测器集合。画廊集包含超过100万张图片和探针组包含两个现有的数据集:Facescrub FGNet[14]。我们在实验中使用Facescrub[14]作为MegaFace探测数据集。MegaFace的验证性能(在FPR= 10-6处报告为TPR)由于页面限制被包含在补充材料中。我们还在4.4节中测试LFW[7]和YTF[24]。

Architecture: To better examine the effect of noise, we use the same architec­ture in all experiments. After a comparison among ResNet-50, ResNet-101 and Attention-56 [22], we finally choose Attention-56 that achieves a good balance between computation and accuracy. As a reference, the model converges on a database with 80 hours on an 8-GPU server with a batch-size of 256. The output of Attention-56 is a 256-dimensional feature for each input image. We use cosine similarity to compute scores between image pairs.

Pre-processing: We cropped and aligned faces, then rigidly transferred them onto a mean shape. Then we resized the cropped image into 224 x 256, and subtracted them with the mean value in each RGB channel.
预处理:我们裁剪和对齐面,然后严格地将它们转换成一个平均形状。然后我们将裁剪后的图像大小调整为224 x 256,并在每个RGB通道中使用平均值减去它们。

Loss: We apply three losses: Soft Max [20], Center Loss [23] and A-Softmax [12]. Our implementation is based on the public implementation of these losses: Softmax: Soft max loss is the most commonly used loss, either for model initial­ization or establishing a baseline.

Center Loss: Wen et al. [23] propose center loss, which minimizes the intra-class distance to enhance features’ discriminative power. The authors jointly trained CNN with the center loss and the soft max loss.

A-Softmax: Liu et al. [12] formulate A-Softmax to explicitly enforce the an­gle margin between different identities. The weight vector of each category was restricted on a hypersphere.

A-Softmax:刘et al。[12]制定A-Softmax显式地执行一个­gle利润率之间不同的身份。每个类别的权向量被限制在一个超球面上。

4.1 Investigating the Effect of Noise on IMDb-Face

The proposed IMDb-Face dataset enables us to investigate the effect of noise. There are two common types of noise in large-scale face recognition datasets: 1) label flips: example has erroneously been given the label of another class within the dataset 2) outliers: image does not belong to any of the classes under consideration, but mistakenly has one of their labels. Sometimes even non-faces may be mistakenly included. To simulate the first type of noise, we randomly perturb faces into incorrect categories. For the second type, we randomly replace faces in IMDb-Face with images from MegaFace.

Here we perform two experiments: 1) We gradually contaminate our dataset with different types of noise. We gradually increase the noise in our dataset by 10%, 20% and 50%. 2) We fix the size of clean data and ‘dilute,it with label flips. We do not use ensemble models in these experiments.


A-Softmax, which used to achieve a better result on a clean dataset, becomes worse than Center loss and Softmax in the high-noise region. 3) Outliers seem to have a less abrupt effect on the performance across all losses, matching the observation in [10] and [17].
a -Softmax在干净的数据集上可以获得更好的结果,但在高噪声区域,它比中心丢失和Softmax更严重。3)异常值对所有损失的性能的影响似乎都没有那么突然,这与[10]和[17]的观察结果相符。

The second experiment was inspired by a recent work from Rolnick et al. [17]. They found that if a dataset contains sufficient clean data, a deep learning model can still be properly trained on it when the data is diluted by a large amount of noise. They show that a model can still achieve a feasible accuracy on CIFAR- 10, even the ratio of noise to clean data is increased to 20 : 1. Can we transfer their conclusion to face recognition? Here we sample four subsets from IMDb- Face with 1E5, 2E5, 5E5 and 1E6 images. And we dilute them with an equal number, double, and five times of label flip noise. Figure 7(c) shows that a large performance gap still exists against the completely clean baseline, even we maintain the same number of clean data. We conjecture two reasons that cleanliness of data still plays a key role in face recognition: 1) current dataset, even it is clean, still far from sufficient to address the challenging face recognition problem thus noise matters. 2) Noise is more lethal on a 10,000-class problem than on a 10-class problem.
第二个实验的灵感来自罗尔尼克等人最近的研究。他们发现,如果一个数据集包含足够的干净数据,当数据被大量的噪音稀释时,一个深度学习模型仍然可以得到适当的训练。结果表明,该模型在CIFAR- 10上仍能达到可行的精度,即使将噪声与干净数据的比值提高到20:1。我们能把他们的结论转换成人脸识别吗?在这里,我们从IMDb- Face中抽取了四个子集,分别是1E5、2E5、5E5和1E6。我们用等量的,两倍的,五倍的标签翻转噪声来稀释它们。图7(c)显示在完全清洁的基线上仍然存在很大的性能差异,即使我们保持相同数量的清洁数据。我们推测,数据的清洁度在人脸识别中仍然起着关键作用的原因有两个:1)当前的数据集,即使是干净的,也远远不能解决人脸识别的难题,因此噪声很重要。2)在一万级问题上,噪音比十级问题更致命。

4.2 The Effect of Noise on MegaFace and MS-Celeb-lM

To further demonstrate the effect of noise, we perform experiments on two public datasets: MegaFace and MS-Celeb-1M. In order to quantify the effect of noise on the face recognition, we sampled subsets from the two datasets and manually cleaned them. This provides us with a noisy sampled subset and a clean subset for each dataset. For a fair comparison, the noisy subset was sampled to have the same distribution of image numbers to identities as the original dataset. Also, we control the scale of noisy subsets to make sure the scales for each clean subset are nearly the same. Because of the large size of the sampled subsets, we have chosen the third labeling method mentioned in Sec. 3.3, which is the fastest.

Three different losses, namely, SoftMax, Center Loss and A-Softmax, are re­spectively applied to the original datasets, sampled, and cleaned subsets. Table 2 summarizes the results on the MegaFace recognition challenge [8]. The effect of clean datasets is tremendous. By comparing the results between cleaned datasets and sampled datasets, the average improvement of accuracy is as large as 4.14%. The accuracies on clean subsets even surpass those on raw datasets, which are 4 times larger on average. The results suggest the effectiveness of reducing noise for large-scale datasets. As the mater of fact, the result of this experiment is part of our motivation to collect IMDb-Face dataset.

It is worth pointing out that recent metric learning based methods such as A-Softmax [12] and Center-loss [23] also benefit from learning on clean datasets, although they already perform much better than Softmax [20]. As shown in Ta­ble 2, the improvements of accuracy on MegaFace using A-Softmax and Center- loss are over 5%. The results suggest that reducing dataset noise is still helpful, especially when metric learning is performed. Reducing noisy samples could help an algorithm focuses more on hard examples learning, rather than picking up meaningless noises.
值得指出的是,最近的基于度量学习的方法,如A-Softmax[12]和Center-loss[23],也从清洁数据集的学习中获益,尽管它们已经比Softmax[20]表现得好得多。Ta­ble 2所示,精度的改进MegaFace使用A-Softmax和中心-损失超过5%。结果表明,减少数据集噪声仍然是有帮助的,尤其是在进行度量学习时。减少噪声样本可以帮助算法更专注于难例学习,而不是接收无意义的噪声。

4.3 Comparing IMDb-Face with other Face Datasets

In the third experiment, we wish to show the competitiveness of IMDb-Face against several well-established face recognition training datasets including: 1) CelebFaces [19](#bookmark45 "Current Document")[,20], 2) CASIA-WebFace [25], 3) MS-Celeb-1M(v1) [5], and 4) MegaFace [13]. The data size of the two latter datasets is a few times larger than the proposed IMDb-Face. Note that MS-Celeb-1M has a larger subset(v2), containing 900,000 identities. Limited by our computational resources we did not conduct experiments on it. We do not use ensemble models in this experiment. Table 3 summarizes the results of using different datasets as the training source across three losses. We observed that the proposed noise-controlled IMDb-Face dataset is competitive as a training source despite its smaller size, validating the effectiveness of the IMDb data source and the cleanliness of IMDb-Face.
在第三个实验中,我们希望展示IMDb-Face相对于几个成熟的人脸识别训练数据集的竞争力,包括:1)CelebFaces [19](#bookmark45“Current Document”)[,20],2)CASIA-WebFace [25], 3) MS-Celeb-1M(v1) [5], 4) MegaFace[13]。后两个数据集的数据大小比建议的IMDb-Face大几倍。注意,MS-Celeb-1M有一个较大的子集(v2),包含90万个标识。由于计算资源有限,我们没有对其进行实验。我们在这个实验中不使用集合模型。表3总结了使用不同数据集作为三个损失的训练源的结果。我们注意到,尽管IMDb数据集的大小较小,但提出的噪声控制IMDb- face数据集作为训练源具有竞争力,这验证了IMDb数据源的有效性和IMDb- face的清洁度。

4.4 Comparisons with State-of-the-Arts

We are interested to compare the performance of model trained on IMDb-Face with state-of-the-arts. Evaluation is conducted on MegaFace [8], LFW [7], and YTF [24] following the standard protocol. For LFW [7] we compute equals error rate (EER). For YTF [24] we report accuracy for recognition. To highlight the effect of training data, we do not adopt model ensemble. The comparative results are shown in Table 4. Our single model trained on IMDb-Face (A-Softmax^, IMDb-Face) achieves a state-of-the-art performance on LFW, MegaFace, and YTF against published methods. It is noteworthy that the performance of our final model is also comparable to a few private methods on MegaFace.
我们感兴趣的是将在IMDb-Face上训练的模型的性能与状况进行比较。按照标准协议对MegaFace[8]、LFW[7]、YTF[24]进行评价。对于LFW[7],我们计算等于错误率(EER)。对于YTF[24],我们报告识别的准确性。为了突出训练数据的效果,我们不采用模型集成。比较结果见表4。单一模型训练IMDb-Face(A-Softmax ^,IMDb-Face)实现了LFW先进的性能,MegaFace和YTF发布方法。值得注意的是,我们最终模型的性能也可与MegaFace上的一些私有方法相媲美。

5 Conclusion

Beyond existing efforts of developing sophisticated losses and CNN architectures, our study has investigated the problem of face recognition from the data per­spective. Specifically, we developed an understanding of the source of label noise and its consequences. We also collected a new large-scale data from IMDb web­site, which is naturally a cleaner and wilder source than search engines. Through user studies, we have discovered an effective yet accurate way to clean our data. Extensive experiments have demonstrated that both data source and cleaning effectively improve the accuracy of face recognition. As a result of our study, we have presented a noise-controlled IMDb-Face dataset, and a state-of-the-art model trained on it. A clean dataset is important as the face recognition com­munity has been looking for large-scale clean datasets for two practical reasons: 1) to better study the training performance of contemporary deep networks as a function of noise level in data. Without a clean dataset, one cannot induce controllable noise to support a systematic study. 2) to benchmark large-scale automatic data cleaning methods. Although one can use the final performance of a deep network as a yardstick, this measure can be affected by many uncon­trollable factors, e.g., network hyperparameters setting. A clean and large-scale dataset enables unbiased analysis.

[2] We should emphasize that the curves in Figure 6 are different from actual human’s performance on verifying arbitrary face pairs. This is because in our study the faces from a query set are very likely to belong to the same person. The ROC thus rep­resents human’s accuracy on ‘verifying face pairs that likely belong to the same identity’.
