Bin-Bin Gao(高斌斌)
Ph.D. candidate
National Key Laboratory for Novel Software Technology
Department of Computer Science & Technology
Nanjing University
Supervisor: Prof. Jianxin Wu

Nanjing 210023, China
Email: or


I will join Tecent YouTu Lab in July. I have got my Ph.D. in Nanjing University at LAMDA Group, led by Prof. Zhi-Hua Zhou, advised by Prof. Jianxin Wu. Beore that, I got my B.S. degreee in applied mathematics in 2010, and received my M.S. degree in 2013 from Southwest University, Chongqing China. My research interests include computer vision and machine learning, especially visual recognition and deep learning. I sever as the reviewer (or PC member) for CVPR, ICCV, AAAI, ECCV, ACCV, NN, TNNLS etc.


Publications [Google Scholar]

Age Estimation Using Expectation of Label Distribution Learning
Bin-Bin Gao, Hong-Yu Zhou, Jianxin Wu and Xin Geng
In: Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI 2018), Stockholm, Sweden, July 2018. (accepted.)
[ Paper ] [ Project Page ] [ Slides ] [ Poster ] [ Abstrcat ] Age estimation performance has been greatly improved by using convolutional neural network. However, existing methods have an inconsistency between the training objectives and evaluation metric, so they may be suboptimal. In addition, these methods always adopt image classification or face recognition models with a large amount of parameters which bring expensive computation cost and storage overhead. To alleviate these issues, we design a light network architecture and propose a unified framework which can jointly learn age distribution and regress age. The effectiveness of our approach has been demonstrated on apparent and real age estimation tasks. Our method achieves new state-of-the-art results using the single model with 36$\times$ fewer parameters and 2.6$\times$ reduction in inference time. Moreover, our method can achieve comparable results as the state-of-the-art even though model parameters are further reduced to 0.9M~(3.8MB disk storage). We also analyze that Ranking methods are implicitly learning label distributions. [ BibTeX ]
           title={Age Estimation Using Expectation of Label Distribution Learning},
           author={Gao, Bin-Bin and Zhou, Hong-Yu and Wu, Jianxin and Geng, Xin},
           booktitle={Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI 2018)},

Resource Constrained Deep Learning: Challenges and Practices (in Chinese)
Jianxin Wu, Bin-Bin Gao, Xiu-Shen Wei and Jian-Hao Luo
SCIENTIA SINICA Informatics, 48(5):501-510,2018.
[ Paper ] [ Abstrcat ] Deep learning has made significant progresses in recent years. However, deep models require a lot of computation-related resources, and its learning process needs huge number of data points and their labels. Hence, one current research focus in deep learning is to reduce its resource consumptions, i.e., resource constrained deep learning. In this paper, we first analyze deep learning’s thirsts for the various types of resources and the challenges they lead to, then briefly introduce research progresses from three aspects: data, label and computation resources. And we give detailed introductions of these areas using our research results as examples. [ BibTeX ]
           title={Resource constrained deep learning: Challenges and practices},
           author={Wu, Jianxin and Gao, Bin-Bin and Wei, Xiu-Shen and Luo, Jian-Hao},
           journal={SCIENTIA SINICA Informatics},

Adaptive Feeding: Achieving Fast and Accurate Detections by Adaptively Combining Object Detectors
Hong-Yu Zhou, Bin-Bin Gao and Jianxin Wu
In: In: Proceedings of the IEEE International Conference on Computer Vision (ICCV 2017), Venice, Italy, October 2017, pp. 3505-3513.
[ Paper ] [ Project Page ] [ Abstrcat ] Object detection aims at high speed and accuracy simultaneously. However, fast models are usually less accurate, while accurate models cannot satisfy our need for speed. A fast model can be 10 times faster but 50% less accurate than an accurate model. In this paper, we propose Adaptive Feeding (AF) to combine a fast (but less accurate) detector and an accurate (but slow) detector, by adaptively determining whether an image is easy or hard and choosing an appropriate detector for it. In practice, we build a cascade of detectors, including the AF classifier which make the easy vs. hard decision and the two detectors. The AF classifier can be tuned to obtain different tradeoff between speed and accuracy, which has negligible training time and requires no additional training data. Experimental results on the PASCAL VOC, MS COCO and Caltech Pedestrian datasets confirm that AF has the ability to achieve comparable speed as the fast detector and comparable accuracy as the accurate one at the same time. As an example, by combining the fast SSD300 with the accurate SSD500 detector, AF leads to 50% speedup over SSD500 with the same precision on the VOC2007 test set. [ BibTeX ]
           title={Adaptive Feeding: Achieving Fast and Accurate Detections by Adaptively Combining Object Detectors},
           author={Zhou, Hong-Yu and Gao, Bin-Bin and Wu, Jianxin},
           booktitle={Proceedings of the IEEE International Conference on Computer Vision (ICCV 2017)},
Sunrise or Sunset: Selective Comparison Learning for Subtle Attribute Recognition
Hong-Yu Zhou, Bin-Bin Gao and Jianxin Wu
In:In: Proceedings of the 28th British Machine Vision Conference (BMVC 2017), London, UK, September 2017.
[ Paper ] [ Project Page ] [ Abstrcat ] The difficulty of image recognition has gradually increased from general category recognition to fine-grained recognition and to the recognition of some subtle attributes such as temperature and geolocation. In this paper, we try to focus on the classification between sunrise and sunset and hope to give a hint about how to tell the difference in subtle attributes. Sunrise vs. sunset is a difficult recognition task, which is challenging even for humans. Towards understanding this new problem, we first collect a new dataset made up of over one hundred webcams from different places. Since existing algorithmic methods have poor accuracy, we propose a new pairwise learning strategy to learn features from selective pairs of images. Experiments show that our approach surpasses baseline methods by a large margin and achieves better results even compared with humans. We also apply our approach to existing subtle attribute recognition problems, such as temperature estimation, and achieve state-of-the-art results. [ BibTeX ]
            title={Sunrise or Sunset: Selective Comparison Learning for Subtle Attribute Recognition},
            author={Zhou, Hong-Yu and Gao, Bin-Bin and Wu, Jianxin},
            booktitle={Proceedings of the 28th British Machine Vision Conference (BMVC 2017)},
Deep Label Distribution Learning with Label Ambiguity
Bin-Bin Gao, Chao Xing, Chen-Wei Xie, Jianxin Wu and Xin Geng
IEEE Transactions on Image Processing (TIP 2017), 26(6):2825-2838,2017.
[ Paper ] [ Project Page ] [ Abstrcat ] Convolutional Neural Networks (ConvNets) have achieved excellent recognition performance in various visual recognition tasks. A large labeled training set is one of the most important factors for its success. However, it is difficult to collect sufficient training images with precise labels in some domains such as apparent age estimation, head pose estimation, multi-label classification and semantic segmentation. Fortunately, there is ambiguous information among labels, which makes these tasks different from traditional classification. Based on this observation, we convert the label of each image into a discrete label distribution, and learn the label distribution by minimizing a Kullback-Leibler divergence between the predicted and ground-truth label distributions using deep ConvNets. The proposed DLDL (Deep Label Distribution Learning) method effectively utilizes the label ambiguity in both feature learning and classifier learning, which prevents the network from over-fitting even when the training set is small. Experimental results show that the proposed approach produces significantly better results than state-of-the-art methods for age estimation and head pose estimation. At the same time, it also improves recognition performance for multi-label classification and semantic segmentation tasks. [ BibTeX ]
         author={Gao, Bin-Bin and Xing, Chao and Xie, Chen-Wei and Wu, Jianxin and Geng, Xin},
         title={Deep Label Distribution Learning with Label Ambiguity},
         journal={IEEE Transactions on Image Processing},
Exploit Bounding Box Annotations for Multi-label Object Recognition
Hao Yang, Joey Tiany Zhou, Yu Zhang, Bin-Bin Gao, Jianxin Wu and Jianfei Cai
In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), Las Vegas, NV, USA, June 2016, pp.280-288.
[ Paper ] [ Abstrcat ] Convolutional neural networks (CNNs) have shown great performance as general feature representations for object recognition applications. However, for multi-label images that contain multiple objects from different categories, scales and locations, global CNN features are not optimal. In this paper, we incorporate local information to enhance the feature discriminative power. In particular, we first extract object proposals from each image. With each image treated as a bag and object proposals extracted from it treated as instances, we transform the multi-label recognition problem into a multi-class multi-instance learning problem. Then, in addition to extracting the typical CNN feature representation from each proposal, we propose to make use of ground-truth bounding box annotations (strong labels) to add another level of local information by using nearest-neighbor relationships of local regions to form a multi-view pipeline. The proposed multi-view multiinstance framework utilizes both weak and strong labels effectively, and more importantly it has the generalization ability to even boost the performance of unseen categories by partial strong labels from other categories. Our framework is extensively compared with state-of-the-art handcrafted feature based methods and CNN based methods on two multi-label benchmark datasets. The experimental results validate the discriminative power and the generalization ability of the proposed framework. With strong labels, our framework is able to achieve state-of-the-art results in both datasets. [ BibTeX ]
  title={Exploit bounding box annotations for multi-label object recognition},
  author={Yang, Hao and Zhou, Joey Tianyi and Zhang, Yu and Gao, Bin-Bin and Wu, Jianxin and Cai, Jianfei},
  booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
Representing Sets of Instances for Visual Recognition
Jianxin Wu, Bin-Bin Gao and Guoqing Liu
In: Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI 2016), Phoenix, Arizona, USA, Feb 2016, pp.2237-2243.
[ Paper ] [ Abstrcat ] In computer vision, a complex entity such as an image or video is often represented as a set of instance vectors, which are extracted from different parts of that entity. Thus, it is essential to design a representation to encode information in a set of instances robustly. Existing methods such as FV and VLAD are designed based on a generative perspective, and their performances fluctuate when difference types of instance vectors are used (i.e., they are not robust). The proposed D3 method effectively compares two sets as two distributions, and proposes a directional total variation distance (DTVD) to measure their dissimilarity. Furthermore, a robust classifier-based method is proposed to estimate DTVD robustly, and to efficiently represent these sets. D3 is evaluated in action and image recognition tasks. It achieves excellent robustness, accuracy and speed. [ BibTeX ]
  title={Representing Sets of Instances for Visual Recognition.},
  author={Wu, Jianxin and Gao, Bin-Bin and Liu, Guoqing},
  booktitle={Proceedings of the theritieth AAAI Conference on Artificial Intelligence},
Deep Label Distribution Learning for Apparent Age Estimation
Xu Yang, Bin-Bin Gao, Chao Xing, Zeng-Wei Huo, Xiu-Shen Wei, Ying Zhou, Jianxin Wu and Xin Geng
In: Proceedings of the IEEE ICCV’15 ChaLearn Looking at People workshop (ICCVW 2015), Santiago, Chile, Dec 2015, pp.102-108.
[ Paper ] [ Slides ] [ Abstrcat ] In the age estimation competition organized by ChaLearn, apparent ages of images are provided. Uncertainty of each apparent age is induced because each image is labeled by multiple individuals. Such uncertainty makes this age estimation task different from common chronological age estimation tasks. In this paper, we propose a method using deep CNN (Convolutional Neural Network) with distribution-based loss functions. Using distributions as the training tasks can exploit the uncertainty induced by manual labeling to learn a better model than using ages as the target. To the best of our knowledge, this is one of the first attempts to use the distribution as the target of deep learning. In our method, two kinds of deep CNN models are built with different architectures. After pre-training each deep CNN model with different datasets as one corresponding stream, the competition dataset is then used to fine-tune both deep CNN models. Moreover, we fuse the results of two streams as the final predicted ages. In the final testing dataset provided by competition, the age estimation performance of our method is 0.3057, which is significantly better than the human-level performance (0.34) provided by the competition organizers. [ BibTeX ]
  title={Deep label distribution learning for apparent age estimation},
  author={Yang, Xu and Gao, Bin-Bin and Xing, Chao and Huo, Zeng-Wei and Wei, Xiu-Shen and Zhou, Ying and Wu, Jianxin and Geng, Xin},
  booktitle={Proceedings of the IEEE International Conference on Computer Vision Workshops},

Deep Spatial Pyramid Ensemble for Cultural Event Recognition
Xiu-Shen Wei, Bin-Bin Gao and Jianxin Wu
In: Proceedings of the IEEE ICCV’15 ChaLearn Looking at People workshop (ICCVW 2015), Santiago, Chile, Dec 2015, pp.38-44.
[ Paper ] [ Slides ] [ Abstrcat ] Semantic event recognition based only on image-based cues is a challenging problem in computer vision. In order to capture rich information and exploit important cues like human poses, human garments and scene categories, we propose the Deep Spatial Pyramid Ensemble framework, which is mainly based on our previous work, i.e., Deep Spatial Pyramid (DSP). DSP could build universal and powerful image representations from CNN models. Specifically, we employ five deep networks trained on different data sources to extract five corresponding DSP representations for event recognition images. For combining the complementary information from different DSP representations, we ensemble these features by both “early fusion” and “late fusion”. Finally, based on the proposed framework, we come up with a solution for the track of the Cultural Event Recognition competition at the ChaLearn Looking at People (LAP) challenge in association with ICCV 2015. Our framework achieved one of the best cultural event recognition performance in this challenge. [ BibTeX ]
  title={Deep spatial pyramid ensemble for cultural event recognition},
  author={Wei, Xiu-Shen and Gao, Bin-Bin and Wu, Jianxin},
  booktitle={Proceedings of the IEEE International Conference on Computer Vision Workshops},

Coordinate Descent Fuzzy Twin Support Vector Machine for Classification
Bin-Bin Gao, Jian-Jun Wang, Yao Wang and Chan-Yun Yang
In: Proceedings of the IEEE Conference on Machine Learning and Applications (ICMLA 2015), Miami, Florida, USA, Dec 2015, pp.7-12.
[ Paper ] [ Code ] [ Abstrcat ] In this paper, we develop a novel coordinate descent fuzzy twin SVM (CDFTSVM) for classification. The proposed CDFTSVM not only inherits the advantages of twin SVM but also leads to a rapid and robust classification results. Specifically, our CDFTSVM has two distinguished advantages: (1) An effective fuzzy membership function is produced for removing the noise incurred by the contaminant inputs. (2) A coordinate descent strategy with shrinking by active set is used to deal with the computational complexity brought by the high dimensional input. In addition, a series of simulation experiments are conducted to verify the performance of the CDFTSVM, which further supports our previous claims. [ BibTeX ]
  title={Coordinate Descent Fuzzy Twin Support Vector Machine for Classification},
  author={Gao, Bin-Bin and Wang, Jian-Jun and Wang, Yao and Yang, Chan-Yun},
  booktitle={Proceedings of the IEEE Conference on Machine Learning and Applications (ICMLA)},

Technical Reports

Deep Spatial Pyramid: The Devil is Once Again in the Details
Bin-Bin Gao, Xiu-Shen Wei, Jianxin Wu, Weiyao Lin
arXiv:1504.05277v2, 2015.
[ Paper ] [ Code ] [ Abstrcat ] In this paper we show that by carefully making good choices for various detailed but important factors in a visual recognition framework using deep learning features, one can achieve a simple, efficient, yet highly accurate image classification system. We first list 5 important factors, based on both existing researches and ideas proposed in this paper. These important detailed factors include: 1) `2 matrix normalization is more effective than unnormalized or `2 vector normalization, 2) the proposed natural deep spatial pyramid is very effective, and 3) a very small K in Fisher Vectors surprisingly achieves higher accuracy than normally used large K values. Along with other choices (convolutional activations and multiple scales), the proposed DSP framework is not only intuitive and efficient, but also achieves excellent classification accuracy on many benchmark datasets. For example, DSP’s accuracy on SUN397 is 59.78%, significantly higher than previous state-of-the-art (53.86%). [ BibTeX ]
  author    = {Gao, Bin-Bin and Wei, Xiu-Shen and Wu, Jianxin and Lin Weiyao},
  title     = {Deep Spatial Pyramid: The Devil is Once Again in the Details},
  journal   = {CoRR},
  volume    = {abs/1504.05277},
  year      = {2015},
  url       = {},


  • Nanruijibao Scholarship in Nanjing University, 2016.
  • Second-class Academic Scholarship of Nanjing University, 2014-2015 & 2015-2016.
  • Outstanding Thesis Award of Southwest University, 2013.
  • First-class Academic Scholarship of Southwest University, 2011-2012.
  • Outstanding Undergraduates Awards, 2010.
  • National Scholarship for Encouragement, 2007-2008 & 2008-2009.


  • First runner-up in Cultural Event Recognition at ICCV 2015.(with Xiu-Shen Wei and Jianxin Wu)
  • Fourth place in Apprament Age Estimation at ICCV 2015.
  • Meritorious Winner of Certificate Authority Cup Mathematical Contest in Modeling, 2012.(with Qiu-Lin Li and Hong-Yan Yang)
  • Second Prize in China Graduate Mathematical Contest in Modeling (CGMCM), 2011.(with Qiu-Lin Li and Ji-Lian Guo)
  • Third Prize in China Undergraduate Mathematical Contest (Mathematics, Finals) (CMC), 2010.
  • First Prize in China Undergraduate Mathematical Contest (Mathematics, Preliminaries) (CMC), 2009.

Teaching Assistants

Professional Activities



    Bin-Bin Gao

    National Key Laboratory for Novel Software Technology

    Nanjing University

    Nanjing 210023, China

    913, Laboratory: Computer Science Building, Xianlin Campus of Nanjing University

    Lamda homepage

    Github homepage

Updated on May 21, 2018.