Recent advances in deep neural networks have helped to solve many challenges in computer vision, natural language processing and artificial intelligence. With the advances of deep models, understanding the high-level and fine-grained semantics of visual contents becomes possible and urgent. It includes but not limited to the tasks of object detection, semantic and instance segmentation, and scene graph generation. Based on the results of fine-grained visual understanding, we can further explore higher-level visual reasoning, which still remains uncertain how to effectively and appropriately formulate in the deep neural networks. The progress of fine-grained visual understanding and reasoning would significantly promote a great number of downstream tasks that require visual content understanding, e.g., visual question answering (VQA) and visual dialog.