The research of computer vision was motivated by a dream of making an intelligent machine that is able to see like our human beings: to automatically analyze and understand massive visual inputs. With the explosive growing of computing power, this dream evolves to many exciting emerging applications, such as intelligent robots, autonomous vehicles, intelligent video surveillances, computer-aided doctors, etc. A core component in these applications is visual recognition (including object classification, detection and localization). Recent advances in artificial neural networks (aka "deep learning") have significantly pushed forward the state-of-the-art visual recognition performances. However, due to the lack of modeling the semantic structures and domain knowledge, most current deep learning approaches do not have explicit mechanisms for visual inference and reasoning. As a result, they are unable to explain, interpret, and understand the relations among visual entities. In this talk, I present my recent research of computational modeling for visual compositionality. It is a new mechanism to unify probabilistic graphical models and deep learning into an effective learning framework for robust visual recognition. Visual compositionality refers to the decomposition of complex visual patterns into hierarchies of simpler ones. It not only embraces much stronger pattern expression powers, but also helps resolve the ambiguities in the smaller and lower-level visual patterns via larger and higher-level visual patterns. In this talk, I first introduce my unified computational model, called the "composition network", for object recognition and detection. It learns visual features, parts, compositional relationships, and structures in a unified end-to-end fashion. This model is not only as effective as conventional deep learning approaches but also able to interpret an object by parsing it to parts and substructures. Then, I present another research that aims to automatically learn the compositional relationships among body parts. We design a novel deep inference network architecture that is able to achieve bottom-up and top-down inference stages across multiple semantic levels. Based on this model, I have delivered an image-based human pose estimation system that ranks first on the MPII human pose benchmark.
Wei Tang is a Ph.D. candidate in the Department of Computer Science and Electrical Engineering at Northwestern University, advised by Ying Wu. He received his M.E. and B.E. degrees both from Beihang University in China in 2015 and 2012, respectively. He was a research intern at Microsoft Research Redmond in 2018. His research interests include computer vision, machine learning, image/video analytics, and human-computer intelligent interaction. He has published 16 journal and conference papers in this field. He received the Northwestern Murphy Fellowship in 2015. When he was in Beihang, he received the Outstanding Master's Thesis Award and the Top Ten Graduate Students Award in 2014. He was awarded a National Scholarship and a Lee Kum Kee Scholarship in 2013 and 2012, respectively.