Vision and Language
Humans communicate using language and conversations that frequently refer to things that can be seen in the world. Any robot that is going to communicate effectively about the world with a human will inevitably need to relate vision and language in much the same way that humans do.
Team Members
Anton van den Hengel
University of Adelaide
Professor van den Hengel and his team have developed world leading methods in a range of areas within Computer Vision and Machine learning, including methods which have placed first on a variety of international leader boards such as PASCAL VOC (2015 & 2016), CityScapes (2016 & 2017), Virginia Tech VQA (2016 & 2017), and the Microsoft COCO Captioning Challenge (2016).
Professor van den Hengel’s team placed 4th in the ImageNet detection challenge in 2015 ahead of Google, Intel, Oxford, CMU and Baidu and 2nd in ImageNet Scene Parsing in 2016. ImageNet is one of the most hotly contested challenges in Computer Vision.
Stephen Gould
Australian National University
Stephen Gould is a Professor in the Research School of Computer Science at ANU. He is also the ANU Node Leader and sits on the Executive Committee of the ARC Centre of Excellence for Robotic Vision.
He received his BSc Degree in Mathematics and Computer Science and BE Degree in Electrical Engineering from the University of Sydney in 1994 and 1996, respectively. He received his MS Degree in Electrical Engineering from Stanford University in 1998 and his PhD, also from Stanford in 2010. He then worked in industry for a number of years where he co-founded Sensory Networks, which sold to Intel in 2013. His research interests include computer and robotic vision, machine learning, probabilistic graphical models, deep learning and optimisation.
In 2017 Steve spent a year in Seattle leading a team of computer vision researchers and engineers at Amazon before returning to Australia in 2018. He was awarded an ARC Future Fellowship in 2020 for the project, “Declarative Networks; Towards Robust and Explainable Deep Learning”.
Chunhua Shen
University of Adelaide
Chunhua Shen is a Professor at School of Computer Science, University of Adelaide. He is also an adjunct Professor of Data Science and AI at Monash University.
Prior to that, he was with the computer vision program at NICTA (National ICT Australia), Canberra Research Laboratory for about six years. His research interests are in the intersection of computer vision and statistical machine learning. He studied at Nanjing University and at ANU and received his PhD degree from the University of Adelaide. From 2012 to 2016, he held an Australian Research Council Future Fellowship. He is Associate Editor (AE) of the Pattern Recognition journal, IEEE Transactions on Circuits and Systems for Video Technology and served as AEs for a few journals including IEEE Transactions on Neural Networks and Learning Systems.
Anthony Dick
University of Adelaide
Anthony is an Associate Professor at the University of Adelaide’s School of Computer Science. He holds a Bachelors of Mathematics and Computer Science (Hons) from the University of Adelaide and received his PhD from the University of Cambridge in 2001. Anthony’s interest areas include computer vision: that is, the problem of teaching computers how to see. He is interested in tracking lots of people or objects at once, and in building 3D models from video. (Over 2,400 citations and an h-index of 24).
Yuankai Qi
University of Adelaide
Yuankai is a postdoctoral research fellow based at The University of Adelaide. He is working with Prof. Anton van den Hengel and Dr. Qi Wu. Yuankai received his B.E, M.S, and Ph.D. degrees from Harbin Institute of Technology in 2011, 2013 and 2018 respectively.
His research is focused on computer vision tasks especially visual object tracking and instance-level video segmentation. He is currently doing research on the vision and language navigation task.
Hui Li
University of Adelaide
Hui Li completed her PhD in 2018 at The University of Adelaide under the supervision of Chief Investigator Chunhua Shen and Associate Investigator Qi Wu. She received a Dean’s Commendation for Doctoral Thesis Excellence. Hui’s research interests include visual question answering, text detection and recognition, car license plate detection and recognition, and also deep learning techniques. She became a Research Fellow with the Centre in July 2018.
Violetta Shevchenko
University of Adelaide
Violetta joined the Centre as a PhD researcher in 2018. She received her Bachelor Degree in Computer Science at Southern Federal University, Russia, in 2015. After that, Violetta participated in a Double Degree program with Lappeenranta University of Technology in Finland, where she finished her Masters in Computational Engineering in 2017. Her research interests lie in computer vision and deep learning and, in particular, in solving the task of visual question answering.
Yicong Hong
Australian National University
Yicong completed his Bachelor of Engineering at ANU in 2018, majoring in Mechatronic Systems. He was a research student at Data61/CSIRO from 2017 to 2018 working on his honours project about human shape and pose estimation. Yicong joined the Centre as a PhD researcher in 2019 under the supervision of Chief Investigator Professor Stephen Gould. His research insterests include visual grounding and textual grounding problems and he is currently working in the Centre’s Vision and Language research project.
Zheyuan ‘David’ Liu
Australian National University
David graduated from ANU in 2018 with first class honours in Bachelor of Engineering (Research and Development), majoring in Electronics and Communication Systems and minoring in Mechatronics Systems. David joined the Centre in 2019 as a PhD student at ANU under the supervision of Chief Investigator Professor Stephen Gould. His research interests surround vision and language tasks in the field of Deep Learning, particularly visual grounding and reasoning.
Sam Bahrami
University of Adelaide
Sam is a Research Programmer at the University of Adelaide where he also gained a Bachelor of Engineering (Honours) and a Bachelor of Mathematical & Computer Sciences. As a research programmer, he works on robotics, machine learning, and vision and language problems within the Centre. Sam has been part of the Centre since November 2018.
Project Aim
The ability to process vision takes up more of the human brain than any other function, and language is our primary means of communication. Any robot that is going to communicate flexibly about the world with a human will inevitably need to relate vision and language in much the same way that humans do. It’s not that this is the best way to sense, or communicate, only that it’s the human way, and communicating with humans is central to what robots need to be able to do.
This project used technology developed for vision and language purposes to develop capabilities relevant to visual robotics. More than just Visual Question and Answering (VQA) for robots or Dialogue for Tasking, it included questions about what needed to be learnt, stored, and reasoned over for a robot to be able to carry out a general task specified by a human through natural language.
Key Results
The team had two major milestones for 2020. The first was to achieve state-of-the-art for Visual Language Navigation (VLN). In relation to this milestone, Yuankai proposed an Object-and-Action Aware Model and Yicong proposed a Language and Visual Entity Relationship Graph model. Vision-and-Language Navigation requires an agent to navigate in a real-world environment following natural language instructions. From both the textual and visual perspectives, we find that the relationships among the scene, its objects, and directional clues are essential for the agent to interpret complex instructions and correctly perceive the environment. To capture and utilize the relationships, we propose a novel Language and Visual Entity Relationship Graph for modelling the inter-modal relationships between text and vision, and the intra-modal relationships among visual entities. We propose a message passing algorithm for propagating information between language elements and visual entities in the graph, which we then combine to determine the next action to take. This model has achieved state-of-the-art performance in the emerging field of VLN.
The second was the development of a demonstrator in collaboration with the Manipulation Demonstrator team. Specifically, the aim was to deliver an interface that receives a text description and a tabletop image as the input, and outputs a bounding box of the queried object to the robot, so that the robot can grab, manipulate and pass the object on to a human.
In 2020, the project team had six papers accepted for the Conference on Computer Vision and Pattern Recognition and two papers for the International Joint Conference on Artificial Intelligence, one paper at the annual conference on Neural Information Processing Systems, one paper for the Empirical Methods in Natural Language conference, four papers for the European Conference on Computer Vision and four papers at the Association for Computing Machinery Multimedia conference. The team also won the TextVQA Challenge and the MediVQA Challenge and hosted their first Remoted Embodied Visual Referring Expression in Real Indoor Environments challenge at the 2020 meeting of the Association for Computational Linguistics.