We Need to Build Interactive Computer Vision Systems

You hear strong proclamations about how AI is taking over the world. And then, you read about how sophisticated AI models are easily fooled by small perturbations in input (see figure below).

Google Inception-v3 classifier
Strike (with) a Pose: Neural Networks Are Easily Fooled by Strange Poses of Familiar Objects

The reality is that we're in a time of transition, as we learn more and more about what AI is (and is not) capable of. To truly advance, we need to start designing more interactive algorithms with end use cases in mind—break our solution into modules with outputs that are understandable by regular users. So if the algorithm were confused about a decision, it could simply ask for help. Humans are emotionally wired to help other human beings, and we feel similar empathy towards machines too. Let’s grow that side of the interaction.

AI and Robustness

There are many thoughts out there about what the future holds, long-term, for AI. I recommend reading the bold predictions by Rodney Brooks in early 2018, and following the debate between Gary Marcus and Yann LeCun on the merits of Deep Learning and symbolic AI.

In this post, I take a narrower, more practical view, as an engineer who wants to build accurate, scalable, and robust systems to solve today's exciting problems. I focus on computer vision because there, the issue of robustness is more obvious and the complexity of computation is higher than in other fields such as NLP. However, the arguments below seamlessly apply to the broader field of AI.

Lack of robustness is the core challenge to overcome with current deep learning models used in computer vision. Yes, we can add more data and train even better models, but the best models fail unpredictably. Training end-to-end models wherein we want the network to associate inputs to outputs of a complex system create even more uncertainty about when and how the system would fail.

As the actual representation used by a neural network is still an unfolding mystery, diagnosing failures in a complex system is challenging. For example, deep learning has led to unprecedented progress in the field of autonomous driving. However, we are yet to have a fully autonomous car on the road without a human behind the wheels. And it seems unlikely to happen anytime soon.

Let's take an example of the ImageNet challenge, where a system is presented with a fixed size image and is asked to identify the object in the image from a pre-defined list of 1000 classes. The solutions to this challenge usually make top-5 predictions of what it thinks the image is, but if they are evaluated on the basis of top-1 prediction, the accuracy drops significantly. (Check the table here.) This shows while the solution isn't able to pick the correct top label, it can quickly create a shortlist of possibilities with really high probability of the right answer in it. If that shortlist is then presented to a human, they can quickly and robustly pick the right solution which will save time.

To address the problem of robustness, we should make humans responsible for identifying the right answer to any perceptual task. Then, the role of algorithms is to process huge amount of data at scale and generate shortlists of likely outcomes. An effective collaboration between algorithms and humans can be leveraged against a surprisingly large number of problem areas. Existing examples of such interactive computer vision systems include:

  • Editing tools in Adobe Photoshop or VFX softwares
  • Semi-autonomous robot control systems, such as the ones build for the DARPA Grand Challenge

Designing Interactive Systems

The main distinction while designing such interactive systems from some prevalent approaches today is that the role of humans must not be seen as a temporary crutch helping the algorithm to succeed until it has learned enough. Instead, we must design systems such that the algorithms generate abstracted symbols with which they can communicate with humans to ask either for confirmation or help disambiguating areas of confusion.

Finding the right symbols is a hard problem and has been a great focus of the symbolic AI community for a long time. These symbols are domain specific, but there are certain characteristics they could all share. Each symbol should represent a visual construct that has a name. For instance, while any random skin patch on a face isn't localizable, cheeks have specific location and we all know what they are.

Robustness isn't the only reason to design interactive systems, or the only benefit in doing so:

  • Another important reason to design interactive AI solutions is to reduce the fear of AI and the hysteria over the doomsday scenario of machine intelligence surpassing humans. AI systems will stop being perceived as a scary blackbox and start to be seen as yet another complex system like others we interact with understandable strengths and weaknesses.
  • Training an interactive AI systems might require very little labeled data, as humans can provide the necessary labels through their responses. This can break the incessant need to obtain large amount of labeled data for any successful deployment of an AI solution.

AI systems that efficiently combine the strengths of humans and algorithms have the potential to solve many intractable problems and introduce efficiency.

What do you think of interactive systems? Have you been working on any? Let us know on LinkedIn or Twitter.
Ajay Mishra, Advisor, Computer Vision
Ajay is a Manifold affiliate, serving in an advisor capacity. Most recently, he has been Director of Computer Vision & Deep Learning Group at HOVER Inc., where he built metric accurate 3D models of houses/building using a few images captured on mobile phone cameras. You can find him on LinkedIn.