Artificial intelligence has learned to “see” just as man

University of California at Los Angeles (UCLA) and Stanford University engineers demonstrated a computer system capable of detecting and identifying objects in the real world that it “sees” using a visual training method used by people.

The new system is considered a step forward in technology, called “computer vision”, which allows computers to read and identify visual images. This can bring us closer to the creation of common systems of artificial intelligence – self-learning computers that are able to reason and make independent decisions. Modern computer vision systems of AI are becoming more and more powerful and effective every day, but still depend on a specific task. This means that their ability to define what they see is limited by the extent to which people learn and program.

Even the best computer vision systems of today cannot create a complete picture of an object based on only certain parts of it, so it can be fooled if you demonstrate an object in an unfamiliar environment. Engineers seek to create such computer systems that would not have this flaw, just as people are able to recognize a dog, even if it hid behind a chair, because of which only paws and tail are visible. Using intuition, a person can easily understand where the dog’s head is and where the rest of its body is, but this ability is still not available to most AI systems.

Modern computer vision systems are not designed for self-study; therefore, they are programmed by demonstrating thousands of images of objects that they must identify. In addition, computers cannot intuitively determine what is depicted in the photograph: AI-based systems do not constitute the internal image of familiar objects, as people do. A new method, described in the journal Proceedings of the National Academy of Sciences, tells how you can solve these problems.
 

The computer vision system developed at the University of California at Los Angeles can identify objects based only on parts of them. © © UCLA
 
The approach consists of three stages. First, the system breaks the image into small pieces, which researchers call “viewlets”. Secondly, the computer remembers how these viewlets can be combined with each other, forming the desired object. At the third stage, the AI ​​pays attention to what other objects are in the visible area and whether they are related to the description and identification of the primary object. In order to help the new system to “learn” and become similar to people, the engineers decided to immerse it in the Internet copy of the human environment.
 
“Fortunately, the Internet provides two things that help the brain computer vision system to learn in the same way that people do. First, it is the presence of multiple images and video clips, which show objects of the same type. Secondly, these objects are visible from different points of view – hidden, from a bird’s eye view, from a short distance – and placed in different conditions, ”says a professor at the University of California and research leader Vwani Roychowdhury.

Starting from infancy, we learn about some subject, since we see a lot of its variations in different contexts. Such contextual learning is considered a key feature of our brain: it helps us create reliable models of objects that are part of an integrated worldview, where everything is functionally connected.
 
 
This understanding helped the engineers to achieve the result: they successfully tested the system with the help of about 9000 pictures, each of which showed people and other objects. The platform built a detailed model of the human body without external guidance and image marking. Engineers conducted similar tests using images of motorcycles, cars and aircraft.

In all cases, their system worked better, or at least as much as traditional computer vision systems with years of training, which gives hope for further progress.