What makes us humans so good at making sense of visual data?

Thats a question that has preoccupied artificial intelligence andcomputer visionscientists for decades.

40% off TNW Conference!

The ‘dark matter’ of visual data can help AI understand images like humans

Deep neural networks are like layers of complex mathematical functions stacked on top of each other.

images) and their corresponding outcome (e.g.

the object the images contain).

deep neural networks

The DNN adjusts the weights of its functions to represent the common patterns found across objects of common classes.

Thats a lot of cat pictures.

In fact, deep neural networks have existed for decades.

Black hole in space

This has allowed AI scientists to create and train bigger neural networks in short timespans.

But at their core, neural networks are still statistical engines that search for visible patterns in pixels.

That is only part of what the human vision system does.

Pool table

The scientists also point out that human vision is not the memorization of pixel patterns.

How can we achieve human-level computer vision?

These dark components are functionality, intuitive physics, intent, causality, and utility (FPICU).

Light switch

Since infanthood, we start to explore the world, much of it through observation.

We can also reason about not only rigid objects but also about the properties of liquids and sand.

What needs to change in current AI systems?

weird chairs

Causality

Causality is the ultimatemissing piece of todays artificial intelligence algorithmsand the foundation of all FPICU components.

Does the roosters crow cause the sun to rise or the sunrise prompts the rooster to crow?

Does the rising temperature raise the mercury level in a thermometer?

Does flipping the switch turn on the lights or vice versa?

This is because causal events are not always visible, and they require an understanding of the world.

Observers recruit their counterfactual reasoning capacity to interpret visual events.

Why is this important?

This is the approach has largely been successful in areas such asboard and video games.

If you want to carry water, youll look for a container.

If you want to climb a wall, youll look for objects or protrusions that can act as handles.

Our vision system is largely task-driven.

We reflect on our environment and the objects we see in terms of the functions they can perform.

We can classify objects based on their functionalities.

Again, this is missing from todays AI.

Deep learning algorithms can find spatial consistency in images of the same object.

But what happens when they have to deal with a class of objects that is very varied?

Inferring intents and goals play a very important part in our understanding of visual scenes.

This allows us to reason about their courses of actions.

And we do not even need rich visual cues to reason about intent.

Take the following video, which is an old psychology experiment.

Can you tell what is happening?

Every possible action or state within a given model can be described with a singlean, uniform value.

Many AI systems incorporate utility functions, such as scoring more points in a game or optimizing resource usage.

But without incorporating the other components of FPICU, the use of utility functions remains very limited.

This, of course, is easier said than done.

you could read the original articlehere.

Also tagged with