A tipping point is defined as: “The point at which a series of changes becomes significant enough to cause a larger, more important change”. In the same way that IP video changed surveillance a decade ago, our industry is now feeling the impact of recent developments in Artificial Intelligence, Machine Learning, Deep Learning, Big Data, and Intelligent Video Analysis.

Keyword Definitions

Let’s start with a few more definitions. Artificial Intelligence (AI) deals with the simulation of intelligent behavior in computers. Machine Learning (ML) deals with developing computer algorithms that access data and use it to learn for themselves. Neural networks are computer systems that loosely mimic human brain operation.

Deep Learning is a subset of ML based on neural networks that has been proven to provide breakthrough capabilities in many problems that were previously unsolvable, and Big Data, or metadata, refers to huge amounts of structured and/or unstructured data -- in our case, the immense quantities of video information being generated daily by security cameras deployed in cities around the world. Deep Learning is tipped to change Intelligent Video Analysis (IVA), the digital video technology integrated with analytical software that is a basic tool for our industry.

AI In Surveillance

Traditionally, the main benefit of surveillance cameras is the ability to collect evidence for debriefing or investigation, as well as the ability to view events remotely in real-time. A decade ago, video analytics technologies were introduced to solve the problem of human inattention -- computers don’t get tired, bored or distracted, and can monitor a camera continuously.

And then, camera costs dropped, deployment skyrocketed, and video management systems began collecting reams of useless, costly unstructured data. AI technology seemed to answer the pressing new industry needs of how to use this Big Data effectively, make a return on the investment in expensive storage, while maintaining (or even lowering) human capital costs.

Three Limiting Factors

All this was theoretical, however, as multiple technological barriers prevented AI solutions from real-world utilization. Despite decades of research on how to cause a computer to accurately recognize different objects in a video stream, the quality of the results, especially in urban environments, was, to put it mildly, underwhelming.

Deep Learning software must be able to differentiate between different objects and under various circumstances
Deep Learning has matured to the point where it can accurately detect and classify objects both in still images and in video

AI was limited primarily by these three factors: 

  • Lack of understanding -- The software must be able to differentiate between different objects (person, vehicle, animal, etc.), and under various circumstances (day, night, seasonal weather conditions, etc.).

  • Inability to learn -- Traditional IVA applications relied on a rule-based approach that required software configuration -- by a human operator -- for each monitoring camera and each type of alert. Although effective in some scenarios, the exponential growth in camera counts rendered this approach impractical, given the amount of manual labor required to configure, reconfigure, and maintain rules.

  • High cost -- The hard truth is that budgets for security and safety will always be constrained. Until recently, implementing real-time AI was extremely cost-prohibitive, sometimes requiring a 1:1 server to camera ratio.

Meeting Challenges

That was yesterday. Today, the application of AI in security applications has reached its tipping point, meeting the above-mentioned challenges. 

  • Understanding -- Deep Learning has matured to the point where it can accurately detect and classify objects both in still images and in video. DL technology is fast becoming the basic building block for IVA.

  • Ability to Learn -- As an AI solution collects and analyses data over time, it creates metadata that describes all objects in each video stream. Machine Learning techniques process this metadata to generate models for “normally observed” behavior. These models are applied in real-time to detect behaviors deviating from the norm. Only those flagged as suspicious events require review by a human operator. This technique allows the solution to scale to an unlimited number of cameras, with no need for a human to configure each new device.

  • Lower cost -- The rapid increase in GPU computational capacity, coupled with mass market adoption, has lowered server costs to a reasonable level. Today, with the correct implementation, a single server can be deployed across hundreds and even thousands of cameras.

The convergence of Deep Learning for video analysis, advances in AI for fully automated event detection, plus the significant reduction in cost to implement these techniques – including cloud-based software as a service (SaaS) models -- means that the fully automated video surveillance solution for cities is fast becoming a reality. We’ll see more of this type of solution being deployed over the coming months, and within the next few years, it will be standard in any Smart City deployment.