CTO Bjørn Skou Eilertsen on how advanced technologies shape innovation in the industry
Sunday, 13 August 2017
Author: Toyah Hunting
Hardware acceleration from NVIDIA and Intel, video content analysis and the Internet of Things. These are powerful driving forces that are transforming how video technology is used for surveillance.
These advanced technologies are here today. Already changing our near-term future rapidly, and creating new possibilities for surveillance and beyond. The challenge is to keep up with the pace of change and understand how they work together and influence each other.
As always, there will be winners and losers because of this change. The difference is; the winners are leading the way by providing clear insight to their customers on how to use this change to their advantage.
In this engaging keynote speech from IFSEC International 2017, Milestone Systems’ Chief Technology Officer, Bjørn Skou Eilertsen, “joins the dots” between these advanced technologies to give you a clear picture of why they are important. Using plain language, he explains how they are shaping the current phase of innovation in our industry. And finally, what action you need to take to be a winner, for yourself and for your company:
Technology is advancing exponentially fast
"It’s difficult for any single company to keep up.
Video Content Analysis is a good example.
A transformation is happening within video content analysis, that will change how video technology is used for surveillance and other things.
Think about it, when was the last time you used one of these?
Right, most of the time you do what this guy is doing; use your mobile.
The phone box provided a fixed, landline solution to distribute access to phones, mobile technology displaced this with exponential growth.
At the start, exponential growth looks to be deceptively slow.
Mobile technology started to pick up slowly around 1998, then it took off and by 2008, mobile phone subscribers had reached half of the planet.
Today, it’s the same trend for video content analysis, it’s advancing exponentially fast, so fast that no one company can keep up.
As you probably know, Milestone is the global #1 video management software manufacturer.
We are at the forefront of video management technology, but we are not deep into video content analysis today.
However, this is a technology that is transforming our future business and we’re studying it closely.
In short, we believe the future is about how we combine intelligence; machines to automate most of the work, humans to assist when the machine is uncertain.
Today, I want to share with you what we have learned and how we came to this conclusion.
Legacy video analytics techniques
A few years ago, video analytics techniques looked like this.
"Rule-based legacy video analytics means a human programmer must set rules to describe every situation that the system can recognize, and it’s fixed.
Let me give you a story from the city of Zagreb to show you what I mean.
In Zagreb culture, drivers think it’s OK to park on the footpaths, and tram lines.
To counter the problem, authorities are using video analytics techniques.
At critical traffic points around Zagreb, cameras are enabled with real-time event detection capabilities and configured with a “stopped vehicle” rule.
The system is integrated with the city’s Security Center and the National Car Registry database.
Today in Zagreb, if you park on the footpath, the next day a fine for a traffic violation is delivered to your home.
The result is a 50% reduction of traffic violations in the city.
Zagreb has tripled its revenue from traffic violations while increasing road safety, all because they are using, video analytics techniques.
The system has been programmed to recognize a true parked car traffic violation, from a car that is legally stopped in traffic.
Then it finds and reads the number plate of the offending vehicle, identifies the owner and sends the fine.
That’s a fantastic.
However, legacy, rule-based video analytics techniques is like the phone box, it’s fixed.
And just like the phone box, new technologies are advancing video content analysis beyond the capabilities of rule-based systems at an exponential pace.
These new technologies are disrupting video content analysis and one of the consequences is that they are making the legacy patents associated with video analytics techniques irrelevant.
Because of these technologies, today video content analysis systems can identify everything in a scene, learn about new items as they appear and understand what’s normal behavior and alert humans to what’s not normal.
In the future, the efficiency of machine intelligence will combine with the quality of human judgements to achieve an outcome that’s not possible for either one alone.
It’s not either or; it’s the combination that makes this difference.
If there’s one thing that I want you to remember from my keynote, it’s this:
The future is about how we combine intelligence; machines to automate most of the work, humans to assist when the machine is uncertain.
Machines doing what they’re best at, working alongside humans doing what we’re best at.
For the rest of my keynote I want to show you how I think we’re going to get there; why it’s important for you, how it’s going to happen, and what action you need to take.
As we see it, it comes down to three technology trends that are driving the disruption and defining the near-term future for video content analysis.
- Sensor aggregation
- System automation
- Visual augmentation.
Let’s take a closer look at each trend.
Theme 1: Sensor aggregation
The first trend, sensor aggregation, is about joining data from many types of sensor as a whole dataset.
It is driven by the Internet of Things and the way it connects vast numbers of cameras and other sensors.
Today, sensors are adding intelligence to more and more objects, video, audio, all types of sensor, and these are all connected through the Internet of Things.
In less than five years, about 50% of the streams feeding into video management systems, will not come from cameras by themselves, but will be video streams from other types of sensor.
In the future, video content analysis will be about more than just video, it will be about joining and analyzing this massive input of data.
How do we use this massive amount of data?
How do we find the useful information?
How do we make this data actionable?
To understand what this massive amount of sensor data is telling us, to find the useful information, we must join the data as one dataset – that’s sensor aggregation.
That’s why future video content analysis technology is developing at the central server, because you cannot effectively aggregate data at the edge.
It’s like this picture, if you were an individual element, a video sensor if you like, all you see is a bunch of other elements around you; you don’t see what’s happening.
To understand what’s going on, you must aggregate the elements, sensors, and look at them together.
Video sensor in a parts box
To answer the third question, how do we make this data actionable?
Let me show you this example I found on the internet that really made me stop and think.
Earlier, I said that video sensors are turning up everywhere.
This is a component parts bin on an assembly line, and it has a video sensor in it.
Can you imagine why there would be a video sensor in a component bin?
Well, the data it provides is used to optimize the assembly line process.
As the assembly line worker uses the components, the video sensor counts how many components are left in the bin, and how fast they’re being used.
Würth Industrie Service was the first supplier to introduce video sensors into component bins to provide quantity and usage at bin-level, and automatically place supply orders to refill each bin.
The video sensor also helps management to improve the process by providing a constant flow of data to optimize workflows and staffing.
Using sensor data to augment human decisions and optimize operations like this is revolutionizing materials management on assembly lines, and creating big commercial gains.
Theme 2: System automation
The second trend, system automation, is about combining machine intelligence with human intelligence.
It is being driven by the massively parallel compute capacity that is now available through modern Graphics Processing Units, or GPUs.
Before the GPU, it was impossible to comprehend the information hidden in the massive amounts of data from sensor aggregation.
Traditional cpu-based computing systems would take eons to complete this task.
Using the GPU’s massively-parallel compute capacity to power neural networks, we now use artificial intelligence, or AI, to analyze this data and understand what it is telling us.
Because of this, I see the GPU as the tipping point that changed the role of humans in video content analysis.
This collective intelligence, machines to automate most of the work, humans to assist when the machine is uncertain, is the future of video content analysis.
Let’s look at how we use GPUs and neural networks to combine the efficiency of machine intelligence with the quality of human judgements to achieve something that’s not possible for either one alone.
NVIDIA Tesla P4 GPU
To analyze video streams from ten thousand cameras requires a new type of computing - using a GPU as a coprocessor is changing the game.
On both the NVIDIA and Milestone booths at IFSEC this week, we are using general purpose computing from Dell and HP, with multiple GPUs to process and detect motion on 1,500 full HD, 1080p streams.
That’s 45,000 fps, using a commercial off-the shelf server, without dropping a single frame!
GPUs or AI accelerators as of 2016, are an emerging class of coprocessors designed to accelerate neural networks and other machine learning models.
GPUs are manycore designs in which their massively-parallel structure makes them more efficient than general purpose CPUs for processing.
The Tesla P4 Inferencing Accelerator, I still call it a GPU, is enabling huge advances in video content analysis because the P4 is cooled, powered and built to run in a 24/7 high-heat server environment.
I really love this card, it’s as if it was purpose-built for the surveillance industry!
Combining sensor aggregation with the compute-power of GPU technology like the Tesla P4, we can use neural networks efficiently to understand what these huge amounts of video data can tell us.
Because of this combination, we are starting to create deep learning.
Today, we buy a product to solve a pre-defined problem, tomorrow, we will buy a neural network to add artificial intelligence to a process.
Neural networks are computer systems used in machine learning and artificial intelligence.
They’re a bit like your brain in that they are based on a large collection of connected simple units called artificial neurons.
Neural networks are not rule-based like traditionally systems.
Rather than being explicitly programmed, they are self-learning processes that can be trained by the user.
Neural networks learn what is normal behavior for people, vehicles, and the environment by observing patterns of characteristics in the video such as size, speed, color, grouping, vertical or horizontal orientation and so forth.
The neural network classifies this video data, tags objects and patterns in the video, and continuously builds up and refines definitions of what is normal or average behavior.
After several weeks of learning, a neural network can recognize when something breaks the pattern and send an alert.
Neural networks excel in areas that are difficult to solve using rule-based programming.
For example, a neural network learns that it’s normal for cars to drive on the road.
When it sees a car drive up onto the pavement, it recognizes this as not normal.
GPU and neural network technology is transforming video content analysis, soon we will use neural networks to reliably find suspects in a crowd, assess situational behavior and estimate intentions.
Theme 3: Visual augmentation
The third trend, visual augmentation, is about using artificial intelligence to signal to us humans when something out of the ordinary happens.
By the end of 2014, IHS Data estimated that there were over 245 million operational surveillance cameras globally.
In fact, it’s just a fraction of the total number of devices capturing video; today, just about every event is captured on video by a security camera, smartphone camera or a body-worn camera.
In my opinion, most of the video data is rarely looked because searching through hours of recorded video to find exactly what you’re looking for can be very tedious and takes a lot of time.
It may be very difficult to find, but there’s highly valuable information locked in this video data:
- Insight into terrorist activity in planning stages
- Criminal activity in progress
- Clues that can become evidence.
The question is, how do you search all those hours of video for patterns of behavior and clues to evidence?
Utilizing the massive parallel processing capability of GPU technology, neural networks can analyze the video and multimedia data, search the content, tag objects it finds and extract information from it.
By extracting this information, the machine intelligence is making the data actionable.
Let me show you a couple of examples.
BriefCam: Video Synopsis Technology
BriefCam is a Milestone partner that specializes in visually augmenting video data to make it easy for humans to comprehend.
Their Video Synopsis Technology enables humans to rapidly review video by simultaneously presenting objects, events and activities that occurred at different times so that we can find evidence very quickly.
By using GPU and neural network technology, Video Synopsis will track and analyzes moving objects, and converts video streams into a database of objects and events.
The visual augmentation process collects all objects from the target period, and shifts them in time to create a much shorter video in which objects and activities that originally occurred at different times are displayed simultaneously.
You can see this in the slide, on average, one hour of video can be “synopsized” down to one minute of review time while preserving all essential activities of the original video.
In addition, the video can be further visually augmented by arranging it according to parameters that we can specify, for example, size, color, speed, area and direction.
This is intelligence augmentation, or IA.
Using intelligence augmentation like this, enables humans to view recorded video for suspects or potential threats and make the final positive identification much faster.
As another example, body worn cameras are being used in increasing numbers by public safety and law enforcement professionals to capture video from the individual’s perspective and record for long periods.
Video from body-worn sensors is more difficult to analyze than video from fixed surveillance cameras, because the background scene is changing.
More recently, using GPU and neural network technology the accuracy of video content analysis has been improved through machine learning, and specifically deep learning.
Deep learning is a relatively new technique in machine learning that enables, among other applications, very accurate classification of images.
Correct application of this technology can provide a dramatic increase in accuracy alongside a dramatic decrease in false alarms which have long been the Achilles’ heel of video content analysis applications in surveillance.
Digital Barriers, another Milestone partner, is a great example of this process.
Digital Barriers specialize in zero-latency streaming and analysis of secure video and related intelligence over wireless networks, including cellular, satellite, IP mesh and cloud.
For example, the officer here is wearing a video and audio badge and by analyzing the video in real-time, if the video content analysis system identifies the person confronting the officer as a known criminal, it can give an alert.
Looking a little further ahead, video content analysis as well as recognizing patterns in the video, will also be able to predict behavior.
Using deep learning, the Computer Science and Artificial Intelligence Laboratory at MIT has created a visual augmentation model that can predict human actions from what people are doing in the seconds before.
Researchers fed the program with 600 hours of YouTube videos to see if the model could learn about and predict certain human interactions like hugs, kisses, high-fives, and handshakes.
Analyzing a video of people who are seconds away from doing something like this, the computer managed a 43% success rate, compared to 71% reached by actual humans.
The MIT team says that the model will be much more successful if it consumes more video data than the 600 hours used for the experiment.
Conclusion: Three technology trends
The future is about how we combine intelligence; machines to automate most of the work, humans to assist when the machine is uncertain.
Machines doing what they’re best at, together with us humans doing what we’re best at.
This collective intelligence is being enabled by three inter-linked technology trends:
- sensor aggregation
- system automation
- visual augmentation
The trends are themselves being driven by GPU and neural network technologies and together, they are disrupting video content analysis industry.
In the past, the video analytics techniques were tightly linked with patents, and as you probably know, most of the patents were acquired by one player to control the future of video content analysis.
At Milestone, we call that an end-to-end solution.
We believe such an end-to-end solution is a dead end, and as I have shown you today, video content analysis technology has already overtaken most of the video analytics techniques and these patents.
In the past, people were scared to engage with video content analysis because they were concerned about lawsuits arising from patent infringements connected with legacy video analytics techniques.
Let me tell you, patents controlling legacy video analytics techniques are no longer a barrier to your future.
Because technology has overtaken legacy video analytics techniques.
Because patents are no longer a barrier.
Because we see that such an end-to-end solution is a dead end.
There is now a new future for everyone in video content analysis technology; but you can’t win alone.
Technology is advancing exponentially fast; it’s difficult for any single company to keep up.
The future for video content analysis technology is combining machine intelligence with human intelligence.
The future for video content analysis companies is also about combing.
To win, you must team-up with other innovators so that their skills add to yours.
Today, I want to inspire you to step up to this challenge.
- Learn about the new GPU and neural network technologies
- Combine your skills as a community
- Build your customers a best-of breed solution that’s very easy to use.
It is very important for me, that you act now and join a community that fits your company.
Not long from now, when we look back on our industry, there are two questions we can ask.
What were we thinking, didn’t we see the opportunity, the need to do something now?
Or, we can ask a second question, and this is the question I want us to ask.
How did we challenge the status quo and unleash our imagination?
It’s up to us to us to make this happen.
This is not the time for business as usual.
As a community, we have everything we need.
Together we must set the agenda, change the rules, and lead our industry."
Bjørn Skou Eilertsen
Chief Technology Officer, Milestone Systems
Bjørn Skou Eilertsen came to Milestone in 2013, bringing a strong entrepreneurial background from the IT industry having fulfilled key roles at a series of startups as well as product management, marketing and sales roles with both IBM and Microsoft.
Prior to joining Milestone, Bjørn was heading up the EMEA product management and sales operations for the Microsoft CRM business. His education includes a M.Sc. in Computer Science and Business Administration.