Surveillance feed has essentially been monitored manually until recent years. Video analytics as a technology has made great strides and leverages video surveillance networks to derive searchable, actionable, and quantifiable intelligence from live or recorded video content.
Driven by artificial intelligence and deep learning, video intelligence solutions detect and extract objects in a video. These solutions identify target objects based on trained Deep Neural Networks and then classify each object to enable intelligent video analysis, including search & filtering, alerting, data aggregation and visualization.
In our session, we will:
With the basics covered, it’s LIGHTS! CAMERA! ACTION ….Let us show you how this works. We will be presenting a live demo that will explain the performance-computing trade-offs between the use of different models, techniques, and their limitations.
What you can expect to take away from our session:
Rishan Sanjay: A very good morning, good evening and good afternoon to everyone present at the Data + AI Summit. Today, I will be presenting along with my colleague Vinamre Dhar and we will be presenting our work on Detecting Anomalous Behavior with Surveillance Analytics. So, there will be two speakers today, both from Kushagramati Analytics. We are a Bangalore based firm who consult with Databricks. The two speakers are Vinamre Dhar and Rishan Sanjay, which is me. We’re both data scientists for Kushagramati Analytics.
So, the agenda for today would be, discussing the popular analysis tools that are used in surveillance data, the challenges that exist in the existing solutions, our proposed solution and demonstration, and architecting the same for scale. So, the popular use cases that are available for surveillance analytics are primarily of security and monitoring types, but there are also a few miscellaneous ones that we’ll be going through. So, starting with those security use cases, we look at physical role-based access management. So, this will aid physical access management for different roles in different working capacities, looking at violent behavior detection in public spaces. So, the government local jurisdictions can work on reducing and deterring crime, using violent behavior detection with an automated model. So, this can be important in terms of notifying a nearby police station while something like this occurs and also, be used in terms of the charges that we laid on a person for committing an illegal activity this way.
So, also we can look at crime and intrusion alert. So, this will primarily refer to a residential or a commercial complex level, where we’re looking at preventing trespassing and various other crimes that can be detected using surveillance. So, essentially trespassing would be the primary focus of such a case. Looking forward into the monitoring; so flagging log activity and creating social events. This will be crucial in reducing the time, in terms of looking for an anomaly, or a crime committed, because we have a searchable law, which will be a lot easier than going through all the data.
Now, looking at trend forecasting. So, for retail businesses, we can look at forecasting the foot traffic data, which can be similarly done in government infrastructure and housing plans as well. So, the idea is to train a model based on a regular foot poll and predict it using trend forecasting and ML models. Further, we can look at the COVID protocols in public spaces. So, this is something that has been implemented by the team at Kushagramati, which we’ll not be demonstrating today, but you can always reach out to it for the [inaudible]. COVID protocols, such as social distancing can be done using machine learning and ensuring such things. So, for example, we can create a model detects this and the minute that it does not adhere to the prefixed condition, it can give up an alert.
So, fire and other safety protocols. So, for example, integrating fire alarms with your current surveillance tools, as well as simply detecting a fire. So, various things like that can be integrated into a surveillance analytics solution. So, for the miscellaneous part, we can speak with respect to illegal parking, or if someone has to pay a traffic violation, the government authorities can use license plate recognition, to both say send a fine, as well as understand who’s committing this traffic violation. Looking at infrastructure bottlenecks and unused spaces. So, a heat map analysis is currently being conducted in various industries, to ensure that there aren’t too many unused spaces in the shops, as well as in industrial places, such as factories et cetera.
Looking at the automation and customer intelligence solutions, we primarily look at this based on each client. So, you can automate customer intelligence solution, but it must be client based. It is tailored towards that specific client. I will be speaking about the challenges in traditional surveillance. So thee challenges in traditional surveillance are as follows. Primarily monitoring manually, is an error prone flow activity, simply because of the fact that it requires a lot of manpower and introduces human bias. So, human bias could be anything in terms of two people having a liking for each other, so preventing unacceptable behavior versus, two people conspiring to do something. So, human biases tend to play a role in the error, as well as just plain human at all. So, it’s expensive to scale for the same reason. So, we’re looking at limited integration of various other tools, as well.
So how do we tackle this? We tackle it by building these following use cases. So, the three use cases that we have built will be primarily focusing on the abandoned object detection, but we have also worked on loitering and unauthorized individually detection, as well as unauthorized vehicle entry or location detection. So, for example, when we speak with respect to loitering. If this is a bank like in the use case, or person is walking about continuously for an extended period of time for say 10 days, 12 days. So, it seems to be someone conspiring to do something illegal and that is what we look at detecting. As well as an authorized vehicles for the same reason. And abandoned objects would essentially refer to anything like, when someone throws a bag and runs away, right? So, you know that is malicious intent there.
Looking at our architecture diagram. So, video feed is an unstructured kind of data. We gather the video feed through the real time streaming protocol, otherwise known as the RTSP protocols. So, we access the information on the camera, through the DVR, using the real time streaming protocols. Once this occurs, we store the data and we implement some cleansing and pre-processing steps, such as background, subtraction, erosion and dilation, and the generation of a time average frame. Object detection is our final stage, where we look at our abandoned objects, loitering individuals and unauthorized vehicles. And once these events occur or they are logged, they are essentially put into a searchable log, which resides in the central management system or the central monitoring system.
So, how do we access the video feed? Like I said, we use the RTSP protocol and a video object. A video capture object is what we use and the general format of the RTSP protocol as you can see, requires the credentials, as well as the camera number. Looking at the steps that we implement for reprocessing and cleansing of the video data. So primarily, we require low latency and a fluent frame rate, for it to be a viable solution. So, what we do is, we let the video of skip certain frames if they aren’t received in time, to prevent the overloading of the protocol. As well as, we implement gray scaling, to obtain the binary format of 15 frames. So, the Pre-processing steps are listed on the right-hand side of the slide, which include gray scaling, Gaussian blur, background subtraction, erosion and dilation.
So, the key modules with regard to video processing are as follows: background subtraction. So, if we’re looking at extrapolating a foreground object from the background. So, for example, a bag that is not attended is an anomaly and how would we separate that from the background? So, we look to pick that specific object using background subtraction. Erosion and dilation are two techniques that we use to ensure that the contours are complete and clear. This is done to ensure correct detection. So, the generation over a time average frame: this is a crucial step in our process, simply because the generation over time average frame, will prevent everyday behavior from looping anomalies. So, for example, if we are sitting in an office environment and there are staplers, there are different office tools that really reside on a particular table, those objects should not be detected as anomalous and that is why we use the time average.
So, looking further in terms of object recognition itself, we’ve used OpenCV for simpler objects like boxes, but we’ve also taken it a step further and use pre-trained models, which were trained on the COCO data set, which is an industry standard. And we use this for more complex objects such as people, vehicles, et cetera. So, finally, we also implement certain threshold. So, we reduce the false positive detection. So, I’ll be handing it over to Vinamre now and he will be taking you through a demo of our solution.
Vinamre: Thank you, Rishan. I’ll move over to the demo aspect of our solution and showcase how exactly we’ll be working on the abandoned object detection itself. Yeah, 5, 4, 3, 2, 1…
Awesome. Thanks a lot Rishan for the introduction and also discussing the overall agenda of today’s demo. This’ll be the technical description and the technical brief and the technical demonstration that I’ll be beginning now. So, I’ll be sharing my notebook in a bit as well and I’ll also be showing you a demonstration of a live recorded real-time solution that we built on abandoned object. We shall be showcasing you, basically when we drop an object and we pick it up, both the distance of a person from the bag itself and how much time that distance was apart for both these particular identifiers. That is, what we’ll define the use-case for abandoned object detection.
Similarly, we’ve built our solution for other use cases as well, like loitering individuals. When individuals particularly loiter in an area, we count the number of individuals. We also do a little bit of COVID social monitoring on them, particularly. We also have vehicle detection, as you’ll see that vehicle currently. So, some of these other use-cases that we’ve built, today because of time restrictions we’ll be only demoing the first use case that’s the abandoned object. Thanks a lot Rishan for describing it so well.
So overall, I’ll first take you through the code that we’ve built around it. So, this is a conda environment that we built it on. We called it Garuda, because that’s the code name of the project and we’ve particularly had these libraries and modules that we utilize. We had CV2, obviously OpenCV will be required. There’s NumPy areas. So, any image manipulation we’ll be doing, you shall require NumPy as a module, as well. You have requests and imutals and argpases. They’re all utility items that are there.
There’s TensorFlow. Obviously, that’s the object detection and the [inaudible] we’re using. For abandoned object, we had to build our own understanding of… and we built it using OpenCV. We’ll Show you that in a minute. Import math. So, math is again when using any of the math functions directly, it’s a lot more easier as well. And some of the utilities, video streams, frames per a second, just to maintain a constant fluid frame per second, so that there are no frame drops, particularly there in the video as well.
So, this is the order class that we’ll define. So, the basic user we had as Rishan discussed was also that we had a lot differing pre-trained models and a lot of different data sets that we could also further train them on. Basically, our issue was that there were a lot of models and we were having different computational requirements. So there might be some of our clients which might have GPU servers at that location, or at the [inaudible] location. Those are very helpful, because then we can do a lot of quick intact API, according to their models and we can have some highly accurate models, which are also computationally being quite expensive and such. Those are things that varied itself according to computational needs.
Some for clients only had enough CPU power. They did not have a lot of GPU power. They had probably a very basic integrated GPU, particularly in the systems. So, for those also, we had a lower model that we particularly used. Obviously, accuracy [inaudible], where it was good enough for a solution to still run accurately for the business requirement for that particular client as well. So, for that reason, we had the DataPlus API. Whatever model that you’ve created, whatever money you’ve deployed, if they’re more or less fixed as per the drift happening, you can pre-train it using a lot of transferring methodologies. You can train for your particular use-case as well. We found it to be quite useful, before inputting our model into a use-cases as well.
Second, we were processing some frames and this is just the frame rate we’re getting for each and every single frame. We’re getting what the boxes are. So, this also was not just for personal action, but is also results for cars, some of the suitcases, some balls. So, different objects that would be injected by this dataset was using [inaudible] as I mentioned earlier and the more minor SST module that we particularly had for our use case, right? So, within that, we were scoring each and every boundary box or each and every objects detected. We also had boxes, that was just some boundary box that we formed over them. Which class did they belong to, so it could be person and different objects type that is and number, is something that’s particularly of the count, the number of objects.
It also helped us maintain harmony, maintain the overall accuracy of the system. So, the then system crashes, let’s say we were running on something with CPU. We reduced this number itself or we had a cap on number of individuals that could have been around the use-case that we built. Yeah. So that is a selected plus API that we define. Then we on to the actual implementation of these functions in class objects. So, we are in the model part. Like we discussed the SSD which is built on the COCO. This is also very code standard for most of these data sets that were used. Then this directed API, which we defined for this model part. Then the threshold for a person. So, we had a 0.5 threshold. We also touched it with 0.6 and 0.7. Between 0.5 and 0.7, it should work on your use case. For person center, we initialize all the variables and it’s on the list.
Basically, this is a While True, which opens up the image object and starts taking in the frames from the camera itself. It goes through the DVR using the RTSP protocol that we discussed before and then, it goes on to manipulate each and every single frame. So, here we’re only getting for the class, which is human, which detects people and the threshold which was 0.5, which was defined. According to that, we build boxes around it and we put to text “Person” and if there was no person for it, then we just had our person center as 0.0, which will be the top left of your frame. Then we also differentiate it using absolute difference in the referencing. This is particularly important, because we discussed our time average, which not something that I’ll be showcasing you within a demonstration. It’d be a lot more clear then. So, basically what we’re doing is trying to extract the foreground objects from the diagram and for data reference, [inaudible].
So, this reference image with some of the use cases became quite good. If you took a reference for time average frame, for some use cases we also use adaptive thresholding, which is another matter in which you want to extract the foreground objects as well. Some common things that were used after this were, the contorting part. So, the image dilation or the image erosion that might be required. Those things are something that we particularly also had, within different use cases. It wasn’t quite required for the first use cases though. Then we just created into the gray frame. So, this is still a RBG colored frame, which is differentiating from the reference frame. Also, a lot more accurate as well. Then we created gray scales, so that we can do other manipulations, like we threshold it and found the different contours on it.
After that, we built boundary box around each of those contours and it took the used above a certain limit, like thousand pixels and less than 20,000, which means that it was not small boxes, which were an error or something that’s not particularly correct. And also, width and height. So, for extremely long verticul objects with small heights, which areas should be bigger than this, but the width might be smaller. So, those things we remove and also the height being to that level. Anyhow, we upended that [inaudible] to the LSDS. And for all those LSD objects that were found, we checked whether or not those objects relied on different frames. Also, whether those objects were a particular distance from the person objective as well.
We also had then, the time itself, that we had built with this as well. This also maintain the FPS in the background. That is something that [inaudible]. You can also help using the key functions. This also changes to what extent you want to do? So, if you want a slightly more fluid frame, you can increase the number on this. If you want to have a frame which is more real time, so it’s a trade-off between real time versus fluency. And it also depends on the computational needs that you might have, or you might not have particularly. So, I’ll just showcase the demo now.
So, as you can see this is the year time frame, which is something… Let’s move this right here. This is a real time frame which is coming. This is also time averaging the frame right now. So, what’s happening is, it’s creating a time average shape of all these pixels that are coming. These are fairly constant. It also helps us remove any objects, which are just statically there, or even adjust for brightness values over a period of days. So, we take a 24 hour along time average frame, then if we want to also adjust for brightness or different contrast levels in a deployed scenario. Here you see, I dropped the bag… And I come back, it tries to detect if it’s an abandoned object or not. After a few checks, basically distance and time, it detects abandoned objects, within the video frame. This is also something that if you’re demoing right not, you show how the code exactly works, but when in a deployed solution, we don’t particularly need to be seeing this frame.
So, all the computational requirement of actually displaying the frame on our screen will also go away. You just have a system which will detect the alarm and trigger the alarm. Like you see here, we print out a statement, “The alarm was successfully triggered.” This is basically triggering the alarm and sending this frame of video. So, 30 seconds or three minutes, whatever be the requirement by the client, we will have that part of the video sent to the central monitoring system, which will monitor the video and tell us whether this event was something that was correctly triggered, or if the alarm was correctly triggered or not. Certain types of different use cases. This also helps us send all these videos into the data link format and into something that we call let’s just summarize with your format right, Which have plans particularly get what is the summary video?
So we don’t have to go through hours and hours long video feed. We can only rely on these short snippets of feed and can extract a lot of value from them. This also makes a lot of surveillance systems be very, very autonomous, but also something being very efficient, use of your manpower that can be done.
We further move on to the metrics aspect of it. As you saw the video, it was quite good and of pretty high FPS as well. I thank everyone for this. Thank you.
Of the key metrics that we were able to track from a particular solution was, we were able to get 25 to 30 frames per second on a GPU enabled system. This is also something that’s a standard for the pre-trained models and dataset that we use for and we got really high accuracy rates as well. Some of the business impacts of our solution would be that the video feed can be summarized by the activity log of searchable event. So, like Rishan was mentioning earlier, the problem in traditional surveillance analytics, is that there’s hours and hours of footage that individuals manually have to go through to understand where it normally happened or where a crime might have happened. This also is very post-facto particularly. So, something if we can make an automated solution and somewhere can be an activity log of searchable events, you can just search particularly and go back to that particular event and see what exactly went wrong, who committed that particular crime? It’ll become a lot easier in terms of the business impact.
It also reduces the scale, because when you scale out to big complexes, having many cameras across large area and large sections of space. Within that, you also need a lot more manpower itself. There’s a cost reduction when you scale our solution, particularly. We also think that AI significantly increases accuracy and improves the responsiveness, because now you’re no longer waiting on a human to commit an error, or a human biases to creep in. Particularly, that’s something that is going to be a lot more accurate itself.
Some key considerations to note, in terms of going forward with our solution are: since video data itself is quite large in nature, it’s also quite unstructured. We do require GPU based computing, to be quite critical. Especially if you’re doing real time surveillance analytics, and you’re deploying this in a real time solution like we have, it’s something that you will require a GPU based solution or graphical processing unit itself. There’s also network latency that needs to be taken care of. It’s the presence of delay in real time that might happen. So, there might be internet bandwidth issues, which might create issues in terms of the packets reaching those places where the video is being processed. Those might also create some sort of lag in terms of the accuracy. Those are some key considerations to note.
In terms of Databricks Delta, we think that when we’re creating that summarized video data, Databricks Delta helps in efficiently probing of that summarized data. You see how activity log can be created using Databricks Delta and you can pick and choose which exact anomaly you want to probe into and sort of investigate a lot more. That’s something that can be efficiently done using Databricks Delta. Lastly, object detection and tracking models become significantly easy using MLflow management, so experiment management. So, when you’re like using a lot of image detection models, or using people action models in deployment, a lot of drift can happen, which means that your models are no longer accurate in itself. To keep the models up to date and constantly improving, you can have MLflow experiment management, which can help you do that task quite efficiently.
How we scale this particular solution with Databricks itself? So, we’re looking at a lot of sensors, IoT, a lot of unstructured data that’s particularly coming in. This could be burglar alarms. This could be fire alarms. This could be any kind of sensors, even motion detectors, You could be getting data from that. You also activity log, which is the searchable event activity log that could be present and the video feed that’s coming. So, all of that is adjusted using the data store or data factory into the Data Lake storage. So, the data from it really helps us like I explained, in terms of pinpointing to what exact anomaly you want to go back into and which exact searchable event you want to go back into and investigate that particular event itself? Delta Lake is really helpful for scaling that solution itself.
In terms of trapping and trading with MLflow experiment management like I explained the previous slide, it’s really helpful in terms of deploying and keeping models quite up to date itself. So, the three use cases that we dealt with were abandoned object detection. So, an individual might abandon object and move away from it. It could also be loitering individual, when individuals loiter around in an area or in a zone, particularly [inaudible] is vehicles. So, when vehicles are parking in illegal parking areas, or when they’re going where they’re not supposed to be going, those and more solutions, are something that we can scale with Databricks, particularly. All of these activities can be logged and summarized into the central monitoring system itself.
In terms of the conclusions: we demonstrated video analytics on a live surveillance feed. We used open-source packages and frameworks like OpenCV, NumPy, TensorFlow and Keras, and many more. We developed a real time surveillance analytics pipeline. We showed you how we can scale it with Databricks as well. For any feedback, please feel free to reach out to us on firstname.lastname@example.org. You can also discuss with us about different surveillance analytics use cases that we’ve particularly built and we’re looking forward to build with you. Thanks a lot everyone for your time.