Abstract
In this growing age of technology, various sensors are used to capture data
from their nearby environments. The captured data is multimedia in nature. For
example, CCTV cameras are used in those places where security matters or where
continuous monitoring is required. Hence object detection, object recognition, and face
recognition became key elements of city surveillance applications. Manual surveillance
seems time-consuming and requires huge space to store the data; hence video
surveillance has a significant contribution to unstructured big data. All surveillance
techniques and approaches are based on Object Tracking, Target Tracking, Object
Recognition, and Object Mobile Tracking Systems (OMTS). The main difficulty,
however, lies in effectively processing them in real time. Therefore, finding a solution
still needs careful consideration. This paper mainly targeting to the smart city
surveillance system and inspects all existing surveillance systems based on various
tremendous technologies like a wireless sensor network, machine learning, and Deep
Learning. The author discovered the problems in the existing methods and summarized
them in the paper. The motive is to point out the various challenges and offer new
research prospects for the multimedia-oriented surveillance system over the traditional
surveillance system for the smart city network architecture. The thorough survey in this
paper starts with object recognition and goes toward action recognition, image
annotation, and scene understanding. This comprehensive survey summarizes the
comparative analysis of algorithms, models, and datasets in addition to targeting the
methodologies.