During the past decade, feature extraction and knowledge acquisition based on video analysis have been extensively researched and tested on many applications such as closed-circuit television (CCTV) data analysis, large-scale public event control, and other daily security monitoring and surveillance operations with various degrees of success. However, since the actual video process is a multi-phased one and encompasses extensive theories and techniques ranging from fundamental image processing, computational geometry and graphics, and machine vision, to advanced artificial intelligence, pattern analysis, and even cognitive science, there are still many important problems to resolve before it can be widely applied. Among them, video event identification and detection are two prominent ones. Comparing with the most popular frame-to-frame processing mode of most of today's approaches and systems, this project reorganizes video data as a 3D volume structure that provides the hybrid spatial and temporal information in a unified space. This paper reports an innovative technique to transform original video frames to 3D volume structures denoted by spatial and temporal features. It then highlights the volume array structure in a so-called "pre-suspicion" mechanism for a later process. The focus of this report is the development of an effective and efficient voxel-based segmentation technique suitable to the volumetric nature of video events and ready for deployment in 3D clustering operations. The paper is concluded with a performance evaluation of the devised technique and discussion on the future work for accelerating the pre-processing of the original video data.