Abstract:With the continuous advancement of smart city development, safety issues in building edge areas have become increasingly severe, as incidents of accidental falls and falling objects occur frequently. There is an urgent need for more intelligent and efficient monitoring solutions. To address the limited temporal modeling capabilities of current object detection methods, particularly in recognizing small, occluded, and fast-moving targets, this study proposes a video detection framework that integrates multiple temporal semantic enhancement mechanisms for the unified detection of both people and falling objects. The proposed method is built upon a faster R-CNN backbone and incorporates three temporal-aware modules: motion-aware module (MAM), temporal region of interest align (TROI Align), and sequence-level semantic aggregation head (SELSA Head). These modules enhance the model’s perception of dynamic objects in complex temporal scenarios from three perspectives: motion saliency modeling, spatial alignment, and semantic aggregation. To support model training and evaluation, a dedicated video dataset covering multiple building edge scenarios and various types of risk targets is constructed. Experimental results demonstrate that the proposed method achieves strong performance in both “detection of personnel behavior at building edges” and “falling object detection” tasks, showing excellent cross-task robustness and practical application potential.