Unsupervised Uncertainty Estimation Using Spatiotemporal Cues In Video Saliency Detection

Collaborators: Tariq AlShawi, Zhiling Long, and Ghassan AlRegib

Goal/Motivation: To quantify video saliency uncertainty to improve saliency-based video processing algorithms and enable more reliable performance and objective risk assessment of saliency-based video processing applications

Challenges: The majority of existing research efforts focus on computational saliency models, however, less attention has been given to quantifying the reliability of the generated saliency maps. The validity of such maps is crucial for integrating visual attention in various image and video processing applications. It is a common practice to consider the validity of a saliency detection model, at every pixel, to be directly related to its average performance on image and video datasets. In other words, a saliency detection model is, first, evaluated using typical visual stimuli datasets with eye tracking data. Then, algorithms that detect salient regions effectively, according to a predefined ground truth in the dataset, are assumed to perform well when used in various applications. However, such saliency detectors might fail to produce reliable results in certain contexts or situations, despite their superior performance in other contexts. Thus, it is important to consider the reliability of a saliency map given the context of the image or video at hand.

High Level Description of the Work: We address the problem of quantifying the uncertainty of detected saliency maps for videos. First, we study spatiotemporal eye-fixation data from the public CRCNS dataset [1] and demonstrate that typically there is high-correlation in saliency between a pixel and its direct neighbors. Then, we 3 propose estimating a pixel-wise uncertainty map that reflects our confidence in the computational saliency map by relating a pixel’s value to the values of its direct neighbors in a computationally efficient way. The novelty of this method is that it is unsupervised and independent from the dataset used for testing, which makes it more suitable for generalization. Also, the method exploits information from both spatial and temporal domain, thus, it is uniquely suitable for videos. Moreover, the flexibility of the algorithm parameters allows for customization to specific video content. Additionally, we propose a systematic procedure to evaluate uncertainty estimation performance by explicitly computing uncertainty ground truth in terms of a given saliency map and eye fixations of human subjects watching the associated video segment.

Related Publications

Alshawi, Tariq, Zhiling Long, and Ghassan AlRegib. "Unsupervised uncertainty estimation in saliency detection for videos using temporal cues." IEEE Global Conf. on Signal and Information Processing (GlobalSIP), Orlando, Florida, Dec. 2015. [PDF][Code]
Alshawi, Tariq, Zhiling Long, and Ghassan AlRegib. "Understanding spatial correlation in eye-fixation maps for visual attention in videos." 2016 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 2016.[PDF][Code]

Datasets Used:

https://crcns.org/data-sets/eye/eye-1