Active Frame, Location, and Detector Selection for Automated and Manual Video AnnotationAbstractWe describe an information-driven active selection approach to determine which detectors to deploy at which location in which frame of a video shot to minimize semantic class label uncertainty at every pixel, with the smallest computational cost that ensures a given uncertainty bound. We show minimal performance reduction compared to a "paragon" algorithm running all detectors at all locations in all frames, at a small fraction of the computational cost. Our method can handle uncertainty in the labeling mechanism, so it can handle both "oracles" (manual annotation) or noisy detectors (automated annotation). OverviewResultsSeveral video examples of the ''baseline'' (all frames labeled without temporal consistency) and our approach (using 20% frames and temporal consistency) can be seen below. Additionally, below we show several still frames. Code and datasetMATLAB implementation of the algorithms described in the paper and data can be downloaded here (320 Mb). To evaluate our approach we use video sequences from Human-Assisted Motion [1], ViSOR [2], MOSEG [3], Berkeley Video Segmentation [4], as well as additional videos from Flickr. Frames from these sequences are shown below. Pixelwise ground truth annotations and frames from our sequences can be downloaded here (240 Mb). If you use this work in your research, please cite our paper:
References
Please report problems with this page to Vasiliy Karasev. |