![]() You can also find the paper on PapersWithCode here.Ībstract Introduce a new pre-trained video model, VideoTaskformer, focused on representing the semantics and structure of instructional videos Pre-train VideoTaskformer by predicting weakly supervised textual labels for steps that are randomly masked out from an instructional video Learn step representations globally, leveraging video of the entire surrounding task as context Introduce two new benchmarks for detecting mistakes in instructional videos Introduce a long-term forecasting benchmark Outperforms previous baselines on these tasks Evaluate VideoTaskformer on 3 existing benchmarks and achieves new state-of-the-art performance Paper Content Introduction Trying to build a bookshelf using a YouTube video Need to repeatedly hit pause on the video Interactive assistant can guide user through task Composite task involves multiple fine-grained activities Ideal assistant has high-level and low-level understanding Prior work models step representations from single short video clips VideoTaskformer learns step representations for masked video steps Mistake detection task and dataset for verifying video representations VideoTaskformer learns step representations for whole video Network learns to predict labels for masked steps Representations improve performance on downstream tasks VideoTaskformer capable of detecting mistake types Related works Large-scale narrated instructional video datasets enable learning joint video-language representations and task structure from videos Assembly-101 dataset and Ikea ASM provide videos of people assembling and disassembling toys and furniture Existing benchmarks for evaluating representations learned on instructional video datasets include step localization, step classification, procedural activity recognition, and step forecasting Recent works attempt to learn procedures from instructional videos Video action recognition models have improved over the last few years Works learn representations for longer video clips containing semantically more complex actions Learning task structure through masked modeling of steps Goal is to learn task-aware step representations from instructional videos Developed VideoTaskformer, a video model pre-trained using a BERT style masked modeling loss Masking is done at the step level Framework consists of two steps: pre-training and fine-tuning Pre-training is done on weakly labeled data During fine-tuning, a subset of the parameters is adjusted using labeled data from the downstream tasks Pre-training approach uses masked step modeling loss Step modeling extends masked language modeling techniques used in BERT and VideoBERT Evaluated on 6 downstream tasks Step representations are learned from entire video with all steps as input Step classification and distribution matching are used as training objectives Downstream tasks Mistake step detection: Identify which step in a video is incorrect Mistake ordering detection: Verify if the steps in a video are in the correct temporal order Short-term forecasting: Predict the step label given the previous n segments Long-term step forecasting: Predict the step labels for the next 5 steps given a single step Procedural activity recognition: Recognize the procedural activity (i. Link to paper The full paper is available here. ![]()
0 Comments
Leave a Reply. |