0

0

SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory

    Published 12/3/2024 by Cheng-Yen Yang, Hsiang-Wei Huang, Wenhao Chai, Zhongyu Jiang, Jenq-Neng Hwang

    Overview

    • New visual tracking system called SAMURAI that adapts the Segment Anything Model (SAM) for zero-shot object tracking
    • Uses motion-aware memory to track objects across video frames without training
    • Combines SAM's segmentation abilities with motion prediction
    • Achieves state-of-the-art performance on standard tracking benchmarks
    • Operates without prior knowledge of object categories

    SAM 2 struggles in crowded scenes and occlusions.

    1/4

    SAM 2 struggles in crowded scenes and occlusions.

    Original caption: Figure 1: Illustration of two common failure cases in visual object tracking using SAM 2: (1) In a crowded scene with similar appearances between target and background objects, SAM 2 tends to ignore the motion cue and predict where the mask has the higher IoU score. (2) The original memory bank simply chooses and stores the previous n𝑛nitalic_n frames into the memory bank, resulting in introducing some bad features during occlusion.

    Visual object tracking results on LaSOT, LaSOT-ext, and GOT-10k datasets.

    1/2

    Trackers Source LaSOT LaSOT AUC (%) LaSOT P (%) LaSOText LaSOText AUC (%) LaSOText P (%) GOT-10k GOT-10k AO (%) GOT-10k OP0.5 (%) GOT-10k OP0.75 (%)
    Trackers Source LaSOT LaSOT AUC (%) LaSOT P (%) LaSOText LaSOText AUC (%) LaSOText P (%) GOT-10k AO (%) OP0.5 (%) OP0.75 (%)
    AUC(%) Pnorm(%) P(%) AUC(%) Pnorm(%) P(%) AO(%) OP0.5(%) OP0.75(%)
    Supervised method
    SiamRPN++ [27] CVPR’19 49.6 56.9 49.1 34.0 41.6 39.6 51.7 61.6 32.5
    DiMP288 [13] CVPR’20 56.3 64.1 56.0 - - - 61.1 71.7 49.2

    Original caption: Table 1: Visual object tracking results on LaSOT [16], LaSOTextext{}_{\text{ext}}start_FLOATSUBSCRIPT ext end_FLOATSUBSCRIPT [17], and GOT-10k [23]. ††\dagger† LaSOTextext{}_{\text{ext}}start_FLOATSUBSCRIPT ext end_FLOATSUBSCRIPT are evaluated on trackers to be trained with LaSOT. ‡‡\ddagger‡ GOT-10k protocol only allows trackers to be trained using its corresponding train split. The T, S, B, L represents the size of the ViT-based backbone while the subscript is the search region. Bold represents the best while underline represents the second.

    Plain English Explanation

    SAMURAI works like a digital eye that can follow objects in videos without needing to be trained on them first. Think of it like a security guard who can track a person moving through different camera feeds, but for any object - not just people.

    The system uses two main components: the Segment Anything Model which identifies object boundaries, and a motion prediction system that anticipates where objects will move next. It's similar to how humans track moving objects - we both identify the object's shape and predict its movement path.

    Key Findings

    • Achieved competitive performance against specialized tracking systems
    • Successfully tracked objects through occlusion and appearance changes
    • Demonstrated ability to track any object category without prior training
    • Memory system effectively maintained object identity across frames
    • Showed robust performance in challenging scenarios like fast motion and deformation

    Technical Explanation

    SAMURAI builds on SAM's foundation by adding a motion-aware memory mechanism. The system maintains a history of object appearances and positions, using this information to predict future locations. The zero-shot capability comes from SAM's general understanding of object boundaries combined with motion patterns.

    The architecture processes frames sequentially, updating its memory bank with new object appearances while removing outdated information. This allows it to adapt to appearance changes while maintaining consistent tracking.

    Critical Analysis

    The current implementation faces challenges with multiple similar objects and extreme lighting changes. The system's reliance on SAM's segmentation quality means errors can propagate through the tracking sequence.

    Further research could explore:

    • Handling multiple object interactions
    • Improving performance in low-light conditions
    • Reducing computational requirements
    • Incorporating temporal consistency mechanisms

    Conclusion

    SAMURAI represents a significant step toward general-purpose visual tracking systems. Its ability to track arbitrary objects without training makes it valuable for applications like robotics, surveillance, and augmented reality. The success demonstrates how foundation models like SAM can be adapted for specialized tasks while maintaining their zero-shot capabilities.

    Full paper

    Loading...

    Loading PDF viewer...

    Read original: arXiv:2411.11922



    This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

    Total Score

    38

    Follow @aimodelsfyi on 𝕏 →

    Related Papers

    Efficient Track Anything
    Total Score

    0

    Efficient Track Anything

    Yunyang Xiong, Chong Zhou, Xiaoyu Xiang, Lemeng Wu, Chenchen Zhu, Zechun Liu, Saksham Suri, Balakrishnan Varadarajan, Ramya Akula, Forrest Iandola, Raghuraman Krishnamoorthi, Bilge Soran, Vikas Chandra

    Segment Anything Model 2 (SAM 2) has emerged as a powerful tool for video object segmentation and tracking anything. Key components of SAM 2 that drive the impressive video object segmentation performance include a large multistage image encoder for frame feature extraction and a memory mechanism that stores memory contexts from past frames to help current frame segmentation. The high computation complexity of multistage image encoder and memory module has limited its applications in real-world tasks, e.g., video object segmentation on mobile devices. To address this limitation, we propose EfficientTAMs, lightweight track anything models that produce high-quality results with low latency and model size. Our idea is based on revisiting the plain, nonhierarchical Vision Transformer (ViT) as an image encoder for video object segmentation, and introducing an efficient memory module, which reduces the complexity for both frame feature extraction and memory computation for current frame segmentation. We take vanilla lightweight ViTs and efficient memory module to build EfficientTAMs, and train the models on SA-1B and SA-V datasets for video object segmentation and track anything tasks. We evaluate on multiple video segmentation benchmarks including semi-supervised VOS and promptable video segmentation, and find that our proposed EfficientTAM with vanilla ViT perform comparably to SAM 2 model (HieraB+SAM 2) with ~2x speedup on A100 and ~2.4x parameter reduction. On segment anything image tasks, our EfficientTAMs also perform favorably over original SAM with ~20x speedup on A100 and ~20x parameter reduction. On mobile devices such as iPhone 15 Pro Max, our EfficientTAMs can run at ~10 FPS for performing video object segmentation with reasonable quality, highlighting the capability of small models for on-device video object segmentation applications.

    Read more

    12/2/2024

    Performance and Non-adversarial Robustness of the Segment Anything Model 2 in Surgical Video Segmentation
    Total Score

    0

    Performance and Non-adversarial Robustness of the Segment Anything Model 2 in Surgical Video Segmentation

    Yiqing Shen, Hao Ding, Xinyuan Shao, Mathias Unberath

    Fully supervised deep learning (DL) models for surgical video segmentation have been shown to struggle with non-adversarial, real-world corruptions of image quality including smoke, bleeding, and low illumination. Foundation models for image segmentation, such as the segment anything model (SAM) that focuses on interactive prompt-based segmentation, move away from semantic classes and thus can be trained on larger and more diverse data, which offers outstanding zero-shot generalization with appropriate user prompts. Recently, building upon this success, SAM-2 has been proposed to further extend the zero-shot interactive segmentation capabilities from independent frame-by-frame to video segmentation. In this paper, we present a first experimental study evaluating SAM-2's performance on surgical video data. Leveraging the SegSTRONG-C MICCAI EndoVIS 2024 sub-challenge dataset, we assess SAM-2's effectiveness on uncorrupted endoscopic sequences and evaluate its non-adversarial robustness on videos with corrupted image quality simulating smoke, bleeding, and low brightness conditions under various prompt strategies. Our experiments demonstrate that SAM-2, in zero-shot manner, can achieve competitive or even superior performance compared to fully-supervised deep learning models on surgical video data, including under non-adversarial corruptions of image quality. Additionally, SAM-2 consistently outperforms the original SAM and its medical variants across all conditions. Finally, frame-sparse prompting can consistently outperform frame-wise prompting for SAM-2, suggesting that allowing SAM-2 to leverage its temporal modeling capabilities leads to more coherent and accurate segmentation compared to frequent prompting.

    Read more

    8/19/2024

    Segment-Anything Models Achieve Zero-shot Robustness in Autonomous Driving
    Total Score

    0

    Segment-Anything Models Achieve Zero-shot Robustness in Autonomous Driving

    Jun Yan, Pengyu Wang, Danni Wang, Weiquan Huang, Daniel Watzenig, Huilin Yin

    Semantic segmentation is a significant perception task in autonomous driving. It suffers from the risks of adversarial examples. In the past few years, deep learning has gradually transitioned from convolutional neural network (CNN) models with a relatively small number of parameters to foundation models with a huge number of parameters. The segment-anything model (SAM) is a generalized image segmentation framework that is capable of handling various types of images and is able to recognize and segment arbitrary objects in an image without the need to train on a specific object. It is a unified model that can handle diverse downstream tasks, including semantic segmentation, object detection, and tracking. In the task of semantic segmentation for autonomous driving, it is significant to study the zero-shot adversarial robustness of SAM. Therefore, we deliver a systematic empirical study on the robustness of SAM without additional training. Based on the experimental results, the zero-shot adversarial robustness of the SAM under the black-box corruptions and white-box adversarial attacks is acceptable, even without the need for additional training. The finding of this study is insightful in that the gigantic model parameters and huge amounts of training data lead to the phenomenon of emergence, which builds a guarantee of adversarial robustness. SAM is a vision foundation model that can be regarded as an early prototype of an artificial general intelligence (AGI) pipeline. In such a pipeline, a unified model can handle diverse tasks. Therefore, this research not only inspects the impact of vision foundation models on safe autonomous driving but also provides a perspective on developing trustworthy AGI. The code is available at: https://github.com/momo1986/robust_sam_iv.

    Read more

    10/2/2024

    📈

    Total Score

    0

    Zero-Shot Surgical Tool Segmentation in Monocular Video Using Segment Anything Model 2

    Ange Lou, Yamin Li, Yike Zhang, Robert F. Labadie, Jack Noble

    The Segment Anything Model 2 (SAM 2) is the latest generation foundation model for image and video segmentation. Trained on the expansive Segment Anything Video (SA-V) dataset, which comprises 35.5 million masks across 50.9K videos, SAM 2 advances its predecessor's capabilities by supporting zero-shot segmentation through various prompts (e.g., points, boxes, and masks). Its robust zero-shot performance and efficient memory usage make SAM 2 particularly appealing for surgical tool segmentation in videos, especially given the scarcity of labeled data and the diversity of surgical procedures. In this study, we evaluate the zero-shot video segmentation performance of the SAM 2 model across different types of surgeries, including endoscopy and microscopy. We also assess its performance on videos featuring single and multiple tools of varying lengths to demonstrate SAM 2's applicability and effectiveness in the surgical domain. We found that: 1) SAM 2 demonstrates a strong capability for segmenting various surgical videos; 2) When new tools enter the scene, additional prompts are necessary to maintain segmentation accuracy; and 3) Specific challenges inherent to surgical videos can impact the robustness of SAM 2.

    Read more

    8/6/2024