UX Products: InternVideo2.5: Hierarchical Token Compression and Task Preference Optimization for Video MLLMs

Tuesday, January 28, 2025

InternVideo2.5: Hierarchical Token Compression and Task Preference Optimization for Video MLLMs

Understanding Multimodal Large Language Models (MLLMs) Multimodal large language models (MLLMs) are a significant advancement in artificial intelligence, combining various types of sensory information. However, they still face challenges in basic vision tasks, performing worse than humans. The main challenges include: - **Object Recognition**: Accurately identifying objects. - **Localization**: Finding where objects are located. - **Motion Recall**: Remembering movements over time. Despite ongoing research, achieving human-like visual understanding remains difficult. Creating systems that can interpret and reason across different sensory inputs accurately is complex. Current Research Approaches Researchers are looking into methods to improve visual understanding in MLLMs, including: - **Combining Technologies**: Using vision encoders and language models together for tasks like image descriptions and visual queries. - **Video Processing**: Enhancing MLLMs to understand sequences of visuals and changes over time. Two main strategies have emerged to tackle detailed visual tasks: - **Pixel-to-Sequence (P2S)**: A method for processing visual data. - **Pixel-to-Embedding (P2E)**: An approach for embedding visual information. Introducing InternVideo2.5 A team from Shanghai AI Laboratory, Nanjing University, and Shenzhen Institutes of Advanced Technology developed InternVideo2.5, which improves video MLLM capabilities by: - **Long and Rich Context (LRC) Modeling**: Better understanding of detailed video content and time sequences. - **Integrating Annotations**: Using direct preference optimization to include detailed visual task annotations. - **Adaptive Hierarchical Token Compression**: Creating efficient representations of spatiotemporal data. Key Features of InternVideo2.5 InternVideo2.5 has several important features: - **Dynamic Video Sampling**: Processing between 64 to 512 frames and compressing each 8-frame clip into 128 tokens. - **Advanced Components**: Utilizing a Temporal Head based on CG-DETR and a Mask Head with SAM2’s pre-trained weights. - **Optimized Processing**: Implementing two-layer MLPs for better spatial input positioning and encoding. Performance Improvements InternVideo2.5 shows notable advancements in video understanding: - **Enhanced Accuracy**: Over 3 points improvement on MVBench and Perception Test for short video predictions. - **Superior Recall**: Better memory capabilities in complex tasks. Conclusion InternVideo2.5 marks a significant progress in video MLLM technology, focusing on: - **Improved Visual Capabilities**: Better object tracking and understanding. - **Future Research Opportunities**: Addressing high computational costs and extending context processing techniques. Transform Your Business with AI To stay competitive, consider using InternVideo2.5 in your operations: - **Identify Automation Opportunities**: Discover areas in customer interactions that can benefit from AI. - **Define KPIs**: Ensure your AI projects have measurable impacts. - **Select an AI Solution**: Choose tools that fit your needs and allow customization. - **Implement Gradually**: Start with a pilot project, gather data, and expand AI use wisely. For AI KPI management advice, reach out to us. For ongoing insights, follow us on our social media channels. Explore how AI can enhance your sales processes and customer engagement.

UX Products

Tuesday, January 28, 2025

InternVideo2.5: Hierarchical Token Compression and Task Preference Optimization for Video MLLMs

No comments:

Post a Comment

Blog Archive