Excel to HTML
MONDAY 2 MARCH 2026
Visual Representation Learning and World Modeling
08:30 - 10:30
Harbour B
08:30IMAGE-265
IMAGE KEYNOTE: Advancing visual perception for embodied intelligence: From 2D scenes to dynamic 3D worlds, [view abstract]
Recent advances in AI and deep learning have significantly improved perceptual, cognitive, and generative capabilities. As a result, embodied intelligence, which tightly integrates perception, cognition, decision-making, and action, has gained increasing attention. However, visual perception -- the cornerstone of embodied systems -- still faces critical challenges due to the real world's complexity, three-dimensionality, and temporal dynamics.This talk presents a progression of visual perception research from 2D images to dynamic 3D scenes, addressing key limitations in accuracy, robustness, and efficiency. First, we introduce MixingMask, a novel contour-focused method that enhances 2D object detection and segmentation by capturing fine-grained boundary information. Second, we propose a boundary-optimized approach for monocular 3D plane reconstruction, improving geometric accuracy through precise segmentation and centerness filtering. Third, we present an octree-based semantic occupancy framework for multi-view panoramic scenes, significantly reducing computational cost by exploiting the sparsity of 3D space. Lastly, we develop LinkOcc, a temporally-aware 3D occupancy method that uses a sparse query mechanism and contrastive learning to improve temporal consistency.Together, these contributions push the boundaries of visual perception towards practical, resource-efficient deployment in real-world embodied systems. Experimental results across multiple benchmarks validate the effectiveness of our methods and provide promising directions for future research in perception for embodied AI.
09:10IMAGE-266
From pixels to worlds: A survey on the new wave of high-fidelity video generation, Weijuan Xi, (US) [view abstract]
09:30IMAGE-267
Dual-stream feature disentanglement network for single domain generalized facial expression recognition, Ningyu Chen; Wenshui Lin; Chang Shu ; Yan Yan [view abstract]
Towards Agentic AI
15:30 - 17:30
Harbour B
15:30IMAGE-268
Let AI see, practice, and learn in synthetic worlds, Liu He, Amazon (US) [view abstract]
This keynote highlights how agentic systems and multimodal large language models (MLLMs) are transforming synthetic world generation and reasoning. I present a framework that combines fast 3D perception (System-1) with deeper chain-of-thought reasoning (System-2), demonstrated through a multi-agent design for video generation. Supporting projects include Kubrick for collaborative video synthesis, DocAgent for long-context multimodal understanding, and the Ulti3D dataset for advancing 3D visual benchmarks. I also introduce Longperceptualthoughts, which distills reasoning into vision-language models. Together, these advances chart a path toward scalable, human-like intelligence where AI agents can plan, act, and self-reflect.
15:50IMAGE-269
A survey of mobile agents: From code mobility to large multimodal model-driven autonomy, [view abstract]
This paper presents a comprehensive survey of mobile agents, charting their evolution from the early paradigm of autonomous, migrating code to the contemporary era of sophisticated agents driven by Large Multimodal Models (LMMs). We begin by establishing a foundational taxonomy that distinguishes historical agent architectures from modern LMM-native systems. We then analyze the operational workflows and architectural patterns notably the shift to multi-agent frameworks that enable robust automation of complex tasks on Graphical User Interfaces (GUIs). A critical review of state-of-the-art systems, such as Mobile-Agent-v3 and V-Droid, is presented alongside an examination of the benchmarks like AndroidWorld that drive their development. Furthermore, we address the expanded security landscape, contrasting traditional threats with novel vulnerabilities inherent to LMM-powered agents. The paper concludes by synthesizing key trends and outlining future research directions, underscoring the transformative impact of LMMs on achieving generalized mobile autonomy.
16:10IMAGE-270
A configurable multi-agent system for feature extraction from multimedia documents, [view abstract]
This paper presents an experimental multi-agent system developed for robust feature extraction from diverse multimedia documents, including images, PDFs, and technical drawings. Addressing the enterprise demand for structuring unstructured data, the system employs a flexible architecture that intelligently orchestrates specialized agents ranging from (Optical Character Recognition) OCR and image processing to Large Language Models (LLMs) to achieve high-fidelity extraction. A key innovation is the system's high configurability, which keeps human experts in the loop to refine extraction logic via prompt engineering. Furthermore, the architecture supports hybrid edge-cloud deployment, allowing raw documents to be processed locally to satisfy strict data sovereignty requirements, with only non-sensitive data ingested centrally. The experimental system has shown scalability and efficiency in real-world use cases.
16:30IMAGE-271
A new dual embedding framework for zero-shot image classification, Qin LI, Shenzhen University of Information Technology (China (Mainland)); Jane You, The Hong Kong Polytechnic University (Hong Kong (Greater China)); Lin SHU, South China University of Technology (China (Mainland)) [view abstract]
This paper presents a new dual embedding framework to improve Zero-shot learning (ZSL) which aims to classify unseen classes via semantic transfer. Unlike the conventional approach which suffers from noisy attribute features due to intra-attribute visual variations, the proposed dual-embedding framework named CRAE (Class Representation and Attribute Embedding) can jointly optimize class representations and attribute features. The novelty of the proposed CRAE approach is focused on three aspects: (1) Adaptive softmax normalization to suppress attribute noise; (2) Hard sample-aware contrastive learning for discriminative attribute embedding; (3) Class-level contrastive removal to enhance inter-class separation. The advantages of the CRAE approach is evidenced by its impressive performance on the widely-used zero-shot learning datasets AWA2, CUB and SUN for benchmarking with the existing conventional ZSL (CZSL) and generalized ZSL (GZSL) methods. CRAE achieved the accuracy of 79.4% for CUB, 67.7% for SUN and 75.8% for AWA2 respectively in CSZL while reached superior harmonic mean (H) scores in GZSL. These results demonstrate its effectiveness in mitigating domain shift and improving zero-shot generalization.
TUESDAY 3 MARCH 2026
Efficient Machine Learning for Applications
08:30 - 10:30
Harbour B
08:30IMAGE-272
DLss at the edge: Deep learning super sampling for low-latency gaming, [view abstract]
Edge platforms are resource-constrained for real-time rendering. NVIDIA s DLss uses AI-based super-resolution to boost image quality and performance under tight compute budgets. This talk reviews the evolution of DLss and how it is reshaping rendering pipelines to improve the edge gaming experience.
08:50IMAGE-273
A real-time on-device defect detection framework for laser power-meter sensors via unsupervised learning, Dongqi Zheng; Wenjin Fu; Guangzong Chen [view abstract]
We present an automated vision-based system for defect detection in laser power meter sensor coatings. The system employs an unsupervised anomaly detection framework that trains exclusively on normal sensor images, enabling detection of both known and novel defect types without requiring extensive labeled defect datasets. Our methodology consists of: (1) a robust preprocessing pipeline using circle detection and K-means clustering, (2) synthetic data augmentation via StyleGAN2, and (3) a UFlow-based neural network for multi-scale feature extraction. Following rigorous experimental protocols to eliminate data leakage and using normal-only threshold selection, we achieve 0.957 image-level AUROC and 0.961 pixel-level AUROC. Comprehensive ablation studies demonstrate that preprocessing contributes 7.9% AUROC improvement and synthetic augmentation adds 3.6%. Comparative evaluation against state-of-the-art methods shows UFlow outperforms alternatives by 3.9% while maintaining 0.5-second inference time suitable for edge deployment.
09:10IMAGE-274
Topological adaptive weighted drug-target affinity prediction, Linman Du, (China (Mainland)); Wenzong Jiang; Bin Shen; Weifeng Liu; Baodi Liu [view abstract]
Abstract unavailable.
Image Processing for Health
11:00 - 12:20
Harbour B
11:00IMAGE-275
Key feature dynamic enhancement network for breast cancer pathological image classification, Xianghe Cui; Wenzong Jiang; Bin Shen; Weifeng Liu; Baodi Liu [view abstract]
Breast cancer pathological images are considered the "gold standard" for clinical diagnosis of breast cancer, but manual diagnosis suffers from inherent drawbacks such as low efficiency and high subjectivity. Computer-aided diagnosis (CAD) systems can provide objective decision support for clinicians by deeply mining multi-level features such as tissue architecture and cytology from pathological images. However, current CAD systems are still challenged by complex background noise and inconsistency in cross-scale feature representation, which hinder the extraction of critical features. Therefore, this paper proposes a key feature dynamic enhancement network for breast cancer pathological image classification (KFDE), in which the channel-spatial feature enhancement module (CSFE) and the multi-scale feature dynamic fusion module (MFDF) serve as the two core components. The CSFE module effectively suppresses background noise and highlights lesion regions through local channel variance analysis and an energy entropy-driven spatial focusing mechanism. The MFDF module employs a heterogeneous multi-branch convolutional architecture to intelligently fuse cross-scale features, addressing the issue of information fragmentation caused by magnification variation. Experiments on the BreakHis dataset demonstrate that KFDE achieves significant performance improvements, with a benign/malignant classification accuracy of 99.74% and an eight-class subtype classification accuracy of 96.35%, significantly outperforming existing mainstream models.
11:20IMAGE-276
Video-based full-spectrum biomechanical gait analysis for remote gait health assessment, [view abstract]
Remote video analysis offers a scalable way to monitor human mobility, yet current video-based assessments typically rely on a small set of spatiotemporal parameters and provide limited insight into overall walking health. Most existing methods focus narrowly on a few basic gait measures, often without explaining which biomechanical characteristics truly differentiate healthy versus abnormal walking. We address this gap with a comprehensive, standardized, and multimedia-friendly framework that mines a broad spectrum of gait indicators from ordinary monocular videos and integrates them into an interpretable measure of walking health. Unlike prior work that is contact-free but limited in feature diversity, our method emphasizes both feature breadth and clinical grounding, thereby delivering results that are not only human-readable but also clinically meaningful. By weighing top features according to stability and clinical correlation, we propose a composite Full Spectrum Gait Dysfunction Index (FSGDI). This index supports scalable at-home monitoring based on resource-constrained mobile devices and fair benchmarking across datasets. Tests on both public and clinical cohorts show that the method consistently identifies the most clinically informative and stable gait features, highlighting which movement patterns matter most. Our method contributed better evaluation accuracy than conventional ML pipelines by coupling computer vision based video analytics, also achieved trustworthy results with interpretable modeling: we provide a reusable standardized feature library, an evaluation protocol, and a transparent gait-severity metric that clinicians can relate to established scores. Overall, this work establishes a practical foundation for remote, non-invasive, and clinically aligned gait assessment and monitoring.
11:40IMAGE-277
Local attention and detail-enhanced network for mass segmentation in whole mammograms, Qingkun Guo (China (Mainland)); Yixuan Wang (China (Mainland)); Wenzong Jiang (China (Mainland)); Bin Shen; Weifeng Liu; Lin Cong; Baodi Liu [view abstract]
Mammography is one of the most commonly used tools for early screening of breast cancer. Developing computer-aided diagnosis (CAD) based on mammographic images to assist doctors in making efficient and accurate diagnoses holds significant research value. Mass segmentation in mammograms is a core component of breast cancer CAD systems and an essential step in further qualitative analysis of breast cancer. However, significant challenges persist in the field of mass segmentation in whole mammograms, including model misalignment due to the small proportion of mass regions and difficulties in segmenting boundaries caused by blurred edges of mass areas. To solve these challenges, this paper proposes a local attention and detail-enhanced network (LADE-Net) for mass segmentation in whole mammograms. LADE-Net employs an asymmetric encoder-decoder architecture and introduces a lightweight local attention (LA) module aimed at early and precise localization of breast mass regions. Importantly, we design a new detail-enhanced fusion residual network (DEFRB) to refine and enhance the learning of edge features in breast masses. We evaluated the performance of LADE-Net on two publicly available datasets (INbreast, CBIS-DDSM). Compared to previous works, LADE-Net achieved superior performance.