Imaging and Multimedia Analytics at the Edge 2023
Monday 16 January 2023
KEYNOTE: Data & Learning (M1)
Session Chair: Qian Lin, HP Inc. (United States)
8:45 – 10:20 AM
Balboa
8:45
Conference Welcome
8:50IMAGE-264
KEYNOTE: Small data, big insights, Raja Bala, Amazon (United States) [view abstract]
Dr. Raja Bala is a principal applied scientist at Amazon. His research interests include computer vision, deep learning, image/video processing, mobile imaging, and color imaging. Bala is an inventor on 180 patents and has authored over 100 publications in the field of digital imaging and computer vision. He is co-editor of IEEE-Wiley book: "Computer Vision and Imaging in Intelligent Transportation Systems" and is the principal liaison for numerous industry-university partnerships. Prior to joining Amazon, Bala was principal scientist, and leader of the Collaborative Visual Computing Group at PARC. Bala is a Fellow of IS&T, and a Senior Member of IEEE.
Deep learning has defined the state of the art for many computer vision tasks, thanks to advances in computing hardware and the availability of large datasets. However, for many practical applications, data acquisition and annotation is a costly and time-consuming task. This is especially true in specialized fields such as medical imaging, fashion, and beauty care, where annotation must be carried out by domain experts. We present several novel approaches to tackle the “Small Data” challenge. This includes i) incorporation of domain knowledge into deep networks via regularizers and priors to reduce training demands; ii) image synthesis exploiting latent structure in deep generative models and adversarial methods to generate hard examples; iii) guided image acquisition to ensure high-quality of data; iv) rapid human-in-loop data annotation with augmented reality.
9:30IMAGE-265
Connecting images and AR content using CLIP embedding, Yulong Liu, Snap (United States) [view abstract]
Recommending relevant AR content (lenses, filters, stickers, etc.) based on user’s input image has been challenging because of the high dimensions of the input space. We have introduced a new embedding based retrieval system which encodes the user input image and the AR content into the same semantic space to allow fast ANN candidate retrieval. This method has drastically improved the retrieval candidate pool of our ranking system.
9:50IMAGE-266
Artificial intelligence and general data protection regulation (GDPR) – a contradiction in terms? (Invited), Reiner Fageth, CEWE Stiftung & Co.KGaA (Germany) [view abstract]
Today, more photos are being taken than ever before. This increases the amount of photos that can be used for image products and storytelling, as well as the challenge for customers to find their relevant memories in the flood of images. In this paper it is described how technology helps people to find their important photos through the use and support of artificial intelligence (AI) without turning them into "transparent consumers”. This can be done -but with much more efforts and costs- while personal image data is protected according to European standards - unlike some approaches from other regions. It is also evaluated if newer techniques such as federated learning can help to support the GDPR compliance.
10:20 – 10:50 AM Coffee Break
Watch What You Eat (M2.1)
Session Chair:
Qian Lin, HP Inc. (United States)
10:50 AM – 12:00 PM
Balboa
10:50IMAGE-267
Harnessing the power of pixels to assess dietary intake (Invited), Fengqing Zhu, Purdue University (United States) [view abstract]
Diet is a complex exposure, measuring dietary intake presents more challenges than other environmental exposures. Image-based dietary assessment refers to the process of determining what someone eats and how much energy is consumed from visual data and associated contextual information. New, unbiased data-capture methods such as mobile dietary assessment technologies are inexpensive and customizable, which could overcome well-characterized limitations of current approaches relying primarily on self-reporting. In this talk, I will present the design and development of a mobile, image-based dietary assessment system that records and analyzes images of eating occasions. We have developed various food image analysis methods including hierarchy based food image classification, single-view food portion size estimation, saliency-aware food image segmentation, and image-based clustering to extract eating environments. Our system has been deployed and evaluated in more than 30 dietary studies (controlled-feeding and community-dwelling) with over 2,500 participants between ages 6 months – 70 years in both domestic and international locations.
11:20IMAGE-268
Conditional synthetic food image generation, Wenjin Fu1, Yue Han2, Sriram Baireddy2, Jiangpeng He2, Mridul Gupta2, and Fengqing Zhu2; 1The Ohio State University and 2Purdue University (United States) [view abstract]
Generative Adversarial Networks (GAN) have been widely investigated due to their powerful representation learning in image synthesis. In this work, we explore different GAN architectures for the application of synthetic food image generation. Despite the impressive performance of GAN for natural image generation, synthesized food images still contain severe visual artifacts. Therefore, we aim to explore the capability of the state-of-the-art methods for image generation and improve their performance in synthetic food image generation. Specifically, we first analyze the experimental results of existing synthetic image generation techniques (<i>e.g</i>., StyleGAN3) on food images as the baseline. Then, we identify two issues during training that can cause performance degradation on food images, including (1) inter-class feature entanglement during multi-food classes training and (2) loss of high-resolution detail during image downsampling. To address both issues, we propose to train one food category at a time to avoid feature entanglement and leverage image patches cropped from high-resolution datasets to retain fine details. Our methods are evaluated on the Food-101 dataset and show improved performance with higher quality compared with the baseline. Our conditional food image generation model shows great potential for improving food image analysis by providing high-quality training samples in data augmentation.
11:40IMAGE-269
Unsupervised visual representation learning on food images, Andrew W. Peng, Jiangpeng He, and Fengqing Zhu, Purdue University (United States) [view abstract]
Food image classification is the groundwork for image-based dietary assessment, which is the process of monitoring what kinds of food and how much energy is consumed using captured food or eating scene images. Existing deep learning based methods learn the visual representation for food classification based on human annotation of each food image. However, most food images captured from real life are obtained without labels, requiring human annotation to train deep learning based methods. This approach is not feasible for real world deployment due to high costs. To make use of the vast amount of unlabeled images, many existing works focus on unsupervised or self-supervised learning to learn the visual representation directly from unlabeled data. However, none of these existing works focuses on food images, which is more challenging than general objects due to its high inter-class similarity and intra-class variance. In this paper, we focus on two items: the comparison of existing models and the development of an effective self-supervised learning model for food image classification. Specifically, we first compare the performance of existing state-of-the-art self-supervised learning models, including SimSiam, SimCLR, SwAV, BYOL, MoCo, and Rotation Pretext Task on food images. The experiments are conducted on the Food-101 dataset, which contains 101 different classes of foods with 1,000 images in each class. Next, we analyze the unique features of each model and compare their performance on food images to identify the key factors in each model that can help improve the accuracy. Finally, we propose a new model for unsupervised visual representation learning on food images for the classification task.
PANEL: Watch What You Eat: Panel on Food/Health from the Perspective of AI and Privacy (M2.2)
Panel Moderator: Reiner Fageth, CEWE Stiftung & Co.KGaA (Germany)
12:00 – 12:30 PM
Balboa
12:30 – 2:00 PM Lunch
Monday 16 January PLENARY: Neural Operators for Solving PDEs
Session Chair: Robin Jenkin, NVIDIA Corporation (United States)
2:00 PM – 3:00 PM
Cyril Magnin I/II/III
Deep learning surrogate models have shown promise in modeling complex physical phenomena such as fluid flows, molecular dynamics, and material properties. However, standard neural networks assume finite-dimensional inputs and outputs, and hence, cannot withstand a change in resolution or discretization between training and testing. We introduce Fourier neural operators that can learn operators, which are mappings between infinite dimensional spaces. They are independent of the resolution or grid of training data and allow for zero-shot generalization to higher resolution evaluations. When applied to weather forecasting, neural operators capture fine-scale phenomena and have similar skill as gold-standard numerical weather models for predictions up to a week or longer, while being 4-5 orders of magnitude faster.
Anima Anandkumar, Bren professor, California Institute of Technology, and senior director of AI Research, NVIDIA Corporation (United States)
Anima Anandkumar is a Bren Professor at Caltech and Senior Director of AI Research at NVIDIA. She is passionate about designing principled AI algorithms and applying them to interdisciplinary domains. She has received several honors such as the IEEE fellowship, Alfred. P. Sloan Fellowship, NSF Career Award, and Faculty Fellowships from Microsoft, Google, Facebook, and Adobe. She is part of the World Economic Forum's Expert Network. Anandkumar received her BTech from Indian Institute of Technology Madras, her PhD from Cornell University, and did her postdoctoral research at MIT and assistant professorship at University of California Irvine.
3:00 – 3:30 PM Coffee Break
Prime Video (M3)
Session Chair:
Raja Bala, Amazon (United States)
3:30 – 5:00 PM
Balboa
3:30IMAGE-270
Learn spatio-temporal downsampling for effective video upscaling (Invited), Xiaoyu Xiang1, Yapeng Tian2, Vijay Rengaranjan1, Lucas Young1, Bo Zhu1, and Rakesh Ranjan1; 1Meta and 2The University of Texas at Dallas (United States) [view abstract]
No further information is available at this time
4:00IMAGE-271
Movie character re-identification by agglomerative clustering of deep features, Samuel Ducros1,2, William Puech1, Gérard Subsol1, Mathieu Lafourcade1, Jean-Marie Barthélémy2, and Bianca Jansen van Rensburg3; 1Université de Montpellier, 2ECOSM, and 3presenter only (France) [view abstract]
In this paper, we present a method for hierarchical clustering of characters in a video. Given a video edited with humans, we seek to identify each person with the character they represent. The proposed method is based on hierarchical clustering, using first neighbour relations. First, the heads and faces of each person are detected and tracked in each shot of the video. Then, we create a representation vector of a tracked person in a shot. Finally, we compare the vector representations and we use first neighbour relations to group them into distinct characters. The main contribution of this work is a person re-identification framework based on a hierarchical clustering method, and applied to edited videos with large scene variations.
4:20IMAGE-272
Light-weight recurrent network for real-time video super-resolution, Tianqi Wang1, Qian Lin2, and Jan P. Allebach1; 1Purdue University and 2HP Labs, HP Inc. (United States) [view abstract]
No further details about this work can be provided at this time, since a patent application may be filed prior to the start of the conference on 15 January 2023 to protect the technology.
4:40IMAGE-273
Depth assisted portrait video background blurring, Yezhi Shen1, Weichen Xu1, Qian Lin2, Jan P. Allebach1, and Fengqing Zhu1; 1Purdue University and 2HP Labs, HP Inc. (United States) [view abstract]
Video conferencing usage dramatically increased during the pandemic and is expected to remain high in hybrid work. One of the key aspects of video experience is background blur or background replacement, which relies on good quality portrait segmentation in each frame. Software and hardware manufacturers have worked together to utilize depth sensor to improve the process. Existing solutions have incorporated depth map into post processing to generate a more natural blurring effect. In this paper, we propose to collect background features with the help of depth map to improve the segmentation result from the RGB image. Our results show significant improvements over methods using RGB based networks and runs faster than model-based background feature collection models.
EI 2023 Highlights Session
Session Chair: Robin Jenkin, NVIDIA Corporation (United States)
3:30 – 5:00 PM
Cyril Magnin II
Join us for a session that celebrates the breadth of what EI has to offer with short papers selected from EI conferences.
NOTE: The EI-wide "EI 2023 Highlights" session is concurrent with Monday afternoon COIMG, COLOR, IMAGE, and IQSP conference sessions.
IQSP-309
Evaluation of image quality metrics designed for DRI tasks with automotive cameras, Valentine Klein, Yiqi LI, Claudio Greco, Laurent Chanas, and Frédéric Guichard, DXOMARK (France) [view abstract]
Driving assistance is increasingly used in new car models. Most driving assistance systems are based on automotive cameras and computer vision. Computer Vision, regardless of the underlying algorithms and technology, requires the images to have good image quality, defined according to the task. This notion of good image quality is still to be defined in the case of computer vision as it has very different criteria than human vision: humans have a better contrast detection ability than image chains. The aim of this article is to compare three different metrics designed for detection of objects with computer vision: the Contrast Detection Probability (CDP) [1, 2, 3, 4], the Contrast Signal to Noise Ratio (CSNR) [5] and the Frequency of Correct Resolution (FCR) [6]. For this purpose, the computer vision task of reading the characters on a license plate will be used as a benchmark. The objective is to check the correlation between the objective metric and the ability of a neural network to perform this task. Thus, a protocol to test these metrics and compare them to the output of the neural network has been designed and the pros and cons of each of these three metrics have been noted.
SD&A-224
Human performance using stereo 3D in a helmet mounted display and association with individual stereo acuity, Bonnie Posselt, RAF Centre of Aviation Medicine (United Kingdom) [view abstract]
Binocular Helmet Mounted Displays (HMDs) are a critical part of the aircraft system, allowing information to be presented to the aviator with stereoscopic 3D (S3D) depth, potentially enhancing situational awareness and improving performance. The utility of S3D in an HMD may be linked to an individual’s ability to perceive changes in binocular disparity (stereo acuity). Though minimum stereo acuity standards exist for most military aviators, current test methods may be unable to characterise this relationship. This presentation will investigate the effect of S3D on performance when used in a warning alert displayed in an HMD. Furthermore, any effect on performance, ocular symptoms, and cognitive workload shall be evaluated in regard to individual stereo acuity measured with a variety of paper-based and digital stereo tests.
IMAGE-281
Smartphone-enabled point-of-care blood hemoglobin testing with color accuracy-assisted spectral learning, Sang Mok Park1, Yuhyun Ji1, Semin Kwon1, Andrew R. O’Brien2, Ying Wang2, and Young L. Kim1; 1Purdue University and 2Indiana University School of Medicine (United States) [view abstract]
We develop an mHealth technology for noninvasively measuring blood Hgb levels in patients with sickle cell anemia, using the photos of peripheral tissue acquired by the built-in camera of a smartphone. As an easily accessible sensing site, the inner eyelid (i.e., palpebral conjunctiva) is used because of the relatively uniform microvasculature and the absence of skin pigments. Color correction (color reproduction) and spectral learning (spectral super-resolution spectroscopy) algorithms are integrated for accurate and precise mHealth blood Hgb testing. First, color correction using a color reference chart with multiple color patches extracts absolute color information of the inner eyelid, compensating for smartphone models, ambient light conditions, and data formats during photo acquisition. Second, spectral learning virtually transforms the smartphone camera into a hyperspectral imaging system, mathematically reconstructing high-resolution spectra from color-corrected eyelid images. Third, color correction and spectral learning algorithms are combined with a spectroscopic model for blood Hgb quantification among sickle cell patients. Importantly, single-shot photo acquisition of the inner eyelid using the color reference chart allows straightforward, real-time, and instantaneous reading of blood Hgb levels. Overall, our mHealth blood Hgb tests could potentially be scalable, robust, and sustainable in resource-limited and homecare settings.
AVM-118
Designing scenes to quantify the performance of automotive perception systems, Zhenyi Liu1, Devesh Shah2, Alireza Rahimpour2, Joyce Farrell1, and Brian Wandell1; 1Stanford University and 2Ford Motor Company (United States) [view abstract]
We implemented an end-to-end simulation for perception systems, based on cameras, that are used in automotive applications. The open-source software creates complex driving scenes and simulates cameras that acquire images of these scenes. The camera images are then used by a neural network in the perception system to identify the locations of scene objects, providing the results as input to the decision system. In this paper, we design collections of test scenes that can be used to quantify the perception system’s performance under a range of (a) environmental conditions (object distance, occlusion ratio, lighting levels), and (b) camera parameters (pixel size, lens type, color filter array). We are designing scene collections to analyze performance for detecting vehicles, traffic signs and vulnerable road users in a range of environmental conditions and for a range of camera parameters. With experience, such scene collections may serve a role similar to that of standardized test targets that are used to quantify camera image quality (e.g., acuity, color).
VDA-403
Visualizing and monitoring the process of injection molding, Christian A. Steinparz1, Thomas Mitterlehner2, Bernhard Praher2, Klaus Straka1,2, Holger Stitz1,3, and Marc Streit1,3; 1Johannes Kepler University, 2Moldsonics GmbH, and 3datavisyn GmbH (Austria) [view abstract]
In injection molding machines the molds are rarely equipped with sensor systems. The availability of non-invasive ultrasound-based in-mold sensors provides better means for guiding operators of injection molding machines throughout the production process. However, existing visualizations are mostly limited to plots of temperature and pressure over time. In this work, we present the result of a design study created in collaboration with domain experts. The resulting prototypical application uses real-world data taken from live ultrasound sensor measurements for injection molding cavities captured over multiple cycles during the injection process. Our contribution includes a definition of tasks for setting up and monitoring the machines during the process, and the corresponding web-based visual analysis tool addressing these tasks. The interface consists of a multi-view display with various levels of data aggregation that is updated live for newly streamed data of ongoing injection cycles.
COIMG-155
Commissioning the James Webb Space Telescope, Joseph M. Howard, NASA Goddard Space Flight Center (United States) [view abstract]
Astronomy is arguably in a golden age, where current and future NASA space telescopes are expected to contribute to this rapid growth in understanding of our universe. The most recent addition to our space-based telescopes dedicated to astronomy and astrophysics is the James Webb Space Telescope (JWST), which launched on 25 December 2021. This talk will discuss the first six months in space for JWST, which were spent commissioning the observatory with many deployments, alignments, and system and instrumentation checks. These engineering activities help verify the proper working of the telescope prior to commencing full science operations. For the session: Computational Imaging using Fourier Ptychography and Phase Retrieval.
HVEI-223
Critical flicker frequency (CFF) at high luminance levels, Alexandre Chapiro1, Nathan Matsuda1, Maliha Ashraf2, and Rafal Mantiuk3; 1Meta (United States), 2University of Liverpool (United Kingdom), and 3University of Cambridge (United Kingdom) [view abstract]
The critical flicker fusion (CFF) is the frequency of changes at which a temporally periodic light will begin to appear completely steady to an observer. This value is affected by several visual factors, such as the luminance of the stimulus or its location on the retina. With new high dynamic range (HDR) displays, operating at higher luminance levels, and virtual reality (VR) displays, presenting at wide fields-of-view, the effective CFF may change significantly from values expected for traditional presentation. In this work we use a prototype HDR VR display capable of luminances up to 20,000 cd/m^2 to gather a novel set of CFF measurements for never before examined levels of luminance, eccentricity, and size. Our data is useful to study the temporal behavior of the visual system at high luminance levels, as well as setting useful thresholds for display engineering.
HPCI-228
Physics guided machine learning for image-based material decomposition of tissues from simulated breast models with calcifications, Muralikrishnan Gopalakrishnan Meena1, Amir K. Ziabari1, Singanallur Venkatakrishnan1, Isaac R. Lyngaas1, Matthew R. Norman1, Balint Joo1, Thomas L. Beck1, Charles A. Bouman2, Anuj Kapadia1, and Xiao Wang1; 1Oak Ridge National Laboratory and 2Purdue University (United States) [view abstract]
Material decomposition of Computed Tomography (CT) scans using projection-based approaches, while highly accurate, poses a challenge for medical imaging researchers and clinicians due to limited or no access to projection data. We introduce a deep learning image-based material decomposition method guided by physics and requiring no access to projection data. The method is demonstrated to decompose tissues from simulated dual-energy X-ray CT scans of virtual human phantoms containing four materials - adipose, fibroglandular, calcification, and air. The method uses a hybrid unsupervised and supervised learning technique to tackle the material decomposition problem. We take advantage of the unique X-ray absorption rate of calcium compared to body tissues to perform a preliminary segmentation of calcification from the images using unsupervised learning. We then perform supervised material decomposition using a deep learned UNET model which is trained using GPUs in the high-performant systems at the Oak Ridge Leadership Computing Facility. The method is demonstrated on simulated breast models to decompose calcification, adipose, fibroglandular, and air.
3DIA-104
Layered view synthesis for general images, Loïc Dehan, Wiebe Van Ranst, and Patrick Vandewalle, Katholieke University Leuven (Belgium) [view abstract]
We describe a novel method for monocular view synthesis. The goal of our work is to create a visually pleasing set of horizontally spaced views based on a single image. This can be applied in view synthesis for virtual reality and glasses-free 3D displays. Previous methods produce realistic results on images that show a clear distinction between a foreground object and the background. We aim to create novel views in more general, crowded scenes in which there is no clear distinction. Our main contributions are a computationally efficient method for realistic occlusion inpainting and blending, especially in complex scenes. Our method can be effectively applied to any image, which is shown both qualitatively and quantitatively on a large dataset of stereo images. Our method performs natural disocclusion inpainting and maintains the shape and edge quality of foreground objects.
ISS-329
A self-powered asynchronous image sensor with independent in-pixel harvesting and sensing operations, Ruben Gomez-Merchan, Juan Antonio Leñero-Bardallo, and Ángel Rodríguez-Vázquez, University of Seville (Spain) [view abstract]
A new self-powered asynchronous sensor with a novel pixel architecture is presented. Pixels are autonomous and can harvest or sense energy independently. During the image acquisition, pixels toggle to a harvesting operation mode once they have sensed their local illumination level. With the proposed pixel architecture, most illuminated pixels provide an early contribution to power the sensor, while low illuminated ones spend more time sensing their local illumination. Thus, the equivalent frame rate is higher than the offered by conventional self-powered sensors that harvest and sense illumination in independient phases. The proposed sensor uses a Time-to-First-Spike readout that allows trading between image quality and data and bandwidth consumption. The sensor has HDR operation with a dynamic range of 80 dB. Pixel power consumption is only 70 pW. In the article, we describe the sensor’s and pixel’s architectures in detail. Experimental results are provided and discussed. Sensor specifications are benchmarked against the art.
COLOR-184
Color blindness and modern board games, Alessandro Rizzi1 and Matteo Sassi2; 1Università degli Studi di Milano and 2consultant (Italy) [view abstract]
Board game industry is experiencing a strong renewed interest. In the last few years, about 4000 new board games have been designed and distributed each year. Board game players gender balance is reaching the equality, but nowadays the male component is a slight majority. This means that (at least) around 10% of board game players are color blind. How does the board game industry deal with this ? Recently, a raising of awareness in the board game design has started but so far there is a big gap compared with (e.g.) the computer game industry. This paper presents some data about the actual situation, discussing exemplary cases of successful board games.
5:00 – 6:15 PM EI 2023 All-Conference Welcome Reception (in the Cyril Magnin Foyer)
Tuesday 17 January 2023
KEYNOTE: Applications I (T1)
Session Chair: Raja Bala, Amazon (United States)
8:50 – 10:10 AM
Balboa
8:50IMAGE-274
KEYNOTE: Multi-scale representations for human pose estimation: Advances and applications, Andreas Savakis, Rochester Institute of Technology (United States) [view abstract]
Prof. Andreas Savakis is director of the Center for Human-aware AI (CHAI) and Professor of Computer Engineering at the Rochester Institute of Technology. His primary area of research is computer vision, with secondary interests in computational imaging and image processing. Savakis founded the Vision and Image Processing lab (VIP-lab) at RIT, where he works with students on topics including recognition, tracking, segmentation, pose estimation, facial expression, scene analysis, domain adaptation, and robust learning.
Human pose estimation is a topic of interest for many applications, including human-computer interaction, activity recognition, sports analysis and health monitoring. Pose estimation methods have improved significantly in recent years due to advances in deep learning architectures and multi-scale representations. We present an efficient Waterfall Atrous Spatial Pooling (WASP) architecture for multi-scale feature extraction that is useful for both pose estimation (UniPose, OmniPose, HandyPose) and semantic segmentation (WASPnet, GourmetNet). Our Waterfall architecture leverages the efficiency of progressive filtering in cascade, while maintaining multiscale fields-of-view comparable to spatial pyramid configurations. The waterfall approach is used for pose estimation in an encoder-decoder framework producing state of the art results for single person 2D pose with UniPose and multi-person 2D pose with OmniPose.. We extend our framework to other types of pose, such as 3D pose from a single image with UniPose+, 2D hand pose with HandyPose and vehicle pose with VehiPose. We conclude by outlining new diretions and applications for human pose and object pose.
9:30IMAGE-275
Robust hand hygiene monitoring for food safety using hand images, Shengtai Ju, Amy R. Reibman, and Amanda J. Deering, Purdue University (United States) [view abstract]
Hand hygiene is essential for food safety and food handlers. Maintaining proper hand hygiene can improve food safety and promote public welfare. However, traditional methods of evaluating hygiene during food handling process, such as visual auditing by human experts, can be costly and inefficient compared to a computer vision system. Because of the varying conditions and locations of real-world food processing sites, computer vision systems for recognizing handwashing actions can be susceptible to changes in lighting and environments. Therefore, we design a robust and generalizable video system that is based on ResNet50 that includes a hand extraction method and a 2-stream network for classifying handwashing actions. More specifically, our hand extraction method eliminates the background and helps the classifier focus on hand regions under changing lighting conditions and environments. Our results demonstrate our system with the hand extraction method can improve action recognition accuracy and be more generalizable when evaluated on completely unseen data by achieving over 20% improvement on the overall classification accuracy.
9:50IMAGE-276
Evaluating the efficacy of skincare product: A realistic short-term facial pore simulation, Ling Li1, Bandara Dissanayake2, Tatsuya Omotezako2, Yunjie Zhong1, Qing Zhang3, Rizhao Cai1, Qian Zheng4, Dennis Sng1, Weisi Lin1, Yufei Wang5, and Alex C. Kot1; 1Nanyang Technological University (Singapore), 2Procter & Gamble (Singapore), 3East China Normal University (China), 4Zhejiang University (China), and 5China-Singapore International Joint Research Institute (China) [view abstract]
Simulating the effects of skincare products on the face is a potential new mode for product self-promotion while facilitating consumers to choose the right product. Furthermore, such simulations enable one to anticipate her skin condition and better manage skin health. However, there is a lack of effective simulations today. In this paper, we propose the first simulation model to reveal facial pore changes after using skincare products. Our simulation pipeline consists of two steps: training data establishment and facial pore simulation. To establish training data, we collect face images with various pore quality indexes from short-term (8-weeks) clinical studies. People experience significant skin fluctuations (due to natural rhythms, external stressors, etc.,) which introduce large perturbations, and we propose a sliding window mechanism to clean data and select representative index(es) to present facial pore changes. The facial pore simulation stage consists of 3 modules: UNet-based segmentation module to localize facial pores; regression module to predict time-dependent warping hyperparameters; and deformation module, taking warping hyperparameters and pore segmentation labels as inputs, to precisely deform pores accordingly. The proposed simulation renders realistic facial pore changes. This work will pave the way for future research in facial skin simulation and skincare product developments.
10:00 AM – 7:30 PM Industry Exhibition - Tuesday (in the Cyril Magnin Foyer)
10:20 – 10:50 AM Coffee Break
Applications II (T2)
Session Chair:
Qian Lin, HP Inc. (United States)
10:50 AM – 12:30 PM
Balboa
10:50IMAGE-277
AI technology for aquatic and nautical search and rescue (TANSAR), Theus Aspiras, Ruixu Liu, and Vijayan K. Asari, University of Dayton (United States) [view abstract]
Natural disasters devastate local communities and make search and rescue slow for those directly affected. Especially in harsh conditions, the search and rescue of people can take several days, which may be life-threatening in some cases. Technology that allows faster analysis will improve search and rescue efforts both in search time and resource management. Our proposed artificial intelligence technology for aquatic and nautical search and rescue (TANSAR) will be able to provide deep learning processing for robust and efficient analysis capabilities for UAV image analysis in various search and rescue scenarios. We analyze various UAV data, implement state-of-the-art deep learning-based algorithms and provide near real-time processing of data, and integrate features for object detection.
11:10IMAGE-278
Wearable spectrum imaging and telemetry at edge, Yang Cai, CMU (United States) [view abstract]
We present a head-mounted holographic display system for thermographic image overlay, biometric sensing, and wireless telemetry. The system is lightweight and reconfigurable for multiple field applications, including object contour detection and enhancement, breathing rate detection, and telemetry over a mobile phone for peer-to-peer communication and incident commanding dashboard. Due to the constraints of the limited computing power of an embedded system, we developed a lightweight image processing algorithm for edge detection and breath rate detection, as well as an image compression codec. The system can be integrated into a helmet or personal protection equipment such as a face shield or goggles. It can be applied to firefighting, medical emergency response, and other first-response operations. Finally, we present a case study of "Cold Trailing" for forest fire prevention in the wild.
11:30IMAGE-279
Eidetic recognition of cattle using keypoint alignment, Manu Ramesh, Amy R. Reibman, and Jacquelyn Boerman, Purdue University (United States) [view abstract]
Ability to identify individual cows quickly and readily in the barn would enable real time monitoring of their behavior, health, eating habits and more, all of which could save time, money, or effort. This work focuses on creating an eidetic recognition or re-identification (ReID) algorithm that learns to recognize individual cows with just a single training example per cow and with near zero time to learn to identify a new cow, both features which the existing cattle ReID systems lack. Our algorithm is designed to improve recognition robustness to deformations in cow bodies that occur when they are walking, turning, or are seen slightly off-angle. Individual cows are first detected and localized using popular keypoint and mask detection techniques, then aligned to a fixed template, pixelated, binarized to reduce lighting effects, and serialized to obtain bit-vectors. Bit-vectors from cows at inference time are matched to those from training time using Hamming distance. To improve results, we add modules to verify the validity of detected keypoints, interpolate missing keypoints, and combine predictions from multiple frames using a majority vote. The video level accuracy is over 60% for a set of nearly 150 Holstein cows.
11:50IMAGE-280
Challenges and constraints when applying few shot learning to a real-world scenario: In-the-wild camera-trap species classification, Haoyu Chen, Stacy Lindshield, and Amy R. Reibman, Purdue University (United States) [view abstract]
Few shot learning (FSL) describes the challenge of learning to classify when there are only few labeled examples. The goal is to be able to adapt to a new classification task using a minimum amount of new data. However, when applying FSL to real-world problems, a number of constraints and challenges are encountered: model selection, training data selection, hyper-parameter and cost function, and final decision making. In this paper, we consider a realistic problem that fits perfectly with the narratives of FSL: our goal is to classify animal species that appear in our in-the-wild camera traps located in Senegal, when these species have yet to appear in popular animal datasets. Using the philosophy of FSL, we first train a FSL network to learn to separate animal species, using popular, large animal datasets, and then evaluate the network with our data, for which there are few labeled images. Using this framework, we conduct a comparison between FSL models, parameter selection, training strategy, etc. We also discuss a weakness of the current FSL framework – lack of the option “does not belong to any of the classes”. This is a common problem in real-world systems due to false positives when detecting bounding boxes that contain, in our case, animals. We then propose an additional constraint to overcome this weakness.
12:10IMAGE-281
Smartphone-enabled point-of-care blood hemoglobin testing with color accuracy-assisted spectral learning, Sang Mok Park1, Yuhyun Ji1, Semin Kwon1, Andrew R. O’Brien2, Ying Wang2, and Young L. Kim1; 1Purdue University and 2Indiana University School of Medicine (United States) [view abstract]
We develop an mHealth technology for noninvasively measuring blood Hgb levels in patients with sickle cell anemia, using the photos of peripheral tissue acquired by the built-in camera of a smartphone. As an easily accessible sensing site, the inner eyelid (i.e., palpebral conjunctiva) is used because of the relatively uniform microvasculature and the absence of skin pigments. Color correction (color reproduction) and spectral learning (spectral super-resolution spectroscopy) algorithms are integrated for accurate and precise mHealth blood Hgb testing. First, color correction using a color reference chart with multiple color patches extracts absolute color information of the inner eyelid, compensating for smartphone models, ambient light conditions, and data formats during photo acquisition. Second, spectral learning virtually transforms the smartphone camera into a hyperspectral imaging system, mathematically reconstructing high-resolution spectra from color-corrected eyelid images. Third, color correction and spectral learning algorithms are combined with a spectroscopic model for blood Hgb quantification among sickle cell patients. Importantly, single-shot photo acquisition of the inner eyelid using the color reference chart allows straightforward, real-time, and instantaneous reading of blood Hgb levels. Overall, our mHealth blood Hgb tests could potentially be scalable, robust, and sustainable in resource-limited and homecare settings.
12:30 – 2:00 PM Lunch
Tuesday 17 January PLENARY: Embedded Gain Maps for Adaptive Display of High Dynamic Range Images
Session Chair: Robin Jenkin, NVIDIA Corporation (United States)
2:00 PM – 3:00 PM
Cyril Magnin I/II/III
Images optimized for High Dynamic Range (HDR) displays have brighter highlights and more detailed shadows, resulting in an increased sense of realism and greater impact. However, a major issue with HDR content is the lack of consistency in appearance across different devices and viewing environments. There are several reasons, including varying capabilities of HDR displays and the different tone mapping methods implemented across software and platforms. Consequently, HDR content authors can neither control nor predict how their images will appear in other apps.
We present a flexible system that provides consistent and adaptive display of HDR images. Conceptually, the method combines both SDR and HDR renditions within a single image and interpolates between the two dynamically at display time. We compute a Gain Map that represents the difference between the two renditions. In the file, we store a Base rendition (either SDR or HDR), the Gain Map, and some associated metadata. At display time, we combine the Base image with a scaled version of the Gain Map, where the scale factor depends on the image metadata, the HDR capacity of the display, and the viewing environment.
Eric Chan, Fellow, Adobe Inc. (United States)
Eric Chan is a Fellow at Adobe, where he develops software for editing photographs. Current projects include Photoshop, Lightroom, Camera Raw, and Digital Negative (DNG). When not writing software, Chan enjoys spending time at his other keyboard, the piano. He is an enthusiastic nature photographer and often combines his photo activities with travel and hiking.
Paul M. Hubel, director of Image Quality in Software Engineering, Apple Inc. (United States)
Paul M. Hubel is director of Image Quality in Software Engineering at Apple. He has worked on computational photography and image quality of photographic systems for many years on all aspects of the imaging chain, particularly for iPhone. He trained in optical engineering at University of Rochester, Oxford University, and MIT, and has more than 50 patents on color imaging and camera technology. Hubel is active on the ISO-TC42 committee Digital Photography, where this work is under discussion, and is currently a VP on the IS&T Board. Outside work he enjoys photography, travel, cycling, coffee roasting, and plays trumpet in several bay area ensembles.
3:00 – 3:30 PM Coffee Break
5:30 – 7:00 PM EI 2023 Symposium Demonstration Session (in the Cyril Magnin Foyer)
Wednesday 18 January 2023
10:00 AM – 3:30 PM Industry Exhibition - Wednesday (in the Cyril Magnin Foyer)
10:20 – 10:50 AM Coffee Break
12:30 – 2:00 PM Lunch
Wednesday 18 January PLENARY: Bringing Vision Science to Electronic Imaging: The Pyramid of Visibility
Session Chair: Andreas Savakis, Rochester Institute of Technology (United States)
2:00 PM – 3:00 PM
Cyril Magnin I/II/III
Electronic imaging depends fundamentally on the capabilities and limitations of human vision. The challenge for the vision scientist is to describe these limitations to the engineer in a comprehensive, computable, and elegant formulation. Primary among these limitations are visibility of variations in light intensity over space and time, of variations in color over space and time, and of all of these patterns with position in the visual field. Lastly, we must describe how all these sensitivities vary with adapting light level. We have recently developed a structural description of human visual sensitivity that we call the Pyramid of Visibility, that accomplishes this synthesis. This talk shows how this structure accommodates all the dimensions described above, and how it can be used to solve a wide variety of problems in display engineering.
Andrew B. Watson, chief vision scientist, Apple Inc. (United States)
Andrew Watson is Chief Vision Scientist at Apple, where he leads the application of vision science to technologies, applications, and displays. His research focuses on computational models of early vision. He is the author of more than 100 scientific papers and 8 patents. He has 21,180 citations and an h-index of 63. Watson founded the Journal of Vision, and served as editor-in-chief 2001-2013 and 2018-2022. Watson has received numerous awards including the Presidential Rank Award from the President of the United States.
3:00 – 3:30 PM Coffee Break
Imaging and Multimedia Analytics at the Edge 2023 Interactive (Poster) Paper Session
5:30 – 7:00 PM
Cyril Magnin Foyer
The following works will be presented at the EI 2023 Symposium Interactive (Poster) Paper Session.
IMAGE-282
Lightweight single pass numerical reading extraction for displays in the wild, Yan-Ming Chiou and Bob Price, Palo Alto Research Center Incorporated (United States) [view abstract]
In assistance applications or interfaces to legacy non-connected devices it can be helpful to extract readings from digital displays. For instance, one might want to read a microwave display, read a scale or thermometer or check a glucose monitor and automatically fill in a log. On a server, one can take the approach of performing text spotting, cropping and normalizing text and then feeding it to an OCR engine. In a mobile or embedded device setting we want something compact with low latency. In this paper we show that a simple model loosely inspired by a Google Street view model [Goodfellow13] can simultaneously isolate and decode digits in digital displays using a lightweight network capable of running on low-power devices. The approach makes use of display synthesis and augmentation technique to implement sim-to-real style training. We show that the model generalizes to a variety of devices and can read times, weights, temperatures and other types of values. When coupled with a generic object detector such as Yolo X, it provides a powerful computationally efficient solution to recognizing objects and their displays
IMAGE-283
Robust tracking of industrial objects across environments from small samples in single environments using chroma-key and occlusion augmentations, Yan-Ming Chiou and Bob Price, Palo Alto Research Center Incorporated (United States) [view abstract]
Training deep models that can be deployed on embedded systems to robustly detect and track highly specialized industrial objects in a variety of field environments remains very challenging. Large Deep Foundation models (e.g., [yuan21]) make it easier than ever to detect and track everyday objects but do not work as well for specialized industrial objects. These models are often very large and not suitable for deployment on embedded systems. In this work we show that the use of chroma-key like substitution combined with artificial occlusion generation allow one to capture objects under a fixed background in the lab and then generalize them to novel backgrounds that work in the real world under realistic conditions. We show that our methods handle this case significantly better than state of the art methods such as MOSAIC augmentation on a Yolo V4 object detection task obtaining up to 17 absolute percentage point improvements over standard techniques
5:30 – 7:00 PM EI 2023 Symposium Interactive (Poster) Paper Session (in the Cyril Magnin Foyer)
5:30 – 7:00 PM EI 2023 Meet the Future: A Showcase of Student and Young Professionals Research (in the Cyril Magnin Foyer)