Image Processing: Algorithms and Systems XXI
Monday 16 January 2023
10:20 – 10:50 AM Coffee Break
12:30 – 2:00 PM Lunch
Monday 16 January PLENARY: Neural Operators for Solving PDEs
Session Chair: Robin Jenkin, NVIDIA Corporation (United States)
2:00 PM – 3:00 PM
Cyril Magnin I/II/III
Deep learning surrogate models have shown promise in modeling complex physical phenomena such as fluid flows, molecular dynamics, and material properties. However, standard neural networks assume finite-dimensional inputs and outputs, and hence, cannot withstand a change in resolution or discretization between training and testing. We introduce Fourier neural operators that can learn operators, which are mappings between infinite dimensional spaces. They are independent of the resolution or grid of training data and allow for zero-shot generalization to higher resolution evaluations. When applied to weather forecasting, neural operators capture fine-scale phenomena and have similar skill as gold-standard numerical weather models for predictions up to a week or longer, while being 4-5 orders of magnitude faster.
Anima Anandkumar, Bren professor, California Institute of Technology, and senior director of AI Research, NVIDIA Corporation (United States)
Anima Anandkumar is a Bren Professor at Caltech and Senior Director of AI Research at NVIDIA. She is passionate about designing principled AI algorithms and applying them to interdisciplinary domains. She has received several honors such as the IEEE fellowship, Alfred. P. Sloan Fellowship, NSF Career Award, and Faculty Fellowships from Microsoft, Google, Facebook, and Adobe. She is part of the World Economic Forum's Expert Network. Anandkumar received her BTech from Indian Institute of Technology Madras, her PhD from Cornell University, and did her postdoctoral research at MIT and assistant professorship at University of California Irvine.
3:00 – 3:30 PM Coffee Break
EI 2023 Highlights Session
Session Chair: Robin Jenkin, NVIDIA Corporation (United States)
3:30 – 5:00 PM
Cyril Magnin II
Join us for a session that celebrates the breadth of what EI has to offer with short papers selected from EI conferences.
NOTE: The EI-wide "EI 2023 Highlights" session is concurrent with Monday afternoon COIMG, COLOR, IMAGE, and IQSP conference sessions.
IQSP-309
Evaluation of image quality metrics designed for DRI tasks with automotive cameras, Valentine Klein, Yiqi LI, Claudio Greco, Laurent Chanas, and Frédéric Guichard, DXOMARK (France) [view abstract]
Driving assistance is increasingly used in new car models. Most driving assistance systems are based on automotive cameras and computer vision. Computer Vision, regardless of the underlying algorithms and technology, requires the images to have good image quality, defined according to the task. This notion of good image quality is still to be defined in the case of computer vision as it has very different criteria than human vision: humans have a better contrast detection ability than image chains. The aim of this article is to compare three different metrics designed for detection of objects with computer vision: the Contrast Detection Probability (CDP) [1, 2, 3, 4], the Contrast Signal to Noise Ratio (CSNR) [5] and the Frequency of Correct Resolution (FCR) [6]. For this purpose, the computer vision task of reading the characters on a license plate will be used as a benchmark. The objective is to check the correlation between the objective metric and the ability of a neural network to perform this task. Thus, a protocol to test these metrics and compare them to the output of the neural network has been designed and the pros and cons of each of these three metrics have been noted.
SD&A-224
Human performance using stereo 3D in a helmet mounted display and association with individual stereo acuity, Bonnie Posselt, RAF Centre of Aviation Medicine (United Kingdom) [view abstract]
Binocular Helmet Mounted Displays (HMDs) are a critical part of the aircraft system, allowing information to be presented to the aviator with stereoscopic 3D (S3D) depth, potentially enhancing situational awareness and improving performance. The utility of S3D in an HMD may be linked to an individual’s ability to perceive changes in binocular disparity (stereo acuity). Though minimum stereo acuity standards exist for most military aviators, current test methods may be unable to characterise this relationship. This presentation will investigate the effect of S3D on performance when used in a warning alert displayed in an HMD. Furthermore, any effect on performance, ocular symptoms, and cognitive workload shall be evaluated in regard to individual stereo acuity measured with a variety of paper-based and digital stereo tests.
IMAGE-281
Smartphone-enabled point-of-care blood hemoglobin testing with color accuracy-assisted spectral learning, Sang Mok Park1, Yuhyun Ji1, Semin Kwon1, Andrew R. O’Brien2, Ying Wang2, and Young L. Kim1; 1Purdue University and 2Indiana University School of Medicine (United States) [view abstract]
We develop an mHealth technology for noninvasively measuring blood Hgb levels in patients with sickle cell anemia, using the photos of peripheral tissue acquired by the built-in camera of a smartphone. As an easily accessible sensing site, the inner eyelid (i.e., palpebral conjunctiva) is used because of the relatively uniform microvasculature and the absence of skin pigments. Color correction (color reproduction) and spectral learning (spectral super-resolution spectroscopy) algorithms are integrated for accurate and precise mHealth blood Hgb testing. First, color correction using a color reference chart with multiple color patches extracts absolute color information of the inner eyelid, compensating for smartphone models, ambient light conditions, and data formats during photo acquisition. Second, spectral learning virtually transforms the smartphone camera into a hyperspectral imaging system, mathematically reconstructing high-resolution spectra from color-corrected eyelid images. Third, color correction and spectral learning algorithms are combined with a spectroscopic model for blood Hgb quantification among sickle cell patients. Importantly, single-shot photo acquisition of the inner eyelid using the color reference chart allows straightforward, real-time, and instantaneous reading of blood Hgb levels. Overall, our mHealth blood Hgb tests could potentially be scalable, robust, and sustainable in resource-limited and homecare settings.
AVM-118
Designing scenes to quantify the performance of automotive perception systems, Zhenyi Liu1, Devesh Shah2, Alireza Rahimpour2, Joyce Farrell1, and Brian Wandell1; 1Stanford University and 2Ford Motor Company (United States) [view abstract]
We implemented an end-to-end simulation for perception systems, based on cameras, that are used in automotive applications. The open-source software creates complex driving scenes and simulates cameras that acquire images of these scenes. The camera images are then used by a neural network in the perception system to identify the locations of scene objects, providing the results as input to the decision system. In this paper, we design collections of test scenes that can be used to quantify the perception system’s performance under a range of (a) environmental conditions (object distance, occlusion ratio, lighting levels), and (b) camera parameters (pixel size, lens type, color filter array). We are designing scene collections to analyze performance for detecting vehicles, traffic signs and vulnerable road users in a range of environmental conditions and for a range of camera parameters. With experience, such scene collections may serve a role similar to that of standardized test targets that are used to quantify camera image quality (e.g., acuity, color).
VDA-403
Visualizing and monitoring the process of injection molding, Christian A. Steinparz1, Thomas Mitterlehner2, Bernhard Praher2, Klaus Straka1,2, Holger Stitz1,3, and Marc Streit1,3; 1Johannes Kepler University, 2Moldsonics GmbH, and 3datavisyn GmbH (Austria) [view abstract]
In injection molding machines the molds are rarely equipped with sensor systems. The availability of non-invasive ultrasound-based in-mold sensors provides better means for guiding operators of injection molding machines throughout the production process. However, existing visualizations are mostly limited to plots of temperature and pressure over time. In this work, we present the result of a design study created in collaboration with domain experts. The resulting prototypical application uses real-world data taken from live ultrasound sensor measurements for injection molding cavities captured over multiple cycles during the injection process. Our contribution includes a definition of tasks for setting up and monitoring the machines during the process, and the corresponding web-based visual analysis tool addressing these tasks. The interface consists of a multi-view display with various levels of data aggregation that is updated live for newly streamed data of ongoing injection cycles.
COIMG-155
Commissioning the James Webb Space Telescope, Joseph M. Howard, NASA Goddard Space Flight Center (United States) [view abstract]
Astronomy is arguably in a golden age, where current and future NASA space telescopes are expected to contribute to this rapid growth in understanding of our universe. The most recent addition to our space-based telescopes dedicated to astronomy and astrophysics is the James Webb Space Telescope (JWST), which launched on 25 December 2021. This talk will discuss the first six months in space for JWST, which were spent commissioning the observatory with many deployments, alignments, and system and instrumentation checks. These engineering activities help verify the proper working of the telescope prior to commencing full science operations. For the session: Computational Imaging using Fourier Ptychography and Phase Retrieval.
HVEI-223
Critical flicker frequency (CFF) at high luminance levels, Alexandre Chapiro1, Nathan Matsuda1, Maliha Ashraf2, and Rafal Mantiuk3; 1Meta (United States), 2University of Liverpool (United Kingdom), and 3University of Cambridge (United Kingdom) [view abstract]
The critical flicker fusion (CFF) is the frequency of changes at which a temporally periodic light will begin to appear completely steady to an observer. This value is affected by several visual factors, such as the luminance of the stimulus or its location on the retina. With new high dynamic range (HDR) displays, operating at higher luminance levels, and virtual reality (VR) displays, presenting at wide fields-of-view, the effective CFF may change significantly from values expected for traditional presentation. In this work we use a prototype HDR VR display capable of luminances up to 20,000 cd/m^2 to gather a novel set of CFF measurements for never before examined levels of luminance, eccentricity, and size. Our data is useful to study the temporal behavior of the visual system at high luminance levels, as well as setting useful thresholds for display engineering.
HPCI-228
Physics guided machine learning for image-based material decomposition of tissues from simulated breast models with calcifications, Muralikrishnan Gopalakrishnan Meena1, Amir K. Ziabari1, Singanallur Venkatakrishnan1, Isaac R. Lyngaas1, Matthew R. Norman1, Balint Joo1, Thomas L. Beck1, Charles A. Bouman2, Anuj Kapadia1, and Xiao Wang1; 1Oak Ridge National Laboratory and 2Purdue University (United States) [view abstract]
Material decomposition of Computed Tomography (CT) scans using projection-based approaches, while highly accurate, poses a challenge for medical imaging researchers and clinicians due to limited or no access to projection data. We introduce a deep learning image-based material decomposition method guided by physics and requiring no access to projection data. The method is demonstrated to decompose tissues from simulated dual-energy X-ray CT scans of virtual human phantoms containing four materials - adipose, fibroglandular, calcification, and air. The method uses a hybrid unsupervised and supervised learning technique to tackle the material decomposition problem. We take advantage of the unique X-ray absorption rate of calcium compared to body tissues to perform a preliminary segmentation of calcification from the images using unsupervised learning. We then perform supervised material decomposition using a deep learned UNET model which is trained using GPUs in the high-performant systems at the Oak Ridge Leadership Computing Facility. The method is demonstrated on simulated breast models to decompose calcification, adipose, fibroglandular, and air.
3DIA-104
Layered view synthesis for general images, Loïc Dehan, Wiebe Van Ranst, and Patrick Vandewalle, Katholieke University Leuven (Belgium) [view abstract]
We describe a novel method for monocular view synthesis. The goal of our work is to create a visually pleasing set of horizontally spaced views based on a single image. This can be applied in view synthesis for virtual reality and glasses-free 3D displays. Previous methods produce realistic results on images that show a clear distinction between a foreground object and the background. We aim to create novel views in more general, crowded scenes in which there is no clear distinction. Our main contributions are a computationally efficient method for realistic occlusion inpainting and blending, especially in complex scenes. Our method can be effectively applied to any image, which is shown both qualitatively and quantitatively on a large dataset of stereo images. Our method performs natural disocclusion inpainting and maintains the shape and edge quality of foreground objects.
ISS-329
A self-powered asynchronous image sensor with independent in-pixel harvesting and sensing operations, Ruben Gomez-Merchan, Juan Antonio Leñero-Bardallo, and Ángel Rodríguez-Vázquez, University of Seville (Spain) [view abstract]
A new self-powered asynchronous sensor with a novel pixel architecture is presented. Pixels are autonomous and can harvest or sense energy independently. During the image acquisition, pixels toggle to a harvesting operation mode once they have sensed their local illumination level. With the proposed pixel architecture, most illuminated pixels provide an early contribution to power the sensor, while low illuminated ones spend more time sensing their local illumination. Thus, the equivalent frame rate is higher than the offered by conventional self-powered sensors that harvest and sense illumination in independient phases. The proposed sensor uses a Time-to-First-Spike readout that allows trading between image quality and data and bandwidth consumption. The sensor has HDR operation with a dynamic range of 80 dB. Pixel power consumption is only 70 pW. In the article, we describe the sensor’s and pixel’s architectures in detail. Experimental results are provided and discussed. Sensor specifications are benchmarked against the art.
COLOR-184
Color blindness and modern board games, Alessandro Rizzi1 and Matteo Sassi2; 1Università degli Studi di Milano and 2consultant (Italy) [view abstract]
Board game industry is experiencing a strong renewed interest. In the last few years, about 4000 new board games have been designed and distributed each year. Board game players gender balance is reaching the equality, but nowadays the male component is a slight majority. This means that (at least) around 10% of board game players are color blind. How does the board game industry deal with this ? Recently, a raising of awareness in the board game design has started but so far there is a big gap compared with (e.g.) the computer game industry. This paper presents some data about the actual situation, discussing exemplary cases of successful board games.
5:00 – 6:15 PM EI 2023 All-Conference Welcome Reception (in the Cyril Magnin Foyer)
Tuesday 17 January 2023
10:00 AM – 7:30 PM Industry Exhibition - Tuesday (in the Cyril Magnin Foyer)
10:20 – 10:50 AM Coffee Break
12:30 – 2:00 PM Lunch
Tuesday 17 January PLENARY: Embedded Gain Maps for Adaptive Display of High Dynamic Range Images
Session Chair: Robin Jenkin, NVIDIA Corporation (United States)
2:00 PM – 3:00 PM
Cyril Magnin I/II/III
Images optimized for High Dynamic Range (HDR) displays have brighter highlights and more detailed shadows, resulting in an increased sense of realism and greater impact. However, a major issue with HDR content is the lack of consistency in appearance across different devices and viewing environments. There are several reasons, including varying capabilities of HDR displays and the different tone mapping methods implemented across software and platforms. Consequently, HDR content authors can neither control nor predict how their images will appear in other apps.
We present a flexible system that provides consistent and adaptive display of HDR images. Conceptually, the method combines both SDR and HDR renditions within a single image and interpolates between the two dynamically at display time. We compute a Gain Map that represents the difference between the two renditions. In the file, we store a Base rendition (either SDR or HDR), the Gain Map, and some associated metadata. At display time, we combine the Base image with a scaled version of the Gain Map, where the scale factor depends on the image metadata, the HDR capacity of the display, and the viewing environment.
Eric Chan, Fellow, Adobe Inc. (United States)
Eric Chan is a Fellow at Adobe, where he develops software for editing photographs. Current projects include Photoshop, Lightroom, Camera Raw, and Digital Negative (DNG). When not writing software, Chan enjoys spending time at his other keyboard, the piano. He is an enthusiastic nature photographer and often combines his photo activities with travel and hiking.
Paul M. Hubel, director of Image Quality in Software Engineering, Apple Inc. (United States)
Paul M. Hubel is director of Image Quality in Software Engineering at Apple. He has worked on computational photography and image quality of photographic systems for many years on all aspects of the imaging chain, particularly for iPhone. He trained in optical engineering at University of Rochester, Oxford University, and MIT, and has more than 50 patents on color imaging and camera technology. Hubel is active on the ISO-TC42 committee Digital Photography, where this work is under discussion, and is currently a VP on the IS&T Board. Outside work he enjoys photography, travel, cycling, coffee roasting, and plays trumpet in several bay area ensembles.
3:00 – 3:30 PM Coffee Break
5:30 – 7:00 PM EI 2023 Symposium Demonstration Session (in the Cyril Magnin Foyer)
Wednesday 18 January 2023
10:00 AM – 3:30 PM Industry Exhibition - Wednesday (in the Cyril Magnin Foyer)
10:20 – 10:50 AM Coffee Break
12:30 – 2:00 PM Lunch
Wednesday 18 January PLENARY: Bringing Vision Science to Electronic Imaging: The Pyramid of Visibility
Session Chair: Andreas Savakis, Rochester Institute of Technology (United States)
2:00 PM – 3:00 PM
Cyril Magnin I/II/III
Electronic imaging depends fundamentally on the capabilities and limitations of human vision. The challenge for the vision scientist is to describe these limitations to the engineer in a comprehensive, computable, and elegant formulation. Primary among these limitations are visibility of variations in light intensity over space and time, of variations in color over space and time, and of all of these patterns with position in the visual field. Lastly, we must describe how all these sensitivities vary with adapting light level. We have recently developed a structural description of human visual sensitivity that we call the Pyramid of Visibility, that accomplishes this synthesis. This talk shows how this structure accommodates all the dimensions described above, and how it can be used to solve a wide variety of problems in display engineering.
Andrew B. Watson, chief vision scientist, Apple Inc. (United States)
Andrew Watson is Chief Vision Scientist at Apple, where he leads the application of vision science to technologies, applications, and displays. His research focuses on computational models of early vision. He is the author of more than 100 scientific papers and 8 patents. He has 21,180 citations and an h-index of 63. Watson founded the Journal of Vision, and served as editor-in-chief 2001-2013 and 2018-2022. Watson has received numerous awards including the Presidential Rank Award from the President of the United States.
3:00 – 3:30 PM Coffee Break
KEYNOTE: Systematic Data Labeling (W3.1)
Session Chairs: Karen Egiazarian, Tampere University (Finland) and Atanas Gotchev, Tampere University (Finland)
3:30 – 4:15 PM
Cyril Magnin III
3:30
Conference Welcome
3:35IPAS-284
KEYNOTE: Systematic data labeling at the point of ingestion in enterprise systems, Gevorg Karapetyan, Zero Cognitive Systems (United States) [view abstract]
Gevorg Karapetyan is co-founder and Chief Technology Officer with Zero Cognitive Systems. In this role Karapetyan leads long-term technology vision and is responsible for the direction, coordination, and delivery of technology. Founded in 2015 in Los Gatos, California, Zero is dedicated to applying artificial intelligence and smart automation to the most pressing operational challenges of the professional services industry. Karapetyan previously worked at Imagenomic as a Senior Software Engineer and attended National Polytechnic University of Armenia. Karapetyan holds a PhD in Computer Science and has more than 10 years of experience in developing intelligent automation systems.
Almost 80% of the enterprise data is unstructured. Unstructured data includes documents, emails, images, web pages, video files, audio files, etc., which are stored in different data silos. Classification of unstructured is an important topic for the world's largest enterprises. One of the approaches is labeling the content per particular project. We present a system for systematic labeling of the unstructured data at the point of ingestion. This approach gives the ability to systematically generate metadata from incoming unstructured data, which can be stored in data catalogs, unlocking the ability to get business insights from the data and reduce security risks.
Machine Learning for Image Processing (W3.2)
Session Chairs:
Karen Egiazarian, Tampere University (Finland) and Atanas Gotchev, Tampere University (Finland)
4:15 – 5:35 PM
Cyril Magnin III
4:15IPAS-285
ORCA: An end-to-end video object removal framework with cropping interested region and quality assessment, Minseong Son, Hansol Lee, Sungkeun Kwak, and Jihwan Woo, CJOliveNetworks (Republic of Korea) [view abstract]
Recently, various types of Video Inpainting models have been released. Video Inpainting is used to naturally erase the object you want to erase in the video. However, to use inpainting models, we usually need frames extracted from a video and masks and most people make these data manually. We propose a novel End-to-End Video Object Removal framework with Cropping Interested Region and Video Quality Assessment (ORCA). ORCA is built in an end-to-end way by combining the Detection, Segmentation, and Inpainting modules. The characteristics of proposed framework are going through the cropping step before inpainting step. In addition, We propose our own video quality assessment since ORCA use two models for inpainting. Our new metric indicates the higher quality of the results between two models. Experimental results show the superior performance of the proposed methods.
4:35IPAS-286
Detection of object throwing behavior in surveillance videos, Ivo P.C. Kersten, Erkut Akdag, Egor Bondarev, and Peter H. de With, Eindhoven University of Technology (the Netherlands) [view abstract]
Anomalous behavior detection is a challenging research area within computer vision. One such behavior is throwing action in traffic flow, which is one of the unique requirements of our Smart City project to enhance public safety. This paper proposes a solution for throwing action detection in surveillance videos using deep learning. At present, datasets for throwing actions are not publicly available. To address the use-case of our Smart City project, we first generate the novel public 'Throwing Action' dataset, consisting of 271 videos of throwing actions performed by traffic participants, such as pedestrians, bicyclists, and car drivers, and 130 normal videos without throwing actions. Second, we compare the performance of different feature extractors for our anomaly detection method on the UCF-Crime and Throwing-Action datasets. Finally, we improve the performance of the anomaly detection algorithm by applying the Adam optimizer instead of Adadelta, and we propose a mean normal loss function that yields better anomaly detection performance. The experimental results reach an area under the ROC curve of 86.10 for the Throwing-Action dataset, and 80.13 on the combined UCF-Crime+Throwing dataset, respectively.
4:55IPAS-287
Hybrid diffractive optics (DOE & refractive lens) for broadband EDoF imaging, SeyyedReza MiriRostami, Samuel Pinilla, Igor Shevkunov, Vladimir Katkovnik, and Karen Egiazarian, Tampere University (Finland) [view abstract]
In the considered hybrid diffractive imaging system, a refractive lens is arranged simultaneously with a multilevel phase mask (MPM) as a diffractive optical element (DOE) for Achromatic Extended-depth-of-field (EDoF) imaging. This paper proposes a fully differentiable image formation model that uses neural network techniques to maximize the imaging quality by optimizing MPM, digital image reconstruction algorithm, refractive lens parameters (aperture size, focal length), and distance between the MPM and sensor. Firstly, model-based numerical simulations and end-to-end joint optimization of imaging are used. A spatial light modulator (SLM) is employed in the second stage of the design to implement MPM optimized at the first stage, and the image processing is optimized experimentally using a learning-based approach. The third stage of optimization is targeted at joint optimization of the SLM phase pattern and image reconstruction algorithm in the hardware-in-the-loop (HIL) setup, which allows compensation for a mismatch between numerical modeling and the physical reality of optic and sensor. A comparative analysis of the imaging accuracy and quality using the optical parameters is presented. It is proved experimentally, first time to the best of our knowledge, that wavefront phase modulation can provide imaging of advanced quality as compared with some commercial multi-lens cameras.
5:15IPAS-288
Evaluating active learning for blind imbalanced domains, Hiroshi Kuwajima1, Masayuki Tanaka2, and Masatoshi Okutomi2; 1DENSO Corporation and 2Tokyo Institute of Technology (Japan) [view abstract]
Deep learning, which has been very successful in recent years, requires a large amount of data. Active learning has been widely studied and used for decades to reduce annotation costs and now attracts lots of attention in deep learning. Many real-world deep learning applications use active learning to select the informative data to be annotated. In this paper, we first investigate laboratory settings for active learning. We show significant gaps between the results from different laboratory settings and describe our practical laboratory setting that reasonably reflects the active learning use cases in real-world applications. Then, we introduce a problem setting of blind imbalanced domains. Any data set includes multiple domains, e.g., individuals in handwritten character recognition with different social attributes. Major domains have many samples, and minor domains have few samples in the training set. However, we must accurately infer both major and minor domains in the test phase. We experimentally compare different methods of active learning for blind imbalanced domains in our practical laboratory setting. We show that a simple active learning method using softmax margin and a model training method using distance-based sampling with center loss, both working in the deep feature space, perform well.
5:30 – 7:00 PM EI 2023 Symposium Interactive (Poster) Paper Session (in the Cyril Magnin Foyer)
5:30 – 7:00 PM EI 2023 Meet the Future: A Showcase of Student and Young Professionals Research (in the Cyril Magnin Foyer)
Image Processing: Algorithms and Systems XXI Interactive (Poster) Paper Session (W4)
5:35 – 7:00 PM
Cyril Magnin Foyer
The following work will be presented at the EI 2023 Symposium Interactive (Poster) Paper Session.
IPAS-290
MLExchange: An integrated platform for scientific machine learning, Guanhua Hao1, Tanny Chavez1, Zhuowen Zhao1, Elizabeth Holman1, Eric Roberts1, Howard Yanxon2, Adam Green1, Harinarayan Krishnan1, Dylan McReynolds1, Nicholas Schwarz2, Petrus Zwart1, Alexander Hexemer1, and Dilworth Parkinson1; 1Lawrence Berkeley National Laboratory and 2Argonne National Laboratory (United States) [view abstract]
Scientific user facilities are some of the world’s leading producers of scientific data. As a collaboration effort across several Department of Energy (DOE) national labs, a project called “MLExchange” is under development to build a collective machine learning platform, and it is targeted to serve as a toolbox to enhance the experience for users working with large scientific data. Two applications within the platform are designed to focus on image analysis: an image segmentation application and an image labeling pipeline. The segmentation application has a web-based interface with embedded machine learning algorithms to aid segmentation tasks. Three machine learning models, including a Mixed-Scale Dense Convolutional Neural Network (MSDNet), have been successfully deployed. The labeling pipeline consists of three web-based applications: Label Maker, Data Clinic, and MLCoach, and aims to provide automatic sample-type identification/classification tasks. Several use cases have been successfully deployed using X-ray scattering and microCT data.
Thursday 19 January 2023
Face and Facial Image Processing (R1)
Session Chairs:
Karen Egiazarian, Tampere University (Finland) and Atanas Gotchev, Tampere University (Finland)
8:50 – 9:50 AM
Cyril Magnin III
8:50IPAS-291
Facial expression recognition using visual transformer with histogram of oriented gradients, Jieun Kim, Ju o Kim, Seungwan Je, and Deokwoo Lee, Keimyung University (Republic of Korea) [view abstract]
Emotions play an important role in our life as a response to our interactions with others, decisions, and so on. Among various emotional signals, facial expression is one of the most powerful and natural means for humans to convey their emotions and intentions, and it has the advantage of easily obtaining information using only a camera, so facial expression-based emotional research is being actively conducted. Facial expression recognition(FER) have been studied by classifying them into seven basic emotions: anger, disgust, fear, happiness, sadness, surprise, and normal. Before the appearance of deep learning, handcrafted feature extractors and simple classifiers such as SVM, Adaboost was used to extracted Facial emotion. With the advent of deep learning, it is now possible to extract facial expression without using feature extractors. Despite its excellent performance in FER research, it is still challenging task due to external factors such as occlusion, illumination, and pose, and similarity problems between different facial expressions. In this paper, we propose a method of training through a ResNet [1] and Visual Transformer [2] called FViT and using Histogram of Oriented Gradients(HOGs) [3] data to solve the similarity problem between facial expressions.
9:10IPAS-292
Face expressions understanding by geometrical characterization of deep human faces representation, Adrien Raison, Theo Biardeau, Pascal Bourdon, and David Helbert, University de Poitiers (France) [view abstract]
Face expressions understanding is a key to have a better understanding of the human nature. In this contribution we propose an end-to-end pipeline that takes color images as inputs and produces a semantic graph that encodes numerically what are facial emotions. This approach leverages low-level geometric details as face representation which are numerical representations of facial muscle activation patterns to build this emotional understanding. It shows that our method recovers social expectations of what characterize facial emotions.
9:30IPAS-293
Crowd counting using deep learning based head detection, Maryam Hassan1, Farhan Hussain1, Sultan D. Khan2, Mohib Ullah3, Mudassar Yamin3, and Habib Ullah4; 1NUST College of Electrical & Mechanical Engineering (Norway), 2National University of Technology (Pakistan), 3Norwegian University of Science and Technology (Norway), and 4Norwegian University of Life Sciences (NMBU) (Norway) [view abstract]
Scale invariance and high miss detection rates for small objects are some of the challenging issues for object detection and often lead to inaccurate results. This research aims to provide an accurate detection model for crowd counting by focusing on human head detection from natural scenes acquired from publicly available datasets of Casablanca, Hollywood-Heads and Scut-head. In this study, we tuned a yolov5, a deep convolutional neural network (CNN) based object detection architecture, and then evaluated the model using mean average precision (mAP) score, precision, and recall. The transfer learning approach is used for fine-tuning the architecture. Training on one dataset and testing the model on another leads to inaccurate results due to different types of heads in different datasets. Another main contribution of our research is combining the three datasets into a single dataset, including every kind of head that is medium, large and small. From the experimental results, it can be seen that this yolov5 architecture showed significant improvements in small head detections in crowded scenes as compared to the other baseline approaches, such as the Faster R-CNN and VGG-16-based SSD MultiBox Detector.
10:20 – 10:50 AM Coffee Break
KEYNOTE: Vulnerability of Neural Networks (R2.1)
Session Chairs: Karen Egiazarian, Tampere University (Finland) and Atanas Gotchev, Tampere University (Finland)
10:50 – 11:30 AM
Cyril Magnin III
IPAS-294
KEYNOTE: Surprising vulnerability of neural networks: Recovering training and input data in federated learning and split computing, Pavlo Molchanov, NVIDIA Corporation (United States) [view abstract]
Pavlo Molchanov obtained his PhD (2014) from Tampere University of Technology, Finland, in the area of signal processing. His dissertation focused on designing automatic target recognition systems for radars. Since 2015 he has been with the Learning and Perception Research team at NVIDIA, currently holding a senior research scientist position. His research is focused on methods for neural network acceleration, and designing novel human-computer interaction systems and human understanding. On network acceleration, he is interested in neural network pruning methods and conditional inference. For human understanding he is working on landmark estimation, gesture recognition, hand pose estimation.
We present a number of studies that demonstrated the possibility of recovering training data distribution given only the final trained model. We also study the effect of data recovery in the split computing scenario where only intermediate features are shared. Finally, we will present results of gradient attack in federated learning that for a first time demonstrates almost the exact image recovery. The focus is on for large convolution networks such as ResNets and transformers, and on complex datasets such as ImageNet.
Segmentation, Classification, and Tracking (R2.2)
Session Chairs:
Karen Egiazarian, Tampere University (Finland) and Atanas Gotchev, Tampere University (Finland)
11:30 AM – 12:30 PM
Cyril Magnin III
11:30IPAS-295
Exploring effects of colour and image quality in semantic segmentation (JIST-first), Kanjar De, Luleå University of Technology (Sweden) [view abstract]
Recent advances in convolutional neural networks and vision transformers have brought a revolution in the area of computer vision. Studies have shown that the performance of deep learning-based models is sensitive to the quality of the images. The human visual system is trained to infer semantic information from poor quality images, but deep learning algorithms may find it challenging to perform this task. In this paper, we study the effect of image quality and color parameters on deep learning models trained for the task of semantic segmentation. One of the major challenges in benchmarking robust deep learning-based computer vision models is the lack of challenging data covering different quality and colour parameters. In this paper, we have generated data using the subset of the standard benchmark semantic segmentation dataset (ADE20K) with the goal of studying the effect of different quality and colour parameters for the semantic segmentation task. To the best of our knowledge, this is one of the first attempts to benchmark semantic segmentation algorithms under different colour and quality parameters, and this study will motivate further research in this direction.
11:50IPAS-296
ILIAC: Efficient classification of degraded images using knowledge distillation with cutout data augmentation, Dinesh Daultani1, Masayuki Tanaka1, Masatoshi Okutomi1, and Kazuki Endo2; 1Tokyo Institute of Technology and 2Teikyo Heisei University (Japan) [view abstract]
Image classification is extensively used in various applications such as satellite imagery, autonomous driving, smartphones, and healthcare. Most of the images used to train classification models can be considered ideal, i.e., without any degradation either due to corruption of pixels in the camera sensors, sudden shake blur, or the compression of images in a specific format. In this paper, we have proposed a novel CNN-based architecture for image classification of degraded images based on intermediate layer knowledge distillation and data augmentation approach cutout named ILIAC. Our approach achieves 1.1%, and 0.4% mean accuracy improvements for all the degradation levels of JPEG and AWGN, respectively, compared to the current state-of-the-art approach. Furthermore, ILIAC method is efficient in computational capacity, i.e., about half the size of the previous state-of-the-art approach in terms of model parameters and GFlops count. Additionally, we demonstrate that we do not necessarily need a larger teacher network in knowledge distillation to improve the model performance and generalization of a smaller student network for the classification of degraded images.
12:10IPAS-297
AInBody: Are you in shape? - An integrated deep learning model that tracks your body measurement, Nakyung Lee, Youngsun Cho, Minseong Son, Sungkeun Kwak, and Jihwan Woo, CJ OliveNetworks (Republic of Korea) [view abstract]
This paper presents AInBody, a novel deep learning-based body shape measurement solution. We have devised a user-centered design that automatically tracks the progress of the body by adequately integrating various methods, including human parsing, instance segmentation, and image matting. Our system guides a user's pose when taking photos by displaying the outline of the latest picture of the user, divides the human body into several parts, and compares before and after photos of the body part level. The parsing performance has been improved through an ensemble approach and a denoising phase in our main module, Advanced Human Parser. In evaluation, the proposed method is 0.1% to 4.8% better than the other best-performing model in average precision in 3 out of 5 parts, and 1.4% and 2.4% superior in mAP and mean IoU, respectively. Furthermore, the inference time of our framework takes approximately three seconds to process one HD image, demonstrating that our structure can be applied to real-time applications.
12:30 – 2:00 PM Lunch
Biomedical Image Processing (R3)
Session Chairs:
Karen Egiazarian, Tampere University (Finland) and Atanas Gotchev, Tampere University (Finland)
2:00 – 3:00 PM
Cyril Magnin III
2:00IPAS-298
Deep learning based speech emotion recognition for Parkinson patient, Habib Khan1, Mohib Ullah2, Fadi Al-Machot3, Faouzi Alaya Cheikh2, and Muhammad Sajjad2; 1Islamia College University Peshawar (Pakistan), 2Norwegian University of Science and Technology (Norway), and 3Norwegian University of Life Sciences (Norway) [view abstract]
Speech emotions (SEs) are an important component of human interactions and an efficient way of persuading human behavior. The recognition of emotions from the speech is an emergent but challenging area of digital signal processing (DSP). Healthcare professionals are looking for the best ways to understand patient voices for better diagnosis and treatment. Speech emotions recognition (SER) from the human voice, particularly in a person with neurological disorders like Parkinson's disease (PD), can expedite the diagnostic process. Mostly, patients with PD are passed through diagnosis via expensive tests and continuous monitoring that is time-consuming and very costly. The primary goal of this research is to develop a system that can accurately identify common SEs such as anger, happiness, normal, and sadness. We proposed a novel lightweight deep model to predict common SEs. The adaptive wavelet thresholding method is employed for pre-processing the audio data. The proposed method is trained on generated spectrograms of the Interactive Emotional Dyadic Capture (IEMOCAP) dataset. The suggested deep learning method contains convolution layers used for learning discriminative features from the spectrograms. A dense layer with a Softmax classifier is used for the classification. The accuracy of the proposed framework is evaluated on standard performance metrics, which show promising real-time results for PD patients.
2:20IPAS-299
Blind denoising of dental X-ray images, Mykola Ponomarenko1, Oleksandr Miroshnichenko2, Vladimir Lukin2, Sergey Krivenko2, and Karen Egiazarian1; 1Tampere University (Finland) and 2National Aerospace University (Ukraine) [view abstract]
The paper considers a problem of automatic analysis and noise suppression in dental X-Ray images, e.g., in images acquired by dental Morita system. Such images contain spatially correlated noise with unknown spectrum and with standard deviation that varies for different image regions. In the paper, we propose two deep convolutional neural networks. The first network estimates the spectrum and level of noise for each pixel of a noisy image, predicting maps of noise standard deviation for three image scales. The second network uses the maps as inputs to suppress noise in the image. It is shown, using modelled and real-life images, that the proposed networks provide PSNR for dental X-Ray images by 2.7 dB better than other modern denoising methods.
2:40IPAS-300
Automatic estimation of mucosal waves lateral peak sharpness – Modern approach, Ales Zita1, Simon Gresko1, Adam Novozamsky1, Michal Sorel1, Barbara Zitova1, Jan Svec2, and Jitka Vydrova3; 1Institute of Information Theory and Automation, 2Palacky University, and 3Voice Centre Prague, Medical Healthcom, Ltd (Czechia) [view abstract]
Videokymographic (VKG) images of the human larynx are often used for automatic vibratory feature extraction for diagnostic purposes. One of the most challenging parameters to evaluate is the mucosal wave's presence and its lateral peaks' sharpness. Although these features can be clinically helpful and give an insight into the health and pliability of vocal fold mucosa, the identification and visual estimation of the sharpness can be challenging for human examiners and even more so for an automatic process. This work aims to create and validate a method that can automatically quantify the lateral peak sharpness from the VKG images using a convolutional neural network.