Monday 17 January 2022
IS&T Welcome & PLENARY: Quanta Image Sensors: Counting Photons Is the New Game in Town
07:00 – 08:10
The Quanta Image Sensor (QIS) was conceived as a different image sensor—one that counts photoelectrons one at a time using millions or billions of specialized pixels read out at high frame rate with computation imaging used to create gray scale images. QIS devices have been implemented in a CMOS image sensor (CIS) baseline room-temperature technology without using avalanche multiplication, and also with SPAD arrays. This plenary details the QIS concept, how it has been implemented in CIS and in SPADs, and what the major differences are. Applications that can be disrupted or enabled by this technology are also discussed, including smartphone, where CIS-QIS technology could even be employed in just a few years.
Eric R. Fossum, Dartmouth College (United States)
Eric R. Fossum is best known for the invention of the CMOS image sensor “camera-on-a-chip” used in billions of cameras. He is a solid-state image sensor device physicist and engineer, and his career has included academic and government research, and entrepreneurial leadership. At Dartmouth he is a professor of engineering and vice provost for entrepreneurship and technology transfer. Fossum received the 2017 Queen Elizabeth Prize from HRH Prince Charles, considered by many as the Nobel Prize of Engineering “for the creation of digital imaging sensors,” along with three others. He was inducted into the National Inventors Hall of Fame, and elected to the National Academy of Engineering among other honors including a recent Emmy Award. He has published more than 300 technical papers and holds more than 175 US patents. He co-founded several startups and co-founded the International Image Sensor Society (IISS), serving as its first president. He is a Fellow of IEEE and OSA.
08:10 – 08:40 EI 2022 Welcome Reception
KEYNOTE: Quality of Experience
Session Chairs: Mark McCourt, North Dakota State University (United States) and Jeffrey Mulligan, PRO Unlimited (United States)
08:40 – 09:45
KEYNOTE: Two aspects of quality of experience: Augmented reality for the industry and for the hearing impaired & current research at the Video Quality Experts Group (VQEG), Kjell Brunnström1,2; 1RISE Research Institutes of Sweden AB and 2Mid Sweden University (Sweden)
This presentation will be divided into two parts. (1) Focus on Quality of Experience (QoE) of Augmented Reality (AR) for industrial applications and for aids for the hearing impaired. Examples will be given from research done at RISE Research Institutes of Sweden and Mid Sweden University on remote control of machines and speech-to-text presentations in AR. (2) An overview of current work of the Video Quality Experts Group (VQEG), an international organization of video experts from both industry and academia. At the beginning VQEG was focused around measuring perceived video quality. Over the last 20 years from the formation, it has shifted the expertise from the visual quality of video to QoE (not involving audio), taking a more holistic view on the visual quality perceived by the user in contemporary video based services and applications.
Kjell Brunnström, PhD, is a Senior Scientist at RISE (Digital System, Dep. Industrial Systems, Unit Networks), leading Visual Media Quality and Adjunct Professor at Mid Sweden University. He is Co-Chair of the Video Quality Experts Group (VQEG). Brunnström’s research interests are in Quality of Experience (QoE) for video and display quality assessment (2D/3D, VR/AR, immersive). He is associate editor of the Elsevier journal Signal Processing: Image Communication and has written more than hundred articles in international peer-reviewed scientific journals and conferences.
Special Session: Perception of Collective Behavior
Mark McCourt, North Dakota State University (United States); Jeffrey Mulligan, PRO Unlimited (United States); and Jan Jaap van Assen, Delft University of Technology (the Netherlands)
10:10 – 11:10
Behavioural properties of collective flow, Jan Jaap R. van Assen and Sylvia Pont, Delft University of Technology (the Netherlands) [view abstract]
In nature we are visually exposed to many complex patterns. Over time our visual system has developed an amazing finesse in interpreting these patterns. In this study we investigate the perception of collective flow. This type of flow is created by a body of individual agents that show both collective and individual behaviours following a coordinated set of rules, e.g., swarms of insects and schools of fish. Using a browser-based simulator of a relatively simple six-dimensional parametric model we displayed a range of collective behaviours. In a variety of experiments with free naming, name selection, rating tasks, and similarity judgements we started exploring the parametric space and its perceived behavioural dimensions. We find that observers can identify and name a wide range of behavioural properties such as discipline and agitation. We further found that multiple behaviours can change across a single physical parameter and that the same behaviour can occur across multiple physical parameters as well. We explored multiple experimental methods to interpret and quantify this intriguing and malleable perceptual space of collective flow.
A visual explanation of ‘flocking’ in human crowds, William H. Warren, Gregory C. Dachner, Trenton D. Wirth, and Emily Richmond, Brown University (United States) [view abstract]
Patterns of collective motion or ‘flocking’ in birds, fish schools, and human crowds are believed to emerge from local interactions between individuals. Most models of collective motion attribute these interactions to hypothetical rules or forces, often inspired by physical systems, and described from an overhead bird’s-eye view. We develop a visual model of human flocking from an embedded view, based on optical variables that actually govern pedestrian interactions. Specifically, people control their walking speed and direction by canceling the average optical expansion and angular velocity of their neighbors, weighted by visual occlusion. We test the model by simulating data from VR experiments with virtual crowds and real human ‘swarms’. The visual model outperforms our previous ‘bird’s-eye’ model (PRSB, 2018) (BFvb>100) as well as a motion model that ignores occlusion (BFvm>100). Moreover, it explains basic properties of physics-inspired models: ‘repulsion’ forces reduce to canceling optical expansion, ‘attraction’ forces to canceling optical contraction, and ‘alignment’ to canceling the combination of expansion/contraction and angular velocity. Critically, the neighborhood of interaction follows from Euclid’s Law of perspective and the geometry of occlusion. We conclude that the local interactions underlying human flocking are a natural consequence of the laws of optics.
Simulating pedestrians and crowds based on synthetic vision, Julien Pettre, Inria (France) [view abstract]
Crowd simulation is a very active topic, with applications to safety, e.g., to simulate the evacuation of people in given environments. We present a pedestrian and crowd simulation technique which is based on agents with synthetic vision capabilities. Simulation agents visually perceive their environment by graphically rendering their environment from their own point-of-view and perspective. Further analysis of perceived images can provide information about the perceived movement, resulting from agents self-motion as well as the motion of surrounding objects. To move and adjust their trajectory, locomotion is decomposed into a set of simple visual tasks agents achieve according to what they perceive. We demonstrate the superiority of this approach in better imitating the vision-locomotor loop real humans use to navigate in dynamic environments like crowds.
HVEI Discussion: Perception of Collective Behavior
Session Chairs: Damon Chandler, Ritsumeikan University (Japan); Mark McCourt, North Dakota State University (United States); Jeffrey Mulligan, PRO Unlimited (United States); and Jan Jaap van Assen, Delft University of Technology (the Netherlands)
11:10 – 13:10
Gather, in the Cafe (entrance near the Reg Desk)
Discussion within the HVEI community to follow the HVEI Special Session, "Perception of Collective Behavior".
KEYNOTE: High Dynamic Range
Session Chairs: Damon Chandler, Ritsumeikan University (Japan) and Jeffrey Mulligan, PRO Unlimited (United States)
15:00 – 16:00
KEYNOTE: HDR arcana, Scott Daly, Dolby Laboratories, Inc. (United States)
Consumers seeing the high-end versions of these displays for the first time typically comment that the imagery shows more depth (“looks like 3D”), or looks more realistic, (“feels like you’re there”), or has stronger affectivity (“it’s visceral”) or has a wow effect (“#[email protected]*&% amazing”). Prior to their introduction to the consumer market, such displays were being demonstrated to the technical community. This motivated detailed discussions of the need for an ecosystem (capture, signal format, and display) which were fruitful, but at the same time often led to widely stated common misunderstandings. These often boiled down HDR to a single issue with statements like “HDR is all about the explosions” referring to its capability to convey strong transients in luminance. Another misconception was “HDR causes headaches” referring to effects caused by poor creative choices or sloppy automatic processing. Other simplifying terms such as brightness, bit-depth, contrast ratio, image capture f-stops, display capability, have all been used to describe “the key” aspect of HDR. One misunderstanding circa 2010 that permeated photography hobbyists was “HDR makes images look like paintings”, often meant as a derision. While the technical community has moved beyond such oversimplifications, there still are key perceptual phenomenon involved with HDR displayed imagery that are either poorly understood or rarely mentioned. The field of applied vision science is at a mature enough state to have enabled engineering design for signal formats, image capture and display capabilities needed to create both consumer and professional HDR ecosystems. Light-adaptive CSF models, optical PSF and glare, LMS cone capture, opponent colors, and color volume are examples used in the ecosystem design. However, we don’t have a similar level of quantitative understanding of why HDR triggers the kinds of expressions mentioned at the beginning of this paragraph. This talk will give a survey of the apparently mysterious perceptual issues of HDR being explored by a handful of researchers often unaware of each other’s work. Coupled with several hypotheses and speculation, this focus on the arcane aspects of HDR perception is hoping to motivate more in-depth experiments and understanding.
Scott Daly is an applied perception scientist at Dolby Laboratories, Sunnyvale, CA, with specialties in spatial, temporal, and chromatic vision. He has significant experience in applications toward display engineering, image processing, and video compression with over 100 technical papers. Current focal areas include high dynamic range, auditory-visual interactions, physiological assessment, and preserving artistic intent. He has a BS in bioengineering from North Carolina State University (NCSU), Raleigh, NC, and an MS in bioengineering from the University of Utah, Salt Lake City, UT. Past accomplishments led to the Otto Schade award from the Society for Information Display (SID) in 2011, and a team technical Emmy in 1990. He is a member of the IEEE, SID, and SMPTE. He recently completed the 100-patent dash in just under 30 years.
Perception and appreciation of tactile objects: The role of visual experience and texture parameters (JPI-first), A.K.M. Rezaul Karim1, Sanchary Prativa1, and Lora T. Likova2; 1University of Dhaka (Bangladesh) and 2Smith-Kettlewell Eye Research Institute (United States) [view abstract]
This exploratory study was designed to examine the effects of visual experience and specific texture parameters on both discriminative and affective aspects of tactile perception. To this end, we conducted two experiments using a novel behavioral (ranking) approach in blind and (blindfolded) sighted individuals. Groups of congenitally blind, late blind and (blindfolded) sighted participants made relative stimulus preference, aesthetic appreciation and smoothness or softness judgment of 2D or 3D tactile surfaces through active touch. In both experiments, the aesthetic judgment was assessed on three affective dimensions, Relaxation, Hedonics, and Arousal, hypothesized to underlie visual aesthetics in a prior study. Results demonstrated that none of these behavioral judgments significantly varied as a function of visual experience in either experiment. However, irrespective of visual experience, significant differences were identified in all these behavioral judgments across the physical levels of smoothness or softness. In general, 2D smoothness or 3D softness discrimination was proportional to the level of physical smoothness or softness. Secondly, the smoother or softer tactile stimuli were preferred over the rougher or harder tactile stimuli. Thirdly, the three-dimensional affective structure of visual aesthetics appeared to be amodal and applicable to tactile aesthetics. However, analysis of the aesthetic profile across the affective dimensions revealed some striking differences between the forms of appreciation of smoothness and softness, uncovering unanticipated substructures in the nascent field of tactile aesthetics. While the physically softer 3D stimuli received higher ranks on all three affective dimensions, the physically smoother 2D stimuli received higher ranks on the Relaxation and Hedonics but lower ranks on the Arousal dimension. Moreover, the Relaxation and Hedonic ranks accurately overlapped with one another across all the physical levels of softness/hardness, but not across the physical levels of smoothness/roughness. The theoretical and practical implications of these novel findings are discussed.
Damon Chandler, Ritsumeikan University (Japan) and Jeffrey Mulligan, PRO Unlimited (United States)
16:15 – 17:15
A comparison of non-experts and experts using DSIS method, Yasuko Sugito and Yuichi Kusakabe, NHK (Japan) [view abstract]
Subjective evaluations are necessary to learn how expected viewers perceive the quality of a system. Traditionally, non-expert subjective tests are preferred rather than expert tests. In this study, we conducted subjective evaluation experiments for non-experts and experts on compressed 8K videos using the double stimulus impairment scale (DSIS) method and analyzed the experimental results expressed in terms of the mean opinion score (MOS), which is the average of individual scores. Furthermore, we investigated the differences between non-experts and experts by considering a new method in P.913 that estimates an improved MOS and a new experimental method using experts, called expert viewing protocol (EVP). Our contribution shows advantages of conducting expert subjective tests, such as EVP: expert tests allow to perform experiments with fewer subjects, to distinguish between original and distorted images, to determine a lower threshold for the image quality, to distribute scores in an appropriate range, and to constantly gain MOS values equal to improved MOS values.
Analysis of differences between skilled and novice subjects for visual inspection by using eye trackers, Koichi Ashida1, Atsuyuki Kaneda2, Toshihiro Ishizuki2, Shuichi Sato3, Norimichi Tsumura1, and Akira Tose4; 1Chiba University, 2Gazo Co., Ltd., 3Niigata Artificial Intelligence Laboratory Co., and 4Niigata University (Japan) [view abstract]
In this study, we developed a learning model to discriminate between skilled and novice users in visual inspection by visualizing their skills in the direction of gaze using an eye tracker. This model enabled us to analyze the difference in skill between skilled and novice users.
A method proposal for evaluating color tolerance in viewing multiple white points focusing on the vehicle instrument panels, Taesu Kim, Hyeon-Jeong Suk, and Hyeonju Park, Korea Advanced Institute of Science and Technology (KAIST) (Republic of Korea) [view abstract]
Although chromatic adaptation eases us to adopt our vision to nuanced whites, viewing more than two substantially different white balances costs perceptual workload and appeals to poor quality control. This study proposed a method for evaluating the color tolerance of light modules using a uniformity analyzer focusing on the instrument panels in passenger cars, two premium line-up vehicles from Hyundai and Mercedes Benz. Using a luminance uniformity analyzer, we captured three main lighting regions in their instrument panels: clusters, steering wheel, and center console. Based on u’ and v’ values, we identified and compared the chromaticity coordinates of the white lighting components. The measurement-based judgment supports the manufacturer in achieving the quality objectively and consistently.
Tuesday 18 January 2022
Mark McCourt, North Dakota State University (United States) and Jeffrey Mulligan, PRO Unlimited (United States)
07:00 – 08:00
Enhancing visual speech cues for age-related reductions in vision and hearing (Invited), Harry Levitt, Helen Simon, and Al Lotze, Smith-Kettlewell Eye Research Institute (United States) [view abstract]
Speech is conveyed by both auditory and visual speech cues. Auditory cues dominate for speech in quiet. For speech in noise, visual speech cues increase in importance as the speech-to-noise ratio is decreased. Similarly, a person with a hearing loss relies on visual speech cues to compensate for the loss of auditory speech cues. The greater the hearing loss, the greater the dependence on vision. Vision and hearing both decrease with age resulting in a concomitant age-related reduction in speech understanding. For seniors, it is by far the most common cause of reduced speech understanding. Much research has focused on improving speech communication ability for sighted people with hearing loss. The problem is more severe for seniors with age-related hearing and vision loss in that less vision is available when good vision is most needed. It is particularly severe for blind people in that age-related hearing loss reduces both speech understanding and wayfinding ability. This paper identifies ways in which image processing can enhance both auditory and visual speech cues or recode the cues in another modality for dual sensory loss.
Smelling sensations: Olfactory crossmodal correspondences (JPI-first), Ryan J. Ward, Sophie Wuerger, and Alan Marshall, University of Liverpool (United Kingdom) [view abstract]
Olfaction is ingrained into the fabric of our daily lives and constitutes an integral part of our perceptual reality. Within this reality, there are crossmodal interactions and sensory expectations; understanding how olfaction interacts with other sensory modalities is crucial for augmenting interactive experiences with more advanced multisensorial capabilities. This knowledge will eventually lead to better designs, more engaging experiences and enhancing the perceived quality of experience. Towards this end, we investigated a range of crossmodal correspondences between ten olfactory stimuli and different modalities (angularity of shapes, smoothness of texture, pleasantness, pitch, colors, musical genres, and emotional dimensions) using a sample of 68 observers. Consistent crossmodal correspondences were obtained in all cases, including our novel modality (the smoothness of texture). These associations are most likely mediated by both the knowledge of an odor's identity and the underlying hedonic ratings: the knowledge of an odor's identity plays a role when judging the emotional and musical dimensions but not for the angularity of shapes, smoothness of texture, perceived pleasantness, or pitch. Overall, hedonics was the most dominant mediator of crossmodal correspondences.
Multisensory visio-tactile interaction in semi-immersive environments, Elena A. Fedorovskaya, Minyao Li, Lily Gaffney, Elise Guth, Kavya Phadke, and Susan Farnand, Rochester Institute of Technology (United States) [view abstract]
With the development of mixed reality systems there is a demand to create and model multimodal interactions for producing immersive experiences. A great deal of effort has been placed on 3D rendering of materials and evaluation of resulting image quality to ensure observers can effectively recognize objects and adequately interpret visual scenes. In future medical, educational and entertainment applications providing tactile information will become increasingly important for user experience and task performance. However, the interaction of visual and tactile information and underlying mechanisms are not well understood. We report the results of a study, where the participants are asked to match their tactile experience with the visual images displayed in a semi-immersive environment. Several tactile stimuli consisting of various textures of physical objects are selected to span a range of tactile experiences following previously reported dimensions for texture perception as “hard -soft” and “rough-smooth”. In the experiment, participants are asked to touch these objects while concealed from view. Photographic reproductions of the same objects, used as visual stimuli, were randomly presented on the screen of the high-resolution 31” monitor in a darkened environment. The participants were instructed to press the “yes” or “no” button if the visual and tactile stimuli were the same or not as fast as possible, respectively. The response time, gaze patterns, and rate of correctness were recorded. Differences in these responses were observed between situations when visual stimuli and tactile stimuli were congruent versus incongruent. The results of this study will be important in the development of mixed reality systems and understanding how humans make the decision of whether or not a visual object matches a tactile sensation.
Mark McCourt, North Dakota State University (United States) and Jeffrey Mulligan, PRO Unlimited (United States)
10:00 – 11:00
Augmented remote operating system for scaling in smart mining applications: Quality of experience aspects, Shirin Rafiei1,2, Elijs Dima2, Mårten Sjöström2, and Kjell Brunnström1,2; 1RISE Research Institutes of Sweden and 2Mid Sweden University (Sweden) [view abstract]
Remote operation and Augmented Telepresence are fields of interest for novel industrial applications in e.g. construction and mining. In this study, we report on an ongoing investigation of the Quality of Experience aspects of an Augmented Telepresence system for remote operation, wherein view augmentation is achieved with selective content removal and novel-perspective view generation. An initial, formal subjective study has been performed with test participants scoring their experience while using the system with different levels of view augmentation stimuli. The participants also gave free-form feedback on the system and their experiences. The results reveal an unexpected shift in how the experiment task was perceived by the users, changing focus from remote interaction to a perception and memorization problem. Within that task, participants indicated preference towards original and content-removed views, without using the novel-perspective views. The results and feedback also reveal specific issues with the test setup relating to insufficient fidelity of presentation, and the need for a remote interaction mode that relies more on the spatial structure of the remote environment.
A feedforward model of spatial lightness computation by the human visual system, Michael E. Rudd, University of Nevada, Reno (United States) [view abstract]
Last year at HVEI, I presented a computational model of lightness perception inspired by data from primate neurophysiology. That model first encodes local spatially-directed local contrast in the image, then integrates the resulting local contrast signals across space to compute lightness (Rudd, J Percept Imaging, 2020; HVEI Proceedings, 2021). Here I computer simulate the lightness model and generalize it to model color perception by including in the model local color contrast detectors that have properties similar to those of cortical “double-opponent” (DO) neurons. DO neurons make local spatial comparisons between the activities of L vs M and S vs (L + M) cones and half-wave rectify these local color contrast comparisons to produce psychophysical channels that encode, roughly, amounts of ‘red,’ ‘green’, ‘blue’, and ‘yellow.”
SalyPath360: Saliency and scanpath prediction framework for omnidirectional images, Mohamed A. Kerkouri1, Marouane Tliba1, Aladine Chetouani1, and Mohamed Sayah2; 1Université d'Orléans (France) and 2University of Oran (Algeria) [view abstract]
This paper introduces a new framework to predict visual attention of omnidirectional images. The key setup of our architecture is the simultaneous prediction of the saliency map and a corresponding scanpath for a given stimulus. The framework implements a fully encoder-decoder convolutional neural network augmented by an attention module to generate representative saliency maps. In addition, an auxiliary network is employed to generate probable viewport center fixation points through the SoftArgMax function. The latter allows to derive fixation points from feature maps. To take advantage of the scanpath prediction, an adaptive joint probability distribution model is then applied to construct the final unbiased saliency map by leveraging the encoder decoder-based saliency map and the scanpath-based saliency heatmap. The proposed framework was evaluated in terms of saliency and scanpath prediction, and the results were compared to state-of-the-art methods on Salient360! dataset. The results showed the relevance of our framework and the benefits of such architecture for further omnidirectional visual attention prediction tasks.
Wednesday 19 January 2022
IS&T Awards & PLENARY: In situ Mobility for Planetary Exploration: Progress and Challenges
07:00 – 08:15
This year saw exciting milestones in planetary exploration with the successful landing of the Perseverance Mars rover, followed by its operation and the successful technology demonstration of the Ingenuity helicopter, the first heavier-than-air aircraft ever to fly on another planetary body. This plenary highlights new technologies used in this mission, including precision landing for Perseverance, a vision coprocessor, new algorithms for faster rover traverse, and the ingredients of the helicopter. It concludes with a survey of challenges for future planetary mobility systems, particularly for Mars, Earth’s moon, and Saturn’s moon, Titan.
Larry Matthies, Jet Propulsion Laboratory (United States)
Larry Matthies received his PhD in computer science from Carnegie Mellon University (1989), before joining JPL, where he has supervised the Computer Vision Group for 21 years, the past two coordinating internal technology investments in the Mars office. His research interests include 3-D perception, state estimation, terrain classification, and dynamic scene analysis for autonomous navigation of unmanned vehicles on Earth and in space. He has been a principal investigator in many programs involving robot vision and has initiated new technology developments that impacted every US Mars surface mission since 1997, including visual navigation algorithms for rovers, map matching algorithms for precision landers, and autonomous navigation hardware and software architectures for rotorcraft. He is a Fellow of the IEEE and was a joint winner in 2008 of the IEEE’s Robotics and Automation Award for his contributions to robotic space exploration.
Human Vision and Electronic Imaging 2022 Posters
08:20 – 09:20
Poster interactive session for all conferences authors and attendees.
P-05: A simple and efficient deep scanpath prediction, Mohamed A. Kerkouri and Aladine Chetouani, Université d'Orléans (France) [view abstract]
Visual scanpath is the sequence of fixation points that the human gaze travels while observing an image, and it's prediction helps in modeling the visual attention of an image. To this end several models models were proposed bin the literature, using complex deep learning architectures and frameworks. We explore the efficiency of using common deep learning architectures, in a simple fully convolutional regressive manner. We experiment how well these models can predict the scanpaths on 2 datasets, and compare with other models using different metrics, and show competitive results that sometimes surpass previous complex architectures. We also compare the the different leveraged backbone architectures based on their performances on the experiment to deduce which ones are the most suitable for the task.
P-06: INDeeD: Identical and disparate feature decomposition from multi-label data, Tserendorj Adiya and Seungkyu Lee, Kyung Hee University (Republic of Korea) [view abstract]
We propose Identical and Disparate Feature Decomposition (INDeeD) from multi-label data that explicitly learn the characteristics of individual label.