For information on exhibition and sponsorship opportunities at the conference, contact Yvonne Romaine at email@example.com
For media-related inquiries, contact Maureen Jennings at firstname.lastname@example.org
To stay abreast of conference news and to receive email notification when registration opens, please sign up for the Where 2.0 Conference newsletter (login required)
Have an idea for Where to share? email@example.com
View a complete list of Where 2.0 contacts
Computer vision (i.e., image understanding) involves understanding the 3D scene creating the image. Computer vision is challenging because it is the computer that decides how to act based on an understanding of the image. Key image understanding tasks include depth computation, as well as object detection, localization, recognition and tracking. Current state of the art techniques are not able to perform any of these tasks robustly with the precision and accuracy demanded by many real-world applications. Additional complications include operational and environmental factors. For humans, visual recognition is fast and accurate, yet robust against occlusion, clutter, viewpoint variations, and changes in lighting conditions. Moreover, learning new categories requires minimal supervision and a very small set of exemplars. Achieving this level of performance in a wearable portable system would enable a great number of useful applications especially for enhancing mobile cell phone and camera operation.
Methods for object category recognition are divided into appearance models, which may take the form of bag-of-words models, part-based models, discriminative/generative methods and approaches based on segmentation and recognition. In addition to these methods, other techniques build upon the appearance models to include the structure of the parts, features or even individual pixels. These usually take location into account and try to build appearance models using the underlying building blocks. This is difficult because the process has to take into account what possible structures as appearances the many possible real 3D world objects can manifest themselves as. The appearance of a particular 3D world object can also take a one to many or many to one mapping. The main challenges in object category recognition include viewpoint invariance, illumination, occlusion, scale, deformation, and background clutter and intra-class variation. The three main issues are (1) representation; (2) learning (i.e., forming a classifier from given training data; and (3) recognition (i.e., how to use the classifier on novel data). We strongly believe that the first issue, representation, is key for a successful approach. Our object category recognition method is a compromise between the various methods, taking advantage of the generality of the bag-of-words approach yet capitalizing on the real world robustness of the structure approaches. Our approach takes the neighboring location of robust features into consideration. We have developed a technique that could be best described as a bag-of-words approach but with the exception that the word tokens are not individual features but spatial neighbor clusters. Thus, each spatial neighbor cluster includes not only appearance but also geometrical properties. We have experimented and determined what feature detection and representation schemes are best for this method. Our object model learns a set of codebooks of spatial neighbor clusters for each object category. A multi stage process for optimal matching between model codebooks versus test image codebooks has also been developed. Our method is robust against scale, rotation as well as affine transformations in general. Our method is also able to determine the location and poses of the object instances in addition to detecting their presence.
A logical extension would be to perform 3D object category recognition. In order to acquire depth information, active methods, ones that send out an array of energy & measure its return (strength or time) are the easiest way to get a measurement. However, this requires special equipment and is a heavy power consumer & usually only provides a single depth point. Passive techniques that make use of ambient energy such as imaging are an attractive alternative. Stereo vision can provide depth scene information but due to calibration issues, again, special & relatively expensive machining is a pre-requisite. Humans use up to 10 different visual cues to ascertain depth, one of them being stereo, the dominant depth cue for objects up to 1.5 m. Another cue, computing depth from a single camera moving through space (& time) (also referred to as structure from motion) is an excellent alternative. Challenges include not only computing the depth field but also keeping track of the camera’s spatial location through time (also referred to as the SLAM problem – Simultaneous Localization And Mapping). We have developed novel techniques that build upon strictly computer vision techniques to achieve structure from motion. We have also been inspired by the symbiotic working relationship between vision & the vestibular systems in humans – a necessary co-operative working relationship for stable vision (vs. dizziness). We have increased the robustness of our own structure from motion algorithms by incorporating inertial sensors (e.g., accelerometers, gyros). Two different structure from motion techniques from the literature are ones based on optical flow (i.e., short baseline, short time displacement, referred to as instantaneous) and another based on discrete feature matching (i.e., long baseline). We have found that combining the 2 methods in a probabilistic inference mechanism produces much better results when compared against either technique by itself. We have also experimented with using vision & inertial measurements to improve upon GPS localizations. We have developed a unique and novel particle filter that provides robustness to outliers. We experimented by fusing 3 hypotheses to achieve this robust solution: (1) hypotheses where the motion is estimated from both visual and inertial measurements (in addition to GPS); (2) hypotheses where the motion is estimated from only visual measurements and GPS and (3) hypotheses where the motion is estimated from only inertial measurements and GPS. By having both depth and object recognition, allows us to perform visual SLAM, thus triangulating on discovered landmarks to obtain geo-location when GPS is not available. The computer vision computations also allow us to perform visual odometry, similar to odometry with inertial sensors, except that the errors do not accumulate.
We have developed and demonstrated all of these techniques in our lab and argue that all of this potential can be packaged within a smart phone like an iphone.
John Zelek is Associate Professor (in Systems Design Engineering at the University of Waterloo), with expertise in the area of intelligent Mechatronic control systems that interface with humans; specifically, the areas are (1) wearable sensory substitution and assistive devices; (2) probabilistic visual and tactile perception; (3) wearable haptic devices including their design, synthesis and analysis; and (4) human-robot interaction. He was awarded the best paper award at the 2007 Iinternational IEEE/IAPRS Computer and Robot Vision conference. He was awarded a 2006 & 2008 Distinguished Performance Award from the Faculty of Engineering at the University of Waterloo. He was also awarded the 2004 Young Investigator Award by the Canadian Image Processing & Pattern Recognition society for his work in robotic vision. He is also the CTO for Tactile Sight Inc.