By Nadav Cohen
Ever wondered what role multiple cameras play on your smartphone? Here, we break it down for you, and show why, with AI-based imaging, the dual camera is here to stay.
The field of computer 3D perception has been developing rapidly since smartphones took the front stage of personal-consumer technology. Since HTC released the first dual camera smartphone in 2011, leading tech companies such as Apple and Samsung have set a standard of using multiple rear cameras.
As the trend of smartphone photography has become more popular, the demand for high quality images has increased. But small camera systems, which don’t involve heavy parts and large glass lenses, have several optical limitations. AI algorithms have therefore been developed to mimic the professional effects which were previously not achievable on smartphones. But what role do these multiple cameras play, and how do they help overcome the size limitations?
Portrait mode: a critical smartphone feature
A key smartphone camera feature which has soared in popularity is portrait mode, which uses an effect called ‘bokeh’. Here, objects in the background appear blurry, and objects in the foreground appear in sharp focus. Portrait mode has become one of the most popular smartphone camera features in recent years, empowering consumers with what was previously only available with powerful professional cameras.
Traditionally, this effect involves creating a shallow depth of field by setting a wide aperture, which leads to a sharp subject and a blurry background. It can be easily achieved on any DSLR camera, but is very difficult to achieve on small smartphone cameras due to size limitations. Due to the bokeh effect’s popularity, smartphone companies have developed digital variants which mimic this effect. As the distance of a certain object from the camera is directly related to its amount of blur in the final image, a depth map is necessary to create a realistic digital bokeh effect.
Depth mapping: the key to digital bokeh
A depth map is the key to creating the effect artificially, by mapping out which objects are nearer and further from the camera. But producing a reliable depth map for a given scene is no trivial task. As neural networks have proven to be superior for vision perception tasks, deep learning algorithms are commonly used for depth perception. There are two main approaches: Monocular (single camera) and Stereo (two cameras).
Stereo depth estimation
Stereo depth estimation is considered to be the closest approach to the natural way to perceive depth, as our brain processes information from two eyes. The logic is simple: matching two images from two rectified cameras shows a large displacement to close objects, while far away objects show less displacement. Processing this information is key for both human and machine depth perception.
Fig 1: A simplified illustration of the parallax of an object against a distant background due to a perspective shift. When viewed from "Viewpoint A", the object appears to be in front of the blue square. When the viewpoint is changed to "Viewpoint B", the object appears to have moved in front of the red square.
Source: JustinWick at English Wikipedia, CC BY-SA 3.0 <http://creativecommons.org/licenses/by-sa/3.0/>, via Wikimedia Commons
Monocular depth estimation
Monocular depth is more reliant on other useful information such as patterns, lighting, and object recognition. Although these details can yield useful information for depth perception, the single camera approach holds many shortcomings as it is mostly reliant on memory. (Imagine seeing a cat; we can assume its depth as we know its probable size.)
Stereo vs mono: so which is better?
The reliance on recognized patterns may cause obvious mistakes with the monocular approach, as unseen patterns can easily fool monocular networks. In addition, without parallax, objects can appear closer or farther away with just changing their scale. Stereo networks can easily avoid these mistakes using the additional parallax information.
As monocular algorithms do not use real depth information, they tend to cluster objects to different depths without regard to the depth difference between the objects. This may cause the main object to appear to stay at a consistent depth, even when moving back and forth. This in return results in a synthetic blur when used for creating a bokeh effect.
Using real depth information in the learning process tends to result in better perception of small details, as separating these details from others based on monocular information alone is a hard task. The stereo approach therefore largely leads to higher accuracy in depth estimation.
Fig 2: The top row shows stereo output, and the middle row shows monocular output. The stereo output provides more accurate maps, whereas the monocular output is less accurate and less sensitive to small details. For example, the wall has uniform depth, but the mono output perceives it as having varying depth. This is because the pictures hanging on the walls confuse the mono imaging, tricking it into perceiving different depths. The stereo output is able to calculate these more accurately.
Depth estimation algorithms are becoming increasingly important in computer vision applications. In smartphones, there are other approaches being used; Google uses “dual pixel”, and Apple uses LiDAR. However, bearing in mind the limitations of the monocular approach, the dual camera system will continue to play a critical role creating a bokeh (or portrait mode) effect, in smaller camera systems. We’ve become accustomed to seeing multiple cameras in smartphones, but we can expect to see them appearing in other small cameras, like laptops and webcams, very soon.