Immersive Reality: Opening New Kinds of Interactive Experiences
If you’ve ever been engrossed in the setting of a novel, you know that visual detail is not required to suspend disbelief; it is the quality of the story itself that takes you away. Similarly for video games, use of the low fidelity blocks in Minecraft are as engrossing as the most advanced photo-realistically rendered games. Low-fidelity mobile games have cut deeply into the high-fidelity game console market because people can play anywhere, anytime and still feel immersed in the action of the game. Nevertheless, research has shown that it is also true that visual and audio detail does affect our sense of reality, making it easier to “suspend disbelief” and allow ourselves to become part of the story.
The killer application for immersive reality is probably gaming — where a single individual can wear a headset and become immersed in an artificial world. But “virtual” reality is not only about computer-generated artificial worlds. It is also about bringing remote places closer — places to which it is difficult to travel can be made real by delivering all of the light and sound that our eyes and ears can perceive using ultra-high resolution cameras, microphones, and low-latency networks. This kind of remote reality is a subset of virtual reality where the objects in the scene really do exist in the physical world, just not right in front of you.
The resolution of today’s cameras seems astoundingly high — typical consumer cameras exceed 12 million pixels and some image sensors are able to capture more than 120 million pixels. At the same time, we see display resolutions increasing from High-Definition (HD, or 2 megapixels) to 4K (8 megapixels) and new screens are already available at 8K (32 megapixels) resolution. While these resolutions seem more than adequate for a “lean-back” experience of watching television or cinema content, is it enough for a visually immersive experience where the audience is interacting within the remote space by tracking objects and adjusting their focus in a scene at will?
A conventional movie or television presentation is a “lean-back” experience where the audience passively lets the director tell a story and the viewer follows wherever the camera takes them. In an immersive experience, however, each member of the audience can define their own interests that can change in an instant, depending on the activity. Let’s be clear — we are recreating reality, not making a movie, not telling a story in which we draw the viewer’s attention to a main character or object in a scene. In immersive reality, all of the visual and auditory data that a viewer may desire has to be ready at the turn of their head.
Immersive remote reality opens new kinds of interactive experiences where rich visual detail can make important differences. Imagine surgeons across the globe operating on patients remotely, skilled technicians manufacturing and repairing complex machinery, and students experiencing the most advanced hard-to-reach frontiers of science on our planet and eventually off-planet as well. People will soon be able to fly drones to tour exotic locations, pilot robots to meet with people, shop at the most exclusive boutiques in the world, witness point-of-view performances of world-class athletes on the field, and feel the full visceral experience of live music, on stage, right there with the band — all delivered in real time over low-latency networks.
Although the value of these experiences is clear, critical questions for developing the technology remain: What are the upper limits of detail to maximize the value of immersive experiences? Once answered, we can specify the upper limit of the amount of data that cameras, displays and networks will need to process.
Upper Limits of Human Visual Perception
It turns out that today’s most advanced cameras and displays provide only a fraction of the detail surrounding us in the real world. Our eyes can detect dots in our view as fine-grained as 0.3 arc-minutes of a degree, meaning we can differentiate approximately 200 distinct dots per degree. Converting that to “pixels” on a screen depends on the size of the pixel and the distance between our eyes and the screen, but let’s use 200 pixels per degree as a reasonable estimate. Our eyes can mechanically shift across 150 degrees horizontally and 90 degrees vertically which would require a region of 540 million pixels for full coverage.
Up to 540 million pixels for a static image; but the world does not sit still. For motion video, multiple static images are flashed in sequence, typically at a rate of 24 to 30 images per second for film and television. But the human eye does not operate like a camera. Our eyes actually receive light images constantly, not discretely, and while 30 frames per second is adequate for moderate-speed motion in movies and TV shows, the human eye can perceive much faster motion with some estimates as high as 200 frames per second. For sports, games, science, and other high-speed immersive experiences, rates of 60 or even 120 frames per second are needed to avoid “motion blur” and disorientation.
Other characteristics of the human eye exceed current display technologies. Our eyes can perceive a contrast ratio of nearly 1 million levels of brightness, requiring up to 8 bytes to fully encode the perceptible color gamut for each screen pixel.
Let’s do a quick back-of-the envelope calculation of the upper limit now. 540 million pixels at 8 bytes per pixel at 120 frames per second would be 518 Gigabytes (GB) of data per second. No digital system or network in the foreseeable future can handle that kind of raw throughput. Fortunately, there is significant redundancy in visual data that allows — depending on the complexity of the images — for a great deal of compression. Even at a very high compression ratio of 300:1 that would require very powerful computers to encode and decode the compressed video, a 518 GB stream would still leave us with 1.7 GB of data per second.
1.7 GB per second sounds enormous, but still only scratches the surface of the amount of data surrounding us in the real world. If we add the ability for our heads to turn and bodies to rotate we would expand the visual field to 360 degrees horizontally and approximately 270 degrees vertically on which our eyes can focus at any instant in time: 3.8 Giga-pixels at 120 frames per second requires a 3.7 Terabyte per second transport system! There are further attributes of human vision that can push the requirements even higher, as our eyes can adjust focal depth instantaneously. In the future, depth-capable displays will need to allow our eyes to dynamically bring objects in and out of focus. Today’s commercial 3D displays cannot provide that, but light-field displays in research labs can, and this will require even more data per frame.
In addition to transporting enormous amounts of data, full-field communication requires that the data be compressed and transported in real-time. In 1968, Robert B. Miller, an early human factors scientist for IBM, established that the threshold under which humans perceive a response as “instantaneous” is less than 100 milliseconds. Additionally, some of the effects of motion sickness that people suffer in VR systems are caused by the delay between that person’s motion and the system’s response, requiring even shorter latencies to reduce such vertigo.
Making Remote Reality a Reality
What may sound like excessive amounts of data in today’s world — where most home networks receive no more than 20 Mbit/s and businesses 100 Mbit/s — is on the order of one-percent of the necessary throughput for immersive reality. You may also question the demand for such ultra-fidelity in a marketplace where HD video has only recently supplanted Standard Definition (SD). However, while HD may have felt like a future-proof format when first announced, it is now being quickly supplanted by 4K video cameras and displays. Forty-three inch 4K displays are now available for less than US$ 600, and many video streaming services offer 4K content. The resolution and physical size of displays continues to grow as just this year several camera and display manufacturers have announced 8K products and some streaming services have already begun to offer 8K content.
Unlike the old days when new broadcast standards required a decade of more for wide-spread adoption, ultra-resolution systems are becoming adopted very quickly now that internet distribution allows the new to be downloaded on demand.
MirrorSys: The Network is the Key
Huawei’s first step into realizing the goals of full-field communication is a research prototype called MirrorSys that provides a fully life-size, real-time, realistic visual and auditory reproduction of a remote space. Introduced in Barcelona at Mobile World Congress 2015, and again in Hanover at CeBIT 2015, MirrorSys consisted of a 32 megapixel wall-sized display — a seamless array of sixteen HD projectors, 5 meters wide by 2.6 meters high — to match the resolution that human retinas can discern at a two-meter viewing distance. Following these early demonstrations, our lab has doubled the size and pixel count with a 10-meter wide by 2.6-meter high display. The audio system employs a 32-microphone array to capture the directionality of sounds from the source environment, and is accurately reproduced over a 22.2 channel speaker system. Because this system precisely localizes the point of origin for each sound, it is possible for the viewer to separate multiple conversations occurring simultaneously within different zones of the shared space. On the camera side, we have stitched together three 4K cameras running at 60 frames per second, and have compressed and transported the payload in less than 150 milli-seconds across a dedicated network and we continue to drive for the lowest possible latencies.
Although the first MirrorSys prototype is capable of pushing today’s network infrastructure to the limit, we know that we are really only scratching the surface. Whether the source imagery is from live cameras in the real world or computer generated, we understand that true immersion requires enormous amounts of data to match human perceptual sensitivities. Networks of the future will need to carry orders of magnitude more data and at latencies that are imperceptible to humans.
All of this makes clear that ultrafast networks are essential for the widespread adoption of fully immersive media. Other than playing games on dedicated local machines, most applications will require transport of some or all of the “reality” between servers and end-users — and even gaming is moving to cloud-based infrastructures that require shipping large data payloads around the world, as data center infrastructures are increasingly necessary to provide the necessary power for rendering realistic, artificial worlds.
Huawei R&D is developing technologies that can compress and transport all of the light and sound of a live remote or virtual environment in real-time for life-size, full-fidelity reproduction.
The future of MirrorSys, and full-field communication in general, is the ability for people to practically and routinely teleport to any location in the world — an accomplishment that will open a huge number of new business opportunities. Telemedicine will shift from talking heads over HD or 4K video circuits to the comprehensive visual detail necessary for accurate medical diagnosis and even remote surgery; remote technicians will operate and repair complex machinery; shoppers can examine real estate, precious gems, detailed manufacturing processes, electronic circuits, and other visually complex products with unprecedented precision. On the consumer side, fully immersive systems will let people become engaged with out of the ordinary activities that most of us will only ever dream of: climbing Mount Everest, visiting the Taj Mahal, piloting a Formula 1 car, standing on the field for World Cup football matches, skydiving, or diving the Great Coral Reef. Spectacular or heartening; the opportunity to visit with Mom back home or check-in with your children before bedtime when you are far away are irreplaceable moments in our lives. Virtual and augmented realities are opening countless new possibilities — but how many realities can a person actually live in? Just one, of course, but it’s going to get a whole lot bigger.
MirrorSys: Converging Video and the Environment