Back to latest Archive

2023.06.07

Why Did Apple Create the Vision Pro?

by soonsoon /why-did-apple-make-the-vision-pro/

Why Did Apple Create the Vision Pro?

On June 6, at wwdc23, the long-rumored Apple Glass and Apple Vision Pro were finally announced.

Because Apple's new lineup, and its expensive price ( $3,499 ), and the fact that it is a product where even those who have actually experienced it are in the minority, it seems that various observations and opinions are emerging.

Especially, looking at the exterior, it looks like a familiar sight in the market created by existing VR devices, yet they are even choosing a wired battery, so it seems they are receiving a lot of criticism regarding their experience with existing VR devices or from users.

Just like a fool, I bought the Meta Quest Pro at full price, haven't been able to make the most of it, and am just displaying it. Having worked in VR in the past and used the HoloLens 2, which was praised as a pioneering device in this field, to create content, give lectures, and experience it firsthand, I have a lot of thoughts on this.

< The MS HoloLens series, which is evaluated as too far ahead of its time >

But, why did Apple specifically create and announce the device called 'Vision Pro'?

Apple is a company that is not a research group and makes thoroughly marketable products. While some call it an icon of innovation, it is a company that chooses 'innovation' not to make products for the sake of innovation, but to create value that opens consumers' wallets.

史蒂夫·乔布斯时代如此，继任的蒂姆·库克时代也是如此。

So, quite a few products have quietly disappeared from the lineup, and quite a few were introduced with a convincing look but never released. There are also products that aren't released without any promotion if the market reaction is particularly lukewarm.

Apple put a lot of effort into this Vision Pro over the course of seven years, and as rumors circulated that it hasn't even gone into mass production yet, there was also a perspective that Apple might ultimately give up on releasing it, judging it to be a sector that won't generate revenue. Additionally, following this announcement, many people expected that the Vision Pro would be released with such high specs relative to its price that Apple would actually be losing money with every sale..

Like the stock market's response, the Vision Pro saw a slight dip in its stock price immediately after its announcement.

Therefore, the fundamental question is

Why Did Apple Create the Vision Pro?

I want to write about this from the perspective of Spatial Computing.

1. Spatial Computing

I think the first hint can be easily found in the keywords of "Spatial Computing" mentioned in the presentation and on the homepage.

< You can often see the keyword Spatial Computing on Apple's announcements and homepage. >

When rumors first started circulating that Apple was going to release a Head Mount Display (HMD) type device, there was a lot of curiosity in the industry about whether Apple would make a VR device or an AR glass type.

And as time passed and the announcement and release were delayed, quite specific stories emerged that Apple would release an MR (Mixed Reality) device and use the term XR (eXtended Reality)! (from MacRumors, etc.)

However, Apple's choice was the Vision Pro, which is introduced as a Spatial Computing Device.

What exactly is Spatial Computing, as Apple describes it?

To implement spatial computing in its literal sense, various technologies are required. First, sensors such as LiDAR scanners and ToF (Time of Flight) sensors that can recognize spatial information must be used to perceive the surrounding environment, and this information must be processed numerically.

< Apple has been processing ToF sensor data for quite a long time, starting with the iPhone Face ID and the 11 Pro, which introduced LiDAR cameras. >

Additionally, based on the Depth information obtained in this way, the actual visible information must be matched. In the case of the MS HoloLens 2 released in the past, the method of rendering the processed information on a translucent glass in a manner similar to a hologram was chosen.

< Example of hologram usage by MS HoloLens 2, looking at the screen shared via public videos or streaming, it looks like it is located in space at high quality. Image source : MS official site >

At the time, it was a very innovative and surprising technology, and it had the advantage of being intuitive and usable while viewing actual screens. However, due to limitations with transparent displays and drawing methods, there were issues with a narrow field of view and improper color representation when viewed in reality.

According to industry rumors, one reason Microsoft may have abandoned the development of HoloLens 3 is that improving the display components of the HoloLens 2 was difficult and the manufacturing cost was too high. This highlights that the field has characteristics where the strengths are matched by significant weaknesses.

< Images seen through HoloLens 2 actually suffered from a lack of color representation and could not display black. Additionally, the narrow field of view made it quite disappointing to use. Image source : https://www.supertechcamp.com/hololens-1-vs-hololens-2-details-and-more/ >

The approach chosen by the Vision Pro announced this time is similar to Meta's Quest Pro Color Pass Through, capturing images from external cameras and processing them through software stitching (Image Stiching : a method of image processing that combines information obtained from multiple cameras).

< Quest Pro's Color Pass Through mode, source: Meta official page >

When this method started becoming known to the public, it was because they began allowing the grayscale pass-through mode on the Quest 2 to be used for other purposes. I remember that the original reason for creating the grayscale pass-through mode on the Quest 2 was to allow users to perceive their surroundings in grayscale, even if only, so that they could select a usable range to avoid collisions caused by surrounding obstacles, given the nature of VR headsets where the field of view is completely obscured during use.

< Quest 2's grayscale pass-through mode, Image source: https://venturebeat.com/games/oculus-quest-2-getting-infinite-office-with-logitech-keyboard-and-adjustable-passthrough/ >

When I first used the Quest 2, I found the black-and-white passthrough mode fascinating. Although the Quest 2 likely suffers from lower resolution due to its use of infrared cameras for external gesture and object recognition, I thought it would be great if we could build an MR environment that matches surrounding information using this technology.

At the time, Meta was responding incredibly quickly to updates, so it accommodated requests from developers like me to use pass-through mode for other purposes. By leveraging this, features like actually linking with a physical keyboard made it feel like true mixed reality could be approached as a relatively affordable solution, rather than HoloLens's expensive and difficult path.

Like me, I felt both the potential and limitations of the HoloLens 2 at the same time. When Meta, the market leader in VR devices, announced the Quest Pro with Color Pass Through mode and discussed how they could address the existing MR market through collaboration with Microsoft, I remember buying it immediately without hesitation as soon as sales opened, despite the high price.

However, unlike expectations, Meta Quest Pro's color pass-through mode was pathetic in terms of resolution, color noise, dizziness, and more, and users like me, who were most looking forward to this feature as an MR device, were greatly disappointed.

Quest Pro made its debut with great ambition, yet it failed to provide a proper color pass-through mode. While there may be various reasons for this, I personally believe it ultimately came down to a lack of performance capable of processing high-definition images.

正如前面提到的，为了通过拼接从外部摄像头拍摄的多角度图像来创建自然连接的空间图像，需要大量的计算资源。

< The process of stitching video captured through multiple cameras, source: https://www.semanticscholar.org/paper/Video-stitching-with-spatial-temporal-warping-Jiang-Gu/cd47f5c30c17dc9e318fa3d4d093869970e3368f/figure/1 >

This technology has been used to create environment maps for quite some time, and I also attempted stitching while directly shooting HDRI (High Dynamic Range Imaging) data to create VR environment maps when I worked at a research institute over a decade ago.

It goes without saying that to achieve a realistic feel in VR content, high-resolution footage must be captured and processed separately according to various light range values. Despite filming in a small studio, I remember running calculations on a high-end workstation for over 24 hours, and the resulting file size exceeded gigabytes, so much so that it wouldn't even open in Photoshop.

Therefore, in order to actually use it, I recall that the process involved modifying it down to a fairly low resolution and compressing it, and only after sufficiently reducing the capacity did it run on the VR software I was using.

In other words, capturing and stitching external footage in real-time and displaying it on the screen is a task that requires a lot of computation, and I cautiously predict that in the case of Meta, it is likely forced to process at a lower resolution due to the limitations of the internal chipset performance.

However, looking at the videos shown in the Vision Pro introduction and user reviews, one can see stories claiming that it is surprisingly natural and does not feel significantly alien from reality.

< Example of Vision Pro usage screen, source: Apple >

< It is said that it receives external information through multiple external cameras and sensors, microphones. Image source: Apple >

Although specific specs are not yet known, reports indicate that it uses about four external cameras and sensors to acquire and process external information to generate internal images.. If it can provide real-time stitching of sufficiently high-resolution images that look natural enough during actual use, while naturally displaying them alongside the 3D or other objects shown in the Vision Pro.. it would be truly amazing technology.

From my perspective, having spent years manually stitching using the open-source software MIT released 12 years ago, shooting with a Canon camera while adjusting exposure in 6 steps, capturing at different angles, gathering data, stitching with software, post-processing the results, reducing file size, and struggling through live streams, I find this truly amazing.

It is known that the iPhone Pro, which is said to use a LiDAR scanner, uses Sony's IMX590 image sensor, but the resolution specs of this sensor are around 648 x 488, which is not very large.

< Image showing the resolution of the Sony IMX590 sensor, source : https://www.sony-semicon.com/en/products/is/industry/tof.html >

In other words, the amount of information from the image sensors used for depth computation is not as detailed as one might think, but it is known to be difficult to process even that.

Although the exact specifications of the actual external cameras and depth sensors used in the Vision Pro are not yet known, if it is possible to implement natural mixed reality as Apple has revealed, it can truly be said to be a remarkable technological capability in many ways.

Of course, in addition to that, there are features like hand tracking, facial recognition, voice recognition, spatial audio, and so on.. They say there are really so many features.. Shivers..

< It seems eye-tracking functionality is also applied quite at a high level. Image source: Apple >

Actually, looking at YouTube and other platforms, reviews from tech YouTubers who were invited to WWDC and got to try the Vision Pro are being posted. While they are clearly mentioning some shortcomings and discussing various topics, the mixed reality shown by the Vision Pro seems to be top-tier overall. Most reactions are surprised.

< You can check reviews from people who have experienced the Vision Pro through YouTube videos. I'm jealous.. Image source: YouTube >

Based on this level of completion, it seems Apple is talking about spatial computing, which allows users to utilize these technologies, rather than VR, AR, or MR as expected.

And, what if a device capable of the level of spatial computing Apple claims were... able to last about 2 hours on a battery in standalone mode, and light enough to wear around your head? Perhaps the $3,499 price point, as some argue, could feel like good value for money. (If you think about implementing this with other devices..? Wow.. )

In other words, if devices like traditional PCs or Macs were mainly installed and used at a desk, it can be understood that Apple has created a computer device that can be used while freely walking around or living in a space familiar to me.

Once again, I want to emphasize that I don't think Spatial Computing as Apple claims for the Vision Pro is something that can be easily implemented.
First of all, among existing devices, this level seems to be the first of its kind

4. The M2 Chip and the Arrival of the R1.

< 2 chips applied to Vision Pro: M2 and R1, Image source: Apple >

I believe that one of the biggest changes and shocks in the computer chipset market over the past five years has been Apple's silicon chipset.

I believe the arrival of the M1 chip, which pushes performance to the extreme by pulling out incredible performance from a single chip, has completely changed the positioning of the chip for tasks that require a certain level of performance or higher.

< With its debut, Apple's M1 chipset changed the landscape of the CPU market, image source: Apple >

After unveiling the M2 line for MacBooks and iPad Pros, Apple seems to have created an environment capable of drawing out much more efficient and consolidated performance than the M1, M1 Pro, and M1 Max. And that chipset is what went into this Vision Pro.

To implement the Spatial Computing discussed earlier, a truly diverse range of computational processing is required. As I have emphasized repeatedly, I believe a major reason why existing companies like Microsoft and Meta have proposed this value but failed to properly operate it or make it accessible to the general public is that they fundamentally have limitations in the computational processing required to run it.

< Most existing VR, MR headsets have used Qualcomm Snapdragon chipsets. Source : https://circuitstream.com/blog/meta-quest-pro >

Microsoft's HoloLens and Meta's Quest Pro use XR chips made by Qualcomm. While I cannot test the exact figures directly, considering the basic performance of those chips, the performance difference between the M2 chip used in the Vision Pro and those chips is enormous, even when viewed arithmetically.

< It is known that the performance of the XR2+ Gen1 chip reportedly inside the Quest Pro is at a Geekbench single 500 points, multi 1000 points level. Source : Geekbench >

< For the MacBook Air with M2 processor, single-core scores are around 2600 and multi-core scores around 9500. Source: Geekbench >

I don't know exactly which M2 chip is in the Vision Pro, nor do I know how reliable the Geekbench test run on the Quest Pro is, but simply looking at it arithmetically, it appears there is a difference of more than 5 times in single-core and more than 9 times in multi-core.
(This information may be inaccurate, as the test criteria are not clear.)

Although it is difficult to check clearly due to the difference between the Vulkan API and OpenCL in terms of graphics processing performance, it appears that there is a significant difference in the performance of the chipset.

< Geekbench's graphics performance test results, source: Geekbench >

In other words, thanks to the inclusion of the M2 chip, countless processes that were previously limited to ideas and testing have become possible, which is likely why the Vision Pro was able to come to fruition.

I had heard rumors in the industry that Apple's inability to continue releasing the product was due to these issues. The story was that during the initial stage of reconsidering the product's value and working on the packaging, the planned features could not be properly driven due to the lack of performance of the main chipset they were considering at the time, which is why they had no choice but to continue delaying the release and mass production..

About two years ago, I heard a story. After that, with the actual appearance of the M1 chip, and rumors circulating that the next-generation Apple device, the 'Apple Glass,' would feature the M1 chip, I thought the rumor had credibility. Then, out of nowhere, the application of silicon chips to the iPad line began, and I viewed this as a process to mass-produce chips and organize the platform, which I thought was very valid.

< In the meantime, Apple has become the company with the best performance-to-price ratio in the world. Source: Apple >

And with the M2 line inside this Vision Pro, speculating, perhaps Apple developed its own chipset from the very beginning of the Vision Pro's development, ensuring it naturally worked with Apple's OS, offering excellent performance and enabling direct control and mass production control.. I imagine.

Furthermore, this Vision Pro also includes another chipset called the R1, which is said to focus on processing sensor data.

< Designed to process various input information, known as the Apple R1 chip, image source : https://www.slashgear.com/1306789/what-is-r1-cpu-apple-vision-pro-headset/>

As mentioned earlier, processing and utilizing data acquired through sensors consumes a significant amount of resources. In the case of the Depth sensor, often referred to as a LiDAR scanner, there is a resolution limitation; to obtain more accurate surrounding location information, higher-resolution infrared particles must be emitted, and naturally, this requires significantly more processing power.

Therefore, it can be thought that Apple uses the R1 chip to separately organize sensor data and uses the organized data to process basic computations through the M2 chip.

According to known information, the R1 chip processes all of this within 12 milliseconds. The average time it takes for a blink is about 100 milliseconds, so the chip is used to complete processing within approximately 8 times that time, preventing users from experiencing dizziness or issues caused by latency.

Since Qualcomm's AR2 chip, announced last year, was presented with a separate auxiliary processor, Vision Pro's approach seems to be the direction that devices handling sensor-based environmental processing (with a main processor and sensors handled by an auxiliary processor) should take.

< Source: Qualcomm >

Nevertheless, to date, Apple is the only company that possesses OS development, hardware development, consumer base acquisition, in-house chipset development, and a developer pool. Therefore, I believe the gap caused by these in-house chipsets, as well as the differences in performance and manufacturing quality, will be difficult to close for some time.

In light of this, at MWC, Meta, Samsung, Google, and Qualcomm have also been making various announcements, and there are discussions suggesting they are preparing to more efficiently create new products to counter or respond to the market that will emerge after the Vision Pro launch...

< Meta, Samsung, Qualcomm, and others have frequently discussed various collaborations starting last year. Source: Gidnet >

While consumers would welcome such collaboration between manufacturers and chip companies, it is not easy for companies with different values to collaborate organically like Apple, and there are few major success stories of this collaboration model other than the Android OS market, which Google gave away for free... We will have to wait and see if it will be as complete as Apple.

In this situation, it feels a bit regrettable that MS, which held the most technology in areas other than chip manufacturing (hardware, OS, SDK), has lost interest in HMDs and is focusing on AI.. If Vision Pro had been announced just two years earlier.. wouldn't MS also be treating the HoloLens lineup as a second-class citizen.. (I'm tearing up as the MR part serves as an MVP.)

< Once, MS was the leader in the Hololens 2 and Mixed Reality market. I looked forward to a new era led by MS, believing Satya's CEO's wonderful speech.. Now it seems to have gone to AI. Source: MS >

Wrapping up the first article on the Vision Pro..

Since the content has gotten too long, I'll wrap up the first article on why Apple created the Vision Pro here.
(It feels like I'm writing too much about a product that hasn't even been released or used yet.. )

I have written a very long story about spatial computing and the performance of the M2 chipset, but in the end, it created a computing system with incredible performance relative to its size for the sake of ‘what purpose’, which cannot be implemented with existing systems, and as a result, it appears to have completed the "spatial awareness computing" environment that MS and Meta had been continuously challenging up to a sufficient level.

Perhaps, as I will emphasize again in the following article regarding the 'input system,' 'home entertainment,' 'price,' and 'developer ecosystem,' I believe that Apple's efforts to prepare for and launch the Vision Pro in response to these reflect its will and purpose regarding the market it envisions for the future.

Most of the scenes seen in this demo show individual users viewing video, viewing photos, browsing the web, playing games, and handling work tasks in spaces they are familiar with.

< The background of the content presented in this announcement is mostly familiar settings like home or work environments. Source: Apple >

Of course, scenes showing interaction with acquaintances via FaceTime or with others via video conferencing solutions like Zoom or Teams appear, but it seems that the main focus is never "communication" in a virtual environment, but rather "utility" in one's actual life.

In other words, rather than challenging and pioneering a new virtual world based on the computing technology and OS software technology they possess, it seems Apple released the Vision Pro to bet its future on adding a single "layer" to the lifestyle itself that they are already living, and overlaying a "value" that breaks free from spatial constraints based on the services Apple provides.

Of course, these values were not first proposed by Apple, and companies that have continuously released various virtual reality devices have also been arguing for them. I am sure many people, not just me, were drawn in and had high expectations.

< Meta Quest Pro launch video, source: Meta >

However, reality suggests that due to the device's performance limitations and technical constraints, only a very small fraction of content could be used, and this itself was a high barrier to entry, leading to a failure in mass adoption.

Consequently, the virtual reality device market has largely been dominated by immersive game content, and I believe it has been largely bifurcated into the market for relatively inexpensive Chinese mass-produced devices and the Quest VR devices led by Meta.

Therefore, when using a VR device, most people understand it as "enjoying an immersive device."

Of course, Meta challenged what is commonly called the metaverse by launching its product Horizon, even making the bold move of changing the company name to "Meta".. However, the reception for Horizon was very poor, and despite several years having passed since the story began, I understand that it is still not officially available as a service.

< I wonder if Meta's Horizon Worlds isn't having a big impact on Meta's stock price.. I still haven't been able to try it.. Image source : Meta >

In addition, there have been interesting attempts like Worksroom that well applied the value of mixed reality, and MS also tried out its possibilities through a solution called Mesh, but... only a very small number of users like me who like new technology actually tried it out.. I think that only a tiny minority experienced and felt this value while the device itself was not yet mass-adopted. (and even that didn't run properly ㅠㅠ)

< Only those with HoloLens 2 could try some of the MS Mesh, personally I felt it had very good experiential value. Source: MS >

<Video of the demonstration I did in 2021 using HoloLens 2 and Mesh at Pangyo Metaverse Hub. Source: My YouTube channel >

Therefore, it appears that Apple has focused on personal entertainment and lifestyle, areas where Apple excels and enjoys, rather than social value in the cost-effective spatial computing market, which is still underdeveloped due to device supply issues, high entry barriers due to network technology, lack of content, and lack of infrastructure, or fields requiring high resource investment such as digital twins.

In other words, it seems Apple looked at the future market and made the right choices and focused on what they do best.
And I think the product that best projects that intent is the Apple Vision Pro.

Based on an intuitive input system, UX interface, and powerful computing performance, it seems Apple aims to implement mixed reality with added layers via the Vision Pro, realize various "Personal" values beyond spatial constraints within this space, and create a new value market.

If we can build an environment where I can watch truly massive videos, listen to music, enjoy sports content with AR elements, and naturally merge with family or friends interacting in my own space through the Vision Pro, I think it could attract not just some users with a geeky or early-adopter mindset, but also general users with purchasing power, different from the existing VR market.

< Example of using Apple Vision Pro for watching sports, Image source : https://www.businessinsider.com/disney-espn-apple-vision-pro-developing-new-ways-watch-sports-2023-6 >

Actually, even around me, there are older gentlemen (who) who like to watch movies and enjoy music at home, and they invest without sparing expense on buying expensive large TVs or home theaters, beam projectors, etc.

< If built correctly, a home theater environment that breaks easily for a few tens of millions of won, source : AV Plaza installation case by actor Jo Seung-woo >

However, if you buy a device that costs only $3,499 and use it by tolerating a little inconvenience by flipping it over (?), you can watch movies and videos through an incredibly large screen, enjoy a high level of music appreciation and sound effects through adaptive sound, and break free from the screen size limitations of devices like the Mac you are using, allowing you to easily use the Mac while walking around the house and improve productivity...

Is $3,499 a price worth investing in?

Of course, once you actually use it, the evaluation might be completely different..

Considering how Apple has always taken care of commercial value better than any other company, one might wonder what confidence led them to create the Vision Pro. Upon reflection, it seems like it could be quite a valuable product.

#순순한생각