A $9B AI fail

How NOT to develop AI products

Gianluca Mauro
April 29, 2024

I love studying AI fails. To be clear: I don’t enjoy seeing other people lose tons of money, but I like trying to put myself in their shoes and think “Would have I made the same mistakes?”. And obviously, analyzing others’ failures means you’re less likely to have the same problems in the future.

3 years ago I wrote a viral post called “A $9B AI fail”. I had analyzed the mistakes that led the real estate marketplace Zillow to lose…well…$9B. The TL;DR is that they were trying to use AI to predict house prices, which is a classic exercise that ML students do so it may seem simple. It turns out that it’s extremely complex and nuanced if you want to use it to allocate $9B.

I remember taking me a lot of time to understand the problem they were trying to solve, how they approached it, and what went wrong. Today we are starting to see the first generative AI fails, and I wish they were as nuanced. They are pretty naive and I may even dare say fairly straightforward to predict.

The 3 gen AI fails I want to look at are the Humane AI Pin, the Rabbit R1, and Cognition Labs Devin. I will first describe their shortcomings, then go into the mistakes that the companies made, and finish with some recommendations to avoid the same mistakes. Let’s have some fun.

The fails

The Humane AI pin made noise in the tech world with an inspiring, somehow dystopian launch demo. It was followed 5 months later by a “the worst review ever” video by Marques Brownlee, the biggest tech YouTuber in the world. Side note: Humane has raised $230M in funding.

TL;DR the Humane AI pin is:

slow (all actions are taken in the cloud, and LLMs are still slow)
inaccurate (it gave Marques wrong information about the solar eclipse)
frustrating (Marques asked for the traffic to get to the Empire State Building, and it answered “Use the AI voice command to get traffic directions”, which he…did?)

The second fail is the Rabbit R1, which is basically a cheaper, $199 not-wearable version of the Humane AI pin. It’s built by Teenage Engineering, a company famous for making cool, well-designed, expensive audio gear. Oh, they sold over 100.000 Rabbit R1s (that’s $20M in revenue).

We’re still waiting for a full review, but I’ll tell you this: when Marques asked it for the weather, it gave him the weather in New Jersey. Marques wasn’t in New Jersey. When asked why it did that, it answered that it gave him the weather for an “example location chosen randomly”. What the fuck.

The last fail is Cognition Labs Devin. This company went insanely viral with a simple video pitching and “demonstrating” their product Devin as an “AI software engineer”, capable of completing software engineering tasks on UpWork. I wrote “demonstrating” with quotes because the demo turned out to be somehow faked.

A YouTuber debunked the video, and even the people who posted jobs on Upwork allegedly solved by Devin said that the tasks weren’t solved. TL;DR:

In the video, Devin solved tasks that didn't align with the customer's original request
It fixed “errors” in files that…didn’t exist
The quality of its code is poor
It’s actually very slow. The video suggests that the task was completed quickly, but the timestamps in the chat reveal that the actual duration extended over many hours and even into the next day

They had $21M in funding before the video. They now have $196M.

What’s behind the fails?

All 3 projects started with legitimate, valuable use cases (an assistant with voice interface for Humane and Rabbit, coding for Devin).

The mistake they did is the same though: they focused just on the problem, ignoring the limitations of the technology.

I can imagine designers and product people reading this and getting angry: “but Gian, you MUST be focused on the problem! Haven’t you heard of Design Thinking? OMG are you NOT customer-centric?”.

I’m not saying that focusing on the user problem is wrong or not important. I’m saying that even if hate wasting time traveling and I would really love to teletransport instead of flying, no airline company should design a teletransport system because it’s not fucking possible.

This is the problem with AI today: it’s hard to distinguish between what is out of reach, what it can do reliably, and what it can’t do reliably but it could with prompt engineering or other AI techniques.

And on top of that, “out of reach” is a moving target so it’s hard to say what “out of reach” stuff will become within reach in a month.

So yeah, I wish there was some nuanced explanation for these fails. Instead, the TL;DR is they sold a dream they couldn’t build.

How do you…not fail?

I propose a simple, 3-step framework:

Deeply understand AI’s limitations (the so-called “jagged frontier”)
Design around limitations
Prototype and test A LOT, be objective and don’t lie

Simple, right? I guess it’s easier said than done when you get free attention and lots of budget by promising teletransport.

Let’s look at an example: Github Copilot vs Devin.

Github Copilot is a tool that helps developers code faster. It’s basically auto-complete on steroids for code. And it’s extremely successful, with more than 1M active users, leading to over 50% faster task completion.

Copilot is in the same problem space as Devin: make coding faster. But Copilot is designed around current limitations: AI doesn’t always make the right prediction (especially without the right context) and it’s better at small tasks than large ones. Therefore, Copilot doesn’t suggest anything if it’s not confident about its suggestion, and when it does recommend something it helps developers complete a handful of lines of code at the time at best. It’s extremely easy for a developer to see the recommendation, and accept it or ignore it with a single keystroke.

Devin ignored these limitations and tried to get rid of the software engineer by trusting a technology you shouldn’t trust, and solving problems out of reach for today’s LLM. The consequences are faulty products, lies, and lots of funding.

Will it work at some point? Will these investors make some returns in the future? Hard to say, but one thing is for sure: today, this isn’t “the first AI engineer”. It’s a lie.

To be very practical, how can you implement these tips?

I can’t stress the importance of step #3: build prototypes (and be critical about their quality). Prototypes don’t have to be fancy, and you don’t have to be a coder to build one. A prototype can be as simple as a series of prompts, that you test with tens of different examples, with different people, and in different contexts. Your testing discipline is more important than how fancy the prototype looks.

Examples: If you were Teenage Engineering or Humane, you could simply gather a few tens of sample requests that users may have in their daily lives (like asking about the weather). They could test those requests with ChatGPT writing their own custom instructions, and get a benchmark of how many times it gives a satisfying answer and how many times it gives you an embarrassing one (like giving you the weather in a random location, you know). If the performance is acceptable, you ship the product. If it’s not, you go back to the drawing board and either invest years trying to fix it, or change the problem scope by tackling a subset of tasks you know you can deliver on.

In other words: less powerpoints, more prototypes. Less lies, more tests.

Want to build your own AI projects?

Participants of our Generative AI Project Bootcamp have built 100+ projects, writing 0 lines of code. The projects can be around anything you’re interested in - past participants built anything from an AI McKinsey consultant to a personal email travel advisor.

You can join the next edition here, or reach out to run a private edition personalized to your company.