The Prompt: What Happens When We Hit The ‘Data Wall’?

https://imageio.forbes.com/specials-images/imageserve/66a940137f7ddf28707cd8bc/0x0.jpg?format=jpg&height=600&width=1200&fit=bounds

subscribe here.

Welcome back to The Prompt.

With the Olympics in full swing, companies advertising during the games are looking for ways to connect their business to athletics — and sometimes that’s a stretch. In Google’s latest commercial, a father turns to Google’s AI chatbot Gemini to write a letter from his daughter to her favorite athlete. People were shocked Google would suggest replacing the human element of a devoted fan-written letter with AI: online commentators noted the commercial took “a little chunk out of my soul,” “makes me want to scream” and is “something seemingly nobody wants.”

Now let’s get into the headlines.

BIG PLAYS

On Thursday, OpenAI announced it is testing a new prototype search engine called SearchGPT. The tool, built on top of GPT-4 models, will search and summarize real-time information from across the web. OpenAI said it’s partnering with publishers like the News Corp and The Atlantic to source content and that its answers will “prominently cite” sources at the end of responses.

SearchGPT could pose a real challenge to search behemoth Google, which now provides AI-written summaries at the top of search results, and nascent AI search startup Perplexity, which similarly draws from news articles in its AI responses to search queries. After Forbes reported that Perplexity had republished journalistic work from multiple news outlets without properly attributing, the company announced today that it’s partnering with publishers including Time, Fortune and Texas Tribune and launching a “revenue sharing program,” where brands will sponsor follow-up or relevant questions and publishers cited in those answers will receive an undisclosed share of the revenue that Perplexity earns.

PEAK PERFORMANCE

Building apps is now as simple as writing a text-based prompt. Airtable, whose low-code tools have already helped whip up 50 million applications, has launched a new generative AI tool called Airtable Cobuilder, which uses OpenAI’s GPT-4 models and information about a person’s job title and company to suggest and create relevant apps.

AI DEAL OF THE WEEK

AI logistics platform Altana raised $200 million at a $1 billion valuation, Forbes reported. Founded in 2019, Altana uses artificial intelligence to analyze and produce insights from a map of the supply chain. The company also uses generative AI to let users ask questions about bottlenecks in their supply chain, allowing them to spot vulnerabilities and act on them.

Also notable: Legal AI startup Harvey raised a $100 million investment at a $1.5 billion valuation, and design startup Canva acquired text-to-image generator Leonardo AI.


DEEP DIVE

In 2011, Marc Andreessen, whose venture capital firm Andreessen Horowitz has since invested in some of the biggest startups in AI, wrote that “software is eating the world.” More than a decade later, it is literally doing just that.

Artificial intelligence, specifically the large language models that power it, is a voracious consumer of data. But that data is finite and it is running out. Companies have mined everything in their efforts to train ever more powerful AIs: YouTube video transcripts and subtitles, public Facebook and Instagram posts, copyrighted books and news articles — sometimes without permission, sometimes with licensing deals. OpenAI’s ChatGPT, the chatbot that helped mainstream AI, has already been trained on the entire public internet, roughly 300 billion words including all of Wikipedia and Reddit. At some point, there will be nothing left.

Researchers call this “hitting the data wall.” And they say it’s likely to happen as soon as 2026.

That makes creating more AI training data a billion-dollar question — one that an emerging cohort of startups are looking for new ways to answer.

One possibility: creating artificial data. That’s five-year-old startup Gretel’s approach to AI’s data problem. It makes what’s known as “synthetic data” — AI-generated data that closely mimics factual information, but isn’t actually real.

But synthetic data has its limits. It can exaggerate biases in an original dataset and fail to include outliers, rare exceptions that you’d only see in real data. That could make AI’s tendency to hallucinate even worse. Or models trained on fake data could simply fail to produce anything new. Golshan calls this a “death spiral,” but it’s more widely known as a “model collapse.” He requires new customers to provide Gretel with a chunk of real, high-quality data to avoid it. “Junk safe data is still junk data,” Golshan told Forbes.

Read the full story on Forbes.


YOUR WEEKLY DEMO

Elon Musk’s social media platform X quietly turned on a default setting that gives itself permission to use public posts and interactions for training Grok, a large language model developed by another Musk-owned startup xAI. Other AI chatbots like OpenAI’s ChatGPT have also been trained on public data from Twitter posts. The good news? You can prevent your X posts from being scraped by disabling the feature in the privacy and settings tab.


QUIZ

This former NFL quarterback has launched an AI company that aims to help people create and publish stories through generative AI tools that can develop characters and edit dialogue.

  1. Tom Brady
  2. Colin Kaepernick
  3. Dan Marino
  4. Patrick Mahomes

Check if you got it right here.


MODEL BEHAVIOR

A new startup called Friend founded by Harvard dropout Avi Schiffmann is selling a $99 AI necklace to “combat loneliness.” Yes, you read that right. The necklace’s “alway-listening” pendant reacts in real time to whatever its wearer is doing; if you tap it, it texts you a quip. A trailer for the technology suggests use cases like casually chatting with it about how good your falafel wrap is or how bad your gaming skills are.

<<<- Go Back