Cartesia claims its AI is efficient enough to run pretty much anywhere

It’s becoming increasingly costly to develop and run AI. OpenAI’s AI operations costs could reach $7 billion this year, while Anthropic’s CEO recently suggested that models costing over $10 billion could arrive soon.

So the hunt is on for ways to make AI cheaper.

Some researchers are focusing on techniques to optimize existing model architectures — i.e. the structure and components that make models tick. Others are developing new architectures they believe have a better shot of scaling up affordably.

Karan Goel is in the latter camp. At the startup he helped to co-found, Cartesia, Goel’s working on what he calls state space models (SSMs), a newer, highly efficient model architecture that can handle large amounts of data — text, images, and so on — at once.

“We believe new model architectures are necessary to build truly useful AI models,” Goel told TechCrunch. “The AI industry is a competitive space, both commercial and open source, and building the best model is crucial to success.”

Academic roots

Before joining Cartesia, Goel was a Ph.D. candidate in Stanford’s AI lab, where he worked under the supervision of computer scientist Christopher Ré, among others. While at Stanford, Goel met Albert Gu, a fellow Ph.D. candidate in the lab, and the two sketched out what would become the SSM.

Goel eventually took a job at Snorkel AI, then Salesforce, while Gu became an assistant professor at Carnegie Mellon. But Gu and Goel went on studying SSMs, releasing several pivotal research papers on the architecture.

In 2023, Gu and Goel — along with two of their former Stanford peers, Arjun Desai and Brandon Yang — decided to join forces to launch Cartesia to commercialize their research.

Cartesia, whose founding team also includes Ré, is behind many derivatives of Mamba, perhaps the most popular SSM today. Gu and Princeton professor Tri Dao started Mamba as an open research project last December, and continue to refine it through subsequent releases.

Cartesia builds on top of Mamba in addition to training its own SSMs. Like all SSMs, Cartesia’s give AI something like a working memory, making the models faster — and potentially more efficient — in how they draw on computing power.

SSMs vs. transformers

Most AI apps today, from ChatGPT to Sora, are powered by models with a transformer architecture. As a transformer processes data, it adds entries to something called a “hidden state” to “remember” what it processed. For instance, if the model is working its way through a book, the hidden state values might be representations of words in the book.

The hidden state is part of the reason transformers are so powerful. But it’s also the cause of their inefficiency. To “say” even a single word about a book a transformer just ingested, the model would have to scan through its entire hidden state — a task as computationally demanding as rereading the whole book.

In contrast, SSMs compress every prior data point into a sort of summary of everything they’ve seen before. As new data streams in, the model’s “state” gets updated, and the SSM discards most previous data.

The result? SSMs can handle large amounts of data while outperforming transformers on certain data generation tasks. With inference costs going the way they are, that’s an attractive proposition indeed.

Ethical concerns

Cartesia operates like a community research lab, developing SSMs in partnership with outside organizations as well as in-house. Sonic, the company’s latest project, is an SSM that can clone a person’s voice or generate a new voice and adjust the tone and cadence in the recording.

Goel claims that Sonic, which is available through an API and web dashboard, is the fastest model in its class. “Sonic is a demonstration of how SSMs excel on long-context data, like audio, while maintaining the highest performance bar when it comes to stability and accuracy,” he said.

While Cartesia has managed to ship products quickly, it’s stumbled into many of the same ethical pitfalls that’ve plagued other AI model-makers.

Cartesia trained at least some of its SSMs on The Pile, an open data set known to contain unlicensed copyrighted books. Many AI companies argue that fair-use doctrine shields them from infringement claims. But that hasn’t stopped authors from suing Meta and Microsoft, plus others, for allegedly training models on The Pile.

And Cartesia has few apparent safeguards for its Sonic-powered voice cloner. A few weeks back, I was able to create a clone of former vice president Kamala Harris’ voice using campaign speeches (listen below). Cartesia’s tool only requires that you check a box indicating that you’ll abide by the startup’s ToS.