It has been three years since ChatGPT first reshaped the AI landscape, yet surprisingly few organizations have managed to develop their own successful large language models (LLMs) trained on proprietary data.
When I first thought about why this was happening, I suspected the problem lay in tokenization — the seemingly simple yet intricate process of breaking down vast amounts of text, code, or other data into smaller units (“tokens”) that a model can understand.
At first glance, tokenization sounds straightforward: just split text into words or subwords. But in reality, it’s a subtle art. Consider the complexity of tokenizing code — dealing with spaces, commas, unique symbols, and formatting — or expanding this idea to tokenizing images, videos, audio, and even biological data like proteins.
I realized how complex tokenization really is after watching Andrej Karpathy’s tutorials, where he demonstrated tokenizing Wikipedia text. What looked simple turned out to be a deeply thought-out design process full of caveats and tradeoffs. (I highly recommend his materials for anyone curious about how LLMs learn language from the ground up.)
Yet, as intricate as tokenization is, it’s not the main reason most enterprises have failed to develop powerful in-house LLMs.
The Unrealized Promise of Domain-Specific AI
Industries such as healthcare, law, and finance would benefit enormously from domain-specialized AI — models trained on their own rich datasets, capable of understanding their jargon, workflows, and decision logic.
Despite this potential, progress has been painfully slow. Meanwhile, the tech industry — the birthplace of AI — has successfully integrated LLMs into software development itself. Tools like Cursor, Windsurf, and Anthropic’s coding assistants have become indispensable, with millions of developers willing to pay for them.
So why haven’t hospitals, law firms, or financial institutions seen similar breakthroughs?
Four Major Barriers Beyond Tokenization
Building an enterprise-grade, proprietary LLM faces challenges far deeper than tokenization. Here are the four biggest ones:
- Massive Compute Costs
Training state-of-the-art models requires immense computing power and access to high-end GPUs. The hardware expense alone can be prohibitive for most companies. - Data Quality and Curation
While many firms sit on mountains of data, much of it is messy, outdated, or full of errors. Cleaning, labeling, and organizing this information into a usable, high-quality dataset is a monumental effort. - Cross-Disciplinary Collaboration
Successful AI development demands close collaboration among researchers, engineers, and domain experts — a blend of technical depth and contextual understanding that few organizations have achieved. As Karpathy once put it, training AI models is part engineering, part art — and sometimes, part luck. - Long Development Cycles and High Failure Risk
Training and fine-tuning an LLM from scratch can take years, with no guarantee of success. Most firms simply can’t sustain that level of uncertainty and cost.
Why RAG (Retrieval-Augmented Generation) Became the Practical Alternative
Given these barriers, most companies — including my own — have turned to RAG (Retrieval-Augmented Generation) systems.
RAG is practical, cost-effective, and relatively easy to implement. It allows organizations to connect existing language models with their own data securely, without the need for full retraining. Information can be retrieved from a managed database in real time, with proper source attribution and audit trails.
However, RAG is not a perfect solution. It struggles with multi-hop reasoning — synthesizing answers that require connecting insights across multiple documents. Retrieval quality depends heavily on how well the documents are indexed, and vague queries can easily fail. Latency is also an issue in complex search pipelines. It simply isn’t as smart or engaging as the other AI tools out there.
The Emerging Middle Ground: Knowledge Graphs and Structured Queries
To overcome RAG’s limitations, many enterprises are now exploring hybrid approaches, such as integrating structured data queries (using GraphQL or SQL agents) and knowledge graphs to represent relationships between entities.
These tools don’t replace LLMs but instead enhance them — enabling models to reason more effectively over structured and unstructured information alike.