Everything Can Be Tokenized (by Jensen Huang)

At NVIDIA’s GTC 2025, Jensen Huang said it loud and clear: “Everything can be tokenized.”
And with the sheer computing power of GPUs, he added, “everything can be decoded and figured out — it’s just a matter of electricity.”

He’s right. But most people don’t fully grasp what “everything can be tokenized” really means.

Let’s unpack that.

AI has already shown us this magic across three fundamental senses: reading, seeing, and hearing.

  • Reading: Language models turn words into tokens, abstract units carrying meaning that machines can process.
  • Seeing: Convolutional networks break images into feature maps, effectively tokenizing vision into patterns of edges, shapes, and objects.
  • Hearing: Audio models convert sound into frequency tokens, transforming raw waves into structured representations the AI can interpret.

When Huang says “everything can be tokenized,” he’s pointing to a deeper truth:
Whether it’s text, pixels, sound, motion, or even financial markets — all reality can be represented as data, discretized into meaningful units, and understood by machines.

Reading — Embedding and Word Prediction

Language models don’t read words the way humans do.
They first tokenize text into discrete units (subwords or characters), then embed each token into a high-dimensional vector — a point in semantic space.
From there, neural networks learn relationships among these vectors: which words tend to appear together, which convey similar meanings, and how context shifts meaning.

Word prediction then becomes a matter of geometry — finding the most probable next vector in this space.
That’s how AI “reads” and “writes” — by navigating a continuous landscape of meaning.

Seeing — Convolution and Vision Hierarchies

Images start as matrices of pixel values.
A convolutional neural network (CNN) slides small filters over these matrices, detecting patterns like edges and shapes.
Each layer builds on the last — from lines, to textures, to object parts, to entire scenes.
By sharing the same filters across positions, CNNs achieve translation invariance and efficient pattern recognition.

That’s how AI “sees” — through layered pattern detection across space.


Hearing — From Waveforms to Spectrograms

Sound is a time-based signal — a one-dimensional waveform.
AI first converts this into a spectrogram, turning time and frequency into a 2D image of sound energy.
Then, similar to vision, CNNs can detect local time-frequency patterns, while RNNs or Transformers model the long-term sequence — phonemes, rhythm, meaning.
Some newer models even learn acoustic tokens directly, like wav2vec 2.0 and Whisper, making hearing as “tokenized” as language itself.

That’s how AI “hears” — by transforming vibration into structured, learnable patterns.

The true ingenuity of AI isn’t just in the model architecture (transformers, CNNs, etc.) — it’s in how humans figured out how to tokenize the world so machines can understand it.

Every sense begins with chaos, what changed everything was our ability to discretize that chaos — to turn the continuous and messy into symbolic, structured units a neural network can process. Once something is tokenized, the rest is just pattern recognition and prediction — which is exactly what neural networks do best. That’s the power of tokenize!

Now I’d like to explore how to tokenize nontraditional senses — smell, touch, pressure — so machines can feel the world the way we do.


🧠 1. The principle still holds: everything starts with measurable signals

Every sense — including smell and touch — is just a mapping from physics to data.
The trick is to find what to measure and how to represent it consistently.

Once a physical signal can be turned into structured, digital data, it can be tokenized — then neural networks can learn on it, just like they do with sound, vision, or text.


👃 2. Smell → chemical space → olfactory tokens

Smell comes from volatile molecules binding to receptor neurons.
Each molecule has a measurable chemical structure — defined by atomic composition, bonds, molecular weight, and 3D geometry.

AI can represent smells by tokenizing molecules into chemical embeddings:

  • Use graph neural networks (GNNs) to represent molecules as nodes (atoms) and edges (bonds).
  • Train on datasets that map structure → odor description (“floral,” “smoky,” “acidic”).
  • Result: a vector embedding of smell — a “token” in olfactory space.

This is already happening: models like Google’s Graph Neural Smell model (2023) can predict human odor perception from molecular graphs.

So for an AI-powered robot, equipping a chemical sensor array (an “e-nose”) that outputs a molecular fingerprint lets it tokenize smell directly.


✋ 3. Touch and pressure → spatiotemporal tactile patterns

Touch and pressure are spatially distributed time signals.
Think of a flexible sensor skin on a robot’s hand — each small cell records:

  • Pressure intensity
  • Shear (directional force)
  • Temperature
  • Vibration pattern

That’s a tensor (like an image over time).
So we can process it just like vision or sound:

  • CNNs capture local texture and pressure distribution.
  • RNNs or Transformers capture how it changes over time (sliding, gripping, tapping).

These tactile readings can then be tokenized into “touch embeddings” — compact vectors representing the “feel” of a surface (soft, rough, smooth, sticky).

Some labs (like MIT’s GelSight and Meta’s DIGIT) are already building these tactile tokenization systems for robotic hands.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.