Research Internship Paper 1
The Problem
Robotic manipulation tasks are non-Markovian — the right action depends on history, not just the current frame. Example: cube is hidden under a cup. You can't know where it is from what you see right now. Most VLAs ignore this and predict actions solely from the current observation.
Why not just feed the last N frames directly? The paper tests this ("Multi-frame baseline") — it actually hurts on generic benchmarks:
- −3.3% on RoboCasa, −8.8% on LIBERO
- 3.6× more GPU memory, 35% slower inference
- Model learns spurious correlations between consecutive frames ("causal confusion") — memorizes the order of frames, not the meaning
HAMLET IS A WRAPPER OVER THE VLA MODEL NOTHING NEW
HAMLET's Solution: Two Components
- Moment Tokens — compress each timestep into a small vector summary
- Memory Module — Transformer that selectively reads the history of those summaries
Component 1: Moment Tokens
At each timestep t, append 4 learnable vectors to the VLM's input sequence. Feed everything through the VLM:
- = current camera images
- = language instruction ("cover the cube with the nearest cup")
- = learnable moment token vectors (4 vectors, trained)
- = output — a compressed summary of the scene at timestep
Because the VLM uses causal attention internally, has attended to both the images and the instruction. It's a cheap 4-vector description of "what matters at this moment." This is what gets stored across time instead of raw frames.
Component 2: Time-Contrastive Learning (TCL) — making moment tokens useful
Without TCL, moment tokens might collapse and encode useless static background. TCL is a pre-training step that forces them to be temporally discriminative.
Core idea:
- Same frame + augmentation → should look SIMILAR (positive pair)
- Different timestep → should look DIFFERENT (negative pair)
| Symbol | Meaning |
|---|---|
| Anchor — projected moment token for current frame | |
| Positive — same frame, blurred/jittered/occluded | |
| Negative — different timestep (>16 steps away) |
The loss (InfoNCE / contrastive loss):
- = cosine similarity
- (temperature — controls how sharp the separation is)
- Summed over all anchors in the minibatch
Result: moment tokens learn to attend to things that CHANGE over time (gripper, objects being manipulated) and ignore static things (walls, table background). Figure 4(a) shows this visually — attention concentrates on gripper and task-relevant objects after TCL.
During TCL training, the VLM backbone is frozen — only the moment token embeddings and projection head g are trained.
Component 3: Memory Module (the Transformer)
Stack the last timesteps of moment tokens into one history matrix:
- = action chunk length (how many actions predicted at once) (in chunks for action chuncking)
- total token rows
- = embedding dimension
Run standard Transformer self-attention (Equation 6 in paper):
Breaking this down:
- = learned weight matrices that project into three different "views" (comes from Attention is all you need wala paper)
- = every token's query dot-producted with every token's key → relevance scores ( matrix)
- = scaling to prevent vanishing gradients when is large
- = causal mask — adds where token would look at future position , making those weights after softmax. Token at step cannot cheat by looking at step .
- = converts raw scores to probability weights (each row sums to 1)
- = weighted sum of value vectors — each token's output is a blend of past tokens, weighted by relevance
Output — same shape as input, but now every token is informed by relevant history.
Take only the last rows of → this is , the history-augmented moment token.
It started as "what's happening at timestep " and after the Transformer it's "what's happening at , informed by everything relevant from the past 4 timesteps."
Why Transformer beats simple concatenation
| Approach | What happens |
|---|---|
| Treat all equally → smear everything together, noise dilutes signal | |
| Transformer | Weight by relevance → M2 suppressed to ~7%, M1 at ~91% when M1 is what matters |
The Transformer is a learned, content-based lookup into episode history. Not recency-based — relevance-based.
Integration into Action Prediction
- = VLM's representation of the current frame
- = history-augmented memory from the Memory Module
- = proprioceptive state (joint angles, gripper state)
- = action expert (diffusion/flow-matching head)
The whole pipeline is trained end-to-end with standard action prediction loss. VLM backbone is frozen during fine-tuning for GR00T N1.5; both VLM and moment tokens stay trainable for CogACT.
Results
Real-world tasks (Table 1) — the main result
| Method | Has History? | Avg Success |
|---|---|---|
| GR00T N1.5 baseline | No | 29.2% |
| + Multi-frame (naive) | Yes | 45.8% |
| + HAMLET | Yes | 76.4% |
+47.2% improvement over baseline. Multi-frame helps but HAMLET dominates.
Real tasks tested: Pick-and-Place Twice, Cover-and-Stack, Swap Cubes — all require remembering what happened earlier in the episode.
Simulation (Table 2) — Multi-frame hurts, HAMLET helps
On RoboCasa and LIBERO (tasks that don't specifically require history):
- Multi-frame: −3.3% (RoboCasa), −8.8% (LIBERO) ← causal confusion
- HAMLET: +2.8% (RoboCasa), +2.0% (LIBERO) ← still helps without hurting
Key insight: because HAMLET feeds single-frame inputs externally via the memory module, it preserves the single-frame VLA's generalizability.
Efficiency (Table 4)
| Method | History Length | Latency | Peak Memory |
|---|---|---|---|
| GR00T N1.5 | 1 | 80.5ms (1.00×) | 289MB (1.00×) |
| + Multi-frame | 4 | 108.5ms (1.35×) | 1051MB (3.64×) |
| + HAMLET | 4 | 82.4ms (1.02×) | 566MB (1.96×) |
| + Multi-frame | 8 | 193.0ms (2.40×) | 2023MB (7.00×) |
| + HAMLET | 8 | 85.8ms (1.07×) | 578MB (2.00×) |
HAMLET adds basically no latency (1.02×) compared to multi-frame's 1.35×. Near-free.
Ablations — which parts actually matter (Table 5)
Component analysis (Table 5a)
| Moment Token | TCL | Memory Module | Score |
|---|---|---|---|
| ✗ | ✗ | ✗ | 62.6 |
| ✓ | ✗ | ✗ | 63.1 |
| ✓ | ✓ | ✗ | 63.4 |
| ✓ | ✗ | ✓ | 64.8 |
| ✓ | ✓ | ✓ | 65.4 |
Memory module is the most critical component. Removing it causes the biggest drop. TCL consistently helps but isn't the main driver.
Memory architecture (Table 5c)
| Method | Score |
|---|---|
| No Memory | 62.6 |
| Moment Concat (just paste tokens together) | 62.7 |
| RNN | 64.5 |
| LSTM | 65.0 |
| GRU | 64.3 |
| Transformer | 65.4 |
Moment Concat ≈ No Memory. The Transformer's selective attention is what makes history useful — just concatenating tokens blindly doesn't work.
Moment token length (Table 5b)
1 → 64.3, 4 → 65.4, 8 → 66.4, 16 → 65.9, 32 → 62.7, 64 → 62.5
Sweet spot is 4–8. Too many tokens → redundancy hurts performance.
Why it generalizes beyond its window size
HAMLET uses T=4 history length by default, but on "Pick-and-Place Three Times" the robot needs to remember 14+ steps. HAMLET still improves dramatically (37.5% vs 8.3% baseline).
Reason: Transformer KV-cache propagation. Each token's key-value pair influences all later tokens during attention. Even after old tokens drop from the explicit window, their information has already been baked into the newer tokens' representations through the attention mechanism.
Summary Table
| Component | What it does | Why it's needed |
|---|---|---|
| Moment Tokens | 4 learnable vectors per timestep, compressed scene summary via VLM | Cheap to store, captures task-relevant info |
| TCL | Pre-trains tokens to distinguish different timesteps | Stops tokens encoding static/useless background |
| Memory Module (Transformer) | Self-attention over last 4 timesteps of moment tokens | Selectively retrieves relevant past, ignores irrelevant |
| Causal Mask | Prevents attending to future tokens | Ensures proper temporal ordering |
History-augmented feature m̃'_t | Last n_m rows of Transformer output | Combined current + history signal fed to action expert |
Key Takeaways
- History matters but raw frames are expensive and cause causal confusion — you need compression
- Moment tokens = cheap compressed summaries of each timestep (4 vectors instead of full images)
- TCL forces tokens to be temporally discriminative — ignore static, capture change
- The Transformer memory selectively attends — not all past timesteps matter equally
- Just concatenating past tokens doesn't help — the selective attention of the Transformer is the key mechanism
- Near-zero computational overhead (1.02× latency at history length 4) vs multi-frame's 1.35×