1 Comment
User's avatar
User's avatar
Comment deleted
Nov 7
Comment deleted
Pooja Palod's avatar

That’s an excellent technical question about the Transformer architecture!

The design choices behind Query (Q), Key (K), and Value (V) and the scaling factor sqrt dk are core to how attention actually works.

Why 3 copies (Q, K, V)? Why not 2 or 4?

The self-attention mechanism mimics a search process:

• Query (Q): “What am I looking for?”

• Key (K): “What information do I have?”

• Value (V): “What content should I return?”

Each plays a distinct role:

1. The Query searches through all Keys to measure relevance (Q K^T).

2. The resulting scores weight the Values to compute the output a weighted average of relevant information.

If we used only two (say Q and V), the model couldn’t separate “matching” from “retrieval.”

If we used more than three, it’d be redundant Q ,K, and V already fully describe the lookup process.

Why divide by sqrt dk} (and not dk or dk^2)?

This scaling is for numerical stability.

When Q and K have many dimensions, their dot products can grow large because:

variance will be dk

That means large dk→ large variance → large softmax inputs → saturation (where one score ≈ 1, others ≈ 0).

That kills gradients and makes training unstable.

Dividing by sqrt normalizes this variance back to 1 keeping the softmax in a healthy range.

Using dk would shrink it too much; dk^2 would destroy the signal entirely.