dopetalk does not endorse any advertised product nor does it accept any liability for it's use or misuse


Our Discord Notification Server invitation link is https://discord.gg/jB2qmRrxyD

Author Topic: Attention Mechanisms and Self-Attention  (Read 6 times)

Online Chip (OP)

  • Server Admin
  • Hero Member
  • *****
  • Administrator
  • *****
  • Join Date: Dec 2014
  • Location: Australia
  • Posts: 7149
  • Reputation Power: 0
  • Chip has hidden their reputation power
  • Gender: Male
  • Last Login:Today at 11:27:06 PM
  • Deeply Confused Learner
  • Profession: IT Engineer now retired
Attention Mechanisms and Self-Attention
« on: Today at 09:38:34 PM »


Attention Mechanisms and Self-Attention

Attention is the core mechanism that makes transformers work. 
It replaces older sequential processing with a system where every token can directly interact with every other token.

---

1. The Core Idea of Attention

Instead of reading text one step at a time like a human sentence scan, the model does this:

  • Look at all tokens at once
  • Decide which tokens matter for each other token
  • Weigh those relationships dynamically

So each token asks:

Code: [Select]
"What other tokens in this sequence are relevant to me?"

---

2. Query / Key / Value Vectors

Each token is transformed into three vectors:

  • Query (Q) → what this token is looking for
  • Key (K) → what this token offers
  • Value (V) → the information content of this token
---

How interaction works

For each token:

Code: [Select]
Attention score = Query · Key

Then:

Code: [Select]
Weighted sum of Values = output representation

So:

  • Query asks a question
  • Key determines relevance
  • Value supplies information

---

3. Attention Weighting

The model calculates how important each token is to every other token.

Example sentence:

Code: [Select]
"The cat sat on the mat because it was tired"

For the token "it":

  • High weight → "cat"
  • Low weight → "mat"
  • Low weight → "sat"

These weights are normalized (usually via softmax):

Code: [Select]
All attention weights sum to 1.0

So each token distributes its focus across others.

---

4. Token Relationships

Attention allows explicit modelling of relationships:

  • Pronoun resolution ("it" → "cat")
  • Subject-object links
  • Long-range dependencies
  • Syntax structure

Unlike older models, relationships are not sequential — they are direct pairwise connections.

---

5. Context Windows

Attention operates within a fixed window of tokens.

This is called the context window.

Example:

Code: [Select]
You can only attend to the last N tokens

Implications:

  • Model "memory" is limited to context size
  • Older tokens outside window are inaccessible
  • Long documents may be truncated or chunked

Modern models use very large windows (thousands to millions of tokens in some systems).

---

6. Parallel Processing vs Recurrence

RNNs / LSTMs (old approach):

Code: [Select]
Token 1 → Token 2 → Token 3 → Token 4
(sequential processing)

Problems:
  • Slow (no parallelism)
  • Hard to retain long-range dependencies
  • Gradient vanishing issues

---

Transformers (attention-based):

Code: [Select]
All tokens processed simultaneously

Benefits:
  • Massive parallelism (GPU efficient)
  • No sequential bottleneck
  • Direct long-range connections

---

7. Why Transformers Replaced RNNs/LSTMs

Transformers won because attention solves key limitations:

1. Long-range dependency problem
RNNs struggle to connect distant tokens. 
Attention connects everything directly.

2. Parallel computation
Transformers process entire sequences at once.

3. Better scaling
Performance improves predictably with more data and parameters.

4. Stable training
Less vanishing gradient issues than recurrent models.

5. Expressive power
Every token can interact with every other token in one step.

---

8. Intuition: What Attention Is Doing

You can think of attention as:

Code: [Select]
Each word in a sentence voting on which other words matter most

Or more formally:

Code: [Select]
Context-sensitive weighted information routing system

Each layer refines these relationships repeatedly.

---

9. Key Insight

Attention replaces sequence with relationships.

So instead of:

Code: [Select]
A → B → C → D (chain processing)

you get:

Code: [Select]
A ↔ B ↔ C ↔ D (fully connected dynamic graph)

This shift is what makes modern language models capable of coherent reasoning across long contexts.
friendly
0
funny
0
informative
0
agree
0
disagree
0
like
0
dislike
0
No reactions
No reactions
No reactions
No reactions
No reactions
No reactions
No reactions
Our Discord Server invitation link is https://discord.gg/jB2qmRrxyD

Tags:
 

Related Topics

  Subject / Started by Replies Last post
0 Replies
28706 Views
Last post August 04, 2015, 12:59:14 PM
by Chip
6 Replies
41901 Views
Last post September 07, 2016, 12:30:16 PM
by sektorgaz
3 Replies
37648 Views
Last post August 24, 2016, 07:34:50 PM
by MoeMentim
0 Replies
21695 Views
Last post May 09, 2019, 02:54:13 PM
by Chip
0 Replies
20101 Views
Last post July 06, 2019, 09:42:15 AM
by Chip
0 Replies
21039 Views
Last post July 23, 2019, 06:45:54 AM
by Chip
0 Replies
20459 Views
Last post November 30, 2019, 03:20:06 AM
by Chip
0 Replies
18542 Views
Last post December 16, 2019, 07:21:58 AM
by Chip
0 Replies
15729 Views
Last post February 20, 2025, 04:40:43 AM
by Chip
0 Replies
20907 Views
Last post April 28, 2025, 04:40:43 PM
by Chip


dopetalk does not endorse any advertised product nor does it accept any liability for it's use or misuse





TERMS AND CONDITIONS

In no event will d&u or any person involved in creating, producing, or distributing site information be liable for any direct, indirect, incidental, punitive, special or consequential damages arising out of the use of or inability to use d&u. You agree to indemnify and hold harmless d&u, its domain founders, sponsors, maintainers, server administrators, volunteers and contributors from and against all liability, claims, damages, costs and expenses, including legal fees, that arise directly or indirectly from the use of any part of the d&u site.


TO USE THIS WEBSITE YOU MUST AGREE TO THE TERMS AND CONDITIONS ABOVE


Founded December 2014
SimplePortal 2.3.6 © 2008-2014, SimplePortal