[Learning Notes] Course: How Transformer LLMs Work

2025-08-14

chapter: understanding language models: language as a Bag-of-Words
- non-transformer, encoder-only, decoder-only, encoder-decoder
  - decoder-only: such GPT
- tokenization -> tokens -> vocabulary -> vector embeddings
chapter: understanding language models: (word) embeddings
- word2vec, the way to express the natural meaning is an array of floats
  - such as cats [.91, -.11, .19 … ]
- types of embeddings
chapter: understanding language models: encoding and decoding context with attention
- recurrent neural networks (RNNs)
  - key applications: natural language processing (translate, text generation, sentiment analysis)
  - speech recognition
  - time series prediction (weather, stock price)
- autoregressive
  - meaning: the model predicts the current (or future) value based on past values, and the prediction itself can be fed back as input for subsequent predictions
- attention
  - allows a model to focus on parts of the input that are relevant to one another
chapter: understanding language models: transformers
- alltention is all your need (paper)
- transformer – a new architecture solely based on attention and without the RNN
- self-attention
- representation models, like embedding models
  - bidirectional encoder from transformers (BERT), classification
  - pre-training on large dataset -> fine-tune for downstream tasks: classification, named entity recognition, paraphrase identification
- generative models, like GPT
- context length (parameters)