Course Link: How Transformer LLMs Work
2025-08-14
- chapter: understanding language models: language as a Bag-of-Words
- non-transformer, encoder-only, decoder-only, encoder-decoder
- decoder-only: such GPT
- tokenization -> tokens -> vocabulary -> vector embeddings
- non-transformer, encoder-only, decoder-only, encoder-decoder
- chapter: understanding language models: (word) embeddings
- word2vec, the way to express the natural meaning is an array of floats
- such as cats [.91, -.11, .19 … ]
- types of embeddings
- word2vec, the way to express the natural meaning is an array of floats
- chapter: understanding language models: encoding and decoding context with attention
- recurrent neural networks (RNNs)
- key applications: natural language processing (translate, text generation, sentiment analysis)
- speech recognition
- time series prediction (weather, stock price)
- autoregressive
- meaning: the model predicts the current (or future) value based on past values, and the prediction itself can be fed back as input for subsequent predictions
- attention
- allows a model to focus on parts of the input that are relevant to one another
- recurrent neural networks (RNNs)
- chapter: understanding language models: transformers
- alltention is all your need (paper)
- transformer – a new architecture solely based on attention and without the RNN
- self-attention
- representation models, like embedding models
- bidirectional encoder from transformers (BERT), classification
- pre-training on large dataset -> fine-tune for downstream tasks: classification, named entity recognition, paraphrase identification
- generative models, like GPT
- context length (parameters)