======
======
Efficient Estimation of Word Representations in Vector Space Distributed Representations of Words and Phrases and their Compositionality GloVe: Global Vectors for Word Representation
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding ELMo: Deep contextualized word representations Contextual Word Representations: A Contextual Introduction The Illustrated BERT, ELMo, and co. Jurafsky and Martin Chapter 11 (Fine-Tuning and Masked Language Models)
GPT-2: Language Models are Unsupervised Multitask Learners GPT-3: Language Models are Few-Shot Learners LLaMA: Open and Efficient Foundation Language Models
InstructGPT: Aligning language models to follow instructions Scaling Instruction-Finetuned Language Models Self-Instruct: Aligning Language Models with Self-Generated Instructions Alpaca: A Strong, Replicable Instruction-Following Model Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Parameter-Efficient Transfer Learning for NLP LoRA: Low-Rank Adaptation of Large Language Models QLoRA: Efficient Finetuning of Quantized LLMs
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness ZeRO & DeepSpeed: New system optimizations enable training models with over 100 billion parameters Gradient Checkpoint: Training Deep Nets with Sublinear Memory Cost What is Gradient Accumulation ?
vLLM: Efficient Memory Management for Large Language Model Serving with PagedAttention Fast Inference from Transformers via Speculative Decoding
======
Stanford SLP book notes on Neural Networks, Backpropagation
HKUST Prof.Kim’s PyTorchZeroToAll Tutorial
Deep Learning Practical Methodology
Stanford CS224N notes on Language Models, RNN, GRU and LSTM
Stanford CS224N notes on Self-Attention & Transformers
Stanford CS224N notes on Word Vectors
]]>