Large Language Models (LLMs) are trained through two key stages: first, supervised fine-tuning (SFT) uses human-written question-answer pairs to teach the model what helpful responses look like, and second, reinforcement learning from human feedback (RLHF) uses human rankings of outputs combined with PPO optimization to shape the model's behavior based on human preferences, enabling the model to predict text one word at a time while appearing to understand human intent.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
How AI Gets Trained to Sound Human #AI #LLM #DeepLearningAdded:
Did you know an LLM can feel like it knows anything? It doesn't. It learned to sound like it does. Here is how.
Every LLM starts the same way. Reads trillions of tokens, news, books, code, Reddit threads, everything.
One task. Predict the next word.
No right answer, no wrong answer, just what comes next.
After enough [music] of that, it absorbs syntax, facts, reasoning patterns. The shape of human thought.
But helpful was never in the task. Give the base model a question, it might complete it like a Reddit thread.
>> [music] >> Or a forum argument from 2009.
It doesn't know it's supposed to answer.
To it, how do vaccines work and how do vaccines cause harm are equally valid continuations.
Helpful [music] and toxic look identical. Two training stages fix that.
First, supervised fine-tuning, SFT.
Thousands of human-written question and answer pairs. [music] Model learns to imitate. This is what a helpful reply looks like. The [music] weights shift just enough to reshape the output format.
But it's still [music] imitating, copying a pattern, not understanding a goal. Second fix, and this is the one that changes everything.
RLHF.
Reinforcement learning from human feedback.
Humans rank outputs. This reply beats that one.
A reward model learns those preferences.
Then PPO, a gradient optimizer, >> [music] >> runs thousands of steps pushing the weights toward higher reward behavior.
The model doesn't get a rule book, it gets shaped. Not a filter on top, baked into the parameters. Remember the hook? It learned to sound like it does.
That's not an insult, that's the mechanism.
Same architecture, same starting weights, just shaped [music] by very specific human feedback. Now, how much can the trained model actually read at once?
That's the context [music] window and why it forgets. Subscribe, that's next.
Related Videos
Agentforce NOW AMA: Build with React and Salesforce Multi-Framework
SalesforceDevs
490 viewsβ’2026-05-28
How agent o11y differs from traditional o11y β Phil Hetzel, Braintrust
aiDotEngineer
450 viewsβ’2026-05-28
WEB TECHNOLOGIES UNIT-2 | Degree 4th sem BCOM Computers web technologies unit-2 full explanationπ―β
LearnwithSahera
1K viewsβ’2026-05-29
More tests are always better? How to use AI to identify tests that bring little value
Alliance4Qualification
335 viewsβ’2026-05-29
Search Algorithms Explained in 60 Seconds! π€π¨
samarthtuliofficial
218 viewsβ’2026-06-01
People of Game of Thrones using JavaScript DOM
AltCampus
296 viewsβ’2026-05-30
Introduction to Problem Solving Part - 1 | Lecture 1 | Intermediate DSA
ascensionix
107 viewsβ’2026-05-29
So What's Odin Lang Even Good For
TechOverTea
131 viewsβ’2026-06-01











