Large Language Models (LLMs) are trained through two key stages: first, supervised fine-tuning (SFT) uses human-written question-answer pairs to teach the model what helpful responses look like, and second, reinforcement learning from human feedback (RLHF) uses human rankings of outputs combined with PPO optimization to shape the model's behavior based on human preferences, enabling the model to predict text one word at a time while appearing to understand human intent.
Deep Dive
Voraussetzung
- Keine Daten verfügbar.
Nächste Schritte
- Keine Daten verfügbar.
Deep Dive
How AI Gets Trained to Sound Human #AI #LLM #DeepLearningHinzugefügt:
Did you know an LLM can feel like it knows anything? It doesn't. It learned to sound like it does. Here is how.
Every LLM starts the same way. Reads trillions of tokens, news, books, code, Reddit threads, everything.
One task. Predict the next word.
No right answer, no wrong answer, just what comes next.
After enough [music] of that, it absorbs syntax, facts, reasoning patterns. The shape of human thought.
But helpful was never in the task. Give the base model a question, it might complete it like a Reddit thread.
>> [music] >> Or a forum argument from 2009.
It doesn't know it's supposed to answer.
To it, how do vaccines work and how do vaccines cause harm are equally valid continuations.
Helpful [music] and toxic look identical. Two training stages fix that.
First, supervised fine-tuning, SFT.
Thousands of human-written question and answer pairs. [music] Model learns to imitate. This is what a helpful reply looks like. The [music] weights shift just enough to reshape the output format.
But it's still [music] imitating, copying a pattern, not understanding a goal. Second fix, and this is the one that changes everything.
RLHF.
Reinforcement learning from human feedback.
Humans rank outputs. This reply beats that one.
A reward model learns those preferences.
Then PPO, a gradient optimizer, >> [music] >> runs thousands of steps pushing the weights toward higher reward behavior.
The model doesn't get a rule book, it gets shaped. Not a filter on top, baked into the parameters. Remember the hook? It learned to sound like it does.
That's not an insult, that's the mechanism.
Same architecture, same starting weights, just shaped [music] by very specific human feedback. Now, how much can the trained model actually read at once?
That's the context [music] window and why it forgets. Subscribe, that's next.
Ähnliche Videos
resume fixed instantly 😭 Comment “app”andI’ll sendyou the link #parakeetaipartnership #resumetips
Ritcareer
686 views•2026-05-31
Re: 🗣️📍theprophedu📍2026 GST 103 CLASS (E-EXAM REVISION)
theprophedu
636 views•2026-06-04
3D Basics in C
HirschDaniel
2K views•2026-06-05
Search Algorithms Explained in 60 Seconds! 🤖💨
samarthtuliofficial
218 views•2026-06-01
Making Minecraft Clone with C++ & Raylib
PecaCSLive
686 views•2026-06-04
People of Game of Thrones using JavaScript DOM
AltCampus
296 views•2026-05-30
Instagram accounts got PWNed
EricParker
13K views•2026-06-03
So What's Odin Lang Even Good For
TechOverTea
131 views•2026-06-01











