Install our extension to search inside any video instantly.

How AI Gets Trained to Sound Human #AI #LLM #DeepLearning
Added: 2026-05-17

101 views42:15max-techieOriginal Release: 2026-05-11

Large Language Models (LLMs) are trained through two key stages: first, supervised fine-tuning (SFT) uses human-written question-answer pairs to teach the model what helpful responses look like, and second, reinforcement learning from human feedback (RLHF) uses human rankings of outputs combined with PPO optimization to shape the model's behavior based on human preferences, enabling the model to predict text one word at a time while appearing to understand human intent.

[00:00:00]Did you know an LLM can feel like it knows anything? It doesn't. It learned to sound like it does. Here is how.

[00:00:07]Every LLM starts the same way. Reads trillions of tokens, news, books, code, Reddit threads, everything.

[00:00:15]One task. Predict the next word.

[00:00:19]No right answer, no wrong answer, just what comes next.

[00:00:24]After enough [music] of that, it absorbs syntax, facts, reasoning patterns. The shape of human thought.

[00:00:31]But helpful was never in the task. Give the base model a question, it might complete it like a Reddit thread.

[00:00:37]>> [music] >> Or a forum argument from 2009.

[00:00:41]It doesn't know it's supposed to answer.

[00:00:43]To it, how do vaccines work and how do vaccines cause harm are equally valid continuations.

[00:00:50]Helpful [music] and toxic look identical. Two training stages fix that.

[00:00:55]First, supervised fine-tuning, SFT.

[00:00:59]Thousands of human-written question and answer pairs. [music] Model learns to imitate. This is what a helpful reply looks like. The [music] weights shift just enough to reshape the output format.

[00:01:10]But it's still [music] imitating, copying a pattern, not understanding a goal. Second fix, and this is the one that changes everything.

[00:01:18]RLHF.

[00:01:19]Reinforcement learning from human feedback.

[00:01:23]Humans rank outputs. This reply beats that one.

[00:01:27]A reward model learns those preferences.

[00:01:31]Then PPO, a gradient optimizer, >> [music] >> runs thousands of steps pushing the weights toward higher reward behavior.

[00:01:39]The model doesn't get a rule book, it gets shaped. Not a filter on top, baked into the parameters. Remember the hook? It learned to sound like it does.

[00:01:50]That's not an insult, that's the mechanism.

[00:01:53]Same architecture, same starting weights, just shaped [music] by very specific human feedback. Now, how much can the trained model actually read at once?

[00:02:03]That's the context [music] window and why it forgets. Subscribe, that's next.

#technology #ai #ai training #large language model #llms

Related Videos

Computer Science

Agentforce NOW AMA: Build with React and Salesforce Multi-Framework

SalesforceDevs

490 views•2026-05-28

Computer Science

How agent o11y differs from traditional o11y — Phil Hetzel, Braintrust

aiDotEngineer

450 views•2026-05-28

Computer Science

WEB TECHNOLOGIES UNIT-2 | Degree 4th sem BCOM Computers web technologies unit-2 full explanation💯✅

LearnwithSahera

1K views•2026-05-29

Computer Science

More tests are always better? How to use AI to identify tests that bring little value

Alliance4Qualification

335 views•2026-05-29

Computer Science

Search Algorithms Explained in 60 Seconds! 🤖💨

samarthtuliofficial

218 views•2026-06-01

Computer Science

People of Game of Thrones using JavaScript DOM

AltCampus

296 views•2026-05-30

Computer Science

Introduction to Problem Solving Part - 1 | Lecture 1 | Intermediate DSA

ascensionix

107 views•2026-05-29

Computer Science

So What's Odin Lang Even Good For

TechOverTea

131 views•2026-06-01

Trending

Revisiting The Cat Cafe For The Final Time

BenGtalks

3195K views•2026-05-29

Lil bro is a menace 🤣

NotAirJordan

2037K views•2026-05-31

The Casino Had Us Guessing All Day

VegasMatt

157K views•2026-06-03

Political Science

My response to the Police

RecklessBen

1496K views•2026-06-01