Install our extension to search inside any video instantly.

How AI Learns to Follow Instructions ? | SFT
Added: 2026-05-12

136 views911:39AttentionVisualizedOriginal Release: 2026-05-12

Supervised Fine-Tuning (SFT) is a training technique that transforms raw language models into helpful assistants by teaching them to follow instructions, using the same algorithm as pre-training but with a loss mask that only grades response tokens, requiring only 1,000-10,000 examples to achieve significant quality improvements, with quality plateauing early at around 85% and demonstrating that knowledge comes from pre-training while behavior comes from SFT.

[00:00:00]Imagine you ask a language model a simple question. What is the capital of France?

[00:00:05]On the left here, the model writes back something strange. What is the capital of Germany? What is the capital of Spain? Match the country to its capital.

[00:00:15]It rambles. It quotes things it has read. It does not actually answer your question.

[00:00:21]On the right, the same model, given the exact same prompt, writes back, "The capital of France is Paris." Clean, direct, helpful, same neural network, one small training step apart. How does that happen?

[00:00:39]Every chatbot you have ever used, had to close this gap. The base model, the one on the left, has read most of the public internet. It knows the answer, but it does not know that questions are supposed to be answered.

[00:00:53]So, what is the smallest change that fixes this? You might guess a brand new algorithm or some clever new architecture.

[00:01:01]But here's the surprising part. It is not either of those.

[00:01:06]Here is one way to picture it.

[00:01:08]Pre-training is the long expensive process where the raw shape gets carved out of marble. The model learns facts, languages, code, style, the underlying form. After months of carving, the statue is there, but the surface is rough. SFT is the polish. A few hours of careful smoothing on top of an already finished sculpture. SFT does not add marble. It does not change the form. It just makes the existing statue presentable.

[00:01:38]Here's another way to see it. A pre-trained model is a brilliant student who has read every book in the library but has never been to school. Ask them a question and they continue your sentence with more questions or quote something they read or trail off. After a brief format school, the same student answers directly in the form they have seen demonstrated mechanically. Here's what is going on.

[00:02:01]Pre-training shows the model raw text and plays a guessing game. Predict the next word, get penalized for being wrong, update the weights. SFT shows the model instruction response pairs and plays the exact same game.

[00:02:16]Same forward pass, same cross entropy loss, same optimizer, same architecture.

[00:02:24]What changes is the data and one tiny thing at the very end, a mask. We will see exactly what that mask does next.

[00:02:33]Here's a single training example written using the llama style chat template. The template wraps the question and answer with special role marker tokens that mark whose turn it is. The user [music] wrote, "Special token start. What is 2 + 2? Special token end." The assistant is supposed to answer 2 + 2 equals 4. 10 tokens total. The first six are the prompt. The last four are the response.

[00:03:00]In pre-training, we would grade the model on every single position. The cross entropy loss formula above sums over all of them minus log probability of the target token at each position.

[00:03:12]But here's the question that makes SFT different. Look at these 10 tokens.

[00:03:17]Which positions should we grade?

[00:03:23]Take a moment. The answer is the four response positions only. We do not want the model to learn to generate user prompts. We want it to learn to generate responses. So we apply a loss mask 0 0 0 0 0 0 0 for the prompt 1 1 for the response. The cross entropy formula transforms in place. The summation index changes from all positions to positions where the mask equals 1. Loss bars rise only on the response tokens.

[00:03:56]In PyTorch, this is implemented with a tiny convention. For every prompt position, you set the label to -100.

[00:04:04]The cross entropy function silently skips those tokens. One bit per token decides what counts. That is the entire change from pre-training.

[00:04:14]Same algorithm plus a mask.

[00:04:18]Now let us watch one full training step run on this example. The token strip is the same. The mask is still beneath and to the right. A probability table opens up one row for each of the four response positions. What does the model think the [music] next token should be at each response position? Right now, the model is mostly guessing. Pre-training never showed it this exact format. So, at position 7, where the target is 2 + 2, the probability the model assigns is 0.05, just 5%. At position 8, target equals the probability is 0.4.

[00:04:56]At position 9, target the digit four, the probability is 0.1. At position 10, the period 0.3.

[00:05:06]Cross entropy turns each probability into a loss. Negative log of 0.05 is about 3. Negative log of 0.4 is 0.92.

[00:05:17]Negative log of 0.1 is 2.3.

[00:05:20]Negative log of 0.3 is 1.2.

[00:05:24]average them 1.85.

[00:05:28]That is the mean loss the model takes on this single example.

[00:05:32]Now we run gradient descent forward pass compute loss back prop update repeat many many times across thousands of examples like this one.

[00:05:44]What do you think happens to those four probabilities?

[00:05:47]Watch. They sharpen in place. 0.05 climbs to 0.85. [music] 0.4 climbs to 0.92.

[00:05:58]0.1 becomes 0.95.

[00:06:01]0.3 becomes 0.9. Each per token loss collapses. The mean loss falls from 1.85 all the way down to 0.1, a 95% reduction.

[00:06:14]Take a moment to appreciate what just happened. All this gradient ever did across all those steps was sharpen the model's response token probabilities.

[00:06:23]It did not teach the model that 2 + 2 is 4. The pre-trained weights already encoded that. It taught the model to commit to producing the answer when it sees the close inst marker. The knowledge was already there. SFT gave it the reflex to actually use it.

[00:06:40]Okay, we have seen the mechanism. Now, how much demonstration data does this take? Imagine an empty number line [music] stretching from 1,000 examples to 100,000. How many examples do you think you need to teach a giant language model to follow instructions?

[00:06:59]Take a guess. Let us place the real data sets on the line. Lima, the famous 2023 paper used 1,000 hand curated examples.

[00:07:09]Instruct GPT. The original chat GPT recipe used about 13,000.

[00:07:14]Llama SFT 27,540.

[00:07:19]Stanford's Alpaca 52,000. and fine tone the modern llama 3 recipe 100,000.

[00:07:26]Now watch this quality bar fill in below. You expect it to climb left to right. More data, more quality, right?

[00:07:34]That is the intuition, but it does not.

[00:07:37]The bar fills to about 85% across nearly the entire range. Quality plateaus early. The production sweet spot today is just 2,000 to 6,000 mixed examples.

[00:07:49]Less than people think. a lot less.

[00:07:54]So, let us zoom out to the production reality.

[00:07:57]SFT does not stand alone. It is the first stage in a three-stage alignment pipeline that every modern aligned LLM passes through. Stage one, pre-training, weeks of training on a GPU cluster, trillions of tokens, hundreds of thousands of dollars in compute. This is where the knowledge comes from.

[00:08:19]Stage two, SFT hours on a single GPU, 10,000 or so demonstrations, a few hundred.

[00:08:27]Stage three, preference alignment, RLHF, DPO, or constitutional AI days of preference labeling that polishes the last 10%.

[00:08:36]Every aligned chatbot you have used follows this recipe. Chatyp Claude, Gemini, Llama 3, Instruct, Quen Chat.

[00:08:45]Each one passed through SFT.

[00:08:49]The engineering version comes in three flavors or three popular ones anyway.

[00:08:54]Full fine-tuning updates every weight fastest but high memory. Laura freezes the base and trains [music] tiny adapter matrices on the side about 40% less memory. Same quality.

[00:09:06]QLA goes further compressing the base to 4bit precision. A 70 billion parameter model fits on a single A.

[00:09:15]The reference recipe llama 3.1 8 billion parameters plus Laura on the finetome data set runs in 4 [music] hours and 45 minutes on a single A100.

[00:09:26]SFT is not a research project anymore.

[00:09:29]It is a Tuesday afternoon and that easy engineering reality leads to two counterintuitive results. In 2023, the Lima paper fine-tuned a 65 billion parameter model on exactly 1,000 hand curated examples, no preference tuning afterward. Then, human raiders compared Lima against state-of-the-art chatbots. How often do you think Lima beat GPT4?

[00:09:59]Take a guess. 43% of the time against GPT4.

[00:10:05]Against Bard, 58%. Against Da Vinci 003, 65% 1,000 examples, no further training from preferences. Quality beats quantity.

[00:10:17]That alone is a striking finding.

[00:10:23]And from the original Instruct GPT paper, the 1.3 billion parameter aligned model beat the 175 billion parameter raw GPT3 on human preference. A 100 dox parameter reduction with a quality improvement alignment beats scale.

[00:10:42]There is a name for this whole pattern.

[00:10:44]The superficial alignment hypothesis from the Lima paper. Almost word for word almost all knowledge in large language models is learned during pre-training. Only limited instruction tuning data is needed to teach models to produce highquality output.

[00:11:01]knowledge from pre-training behavior from SFT.

[00:11:05]The base model on the left of our opening, the one rambling about Germany and Spain, already knew the capital of France was Paris. The fact was sitting in its weights.

[00:11:16]SFT just taught it to say so when asked.

[00:11:19]That more than anything is the lasting insight from SFT.

[00:11:24]A thousand examples. Same algorithm as pre-training. That is supervised fine-tuning.

[00:11:31]Next time we will see what happens after SFT when the model meets human preferences.

Related Videos

Artificial Intelligence

OpenHuman VS Hermes AI: Who Wins?

JulianGoldieSEO

285 views•2026-05-29

Artificial Intelligence

Long-Running Agents — Build an Agent That Never Forgets with Google ADK

suryakunju

142 views•2026-05-30

Artificial Intelligence

This computer is made from real human brain cells. And you can buy it.

Talktmsmedia

3K views•2026-05-28

Artificial Intelligence

BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2

aimmediahouse

122 views•2026-06-03

Artificial Intelligence

I Made the Same Anime Fight Scene in Every AI Video Generator

NobleGooseAnime

295 views•2026-05-30

Artificial Intelligence

Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S

cnnnews18

3K views•2026-06-01

Artificial Intelligence

I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)

AICodingDaily

298 views•2026-05-29

Artificial Intelligence

3D Platformer Update - NO CAPES

SolarLune

294 views•2026-05-30

Trending

Computer Science

The Meta AI Hack Is a DISASTER

LowLevelTV

141K views•2026-06-03

The Casino Had Us Guessing All Day

VegasMatt

157K views•2026-06-03

The Dancing Plague...

HoodieGuyStories

1730K views•2026-05-30

The Fastest Way To Board A Plane 😮

zackdfilms

6504K views•2026-05-29