A precise breakdown of structural data hygiene that transforms chaotic wide-format data into a rigorous analytical framework. It is an essential guide for anyone looking to master the foundational logic of modern data manipulation.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
Tidying data with pivot_longerAdded:
Hey everybody. Today we're tidying data in R using the pivot longer function.
We're in a situation where the names of columns in our spreadsheet are actually values of a variable that we'd like to use. So, before we can do any real work with our data, we need to reshape it.
The preferred format that ggplot2 for data viz and dplyr for data wrangling want is called tidy data. And there's three features of a tidy data set that you probably don't even think about when you have them, but are really worth having in the front of your mind when you're running into data sets that are in more exotic formats. In a tidy tidy data set, every data set holds exactly one type of observation.
So, there aren't like sub type sub tables or weird formatting things that you might find in you know, like spreadsheets Excel spreadsheets in the working each column is one and only one variable, and each row is one and only one observation. We call this tidy data.
If you have a data set that doesn't have these properties, and you're working in R, the first step is going to be to reshape the data so that it does.
The most common form of untidy data is a wide format. And one example of that is the relig income data set that comes built in with tidyverse.
I've already loaded up tidyverse with library tidyverse. Let's just take a look at the relig income data set.
So, we have one column for religion, and it has values like agnostic for instance is the very first one. And then we have a column for various what for each of the different various income levels, less than 10,000, 10,000 to 20,000, et cetera.
Now, my ultimate goal here is to generate a plot that's going to show the number of agnostic respondents for each income level. So, I'm going to restrict my attention to just agnostics for now.
In order to make a ggplot for that for instance, I will need a column called income level and another one called count, neither of which I have right now.
This data set is what we in what we call a wide format, where the current column names, except for religion, are actually values of one of the variables that we want to use, in this case income level.
And then the cells in those columns contain values of another variable, in this case counts. And that variable, count, isn't even named in this data set. It's just assumed by the authors of the set that we are going to know what these cell entries actually mean.
So, in order to fix this, we're going to have to let R, or whatever programming language you might be using, know what the problem columns are. So, here it's going to be all the different sort of income levels, starting with less than 10,000 K and going all the way up to that DKRF at the very end.
Then we're going to have to say the name of the new column where the column names should go. For us, that's going to be income level. And then we're going to have to say the name of the new column where we want the values from those columns to go. And in this case, that's going to be count.
When we use the pivot longer function, which is our you know, workhorse function here, we are going to specify those three things in order.
So, starting on line 43 here, you can see the code that's actually going to do it. I'm going to take the relig income data set and pipe it into pivot longer.
I have an old pipe here for some reason, I'll fix that. I like the newer one better.
Um pivot longer after we specify the name of the data set, relig income, needs three arguments. We have to say the problem columns, cols equal. Then we have to say where the names are going and where the values are going. So, the names of these problem columns are going to go into a new column called income.
The values in those columns are going to go to a new column called count.
By the way, this negative sign right here in the in cols equal is saying to select all columns except for the religious columns. So, sort of subtract that out of the set of columns being considered.
So, I'll go ahead and execute that and then glimpse the result. And just like magic, suddenly the data set is in the format we want. We have all the same information, it's just now been reshaped into a format that ggplot is going to like a little better. As advertised, we have a column for religion, as we started with, income and count. Those are the two column names that we specified in our previous in our previous pivot longer call.
So, now let's get our plot. We can remove all the non-agnostics using a filter. I'm doing that here on line 54.
And then doing just a very basic plot of our data. And I'll pretty this up in just a moment.
So, there it is. You can see I've just put one point for each income level referring to the count, the number of people that responded saying that they had that income level just among agnostics.
Now, I said I wanted a lollipop chart.
So, I want to run a little bit of code to pretty this up. In particular, I'd like to collect connect all of those dots horizontally to the vertical axis.
And this will solve the problem of making it look like the the ratios are different than they are. Currently for instance, it looks like 75 50 to 75 K is many many many times greater than less than 10 K. So, you know, the largest and the smallest values in the set. That's not actually the case, and we'll be able to see that better if we include zero on this plot.
So, here's a more attractive version of it. First, I'm going to put the income levels in order so that for instance, 100 to 150 K isn't in between 10 and 20 K. Currently these are going in alpha numeric order.
I'm not going to say too much about how I'm doing that. I'm using the factor relevel command. I have a whole vid put about working with factors in R. I'll throw a link to that up top if you want more detail.
Um then, once the factors are in the order that I want, I'm going to make my lollipop chart.
Here's the slightly cleaned up version.
You'll see that I've added geom_segment.
So, that is going to require two additional aesthetics because to identify a segment, we need to say both the starting and ending points. We already have the starting end points in our original AES command. So, now I'm just adding my end points. For every observation, I want my segment to start at x equals zero, or rather end at x equals zero and y equals whatever the categorical value is for the income.
So, by including the y-axis here, basically the value for count equals zero, and then connecting all our points to it with a horizontal segment, we get a more authentic representation of the relationships between these groups. Generally, when your data is at a ratio level of measurement, so zero actually means nothing or none at all, it's a good idea to include zero in your plot as in as like I've done here.
Now, there's lots more that can be done with pivot longer. And to illustrate some of the tools that are at our disposal for more complicated data sets, I want to look at the Billboard data set, which also comes built in with tidyverse, and includes song rankings from the Billboard charts from the year 2000.
Currently, each row is a song, which is great for some purposes, but if we want to track the progress of a song over time, we're going to need something like a week variable. So, just looking at the head of this data set, you can see that we have um to start, artist Tupac, and song uh Baby Don't Cry, that started on the charts in its first week at 87. We then have its rank on week two, week three, week four, and so on, in addition to some other variables here, like the date that it actually entered the the chart.
So, if we want to make any sort of time plot, we're going to need a week variable. So, let's do a pivot that will fix this issue. And to start, I'll do it in the simplest way possible, basically using similar code to what I had before.
The problem columns are all the ones that start with week. So, here I've used the helper function starts_with to select all columns that well, start with the letters WK.
Then I am going to specify that those column names should go to a new column called week, and that the values in those columns should go to a new column called rank.
Let's just take a glimpse at the at the top of this. You can see that now we have Tupac Baby Don't Cry for every single week that it was on the charts.
Now, this output does have a few problems. Each song currently does have 76 rows, that's the number of week columns that we have in this data set.
Um even if the data even if that song was only on the charts for one or two weeks. So, really we'd prefer it if we only had one row per week that the song was actually on the charts. And if I change up my window just a little bit here, you'll be able to see the problem a bit more clearly.
There we go. Like Baby Don't Cry has was off the charts after seven weeks. So, we have a bunch of NAs here. And because I'm using the head command, we'll only see the first 10 rows of this. There are actually 76 entries for every song, including Baby Don't Cry. So, basically a lot of NAs.
The second problem I have here is with the week column. I don't need this WK in front of everything. I would rather just have 1 2 3 4 5.
While I'm at it, I don't want this to be a character vector. I would rather just have numbers.
So, you know, all of this can be fixed using commands from the dplyr package, things like mutate for the week and filter to get rid of the NAs.
But, um pivot_longer has built-in tools for this that I think under a lot of circumstances are more simple to use.
So, I just want to illustrate that.
Here's the more complicated version of the pivot that I just did that's going to resolve all of those issues. So, let me run it first.
I will again look at the head of this.
There we go.
And we can see how the issue's been resolved. So, first let me just show you that it has been resolved. I'll rerun that head command.
There we go.
And so, I guess I need n equals 10 so that we can actually see the NAs are gone.
There we go. Now, we only have the seven rows for Baby Don't Cry before we get on to the next artist, which I think is to get her is the name of it. I could be wrong. I'm not familiar with that.
Notice that week is now integer and does not include the WK.
Okay, so let's see how we did that.
Going back up here.
Here's the code. Let me just get it a little bit more visible.
There we go.
All right. So, the first thing I want to point out here is the names_prefix argument, which is saying that each of the column names in the original data set starts with the letters WK and says we should strip that off. It's very common when you have a wide data set for all the columns that are problem that all the problem columns to start with a prefix like that. It's um if the names of the columns are actually values of a variable, typically the person who's actually coding in the data kind of subconsciously realizes that and puts in a prefix to indicate it.
Um in order to make all of the columns um numeric, I have used the names_transform argument. To be fully transparent about this one, typically when I'm doing the um the changing the data type of a column, I end up piping my pivot_longer into a mutate and doing it manually. But, I thought I would illustrate the internal syntax for it here.
Names_transform um needs a list as its value. And inside that list, we say the names of the new columns and then the function that we want to apply to all of them. In other words, how are we going to transform those columns? So, fundamentally, this is just doing a mutate underneath the hood.
In order to deal with all of those NAs in my Billboard long data set, I've added this argument values_drop_NA, which is saying that whenever we get an NA in the values to column, so in this case rank, we are just going to leave out that row. And that's specifically for this situation where many of the original columns have a lot of NAs. When we do the pivot, those new rows that we've created don't end up holding any information any new information, so we just drop them.
So, you know, these are the tools that I use most frequently when I'm using pivot_longer when I have a wide data set that I need in a longer format for the purposes of modeling or visualization.
I do just want to quickly point out some of the tools that are available to you for more complicated pivots.
Cuz untidy data can get really, really complicated. And the Anscombe data set, again, built-in with tidyverse, is a really good example. So, the um Anscombe data set is a real classic. It is um four two-value data sets with um where the variables X and Y always have identical correlations or nearly identical correlations, as well as nearly identical means and variances.
So, it's sort of a a cautionary tale about uh over-interpreting things like correlation and variance.
The version in tidyverse, I think, is intentionally very untidy.
We have columns called X1, X2, X3, X4, and Y1, Y2, Y3, Y4. So, we have the X values and Y values for the four different sets.
If we were trying to get plots of these, for instance, like, you know, a pivoted scatter rather a faceted scatter plot, we'd want variables X, Y, and set, you know, like, which set number are we in?
So, the data set that we have here, the raw data set Anscombe, actually is a situation where the column names are encoding the values of two variables.
So, the code to unwind that is a bit more complicated. I'll just flash it at you right here. We um are saying that the names have to go to two different columns. And then we've used little code or a little bit of it um more sophisticated instruction to say what the names of those columns should be. We also have to let R know where to split up the column names.
Now, while I don't want to talk about this code in detail, I do have a whole video where I do, and you can see the link here. I'll put it in the description below so you can just click on it directly. If you're looking to get some practice with your pivots, this is a good place to start.
>> [music]
Related Videos
Agentforce NOW AMA: Build with React and Salesforce Multi-Framework
SalesforceDevs
490 views•2026-05-28
How agent o11y differs from traditional o11y — Phil Hetzel, Braintrust
aiDotEngineer
450 views•2026-05-28
WEB TECHNOLOGIES UNIT-2 | Degree 4th sem BCOM Computers web technologies unit-2 full explanation💯✅
LearnwithSahera
1K views•2026-05-29
More tests are always better? How to use AI to identify tests that bring little value
Alliance4Qualification
335 views•2026-05-29
Search Algorithms Explained in 60 Seconds! 🤖💨
samarthtuliofficial
218 views•2026-06-01
People of Game of Thrones using JavaScript DOM
AltCampus
296 views•2026-05-30
Introduction to Problem Solving Part - 1 | Lecture 1 | Intermediate DSA
ascensionix
107 views•2026-05-29
So What's Odin Lang Even Good For
TechOverTea
131 views•2026-06-01











