Data cleaning in Python follows a systematic three-step workflow: (1) Load and profile your data to understand its structure, identify missing values, duplicates, and inconsistent formats; (2) Handle obvious issues by standardizing column names, cleaning text values, removing duplicates, and imputing missing values using median for numerical columns and 'unknown' for categorical columns; (3) Run integrity checks to validate data logic, such as recalculating derived fields and checking for impossible values. Key tips include using the data profiling library for comprehensive data assessment, applying median instead of mean for numerical imputation, using mapping dictionaries for text standardization, and leveraging regex for currency symbol removal.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
🧹Watch me CLEAN DATA in Minutes with Python (+10 Tips for Complex Datasets)Added:
Data cleaning is boring. Everyone knows that, but I have clean data sets with millions of rows for banks, for police department, for startups, and I developed my own 10 insider tips for cleaning data in Python that saves hours. In this video, I will show you my exact workflow. Let's get into it. So, we will be working with a synthetic data that I created specifically for this video, and I added all possible issues that can happen to the data set, so to show you how to fix it. But, let's be honest, if you work with a data coming from a database, there might be minor problems, but not all of the problems all together. So, we will be working in Google Colab. You can choose any integrated development environment you prefer. Sometimes I use VS Code, sometimes Google Colab, but if you're just starting with Python, I believe that Google Colab is the easiest to work with because you don't need to install anything. You literally just Google Google Colab. You need to have Gmail account, and that's it. When I started in data, I was always working with Excel. However, if you want to work with big data, if you're working for industries like banking or Telco, for example, there are just billions of transactions per day. You can't process this data in Excel. The first problem with Excel, it is limiting, and you can't process this huge data in Excel.
You need to work with Python. Secondly, with Excel, if you have to clean, you have to repeat the same actions again and again, while with Python, you wrote script once, and you can rerun it again and again on a different data sets. So, it is a way of automating your workflow.
And also, Python is much faster, so those millions, billions of rows can be processed very quickly. There are different libraries to work with data. I will be working with Pandas, which almost like default library, but you can also work with Polars or DuckDB, which is much faster on bigger data sets. So, the very first step, we need to install libraries we will be working with for cleaning our data. And to do that, we say exclamation mark, pip install, and then we need to indicate names of those libraries. So, the first one will be fg data profiling. And also, if you don't have other libraries, you can indicate them after this. For example, pandas, numpy, and others, but I already have them installed. And tip number one, if you do {dash} {dash} quiet, then it will not print everything that's happening.
So, usually when you install libraries, Python [music] just prints everything that's happening and so on. But, if you do this quiet, it will not print that huge screen. I actually discovered it couple of weeks ago. So, I'm sharing my little tip with you. Before we keep going, if you want to practice this on a real data set, I have a free resource pack I put together for data analysts.
It has templates, tools, and exercises to actually apply what you're learning with these videos. Go grab it by the link below or the QR code. It is free.
Now, once the libraries are installed, we need to import them. So, we will import pandas as pd, we will import numpy as np, and then we will import my secret library, which is called data profiling. And from data profiling, we will import profile report. And also, since we are working in Google Colab, so if you work in VS Code, you don't need the last line of code, but for us, from Google Colab, we will need to import a files. [music] Let's run it. So, now finally, we can open our data set. So, we need to say upload it equals, and then we say files.upload, empty brackets. This syntax is used for Google Colab. So, once you click it, it will allow you to choose the file. So, we click choose files, and I am selecting my messy customerorder.csv.
That's actually the file we will be working with. We will be using pandas, as I said. We say df, which stands for data frame, and you can use any name you want. Default is usually DF equals PD, which stands for pandas. It is aliases we gave to pandas library over here. So, we say PD {dot} read {underscore} CSV brackets, and in the brackets in quotes, we need to indicate the path to our file. Our file is messy customer orders CSV. So, I like using this uploaded because now you can copy the name of the file from here. Usually, you have to go into the files, and you know, copy the path. So, the very first step is to actually load and look at our data. My very first tip is to use data profiling library, which we installed previously, to see what happens to our data set. We will say profile equals profile report brackets DF. So, we're doing reference to our data set, comma, and then you see it's suggesting us. So, we need to give a title, correct? We can use profile and report. And also, let's add this explorative equals true. And we will save this profiling report. So, we will save profile {dot} to {underscore} file.
And then you can say report {dot} html.
That's the name of our html report. So, it can be anything you want, and we'll download our report html. Once this syntax is executed, this report will be downloaded. Okay, let's open it. And this is an amazing thing. So, it has all the details about your data set.
Usually, we do a lot of code to get this information, but if you're using data profiling, you can get it all at once.
And this interactive. For example, we can see how many variables we have, how many rows, how many missing cells, which is usually a separate code. Then, with this part, you can select a column. For example, we have country. You can already spot the problem with our data set. So, we have D and Germany, which is the same thing. We also have United States and USA. So, further in this video we will have to clean this and make them the same. So, rename one or another. And by clicking like that on every category, we can see which problems. So, here we can see we have electronics all capital and lowercase.
Then, we have heat maps, we have correlation matrix. We can also see missing values here by each of the columns. And we also have sample of our data. Usually, when I do my analysis, I print head or tail to see how data looks like because I'm a visual person and this really helps me to understand what I'm working with. So, for example, here I already see that quantity can be negative and probably we will be working with it. With our status, we can see that there is lowercase, is returned. We have boolean, we have one, we have Y which we have to fix. And you can also see last row if you want.
And then, it shows duplicate rows. So, basically this Y data or data profiling gives you all the information you need.
But, if you don't want to do data profiling, we can do typical cleaning steps. So, the very first four lines of code that I always run for every project is looking at the shape, looking at data types. I also look for missing values.
And as I said, I print either head or tail of the data set to see those rows.
So, very first we will print our shape.
We say df.shape and it shows us how many rows and columns we have. So, 51,000 rows, 12 columns. Next thing we will look at dtypes, what data types we have.
Usually, if it has dates, you need [music] to convert them. I don't know, working with dates is a headache overall. So, let's see what we have.
Objects, quantities integer, that's good. Unit price object, I expected it to be integer, so we need to look into it. Next one, we are printing our missing values. To do that, we say df.isna().sum(). And now we see which columns have missing values and how many those missing values there are. And the last step, we are printing our head or tail. So, you need to say df.head(). By default, it prints five rows, but you can change number of those rows, for example, 10. And we printed our 10 rows and we see how data is structured. You can see the naming.
And by the way, we can see that order date has slashes and dashes. So, that's what we will be working with. My tip number two, when you start analysis, always print shape to see how big the data set. Print head or tail to see the data set visually. Of course, check data types and check missing values. Step number three, I usually prefer to standardize column names in one line.
So, in our case, actually, all the column names look all right, but sometimes you have underscores, you have some capital letters. So, usually my next step is to standardizing those column names. So, we will do df.columns equals empty brackets and then we again say df.columns. It's actually helping us and suggesting what we will be doing.
All the spaces we will replace with underscore. We will do.str.lower() to change all the uppercase, if you have it anywhere, to lowercase. So, actually we'll use what Google Colab suggested to us. We will also do.strip(). So, you say.str.strip() and empty brackets. We will strip all white spaces around our naming. Then we run it. And in case you had column names different, for example, you had some white spaces or you had uppercase, it will be all fixed by this line of code.
Okay, so the next step, we will actually print percentages of missing values. So, we'll see which columns have missing values and how many those missing values are. To do that, we will create a new variable which I will call missing equals df.isnull empty brackets. dot empty mean empty brackets multiplied by 100. So, we will count percentage of those missing values. So, with dot isnull, Python is looking for missing values. So, is this value equals null? When it finds it, it counts it, and then we will calculate percentage. And now we can print all those columns. So, we will print missing square brackets again where missing is more than zero, and then we will sort our values in descending order. So, we will say sort_values.
Yes, and it suggests ascending false.
Correct. And it should be ascending equals false. Now, let's hit it, and you can see that it printed name of the column and what percent of missing values there is. So, category has almost 6%, country 5%. So, of course, if your data has 40-50%, it is already not even a data problem. It's probably data pipeline, and you need to talk to data engineers or to people who are responsible for the data itself or to understand why such a huge number of values is missing. The next step we will do is we will deal with our duplicates.
So, my little tip is to check duplicate count before dropping them. I think because I came from finance background, I don't really like to remove anything because in accounting, you can't just delete things. So, I always prefer to check, and if I can, I can separate it somewhere else instead of completely deleting and removing it from a data set. So, before we will remove, we will print how many duplicates exist. So, let's say number of dupes, which is duplicates equals df.duplicated empty brackets.sum empty brackets. And now, if you print this variable, we can see that we have a thousand duplicates in our data set. If you want to see how those duplicates look like, we can do it with the following line of code. We will say df square brackets df.duplicated. All right, it's actually suggesting us the whole line of code that does this. So, it is df.duplicated keep equals false, and then we are sorting our values, and it suggests how many rows we want. We want 10. You can, of course, make it less or more. And this is our duplicates. So, by duplicate, we mean the whole row is exactly the same. So, every point in every column is exactly the same. Because you can have duplicates, for example, the same customer made several purchases. It looks like a duplicate if you just look at the customer name. But, of course, if you look at date of purchase or maybe orders, it will be not a duplicate transaction, but a real transaction of returning customer. At this point, you really need to turn on your analytical brain and think what you need to use for your analysis. So, now we looked at those duplicates, and if you're happy to proceed, you can drop them. And to drop the duplicates, we will say df equals df.drop duplicates empty brackets. So, it is done. And if you want to check how your data frame changed, we can again print shape how we did it before. So, we will say df.shape. And if you remember, we had 51,000 rows, and now we have a 50,000 rows because we dropped 1,000 of our duplicates. At the next stage, we will be dealing with missing values. We will replace numerical columns with median. So, I prefer median instead of mean because if there are outliers, it will impact our mean value. While with median, median is a value that is in the middle. So, that's why for numeric, I prefer median. And for categorical columns, we will just replace those blanks with unknown. So, then we can actually filter them out if we need to.
So, what we will do, at first, we will create a copy of our data frame this idea of equals df.copy empty brackets.
And then for numeric columns, we are creating a function. So, we say for column in which column, in our case it is total amount column, df column equals df column.fillna.
So, we replacing our NA values and we say in what we are replacing with. We replacing with a median value of this column. That's why we say df square brackets column.median.
And for categorical columns, we are indicating which categorical columns we have. We have customer ID, country, and category. And you can see that we are not replacing value in unit price and I will explain why. Now, we will replace categorical column with unknown. So, those missing values will be replaced with unknown. We say df square brackets categorical columns equal again name of the column. fillna brackets unknown.
Let's run it. Now, we will check which are the remaining missing values we have. So, we will print and we are printing df.isnull brackets.sum empty brackets. And in square brackets, we say df.isnull.sum where it is greater than is zero. And you can see it says that unit price we have 2,000 missing values, exactly how we had it before. So, I wanted to show you that we replaced missing values in other columns, but we didn't touch this one for illustrative purposes so you can see that it worked.
The next step, remember when we did my data profiling, we saw that categories are all over the place. We have capital letters, lower case, upper case. So, we need to standardize our messy text columns, and we will do it with a mapping dictionary. So, I will copy this code because it's a lot of typing, and I will explain it to you. First of all, again, we are creating a little function. So, we're saying for column in and then we indicate in which specific columns we're talking about. We're talking about category, status, and a country. And we will uh strip all the white spaces, and we will convert it to lower case. Also, some people prefer upper case, some people prefer lower case. Please comment below, are you lower case or upper case person? In this case, we are doing a lower case, and because it is a string, so we always say dot string strip dot string lower. Now, we will create a map which will know what we're changing to what. So, for example, electronics we're changing to this, clothing to this, etc., etc. This is a great example where I use AI to help me because can be done so much quicker by AI. So, I am just given a snippet of my data. Maybe this specific column is not even giving away any really secret information. Or at the company I'm working at, we have internal ChatGPT, so this information doesn't go public. It is only internal for the company. So, I can upload little snippet and I show that you see this electronics is spelled this way. Or, for example, for status map, we have a delivered, pending, and then I say, "Please create me a script which I copy." So, similar with the country, you can see there are so many spellings of United States or UK. So, once we created those mapping, what will be changed to what, we're saying DF square brackets category, and we're mapping our old values to new values. So, we're saying dot map, we're referring to category map. So, for our column category, we're using category map. For status, we're using status map.
And for country, country map, very logical. Let's run it and actually let's do again the F hat. We will again print 10 values. You can see how country looks like. You can see that status.
Everything starts with a capital letter now. So, everything looks so much better than it used to be. You can see we still have problems if you use unit price because here we have a dollar sign. And also, I highly recommend you to look at commas in your unit prices or total amounts because you need to remove them as well for calculations. So, again we will use a chain operations. Chain operations is when you do several steps at the same time and they are separated by this dot. So, we say in DF square brackets unit price and we are changing our unit price. So, we again repeat in DF unit price. As type string, so we are converting a value to string and then we are replacing our pound sign or dollar sign to nothing. So, here little R stands for regex, our regular expression that allows you to work with text. And we are saying find what we find those symbols. So, like dollar sign, a pound sign, maybe euro, you know your data better. And we are replacing it with nothing. So, this example will catch this 1,299 style values. And then we are stripping all the white space. And after we've done this procedure, we need to convert our unit price back [music] to numeric value. That's why we again send DF square brackets unit price DF is pandas to numeric. So, we are converting value to numeric value and we are saying error coerce. Error coerce itself is a my little tip because it will skip errors if there are any. So, let's run it. We will rerun our head to see how it looks like. And now you can see there are no symbols anymore and this 1,920 doesn't have comma anymore. But we still have a problem with our column is returned. So we have this boolean, we have no, why, true, everything. So let's fix that [music] bit. So tip number nine, we are handling mixed boolean values again with a mapping dictionary.
So we are creating this a new dictionary which we will call boolean map and dictionaries always have curly brackets.
So we creating this curly brackets and dictionaries is a key value pair. The first part before colon is a key and the second part is value. So everything that's on the left will be replaced to what it is on the right. So all options of yes, why, capital Y, lower Y will be changed to true. Similar here. And after we created this map, we again doing DF square brackets is returned or name of the column we are working with equals DF is returned dot map. We are mapping these values and then in brackets boolean map dot fill in a false. Let's run it and we will rerun again our head.
You can see how amazing this column looks. It's so beautiful. So everything is standardized. While we're looking at this, we can notice that the other column that we haven't touched yet is order date and it has all possible formats. So we have dashes, slashes, we have just other type of date formatting.
So we will need to fix this. First of all, we say DF square brackets order date equals pd.to_datetime so we converting this column to date time and again in the brackets we are saying that order date day first equals a false error equals coerce. As I mentioned, coerce handles anything unparsable without crashing your script. Let's run it. Okay, after we ran this, we can check again how our data looks like.
Let's do DF.head and see if the data was updated. And you can see now order date looks much better than it was before. Again, we have some standardized date type. And other thing we can extract some useful features from those dates. So, something that I always do in Excel can also be done in Python, obviously. So, for example, we can extract a year, a month, and day of week. To extract year, we will create a new column uh called order year. So, df square brackets order year equals df order date. So, we are working with our order date column {dot} dt {dot} year.
To extract months, similar script, but we say {dot} months. And if we want to find day of week, we are saying order date {dot} dt {dot} day {underscore} name and empty brackets. And now, let's print just this order date, order year, order months, order day of week. And we can also print a head for those columns only. And you can see that our order date fixed, order year, yes, months, and day of the week. So, it is also very valuable for analysis. If you want to analyze, are there any spikes on a specific date, etc. especially if you working this visualization. And the last steps we will do is to apply our logical analytical thinking to the data set. If you look back at what we have, we standardized all the naming, we cleaned everything we possibly could. But if you look at row number two, we have quantity minus two, unit price and a n. So, there's no unit price, but total amount is 690. So, this raises suspicion that there might be some problems with total amount. And sometimes when you are exporting data, especially marketing data, from my experience, sometimes that percentage is not accurate. If you recalculate it, the number will be different. And in this example, looks like total amount might be wrong. So, we can easily recalculate it. Total amount is quantity multiplied by unit price.
So, we can actually not rely on the column provided in a data set, but redo recalculate it ourselves. So, we know that values should not be negative or zero unless it's like returns. But, if you work for a company, you know your data best. You know if this makes sense or not. But, we will create a new mask.
We will call it impossible quantity mask. And it is where quantity is less or equals zero. And we will print all the values that meets this criteria. So, for order ID, for product quantity, unit price, and total amount, we want to see if there are any of those. Then, you need to apply logic. What you need to do, you either remove it or you work with it, maybe replace. That depends on the task you are trying to achieve. Data cleaning doesn't have to take you whole day. Today, we did it in a simple workflow. Step one, load and look at your data first. Step two, handle the duplicates, missing values, messy columns. Step three, run a fast integrity check. Along the way, you learn 10 practical tips to speed up each step. If you want the templates and the free resources to practice on your own data, grab them with the link in the description. And if you want to know what to actually do with clean data, how to run your first real analysis in Python, watch [music] this next.
Related Videos
Agentforce NOW AMA: Build with React and Salesforce Multi-Framework
SalesforceDevs
490 views•2026-05-28
How agent o11y differs from traditional o11y — Phil Hetzel, Braintrust
aiDotEngineer
450 views•2026-05-28
WEB TECHNOLOGIES UNIT-2 | Degree 4th sem BCOM Computers web technologies unit-2 full explanation💯✅
LearnwithSahera
1K views•2026-05-29
More tests are always better? How to use AI to identify tests that bring little value
Alliance4Qualification
335 views•2026-05-29
Search Algorithms Explained in 60 Seconds! 🤖💨
samarthtuliofficial
218 views•2026-06-01
People of Game of Thrones using JavaScript DOM
AltCampus
296 views•2026-05-30
Introduction to Problem Solving Part - 1 | Lecture 1 | Intermediate DSA
ascensionix
107 views•2026-05-29
So What's Odin Lang Even Good For
TechOverTea
131 views•2026-06-01











