This tutorial covers essential pandas operations for data analysis, including basic DataFrame exploration (head, tail, shape, info, describe), filtering with Boolean indexing and logical operators (AND/OR), selecting specific rows and columns using loc and iloc, adding new columns through arithmetic operations, string operations (str.startswith, str.contains), applying custom functions with apply and lambda, grouping data with groupby for aggregation, and merging DataFrames using pd.merge. The instructor demonstrates these concepts using a music dataset with columns for artist, track, album, views, likes, comments, and streams.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
CS-334 Tutorial 2
Added:okay I think we're good to go so after reading my data frame into sorry after reading my CSV into a data frame I'm going to display the first five entries and this is what our data set today looks like if I remove do head I see that I have 2,718 rows across 10 columns so uh essentially our dat data set contains details about some music some tracks available on YouTube and on Spotify okay similarly you can use do tail to obtain the last few entries of your data set okay you can use dot shape to display the number of rows against the number of columns we've already verified this number dot columns is going to directly give you a list of columns so here we have artist track album the type of album the number of views likes comments as a Boolean variable we know if the track is licensed or not then we know if it's an official video or not and then we have the stream count similarly you can write do index to know how many indexes there are right now we have integer so this is how you interpret it it starts at zero it stops at 20718 and and a step of one just means there is an increment of one after each index okay to display the general information in your data frame we use the command sample data. info we're going to study the details of how to use this how to process this when we cover module 2 Data cleaning for now it tells you how many nonnull entries you have rather it it helps you understand how many missing entries you have what is the type of the variable stored in each column so apparently artist was stored as uh strings however the data type at the back end as we discussed in our last lecture as well is actually an object then float 64 represents a number over here okay with that moving on for our numerical attributes we use the function. describe which automatically filters out all the numeric attrib rutes in this case it's just views likes comments and streams and then it prints the summary statistics of it the count the mean the standard deviation min max and all your quiles for non-numeric variables you can uh you cannot use this function however info Works value counts works it tells you the um count of each unique instance in your given column similarly n unique gives you the number of unique entries and do unique gives you the names of the unique entries so with that let's move further then for further for numerical attributes what you can do is you can refer to the column and then if you write Min do mean Max or count you're going to obtain those properties directly these are some functions in pandas that are available for numerical attribut however I believe that Min and Max is going to work for your string operations as well so similarly if I want to get the sum of total comments this is how I would do so so uh the name of my data frame then in square brackets I refer to the column name and then do sum as my aggregation operator so with that we have looked at some basics of some basics of pandas and we're going to move on to questions this was just for a quick recap my assumption is you guys have been through these things before as well so we have a list of practice questions for simple ones and then slightly tougher ones I'm going to use the same data frame okay let's start with some basic filtering queries so I'm going to move at a slight slly faster Pace if you guys are stuck anywhere in between feel free to stop me and ask any questions so first you have to select the attributes of all songs by the artist Edge herein this is how we would do so data frame and then if I write artist double equal for comparison and then I will give the name of the artist and this is going to give me a Boolean series it's not an array it's a series but generally we refer to it as an array as well and it works and I can directly refer this inside my data frame and just like that I have selected all the attributes which means all columns for the rows where the value of the artist is equal to Edge herein are there any questions at this point okay I think we're good to move on then similarly if you have to select the top 10 entries of single type albums so first I'm going to of course select album type I think yeah that is the name double equals and then I have the keyword single it's going to give me a Boolean array or a Boolean series again I will put it inside my data frame can anyone tell me how to refer to the first 10 entries now like if you write head and in the brackets you write 10 perfect excellent so head and 10 will automatically give you the first 10 entries what if I want to reset these indexes like 2 6 8 1 11 what if I want it to be zero till 10 can anyone tell me how to do that okay so guys for this you have dot reset index command this is what that does for you okay and then you can drop this index column as well but we look at that later in further modules with that let's look at question number three find all songs that have more than 10 million views so I'm going to select data frame views and then greater than 10 million I think that yeah that is six zeros and that will give me a Boolean array and I can pass it to my data frame but right now I'm getting all the rows against which there are more than 10 million views how do I select the name of only the track for instance can someone tell me how do I select only the track for these rows okay so your given result is a data frame in your data frame if I select track like that it will give me a series with only the track names if I pass track as a list it will essentially give me a data a data frame with only tracks I think with that let's move to question number four some basic operations um let me see okay so first you have to find all the songs that are licensed or that have an official video available so uh I have my column name for licensed here and then one for official video here so let's use those so data frame license equals true okay that's my first query and then data frame official video equals true now I have two operations and I want to have an or operator in between them so in FAS this is used as your or operator and whenever you do that you will essentially not get a Boolean array right now you'll get an error because it's important to ensure whenever you have a query like this that's a combination of other queries square brackets parentheses are a must otherwise it fails and then I can pass this Boolean array inside my data frame and thus I obtain all the rows 15,723 in total against which the songs are licensed or that have an official Video available similarly I can I have another question over here okay let's try this out get all tracks from albums where album type is album with more than 50 million streams so that is how we would do so okay album type double equals album that's going to give me one Boolean array then similarly data frame views greater than oh it's streams it's not views let's change that and that is going to give me my hold on let's see what the issue is so I think uh yeah we have stream as the keyword not streams and that's going to give me my second Boolean area essentially I want to ensure that the album type is album and that they have more than 50 million steams stream so I will use my and operator not my or operator so and will be true only when both conditions are true then of course I have to include parentheses and I can pass it to my data frame hence I have obtained 7,756 rows moving on to the next question select all songs with more than 5 million views and fewer than 10,000 likes so I will have views greater than 5 million and similarly likes less than 1 million can anyone tell me if I am looking for the and operation in between these quaries or the or operator and operator and operator so hence I obtain 9885 rows guys are there any questions at this point okay I have a question for you all afterwards focus on what we're doing over here I am storing four results in variables a b c and d so I want you guys to take a minute here and think about what the type of each variable is which will return a series and which will return a data frame okay so I'm going to give you guys a minute to process this and then when we display the types we're going to look at the solution e okay I think that's enough so let's start with a is a going to be a data frame or a series guys please feel free to turn on your mics for this one what am I expecting in a anyone okay so a is going to give you a data frame I'm going to comment out the rest of these so that we can view one solution at a time it's going to give you a data frame essentially here I am selecting a series but the do2 frame command automatically converts my series to a data frame see otherwise it would have just been a series okay moving on what about B was anyone able to decipher this is B going to give me a series or a data Frame data frame I guess yeah it it will give you a data frame because essentially you're just passing a Boolean series to it you're just passing it a Boolean array and then you're selecting all the rows hence B is also a data frame okay what about C A series a series precisely and what about D I guess this will give a data frame this will give a data frame now would anyone like to volunteer and tell me why C gives you a series while D gives you a data frame essentially it's just head and tail right I guess probably because of the brackets like in the C you have uh you're using one square bracket and in the D1 you're using mulp absolutely Square absolutely yes that is true so here I'm just passing a single variable a single string track here I'm passing an entire list of column names although my list just contains one name but it is a list hence I get an entire data frame with that let's move on from an mcqs and true false point of view you guys should be able to very nicely distinguish between these two data structures okay so we have a few indexing and slicing questions I think uh locator and index locator can often times be confusing for students the best way to go about this is just to practice as much as as you are able to so let's look at question number eight use do loock to select the first five rows and display only the artist track and Views column so DF do lock and uh inside my lock what will I have the first five rows so I think one way of going about this is to do that here I have my first five rows and then you pass a list of of columns which I am already given over here artist track and Views and that is how I obtain the first five rows against only these three columns if I want to use slicing instead because I think it would be much more efficient then can someone tell me if I should write 0 till 5 or if I should write 0 till four oh my bad 0 till 4 or 0 till three 0 till 4 0 till 4 and why is that so because I remembered like when we were like like m' taught this is is function can you can write four instead of three yes precisely you can do that so the thing is whenever you use Dot Lock it includes the starting and the ending index it does not exclude either of them so for reference for instance if I just uh do if I do not enter the name of Any Given column I will automatically select all columns so 0 till four will include the zero row and the fourth row instead if I use iock it will not include the fourth row because generally in slicing the final index is not included so that is one key difference between these to to remember okay with that let's move on oh yeah alternatively if you had to do the exact same operation using iock okay so up up until this point if I'm selecting all the columns this is how I would do so of course I would have five instead of four but that is how I would do so but how will I select the columns one way of going about it is to pass it a list of the indexes of those column so I know artist is at the zeroth index then I know track is at the first index and then I know views is at the 0 1 2 3 and fourth index so that will work but what if I want to reference them by name and I don't want and suppose I do not know the index at which each is a learning because in my data frame it could be on any index in a new CSV of similar data it could be on any index so I'm going to for your information teach you guys an alternate way of doing that in ilog see we have uh an interesting function over here yeah okay we have an interesting function over here um you can ignore the X It Go goes on till the fifth row in ilock and then look at the list that I'm passing to it sample data do columns now we know what sample data do columns is going to give us it gives you a list of columns rather a list of columns in the form of indexes now if I write dot get location and then I write the name of the column suppose I write artist it's going to give me zero because zero is the index at which I have the name of the column artist so that is how I can automatically obtain all the indexes in a list and that is an alternative way of using iog are there any questions at this point um I wanted to ask what's the difference between log and iog yeah okay so basically when you're using loog you are referencing on the basis of um on the basis of labels and when you're using iock you are referencing on the basis of integers only integers such as if you would do so in a list so I think one way of making this more clear is let's see so let's display the head see what columns we have what if I take the name of the album and set it as the index okay I think we can do that so DF do set index album all right now my index is album and of course we would either receive it here or we would do in place equals true for now let's keep it simple okay now suppose I want to use locator using do locator I can directly give it the name of an album like that and it will work because it goes to the indexes in the rows first it iterates first over the row over here as the first parameter and in the second parameter it expects a list or a slice of column names now we're referring to it using labels suppose here I just passed Demon Days so this so this was my result similarly I passed the names of column I could just as well have an entire list over here and hence it has filtered out only the albums that are or only the indexes that are in this list similarly you can use slicing as well just like that starts over here ends over here it's not going to work right now because uh my labels are not unique I think we've discussed that in class as well and uh right now I do not know for this particular data set which labels are unique but if you have unique labels this will work as well okay now let me show you what iog will do differently if I use index locator my labels for rows or columns will no longer work rather it will look at the indexes of these suppose all my uh indexes were stored in an array in which case uh Demon Days would be at the zeroth index of my array Plastic Beach would be on the first index of my area so essentially it expects the number of these indexes and similarly for columns as well so for columns suppose the names of all my columns were stored in a list then artist would be at the zeroth index of my list track would be at the first index of my list and so on and so forth so that's exactly what it expects over here the numbers of those indexes I hope that answers your question can you please tell what unique W thing means oh yeah of course okay so let's look at one interesting thing over here suppose I do not set the index okay and then this is what the head of my data frame looks like so if I write DF dot sorry DF artist.
unique see this is what I'm getting and I I suppose I want to check the length of this instead so I have have 279 unique artists similarly you can just do n unique and that works just as well let me show you and if I remove the head out of this then I know that in my data frame I have 2,718 entries but I only have 279 unique artists so that simply means if I give it the starting index as um for instance for artist I give it the starting index as gorillas it's not going to know whether it starts here or it starts here because the same thing occurs so many times across my data frame so that is why whenever you're setting something as your index and you're using locator along with it it always needs those labels to be unique and and that is generally really the issue that we encounter let's suppose I had um 0 1 2 3 4 5 4 and then 2 3 10 all right and I am basically asking my locator operation to start from Two and end at 10 okay suppose I wanted to start at two end at 10 two occurs over here and two occurs over here I get an error because it doesn't really know which two to refer to as the starting point does that make sense yes yes it does okay all right I think let's move on then this is a very important distinction to Remember by the way for everyone okay let's look at another question and let's try to determine which of these four will return a series and which will return a data frame so at this point again I'm I'm going to give you guys a few seconds to process this on your own so I'm going to remove this over here and let's suppose that now this is what my data frame looks like okay let's suppose again I have just these rows now my index is no longer artist or album my index is just like starting at zero and ending at 177,000 something 20,000 something so take a few minutes let's discuss this okay guys I think that will be all let's start from a what is the data type of a going to be would anyone like to answer this a data frame and let's look at that yep it will be a data frame because essentially here you're just telling it to start from the start and end at the end means include all rows here you're telling it to include all columns so it's an entire data frame what about B uh I'm not sure but like we are using zero for the reference rather than a label no see the thing is if we had used this with lock it would have given us an error but we're using it with iog and iog is expecting integer values it's just expecting uh indexes in the form of integer so it will not give us an error in fact let's try and print B so this is what you get it's actually a series gives it away but it's going to be a series it's going to be a series because um essentially over here you're just giving it the index of one particular row okay if this was a list instead like this then it would have been an entire data frame okay what about C Mya series let's check that out so C gives you an entire data frame and I'll tell you guys what the difference is the difference is here I'm using just one integer so it's going just to one row here I'm using a slice although the slice includes only one single instance but whenever you use a slice it calls the entire data frame instead of just a series okay what about D uh can you show me the output for BNC for BNC sure so this is what B gives us okay essentially what happens is when you just when you're just selecting one particular row it converts the column names into the indexes and it uses the values of that row over here that's just how iog works so this is to show you guys how it works and this is the output for C this is why C is giving you an entire data frame I hope that makes sense the difference is just in slicing you guys have to remember that okay guys what about D we're just left with d now here you're selecting all rows but you're selecting just one column okay let's see or must be a network issue or something it's loading there you go it gives you a series so let's look at what D gives us this is what that looks like so I'm going to tell you guys something interesting now and it's important to hold on to this if you are giving one integer value or you're giving it in columns it's always going to give you a series if you're going to give slices in both rows and columns only then you're going to get an entire data frame whether you're using lock or you're using ick both will work the same way in this regard see here I'm giving it slices in rows and columns that's why it gives me a data frame here although I'm giving it my columns in slices but my rows are only in a single index so it gives me a series here I'm giving both in slices thus I get a data frame here although I'm giving rows in slices I'm giving my columns as just a single index that's why I get a series is that clear are there any questions I just wanted to ask uh what kind of code are you talking about because you work with both generally like dat like we usually see the table dat yeah yeah you're absolutely right usually when you have a data frame generally most data that you're working with you're not just going to have one column you're going to have multiple columns data frames but essentially when you're working further down the line when you're transforming data when you're extracting useful information out of data so you will find yourself to be wor working with series as well it's important to be able to differentiate between the two okay okay anything else guys okay let's print it okay now let me just uh show you something suppose I modify B like this a b ke output do you understand the output of B right now what's in front of us here yes yes so essentially when I take this SCE and I change it to just the zeroth row so what it does is values it kind of transposes this Matrix and all of these column names become your indexes and these become the values against your indexes like that now view it see these are your indexes stream official video all column names are your indexes and these are your values I hope that makes sense okay okay I get it yes yes thank you okay perfect let's move on okay uh there are some operations that I think I've introduced them but it's important to just reiterate them quickly suppose you have to find the row with the maximum number of views the first question is how do you find the maximum number number of views and I think this we've covered before so this is the maximum number of views across my entire data frame now I want to know the row that contains these maximum number of views I'm going to write a simple filtering query DF views equals equals and what what am I trying to do here let's suppose I score this here and this is going to give me a boan array now if I pass this Boolean array to my data frame this is what I get is that okay so guys it's important to ensure excuse me it's important to ensure that you take this entire Boolean array and pass it over here because it could also happen that there are multiple rows with the maximum number of views in which case this will filter out all those views all those rows and you can do this directly over here instead of using a separate variable as well so you should know how to filter on the basis of an aggregation operator as well essentially I'm aggregating and then I'm filtering on the basis of that aggregated value although it's not across groups similarly suppose you have to find the total number of views for all tracks by the uh I think this is the name of the artist so DF artist and that is going to give me a Boolean erray see this artist has multiple tracks inside my data frame so if I pass this Boolean array to my data frame this is what I get I get all those rows where the artist is equal to sick Legend and then I want to count the total number of views so now I will refer to My Views column see now I get a series and then if I use my aggregation operation sum this is what I get and everyone should know how to do that is that clear similarly I can find the maximum number of views for this particular Artist as well the minimum number of views as well and pretty much all of these will work okay moving on to the next question extract the artist name and track name for any tracks with the least comments so first I would want to find all the tracks with the least comments of course I would look at comments equals equals TF comments do Min see this will give me a Boolean array so there are multiple such columns with the minimum number of comments I think the minimum number of comments must probably be zero so if I pass this Boolean array to my data frame this is what I get like I told you guys it is zero that's why we have multiple instances 4.99 such rows in total okay now you have to find for this particular data frame you have to find the artist name and the track name so essentially this is how we would do so I'm going to use Simple indexing let me show you guys something suppose I store this in a new data frame just to avoid any confusion this is exactly what that looks like now I want to select a subset of columns for this data frame so I want to select the artist name and I would like to select the name of the track this is how I would do so see essentially I stored it in a new data frame but you can just as well do that directly on the same data frame there's no harm in doing so in which case you you don't need to initialize a new one to begin with but sometimes a query like this can be overwhelming for students okay are there any questions at this point yeah go on artist track dou Brack it will give us a cies it will give us an error let me show you see now now let's look at an error key error can can anyone interpret this error for me what is the problem here okay let me show you guys an instance where this will work see it works like that and it gives us a series if I add another bracket it gives us a data frame then why does it not work if I have two values instead of one because guys I can only pass it either a list or I can pass it an entire I can pass it a single variable or I can pass it a list if I have two variables they have to be inside a list is that okay otherwise it kind of thinks that artist comma track is a multi-index which it isn't okay I hope that answers your question with that let's move on to the next one sort the data frame by views in descending order I think that's pretty straightforward DF do sort values by equals views but I think if we look at views you will find that this is sorted in an ascending order by default if you want to sort it in a descending order can anyone tell me what we do in that scenario it is sorted in an ascending equals true uh yeah absolutely we have to sort it in a descending order so we're actually going to set it to false oh yes yes yes but you are right and see that is how we were able to do that now if you look at view so it's sorted in that particular order okay let's look at our last two questions extract the top 10 most viewed songs how would I do that uh would anyone like to contribute to this top 10 most viewed songs come on people whether you answer or not try to have something worked out at your own ends though you're going to find it really helpful I guess Max function no no the thing is Max function bracket because Max function gives you just one value I guess yes that we can do so head tail so let's see I have a sorted it by views in descending order let's store this somewhere first sorted by views equals okay and there we go so that is how we were able to do so similarly you can just extract the names of the Artist as well now let's look at question number 15 extract only the artist names for the three least liked tracks again I want you guys to take a few seconds and think about this uh is procedure how do we do that like uh if we arrange in an ascending order artist names accordingly views go and then we just display the head equals 3 I'm sorry could you come again uh again I'm asending order we're going to sort by what uh by the views by the views yeah sort by views okay in an ascending order a there there is a problem with this can anyone tell me the problem with this yep that is one problem with this artist the thing is you're finding the artist names for the three least liked tracks so you're going to sort by likes instead other than that uh you're absolutely right okay and then if you want to extract the top three you can simply do that see now we have a list of the top three artists only actually the bottom three in terms of likes right now you have have it in the form of a series so I'm going to tell you guys something useful if you want to convert this to a list this is how you do so you'll find it useful later I think in your assignment as well two list command series list convert see just like that now you have a list you have an alternate command as well actually if I write list and add brackets around it it will do the exact same thing I hope that's clear assment first assignment key official release date is February 3rd but we're trying to release it uh by first or 2 inshallah otherwise third out best to it will definitely be released to you guys okay okay and then you'll have two weeks to finish the assignment essentially it's not a very long assignment I that I think we're going to have around 25 questions focused on Panda he I would recommend that everyone gives it their best short it's going to be lovely for practice and will it be like very difficult uh there is going to be a spectrum of difficulty actually some questions will be simple and then some questions will be of a medium difficulty level and then of course there will be some tougher questions too okay but overall you guys will not find it very tough okay is there anything else okay so guys uh I can see that we are a little short on time but I have another notebook over here just may we have a few more practice questions essentially Group by questions and then just a brief look at pivot tables and string operations so for those of you people who have to leave you may feel free to do so later because I will be uploading this anyway but I'm going to try and make this quick so that we can at least go through a few string operations right now you're going to find it useful for your assignment okay I'm using the exact same data set I'm going to show you guys how to add a new column to your data set suppose you're adding a new column uh just to display the head so you remember what the data frame looked like you'll always find this useful add a new column called engagement that calculates engagement by summing up likes and comments so let's see how we create new columns so DF the name of the uh or let's suppose right now I just write likes and this will give me a series similarly if I write DF comments it will also give me a series this is something amazing that you can do in vandas if I add a plus sign in between it will give me a series for which against each corresponding index it will just add all the values so if I take this series and um suppose now I want to store it in a new row in my data frame it's actually very simple engagement equals and now if I display suppose the first three entries of my data frame see this is what that looks like I hope that's simple enough is that okay let's move on similarly you're adding a new column called official licensed video that is true only if a video is licensed or it is sorry if it is official and is licensed so DF licensed I hope you guys remember this is another way to write your or operation let me show you guys we can do it the other way around as well and then DF official video and this will give you a series of that's how you do R and then you can simply write DF this was the name of your new column equals and now if I display the top three values of my data frame hence I've added a new column over here we have the drop column command as well we're going to be testing it in the first programming assignment too essentially we're hoping you guys will look up a lot of these things so you simply write DF do drop and then you write the name of the column it's very simple so suppose this is one column that I want to drop me see what the problem is you have to specify the axis in this command since it is across columns we will specify one over here similarly if I want to drop an entire list of columns this is how I would do so I would simply pass it as a list essentially now if I print my data frame I still have those columns and here it's important to remember the in place command I need to manually always set in place to True otherwise it will not work the way that I want it to work just a question in place or ACC yeah okay so access by default zero and zero access means it's going to iterate over the rows of your data frame when you set the axis to one it iterates over the names of columns instead so that that is the first thing access pandas key commands and then in place is something that you use to make sure that your original data frame is modified in place is always false by default in which case data frame modification rather it returns a data frame with the modifications so then you have to receive it in your data frame I hope that's okay yeah okay so guys error because I've already removed these two so now I'm trying to remove them again so let's not let that puzzle us and let's move on okay I'm not going to go over the Sorting functions again because we've covered this already oh one important thing though if you want to sort by two values this is how you do so you have to pass a list to buy now first I want to sort by artist and then for each artist I want to sort by views this is how I would do so other than that I'm going to leave some things for you but it's very important that we go over string operations string operations are very useful find and display all tracks where the track name starts with the letter S so if I do that I'm going to get a series right now uh if I check the type then for each given value it is not essentially a string but it is an object that is just how Panda streets strings by default when I want to convert it from an object C for instance let me remind you all if you press DF doino see for artist the data type is given as object even though I know my artists are all strings so this is just how pandas store strings by default generally so the first thing that I want to do to perform any operation is to convert these objects to strings for which the command is very simple do SDR and that it just works that way now I have my artist stored as strings I want to have this printed here so do St Str and then you have some very useful string operations like let's suppose for our numerical columns you would simply do Min or Max for instance you had data frame likes and you would directly call an operation such as Min such as Max or such as mean for string operations you have a different list of functions for instance Len Len will give you the length of all the strings all right it will give you the length of all the strings I forgot to add parentheses to this first similarly um well we already know about Max and Min but suppose uh we have we have some other interesting operations so I want to get all the tracks where the name starts with s so I simply have dot starts with and then I can specify s what does it give me now it gives me a Boolean array and we know very well how to use a Boolean array just like that hence I have filtered out all the rows over here where my track name begins with s are there any questions at this point I'm going to show you guys something fascinating similarly you have ends with as well written s twice similarly you have contains as well so for uh each string it will check if it contains this letter similarly you have Len as well so for instance elen is greater than 20 characters so now it's going to filter out all the tracks for which the names are of a length that are greater than 20 characters so you're going to find string operations to be very useful as well now suppose I want to create a new column I'm just going to display the first three rows so that we know what our data frame looked like so suppose I want to create a new column called track length that stores the length of each track name and I want to display the first five rows who can tell me voluntarily how we would do that how can we do this using the same string operations that we've studied okay I'm going to do this over here so DF track length equals and then DF track. st. alens and that is simply how I would do so now if you look at your data frame here you have a track length column introduced automatically see the only new thing is St Str this is something you have to add similarly if you have to find the tracks that contain a certain substring in their names then you would do track Dost do contains and in contains I would just pass this so it will automatically look for that substring and it will return to you a Boolean array when I I pass the Boolean array to my data frame this is what I get see you can clearly see that this word is visible across all track names now are there any questions at this point as far as string operations are concerned okay all right let's move on then so then we have a few basic custom functions which we've already been over to suppose now you're creating a custom function that will characterize the count of views into low high or medium based on certain defined thresholds so first I'm going to Define my custom function suppose I call it get view rating okay and then of course I'm going to pass it one given value at a time if my value is greater than I think that's 10 million or rather it's 1 billion then I would want to return High similarly if my value is greater than or equal to this number and less than this number then I would want to return medium and of course if it's even lower than that then I would want to return low up until now I have just created a custom function if you want to apply this custom function you use the DOT apply keyword which we've studied so right now I would want to apply this on my column of views DF views.
apply and then you just pass it the name of the function see and this is what I get now there is one last thing to do I want to store this series in another column in my data frame see apply this function to create a new column called view category and now if I view the first few rows of my data frame here I have introduced view category okay are there any questions at this point okay okay let's move on similarly there's another practice question over here which classifies a track as highly engaged if the sum of its likes and comments exceeds 50,000 then you're going to apply this to the entire data frame and count the number of Highly engaged tracks so essentially now the function that we're defining is not going to be applied to each row rather it's going to be applied to um sorry not to each value in a column but rather to an entire row which means we'll have to switch the axis so let's suppose I call my function check engagement I'm going to pass it an entire row and this is what it will return if so I will get the likes I will get the comments I will add those two things so if the sum of these two is greater than 50,000 then I would want to return highly engaged otherwise I would want to return suppose I say not highly engaged now essentially I apply it the exact same way so DF now suppose I want to introduce a new row highly engaged or not equals DF now I'm going to call this function on my entire data frame and then you just pass the name of the function but I'm going to get an error right now and I get an error because I have um I've not specified the axis now I'm not iterating over just a single column rather an entire data frame in which case as discussed in class already you always have to specify that the axis is now one you're going over the columns instead so now if I display the first few values this is where my new column is with that let's move on to some basic grouping and aggregation questions so first you're going to group the data by album type and calculate the average views for each type then you're going to store the result in a data frame before storing it let's split our question across parts and start with just grouping so DF do group by then by equals or rather you can directly specify the name album type hence I obtain a data frame Group by object which is just a group of data frames so if I display the groups that is what oh or rather I would have to store it somewhere before displaying these and then I have to apply an aggregation operation to it so suppose I actually want to use this on a particular series instead so I get a series Group by object you can use views over here or you can use views over here it will give you the exact same result okay dot and then I have my aggregation operation which is mean in this case oh you're actually going to have to refer to mean in the numai library and we do mean let me see what the problem is key error album type let's see maybe we're doing something wrong okay you know let me look into this and get back to you guys on this I don't understand right now what the problem is but let me definitely look into it similarly we have a few more grouping queries I'm going to leave some of these for you guys to practice okay and then let's move on towards uh Lambda functions now I hope you all remember that Lambda functions are basically just simple oneline functions for which um you do not have a name you directly call these using dot apply or using do filter and it's automatically called for every single row or for every single column in your data frame depending on what you're calling it for so suppose you want to scale the views and likes column so suppose you want to create a new column views in millions that simply takes the views and divides it by 1 million and then you can use a Lambda function for any other column as well so I will do DF views in millions equals and I want to give it a series so DF views dot I will call Dot apply and I will Define a Lambda function over here when I write Lambda X it means every single view in inside this column and then I want to return X ided by 1 million now if I display the first three rows this is what I have been able to introduce okay with that let's move on okay I'm going to leave some of these for you guys to practice and I'm going to move on to a slightly tougher question for each album type find the track with the highest likes display the album type track like s artist for these tracks you're going to use Group by to group by the album type and then you're going to use apply to find the track with the maximum likes so let's see the only important thing that I wanted to introduce over here was do idx Max and what it does so let's suppose I go inside my data frame and I do that so what exactly is this operation doing for me if I simply write DF likes. Max or let's suppose I get the row of the data frame for which my likes are equal to Max this is what the particular row of the data frame is where I have the maximum number of likes and this is the ID of that Row the index of that row 11 147 so if I directly write you know if I could just as well refer to index over here and it will give me 11 147 or I could write TF likes dot idx Max which will give me the index where the likes are Max maximum I think similarly you have idx Min as well which is going to be very useful so here what I want to do is I want to write DF do group by and I want to group by album type all right and then what I want to do is for each group I want to apply this function and this is what I've been able to obtain let's look at this so I have three album types album compilation and single against each album type I went to the particular instance where my likes were maximum so where is it yep here are my likes the maximum number of likes across album were this number and then I have those for compilation and for single and then it essentially gave me the entire row or the list of rows in this case it's just one row so it gave me the entire row where the likes were the maximum for each album type this essentially solves one question given to you guys as part of the case study in the last lecture as well uh I I don't think it was part of the case study actually where were're looking at the elections data frame and when you directly use Group by and then let's suppose I do group by and then I want to do likes dot Max I get the maximum number of likes if I remove likes and then I do dot Max it gives me the maximum value of artist the maximum value of track the maximum value of album and then the maximum value individually across every single column for each kind of album type but these values are not corresponding to each other similarly for elections in one particular instance we got Obama in 2016 with a maximum percentage of votes however those values were not corresponding to each other in which case you will be using idx Max in this way you have other ways of going about it to such as the first operator I'm not going to go over it right now but that's definitely something something you can look into as well so idx max is going to prove to be helpful over here now I'm getting exactly the corresponding row against which the number of likes are maximum I'm going to leave this for you guys to explore at your own ends now and I'm going to skip over to merging because this is important we're going to use this in the assignment as well and uh let me just show you the new data frame that we are creating so I've created a particular data frame with an artist and with a country and um let's suppose now I have another data frame which is my of course I have my original data frame as well which looked like this suppose I print only the first two rows right now see so in my original data frame I have 10 columns and I think I have over 20,000 rows in my new data frame I have just two rows and I have two columns if I want to merge these two data frames you're always able to do so on the basis of one common column that exists between them kind of the way that you would do in SQL when you start when you study database systems so it's the same kind of merging andas allows you to do that too now uh My First Data frame artist details contains artist and Country my next data frame contains 10 rows but it does not contain the countries over here for each artist so if I essentially merge these two data frames on the basis of artist where the artist name is the same as each other then I will essentially get a new column which is country now there are three different kinds of merging that you can do I'm not going to go into the depth of each merging essentially the way that you do inner joints outer joints and then of course left and right inner and left and right outer joints we have all of those available in pandas right now I'm just going to show you how to do an an inner joint and then you can modify the how parameter and experiment with this on your own so the command is pd. merge inside your pd. merge command you just add the names of both data frames I have DF and then I have artist details then your next parameter is on where you're specifying the name of the common colum on which you want to merge them and then you're specifying how this is what I obtain as a result so what inner joint did was it dropped all the rows uh where the artist value did not exist in my other data frame and this is what I've obtained as a result on that note we will be concluding including this session I'm going to share these notebooks with you guys uh on that note hold on where is it yep okay I'm going to share these notebooks with you guys and uh next week I will try and hold a brief session where we will go over any problems that you've encountered in these two
Related Videos
LBF101 Creating an XML Changelog
liquibase7511
3K views•2026-06-15
Alta Labs Cloud Dashboard Real time Network & Xnet Insights!
ShinyTechThings
158 views•2026-06-17
Wait... Group Policy Not Applying? Check This First!
keeplearning_iT
144 views•2026-06-15
Leetcode Weekly Contest 506 | Life's boring these days
Pudeesht
2K views•2026-06-14
microJAM: MAKING A MICRO GAME FOR A GAME JAM IN CLOJURESCRIPT AND TOTALLY NOT C
janetacarr
156 views•2026-06-18
Partitioning vs Bucketing vs Clustering: How to Make Queries 100x Faster
thedataandaiguy
194 views•2026-06-16
Design Claude Code Like a Senior Engineer
hayk.simonyan
344 views•2026-06-19
Linus Torvalds: AI Won’t Replace Understanding Code
SavvyNik
140 views•2026-06-19











