Data engineering involves building and maintaining pipelines that move data from source systems (like CRM and ERP tools) into centralized storage systems (data warehouses, lakes, or lakehouses) for analysis and decision-making. The core skills required include SQL (essential for data manipulation), Python (for pipeline orchestration), cloud platforms (AWS, Azure, GCP), and version control (Git). The data engineering lifecycle consists of five stages: generation (data sources), storage (warehouses, lakes, or lakehouses), ingestion (batch or streaming), transformation (using SQL, Python, DBT, or Spark), and serving (BI tools like PowerBI). Key undercurrents include orchestration (Airflow), software engineering practices (CI/CD), and data architecture. Recommended learning path prioritizes SQL and Python as foundational skills, followed by cloud platforms, then specialized tools like DBT and Spark.
Deep Dive
Voraussetzung
- Keine Daten verfügbar.
Nächste Schritte
- Keine Daten verfügbar.
Deep Dive
How I Would Learn to be a Data EngineerHinzugefügt:
Data nerds, in this video, we're going to be going through a databacked ranking system of the top skills you should be learning as a data engineer. But hold up, this is blurred out. Hey, editor Luke, remove the blur. Honestly, this isn't going to make sense until we learn the basics of what's actually needed for data engineering. We're going to first start with understanding what a data engineer actually does and breaking it down step by step based on what's known as the data engineering life cycle.
focusing first on the concepts and then from there introducing what are the most popular tools to solve these engineering problems. Now, if you're new here, I'm Luke and after gaining my engineering degree, I served the United States submarine force, spending a total of 2 years underwater. After that, I got my start working in data with a global Fortune 500 company. I've made data courses not only on YouTube, but also for over 30,000 students on data camp.
And my last job was with Mr. beast building out data pipelines. Now, more recently, I founded data nerd.te. It's a realtime job market platform that's completely free. It not only tells you what are the top demanded skills for roles like data engineers and based on your home country, but it also goes a step further providing you with realtime job postings so you can apply. Anyway, we'll be using the data from my app of over four million different job postings combined with that data engineering life cycle so you can get an unbiased opinion of what tools and what order you should be learning them. So, let's start at the very beginning understanding what does a data engineer actually do by looking at the problem they solve. So, when I worked at that Fortune 500 company, our data was spread all over the place. Our customer data was in Salesforce, which is a CRM tool or customer relations management tool. All of our product data was in a program called SAP, which is an ERP tool or enterprise resource planning. Oh, and I could never forget all those tribal knowledge Excel files that had pricing information. Needless to say, this is a mess, especially if you're trying to analyze your customer's buying pattern. You have to go to three different sources. Well, that's where data engineers come in. They take all that data spread across your source systems and build and maintain pipelines to move this data into a centralized system such as a data warehouse. And it's not just to move the data, but it's also to provide clean and reliable data which then can be used to serve your different end consumers such as data analysts, data scientists, and really anybody else that needs it. Now, there's three general roles that work in this type of ecosystem. The first are data engineers, which this is why you're on this video. They help build and maintain pipelines. Downstream of this are data analysts which use this data to inform business decisions. And then we also have data scientists which use this data to build predictive models of what should happen. So in practice, what does this look like? Well, you'll be provided with a strategic business decision to solve and you'll build a pipeline around this. Then data analysts will use tools like Excel and PowerBI to generate reports and build dashboards to understand what is happening now.
Similarly, data scientists will use tools like Python to build models in order to understand what is happening tomorrow. Now, if you want to dive deeper in understanding what a data engineer does along with a lot of these concepts that we're going to cover in this video, I have a completely free data engineering crash course that's delivered directly to your inbox every morning for 10 days. This course covers all the essentials you need to know to land your first data engineering role.
Check out the link in the description to sign up. Now, before we get into breaking down the concepts of the data engineering life cycle, I want to take a brief stop to look at some popular tools. Now, specifically with this, I'm not caring which ones are first and last per se, but instead I want you to see some different segmentation that is happening with the tools. So, here we're looking at the likelihood of a skill appearing in a data engineering role over the past 12 months. First, let's start with programming languages, and that's the top contender. We have SQL and Python, which appear in more than two out of every three job postings, and then also Bash. Although it's down here at the bottom, it's a severely undervalued skill. More on this later.
Next up are cloud providers. The top two are AWS and Azure. And then GCP or Google Cloud Platform is down a little bit further. Then we have data platforms. The two dominant ones are Snowflake and Datab Bricks. They're in one of every four job postings. Then are transformation tools. Spark is appearing in nearly 40% of postings. And then down near the bottom, we have DPT in almost 10% of job postings. This one, like Bash, is severely undervalued. After this, we have the orchestration tool of Airflow. And really, there's no other competitors that come close right now.
After that, we have BI tools of PowerBI and Tableau. Both are neck andneck depending on the region of the world you're in. And then finally we have version control which the dominant player here is git. Once again this one I feel is also severely undervalued. Now there are other types of tools as well but they drop off significantly. So we're not really going to be covering these. Now that we understand this grouping of tools let's dive into the life cycle. And real quick got to give a shout out. The concepts that we're going to be going over such as that life cycle are from this book fundamentals of data engineering from Joe and Matt. I got to meet them a few years ago when they first released this book and I can't recommend it enough. So, let's dive into this life cycle with a quick overview.
First, data needs to come from somewhere. This is the generation stage.
Think of Salesforce like that CRM or those Excel files that I discussed earlier. Data engineers don't control this. So, we need to move this into a location that we do have control over.
And the first portion of that is thinking of where we're actually going to store all this data. Once we have that stage settled, we can then ingest it into our environment. And data is never clean or shaped how we like it. So then from there, we need to transform it. Once it's in tip-top shape, we can then deliver it to those data analyst and data scientists via our serving stage. Now, in order to sustain these different five stages, we need what's known as undercurrens such as orchestration that automates when and in what order tasks should run. Software engineering managing all of our different version control, testing, and code quality, data ops, which controls our continuous integration and continuous development of this life cycle, and then how this pieces all together with our architecture, the blueprint of how everything connects.
Now, let's get into breaking down each one of those stages and undercurrens step by step because if you're like me, when I first saw this, I was like, what the heck is going on? So, with this, we're actually going to start with generation or ingestion, but instead with the foundation of storage and this all starts with three major storage abstractions. First of data warehouses.
Probably heard of this one. The next of data lake and then finally a data lakehouse. Let's dive deeper into data warehouse first. Now this abstraction only supports one type of data and that is structured data such as CSVs, Excel files, text files or probably most popular from another SQL database.
Anyway, this data is inserted in and cleaned up either before or even once in the data warehouse. Now because we use this structured data, this is very great to use for analytics and also business intelligence. However, for data science, this is more limiting as they're typically using unstructured or raw data in order to build their models. And then similarly with machine learning, they're building out models as well and they typically rely on more unstructured data. Now, moving on to a data lake.
This takes any type of data such as those structured files that we talked about previously and then semistructured and even unstructured data such as movie files, PGs, or even JSONs. you dump it in all raw and figure out how to use it later. I call this the yolo approach.
Now, because of this approach, analytics and business intelligence is typically harder unless they can connect to some sort of structured data. However, this is really great for data science and also machine learning as it now has that semistructured and unstructured data that's great for building models. Last up is a data lakehouse. And it's quite simple once you understand that this is just a combination of a data warehouse plus a data lake that make this data lakehouse. So with this data lakehouse for the input data this takes either structured or unstructured data. We then dump it into the data lake and then from there we clean up to get our structured data in our warehouse. So this has the best of all worlds in that our analytics and BI can connect to our warehouse and then data science and machine learning connect to our unstructured data in our data lake. Now when it comes to these storage abstractions which one should you be learning first as a data engineer? Well let's first understand some trends of these warehouses demand for this is stable. They're easier to build and maintain than lakeouses. Data lakes however have been on the decline and not as demanded. And that's because demand for data lake houses has increased, thus absorbing those that were data lake only. Because a data lakehouse is just a data warehouse and a data lake and demand is stable, I recommend starting here. And then once you understand data warehouses, shifting to data lakehouses next. So now that we understand these three major types of storage abstractions, we're going to take it a step further now and start introducing tools, specifically cloud providers and also data platforms.
Starting with cloud platforms, we're going to look at Google Cloud. Walking through a warehouse example. And one quick note, AWS and Azure have the same equivalent service at every stage that we're going through as well. Anyway, cloud platforms are like a hardware store. You walk in and get access to all the raw building blocks you need to build a pipeline. For Google Cloud, we'd use their popular data warehouse option of BigQuery. We could use cloud dataf flow to ingest the data in SQL inside of our BigQuery database to transform it and then their BI tool of Looker to serve it. So as a data engineer, you have to piece this all together to make this work. Now, how would this look for a lakehouse extraction? For storage, we put our structured data into the data warehouse of BigQuery and then our semi and unstructured data into something like cloud storage. And this would be all managed under Google's lakehouse storage engine of big lake. Then from there, we could ingest it in with something like cloud dataf flow and then transform it outside of our warehouse with something like data proc which basically like spark or use the data warehouse itself of bigquery to transform it with SQL and then serve it with something like looker. Anyway, I think you get the point here. With a cloud provider, you're putting this all together. Some people don't want to build a fully customized solution.
That's where data platforms come in. And there's two major types of tools.
Snowflake and also data bricks. And it's important to understand these are not competitors to GCP, AWS or even Azure.
These data platforms run on top of cloud providers. You actually select which cloud provider you want to use. So looking at an example using datab bricks, it's going to be a completely different approach because datab bricks is a unified platform. So in this you don't manage the different storage tiers, clusters or servers. You focus just on the data. So the takeaway for this section is that companies hire data engineers, not platform specialists. If you can build a pipeline by putting together the different services from a cloud provider, you're going to be more than capable to jump over to a platform and use their tool to also build out a data pipeline. So I recommend starting with the cloud over data platforms. And to prove this, the engineering life cycle for data nerd.te is built in a cloud provider. I use Google Cloud to manage all the different services in here. And specifically for this storage section that we're on, I use Big Query.
This is what's crunched and cleaned up over 4 million job postings. So that's the storage cycle. It's the biggest one we're going over because it's the base that we're going to build everything on top of. We're now going to move into two stages. understanding where our data is generated and how we can ingest it in to our storage layer. For this, it's important to understand there's two types of systems. Our source systems or where a data is generated and then analytical systems where that data is piped to. And we covered these three major storage abstractions associated with this. For source systems, these popularly come from CRM or ERP systems or even it could be your custom app backend, which is a database. And these are not maintained by data engineers.
They're typically maintained by software engineers or even database engineers.
Your job as a data engineer is to pipe this data into your analytical system.
Now, there's two flavors in how you ingest your data. The first we're going to look at is batch. Getting it in scheduled chunks. An example of this with having it run at scheduled intervals such as at 3:00, 6:00, 9:00, and so on. You could even do daily, hourly, or every 15 minutes, whatever the business needs. This is the most popular flavor of loading in data is it's not only simple to use but also most people need data only on a daily, weekly or monthly basis. Which brings us to our second flavor of data ingest which is streaming. This moves data continuously as events happen. Every click, every transaction flows immediately. This type of ingest is built for speed where every second or minute matter. Think of something like fraud detection where you need to know immediately if you have a fraud case.
You can't wait until tomorrow to flag a stolen credit card. As expected, these are harder to build, to debug, and also they're more expensive. So, let's get into looking at some popular tools for batch ingestion. Now, there's a few options here. We're going to look at first that tools that prioritize zero code. A solution like Fiverr provides some pre-built connectors to connect to a host of different sources.
Alternatively, there's not always a connector to get the data that you need.
For my need of datender.te, I needed to connect to an API that pulled job postings from LinkedIn, Upwork, and Indeed. For this, I just used vanilla Python, so Python alone. And so, I coded the whole solution myself. This is obviously more work than a noode solution, but it's the only option that I had, and it's quite popular for ingestion techniques. Now, moving on to streaming options. And for this, there's really only one major player in this area of CFKA. This tool is open- source and free. So if you wanted to, you could go download it. And it was originally built by LinkedIn to handle their massive realtime data feeds. This type of tool is meant to handle at scale as it handles millions of events per second. So when building your ingestion pipeline, you need to decide between batch or streaming and then from there decide the tool needed to get your data from point A to point B. So moving on, now that we understand ingestion, it's now time to get into my favorite stage, transformation. Now, we just learned that our job is to get data from those source systems into analytical systems.
And we need to do this for a few reasons. First, this puts our data in a central location that can be easily served to our end consumers. And this puts it into a database that is made for analysis. What do I mean by this? Well, we actually need to take a step back and understand how databases are designed by looking at two major design principles.
For this, we'll take the example of the job postings powering data nerd.te. In a platform like LinkedIn or Upwork, they have many different tables. Maybe something for job postings, for locations, skills, and so on. These are all separate tables. This is known as normalized tables or normalization. a design principle to break data into separate tables to minimize duplication and maintain data integrity. This type of structure is great for apps that have millions of reads and writes per second.
The problem however is this gets painful analytics combining all these tables.
That's where denormalized tables or denormalization comes in. This is a design principle to intentionally introduce redundancy into a database by combining data from multiple tables into fewer wider tables. The duplication is the point here. you trade a little extra storage for dramatically faster and easier queries. So the whole purpose of data modeling is figuring out just how much and how far you should denormalize tables to make it useful for analysis.
And that data modeling is just one step of the equation. We need to also select the correct database to put that design pattern into. For this there's two major types of databases. First is OLTP or online transaction processing. Popular options like Postgress and MySQL are designed around this. The other type are OLAP databases or online analytical processing. Popular options here are well a lot of the ones we've spoke of previously. These OLTP databases are built for running applications. They handle thousands to millions of read, writes, updates, and deletes per second.
However, they're slow at scanning and aggregating large data set. Conversely, we have OLAP, which is built for analyzing data. It's blazingly fast at scanning and aggregating large sets of data. However, it's slow at handling lots of updates at scale. So, let's put this all together and understand when we'd use something like OLTP. Well, you would typically see this for CRM, ERP, or app backends. Basically, any type of source system. Now, you could have outliers. That's why I have typical in quotes. Additionally, with these type of databases, you typically see them using a normalized table design pattern.
Conversely, with OLAP, this is typically used with our analytical systems of data warehouses and data lakeouses. And this data typically tends to be denormalized inside of these systems. Now, I'll give you that whole background of OLTP versus OLAP because you need to pick the correct database based on your use case.
For my use case of data nerd.te, tech, I need to be able to transform and analyze millions of job postings. So, not only is BigQuery my storage layer, it's also where I'm doing my transformations. And so, diving deeper into those transformations, the data that I'm getting from those job postings use more of that normalized design where it's spread across all those different tables. This makes it super hard, like I said, for analytics. So, in this transformation stage, we need to denormalize it and to get it into a better design pattern. Now, the most common design pattern here is what's known as a star schema. In the middle, you have a fact table. This carries events or measurements. Think of sales, orders, or clicks. Then, surrounding all this, you have your dimension tables, which is the context around those events such as customers, products, or even dates. Now, star schema is not the only design pattern. There's other popular options like snowflake schema, constellation schema, flat or wide tables, and even things like slowly changing dimensions. Now, I'm not going to go any further into the details of this modeling technique, but I do have a resource that I use and reference all the time, and this is of the data warehouse toolkit. It's by this OG of Ralph Kimal, and it has a definitive guide to understanding how to perform dimensional modeling. Now, I do want to call out that this transformation doesn't just happen in this one step of going from the source design to this analytical design pattern. Instead, most teams break this up into layers. The industry standard for this is the medallion architecture popularized by datab bricks. In this you import your data into the first layer typically called bronze and this is raw unchanged data exactly as it was ingested from the source system. Next is the silver layer and this has data that's clean and also validated. So duplicates removed nulls handled useful for exploration not yet modeled for business analytics. Finally we get into the third layer of gold.
This is the business ready modeled data.
We have fact and dimension tables in here optimized for dashboards and reports. Now, for the past 5 minutes, I've been going over the what of what we need to do for these transformations.
But now we need to dive into the tools to understand how we'll be doing these transformations. And for this, we have two main options. The first option is to use SQL for your transformation. This is the language used to query and manage the data stored in a database. For this you write a SQL query of what you want to transform. This is sent to the database to manipulate the tables and then from there you get an output from that database. So for this form of transformation we're using a SQL engine to commonly operate on either a data warehouse or a data lakehouse. Now the great thing about SQL is we don't have to worry about data size. Modern warehouses automatically scale compute across multiple machines under the hood.
Now, SQL is great when working with structured data for things like joining tables and aggregating and grouping.
It's a beast at doing this and is my number one choice. However, there are times when SQL is going to hit a wall and we need something else, such as if we have data that's nested within a JSON. SQL's not too good at this. Or if we need to do some sort of natural language processing or ML feature engineering to classify messy text. This is where option two comes in. As data engineers, we typically reach to the programming language of Python. Although Scala and Java do appear in postings, they're just not as popular. Now, Python is a multi-purpose programming language.
So, it's great at unpacking nested data, but even better at things like AI and machine learning. So, this would be the tool used when we have those outlier cases that we need to transform data.
And you typically see this implemented in either something like a data lake or lakehouse where the storage is separated from compute. However, times are changing and modern data warehouses like BigQuery and Snowflake now support Python natively in a warehouse, although it's not super common. Now, to help SQL and Python out, there's two popular tools used for these transformations.
DBT or data build tool is used for managing SQL transformations and then Apache Spark is used for scaling Python transformations. Let's dive into DBT first. Now, for simple transformations where there's not a lot of tables, it's more than acceptable to just use vanilla SQL for these use cases. You don't need DBT. However, let's take my use case of data nerd.te and the transformations I have to go through. I have to first make my main table, then all my other ones.
These transformations have to be run in order. Oh, and I still have to do a bunch of transformations to get from silver to gold. This is not simple. And this is where DBT comes to the rescue.
Well, we're still going to be using SQL to perform these transformations, but DBT is used to manage the dependencies or particularly what order you're going to be running all these queries. So, this tool manages that entire transformation process. It also has popular software engineering practices like version control, testing, and documentation. There's no major competitors at scale for dbt. So, you really won't go wrong learning this.
Now, moving into Python, let's dive into Apache Spark. Now, similar to how SQL doesn't always require DBT, you don't always need Spark with Python, especially if you can keep it on a single complete. Take for example data nerd.te. For this, I just use vanilla Python to unpack those JSON results from our generation stage. And this works because it's only a small subset of incoming data that I can fit on a single computer. However, when you're dealing with big data, you need a cluster of computers. This is where Spark comes to the rescue as it's a distributed processing engine that splits work across a cluster of machines to process data at scale. So for example, for a data warehouse, you could use Spark in cases that SQL can't handle it for your ingestion and transformation.
Alternatively, if you have a data lakehouse, you could use it right along with your data inside of here. You wouldn't need to do it necessarily at ingest. Similar to DBT, there aren't a lot of competitors. However, tools like Polars and Dask are gaining ground for midsize data. All right, we got one more stage to get to before we get into the undercurrents and that is the serving stage. This is all about getting the right data to the right consumer in the right format. There are three main areas where this is delivered. Business intelligence and analytics, machine learning and artificial intelligence, and reverse ETL. Let's dive into each.
For these, let's look at the who and what is involved. Well, data analysts are the main ones involved in this.
They're going to be the ones querying and investigating your gold tables.
However, whenever data analysts can't be found, this is more commonly becoming part of a data engineer's role as well, especially in smaller scale teams like startups. This is done by providing reports that analyze what's going on or providing dashboards so managers can explore further. Looking at the popularity of these tools for data engineers, PowerBI is the clear winner for dashboarding tools and surprisingly, Excel even shows up for them. Moving on to machine learning and AI. Data scientists have historically been in charge of this area. However, we're seeing a growth in other roles such as AI or machine learning engineers that also are in this as well. Now, these jobs may find themselves building dashboards, but primarily they're focused on building models and Python has been the dominant tool of choice for this. Last up is reverse ETL and this revolves less around a who and what.
Instead, this actually goes back to the life cycle. Specifically with reverse ETL, we're actually sending data that we've cleaned up and made back into that generation stage. This is commonly done in platforms we've seen before, such as CRM tools, ad and marketing, and then from time to time, ERP tools. Data that you've cleaned up as data engineers is ingested back into this system and used in a different manner. So, this now wraps up of getting your data from those source stages all the way to your end consumers. But in the real world, you're going to hear a different acronym of either ETL or ELT that covers this type of process. Now, let's start with ETL or extract, transform, and load. This is the legacy pattern that deals with extracting your data from those source systems, then transforming it in order to clean up the data to get it presentable, and then finally loading it specifically into something like a data warehouse. This was commonly done in the pre-cloud era when warehouse compute was expensive. That's why we discussed her the medallion architecture just loading your raw data as is directly into your analytical system and then from there clean it up. This is the newer form of ELT where you extract out of that source system load it into either your data warehouse, data lake or lakehouse and then once is inside of there perform all the transformations. This is the modern pattern because cloud costs got cheaper and more powerful. Also, if you caught it, ETL was designed back when only data warehouses existed. Since the introductions of lakes and lakehouses, this has refined the approach as well.
Now, ELT saved my butt. It's the approach that I use from the very beginning with building out data nerd.te. Anyway, I recently went through and not only redesigned the app, but also the data pipeline to streamline it.
because I loaded and have always preserved that raw data. I found during my rebuild that I actually had missing job attributes. If I would have used ETL and not preserved that raw data, I would have never found those insights. So, super helpful. Let's now shift into understanding the undercurrents of the life cycle. So, we just looked at what happens to your data. With this we're getting into how to do this work well in production with four major areas with orchestration automating when and in what order tasks run software engineering of using a computer science approach to develop operate and maintain software data ops to automate observe and also respond to any incidents and finally data architecture putting all of these principles that we've covered in this video together to build out a sustainable pipeline. Let's dive in orchestration first. Before going to any tools, we understand what qualifies as orchestration. This controls when to start jobs, what order to run those jobs, what the heck to do if they fail, and what's even the status. This is handled under the areas of scheduling, dependency management, failure handling, and observability. This is commonly done by managing every stage of the ELT process with something called a DAG.
You're probably like, Luke, what the heck is a DAG? Well, this stands for directed as cyclic graph. Directed means that work flows in one way. So from task A to B. We demonstrate this as an A. We demonstrate this as an arrow going from one to the other. A cyclic means that there's no loops. Task A can't loop back to task A. Everything needs to run only once. And then finally graph. It's well a graph. The nodes of A, B, and C represent the task or jobs to complete.
And then the lines between them represent their dependencies. So a general DAG structure would have us extract from our source systems, load them into our data warehouse, transform them using something like DBT and then finally serving it. Now for orchestration tools, Airflow is the most popular option by far. However, there are alternatives such as Prefect and also Dagster. Airflow is an open-source option that allows you to write your DAGs in Python. And the cool thing about it is that it has its own user interface that allows you to then monitor your DAGs as they perform and you can rerun it with a click of a button. Now this begs the question, do you always need airflow is a requirement? I would argue no. My first version of data.te used airflow. The problem is however airflow is required to be running 24/7 and this can get really expensive especially if you're using cloud computers. So whenever I rebuilt version two, I built it not to depend on Airflow and instead just use Google Cloud's different tools in order to manage all the different capabilities to fully be considered as orchestration. I would argue that this new version was actually easier than Airflow because I didn't have to learn this new library, but I did have to understand the engineering behind what was needed to orchestrate this inside of Google by piecing this all together. So that is orchestration. We now need to get into one of my favorite topics of software engineering. So this is a disciplined approach to developing, operating, and maintaining software using engineering principles. Basically, computer science nerds came up with this process. And there's three main pillars of operate, develop, and maintain.
Operate means how do I run my code?
Develop goes over how do I update my code? And maintain goes over how do I test my code. With this, we're going to dive directly in the tools for each. for operate. This is done via the command line which takes well two things. One, it takes your terminal and two the language inside the terminal which in our case is bash. The terminal allows you to programmatically run commands to control your computer. This is how chatbots are building all these apps and interacting with your computer system.
They're running bash commands or well zshell on my Mac to operate on my system. You can use commands for everything from running SQL queries and Python pipelines to deploying to the cloud, running transformations, and even version control. This brings us to develop. How do I update my code, which is controlled by version control? And there's two popular tools for this. Git, the actual version control system itself, and GitHub, the online platform to collaborate with others with your Git repositories. Let's break both of these down. Git is a free and open-source option to install on computer and manage version control of your pipelines.
There's really no competitors with this.
With this, you have different areas that you can actually store your different changes and then commit to to managing those changes. Now, to related to this is GitHub. This is also Git, but it's in the cloud. It's for your remote repositories. This is where other people besides yourself can go and actually access. Now, GitHub does have alternatives and it's been getting a lot of flack recently for its recent downtime, especially since after its acquisition of Microsoft. Anyway, how do these interact with each other? Well, using your terminal, you create your code and then from there, you'd add it to something like your staging area.
When that's all in tip-top shape, you can then commit it to your local repository. When you want to share it with others, you would then push it up to your remote repository in GitHub. If you or somebody else made changes to that remote repository, you could then fetch and merge it in order to work on it. Anyway, we just talked about terminal, so I wanted to show you that there are actually terminal commands even revolving around this. Last up is maintaining. How do I test my code? This is going to depend a lot on the programming language and software that you use. If you're using Python, you can use the library of piest. For SQL, the best way to manage test is inside of dbt. Anyway, testing is sort of boring to me, so we're going to continue on.
Next up on our undercurrents is data ops. Data ops consists of three main pillars. First is automation. We're lazy, so we want to automate it. Next, we don't want to get fired, so we need observability. And finally, we got to actually work, so we need to have an incident response. Automation in this sense is about ensuring code reaching the production safely without manual steps. Basically, spending days to automate something that probably takes 5 minutes to do. But on a serious note, this is actually following more under continuous integration and development, which we'll cover in just a sec. Then observability of knowing what's happening with your pipeline and data.
This is done through different logs and quality checks and lineage tracking which tools like DBT and Airflow are great at helping with this. And then from there instant response of detecting, responding and learning when things break. These are things like alerts and notifications to let you know what's going on. Now back to automation.
Let's understand how continuous integration and continuous development actually work. So we talked about previously using git to manage and whenever we run git push or our final code we would then perform automated checks before our code goes into production such as running any tests linting code basically spellchecking and validating syntax. Once we've verified that our code is clean and ready to go into production and they all pass, we can then move into continuous deployment. So getting it out there by deploying it, applying any changes that are necessary, and of course monitoring it. This safely sails the ship. And it's called continuous integration and deployment because you repeat this process over and over again as you're pushing new changes to production. All right, getting into the last and most important undercurrent and that is data architecture. And this is the set of structural decisions about how your data systems fit together of deciding where will this pipeline run, how fast does our data need to move, what storage abstraction we're going to go with, what order should we move and transform, what's the systems optimized for, and how do we layer the transformations. So data architecture is putting that all together. All right, so now we have a foundation in understanding that data engineering life cycle. we can now truly move into understanding what tools we should be learning and in what order.
For this, we're going to be using a tier list. For anyone unfamiliar with this, S is the top tier or creme de la creme of what you need for a job. And then it goes from A all the way to E where E is like, okay, not really necessary for this. I want to start with a list of tools. We're going to be using these top ones shown here, additionally bash. So, we're going to be walking through using each of these tools on the right hand side and moving them into the respective tiers. One note on cloud technologies, whether it was a cloud or actual cloud tool like BigQuery, it's all combined into one of this cloud symbol. And this just means you need to have an understanding of cloud platforms. For this, we're going to jump around a little bit. I want to start with the most foundational skills first that you have to learn to have a job. First is SQL. It's a non-negotiable. Every data warehouse runs on it. Second only to SQL would be Python. This is the language of choice for orchestrating your pipelines and they will manage your ingest automations and it glues everything together. With that, it's vital that you can put your SQL and Python code into the cloud and use one of their respective data warehouses at minimum.
So that's the three most essential skills to start. There may be more.
We're going to start now at the bottom in the E tier. With this, I'm going to move two technologies, specifically Java and Scala. Both of these are programming languages. I've primarily found them in legacy systems or enterprise systems that are maintaining these code bases.
So, if you're applying to one of those roles, you may need to move this up.
Next up to the D tier here, I'm going to place BI tools. They're not foundational to data engineering roles. However, if you find yourself applying to analytic engineer roles, I bump these up. These are tools that you can learn in a weekend. Moving into Ctier, this is going to be reserved for data platforms like Snowflake and Data Bricks. If you've mastered those cloud technologies up in the S tier by piece mealing together your pipeline, you're going to have no problem adapting and learning a data platform. Moving into the B tier, we'll start with orchestration tool of Airflow. If you're targeting air enterprise roles, this may be more of a must and I'd bump it up to A. At minimum here in the B tier, I think you need to be interview level aware of this. Can you talk about it? Not necessarily have you built a lot of things with it. After you learn the basics of Airflow, I'd then move on to Spark. Remember, this is for transformations that you can't necessarily do in SQL and you need scaling. Similar to the what I said about Airflow, if you're applying for an enterprise role, bump it up. Next up on the list is Kafka. This is when you need to switch from batch to streaming. Of all of these on this tier, it's the lowest priority. And I'd even considered moving down to Ctier. All right, onto that A tier. This is where DBT is going to go. And I listed on A tier for two main reasons. One is that you need to know not only SQL, but also Python in order to run this tool. So, you need to get that S tier first. Also, you're more than capable and probably should build pipelines your first time without DBT.
And then you can see just how powerful this tool is. This leaves two tools left and they're at the bottom of the list, but both of these are highly undervalued. Git for version control and bash used inside of a terminal are two essentials required in order to build a data pipeline. You can't build a pipeline without it. Now, I feel the S tier alone is enough to get you job ready and start applying, but I think it would set you apart if you also had that A tier of DBT. And just to prove the power of this, my data nerd life cycle runs on just this using Python for ingestion, SQL coupled with DBT for transformations and then Google Cloud for my storage and also orchestration and then Bash or the terminal to run all my code along with Git and GitHub for version control and deployment. So, where should you start in learning this S tier? Well, conveniently, I have two courses geared just for this. The first is my SQL for data engineering course on YouTube, and the second is my Python for data analytics course. Don't let the name deceive you. It teaches you all the fundamentals of Python need to know.
Anyway, in the SQL course, you're going to build a realworld pipeline in the cloud and you're going to manage this entire process of using terminal, git, and GitHub. In the Python course, similar to that of the SQL, we're going to start at the very beginning going over all the necessary coding basics to get you to become an expert coding in this language. We'll also build a project in this one and deploy it using terminal, git, and GitHub. And the best way to land your job while also learning the skills is clearly project as told by 4,000 of my subscribers. That's why in both my courses, you build real world projects such as my SQL course where you build an end toend data pipeline. All right. So, if you'd like to get everything from this video in a course, I have a 10-day course for that. It has not only this, but a lot more, including my three-step approach to landing a job.
If you're ready after that to jump into my course, here's the SQL for data engineering course. All right, with that, I'll see you in the next one.
Ähnliche Videos
Agentforce NOW AMA: Build with React and Salesforce Multi-Framework
SalesforceDevs
490 views•2026-05-28
How agent o11y differs from traditional o11y — Phil Hetzel, Braintrust
aiDotEngineer
450 views•2026-05-28
WEB TECHNOLOGIES UNIT-2 | Degree 4th sem BCOM Computers web technologies unit-2 full explanation💯✅
LearnwithSahera
1K views•2026-05-29
More tests are always better? How to use AI to identify tests that bring little value
Alliance4Qualification
335 views•2026-05-29
Search Algorithms Explained in 60 Seconds! 🤖💨
samarthtuliofficial
218 views•2026-06-01
People of Game of Thrones using JavaScript DOM
AltCampus
296 views•2026-05-30
Introduction to Problem Solving Part - 1 | Lecture 1 | Intermediate DSA
ascensionix
107 views•2026-05-29
So What's Odin Lang Even Good For
TechOverTea
131 views•2026-06-01











