Databricks provides structured streaming capabilities with two trigger modes (available now for batch-like jobs and processing time for continuous data flow), watermarking to handle late data in windowed aggregations, and three output modes (append, update, complete). Auto Loader automatically ingests new cloud files from storage systems like S3, Google Cloud Storage, and Azure Data Lake Storage with schema evolution support, using either directory listing or file notification modes. Unity Catalog integration enables secure cloud storage access through IM roles and policies, requiring proper credential configuration and external location setup. Delta Live Tables offers a declarative framework for building ETL pipelines with built-in data quality checks, CDC support, and automated orchestration.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
Databricks Streaming, Auto Loader, and Delta Live TablesAdded:
All right. So when you are using structured streaming, you might want to know about uh when things will get triggered. Okay. And so there's two options. We have available now and we have processing time. So available now the idea here is that it will run once all available data uh becomes present and then it will stop. This is really good if you are doing batch like jobs or autoloader. So we have trigger available now equals true. And then for processing time, the idea here is that um it runs repeatedly at a fixed interval. And this is great if you got data constantly flowing when you're using structure streaming. Uh but let's take a look here at watermarking and late data handling.
So in streaming, data can arrive late or out of order. Watermarking tells Spark how long to wait for late data before finalizing a window. So what exactly is watermarking? Defines a threshold for an event time delay. Sparks drops late records that arrive after the watermark.
So here we have with watermark setting that event time to 10 minutes. This means spark will wait up to 10 minutes for the late events based on the event time prevents unbounded state growth in windowed aggregation balances accuracy and resource usage. An example here would be a group by and then we specify um that window with the aggregate and then we have our width watermark information there. So the idea here is that it will process data in five minute windows. uh the late event within 10 minutes are included and later than that it's going to just discard. Okay, let's take a look at some output modes and streaming. So, Spark supports three different modes based on what you want.
U first is append. So, only new rows are written used when no updates to existing rows. Um so, here you can see append.
It's great for logs, events, and insert only. We have update writes only updated uh rows requires query to have a unique key or state. Um best for rolling counts or running totals. We have complete outputs entire result table each time.
And this is great for full aggregation.
So those are your three examples. Let's make sure we know the difference between triggered versus continuous pipelines.
So we'll look at features here in the lefth hand side and go through them for.
So for triggered and continuous under execution mode uh for triggered runs in micro batches on schedule. For continuous, it runs continuously as data arrives. For latency, uh for triggered, it's higher, depends on the trigger interval. For continuous, it's lower near real time because it's continuous data. Um triggered is more cost effective because it runs more infrequently, but continuous because it's constant. It's going to cost you more. Triggered is great for batch jobs, periodic reports, where continuous is for real-time dashboards and alerts.
triggered. Um the idea here is that you're starting and stopping compute whereas continuous it's always running compute. So that's going to cost you money. Um for triggered it's it has available now in processing time as we saw but continuous has no triggers and it's always running. So use triggered for scheduled costconscious workloads.
Use continuous for low-level realtime latency uh needs. Hey this is Andrew Brown. Let's take a look here at autoloader which automatically ingests new files from cloud storage with no manual tracking. So it uses cloud file sources or source in structured streaming to ingest new cloud files incrementally with support for schema evolution multiple formats and massive scale. So imagine you have your cloud storage AWS S3, Google Cloud Storage, whichever one you want to use and you create new files in there. Autoloader is going to load them into data bricks.
Okay, pretty straightforward. Uh and it's best used if you have large and continuously updating data sources for obvious reasons. Let's talk about key features. So it auto detects ingests new and existing file types like CSVs, JSON, parquet, Avro uh and more. Supports S3, Azure data lakeink storage, Google cloud storage, Azure blob data bricks file storage. Um it handles schema drift and evolution. Um it stores state in Rox DB checkpoints for exactly one processing. It integrates with DT for production grade pipelines. It has two modes, directory listing and file notification which we will talk about here shortly. It uses cloud files in structure streaming to read new CSV files from the cloud storage incrementally. So there on the lefth hand side um you know you can see that format for cloud files and we will see that more when we actually dive into a lab autoloaders beat beat beat beat beat beat beat beat beat beat beat beat beat beat beat beat beat beat beat beat beat beat beat beat beat beat beat beat beat beat beat beat beat beat beat beat beat beat beat beat beats read read streams with uh with scalable ingestion autosklling handling and lower cost file detection. So consider using autoloader over just using your standard old readstream. Let's compare the two modes.
We have directory listing which is the default mode and file notification. So on directory listing side it scans the directories to discover new files whereas file notifications uses cloudnative event services to detect file changes. Directory listing is simple setup and no external dependencies where file notification is faster and more cost effective at scale.
For directory listing we have slower for larger directories where um it's supposed to go over the other one. I think it's just out of order but we'll come back to that one here in a second.
Um for directory listing it uses directory listings for quick and easy setup. Um and for file notification we can choose file notifications for large scale ingestions with better performance and lower cost. Going back up one step.
So for directory listing it's slower for large directories. For file notification requires additional setup. Uh let's talk about the autoloader schema interface limitation. So by default autoloader may infer all fields as string especially from formats like JSON which don't enforce strict typing. So here's an example where we have our structure autoloader might infer that as string string string. So even if price should be a float and in stock should be boolean it will be a string. So what do we do in this case? We provide explicit schema using schema or schema location or enable infer column types option if it's supported. So uh there's an example of turning it on. This is going to ensure correct types are used and avoids downstream issues with transformations or valid or validations. So there you go. Hey, this is Andrew Brown. This video we're going to try to figure out how to get data from Amazon S3 uh into datab bricks by using autoloader. Um, it also seems to suggest that uni unity catalog might be something that we'll incorporate. So, we'll use uh this as our guide and try to work through it.
Um, I'll just have it off screen here.
If we need to pull up, we'll work through it. Sometimes I find for the most part not all of these work and so we might have a detour, but I think it will it might work. Also, I wonder if we switch here, could it tell us for GCP?
Oh, it does. That's cool. We can switch over to like Google Cloud Storage or AWS here or SAP, but we'll stick with um Adabus for now here. Okay.
So, um here it's suggesting, okay, we'll go create a cluster, make a runtime, and then make a notebook. So, I'm just going to go ahead and create ourselves a new notebook. Um this one, and you know, actually, I'll name it this time. Last time I didn't name my notebook, this will just be like um load data load data S3. Okay, so we have that there. It's going to suggest us to use a cluster. We already have our serverless one, so I'll go ahead and connect it here. It literally tells us to make a new cluster. We're not doing that. We'll try to use the serless one here. We're going to drop it down to use the SQL language. And um so here says change the default languages render without a Python.
That's totally fine. I don't care. And so now we are in SQL mode. And so here actually they have both in SQL and Python.
U but we'll go ahead here and here we needed to put our path to our source data. Now I think our source data is an an S3 bucket address. I think that's what it is because if we were to go back over to here and look up autoloader uh data bricks I'm pretty certain that it's showing that address. Right.
So replace path to source data with the path to the directory that contains your data. Okay. To view a sample of the records um better understand the contents um of each thing. So is this before we uploaded or be h let me go figure this out one second. Okay.
So yeah here it's suggesting external location path as a um as a bucket. The thing is like how is it going to get permissions um to S3 because right now I'm in the um data bricks um intelligence platform and I'm not running this in AWS. Do I have to go launch that there first? Is there another way to get permissions? So, I'm expecting this to fail because I haven't connected Adabus. Anyway, I am logged into my Adibus account. So, we can go ahead here and give that a go. So, I'm going to go over to my account um over into here.
And I'll go here to S3. I'm going to create myself a new bucket. So, I got a bunch of buckets I need to delete. This is my developer account. So, all those buckets can go. I'm going to call this um data bricks autoloader and just put some numbers here on the end. I'm trying not to do 1 2 3 4. It seems like all the numbers ever do. We'll go all the way down the ground. We'll hit create bucket because how would it access this bucket? We would have to update our policy. Do something there, but I'm going to go over to data bricks. And so now we're into autoload. We're going to need some kind of data structure here. I'm going to go over to claude. I'm going to say um you know give me a JSON lines file.
Well, does it support JSON lines or just JSON? So that's something we'll have to go back and double check. So if we go here to tell us JSON, CVS, etc. So I'll just say, you know, generate um I am using autoloader for um data bricks.
I need to generate um a JSON file to test for loading. Okay.
So, we'll go ahead and do that. But again, it doesn't say whether it's um JSON lines here because that's what I would expect it to be would be JSON lines. It just says JSON. But anyway, we're getting data here. We'll we'll give it a try. Um so, whatever data it wants to produce, go ahead and grab it.
I'm going to copy this, you know, get data however you got to get data. And I'm going to go back over to Visual Studio Code. Um, just so I have somewhere to save it for a moment. I'm going to go here and make a new text file. I'm going to drop in this file here. So now I have this here. We'll go ahead and save it.
Um, sure. We'll just go example data. Okay. So that is now saved. I'm going to upload that to S3.
reveal that in my explorer if I can find it here. I'm just right clicking here.
Reveal reveal in in explorer.
There we go. And what I'm going to do is go over to here and I'm just dragging that onto here to upload that file. But again, I'm expecting a permissions issue. So here we have the file. That's our destination. Let's go back over to here. Probably should put it in a folder. It's not going to matter, but uh it's probably like good good to put things in a folder. Um that we want to go back over to [laughter] our autoloader one here. Where where did it go?
We have it here.
Okay. So here, um it's saying list and then the source path. So we'll go ahead and try that. And my expectation is that's not going to work because how would it work? It doesn't have permission. So we'll go ahead and run that.
Okay.
And I'm expecting that to fail. So there obviously has to be some way uh to connect this with adabus. I'm going to go over here another tab and see if there is any options here. Sometimes over here there might be something for access to cloud providers.
Um, but I'm not seeing that there. We'll go back over to this tab. And here it says no such file found. That totally makes sense. Let's try the diagnostic error and see if it makes any suggestions here.
And this obviously has some kind of AI here, but that's not going to be helpful. like look um um how I'm trying to list an S3 bucket but I haven't connected Adabus in any way.
How do I do that?
Does it even know? How would it know that? Um ensure that your database workspace has the necessary IM roles and policies to access S3 bucket. Configure the credentials using the IM roles or same with the access key. Um once the credentials are configured you can use dbs util really wants to use dbs util here ensure your database workspace has the necessary IM rules. So we are in this workspace here there must be an option to configure it here.
I don't know. Um give me a moment and let me see if I can figure it out. Okay. All right. All right. So, it's suggesting that we need to use Unity catalog um as a means to connect. And so, over here we have those instructions. Let me give it a read and see how to do that. Okay. So, interestingly, we have a lot of requirements here. They suggest to keep it in the same region. Um do not use dot notation when you have your bucket. Um it's not telling us how do we create it as of yet. Um but again, I'll just continue to read here and then just speed us through it. Okay. All right.
So, let's go ahead and create an IM roll with a custom trust policy. So, I'm going to go back over to here. We obviously have our bucket, but we'll just make a new tab. I'm going to go over to IM and we'll go ahead and what are we creating? A policy. A policy, right? A policy. Yeah, it's a custom policy. So, we'll go over here. We create ourselves a policy. Um, I think we just go over here right away.
We don't have to fiddle around with any of that stuff. and we'll go ahead and copy this and we'll paste this in and we'll take a look here. So, we have a principle.
If you don't know what a principle is, that's what you're saying is going to be given access. And here we're specifying a specific database account. So, this must be data bricks and they're saying allow uh the role of unity catalog um to be able to assume a role. And then here we have this external ID. I'm not sure if we need that. Question is why are we getting some errors here?
as this template looks okay to me. But we'll go down below. Add a resource or or not resource element to the policy statement. I guess the thing is that it's saying that this is the um trust policy. So I'm wondering [laughter] if this is the right place. Let's go back a step. Um okay, this is creating a policy. But like normally when you create a role, let's go take a look at a role for a second.
Here it is. Okay. custom trust policy.
There it is. Okay. It's actually supposed to go here. Now, generally an external ID is used in as to match against another account. So, they'll have an ID and you'll match that ID. So, I'm imagining that they probably want us to do that here as well. Have a few too many tabs open. So, I'm just going to close a couple so that we have a bit of focus here.
Okay.
And um there must be something about the external. So, the policy sets the external ID to 000 as placeholder.
You'll update this to the external ID of your storage credentials in a later step. Oh, okay. Later. All right, that's fine. Skip the permission policy configuration. You'll go back to this step later. What? So, create an IM roll with custom policy.
Okay. The policy establishes a cross whatever. Whatever. That's fine.
Okay. So, skip the permission policy.
So, we'll go ahead here. We'll hit next and next. I didn't realize you could skip the permission policy. Does this do they want this to have a very specific name? Save the IM roll as what? Um this will be I guess data bricks policy.
What's called datab bricks uh policy uh access policy allow datab bricks platform to assume a role. So it seems like this um ro is incomplete. We'll go ahead and create that.
They said policy not roll but that's fine. So create the like earlier here it says it's a role now. Create the following IM policy in the same account as the S3 bucket replacing the following values. So why wouldn't we just create the policy first and then attach to RO but whatever that's fine. The name of the bucket, the KMS key if you have it.
So we do have a bucket. So your name is going to be different than mine. I'm going to go ahead and copy that. We'll go back over to IM. We're going to go ahead and create a policy. Um, I'm just going to paste this over here so I do not lose it.
And create the follow IM policy in the same account.
Oh, here it is. Okay. So, I'm going to go ahead and copy this. We'll go over to our JSON. We'll paste this in here. It's giving get put, delete list, get uh bucket list or a bunch of other stuff.
And so then here we have very specific things. So I'm thinking we got to put our bucket name here. We'll go over to here and copy this.
Okay.
And then we'll also do it over here.
Good. Um it has KMS key. I don't think we set a KMS key on this.
Pretty sure we didn't.
Let's go over to our permissions and be properties encryption.
Yeah, we're not using KMS. We're just using SSC S3. So, we'll go here. We'll just remove that as we do not need this statement here. Okay. And so, now we have the account ID and the ads I enroll. I'm assuming this is from the other one.
Let's go back over here for a second.
It's like who should have access to it, right? And I would imagine they want data bricks to have access.
So here it says the account ID of your account, not your data bricks one. Oh, okay.
The name of the roll. Okay, that makes sense. All right. So I'm going to go up here to the top, grab my account ID, and we'll go back over to here. They really should have in this tutorial like been been explicit and say, "Hey, name the role." Exactly. This I already forgot what the role is. So I'll go ahead here and open up the new tab. It's like data bricks access but I don't want to get it wrong. It's always best to copy paste just so that things work. We'll type in data bricks access. I wrote policy. It's really a role here but we're not going to rename it. We'll just live with our mistakes here. It's not a big deal. Um and that is over here. And so we'll go ahead and place the roll as such. We'll go back over to here wherever we were.
Um, so the IM policy grants read and write access. Good.
However, this may be unnecessary because you can mark the storage credentials as read only and write access granted by an IM rule can be ignored.
Okay, that's fine. Um, sure. So, I'm going to go back over to here and hit next.
datab bricks access S3 buckets bucket policy. So this one actually is a policy. I'm going to copy this name here so I have to go find it later.
Put it right there. And we'll go ahead and create that. So now we have that policy.
This is going to be our bucket policy.
Create an IM policy that uh for the file events in the same account. Well, hold on. What are we doing with this one?
This IM policy grants read and write access. You can create a policy that grants read access only. No, that's okay. But don't we have to attach this to something?
Okay. So, we'll go down below here.
Create an IM policy for the file events in the same account.
So, now we're just creating another policy.
Um, I mean, sure. So, we'll go ahead and we'll copy this one.
Create another policy here. We'll go to JSON.
And we obviously have our bucket name somewhere in here.
Where's our bucket? Right there. So, we're going to go back over to here.
We'll grab this bucket name.
Go on back.
What a what a lot of work just to get this connected day. So, that one is set.
Um, the rest of the stuff looks like wild cards, so I do not think we need to fill it in. Let's go back over to here.
Replace bucket. And we did that.
Attach the IM policies to the IM rule.
Okay. So, but this will be called um datab brick access.
Was it like file store? What was this for?
file events. Okay, sure. So, we'll go down here. This will be file events policy. We'll copy that name as well. I'm going to go ahead and create that one.
Use alpha numeric. What?
Uh, just take off the end there. Maybe it's just too long.
Mhm. Is there a space in the front?
Okay, that's what I thought. I'll put policy back in there just because I named the other one as policy even though it's probably redundant to name it policy like that. I'm going to go back over to here and we'll paste that there. And so now it's suggesting to go back to the original RO and attach them. Let's make sure that is the case.
Attach the IM policies to the IM RO.
Okay, sure. Um, so we'll go back over to our role, which is here.
And really, I don't think there's any reason we couldn't have created those policies uh first and then did it, but that's totally fine. So, attach those policies. We'll say data bricks.
There's our two policies that it wants us to attach. We'll add those permissions. So, now they have been added. Give data bricks the imroll details. So in data bricks log into the workspace that is linked to the unity catalog meta store.
How would I know that? So click catalog.
Okay. So we'll go over to here. Oh, it's over here. Okay. So this is Unity catalog. It just says catalog. Okay. So that is step one.
On the quick access page, click external data button and go to the credentials.
There it is. Credentials. Create credentials. Ah, so there we go. Um, so we have storage credential and service credential. I mean, we're trying to deal with storage. So I'm assuming that's what we want.
Um, says copy from instance profile. That's usually talking about a compute instance, but obviously we're not doing that right now.
Credential name, I guess, whatever we want to call this. So be like AWS developers um, Exam Pro. And then here it wants the I am uh RO AR.
So I'll go back over to here. We'll grab the AR from there. That's on the RO page. We'll paste that in. We'll look at advanced options here. We can limit it.
I don't need to do that. And so we have created it. I'm going to assume that was what the rest instructions wanted us to do because it's pretty straightforward.
Yeah, it was already set to that. set to read only in the source tarantul created dialogue.
Copy the external ID. Okay, so there's our external ID. So we're going to copy that external ID. We're going to go back over to here. We're going to edit uh oh no, maybe not that. It's the trust relationship that we need to edit. So we'll go edit that. We'll paste it in there. Let's just make sure there's no like prefix we got to put in here.
bind the storage credential to the specific workplace. Optional. I'm not sure what that is. Update the IM roll.
So here it's just suggesting to do that.
Edit the trust relations policy to look like the following. So now we have this, but then we also have that. So why did we do Well, whatever. So we'll do this again.
Because if you go here, I mean, it's fine. It's just um I'm just wondering why. And so this one is the same. So this one's the same. And we'll grab our external ID here.
We'll take this. We'll remove this part.
And so we need to bring in our account ID. So we should already have it. No, actually we don't have it here, but we'll go to the top. We'll grab it.
Paste this in.
And I guess we're permitting ourselves to be able to do it because it's literally the same role here. It's just strange because it's the same role, but I guess you can do that and we should be all set up. We'll go ahead ahead and update that policy. So now our trust relationship is updated.
Let's go back over to here.
Validate the storage credential. So after you have made the changes to the IM policy in step three verify that your IM role is pro properly configured to validate the configuration you must be the storage credential owner I am because I made the account a meta store admin I must be because I created the account or you have create external location didn't do that but it should work so here it says click catalog on the quick access page external data go to credentials tab there's an alternative go to the gear select the storage credential you want to validate validate configuration okay cool there's a validate button So, I feel like it's right here. There it is. We'll click that. Success. So, we are now connected via Unity Catalog. So, that's great. Um, so I'm not sure about the rest of this, but we want to get back on track to what we were doing, which was working with the autoloader. So, I'm going to back out here for a second.
And I should have the tab here. This was for connecting to here. Create a credential storage. Create a storage to S3. I mean, specifically, it's for autoloader.
Maybe it's over here. Nope.
So, that's fine. We'll just go ahead and type data bricks autoloader S3.
Okay. Okay, so we'll go back over to here and then on the left hand side onboard data from S3. That's what we were doing here. So now that we're here, this is what we were working through and so we should be able to list the source.
And so I imagine that now our account should know about it. The question is like does a workspace have to be enabled for catalog Unity catalog because that was something else that kind of suggested like hey is your workspace whatever but isn't this whole thing our workspace? I don't know. So, let's go back into um the file we were using, wherever it was.
It's here. Okay. And we're we're still connected. Let's go ahead and try this again and see what happens because there might be something else we need to configure, but it looks like it's working, right? I don't think the other one was doing this. I don't think it had like statement 01. I'm going to pause here just to speed this up. Okay, so we still have a failure, right? Um, so there's obviously more to this and let's go back over to here and look at some of the prerexs before you begin. Access data bricks workspace with Unity catalog enabled.
I mean, isn't it enabled? We created it.
A workspace that is enabled for Unity catalog.
In most accounts, Unity catalog is enabled by default when you create a workspace. I mean, I didn't create a workspace. I already have one. Ensure your workspace is enabled for Unity catalog. If workspace a predates automatic enablement, uh, Unity catalog enable essentially means creating a Unity catalog meta store in the workspace region and confirming that any groups that you have is created, whatever, whatever. And by the way, when I create the bucket, I didn't check. So, I might have egress costs. Upgrade the tables that are registered in your thing. Okay. Can I just make a new uh workspace?
Um I mean this is my workspace right but like we have it here so it must be it must be enabled right because we have Unity catalog here. So just a moment let me figure this out. One thing I notice over here in the catalog is that um we can add um a connection.
So I wonder if we're supposed to be doing this. I'm going to just try it anyway. and see if it do we even have access to this with AWS AWS Amazon. Nope, it's for Redshift. So that uh that kind of doesn't work for me here. But let me just figure this out a bit further. Okay, so in here it just says like how do you confirm? So I went over to here to confirm. So it says use the account console to confirm your Unity uh catalog enablement. So we'll click to workspaces.
This just says workspace. Oh, up here we have workspace.
Oh, we have multiple workspaces here.
What if we go to manage account?
Okay. So now we have more workspaces.
So now we can see where it is. This is in US West 2. I see serless only. And it has a meta store.
Is that what it needs for it to work? Um let's go back over to here.
It was over here. So use the accounts workspace. Find the workspace and check the meta store column. If a meta store name is present, your workspace is attached to a unity catalog meta store and therefore enabled for unity catalog.
Run SQL create to confirm Unity catalog enablement. Run the following SQL query in the SQL creating a notebook. So we'll go ahead and try this as well. But it looks like we do have a connection.
So we'll go back over to our workspace here back into here. I'm just going to create a new Oh, shift enter to run and move. I don't want to do I just want to add another cell. Um, where do I add another cell?
Normally there's like a little plus underneath. Oh, here it is. Okay, great.
So, I'm just going to paste this in here. I want to drag this above. Let's just run that and see what we get. I mean, again, I I do believe that we have it enabled.
So, we'll give that a moment. So, we see US West 2. I didn't know I was in US West 2, but that's great.
Um, and therefore you're enabled, right?
So that's good. But we are back onto this step where it's an issue connecting here. So access to a data bricks workspace with uni catalog.
The read volume permission with unity catalog. External volume for the read files. The path to your data source. the external location of the path example cluster creation permissions. One thing I I would think is like does it need other additional permissions for S3? I don't think so because it should have it.
Um and so here path to your data source.
Now maybe we don't need the full address. Let's go back and take a look here.
I mean that that isn't the full address.
It's just the file there. No such file or folder called that.
Hm.
Interesting. Okay. So, I'm going to run it again even though it's probably futile, but we'll try again. All right.
So, coming back here and just reading a bit thor more thoroughly. Says when the storage credentials validate, you can create an external location. So, I'm guessing in here we need to set up an external location. Um, so I'm going to go back over to our catalog into our meta store. Um, I think I'm at the top level there.
Let's go back here. Yeah, sorry. Catalog and external data. Sorry, I was fiddling around with this a bit and I gave it additional permissions, which I don't know if I was supposed to do. So, I'm going to go ahead and revoke those for now because I don't know if I was supposed to do that. But, let's go back over to catalog. And maybe there's an option here to add external location.
Create external location. Oh, there we go. Okay, good. So, we have Adabus quick start. I like that. That sounds good.
Quick start sound good to me. And so, now we need to provide our bucket. So, we'll go back over to our bucket name here. And we'll do S3 slash and we'll paste that in here. We're going to send you to the ads console to configure an external location. Once you sign in, you'll paste the personal access token and create a cloud for stack. Oh, okay. Bucket name. Make sure you have the following. Adabus account.
A name of existing bucket. Genics access token and copy and paste into the datab bricks access token account.
Okay, we'll generate that out. So, I have that. Obviously, I will expire that before you see it. Now, we'll go ahead and launch quick start. So, I guess there's additional things we need to provision. I was expecting that. Um, so we'll dump that into here and I'll say acknowledge. I'm kind of curious as to what this template's doing. Let's take a look here at the view template. Um, if you don't know how to read cloudformation, sorry, I'm pretty good at it. But if we go over here to canvas, we can see we have a data bricks API function. So it's creating a llama and then a copy zips function and then it's going to a llama zip bucket and then we have a S3 data buckets access ro. Um, I guess this is all fine. Sure, why not?
So, we'll go down below here and we will create it.
I'm not sure if like the role is the exact same thing as we had before. Um, if we go to our resources here, well, this will take a bit of time.
Rolls always take forever to create, but we'll just wait here and come back here in a second. Okay. All right. So, um, that cloudformation template's done running. Um, we give this a nice refresh here and we can see all the resources created. Let's go back over to here and so copy this token and we did that already.
So, we'll go ahead and hit okay. Um, we already did that. The bucket overlaps with existing external location. We'll go back over to here. We'll go into workspaces default information. So, I'm just trying to find that external location. But I mean it must be loaded in here now even though I don't necessarily see it here. What I'm going to do is go back to our workspace.
We're going to go back into load data data S3. Let's try to run this again because I would think now if there's external location the command still doesn't work. So give me another check here. I'll be back in just a moment.
Okay. So now here it's saying create an external location. All I did was click um up here. I just clicked down to the next step. So it just said like oh do you want to use with SQL? I mean that's what we want to do. And so here I guess we are providing it a location name a name of the external location.
The location name is special characters must be whatever um the bucket path. So I suppose we are creating it here. Um storage credential name the name of the storage credential that authorizes the reading. So I guess we could do it this way as well.
Did we not need to do this is option three, right? Let's go back up to option. Why did I do option one? That create an external location for S3 bucket using ads cloudformation template. This is the recommended approach to creating external locations for an S3 bucket. When you create an external location, data bricks configures external location creates a storage credential for you. Oh, so the whole thing we did before, all that work we did before, we could have automated it, but the other pages didn't tell us that. So if we go back over to the catalog here, right, and we go into external data, uh we have an external location now, right? We go over to credentials. Now we have two. So we we've essentially did it manually and now we did it the automated way. I would have loved to known that up front, but that's okay. We learned quite a bit doing that. But now we actually can see that we have an external location. And so this is the name of the location. So I wonder if we copy this.
What a roundabout way to do this. But this is why I'm walking you through this stuff because just because they have instructions on it doesn't mean that it is easier or clear. Let's go ahead and run this and see what happens. So it says the public D uh DBFS is disabled.
Access denied on your path.
Okay. So I mean we're kind of getting closer. I need another thing that we could do like we could try to make that work. But the other thing that I would like to do because we did all that manual work is why don't we go to that third step and create that external location because we already have made that connection, right? Um, and we might have better luck that way. I'm just trying to find this one. So, we'll go back all the way at the top here. We're going to go to this one here and we'll copy this command.
Okay. And we'll go back over to our SQL.
Wonder if we can just run the SQL editor here. Like why wouldn't be able be able to do that? And so this will just be um you know my uh S3 bucket and then our bucket pass. We'll go back over to here. We'll grab this. I'm going to assume that it wants S3 like S3 slashbucket the storage credential name.
So we're going to go back over to catalog here. We will go into uh external data credentials this credentials right and so this is the one that I created earlier we'll go back over to here and we will provide it this I don't really need a comment but uh we'll put just in case it really wants one so be like access to our S3 bucket assuming we can run in here unless we have to run it somewhere else and so I'm going to see if that runs we'll see what happens Okay, it has a syntax error maybe because that is optional.
So, I'll just take this out. It's not really clear um if that is or not, but we'll go ahead and run it. I'm not I don't usually see SQL have square braces.
Oh, yep, yep, yep. So, I think yeah, there this stuff is optional, right?
Okay. So, we'll go ahead and do this.
We'll go run this again.
And here it says with storage credential. I mean, what else would it be? Try that again.
Does that work? Please, please, please, please, please.
Okay. So, it's saying that it overlaps basically because there is another location that already exists, right? So, that's what I'm hearing from here. Uh, what I'm going to do is I'm going to tear I know we're really in the weeds here, but you know, again, this is great learning, great learning experience. I'm going to delete this one here. I'm going to see if I can roll back um, one of these here so that there's not two locations. So, we're going to delete that. And while that is deleting, I'm going to go back over to our catalog here and I'm going to go into external data.
And I want to get rid of that location, but I'm going to go over to credentials here. I mean, shouldn't it tear it? I I don't think it would actually tear it down on this end. Let's go over to here.
I'm going to go into this one.
And how do I delete it? Over here. Delete.
So, we'll delete it. So, now there's no conflicts. Um, and then we'll go to this one, which is the second one. We'll go ahead. We'll delete it. Okay.
And so now we shouldn't have a conflict.
And this this should in theory roll back. If it created an S3 buck, it could run us an issue, but it looks like it's cleared out there. Okay, great. So, we're going to go back over to SQL.
Okay, and now we'll run this again. And so, it shouldn't have a conflict because that other one shouldn't be there, right?
So, now it has created that. So, now it's called my S3 bucket. So, we have that. Let's go over to um SQL. Actually, we're in SQL. Can we just do here list?
Can we um do this list my S3 bucket?
There's only a single file in that whole thing. So, let's see if it works now.
Says operation allowed list with resource type. Um, okay. I'm not sure. Oh, you know what?
Maybe we do like quotations here. One second.
Let's try that.
So, it says public DFS route is disabled. Access denied on the path. So clearly we have a permissions issue whatever that is. Um and it's the same one. So we ruled it out. You need to mount your S3 bucket to data brink file system uh before you can list its content. Of course there's just more and more as it goes here. But this is showing us in Python, right? So I mean we could take this and run it I guess as Python. You replace your access key secret with your really that doesn't seem like that should be the way we should have to do it. Um, but anyway, let's go back over to our catalog here because I did notice over here if we click into this permissions.
There's no permissions here.
I mean, does it means to read this like like this isn't clear as to what permissions are going here, but we'll go ahead and just try this. Right. So, now we're just saying everyone can read from this one. Let's go back over to our SQL editor. Let's try this one more time.
All right. Just a moment. Okay.
And so here it's talking about like because it might have suggested that maybe it's disabled in the free edition, but they're talking about like the serverless compute that it might not work there. Um so that is a challenge here.
So just give me a second here. Okay, I'm going to see if there's another solution. All right, one other observation I noticed here is if we go to browse, it actually allows us to browse the file. So we clearly can see stuff in here, right? And here it says all workspaces have access. Um so clearly we do have a connection here, right? We have read, write, list, delete. We have full access. Um, so I don't think it's a matter of our external location doesn't work. There's just some mismatch with this information in terms of um, uh, these steps here. So again, I'll just continue on here to just try to debug it, but I just wanted to show that we do indeed have a connection. Okay. Um, but we need to figure this out over here. We have like is the workspace selected, which is true. So we are in our workspace. We're running this in the serverless starter warehouse.
Okay, so just give me a second. Okay, so here it's suggesting there is another way and we this is an older way to do it and so that we would we would actually have to grant access which I believe we did. So again we'll go back over to here take a look. We click into this we go to permissions and we have granted all users of this account all privileges. So there's no chance that we do not have that. Um but again we'll continue to try here. Okay. So, just again I'll just keep looking. One other thing I figured to check is to check if that there's a a limit on the trial. And so the trial forces you to use serless compute limitations in general limitations here says you must use Unity catalog connect external data sources use external to access data sources. So there's nothing suggesting that um it it has to do with the fact that we're on a trial account using serless compute. But I'll continue to look. Okay. All right. Uh just as a side detour here um one thing it was suggesting before is can I check my audit audit logs and so I ask where does it store it says system access audit and so it might not be specific to um the one that we're doing here but if we go over to actions we can kind of see like list connections things that we're doing so I'm just kind of wondering uh if we can find the information here and like is it ordered by latest or newest here it looks Like [snorts] so I probably hear like order by event time descending and it'll just say like limit 100 because I don't need that many records and so maybe this will make it a little bit clearer. Um, oh, here limit. We need a limit on this.
Sorry. It's nice. I just use standard SQL and it should just work.
Let's see if that helps it try to narrow down at least see the last actions that we were running. Right. So, if I go over to here, I mean, it's grabbing everything. That's a little bit too much information. Um, the only thing we really want here is uh, well, I guess we'll just have all of it for now, but let's go take a look here. So, here we have token login, O list credentials, list volumes, list tables.
So, I'm not really sure this is going to be the way that we're going to figure it out. But anyway, I wanted to show that because it's nice to know where the logs are. And I guess the logs are in SQL, but I'll keep trying here with just a moment. One thing I did notice here is that they also have this load data using a Unity catalog external location. Um I mean like this has external location.
So, you would think that this can take it from this other way unless you can't use this with um Unity catalog because here it says here you'll learn how to access and correspond with Unity Catalog volume recommended or Unity Unity catalog external location.
Okay. Um let's go over to here and try this one. So, you know, this isn't a public a public preview feature, but if it is here, maybe we can try to utilize it. So here we have external data, external locations. We've already done that, so we know we have that already.
Create the manage table in the sidebar of your workspace. Add new data because all we're really doing is importing data. Um, and so maybe this is a way that we can do it. Add or upload data here.
Okay.
And create table from S3. Yes, we'll go ahead and do that. It did say like does our schema have access to do whatever.
Yeah. So there it is right there. We'll click it.
We'll preview the table.
Cannot read files when the input path does not exist. Okay. So here it's suggesting that the input path does not exist.
Huh.
Let's go back a step.
Now, that's interesting that we're getting a different error here.
Go. Okay, let's try this again.
Maybe we're on to something because we're getting kind of a different thing.
But here, it's showing the contents of it. We can see it. So, I'm checkboxing what I want, right?
And then I say preview table.
Interesting. So, let's feed it in and see what it has to say here because obviously it does exist. Obviously, it's able to see stuff. It's just very, very confusing, right?
Or is not accessible. So, verify the S3 path is correct and that the bucket and folder exists. You can do this by navigating to S3 console. I mean, like how did it load it if it had it? Check the permissions. S3 list bucket get object put object. Check for typos. use the correct path in your code.
U here's an example you might use with autoloader, which is what we want to use. So I'm going to do something extreme and I mean it's not like super extreme, but what I'm going to do is I'm going to go back over to catalog because we had that cloudformation template, right? And that one apparently set everything up for us and supposed to be like um it's just supposed to work, right? So, what I'm going to do is I'm going to go ahead and delete this. I know it's crazy. We're going to go over to credentials. I'm going to go to this one and I'm going to delete it. And now what we'll do is we'll get set up um uh via the um the automation thing. So, it was here and I know we want to do Yeah, here it is.
Create external location. Excellent.
We'll go next. And then what we'll do is go back over to here. We will grab our bucket name, which is this.
And I know we already did this, but we're just going to do it again because if we can get it to work, we'll in good shape. Let's take out that space here.
We'll generate a new personal access token. We'll copy it. We'll launch in quick start. Okay, we'll go down here.
We will enter that key in. We're going to go down here. We're going to create the stack. This is going to set up everything for us, which by the way would have love to do this the first time because this way we can rule out saying like, okay, it's not us, right?
Because if we do this exactly as it's asked, then the permissions need like the access should be set up. We will have to go to permissions and make sure we have that, but we'll figure that out in just a moment. Okay.
All right. So, this is now set up. We'll go back here.
And the idea here is that Oh, we'll have to copy this. Copy this. Oh, we already did that. The idea here is that now that we have this external location set up.
This one's more likely to work. U but we'll go into permissions here. We'll go ahead. We'll grant permissions. We'll give it all. We'll give it manage. We'll give it everything. Just just let it be able to do something. Okay. So, go ahead. We'll add it. This way. We're very unlikely to make a mistake here.
Let's go copy this. We'll go over to SQL. We're going to now replace it here.
And it should work.
If it doesn't work, we'll give up. But it should work because now we're literally using what it's telling us to use. So, it's still telling us it's not going to work. Um, the other thing that we could rule out, I guess, would be um the compute. So, we go over to here. And I mean, like, as far as I'm aware of, there's nothing special that the compute needs, right?
So, we go here. Can I even use um another compute on a free tier? I'm not even sure. Let's go ahead and take a look. Um, no, we're at the max. So, yeah, it's not clear like if this one doesn't work, if this doesn't like what else could it be, right? Because now it's doing all the work to set it up for us. Um, the only thing I don't remember doing, did we actually enter in what it needed access to? Let's go take a look here because I would assume this would just get full access to S3. Let's go down here and take a look. So here it sets the account ID which is fine.
And this one is giving access to a bucket that it's created. So yeah, I believe this actually created a bucket for us. Let's go back over here.
We'll go over to here. Resources.
zips.
So, if we go back over to here for a moment into our catalog, external data, we'll click into here.
We'll go to browse.
I guess we must have plugged that in.
like what like what else could it need?
Do you know what I mean? Um so we go back over to here and I got to again check check it because here this would have created a role, right? Like it absolutely would have. And so we have the S3 data access role. So we'll click into this one, right? This is where it would have a problem saying it can't be found. Are we looking at old old data here? Let's go back over here to cloud formation.
Also, why is this in Tokyo data bricks S3 inestion? I'm in the Tokyo region. I wonder if that's a problem at all.
It's uh So, what I'm going to do, I'm going to delete I know this is ridiculous, but I'm going to try this one more time. I'm going to switch over to US West two. Okay. So, now we're going to be in the same region and um uh some net boot. I'm not sure what this is for, but I'm gonna ignore it for now.
What I'm going to do is now that I'm in US West 2, I'm going to do this again. I know it's crazy, but now that we're now, see, it doesn't even have access anymore because we deleted that permission. So, it's so strange that that would not be an issue, but let's go ahead. I just want to rule that out. I know we've done this so many times, but we'll go over to here into external data into our uh credentials here. I'm going to go ahead.
I'm going to delete this. I'm going to go back over to external locations.
We're going to create. We're going to go ahead go to quick start. We will provide the bucket name, which we do here.
Also, let's take a look at where the bucket is. I'm going to make sure the bucket is in the same region, US West 2.
Um the bucket I probably just created is probably in Tokyo. So, we'll go here and type in data bricks. It is it's in Tokyo because I haven't been paying attention to what I've been doing. Okay. So, we're going to go ahead and just rule this out, right?
We'll empty the bucket. We will permanent delete. That's totally fine.
Yep. Permanently delete. I like how they made it like two uh two steps after each other now, which is nice. So, now we're getting rid of that bucket. We're going to create a new datab bricks bucket. So, I want to go back here. We're going to say um data bricks autoload bucket because that's what this is supposed to be for this entire time.
Going to copy this. I'm going to go all the way down. We're going to create that bucket. That bucket is now should be created. We're going to go back over to here. We're going to dump in our our bucket name, which is S3 slash. We're going to generate another personal access token. We're going to hit launch quick access. We're going to go down here. We're going to dump in the personal access token. We're going to go all the way to the bottom. We're going to check box this. We're going to hit create. And we're going to wait and we're going to see what's going to happen. Okay. I just noticed again it says Asia, Tokyo Pacific. Why is it over here? I don't understand. Hold on one second. Hold on one second. I can't create it in this region. It's going to drive me crazy. You know, I think the thing I'm going to delete this like I definitely chose this region.
Let's check our S3 bucket. Let's make sure that bucket that I just made is in the right location because if it's not, it's going to drive me crazy here.
So, data bricks. No, none of these buckets are [laughter] in the right location. I Okay, delete. Just get rid of these buckets. Please get rid of them.
Yeah, let's delete it. Bye-bye bucket.
Okay. And I'm gonna go back over to here. Look, it switched me back. I think it's because um when I'm over here, I don't know why it's just defaulting me.
We'll try this one more time. Data bricks. We'll delete this one. I have to pay. Yeah, look, it switched me back.
So, this is something weird that um AWS is doing. They never used to do that.
I'm not sure why it really wants to persist me over into the Tokyo region.
Um but we'll go back here and we'll make sure we're in Oregon. We will create a bucket. We will make sure that we choose.
At least it's not me. That we'll make sure that it's in the location we want it to be.
Can I choose where it is?
No. But it says right here. So that's okay. We'll go ahead and create that bucket. We are going to make sure it's in the region we want it to be in. It is. Excellent. Okay. We're going to go back. We're going to do this again.
Sure, we'll use this. No, we'll have to do this again from scratch. So, we'll go here. Create a location. Next, and uh really, I got the name autoload.
That is such a um uh surprise. We'll go ahead and type S3 slash. We'll generate a new personal access token. We'll go next. We'll hit quick launch. We will make sure it's this location. Again, I don't think that's the reason why it's not working, but I want to make sure that we rule it out, right? Because we are in US West 2 if you didn't notice earlier. Well, I'll show you again. But if we go over to workspace, it says US West 2. So, let's make sure we are in the same region. Um, and see if this makes any difference. So, we'll wait here again. Okay. All right. So now we have our bucket in US- West, our credentials in US- West, our workspace in US West. And so if all these things are in US West, you know, maybe that will resolve our issue. I'm going to go in this external location.
I'm going to look at browse.
Oh, this has nothing in the bucket, right? So we'll go into this bucket.
We're going to have to upload something to it. Um, so I'm going to go ahead and do that.
I still have the file off screen here.
Okay.
Just one moment here.
[music] Um, I should have the folder. One moment.
Here we go. We will drag this one on over. We're going to upload it. It's now uploaded. We'll go back over to here. We can see it can see that there's a file in here, right? We're going to go back uh here. We're going to copy. Well, before we do anything else, we're going to grant permissions. We're going to go read write. We're going to or not read write, we're going to give it everything. We are just going to be very permissive and just say everyone can do everything. We're going to go into um uh over here, we're going to go into external data. We'll go to our credentials. We'll do the same thing.
We'll say, "Hey everybody, you can use whatever you want. Just go for it." So now everybody has access to everything.
There should be no permissions issues whatsoever. Um I mean probably will have problems but we'll uh we'll try it anyway. We're going to copy that. We're going to go to SQL editor. We're going to now replace this with this. Okay.
Everything is now in the same location.
We're going to try this again.
Please work. Please work. Please work.
Please work. Please work. Please work.
Please work. And it doesn't work. So I don't know why it doesn't work. I can't solve it in this video. Um, we are following everything that it's suggesting. We cannot find anything online.
They they're talking about something else that there's something called a a catalog volume. So maybe there's another way to uh add the data. I'm not sure.
But this has been a very frustrating experience. But we have learned a lot going through all this stuff. Now we understand catalog configuration quite well. Um, but why it doesn't work, I do not know.
It's it is insanely frustrating, right?
So, um, we're going to call the video quits here because we're at an hour and you know, it's just it is what it is.
Okay. Uh, but yeah, there you go. So, what are Delta live table? It is a declarative framework in data bricks that simplifies building, batching, and streaming ETL pipelines with built-in data quality, orchestration, monitoring.
So here down below you can see a pipeline moving from uh left to right.
So it's that visual thing you're seeing there. It's key benefits. It's unified with batch streaming pipeline support declarative syntax with just a few lines of escular python automated orchestration and error recovery.
Built-in data quality checks uh via accept incremental and CDC uh support.
We have a slide on CDC somewhere in the course if you need to know what it is.
Serverless compute for cost efficiency visual monitoring and lineage tracking.
Common use cases would be a declarative ETL uh change data capture real-time streaming and digestion SQL based pipelines with automated scaling. Um and so down below we have an example where we have the refresh live table. Um so the idea is you're selecting information and you can uh see that information there. But um we'll get more into it as we go. Let's take a look at notebook libraries and target. So in delta live tables pipelines are defined by notebook libraries. These are code for transformations and targets where output data is stored. Notebook libraries define logic using DTL.table SQL or Python. Targets specify the catalog schema where output tables are saved.
DTL auto manage manages execution orders based on dependencies. So down below the idea here is we have your notebook library where there are DTL uh command there. it's going to execute that data towards that target. Uh if that doesn't make sense, it'll probably make more sense as we continue on here.
But there all right, let's take a look here at the syntax for DTLS. Um so the idea here is that we use live to define manage tables and stream to read from streaming inputs. So the idea here is that we create a live table and then we're selecting our data. Um so live tells DTL to manage the table view life cycle. stream treats the source as streaming even if it is batch capable.
So pretty straightforward. You have a live and you got a stream. Let's take a look at violations in Delta live tables.
So you define your expectations as data constraints to validate incoming data.
You control how violations are handled using on violation clause. So let's take a look at our violation handling options. So on violation failed updates stops the pipeline pipeline. If any rows violate the exception, we have on violation drop row skips. Only the invalid rows. The rest of the pipeline continues normally. Use fail update when the bad data should stop the pipeline.
So missing required fields. Use the drop row when it's okay to skip bad records and keep going. So here's an example of uh the idea here. We have enforce location is not null. So constraint valid location expect this thing to not be null and then make it a violation if it fails. Okay. Um, but let's take a look at change data capture. This tracks inserts, updates, and deletes in a source data so you can reflect those changes in a target table without reprocessing everything. It's great to keep your warehouses or analytics tables up to date. Reduces cost and latency pipelines. Avoids full refreshes. And so the command here to do it is apply changes into. And the source must include a change type column because you need to know what has been changed. And the target uh delta table um uh is where are the changes that the target is where the changes are applied. Um you need to have a primary key used to match and update the rows in the target. So down below we say apply changes into target from source stream. Okay. And so we have insert, update, delete. So inserts is added to the target. Update is when you're matched on a key and it's updated. And deletes is matched on a key and removed if configured. But having an additional table that has the individual changes are really good in OLAPs um like analytical systems. Um so it's just another way to incrementally see the changes. Okay, let's just talk about error diagnostics and detail pipelines.
So data bricks provides built-in error diagnostics to track resolve issues.
When a pipeline fails, you can quickly identify which notebook or transformation caused a failure. The specific error message failed in the line of code. The pipeline stage and table name where the issue occurred.
quickly debug broken ETL stages with manually tracing notebooks dependencies.
I don't have a screenshot here for you, but as we work uh through our labs, you'll be able to see it clearly. So, the DTL event log helps track pipeline progress, diagnose failures, and ensure reliability. It tracks state changes like initializations, updates, and failures. Debugs issues by identifying which notebook table caused a failure, review metrics, schema changes, and and job duration. We can audit data lineage to trace dependencies across tables. So you have a tab there that says DTL event log which gives you information. You have a filter. Um pretty straightforward. Okay.
Related Videos
Agentforce NOW AMA: Build with React and Salesforce Multi-Framework
SalesforceDevs
490 views•2026-05-28
How agent o11y differs from traditional o11y — Phil Hetzel, Braintrust
aiDotEngineer
450 views•2026-05-28
Re: 🗣️📍theprophedu📍2026 GST 103 CLASS (E-EXAM REVISION)
theprophedu
636 views•2026-06-04
WEB TECHNOLOGIES UNIT-2 | Degree 4th sem BCOM Computers web technologies unit-2 full explanation💯✅
LearnwithSahera
1K views•2026-05-29
More tests are always better? How to use AI to identify tests that bring little value
Alliance4Qualification
335 views•2026-05-29
Search Algorithms Explained in 60 Seconds! 🤖💨
samarthtuliofficial
218 views•2026-06-01
People of Game of Thrones using JavaScript DOM
AltCampus
296 views•2026-05-30
Instagram accounts got PWNed
EricParker
13K views•2026-06-03











