Schirripa provides a masterclass in systems archeology, offering a pragmatic blueprint for injecting modern extensibility into legacy architectures without breaking upstream compatibility. His "happy path, escape hatch" philosophy is a vital lesson in balancing developer flexibility with the rigid constraints of decades-old code.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
VillageSQL: Decisions, Principles, & Lessons in Teaching MySQL New Tricks (Steve Schirripa)Added:
[music] >> I don't give a [ __ ] about my sequel.
I don't give a [ __ ] about Oracle.
>> [music] >> I don't give a [ __ ] about ClickHouse.
I LOVE POSTGRES.
>> [screaming] >> I DON'T GIVE A [ __ ] ABOUT ANYBODY BUT ME.
>> [music] >> All right, guys. Let's get started.
We're excited today to have Steve. Uh he is the CTO and co-founder of Village SQL, which is one of the the the new startups trying to lead the way of of progressing MySQL into the future. Um Steve is also uh CMU alumni. He completed his master's degree here in in 2000. So, it's always good to have a fellow Tartan with us. Uh as always, if you have a question for Steve while he's giving this talk, feel free to unmute yourself and ask your question anytime.
And that way he's not talking to himself for an hour on Zoom. Steve, thank you so much for being here. The floor is yours.
Tell us about MySQL.
Thank you. So, um this is a bit probably of a different topic for um most people in the database space cuz they're kind of used to, you know, thinking about scalability and performance and things like that. But, we wanted to talk about extensibility.
Um it's often overlooked, but it's actually really important. So, thank you so much for inviting me to talk today about this. So, over the past year I've been working uh on Village SQL trying to innovate in the MySQL space. And one of the things that we've been doing is trying to add extensibility to MySQL.
So, for agenda, I just wanted to talk a little bit about just give some background and then just go into war stories.
And then from from there try to synthesize some principles, if you will.
Um so, uh as you said, I'm a CMU uh alumnus. Uh and I'm the technical co-founder. Uh my other half uh Dominic is on as well.
Um and you know, prior to that I spent about 21 years at Google. I was really fortunate to work uh for about 15 years in storage there um leading uh BigTable and Colossus, the file system.
So, um you know, while we we thought a lot about scalability and performance for those things, we also thought quite a bit about making sure we were meeting the needs of our customers internally going forward. Um and and so, you know, uh I think MySQL outside needs that uh as well.
Uh so, our mission is really to drive innovation in the community in the MySQL community specifically. And our experience and sort of perception is that MySQL has been lagging a bit here um you know, Oracle has uh carried the torch for quite a while.
And um only really recently in in the last few years has uh have they really changed their their focus. And um you know, we've seen other systems, Postgres being the the big one that gets more sort of love and attention. And so, um the community for MySQL has sort of been diminishing over time. And we want to we want to change that. There's still a huge install base and still a lot of need for the database. So, um we think it's a it's a good time to try to innovate in this space. And one of those ways that we're trying to innovate is with extensibility.
So, what's the problem? So, you know, in terms of OLTP databases, right? Um and talking about open source, really you know, MySQL is the big one, right? Um it has the most usage and, you know, it's ranked number two in in DB-Engines.
Um And again, Oracle has sort of slowed the pace of innovation and and and contributions from the outside. Um and a lot of the internal team for MySQL has even just been focusing their attention elsewhere on sort of internal um you know, Oracle Cloud adoption, those sorts of um those sorts of plays. Um and so, the the premise of MySQL is that there's really this large discrepancy between the usage and the needs of the MySQL community and you know, where you know, the the sort of love and attention is. And so, like we we we think that we need to inject some of that into the community and try to evolve the system over time cuz if it's just going to be a static database, it's not going to continue to meet the needs um for for everyone out there.
So, what is the fix? Well, as we see it, we really think that we need community-driven development again. Um and and give people, you know, sort of the the permission to go ahead and and contribute to this ecosystem. Um And, you know, part of that is taking a little bit of a playbook from from Postgres. I mean, Postgres is known for being extensible. And we think that that kind of thing in MySQL would be really useful as well. But, we just have to do it in a different way. Right? There's like 30 years of history here. And, you know, a 3 million plus uh line code base in C++. So, you know, it's not an an easy feat. But, we felt like we could uh you know, give it a try and and and see if we could improve the situation.
Um one of the important things for us also is that we really wanted to maintain a compatible fork with MySQL so that you could, you know, try it out. Um and if you decided to go back to the upstream uh Oracle MySQL, that you you could do so.
Um So, like over time this this sort of graph just shows, you know, what we what we expect and what the premise is of of doing an extensibility framework. Um is that sort of in the beginning we would sort of plant those seeds of extensibility and invest in in hooks, if you will. And that over time those hooks would enable more and more use cases.
And the combined effort of the community, which includes us, would uh sort of get to a a a critical point where, you know, a lot of the effort start to be um shifted towards uh actual extension development. Um and that over time uh that we wouldn't need to, you know, hook as many you know, as many features because um just the the sort of menu would be large enough that you could get most of what you want done um uh by that time.
Steve, you were going ahead of your Are you like What is your definition of a core like hook?
Uh what I mean by that is, you know, changing the, you know, the MySQL code itself in our fork to permit um you know, some sort of extensibility hook that an extension author could take advantage of. Right?
Whether that's, you know, rewriting queries or uh you know, exporting metrics or like adding new types or a new type of index or or whatever it may be.
Hey, maybe you can get into more detail with like like I'd be interested to know what parts of the MySQL code are more amenable to inputting hooks in versus other parts. Again, if that's if that's later on, we can hold off. Yeah, that's fair. I mean, most of this talk is focused particularly on the type system and sort of what we learn. I touch a little bit on sort of uh functions as well. Like right now we have uh a functioning, you know, type system and uh function ex- extensibility that can use those types. Um and then and like right now one of the things we're working on is indexing.
But, we also have a number of other features that we we have, you know, PRs out for that will uh you know, that address some some of these other things.
So, um some of them are actually pretty straightforward to do. It just takes time to do them, right? It's like it could be just a single change. A single pull request could add, you know, some really, you know, useful functionality.
But, for the type system and functions and indexing, those were like those were like the big things that required that we felt required like a lot of effort. And we really wanted to focus on those things because those were the riskiest parts of the of the extensibility story.
All right. So, the question in chat is when it comes to extensions, is the plan interface similar to that of Postgres?
MySQL does not have a user-defined types, I believe. So, by default, I guess guessing that would be added.
Yeah, so we have the ability in Village SQL to to add uh to add types.
Um it's different than Postgres. Not that I'm I'm familiar with all the internals of Postgres, but um you know, it's uh similar types of things can be done, but it's not like we we have a system where it's like, "Okay, there's there's tables that that that have all this extensibility." And really it's a matter of like writing rows in into that table to sort of define, you know, what the sort of structure of the type is, if you will. Um So, we've had to go a different route, which is discussed here um in this presentation um as to how we we integrated types.
All right. I I I think keep going.
Okay.
Great. So, I'll move on to some extensibility story time, if you will.
Um so, but first, uh I just wanted to show what this might look like, right?
So, so from the user's standpoint who wants to, you know, make some tables and and and write some queries, um here's just like a simple example of a a new type, complex, um which, you know, they're creating a table with it. It has an index on it, but the index is using a function that's defined on complex. So, it's sort of a functional index.
Um and they insert some values, and they're doing it as strings, the string representation of complex.
Um and they're able to, you know, select from that table, um and look for particular, you know, particular values of of the the the complex magnitude. Um and the the point of this is really just to illustrate that the type becomes sort of a, you know, it's like a a first-class concern, right? It's it's um you know, even even some advanced features you might think of like functional indexes should work with this, right? Um it shouldn't be like, "Oh, you know, you can only do trivial things with it, right?"
Um and then from the author's experience, like right now, we are you know, we only allow you to implement these um extensions as C++, but in the future, they could potentially be other languages.
And we sort of have this, you know, idiomatic, you know, C++ builder sort of pattern for for um for for defining your um for defining your types and your extension, your your functions, so forth. Um but the point is like to try to minimize the details of um of the database system and and try and um let the framework take care of the rest. So, here we can see that um we're defining a a type complex, and um we're we're giving, you know, various properties about it, um including sort of the how you take a string um and turn it into a complex, that's the the from string, and how you turn an internal representation of a complex into a string, um and and then like a single comparison function. We'll we'll talk about why the single comparison function thing is important later on, but um you know, it's like pretty it's pretty straightforward to understand um what's going on here. Like at the at the bottom, we're we're saying we're defining a function complex ABS that is um going to take complex as a single parameter and return [clears throat] a real.
And and of course, you have to actually define the I I didn't show the actual functions themselves, but it's it's pretty straightforward um C++.
Okay, so now I want to talk a little bit about like the journey of, you know, like coming into this, you know, I had a lot of experience with internal, you know, Google systems, but not MySQL, which is a huge code base to to learn from from scratch. Um and you know, it's a 30-year-old code code base, as I said, and um you know, it wasn't even clear even from the beginning like, is this going to work? Like how do we even approach this problem?
Um so, the first thing that I wanted to do is just like, "Okay, let me pick a type and try to like directly integrate it, cuz I wanted to learn how would you actually do that?"
And so, I picked complex numbers, and oddly enough, like I didn't know it at the time, but like that is the example type that's used by Postgres, and um uh I just sort of felt like, "Okay, now I'm on the same wavelength because of that." Um but uh but anyway, like direct integration has a lot of difficulties. So, you know, what what do you do? Well, you there's this type enum, um so you have to add a new entry there.
You have to ensure that all the locations, all the switch statements that use it, um are going to know about your new type.
Um there's like a whole tree of um of classes or or multiple trees even.
Like there's a there's a field class, and you need to like subclass that and have a field for it, so that um [clears throat] a column of that type knows what to do. Uh you have to sort of widen your type conversion. So, like knowing how to convert type X to Y, like now you have complex also, so you need to teach it for all axes how to go from X to to complex, if that makes sense, and and vice versa.
Um but, you know, I knew this wasn't going to work from the beginning for various reasons, but um but my thought was like, "Okay, like maybe just doing this, maybe there's a way to generalize it. Like maybe you can make one new type that you that you sort of take to mean it's the extended type, right? And then for all the extensions out there, like you keep track of it in some sidecar, you know, like what the actual type is and what, you know, you might have structures and stuff that the that the extension author might define.
Um and then you sort of could then you know, delegate more generically uh for the actual new type based on that sidecar information. Um but, you know, that that doesn't work. Um so, let's investigate that a little bit. So, here's the here's the enum that I was mentioning before. Um and uh you know, clearly these are persistent, like the the data dictionary of the system, um which was new in in MySQL 8.0, um it keeps track of all of the it's sort of the set of system tables, if you will, that that keep track of the uh of some of the core the core information of the database, the user tables, the the structure of the columns, the the indices, etc. Um and so, um you know, like a column is going to have one of these types. So, that's definitely not going to work, right?
Like if you if upstream decides like they're going to introduce a new type, and they happen to pick the value that you pick, well, that's bad. Like you can't be backwards compatible then.
Um you know, or you'd have to go and somehow know how to update all of the references for for that custom type that it's not feasible, right? So, um but it doesn't mean it's, you know, not a possibility at all. Like maybe you could avoid a conflict somehow, right? Like maybe you could use a value that's, you know, 1024, cuz clearly they're trying to use values in 255 and you know, in one byte here.
But looking at this definition, like your spidey sense should be going off.
Like what could possibly go wrong here?
Um and really there's a hidden assumption here that well, only came up much later on, but you know, when I first looked at this, I thought, "Okay, there's some gap here, right?
Between 16 and and 243." Like they're really trying to keep it in one byte, but it's not obvious. There's no comments saying like it has to.
Uh C++ doesn't make it so, like it's not taking up a byte in memory.
Um so, so like what's going on here? Um not immediately obvious. Um and there's this gap management code, so um like they went through the trouble of adding that that gap and starting to put the types going backwards from 255.
Um why would [clears throat] they do that, right? Like um you know, if you know, they need the gap for some reason, and then they needed to deal with these let's say matrices, these table lookups, and they wanted a more compact representation, so they needed to make sure they weren't having, you know, um 256 * 256, you know, 64K um entries in these tables in memory.
And so, you know, the gap management helps with this kind of thing, because they can manage to keep this to be, you know, 30 by 30 instead of um 64K entries. Um but again, it doesn't really explain what's going on here. Um and it turns out that only like much later on did I find a comment elsewhere in the storage engine, which is like, you know, below the below the MySQL query engine, there's a storage engine, right? And it's mostly InnoDB.
So, InnoDB [clears throat] being the most popular one. Um and it turns out that um InnoDB is expecting that um that that type would only be a single byte. And I suspect this comes from like historical for historical reasons, because this was actually metadata that was stored on disk by uh by InnoDB, and they wanted it to be um really compact. But they sort of keep it in this um first, you know, this this least significant byte of this um uh what they call a precise type, the PR type. And then in the code all over the place, they have things like switch PR type um and then bitwise, you know, and it with this mask.
Um And so like if you picked, you know, if you wanted to jump outside and do 1024 um because you wanted to avoid any potential conflicts, then like you'd have to change a whole bunch of stuff elsewhere too. And so um this is just an example of like uh like leaking the leaking some information and and across this boundary, right, of of the of the storage engine and the and the query engine itself.
Um so just to like go through a little bit more that that, you know, the danger of enums, right? Um the upstream really controls the official mapping.
Um and when you're going for compatibility these enums can be a huge source of potential danger.
Um especially or really only like the really the thing you have to worry about is are these persisted or are these communicated out of process?
Cuz if they are I mean certainly if they're persisted, you want to avoid them. When they're communicated out of process they may be persisted somewhere later on and you just don't know about it.
Um so it it's a bit of um dangerous territory.
Um so you know the you know, conflicts are also a problem, right? Like I said, if you can avoid them right?
Um that may be one path to using them, but reassigning them is difficult. If you do find later on there is a conflict then you're in trouble because of this persistent the persistence of the of the value itself.
Um and then even worse I would say is even if you feel like jumping out of bounds, meaning like starting a new block of numbers later on to avoid conflict is a way um for you to to handle this situation you can't always rely upon that because as last slide showed, like you know, there are other potential hidden assumptions in the code um ensuring that the range has to be, you know, within within some set of values.
Um And this sort of Can we let's take a step back for a second. So Village SQL is you're adding sensibility to MySQL. Maybe can you talk about like what version of MySQL are you targeting?
And then also you you mentioned this like this aspect of MySQL that's super important that like it has this since it's original design, at least from the '90s, this pluggable storage manager architecture of which InnoDB is the de facto standard that everyone uses, but in theory right? Like you could put in RocksDB or MyISAM and I don't know.
Yeah, like what version of what version of MySQL are you targeting?
How are you guys thinking about handling like the MySQL layer versus InnoDB?
Yeah, so the so there's there's really two questions there. So so first we're we're targeting 8.4 um as as the version that we're working off of because that's the long-term support version currently. But later on when when Oracle says, you know, whatever 9.7 or whatever it turns out to be is the next long-term support version then we will port our platform over to that as well.
Um the the pluggable storage engine is a good point. You know, there are existing extensibility mechanisms inside of MySQL already. They're just not very fully featured, let's say.
Um so there's the oldest one is called this plugin uh you know, framework and that includes the storage engines, but there are other types of plugins that you can do as well.
Um and and then there's this newer version for or newer framework I should say called components um that again is is sort of like it doesn't have all the features you would want.
And so you know, it's unclear you know, it's sort of like plugins are deprecated except I don't see how they're going to deprecate plugins for storage engines since that's just so core to the system.
Um and then components is this new thing that solves a bunch of problems that plugins have. For example, like uh ordering like like the order in which they load up and dependencies and things like that. But but they're fairly limited and it's not clear how much effort Oracle is going to put into them.
So when we approach this, we're saying we're coming from the standpoint of oh, it's not that we can't use these things as, you know um as building blocks, but for for our extensibility, we need to come up with our own sort of framework, if you will, to to give us control over um all the hooks that we want to put into the system.
Uh does that answer the question?
Yes, but I think also we so MySQL 8.04, right? Uh for 8.4, yeah. And then you're targeting I'm assuming InnoDB is the default, but like higher level question is like in this example here where where the the data type has to be 16 bits.
Um like are there other how do you know there's not other assumptions like the some design decision that's up in the the MySQL layer because of MyISAM wanted that 20 years ago and InnoDB just adopted it, but it doesn't have to be that way. So we made the decision not to support MyISAM. We only support InnoDB right now. Like it's it's, you know it's the vast majority of uh of uh of the databases use that as the storage engine. I'm dating myself, but is the catalog completely off of MyISAM now?
Uh that was that was like version 5.7 I think or something.
The in 8.0 they introduced this data dictionary and that's built that's you know, that's InnoDB.
Yeah. Got it. Okay, thanks. Yeah, and so it's not to say that eventually we won't support other storage engines. Um but but for now like and and I have to say like again, it's it's sort of a historical thing. It it seems to me I mean I only have a year's worth of experience here, but there's, you know, there are assumptions that sort of bleed through the sort of handler interface, which is sort of the interface between the storage engine and the and the upper query engine, if you will.
So Steve, what you're saying is that InnoDB is the primary the default for Village Village SQL and like so like column like the columnar storage wouldn't work, say, or like let's say like RocksDB MyRocks, even though it's archived, there are other LSM engines.
Would there be like a a protocol that the other engines would have to follow to like be pluggable into Village SQL, say? Kind of thing or So we don't support the secondary engine stuff that would be sort of closer to the columnar store. Um and then at least right now, like again, we we're open to like this whole thing is supposed to be that you know, we will build some of this infrastructure to start, but we want to get community engagement and understand what the needs of the community are. Um MyRocks, which was given as an example too, is again not supported initially, but it may be something that we visit later on down the line.
Cool. Thank you.
Okay.
So there's another side to this though is that there's some safety in enums also.
Um you know, there are times when enums are used and they're non-persistent. So a good example of this is like the lexer and parser um the tokens. So if you introduce new tokens into the language um that the the numbering for the enums is is completely internal.
It gets recompiled every time. So if you hit if you introduce some sort of conflict, you can just take a different number and it's fine cuz when you compile all the you know, the query's going to come in, it's going to get parsed, it's going to be using those internal numbers and that's what's going to be used for, you know, for the rest of the query. Um so so that's really, you know, that's a case where you can use it safely.
But there are what I what I consider being as flytraps. Like they look they look just fine. So commands is an example where we wanted to introduce a new statement type like saying install extension.
Um and that again looks like it's completely internal and safe.
Um it's not persistent. But it turns out that there are some plugins like audit plugins that record the commands, sort of log them, if you will, and externally communicate these values. So you know, is that is it you know, is that dangerous or not?
Well, in this particular case, what we what we did was we actually introduced gap management, the the the thing that was done for the the MySQL types themselves, we did that for commands and introduce a gap so that we could jump outside and and and use those numbers without conflict. Um and it's something that we'll have to maintain going forward, but we think this is a good way for us to still be able to um to add commands.
Okay, so um another possibility is like if you if you wanted to extend the database, maybe since the direct thing didn't really work out, like maybe wrapping and subclassing will just like you know, 2,500 classes in the code base. So, um you know, it seems like it might be natural to to think about that. But, you know, there are some gotchas there as well. Um you know, one of the big ones is that uh there's downcasting used all over the code base, and the downcasting um leads to undefined behavior or crashes.
Um and so there's there's an example there where okay, in this very important um uh parent class called item, um there is a uh a type inside that sort of tracks what it expects the subclass to effectively be, and at times will do a downcast.
And so, if you had some kind of wrapper for item, and you wanted to be able to just like add your own items, um then you know, the wrapper might do delegation of its type method. And if the thing being wrapped was a was a field item, it might say, "Oh, um you know, it might return that as the value." But then if your wrapped item were not actually a subclass of item field, then this whole thing would blow up. So, there's that. There's There's also other maintenance burdens associated with it, like you're tempted to change visibility of certain methods and and and and fields. Um there are final methods in in classes. There are non-virtual methods that need to be changed virtual. So, like you're introducing all sorts of problems that you then have to keep up as a maintenance burden.
Um plus, not every potential hook that you want to add even has sort of like a class, you know, representing it. And you know, in general, like you can use subclassing, but I would say it has to be like a really well-defined interface.
It has to be effectively a seam, right?
It has to have been intended to be used that way in order for for you to take advantage of that. And that doesn't exist um you know, universally throughout the code base. There are places where where it is used.
Okay, so so what do we actually do? Um we use uh metadata or type injection. Um It effectively uh we end up carrying some extra type information, which we put into something called a type context. And we stick them uh that type context in uh some of the important um uh class hierarchies.
Um and this really survived because it was not fighting the code base, I would say.
Um So, like for example, when the uh when the system is loading, you know, into memory the structure of a table, it builds it into something called a table share. And um that mimics the schema uh from the from the data dictionary that I had mentioned before, those system tables.
Um and essentially what we do is then in that code that's loading that table share, we will sort of join that data with our own sidecar information, our own metadata. And then build that in-memory structure to contain these type contexts that carry around the extra type information.
Um and that way we can still piggyback off of the existing infrastructure. For example, we don't have to change, you know, the fact that when you add a column with a custom type, um that that is, you know, that's actually a column in a table. So, it should still exist in the data dictionary as a column um in in that table.
Um it's just that we add that information um you know, on the uh on the way on the way in as it's being uh loaded from the data dictionary, we figure out that okay, this is supposed to be a complex, and then keep that type information with it, and then use that in memory as appropriate.
Um and the good thing is that we don't need to do this everywhere. Like the currently we make all the custom types are going to derive from this field varbinary, which is effectively like a varbinary type.
Um and then we put the internal representation of the sort of binary representation of the custom type, which is up to the extension author how that's laid out. Um so, for example, for complex it'll just be like two doubles next to each other. Um you know, we can um uh we end up just having to sort of put put that extra information in these uh field varbinary, and we don't have to like go to every field subclass and and and deal with it, you know, deal with it there, because those are not going to be custom types.
Um so, here's an example of just, you know, like the kind of thing that you have to do.
Well, okay, we're going to we're going to take the SQL type, we're going to turn it into a string, um you know, for printing out or whatever.
Well, okay, if it's a if it's a custom type, then um because we have a type context, then we have like a special function we call out to that will take that type context and fill the string up with the name.
Otherwise, it's just the the regular code. And that way we sort of minimize uh the amount of conflict that we have.
Okay.
Um so, just another thing that comes up in a really old code base uh is that over time, um there are many ways to do the same thing, right? So, there's at least eight ways I could find um where data is ordered and uses different code to do the ordering.
And that's really unfortunate. Um because I I would have thought that there would be like, okay, there's going to just going to be one place to inject with the comparison function.
Um but it turned out that like, oh, you get the you know, where clauses working, and like you're like, okay, let me try some joins. And you're like, oh, those work.
It's amazing. Yes, this is great. And then like some other joins don't work, or like the between operator doesn't work, or you know, you do an order by and it doesn't work. And so, um we've had to just like take all of these ways of doing ordering and map them onto one way that the extension author, you know, uh thinks about it and writes that function, and then um and then those all work, right? Um so, that's the the kind of thing that we we need to deal with there.
Okay, another bit of a surprise, I mean, maybe it shouldn't have been a surprise, is that like create an alter table, especially because they they share some code, they look like they would be they would be relatively easy to support one from the other, where like you that you're working on one would sort of make the other one work. Because like an alter in theory seems like it's oh, create a create a temporary table, copy a bunch of data to it, and then like rename the table at the end. But there's all kinds of details like that are written here, like, you know, there's there's differences in locking.
There's differences in the way they get registered in the DD, because like the temporary table in one case is truly temporary, and in the other case is not.
Um Alter has to deal with all types of complexity around dropping, renaming, and changing columns. It has multiple strategies. Sometimes it copies the data. Sometimes it doesn't have to.
Um you know, trying to optimize for for for certain cases.
Um and so, you think, oh, okay, even alter is is like if if I can cover alter, then like we're done. But no, like CTAS, like create table as, you know, as select is you know, complicates it even further, and is yet another you know, another path of code. Um and you would think that that would seem it would seem like that was something that was much closer to some of the other cases, but it's not, right? It's like it's different enough that it has its own path.
Is this Is this Is this the MySQL there, or this is this is MySQL plus InnoDB? All of this. This is This is MySQL, just MySQL alone. Although, you know, some of the like the implementation of some of these things, like the the copy, let's say, are happening inside of InnoDB for, you know, for performance reasons, but but but basically this is just the the the entire ecosystem, you know, like the the entire system, storage engine plus, you know, plus uh plus query engine, uh have to decide which features are supported by the by the storage engine for what the what things they can push down, and which things they can't, right? So, but again, like there's there's a lot of like, you know, differences and assumptions and locking and so forth that that make this so that like you need to you need to visit all these paths. Which by the way, like I didn't mention this before, but you know, there are so many different features to support and so many variations on on queries and statements.
Like, how do you even go about making sure that you you cover all these things? Well, you you build up you know, a huge set of test cases.
And initially, they they pretty much all fail, right? And then over time, you burn them down. Right? You you you find [clears throat] that, you know, you cover create and alter doesn't work and also CTAS doesn't work. And then you you focus on alter and CTAS still isn't work. And then you get CTAS to work, right? Um so so yeah, it's just like a uh uh uh um a trial and error thing, right? Like, we're seeing we we we make a change, we see which tests all of a sudden work as expected. And some things might not work. And it's like, well, why doesn't that one work? And you'll look into and find some difference in the in the code path.
Is there any anytime there's a conflicting changes?
Like, something you do like, oh, create table as needed this, but alter table needed something different. Like, in that changes the the the extension infrastructure you're building? Or is it just whack-a-mole, just fixing the bugs as as you find them?
I mean, it's actually pretty rare, I would say, that we go and add add a new feature, like enable something new, and then it breaks something something that we already had working.
But there are sort of some code like like I would say actually create and alter are sufficiently intertwined that that risk is a bit higher. Um but but it's not quite whack-a-mole.
It's more like just capturing the long tail of features, right?
Sure. Yeah, but but you're like you're running a fuzzer with a bunch of SQL queries from either the MySQL test suite or something you guys developed. And like you're just running it and see and see what breaks and you fix it.
Yeah, but but like we built we built the set of test cases based on like the the grammar itself, like the parsing grammar, right? So like go through and and come up with this more exhaust exhaustive, you know, test suite that covers all these, you know, um you know, less used features and so forth, right?
So Steve, are you are you saying that you modifying these code paths uh basically change the way like the the optimizer reacts, like the different paths, like the different branches it'll go through? And running these cases you kind of have to figure out, okay, you know, which branches it touch, you know, which parts do we have to uh cuz cuz in the case that Andy said running like my my SQL's uh test suite, there's a lot of cases there. And you know, you'll be able to identify a lot of edge cases that you [snorts] know, each time you really touch any any part of code, right? Uh Right. My I mean, my comment was more about the just the sheer amount of statements and variants on statements that you can possibly have. The issue that I think you're bringing up around the optimizer is actually covered in the next slide.
Where we did find one such thing where um I'll I'll just jump to that bullet at the bottom.
Um you know, it looks like a fairly innocuous query, right? You're just selecting doing a simple join.
Um and and on on the same column where that, you know, that column is happens to be a um uh an an extended type or custom type.
Um and it turned out that, you know, in some situations, we could we would find that running it one time worked and then running it the second time didn't work.
Right? And why was that the case? Um well, it turned out that the optimizer was choosing different paths of execution for these. Like, in the first case, it was doing maybe a full table scan. And the second time, it was actually, you know, um uh deciding to do a hash join. Right?
And so, um you know, there are situations like that that where it was like, oh, okay, we never we never did any integrations on the hash join. I thought that that's where we found we we were missing something, right?
Like, we had missed some integration there. So, you definitely do have to be aware of of the fact that, you know, the optimizer can make different choices. So it's another it's another in the list of gotchas, if you will. So with that, Steve, the the the um so when you what cuz you have the complex let's say this is this is a custom type, right? So uh it it these these custom extensible like the these ex these extensions that you're adding to the system, now you're going to have to really extend the MTR suite, right? Like uh with let's this like drawing kind of query, let's say group buys and and and these these different uh >> Um and also extensions can have their own tests as well. So complex has its own. We have a number of other extensions that we've added like UUID, and we added um uh network addresses and so forth. So like there are, you know, extensions can also participate in the in the fun of of MTR um if they so desire. Um yeah.
So Okay, so so this was just like this is the end of the gotchas. Um low bulk bulk import like load data. The load data command is is like completely separate from insert. Like, I thought naively maybe they would share something like but insert has so many layers to it that, you know, it it would it would turn these string values into the internal representation and then store them correctly, but but load data wasn't working. It just turns out like it's just way lower level, right? It's like reading something from a file and then it just sticks the thing in the buffer, right? Like, it just sticks that value into the the buffer for the column, for the field, um and then stores that. And so, um you know, none of that worked without actually, you know, intercepting that and and and converting the the value properly.
Um and then another interesting one was this, you know, default value handling.
Well, what do you do? So MySQL has this special extension to the language where you can say ignore.
Um and so like if you create a table that has a column that's not null, right? What happens if you insert into that column with a null? Well, that's an error. But if you insert ignore into it with a null, what happens? It's supposed to not error out.
So what do you put? Do you skip the row?
Do you put some default value? What if there isn't a default value?
Um it turns out that MySQL uses something what it calls the intrinsic value um or the intrinsic default value. And so, um we have a way of saying that, okay, for complex, it's, you know, it's 0,0.
Um and that's the value that gets um put in. But, you know, it's one of these things that's like very specific to the system how it behaves.
Okay, I'm ready to move on to some principles and closing thoughts, if that's all right.
Okay, uh backwards compatibility I mentioned. Um it should be really easy to to try the system without fear lock-in. Should be able to revert. Uh we talked about enums and persistence um and some of the dangers of the data dictionary. Specifically, you can store values in the data dictionary, that's fine.
Um like, you know, columns for instance, but you can't you can't extend the data dictionary. If we needed additional information with the column, we can't just add a you know, add our own column to the to the data dictionary table that that that is tracking that information. Um that's a that's a big no-no. Um that would not be backwards compatible. Um and then we have to like avoid workarounds that can bite us later. So an example of this is like early on in prototyping, I thought, oh, you know, maybe we can use the every column allows a comment. So maybe I can just stick some JSON in the in the column comment that will contain enough information for me to look up the you know, extended type information that we want, the custom type information.
But, you know, it's a little bit kludgy to do things that way because comments carry a maximum length and you're taking up some amount of that. Um and so, you know, that that wouldn't necessarily be the best approach um for the long haul.
Um We'll we'll talk about Did I mention So so what we ended up doing was we have our own system tables. I think I have a slide about it.
Yeah, here.
Uh uh we have our own system tables. They are um really actually user tables, but we've done enough integrating >> [clears throat] >> um them into the system that they are somewhat differentiated.
Um and uh you know, this is sort of a building block that we have that has sort of shown us that is like a really useful thing to have around. So, like for example, one of the things we're going through right now is adding support for stored procedures having custom types and uh because of the way stored procedures work, we needed to keep around extra information about some of the parameters and variables that get used inside the stored procedure that may themselves be custom types.
Um and so, we needed to add a a new table for it. And because we built the infrastructure for that, we've invested in that from the beginning, um it was fairly easy to to to do that.
Um and we have this sort of abstraction called the evictionary client that sort of manages all of our interfacing with those uh those system tables.
Um So, another principle is that we want to be Myonic. So, Myonic is sort of this joke term originally that uh I made up uh that was supposed to be you know, in the sense that Pythionic means the Python way, like Myonic was going to be the the MySQL way.
Um as if, you know, uh you know, the Pi part was an important part and not the onic part. Um and it sort of just means like doing things the way that, you know, uh MySQL would do it. And and we always try to do that. So, whether that's error handling conventions, which you know, you know, you can say, "Oh, return true is is positive." Or could mean that it's a it's an error. Like MySQL does it with falses, you know, is the positive thing and and true is an error. So, we do that. You know, they tend to not use exceptions, so we don't use exceptions.
Um you know, the whole intrinsic default value thing. It's like, "Okay, well, the initial version just skipped the row."
But, you know, the expectation from MySQL users would be that because, you know, insert ignore works a certain way, we should really work that way, too. And we should support an intrinsic default value for custom types. Um even the way that we patch extensions, I would say, is much more like how, you know, plug-in and component developers would expect than say a Postgres developer that has, you know, an an you know, an install script a SQL script that that is used. Um like we didn't we didn't go that route.
Okay, low conflict burden. So, this was just the whole like we wanted to make sure that, you know, when a new version of MySQL comes out and we want to port the platform over, you know, how do we make sure that we minimize the amount of uh conflict, right? We can't just like go in and refactor the code any way we want. Um and so, we prefer surgical changes and try to call out into, you know, new files in a in a Village SQL specific directory, um you know, using new classes or structs or functions uh um and, you know, one way that we make sure that this is happening is that we track an internal metric around, you know, how many lines of code have we have we added into, you know, what we consider Village SQL specific files and directories uh versus how much change we've made to the MySQL upstream code base itself, you know, and initially, we were hovering around I think it was like 6.6 or so was the ratio of lines of Village SQL change to lines of MySQL change. And we've driven that up to over 10. So, there's there's more than 10 lines of Village SQL code for every line that we change in in MySQL.
Okay, and this last one is just um you know, this come this didn't really come up that much with the type system, but it did come up when we were thinking about um packaging extensions and and about the the sort of um custom functions.
Um Dan on our team, Dan Bentley, he he really uh pushed this and it really has served us well.
Um and he he paraphrases uh Larry Wall uh you know, in the third Pearl um with this uh he says, "Happy path, escape hatch." And I say, the way I usually say it is like, "Let's make the common things easy and the exoti- the exotic things possible, right?" So, so, what you want to do is um you know, make sure that, you know, uh the thing that people want to do is really straightforward, right? And and there may be a bunch of optional stuff that that people can do extension authors can write that will um that will handle some of the more exotic cases, but um it shouldn't be the common thing.
Um and then over time, as we get, you know, feedback from the community about what other ways they they want to use the framework, then we can think about making some of those use cases much easier and straightforward um uh while still maintaining that that you can can, you know, uh open the escape hatch in the case in the case that you need to do something that is um a little bit more exotic.
Um so, examples are like we provide pre-run and and and post-run hooks for all the all the functions, but, you know, the vast majority of users or sorry, extension authors that want to write a custom function don't need this.
Um and so, they just write one function, but if if they if someone needs it, they can write three functions, right? The main function, the pre-run and the and the post-run run.
Um and another example is like for for parametric types, meaning types that take a parameter. Like you think about something like a vector type. So, we have this test vector uh type that you can say the dimension 1536, um but and so, like we make it really easy to support that one parameter that's an integer that usually means something like a dimension um or a length or a size, something like that. But, if you want to support additional parameters or like the underlying, you know, you want to say the vector can support float or double as the individual, you know, entry of every dimension, you can do that kind of thing, but then you have to give this this more um you have to give a string that has comma-separated um values, right? And the the 1536 shortcut means, you know, dimension is 1536 and the default value that the extension author picked for all the other dimensions.
Okay, so the last the the last slide. Um remarks on AI. I just want to say, you know, this is especially geared towards those who are early in their careers or or coming out of school especially, I personally don't feel uh doom and gloom for the industry uh here. Like it's definitely a shift and it's a tool to be used. Um and it's certainly shaping how we do software development, but I think it's worth keeping things in perspective. Like the way I see it, the demand for software globally has always been way higher than we than than like the supply of software engineers to write that that that software. Like there's so many things that never got written because there just weren't people to to produce it. Um and so, like AI will certainly, you know, capture some of that demand in that it will allow some people who wouldn't normally be writing software to to to try their hand at it. Um and it will make uh engineers be uh you know, uh you know, more productive at at at writing software.
Um but, you know, software engineering is not just coding, right? And um AI doesn't always suggest the right approach. And I think that a lot of times experience and judgment are are needed to to make decisions about that and and sort of guide it along. So, so, anyway, like I just want to say I think it's really important that um you know, we make sure that the next generation of engineers are being trained to get that experience um and and get the mentorship that they need to ensure um you know, future success in this industry.
Then end with that, thank you.
And any questions?
I I will clap clap with everyone. Steve, that was fantastic. Uh Jignesh, won't you go first?
Yeah, this was a very interesting talk.
Obviously, this is really difficult to do in an old code base.
I have uh two questions. One is you're hitting all of these points in the code.
Obviously, MySQL wasn't written with extensibility as a first-class design, and you're hitting all these place in the code where I break. So, the first question is many years ago, I guess decades ago, when MySQL added the spatial data type, which is pretty complex. It also has things like provide the reference coordinate system and things like that.
Uh did it have to deal with all these problems? Or were there no abstractions created at that point to say, "Here are the kind of places in the code where you might have well-defined uh entry points to add extensibility?" And the second part related to that, as you do that, And were no such there wasn't that type of abstraction done before, would it make sense for you guys to create it now?
Uh um thanks for the question. I You know, I can't speak historically about um you know, about when that was introduced and sort of like all the things that went on with it, but I can I can say just by looking at like I've looked at for example the the vector changes coming up in in the 9.0. Um and the amount of change that has to happen to natively add a type is just it's it's a lot. Um and I think we are doing the the the latter to to to answer your question. Um I think we're introducing the capability of putting in in you know, uh these types.
Um It's possible that we won't encounter some type that needs a deeper integration, but I would say we should cross that bridge when it comes. Like if if it turns out that our type system is not flexible enough for some um you know, uh you know, bespoke special case later on down the road.
I would hope like that with you know, community engagement we would be able to figure out you know, ways around that, ways of generalizing the type system even more to to allow those cases cuz I do know that they like as you as you allude to like the I see it all over in the code base. Like the some of these later additions of types have very special handling to them. Like they say, "Oh, if it's a spatial type, do this special thing. Oh, if it's JSON, do this special thing, right?" Um and so um So yeah, but so far like we've been we've been able to add a number of different types and and not hit um too many difficulties.
Um we have been also extending some of the storage mechanisms underneath. Like we're adding a storage API that allow you to do things like potentially store the data outside um of the the B-tree itself. And so um that will help in cases where like the data is really large for instance, right? So you know, going back to the principle that we talked about at the end. Um you know, like we should make it possible to add whatever you want to add, but it may not be easy to do so, right? Um especially if it if it has like um if if it behaves very differently from from traditional types, I would say.
Yeah. Maybe a quick follow-up. Like in Postgres in the type extension a component, you have these well-defined functions that at least get you something basic running, right? You as not being a developer who's deep down in the storage and indexing and locking and all of that code.
Do you see a hope there might be an equivalent like very lightweight version? Okay, this is not going to run fast, but it's going to run correctly um in the system or the architecture and the the way the code is spread out is just a bunch of case statements for type switches all over the code that you it's not that easy to get that type of clean separation.
No, I mean this is what I've been surprised at. Like I knew it would be difficult, but I feel like you know, we've hit so many of those cases already, right? Like we like yes, we have those well-defined functions, right? If you come you you can come with a from string and a two string and a compare and there's a hash function and so forth. There are these well-defined um functions that you can that you can provide um for your type. And then you can add your own functions as well that you'll be able to call from from SQL, but but yeah, I don't I don't have a uh you know, I don't really think it's a matter of hope. Like I I've I've become convinced that we can do this. Like MySQL can actually do this. It's just a matter of looking at it a different way. It's not that um you know, the system has been designed from the beginning this way, but I'm pretty confident that we will um be able to make this work by taking the approach that we're taking.
Great. Thank you.
Thanks.
Uh time for maybe one question from the audience?
Otherwise, I will see you at the time.
Okay. Um So after UDS and UDTs, what's the next target category of extensions that you guys want to support in Village SQL? Or do you think those two will take you pretty far for what people are asking for?
I think like indexing is is really the big one, right? Like we want like there's no de facto standard for vector search in MySQL today.
Right? And that is unfortunate. Right?
But even if vector search turns out to be something that whatever people don't think is that important, there'll be something down the line that will be, right? So we we have to be better prepared and and more nimble by having this infrastructure in place to be able to to move on the next thing. Um I think that will be really important. I do think vector search is important for what it's worth. Um Yeah. And then and then Abby's Go ahead.
Sorry.
Well, I was just going to add like there are there are a number of different, you know, extensions that maybe can't be done today in MySQL or even in Village and so you know, we we we've done analysis. We understand different categories of those of those hooks and and have ideas for which ones we think are really important to add and we even have um you know, as I said, PRs open for for some of them. Like we will be adding we won't just be doing them sort of one after the other. Like, "Oh, we're doing types and functions now and now we're working on indexing. When we get indexing done, then we'll we'll you know, we'll look at all the other needs." No, we have we have a number of other extensions that are much sort of sorry, extensibility hooks that are much smaller in scope and effort that we are just going to put in as we as we complete them and as people are asking for them.
Got it.
So related to that, you know, Abby's paper from VLDB last year, you know, showed all the showed the the pitfalls, the dangers of having unfettered extension APIs or in Postgres, right? Like they collaborate each other. So So I guess maybe are you guys Have you I mean you're obviously mindful of this and maybe it's in C++ so it's a little bit cleaner to do this. But like are you putting in the the sort of the failsafe mechanisms, the protections to make sure that like extension writers Yeah, if it's arbitrary extension, you can do whatever, but like at least they're not going to interfere with other extensions or break compatibility with the system because two extensions get installed at different times. Have you guys thought about that problem or is it not an issue cuz it's C++?
I mean we have we have talked about that. I mean there's a whole bunch of issues that range from like what happens in a replicated setup if like you know, the the replicas are not in the same state, right? Or um um I don't know, conflicts of some sort.
I mean I think you know, extension dependencies is another thing, right? Like how do you make it so that like one one extension can use and utilize and and depend on the fact that another extension is actually there. Those are actually like extension dependency is something that we're tackling right now. Like one of you know, Dan who I mentioned before has been working on like how is that actually going to work? You know, it's not trivial. Like there are definitely things to work out. Um but we've you know, we've had discussions about all these sorts of things. We've even had discussions about how do we how do we do better isolation if we want if if someone was really concerned even about the operating of these things being um uh sort of conflicting with each other.
Um how do you make sure that um that you have that level of of isolation and control. So So like we've thought about a bunch of these things. They're not all solved, right? We're sort of trying to prioritize where we think we have to focus next and and just work on on those things.
Got it. But we're open for suggestions and comments from from the public. Like we we'd love to get you know, uh you know, discussions going about some of these things.
One more question if you have a few more minutes. Uh towards the end, Steve, you had something super interesting to say about you know, how much software is not written.
Uh MySQL is a massive code base and you've done such deep software development at Google before on massive code bases.
How is your day-to-day change? Like what does your workspace look like? Do you have like three screens with agents doing a lot of coding? Like what's your dev cycle look like today?
Yeah, I mean I Yes, we like we use an agentic, you know, workflow for our normal development. I mean when I started Village SQL, I I wasn't doing that and and pretty quickly it sort of came into the into the forefront uh and we started, you know, we experimented with Cloud Code and and Gemini CLI as well.
And like we've been um you know, using them for our normal day-to-day.
Um you know, it's it's been definitely a boon. Like it's great at explaining things. It's great at writing code. It doesn't always come up with the best suggestions. Sometimes it says, "Oh, you know, we can do A, B, or C."
And you're like, "No, how about D?"
Right? Like you didn't think about that.
Um but you know, it is a it is a large code base. So it's not like it has the whole thing in context, right? So um but but it absolutely changes the way that that you work on a day-to-day basis.
So yes, I have, you know, two or three screens depending on on uh uh where I'm working, but um but yeah, I usually have, you know, um even multiple uh um uh multiple agents going working on, you know, different things simultaneously.
All right. We're out of time. Let me ask one last question. Um Oh, wait. I have a question in the chat.
Wait, no, Andy. Oh, sorry. Oh, Abby, go >> [laughter] >> Abby, go for it. Go Go for it. Oh. I just wanted to ask about the hook dependencies. If you've thought about the If you plan on adding hooks in the Postgres style, like I think a lot of the problems that we ran in Postgres were like um caused by like hooks depending on one another. Like people using hooks to set up one state and then it gets kind of overridden by another extension or something like that. The other thing is I was curious about is that uh one problem that we found in our paper, I guess I'm Abby, um for reference, but is like when you're doing uh index type analysis, like we found that a lot of indexes, if you're adding planning on adding indexes next, don't work with like all different UDTs. They like only work with the specified UDTs that they're made for. And I think that this can be useful in some cases like PG vector, but can also be kind of tricky in like other cases.
Um and ideally you would want to have like an index like a generic index can support like many different types. So like have you thought about these problems? Are you addressing them so on and so forth? Thank you.
Yeah, sure. So we are for indexing, we're really focusing on on vector first and sort of like that being being able to enable something like a like a PG vector. So there, as you said, it's it's a very specific thing, but I think down the road like handling a more generic like GIST kind of thing uh might be something that we we want to think about. Like we've we've talked about that as a possibility. Um and then Sorry, you asked something about Oh, oh, about the uh dependencies. Um the uh Like we've thought a lot about making sure that we that like extensions have their own, you know, if they have their own, let's say, system variables or things like that, um that they are isolated from other extensions. Like we don't want the the stomping that that can happen, right? So we want to want to be more prescriptive about like here's the infrastructure that you can depend on. Like, oh, you can You can make your own system variables. This is how you do it. Um rather than than like, oh, I'm just going to like, you know, um look over there to some on the file system and and I know that the extension is going to have that data over there and so I'm going to just go ahead and and and read it. Like we we you know, we don't want that kind of thing uh to happen.
I don't know if that's the kind of thing that you were encountering when when you wrote the paper.
Um I think yeah, definitely like a good portion of stuff you said was like stuff we encountered when we were writing the paper.
So I'm happy to follow up >> Happy to follow up offline uh if you'd like to to get in contact and talk about that.
>> Oh, yeah, yeah, of course. That would be fun. Yeah. Yes. All right. All right.
It's taking time. Let me ask one last question. It's a bit grandiose, and so if you have a short answer, it's okay.
Um you know, given all the trials and tribulations you have of of getting, you know, user-defined types working, all the periods parts could you detach, the horror of realizing there's like eight or nine different parts that we have to do key comparisons of, you know, these such and such data types.
Did you guys ever consider just saying screw it and just rewriting MySQL from scratch similar to what the Torso guys are doing for SQLite in Rust?
No.
>> [laughter] >> Okay.
All right. We'll we'll leave it at that.
Yeah, yeah, yeah. Like we we we definitely wanted to to to serve the market that's concerned with, you know, making sure that um that they could go back to the uh that they don't break compatibility and that yeah, ba- basically we we that's a totally different undertaking.
Related Videos
Agentforce NOW AMA: Build with React and Salesforce Multi-Framework
SalesforceDevs
490 views•2026-05-28
How agent o11y differs from traditional o11y — Phil Hetzel, Braintrust
aiDotEngineer
450 views•2026-05-28
WEB TECHNOLOGIES UNIT-2 | Degree 4th sem BCOM Computers web technologies unit-2 full explanation💯✅
LearnwithSahera
1K views•2026-05-29
More tests are always better? How to use AI to identify tests that bring little value
Alliance4Qualification
335 views•2026-05-29
Search Algorithms Explained in 60 Seconds! 🤖💨
samarthtuliofficial
218 views•2026-06-01
People of Game of Thrones using JavaScript DOM
AltCampus
296 views•2026-05-30
Introduction to Problem Solving Part - 1 | Lecture 1 | Intermediate DSA
ascensionix
107 views•2026-05-29
So What's Odin Lang Even Good For
TechOverTea
131 views•2026-06-01











