Betzen effectively reframes observability as a proactive design discipline rather than a reactive production safety net. By integrating telemetry into the development loop, he transforms system transparency from an afterthought into a core architectural requirement.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
Telemetry-Driven Development - Noah Betzen | ElixirConf US 2025Added:
[Applause] [Music] [Applause] [Music] Hi everyone. Uh my name is Noah. I uh am here to talk about a concept called telemetry driven development. The subtitle for the talk is the purpose of the system is what it does. And I'll get to that a little bit more. But first, a little bit about me. Uh, if I were uh a node in an Erling cluster, uh, I'm Noah.
I'm go by NZTeb pretty much everywhere online. Um, I have approximate knowledge of many things, so I don't really claim to be an expert with Elixir, but got several years of experience, been working with it for quite a while, and I've got some stuff to talk about. But, um, first, where do I work? Uh, I work for a company called Smart Rent. Some of you have probably heard about it through the nerves community. Uh you might also know it as a platinum platinum sponsor of the elixir forum. Um one of the things I enjoy most about the company is their willingness to contribute back to open source via financial support.
Another fun thing about Smart Rent is that we operate at incredibly large scale. Uh so we are an IoT company, distributed IoT, usually for multif family homes. Uh currently in our production environment we've got about 848,000 hubs. So these are like smart IoT hubs that other devices like locks and thermostats connect to. We've got about three mill 3.5 million of those in prod. Uh so that is about four devices per hub. So imagine a lock, a thermostat, maybe a leak sensor etc. Um so that's our average. Uh these graphs might be a little hard to read but some of the some of the scale we operate at we've got about 231 load balancer connections per second. 3600 Q messages sent per second. Our P99 latency is about 350 milliseconds and we do probably about 7,000 dB row updates, deletes, inserts every second on top of that. So we operate at pretty large scale. Um, and so one thing that's especially important to us is telemetry.
So for those of you unfamiliar, TDD historically stands for testdriven development. And it's an idea of you write the unit test first, you make it fail, then you implement the thing, you make it pass, and that's kind of the whole process. So what I'm proposing here is kind of a play on that TDD acronym uh but for telemetry. So um this purpose of the system is what it does is an idea from systems thinking. It doesn't necessarily it initially did not apply to computer systems. This systems thinking in this context is actually referring to like people systems of people and governments, communities, etc. and how they relate to each other.
But one of my favorite things about this um this uh I guess uh not manifesto this idea is there that there's no point in claiming that the purpose of a system is to do what it constantly fails to do. So we build systems here. We usually use elixir. uh but how do we know that the systems are actually doing what they're what their purpose is? Are they accomplishing the goal that they set out to do?
So uh my thesis a bit uh is that the best time to utilize telemetry in your Elixir application was July 3rd, 2021.
That's when the 1.0 release of the telemetry library was released. Um the second best time I think is now. So if you're not already utilizing telemetry, I think you should, especially if you or an AI/LM, I've used the buzzword, uh really want to understand what the system is doing and what its purpose is, you need some form of telemetry.
So what is telemetry? Um what is open telemetry? I'm going to recap a little bit about what these things are and then introduce a little demo. We've got a live demo. Hopefully it works. Then I'm going to go over the three main mix environments that most of us are familiar with: dev, test, and prod. and then kind of wrap up with some challenges of combining elixir and open telemetry specifically as well as a little bit about the future of what I call TDD.
Telemetry is not a word that we as a software uh industry invented. It actually comes from uh old school electronic devices that were measuring uh scientific data uh in various places.
Um the long story short, I like the word telemetry. The root word uh the Greek roots basically means it's a far off measure. So you're trying to measure something that is far away from you in in uh for like in in this case we're talking about computer systems but um the original root word uh means far off measurement. So um one thing I like to compare telemetry to is pressure gauges especially in like water pipes and gas systems. So uh I would consider a pressure gauge to be a form of telemetry. You can't see what's going through the pipes. You might have built the pipes. You might know that water is going through them, but you don't necessarily know what the pressure is at every single point along the pipe. So, you install pressure gazes gauges at regular intervals. Uh or you know, you define a telemetry event if uh you're in Erlang or Elixir. So, that's the best visualization I have when comparing telemetry to like a physical system.
funnily enough uh actually on the uh uh the so you'll notice the telemetry Wikipedia article says not to be confused with telemetry in software but if you click on that it doesn't actually say telemetry it says observability so this is another term you'll hear a lot short is uh observability means you can observe it uh that makes sense but uh honeycomb defines it a little more succinctly a measure of how well you can understand and explain any state your system can get into no matter how novel or bizarre without needing to ship new code. So telemetry is like one of the main observability pillars that um we're working with here. Uh going back a little a little more context, Elixir telemetry, most of you are probably familiar with these two libraries.
Telemetry, telemetry metrics. Telemetry, long story short, is for emitting events of any form. They basically are maps.
Telemetry metrics is for defining and aggregating those events into meaningful metrics. Uh so you would use telemetry to instrument your code and define your events. Telemetry metrics would basically determine how you aggregate them. Whether you want them as a summary, a counter, uh there's a variety of other options. These two libraries don't necessar what these two libraries don't necessarily do by themselves is integrate or send these telemetry events to other systems whether it's data dog, Prometheus, StatsD, etc. So we need kind of a standard because otherwise it's up to each individual vendor to define their own and then you have to pull in their custom SDKs and migrating is a pain in the butt. So that's why open telemetry exists. So OTEL open telemetry is a vendorne neutral open source observability framework for instrumenting generating collecting and exporting telemetry data such as traces, metrics and logs. Traces, metrics and logs are also what is known in the open telemetry world as signals. Uh, open telemetry actually defines some additional signals that you may not have heard of before. Uh, logs, metrics, and traces most of us are pretty familiar with. But there are also concepts such as baggage, uh, emotional baggage maybe.
Uh, it's basically baggage is basically just an arbitrary amount of context that you can tap past to and from the other different signals. There's also some other proposals for things like events.
Think of like event streaming or like a continuous log profiles. I can't even begin to explain. and they're along they're like profiling specific behaviors of your uh telemetry as far as I know.
Uh so where where does this leave us uh with elixir? What is elixir support for open telemetry like it's pretty good. Uh you will notice however that uh there are of the three main signals that we care about traces are the only one marked as stable because metrics and logs are still in development currently.
uh if if you need telemetry and if you need metrics and logs in your application, you're pretty much kind of forced in at least early elixir for now to use the affforementioned telemetry and telemetry metrics libraries.
However, all of the repos mentioned at the very bottom are actively being worked on and there is a ticket for uh ticket ha GitHub issue in the open telemetry Erlang library for adding a logging domain. So that's kind of where we're headed in the future. So at this point I start thinking hang on a minute.
I've seen so many standards at this point that this is just going to this feels like it's uh we're entering a space of too many too many standards uh that basically don't do anything. Uh I will say with a caveat though that I think open telemetry is doing this a little uh more carefully.
I say that but then you have libraries like the aptly named open telemetry telemetry library which acts as a bridge between the elixir telemetry library and the open telemetry library. If I say telemetry another time I might my head might explode. So it feels like uh Mr. Bones's wild ride and we want to get off. We usually at this point I might have convinced you to never even approach open telemetry. So I advise we take a deep breath. I didn't actually bring my water or I would drink it right now. But a little more context. Uh so I've been thinking about this some of this stuff in my head for at least since January of 2022. The demo I will show is at some point after the talk I'll make it uh a public repo. Currently it's private, but I've basically been collecting links to talks, blog posts, etc. that have to do with telemetry, elixir, open telemetry, etc. Um, and I've tried to kind of summarize them all in a single demo. For those of you unaware, Graphfana is one uh open-source tool for collecting observability data like telemet uh logs, metrics, and tracing. Um, so for the purpose of this demo, uh, essentially what we have is we're going to have a Docker container running this container which has all of the fancy UI that we need. Um, and then we're going to kind of structure our literature application around that. So, I'm going to go over kind of like the three different environments. Uh, and the the always sunny in Philadelphia meme feels particularly relevant because we not only have three mix environments, dev, test, and prod, but if you think about it, we actually have local CI we have local tests, we have CI tests. And then for every environment that your company might deploy its products to, you have another environment. You might have dev test QA UAT produ produ. Um, so it gets uh unwieldy very quickly, but we're going to start with mix equals dev. And now let's see if I can actually get my demo onto the other screen correctly.
So, and so, uh, what did I do? Uh well, I've built this demo three different times and each time uh it was too complicated.
So, uh I literally finished it again this morning and I don't know if there's any way I can turn up contrast here or switch to a light theme. But, um we're going to guide we're going to walk through this a bit. So, mix them dev. Um so, what are some uses of telemetry in local development? Um, the main ones I I won't show off too much, but the big ones for me are say you have a ticket that's to add or update your existing telemetry. Maybe you send traces to Data Dog. Um, how do you know they're actually working if you make an update to them and then deploy and have to wait for them to attempt sending data to Data Dog before you figure out if it actually works? Well, I can run Graphana locally.
I can run my Open Telemetry Collector locally. And I can do this all on my machine. I don't need to worry about sending it up to a production data dog environment or costing any amount of money. So um the gist of um the open telemetry ecosystem uh is a series of dependencies.
Um this is another kind of part of the ecosystem that gets a little uh cumbersome especially in elixir because the documentation is now split up across multiple places. Generally the underscore API version of the library will tell you what the actual API is.
It's designed in a way that the API is a separate thing from the actual implementation. So it's a pretty sensible design but it does kind of make local setup a bit uh ownorous. Uh another example is uh in our application when we start up the application there's a little bit of I've got some stuff that I would never do in production here like logging all of my environment but that's just for debugging. Uh but mostly all you really need to do is a couple open telemetry specific setup calls in your application.
Um and then there's a couple other pieces of specific configuration that are needed. So there's a a config exs for open telemetry. You can set it so that traces get exported to standard out locally when you're running mixm= dev.
There's all sorts of config you can do.
Um and you can also even control I mentioned earlier the compose file uh or the the docker container that I wanted to use. Uh that's here graphana hotel lgtm uh lgtm I I stands for Loki graphana tempo and mimir I think is their prometheus alternative but uh most of us probably know lgtm because when you approve a PR you write looks good to me and then you uh blindly approve it without testing it. So, um, we've got, uh, Graphana running locally. This includes the collector and all those other things we talked about earlier.
Um, so let's run it. Um, if I bring up our Docker Compose stack here on this tiny screen for me.
Oh, yeah. There's a there's a couple commands in here that are probably here.
Uh, let's do Docker Compose up. Here we go. Okay. So, what's this going to do?
This is going to start our stack here.
So, um, so we got a simple Phoenix app. That's it. I don't know how to do any front-end development. I'm joking. But, um, the act the the meat of this demo is in a aptly titled uh, demo server. So, this is a contrived gen server uh, this really just exists to get the point across of like what kind of telemetry we're collecting here. But this is a gen server that basically runs every 10 seconds, does some work, and then sends some telemetry events. The work it's doing is very contrived. It's just doing some work creating a span in uh open telemetry and uh with tracing traces are usually built up of spans.
Once you've started a span, uh you can basically do any arbitrary work within there. You can set attributes based on the results of specific operations. Uh but the coolest part is that we can actually uh localhost 3000 we actually have our own working instance locally of graphana. So uh graphfana in its own way has a very which one of these is explore. Uh so if we go over to tempo which is the graphana tool for managing traces we can start to explore some of the traces that we're seeing here. So the screen's a bit small and the UI is extremely cluttered, but here's an example span. Uh a trace that the gen server sent. Uh this basically just does some work. Um we can see that there's uh I'm trying to emulate some CPU intensive work, some IO intensive work. Um and they basically break those down into three different like spans within this trace. So um that is basically the gist of the demo. Um, but I have a couple other things to talk about. So, that was mix equals dev, mixv equals test. Back to the demo. Uh, this one's much simpler, but the gist of the gist of this is mix test. How can we use telemetry? You can actually unit test your telemetry events and handlers. And you can also use them as a way of not adding process. tests. So if you ever have a gen server that's doing some work and you want in your unit test to assert that the work is finished, one thing you can do is process. Until the work is finished, uh, a better thing to do is to actually call telemetry attach either attach or attach many. There's different ways to do this, but you basically just say, hey, whenever I receive this telemetry event, send myself aka the test process an event of some sort. And then we can do some operations and assert that we received that message.
Uh, and you'll notice there's no process.
So, this is a common thing I see in unit tests at least with um testing telemetry and mix equals prod. So, this one um kind of already went over, but the the graphana demo in this case is actually running mix equals prod because it's compiled as a it's does a mix release, puts it in the container, spins up, hooks it up to graphana. Um, the code for all this will be available after the talk. I need to clean it up a bit more and probably remove some swear words. Um, so other applications of telemetry and prod, these are pretty obvious. Most of us are familiar with telemetry in production.
Um, but things like multi-provider telemetry is a big one. If you want to send your application traces to data dog, if you want to send your application logs to Splunk, if you want to stream other data to an LLM integration that will steal your identity, um, go for it. Uh, there's plenty of options there. Another deep breath. I don't have my water. I'm sweating.
So, um, when to use each? Uh, it's pretty obvious to us that logs are useful. We, uh, for I'm sure plenty of you are familiar with wolf fence debugging, aka print statement debugging, reach here, reach here, reach here, ABC. Uh, so, but what about metrics and tracing? Uh I hope that I have at least partially convinced you that you can actually do a lot of the same stuff locally uh and get a lot of the same benefit that you could in production testing. Similarly, I'm sure some of you have probably seen capture IO um used to make assert that a given function logged a given thing. That's kind of hacky. Uh but I've done it so it's not that bad. Uh production logs, metrics, and tracing most of us are familiar with. Like I said, we all use something like data dog or like graphfana. Uh within each of the other environments as well, like I said, prod you're very used to all of these things happening. But do you also uh h do you measure your telemetry in develop? Do you measure it in test? Most of us probably don't, but maybe we should.
Ideally, all of these environments are as close to identical as possible. So if we're doing it in prod, we should probably be doing it in the other environments as well.
One of my favorite bits of telemetry advice comes from Ethan Gunderson. Uh, at the time he was working at Cars, but uh, his ultimate hottake is that logs are garbage. They're super expensive and really low utility in his opinion. I actually 100% agree with Ethan, but you can pry my logs from my cold dead hands.
I will keep using logs, but I do agree with the idea that traces are a lot more valuable. Like I like I showed with the graphana example, you can drill down into a span and see exactly how much time each thing is taking.
A last caveat, uh telemetry isn't free.
Uh it costs uh money. Uh the business value of telemetry is often hard to find, especially if you have a lot of noise and not a lot of signal. Signal.
Haha. Signals, logging, metrics, traces, signal versus noise. No, just me.
There's also a lot of effort. Uh you got to encode the telemetry. You got to transmit it, you got to parse it, you sometimes got to store it, you got to pay for the storage. Um, another warning I guess uh Goodart's law states that when a measure or a metric uh becomes a target, it ceases to be good measure.
That applies here as well. If you start arbitrarily defining goals based on your metrics and your telemetry data, you're going to have a bad time. Uh, also some ad hoc telemetry data could probably just be application data. Do you really need to keep track of every time you add a new user uh and send a telemetry event or do you really just need to say select count from users? Uh you probably don't need telemetry just for that.
Uh future like I said uh someday I would love to see both metrics and logs reach stable for the open telemetry elixir and Erlang libraries. Um again uh in the snippet from a diagram earlier uh you can collect all the telemetry you want and then you get to choose where it goes. You can send it to Claude, you can send it to Gemini, you can send it to your other robot of choice. Uh or you can send it to other providers.
Brief recap. So we talked about telemetry, open telemetry, showed a little bit of a demo that'll be open sourced later. Three main environments for mix uh and then some of uh the challenges of combining the two.
That's my face on Jose's body, by the way. U any questions?
[Applause] [Music] [Applause] [Music] [Applause] [Music]
Related Videos
VALORANT's Latest 'Exclusive' Tier Bundle is Rough...
KangaValorant
17K viewsβ’2026-05-28
Flight Attendant Mocks Poor Looking Black Woman β Mid Air Announcement Exposes Her Real Power
SkyboundStories-b4r
184 viewsβ’2026-05-28
I FIXED My Friendβs Blown Turbo RX-8β¦ Then Sold It
Cameron-RX8
134 viewsβ’2026-05-28
NewsWatch 12 at 5: Top Stories
NewsWatch12
1K viewsβ’2026-05-28
Simon Jordan & Danny Murphy deliver PREDICTIONS for Arsenal's Champions League FINAL with PSG
talkSPORTArsenal
6K viewsβ’2026-05-28
Botting is OUT OF CONTROL in Classic WoW (Again)...
SolheimGaming
108 viewsβ’2026-05-28
The "AI Job Apocalypse" is CANCELLED!
WesRoth
9K viewsβ’2026-05-28
STREET FIGHTER 6 - INGRID Story Walkthrough @ 4K 60αΆ α΅Λ’ β
RajmanGamingHD
12K viewsβ’2026-05-28











