Building AI products that work is different from building AI products that are reliable; reliability requires addressing five key dimensions: edge behavior (handling unexpected inputs gracefully), graceful degradation (admitting uncertainty rather than confidently hallucinating), observability (detecting issues before customers report them), reproducibility (ensuring consistent outputs), and distribution shift resilience (maintaining performance as data distributions change). Evaluation should be designed from the start, not added after bugs occur, using both golden reference tests and LLM-as-judge approaches for non-deterministic models.
深掘り
前提条件
- データがありません。
次のステップ
- データがありません。
深掘り
AI @ Qonto : shipper et évaluer l'IA en production追加:
Hello, we are so happy to be back with you today with a great new TPC live stream. Uh, hello everyone. I saw that some people have started introducing themselves on live, that's great. Feel free to continue. Tell us who you are, uh, what kind of job you do. For me, it's super interesting to get a little taste of who we're going to talk to and change with today. Well, there you go. And I'll start directly with the slides as usual for the TPC presentation.
OK, we're there. Um, I am in the presence of another exceptional person today, Marianne Borzic du Courneau who comes straight from Konau. So, she has an exceptional career path, hold on tight.
She went through the United States, stopping in San Francisco. She worked at Uber, Amazon and today she's back in France to lead the AI strategy and all of Conto's AI products and we're so lucky to have you with us Marianne if you want to say a few words.
Well, luck is shared. Finally, I too am very lucky to be able to speak to The Product Crew today.
I've been watching your live streams for a long time and I'm delighted to be able to be on the other side of the microphone today. [laughs] There you go, I hope that will interest all our listeners and viewers.
Hm uh I'm sure, given what we've already discussed, I know what you've prepared for us is something big. Okay, let's get straight to the point. So, as a teenager, I was evaluated and put into production. So we're finally going to see a little bit of what goes on behind the scenes, how products were built in the past. So, as usual, the program includes TPC resources. I'm going to bother you for a little while, 5 minutes, with the TPC news. Next, we'll switch to Marianne's live stream and we'll have a 10-15 minute Q&A at the end as usual, feel free to post your questions as they come up. I will take them one by one and if there are questions that need to be put live, we will ask them live in the middle of the presentation. But otherwise, we'll save a special session for the end.
Uh, small news, and not small actually, big news: the special leader salary study is available on the website. So you know, we at TPC have already done a huge study that spanned all the engineering, data, HR, product, design trades, that was already live for a little while. And there it is, we've released the one that is Special Se VIP level with some pretty huge learning curves. I was quite struck by what we learned about VP engineering. Well, I won't tell you any more, you'll have to go and look to find out more.
Another big news is that we're launching a podcast and the first episode of our Reshape podcast was released today. You can find it on almost all platforms, YouTube, Spotify, Apple Music, Anything and we interviewed the CEO of Starling who explains how he rebuilt his entire company in 90 days with Delia Saint flight of the towers.
In addition to that, as you know, TPC recruitment has 70 off-market job offers that are open to all our members. You can be matched with the best jobs on the market thanks to your PCs.
Whatever your profession, we'll see it right after. We have a great form to join in 3 minutes.
And I'll start before showing the form with an off-market opportunity that's super hot for Tech in a hybrid start-up based in Paris.
So, always within a theme to orient itself, it has a generative approach that replaces 3 weeks of creative briefing with 30 minutes. So, we're going to Uberize the creative brief, and in my opinion, there's potential to carve out a significant place in the market, so there's already an AR in place, EU deployment underway. So, the idea is to consolidate and solidify the teams in a product that has already found its place, and there you have it, a super interesting opportunity to launch yourself as a techad.
In addition to that, you also have, as usual, the link to our form to find all the off-market jobs.
So there you go, you just have to scan this QR code. Enter your information in 3 minutes: first name, last name, email, your LinkedIn, the type of position you are looking for, your experience, minimum salaries, and we can match you with Intass offers. You will be accompanied by our excellent recruiters to find the perfect job and position you where you need to be.
Uh, in addition to that, we have plenty of resources for you. We do software engineering case studies, PHP case studies, and case studies not only in engineering, but also in data, design, and product. It's just that I have the engineering hat, that's what I emphasize. Well, there you go, it's a lot of resources that can help you in your recruitment searches, in structuring your processes, in how you implement your techniques. So, it's really worth checking out and likewise, you can find it on the QRC. And as usual, our live broadcasts are available to watch again from the afternoon. So, if you missed this Qurcon, find them later. And that's all for me regarding your PCs, and I'll now hand it over to Marianne.
Perfect. Thank you so much. Thank you very much, Emma. And for those who have joined in the meantime, hello to them. And I'm just quickly introducing myself. I am Marianne, Head of LA Products at Conto. And uh, during this lunch break, I'm going to talk to you about what we learned uh in my company by putting Lia into production.
So, uh, I'm going to start with a little bit of provocation. It's a quote from months ago, you can recite it, but making a product that works has become easy for me.
In a few weeks you can have a prototype that impresses, impresses others, impresses yourself, passes demos, and does things you wouldn't have thought possible 3 years ago. The real issue for me is what comes next, which is to say, making something that works well, that behaves exactly as expected, that won't surprise your customers in a bad way, that will hold up in production under load and on real data. And that, for me, is a completely different job. So that's what we're going to talk about today in this TOG. Uh, I'm going to explain to you how we did that at Contao. What we learned from putting four different AI products into production, and what that implies for the teams that build AI products. And uh, and then I'll also talk to you about takeaway, well, uh yeah, about learning to take away maybe and apply for you. So we're going to do this in three steps. Uh, I'm going to present the context first, a little intro, and then, as I said, we'll look at four different products that are in production again with demos on real accounts. And then, uh, in the last part, I will present the behind-the-scenes aspects and again two keys to understanding in particular that I want to leave you with. uh, a grid to distinguish an AI product that works from an AI product that is reliable, and also a method to think about the evaluation before the bug occurs.
And if we have time too, I will share some thoughts on the evolution of the engineering profession, specifically machine engineer, back-end engineer. Well, we'll see depending on the timing.
Nice program, huh? A busy program. And so, we're already moving on to the first part. In fact, it's to set the scene for you a little bit and uh and explain or remind you if you know, but uh what is conta.
So, Conto is a professional online account for, uh, SMEs and freelancers. So, it's really B2B. Uh, in this professional account, we have payment tools, obviously like any good account, billing tools, expense management tools, and all of that in one place in a single UI. So you can see it on the right here, and you will have the opportunity to see it again as the demos progress.
In terms of scale, we have over 500,000 customers and we are present in 8 European markets. Well, France being our biggest market. Then, uh, in terms of SCAD, but internally, uh, Conta has approximately 1500 employees, more than a third of whom are in tech. So. So, at Conto, I'm in tech, I lead the AI Products team. So this is a bit misleading but uh uh AI Products is good, it's a good tech team.
And in my team, I have approximately 15 engineers. They are machine learning engineers, uh that's their official title, or data engineers. Uh, so for this distinction, I'm mainly going to talk about machine learning engineers today. I know that makes up the majority of what constitutes my team. Well, if my data engineers are watching the video, [laughs] good for them, they are definitely part of the team, but here the, how to say, the OCD is focused on MLE. And in any case, we have a common mission which is to improve the customer experience via machine learning directly integrated into the product. And really, I'm going to emphasize one point, we're not a team of writers, we don't publish articles. Uh, the only publications we do are, uh, they're Medium blog posts, and we really type MLA features into production for clients, and uh, we monitor everything from end to end. We are responsible for everything, from start to finish.
There is another key distinction I want to make, which is, uh, before getting into the heart of the matter, which is internal ML machine learning versus product machine learning. So this is also an important difference and we will finally, in this talk I will only talk about the right column and so product ML and the difference will not be the production deployment since we can have advanced internal applications but it is really who is the direct user of our product and so the internal ML is the one that will in my case help the account teams to better decide ideas. So, I saw in particular on a live stream, it was 4 years ago, I think, on The Frecu, about marketing, mix modeling and topics like that, where data scientists can create models or even put models into production, but for large teams, and which will help them allocate their marketing budget. That's internal ML. Similarly, things like early turn detection will help customer success teams identify customers at risk of turning away and give them a warning signal to act before the customers actually turn away. So there you have it, the end users of this ML are the counters and therefore it's not at all in my team. I am on the right, product ML, and uh, here we are directly in the interface within the Ponto application. The interface our clients use is that of the end users, the half a million companies. And that, in fact, this scale, changes everything. The reliability requirements are much higher. Uh, the feedback cycle is also longer than when it's done internally. And then, well, the customer impact is not real because on the other side, it is also real, but it is, let's say, more direct. Uh, so again the four ML Product use cases that I'm going to explore today, they are part of the right-hand column, uh, producing.
Yeah, exactly.
So, for each of the four products that I'm going to present, there will be the same approach, that is to say, I'm going to talk to you about the initial customer problem, then how we solved it with the stack, the impact on the customer, and then I'll also do a little demo that I recorded each time so as not to take any risks, four times the live effect, that wouldn't be good. Uh, a little extra information, uh, the products I'm going to present to you, uh, are actually presented in chronological order of production, uh, and so, finally, the first one was the first one, and so on. And uh beyond these four products, in fact, we have many others but I chose them because they are quite different and uh this way, you have a good vision of the type of things we can do in my team. So.
So, first product aussiar and uh well, I put invoice extraction actually uh not necessarily invoice but let's say financial document extraction. So this first product is not necessarily specific to Conto or Fintech. I don't know if we have friends from Bah si, we have friends from Alan.
[laughs] Yeah, I know Alan's respect well, you also developed a product to extract information from documents, particularly health documents, and there you go. This is the customer problem: imagine you are a compa, a compto customer, you are at the head of an SME, you receive invoices from your suppliers. So these invoices can be either in PDF format, which you receive by email, or as an image, or as a paper document, which you photograph. Either things scanned hastily. And you, as the head of this SME, need to extract the information to pay the bill. uh, of this invoice, that is to say, uh, who am I paying, what amount do I have to pay, what is the influence, uh, and in who am I paying, there is also what is the IBAN of the person or company to whom I have to pay.
And if you have 50 invoices per month, doing it by hand is long, repetitive and prone to error. And so that 's the problem. How did we solve it? Something quite classic, right, but we automated it with ML. So, the classic thing is that we decided to build our tool entirely in-house instead of outsourcing it. And we have a two-stage pipeline. We have, uh, first of all, an oceanization pipeline.
So OCR stands for optical character recognition. So it will detect the text in the image. If it 's a PDF, we don't need to do it. But if we're talking about the actual text of an image, we detext it, we detect the text in the image and we transcribe it into raw text, into a layer of text.
That's the first step. Second step, once we have a really good cut of text, we're going to run an MT5 model. It's a model that was developed by Google but is fine-tuned to our specific data to extract the structured fields. So again, I was telling you the amount, the date, the BAN and so on. And in fact, a layer of text where there is a lot of useless information or, for example, issue date, the date of creation of the invoice, that can be useful but for payment it is not useful. What we need is rather the due date and so we will come, I don't have the French equivalent, we will come and place the text layer in the correct fields, the fields that interest us and that I mentioned earlier.
So we tested this model on 1.6 million documents and then we manually rated about 10,000 documents for quality. MT5 is M, I think it means multi but it's multilingual. So it handles French. German, Italian, Spanish are the languages we have most often in our invoices at Conto. Hm hm.
And with that, we have over 95% accuracy on the main fields we want to extract. So, from an image or a PDF document, uh, in less than 2 seconds as well, that's also important.
In less than 2 seconds, we have all the fields extracted from the invoice.
So it gave structured results. So.
Exactly. Ennea structured, a Jason. So. And finally, not a Jason for the user obviously, I'm going to do the demo, uh, but well, uh, in boxes with a completely user-friendly interface. Uh, I wanted to, uh, what did I want to [clears throat] add to that?
Yes, I used the example of the invoice but as I said, this Financial Language Extraction Flex tool was built for a whole bunch of other financial documents, including bank card receipts, where there is the list of, if you go to a restaurant, you have, how to say, your restaurant receipt with the list of what you had for lunch and so on. Well, there you go. And [clears throat] so I'm going to show you, I'm going to press the video. I think you see. Okay, here we are in the accounting interface. I'm going to [clears throat] uh, I'm going to what? Yes, I'm going to transfer. I'm going to transfer. I ask, therefore I do; I will make a transfer. I actually received an invoice from one of my suppliers. As I was saying, that was our illustrative example. So in this case, uh, I upload uh, the invoice. So since it's a PDF, the text layer is already in extra and only the MT5 model will run. And so, as you can see, the fields are correctly extracted.
I owe money to the artisanal bakery as I already had it in my, how should I say, beneficiaries. He recognized it, he recognized it, and uh, and I just have to click on continue, so I don't know, I'm going to pause, but uh, hop, I don't know. So.
In fact, it's uh on the invoice, there is the payment date which has been uh extreme uh no, sorry, the due date of the invoice, that is to say the due date of the invoice has been uh extracted automatically. And uh, here in the interface, we offer the option to either make the transfer today, or make the transfer on the DU date. What everyone does for cash management issues, the money is [clears throat] better with you than with your suppliers.
So there you go, there's just this little click to make and then you validate it as a user.
continue. And then, a final look confirms that everything is fine. And here's the payment. Yes. So, I'm being asked for confirmation.
There you go, you've seen my, uh, very secure code.
And there you have it, the transfer is underway.
So, the transfer repending is simply because in that account, I did n't have all the rights and the administrator of the Comit account.
So, this is, uh, this is the first product I wanted to talk to you about. It's been, uh, integrated into the product account for years. You realize that this is very core banking, I would say uh very linked uh to how to say to the primary features of a bank account.
Uh, the second product I want to talk to you about, yeah. So we already have one on Padle. I think it's quite relevant to ask that question about this product.
Yes.
Uh, the question is from Vincent.
Have you benchmarked Padser against LLM Vision Cloud GPT on the extraction of structured documents?
Yes, the answer is yes. I can actually elaborate further. Uh Vincent, uh, there are two, so actually yes, we benchmarked and uh, in our case OCR was uh better, uh, in terms of cost, obviously, uh, it's a good source, huh. So. So, uh, we're only going to have infrastructure costs. And then in terms of latency, in fact, that's what really guided our technical choices. Here, we absolutely wanted there to be minimal friction between uploading a document and therefore minimal time between uploading documents and extracting data. We 're at P80, well, in most cases, the data is extracted in less than 2 seconds, and that's unbeatable with an external API.
Hm hm. It's true that it's quite a wow effect to see the results live. After that, it is finally very basic, but especially for the handwriting, uh, how to say it, there are sometimes notes which are taxi receipts and so on which are handwritten. He's not at all, he's not very good. But also with the billing reform, we have fewer and fewer documents of this style where the criticality is to extract the text layer. The real criticality comes mainly from the second part, which is to properly map the text layer to the right field.
So there you have it, with electronic invoices, there is less need to tune the recognition.
There is another question. So, there are two questions about Sec to Sec. Uh, there's Ayou Massoui. Uh, I posted it. What are the costs and ease of integration of MT5 SE?
Sorry, I don't have the costs [laughs] ease of integration. I imagine these are fairly general questions about se to.
Uh, sorry You. Could you perhaps clarify your question? I'm not sure I understand. Uh yeah, I can, I don't understand the question, sorry.
Then he asked the question of ease. Okay, I'll read it. Uh, OK. Ah, uh, the costs, well, the costs are actually really minimal. That's what the infrastructure costs will be, uh, uh, of having uh, uh, your model which is which is hosted.
So we, we are on AWS and then you have an internal pipeline that you query every time you need to uh fire. Finally, for serving, you need to do a field extraction. So there you have it, these are really minimal costs, except for fine tuning and retraining, where it can be, I think, amounts of €10,000 per retraining. and ease of integration. Well, finally, it's again, we're on AWS and uh, uh, it's like any other machine learning model that has been pre-trained and then we make available for serving. So, very, very easy to integrate.
H And subsequent question then on Sec to Sec still from Ayou. Are you sending data to Google with Sec?
No no no, it's... it's something open source... it was Google Research, well, if I'm not mistaken, to be honest, uh, to be verified. Okay, I'll let you check [clears throat] but in any case we can't see anything at all on Google, uh, and if I'm not mistaken it's simply an open source model.
OK. And finally, one last question before moving on to the next product.
How is Paddle OCR hosted?
It is hosted with us again on AWS infrastructure and it is also open source.
OK, very clear. Listen, I won't bother you any longer.
Great. And yes, there you go, there will also be time at the end as you had already said. Well, I've already talked about that. The second product I wanted to [clears throat] talk about is uh reconciliation document transaction. So this second one, well, it comes after the first product in terms of, uh, well, in the life of the account. So the problem is, imagine you're still a Conta customer, well that's good, you haven't changed. Uh, you pay a bill in January, but this bill that you actually received in December, uh, there you go, but you decided to pay it in January.
How do you know that the invoice and the transaction match? In fact, manually, there is an accounting reconciliation work to be done which is tedious and which will block the monthly closing in fact because for those who are not perhaps everyone is not very familiar with this concept, but as a company, you have to close your months in an accounting way.
So, to close out your month, your accountant will need each transaction you made during the month to be supported by a receipt. Hence the importance of having proper documents and clearly specifying the correct fields. So at the end of January, your accountant will ask you for all the invoices, all the documents that correspond to your payments for the month of January. The problem is that you may not have received the bill in January. It may have been received in December or even earlier.
Hm. And without automatic matching, it's manual work every time, at each closing, to say "OK, well, in fact, this transaction that I made on January 15th corresponds to the invoice that I uploaded on December 15th." And then sometimes, there is a transaction that you make, let's say on January 16th, and you haven't received the invoice yet, it hasn't been posted to your account yet. So there is no matching to do and it is really very tedious and very, how to say, time-consuming and really unimportant in terms of work. And so it's something we wanted to solve for our clients and again, as with Flex the Financial Language Extractor, we wanted to automate this pain point as much as possible.
So we created a machine learning model which is a simple one, it's a decision tree which is trained to score document- transaction pairs. So it will trigger either every time a new document arrives in your contao application, or every time a new transaction arrives. If you make a transfer, if you receive money, and so on. And uh so uh as soon as there is a new action, either a document, or a new transaction, it will uh output, it will score and output three types of responses. Either he will say there is a total match between this document, this transaction, and he will reconcile the two. He's going to say, "Well, this document matches this transaction." Either he will say there is a partial match. I think that between this document and this transaction, there is a good chance that the two are linked and we will suggest this to the customer who will validate in one click or there is no match because either we did not find it obvious, or because for example as I was saying, we uploaded a document to you, you have not yet made the transaction or the other way around and uh and this covers both supplier accounts and customer accounts.
So it's a bit of a hassle, but supplier account means it covers the invoices you receive from your suppliers and the transactions that go with them, as well as the invoices you issue to your customers and the transactions that go with them.
So, in terms of a demo, we'll be able to look at that together.
You see that I've become richer, I now have an account with 8 million. Things are going well for SMEs. [laughs] Uh, so, uh, knock knock knock, I'm in the... here I am, I 'm in the transactions. Look, I actually paid €500 at Leclerc.
So, uh, there's no attachment.
Okay, I'm going to take a short break. I just got back here. So what I was showing you is that I had a transaction, uh, I paid by card, uh, uh, at Leclerc on June 27th, uh, and I don't have any document to attach.
So arriving on June 30th is going to be a problem for my accountant. Well, in any case, he's going to ask me for it, he's going to ask me for the document.
Hop. In fact, what I do is, obviously when I've done my courses clearly, I've asked for an invoice for my company. So, I'm simply getting the invoice. So here, on the right of the flexbox, you have the auxarization and extraction which is at work. Everything is still good, game date, date and so on. Uh, the total amount is correctly extracted and uh, in fact, it's not clear enough, but it's an invoice, uh, an invoice, but one that does indeed correspond to the transaction there. So it's not visible, but it clearly says €500 and the supplier here is clear. In short, I plot this in my document invoice bank.
So once again the fields are extracted correctly. There's nothing left to do but close up and say goodbye.
And uh, what we're going to review together is that, uh, so there you go, it appears in my bills. And now, if I go back to the transaction section, or rather the business account and transaction section, well you can see that on the card transaction line there for the €500 Lecler from June 27, now there is an attachment, it wasn't me who did it, it was really Atma Attachment Matcher that was running because I uploaded an invoice that corresponded to this transaction. So, I'm taking this opportunity, I have n't mentioned the decision tree, how it works, well, how it's built in terms of features, it's uh, well, obviously the date of the transaction, the name of the uh, the recipient, the one who, well, the one to whom we paid. Um, what else do we have? The amount, obviously. There, it's a perfect match on the amount and so that's what will allow us with the decision to match the two. And in this case, it was a complete match. We are 100% certain that this document corresponds to this transaction. And so here, I just open the line, the transaction view and we can see the document which is correctly attached here automatically. Uh, that's the one we just coded.
Uh, and there you have it, I'm from TVA and yeah, uh, what's this "very hot" trust you were talking about? It was a 100% match, so it was great. I'm wondering what percentage you're aiming for, and if you're showing a grey area, like if it's only 70%, will it be shown differently in the product?
Yes. Uh, I haven't done the demo yet and I'm not going to do a live stream, unless you want to go into the staging environment at the end. Uh, I don't have the Strold in mind, uh, from memory, but uh, it's roughly, you have, we have, we have two values, uh, in the background, uh, of confidence. Well, under the first confidence value, we're not going to show anything to the user. We're just going to ignore the event that just happened.
leading the way in the bill, let's say. So, there you have it, we do nothing, we keep the invoice in the invoice inventory.
Uh, between this first SCHold and the 2nd one which is higher up, that's where we're going to have a small call to action at the transaction view level where we'll say, we think this transaction matches with this document. Do you confirm or not? And so it was either validated or not validated. And then the last one, so above the last threshold, is the case I just presented to you where the matching is done automatically. So the user can still unmatch manually but we obviously track the unmatching rate and that is part of the performance metrics we have in production.
Great. I have a follow-up question that was just asked about this. Yeah.
Because, so, via this feature that requests information from the user, does it actually correspond to the transaction? Is there a learning mode for this model? Does the model learn when asked to confirm? Are you sure that the rem is with a supplier account movement?
Yeah. Uh, great question, but as always, uh, yes and no, actually, uh, we'll persist as we say, that is to say, we'll record this information somewhere in a database. So, this information being, uh, the fact that either the matching was correct, or the matching was incorrect. And uh, from that, we will extend our uh training dataset and when manually later, we come to retrain our model, well, among the training, uh, there will be this new data that yes, the matching was good between this and that. So it's a yes and no answer to that question. Uh, for now, there are still a lot of, how to say, manual things in the MLOPS management of this product.
OK. Um, one last question.
Then I'll let you go because time is running out. Um, what was complicated about retrieving a training dataset?
Yes. Do you already have enough data internally that is directly usable?
Yeah. Yeah. No, it's uh well it's yes, it's still complicated in a few uh in a few words in a large mass of datasets which comes from before in fact when there was no invoice and uh users had to do it themselves. So either they did it, or they left that pleasure to their accountant. But so we have this historical data. Well, once the fire has been put into production, we can no longer really rely on the human work of our clients. And so, uh, it's either PMs or uh, machine learning engineers who will take care of it and make smaller datasets than when we have it in production and it's the users themselves who do it, uh, of quality. But frankly, it's tedious. Extracting from a document is not very pleasant and so on, but it's relatively easy. It's like searching among 100 transactions without a document, and then searching among those with a document to find which one might match.
It's very difficult, it's very unpleasant to do in terms of, well, it doesn't go very fast, it's very laborious, and so on. There you go, I hope that answers the question. Don't hesitate to tell us, Cyril Colin. I think it's very, very good. I'll let you move on to the next part. I'll save the questions for later. Yeah, okay. Uh, 3rd product.
So this is an invoice that I think will resonate with many people because it is offered by I believe all banks, including B2C for individuals. Uh, so again the problem, uh, so you are a customer of Contaux, you definitely make good choices and to manage your need to know where your money goes and where it comes from. So, for example, to be more specific, how much did I spend on software this quarter, or how much did I spend on marketing travel? What are my sources of income? And in fact, if you don't have a reliable categorization of your transactions, of your banking movements, well, these questions will remain unanswered. So, categorizing them manually, obviously, always brings us back to the same problem. When it's done by hand, it's a pain. Especially if you are once again at the helm of an SME, you have hundreds of transactions per month, nobody really does that. And finally, in any case, we start and then we stop. So there are quite a few SMEs that manage their cash flow on a visual basis, but not you then, since you are with Ganto and uh, and you have chosen uh, uh, automated cash flow management. So that's how we solved this problem, it's simply by automating the categorization of transactions.
Machine learning was used again. So this is, uh, yet another different model. That's also why I'm introducing this product to you. Uh, it's also a two-layer model architecture.
And so firstly, you have what is called a backbone model, or rather a spine, which is a classification model from Facebook research. So again, we don't give our Facebook data, it's open source, it's called FastX.
This involves training on all of our transactions, or rather, on our clients' transactions. It covers 50 of the 50 default categories that we have at Conto.
So, for example, Office Supplies, well, what is it? uh office supplies, uh taxes, these kinds of categories are by default and uh so they are common to all users. And in fact, this Fastex model, it can therefore categorize, well, its purpose is to categorize any new transaction into one of the 50 existing categories. Again, the training is done on all transactions, uh, conta. Finally, as usual, uh, if you have a machine learning or data science background, well, when I say all transactions, that is to say that we mix the clients, obviously, we have a test, a training set, a test 7, there you go, well, for there is no, there is no leage and so on.
Uh, this is the uh, this is the first model, and on top of this backbone model, we have uh, and this is where I really think what will distinguish us from the other uh, from the other companies, and also it will uh, I think it will please the person who asked the question on uh, to Tatmch Masher, and those who took the feedback into account. In short, on this second one, there is a second layer which is a neuroadaptor and which is a model by organization. So it's a very small neural network that is really trained on the specific habits of each customer using transfer learning, and it allows us to predict categories that are specific to the customer that the customer has created themselves.
So if you didn't like Office Supply and you added Paris Office Supplies and Berlin Office Supplies, well, we are not able to predict these categories. Well, by default, because not all customers have them. But with this 2nd layer of Neural Adapter, we are able to predict these specific categories and then the more you use it, the more it adapts to you.
And we have a training program that is daily, automatic, and triggered by Airflow every day. And where I am particularly proud of this product is our, uh, our performance metrics. We have 85% overall accuracy on all transactions, 85% accuracy on the categorization of transactions according to what the customer wanted. And we are at 95% on recurring transactions, which represents, uh, which itself represents a third of the volume after transaction per customer. Uh, and the fact that the categorization supports custom categories, I haven't personally seen that in B2C elsewhere, and we have really a lot of very, very positive feedback from customers on this feature which will allow them, as I was saying, to manage their cash flow, to do meaningful analytics, to have FL dashboards by category, but built on something reliable, solid, and uh, there you go. So that's for the problem, the solution, and then as usual. So actually, this demo video is more illustrative than anything else. This is really to, uh, show you we've lost a bit of revenue there, sorry.
We lost a little bit of cash flow. We went from fast to 4 million.
It's true. Yeah, that's true. That's why we had top-ups. We had even dropped to zero. We topped up to 4 million2. [laughs] We did indeed give someone a big gift. And uh, so what you should have seen in the video was n't exactly a demo. I simply went to the transaction tab and in fact, uh, when this transaction uh took place, it was automatically uh categorized under transport. simply to illustrate to you what it looks like in the AU. Uh, you can ignore the fact that it's not categorized here. In fact, Vinci Autoroutes is also categorized under transport. It's easy for Fasttex which has a uh which is a how to say a model which has semantics so motorway no problem to put that in transport.
And there you have it. Do you have any questions, Emma, or about that?
No questions arose. I just wanted to point out that Accury's percentages are really excellent.
Yeah, I'm personally impressed.
[laughs] Yes, delighted and I am delighted and and maybe in fact we will crechendo, we will both chronologically and also maybe in terms of uh wow effect, we will create Shendo because the last product too, I am particularly proud of it in relation to the customer feedback we have. So that makes for a good transition for me.
Hm.
So, a conversational analyst, the latest conversational analyst, so it's the most recent product and it's perhaps the one that will speak to you the most. So yes, I prepared this pun, but I thought it was pretty good.
[laughs] It's LLM, it's GenII in production. Uh, imagine, you're a customer and normally financial analysis is done mainly by hand. You scroll through your transaction history, you filter, you export to CSV, you open uh Excel or something else.
For example, to answer how much I paid to Uber this month, or have my travel expenses increased this quarter? Well, it takes several minutes to get answers to these questions. Once again, all the data is correctly displayed in the interface. Simply put, either you use the filters from which are in the accounting interface, or you will export it to Excel and have to do your pivot tables and your small calculations separately.
And so sometimes we just say no, I'm not going to look for the answer, not because the data doesn't exist, but just because accessing it takes too much time.
How we wanted to, or rather how we solved this difficulty for our clients, well, we introduced agent links that are client-facing in the same way that we can have customer support boots. A conversational agent is accessed via a single entry point in the conto application. And if you ask him questions that are uh analytical relating to what I called financial analysis, let's say, well, this entry point will redirect to this sub-agent that we call the analyst and who will give you the answers. It will perform the queries and pivots in the background and will be able to give you the answers. Hm.
Uh, in terms of construction, uh, we use an instant, sorry, uh, ephemeral DGD DB. So at Conto, in terms of warehouse, we have Snowflake, but the agent we build is not going to query Snowflake in production at all. Uh, that would be far too dangerous because typically, uh, it would be quite risky and a client might say, well, might try to access the data of other clients where we might want to write in our basflake, uh, well, for example, write more than 1 million received today. So.
And so, obviously, it's forbidden. So we move from Snowflake to uh DC DB which is an ephemeral database in which only the financial data of the client himself is loaded. Um, actually, why is this, how should I say, this product difficult? That's because, uh, in terms of hallucination, for example, in a financial product, if you have that, a hallucination, it's not just a bad user experience. Not like if you, I don't know, ask GPT chat how Newton is doing and you answer well last news he's doing well and he's eating apples, apples, there you go, healthy, well it's funny and all, but if you ask your account how much I spent on marketing and it tells you that the agent answers 3000 when it's actually 30000, you lose confidence and then potentially permanently. So for us it was extremely important to have a particularly rigorous and continuous evaluation framework on this tool. Especially since it's LLM, therefore non- deterministic. And so this continuous assessment framework, we built it by having test questions mapped to known LLM scenarios as a judge, regression tests on each prompt change and I will talk about this a little more in takeaway 2. I'm checking the time, I'll have time.
And then just before moving on to the demo, simply in terms of impact and feedback from our users, so the 500,000 have access to this for free, we are at a 6AT of 80%. And frankly, for a product like this, a chatbot, with such high customer satisfaction, it's really, really impressive. Uh, I had set a target of over 70, which would be great, and between 50 and 70 would be good, and below 50, well, we would have to continue investing in it. So there you go, very really very happy with this product that I can show you here and if you uh I don't know Emma, do you want to talk at the same time to save time or Yeah, go ahead. Me, the SAT, the CSAT, you got 80% right from the start, you must have fine-tuned the architecture a little to achieve this result.
Actually, uh, we had, we had at the beginning even more, we had more like 90%. We [laughs] did a release in several stages with uh an alpha release, beta release and uh by choosing user groups that were initially very uh texy as we say uh who worked in the field of AI uh and so on and they were much more lenient than uh the general public than the restorer and so the su decreased but now it's been several long weeks that this analyst is available to everyone and it's and it's completely stable. So, I'm asking for the balance sheet from April 3rd. As you can see, there are plenty of answers. That's because, in fact, I'm in a situation where I don't just have one account, I'm an organization that has several bank accounts. So it's not a bug, it's uh I have uh I have 7 hours of accounts uh and so it's not bad. For example, the balance on April 3rd is something that users really like.
Uh, I can ask the scale on April 3rd at 5:28 PM and uh, he will give me the answer. And that's something that, in this case, isn't even available in the UAI. Finally, it allows for time travel and, well, that's it. So it is, uh, also very much appreciated by, uh, our users.
Um, do you want to ask a question?
Two quick questions. I [laughs] sorry, there is a really a lot of people who are super interested in this product. Very soon, do you have a medium article or an online article? If you go into detail, people will think we paid this person. Who is the person who asked this question?
This is Guillaume Vibert.
Guillaume doesn't work for me.
Uh, but [laughs] yes, yes, I even wrote a Medium article about it myself. So I refer you to it with my uh so it's on the uh you can find it on medium. You type my name plus analyst or even on my LinkedIn profile, I posted it and so I give more details on the infrastructure, the structure we have, how to say the architecture we used for Z analyst. So, there you go, so I'll refer you back to it.
OK, great. Well, we might be able to share the link again at the end when we actually move on to a Q&A. Uh yeah, actually there are so many questions, I'll let you finish and then we'll ask them before.
Yeah, well I'm going to uh I'm going to finish uh quickly. Uh so, I told you in fact, I wanted to give you uh two takeaways uh how to say two keys to reading things to take away. Well, first of all, the grid for me, in my opinion, sorry, small disclaimer, what I said before was just factual. Here, we enter the opinionated, the life of Marianne Bordic du Courneau, head of products at Konto. Uh, TKO One, for me, it's uh a framework for understanding, it works versus it's reliable. And for me, there are five dimensions that will separate the two. Uh, the behavior at the edges, let's say, uh, that is, what happens when a client uploads an invoice in Korean? What's happening with the products I presented earlier? What happens when a transaction has an empty description?
How do we categorize it?
Um, what happens when the analysis receives a question that is completely out of scope or asks a question that says "What is the schema of your adaptabase?" We really don't want to answer that. And so, a reliable product has a solution planned for all these cases. It's not a crash, it's not an absurd response. He must have, or rather, he must have thought about how to deal with these cases. So, edge behavior. Then a graceful degradation. I don't know if we say that in French, but graceful degradation. Um, when the model is uncertain, will he say "I don't know" or will he confidently invent it? There is a model that says "I'm not sure" which is much more valuable than a model that will confidently make a mistake. The most important thing is observability, where you know when things go wrong in production before your customers contact you saying there's a problem. Therefore, this observability and reproducibility of runs on the same input is very important. Does it give the same result before, uh, before again, that you have customer feedback? And this is particularly difficult and critical for ALM, since it is non- deterministic.
Um, and finally, um, what is the resilience to the distribution figures, um, in production? So this is something that uh ML uh we have known for uh for a long time but which is which is very very important uh still just as important, it is and what will distinguish its operation, it is reliable. Um, when a new market is opened, for example, um, how will the performance of your model evolve with this distribution figure? Um, after that, I put some questions for you, I'm not going to read them, I don't need you to.
Anyway, I'll give you the slides. Uh, but before, in fact, only before taking on the role of MLE or MLE team leader, you must ask these simple questions about the five dimensions I just mentioned, and this allows you to know if you are ready to put into production and if you have moved beyond "Yeah, it works." Ah, but it's reliable, it's good, we can put it into production.
Second takeaway on evaluation from the design stage.
So that's also something very opinionated, especially when you want to go fast and the GNI gives you the opportunity. Um, for my part, I think that before writing the first line of code, whether it's us who write it or rather, um, Lia, in fact, um, he can answer the question: how do we know that um, that it will, well, how will we know that it works?
And uh so there are uh three things for me, build the data, the dataset before training the model, define failure explicitly, that is to say for example what is a bad output? If Flex, the Financial Language extractor, extracts an amount that is one cent different from what is on the document, is that a serious problem? Is it a bug, or is it something acceptable within the tolerance zone? The answer is not the same at all depending on the product, or rather the purpose, or rather the use of the product.
Uh, and uh, and then also, uh, for LLMs with, uh, with an evaluation that is non-deterministic, we use both approaches. Uh, we have golden reference tests, and then the LLM as a judge, which scores the outputs of another LLM. We don't use the same providers on different dimensions, relevance, accuracy, hallucination.
The evaluation of LLMs as it stands is not necessary, it is the minimum I would say. It's always imperfect, but at least it scales and uh, so that's what it should be, it should be put in place from the design stage, at least thought about from the design stage. Well, there you go, then I had a little bonus but maybe I'll let you speak and perhaps leave more room for interaction.
Well, anyway, know that you have a ton of questions, your talk was really interesting and I don't know if you've already concluded, or if you want to conclude, but you really went into a depth that is extremely appreciated. Thank you so much.
I personally come away from it with a lot of learning [grunt] and I think we can leave the role to shifted in 12 months. I can do that, we can do it, I can put it aside. It will be, it will be, and well, that's, uh, that's the opinion, it's less, less applicable, uh, anyway, uh, in terms of, uh, yeah, uh, of conclusion, simply to recap, uh, I think if you had to remember, uh, just three things from my talk, it's that, uh, chip li in production, it's not at all the same as in research and finally in Vipe coding, uh, the fact that it works very differently, it's reliable. Uh, that's the second thing, and there are five dimensions in my opinion that allow us to separate and distinguish between the two, and then think about evaluation by design. And here I'm thinking especially of software engineers who want to work through Engie. I know, I noticed that it wasn't a reflex they had, and that's okay, uh, we all evolve, and there you go, uh, be aware of it, and then uh, maybe work on it. Well, there you go. And one last point, uh, we're recruiting in my team. So if what I've presented interests you, then don't hesitate to contact me. Uh, there you go, you'll have my LinkedIn link on the first slide and we're recruiting in five different countries, uh, and also fully remote, uh, on machine learning Engineer posts which are becoming quite AI Engineer. Ah great, thank you very much for this talk and personally, I find it really makes me want to join your team. [laughs] The number of topics you cover and the tech you explore is awesome. You have lots of people who thank you for the feeling, it's excellent, great Rex, top. We'll extend the Q&A session a little bit if that works for you because there are quite a few questions. People were quite interested. So, I'm going to go by the number of likes on the questions, since there are so many.
So, Ayou, always him. What do you think of LLM generating test data in the absence of a Grand Trophy when it's complicated to do it manually?
Uh, I'm sure Au actually has the answer to his question. [laughs] Uh well it's how to say it and it's imperfect and uh it will be it's never worth uh real life data, but when you have no other choice or when you're pressed for time and so on, it does the job. So it allows us to have something good. Excellent, no, but it's not a big deal. It still allows you to quickly chip away at a quality product. And uh so to summarize for me it's a solution either when we have absolutely no choice, the data is unrecoverable, the real data is unrecoverable, or we are pressed for time, in which case it allows us to have a product that is already pretty good and then go and get the additional performance points. It will be done on real data and nothing beats user data, user feedback.
Hm. Great, thank you. Um, another question that's been asked quite a bit: how is this handled? So we're back to your third product, I think. How are horizontal attacks handled with the all-SQL LLM approach? Oh no, 4th, sorry.
Hm.
Uh, on that point, well, I'll refer you to my blog post [laughs] of the views and if I answer the question well, I'd like to, I'd like some applause in exchange on medium. [laughs] Very good. Well done. Stay tuned for the answer. So.
Um, to what extent have the boxes been identified? OK, we'll do this one.
Daniel, to what extent were the edge cases identified upstream and by what method? Because on the product side, we always know that users are the best at finding the worst ones.
Ah yes, indeed, that's a very good question. Uh, there's no miracle solution for that.
Simply be aware that some time must be spent finding out how the tool can be misused, abused, etc. Just doing this exercise, forcing yourself to find H cases, that will cover, I'm not going to give a number, but it will cover a good part of the H cases. Well, it's a bit like what distinguishes a good product from a very good product. After that, it will be a matter of continuous improvement based on what is detected in production. Uh, ah, and for the first part as well, customer interviews help to anticipate that. So even before the product launches, we do a demo for the client, and he says, "Oh, I'd like to ask him that. Really?
We hadn't planned for that. Uh, so, in several phases, but clearly, uh, we have to be very careful not to steal something we developed exclusively for our Golden Pass.
Okay. Very interesting. Thank you. Uh, a question from Mathieu Cabrera: "When do you run these tests and based on what criteria?" So, I know, I've lost track, but basically, he's asking you, "Have you set up pipelines?" Is this on the developer's workstation?
Is this every day? Is it with every release?
Uh, well, uh, I know, I'm not sure which test he's talking about, but I'll interpret the question as, uh, tests to know, uh, to know the performance of our models. H hm. Uh so actually as soon as uh as soon as we finally before merging well if I stay on the analyst then because I think that applies to the analyst uh which is uh again based on the generational link uh generative sorry not generational it's really time we go to lunch [laughs] uh on the generative our tests we run them as soon as we make a change in the base odds which can uh change the response to the test we planned. So, really very regularly.
Uh, in this case, we, uh, the analyst, part of a sub-agent, which is part of a much larger set of agents, which I haven't told you about, but if you remember, I told you that there was a single entry point for conversational AI in the conversational UA and that there was a router which, when the question is analytical, will refer to our agent.
If, uh, if we make a change and we want to do a merge PR on the analyst part, in fact, we will return all the tests that concern the router and all the tests that concern the analyst. Uh, if we make changes to the other sub-agents, we don't need to run the tests on the analyst sub-agent again. So long story short really often involves tests, but you still have to be clever because it's expensive. So it's only when you want to do the PR that you check that the PR will not degrade the uh the performance of the agent compared to the previous version. Hm.
And can I bother you with one last question, uh MTA, uh, which was asked at the very beginning by Guillaume Viber, our friend Guillaume. How is the internal ML team structured? Are there any synergy issues with your team? Well, there isn't really an internal ML team. It's her, and well, there isn't really one, strictly speaking. Uh hm, in uh, I don't know if Guillaume was there uh at the beginning of the to, probably since he asks me this question. Uh, and so actually, as I was saying at the beginning, uh, contrary to what the name of my team might suggest, AI Products, my team is in tech.
And actually what I didn't say is that more specifically my team is in data. Uh, and in this data team, there is also a large Business Analytics team, and it is they who have taken over the role of machine learning data science for internal clients. However, so for example, it is Business Analytics that will do the how to say the marketing mix modelling models and CH detection that I was talking about.
However, they are not the ones who are going to create tools to enable teams to communicate more effectively. That will be managed by yet another team that is not in data, but in tech, which will manage the licenses, Claud. We are very much on the Cloud which will create tools, for example an internal DPR review bot, this has not been done by Biaet. And so in terms of synergy, well, with Business Analytics, we don't have much.
It's more with the rest of the tech, so not in data, because I... well, that's certainly it, by the way, we didn't have time to talk about it, but the fact that MLEs are increasingly becoming software engineers, how are the skills... Sorry about that [laughs] I'm reaching the limit of my translation abilities. Do n't worry, the skills that are missing in traditional MLE are more on the software engineering side, and that's where we're going to learn. So there you have it, more discussion with the backend than with uh ML internal products. I don't know if I answered correctly. I think it's Guillaume.
Yeah, Guillaume, speak up in the comments if you're still here. I think it was very comprehensive and I really like this step back on MLEs and what they're becoming. And with that question, we finally conclude the talk.
Thank you so much Marianne for your time and precision. It was very pleasant to host you. Uh, a huge thank you to the cat who was incredibly kind and super active. Thank you for being there.
Feel free to tell us if you enjoyed the talk. You can give it a rating from 1 to 5, so-so to 5. Yeah, it was really great. Send some strength to Marianne and TPC if you want. And we'll see each other again very soon. We'll post the live stream this afternoon, and then in a few weeks, we'll do it again with a live guitar brost.
Oh, but there are lots of 5s.
関連おすすめ
OpenHuman VS Hermes AI: Who Wins?
JulianGoldieSEO
285 views•2026-05-29
BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2
aimmediahouse
122 views•2026-06-03
Long-Running Agents — Build an Agent That Never Forgets with Google ADK
suryakunju
142 views•2026-05-30
I Made the Same Anime Fight Scene in Every AI Video Generator
NobleGooseAnime
295 views•2026-05-30
Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S
cnnnews18
3K views•2026-06-01
I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)
AICodingDaily
298 views•2026-05-29
3D Platformer Update - NO CAPES
SolarLune
294 views•2026-05-30
AI Doesn't Create Bias — It Inherits It
UXEvolved
176 views•2026-06-01
トレンド
Why Batman Lets The Joker Live 🤨
zackdfilms
9222K views•2026-05-30
They're Complete Trash
penguinz0
558K views•2026-06-04
The Murder of Deputy Caleb Conley
MidwestSafety
810K views•2026-06-04
I Bought FAKE HopeScope Merch (and paid a subscriber to give it a makeover) | Hopeful Hauls
HangWithHopescope
158K views•2026-06-04











