He rewrote chardet with Claude. The internet blew up. Here's his take.
Show notes
elvex is the enterprise agent platform that goes first—and brings your whole company with it. We interview builders about their "AI thing," with the caveat that it has to be something other people actually use: not just a flashy demo. If you are interested in being a guest expert on the podcast, contact us here and include "podcast" in your message.
Dan Blanchard has maintained a Python library largely by himself for more than a decade. It’s a very popular library—more than 130 million downloads a month, meaning it’s included in untold amounts of software. Recently, he used Claude Code to rewrite the entire library from scratch, and released the new version under a different open-source license.
The internet blew up.
When original author Mark Pilgrim resurfaced after 15 years to challenge the relicensing, it sparked a firestorm across the open source community about AI-assisted rewrites, "clean room" methodology, and the future of software licensing. Bruce Perens, the person who originally defined “open source,” is on the record comparing the moment to the invention of the printing press and the scientific method. “The entire economics of software development are dead, gone, over, kaput!,” he told The Register.
We sat down with Dan to get his take. It should be noted that we are friendly to Dan, having been colleagues with him at previous companies.
Dan's position is clear: he demonstrated structural independence (JPlag similarity under 1.3% vs. 80% in prior human-maintained versions), consulted legal counsel, disclosed his entire process transparently, and built something objectively better for users. What’s more, he’s never earned a cent from maintaining chardet, and isn’t intending to in the future. The new licence for the free software he has been maintaining for free is a license that is itself more free and less restrictive. The new version is 48x faster—given the amount of devices this software operates on, some have quipped that this rewrite is “the first carbon negative use of AI” given less aggregate CPU utilization.
The full transcript of the conversation is below.
Podcast Transcript
This script has been lightly edited. The first few minutes of the video are a “sizzle reel” from later parts of the podcast.
Sachin: Hello everybody, welcome to another episode of Building for Others, the podcast where you have to build something that people use with AI to get on it. And today I have the really amazing pleasure to introduce everybody to Dan Blanchard. Dan and I go back, I don't know, like 10 years, something like that, when we worked together at a company called Parsley. Actually, all three of us worked at Parsley. Dan was one of our amazing engineers worked on a wide range of different things. Since then he's worked at a couple of different companies. He's now an engineer over at Monarch Money. but we're here to talk with him today about something that made him go viral on the internet over the last, I don't know, like 10 days or so. so Dan, why don't you just introduce yourself? Tell us a little bit about your background, what you tend to focus on and then, tell us actually what chardet is
Dan Blanchard: Yeah, so I'll start with my background. So. Yeah, Dan, as you said, I used to work at Parsley with you for like, I think we were, I was there for six years. So, you know, like we've had a lot of experience working together. Even when I was at Parsley, I was maintaining a chardet so I've been maintaining a chardet the open source Python character encoding detection library for about 12 years. It was like 2012, I think when I, or 2013, when I first got involved, I think I took over on 2014. I don't remember. Yeah, the exact dates are roughly 12 years. So, sure.
Sachin: What does chardet do and what is character encoding detection number one and then number two, why did you take over maintaining it?
Dan Blanchard: Yeah, so that's actually kind of a funny story. So at the time I started working on it, I worked at ETS educational testing service on their software for automatically grading essays, like one of the first AI things back before that was a thing. And yeah, so character encoding detection is basically the idea that
Dan Blanchard: every file that's got text in it that's on your computer or on the internet is at its core, you know, a bunch of ones and zeros as people always like talk about, but it's really more like it's a series of numbers you might even want to think of it like that. And so a character encoding is like a mapping kind of like one of those like Caesar ciphers where it's like this number means this letter only it's like this number means this letter or this like question mark or, you know, any of the symbols that you can type on your keyboard basically. So the most well-known character encoding, was the original one, ASCII, which only supported 127 characters. And for a long time, us who were in America, who were very English-centric, were like, yeah, that's all you need. You only need 127 characters, because why would you need anything other than what's on the keyboard? So. The Chinese language doesn't exist.
Dan Blanchard: Yeah, exactly.
Dan Blanchard: Chinese language doesn't exist. Neither does Hebrew or any of that. So these other character encodings started to develop historically in the 70s and 80s, or some of DOS era ones that are kind of interesting and still somewhat used a little bit here and there. There are ones that are even earlier than that, like mainframe encodings and stuff like that, are symbols that are systems that almost no one uses anymore. But the idea behind character encoding detection is that given an arbitrary file you look at it you're like is this in English is this in Japanese is this in Chinese and like you know it tries to figure out statistically and using some other methods like what encoding this is in and what language it's in because figuring out what language it's in usually helps you narrow down like what encoding it's in like certain encodings are only used for certain languages so you know like if you're looking at an English document if you see lots of does you know you see the numbers that tend to corks respond to like T, H, and E and certain encodings, then like that's good indication that that's English. No, that kind of thing. Mm-hmm.
Dan Blanchard: So yeah, chardet's just a system that does that. It's based off of, well, I should say the original version of it was based off of what was called the universal character encoding detector inside Netscape way back in the day in the late 90s, was then, know, became Mozilla, which then became Firefox. And so the original version of chardet was the Python version was created.
Dan Blanchard: in 2006 by a guy named Mark Pilgrim who took the C code for that and just translated it into Python. And if I had to like explain this for non-tech, like you've built or you're maintaining essentially a Lego block that goes into larger pieces of software and it's used to understand, for example, what language is this document in? And so I think you said it was 130 million downloads a month. So this thing is used in tons of different pieces of software all over the place. Is that right?
Dan Blanchard: You Yes. Yeah. that's right. The reason it's used so much is because there's a... Python library called requests, probably anyone who's worked with Python is probably familiar with, but request is kind of the main library that's used to like download web pages or anything on the internet. And under the hood, request relies on a character encoding detector. Older versions of it only use chardet. There's newer versions that use this other one called CharSet normalizer, or you can use either one, like it's your choice. So because of that request being used for basically anything that uses the internet, that's one of the reasons that chardet gets used so much as well.
Sachin: So super high utility, super ubiquitous, how did you become the maintainer of this?
Dan Blanchard: yeah, so when I was at ETS, we were one of... we're kind of ahead of the curve of adopting Python 3. there was this big controversy like 15, 20 years ago at this point about Python language had these big changes and you had to rewrite all your code to work in 3 that previously worked in 2. Chardet, because Mark wrote it in 2006 and then he abandoned it in 2009 or 2010, I don't know the exact date when he He deleted himself from the internet and deleted all the code repositories he had, including chardet. Some other people recovered it. We can get into that later. Basically, was a Python 2 version of chardet and a Python 3 fork of it called charade, made by Ian Cordasco, is one of my fellow maintainers of chardet and also one of the maintainers of requests. request was trying to adopt Python 3 and they were like, we need a Python 3 version of this. And so he forked it. And then where I came into the picture was I wanted to make it so it was one code base again. So there was like Python 2 version, the Python 3 version. I had a lot of experience like making things that worked with both, which was hard to do technically at the time. So I merged the code bases back together. And then That was charted at version 2.3. Maybe I can fact check that later. I'm pretty sure it version 2. something. And then that's when we started releasing versions that were compatible with both until we dropped Python 2 support a long time later. Longer later than it should have been. But yeah.
Doyle Irvin: Long time ago, yeah.
Sachin: Okay. So let's fast forward now. You're like 10, 15 years past that. You've been maintaining it for a while. Mark has been off the internet. I guess I'm assuming not contributing anything to the project at that point. Okay. It's you and a couple of other people maintaining it. So what happened a couple, I guess, like, when did you start thinking about a new license? When did you start thinking about, you know, the things that happened that caused a controversy here? And then.
Dan Blanchard: Yeah. No, no. Correct, yeah.
Sachin: what made you take action.
Dan Blanchard: Yeah, so one of the craziest things about this is that I've been thinking about changing the license for quite a long time. So I did not want the license to be LGPL almost from like the day I started working on it because I knew that, yeah, sure.
Sachin: And for people here like LGPL, MIT, what do these licenses mean?
Dan Blanchard: Sure. So the LGPL license is the lesser GNU public license. The GPL is like the, I don't know, greater GNU public license, but it's just without any L in the front of it. And the idea there is it's a library that's typically used for libraries because it's a little more lenient than the GPL, which has even stricter guidelines. the guidelines that it has that make it kind untenable for certain uses is that it has some parts of it that make it harder for commercial use without just sharing your code. And so the MIT license is kind of more open and says like basically do what you want with this code just like you know I'm not responsible for whatever you do with it and if you make changes to it like it'd be nice if you contributed them back but you're not like required to you know it's like It's a much more flexible license. It's like, I'm...
Sachin: So all three of these are for open source projects. they have varying degrees of restrictions or limitations. developers or contributors to open source projects can, if they're the ones that creating it, they can pick one of these to kind of apply to their project. And so, yeah, so you've been frustrated with the LGPL because of some of the restrictions there, right?
Dan Blanchard: yeah. and I will say the reason I've been frustrated with some of the restrictions is not because I ever have or have any plans to make any money from this. Because that's the thing about this controversy that drives me crazy is people are like, he's just going to like go out there and make bank on this. I'm like, I have no plans to ever make money off of this. It would be nice if someone paid me to support it sometime, but like, I'm not going to like go out there and make millions of dollars on a character and
Dan Blanchard: detection library that's been around for 30 years. The new version of it is a ground-up rewrite, so it's technically a different thing, which is the reason people are making accusations about me doing things with it.
Doyle Irvin: Yeah. If other people wanted to use it in software where they intended to make money, this new license makes that whole thing way easier. And so a change could mean to, even though it's already mass adopted, it could be even more so, like more ubiquitous. And the more attention an open source project gets, the more support it gets from people in terms of like, you all these voluntary, I mean, you've been working on this unpaid for more than a decade, you know, it'd be nice to get a little bit of help, right? So.
Dan Blanchard: Yes, exactly. Right. Right. So if you look, there's an issue, issue number 36 on the chardet repo is from October, 2014. And in that one, I said, it's actually kind of funny to me because it's like, I started off saying, this is weird for me to ask because I'm one of the maintainers, but like, can we change the license? And then there's this whole debate between me and Ian and I believe I think Eric Rose, cause he was more involved back then. Yeah. And Basically, we came to the conclusion like, well, we would need to talk to Mark about it. And no one knows how to talk to Mark about it because Mark's gone. So we don't even know what to do. Like six months after that, actually dug up. So I deleted my Twitter account, which is annoying because like I had this conversation with Guido van Rossum, who's like the guy who made Python on Twitter in 2015. And I think I have that somewhere.
Sachin: Right on, yeah.
Dan Blanchard: where? I pulled that up before this because I wanted to mention that. Yeah, in like April? Yeah, was April of 2015. talked to...
Sachin: Do you have a screenshot of it or something you share?
Dan Blanchard: yeah, I, I, well, I don't have a screenshot of it because it's like recreated, but I, I, I basically tasked Claude with finding anywhere on the internet on like archive.org that the parts of it existed. Then I recreate, I had to make an X account because like you can't even load pages on X. Like, so like I had to make a temporary X account to go to Guido's page and you can see like the parts of the conversation where like my messages are deleted, but you can see what he was responding to. Yeah. Right, right, Right. Yeah.
Dan Blanchard: I don't know if I can send that to you later to put in a post or like, yeah. But basically, Guido... No, no, no, that's OK. Just talk us through it. Yeah.
Dan Blanchard: asked my opinion about adding chardet to the Python standard library and I was like, yeah, that would be great. The main reason this came up is because at the time some people were proposing putting requests in the standard library. That never ended up happening, but requests had two dependencies. It had chardet and something else that I can't remember. But I remember it had two dependencies and they were like, yeah, but chardet's LGPL... And so that's a problem because we can't add LGP.
Sachin: That's from Guido. Guido's like, we can't do it if it's eligible.
Dan Blanchard: yeah. So like we, he was like, but the license. And then, so we were like, well, maybe we could change the license. And he was like, yeah, that'd be great if he could figure out a way to do that. And so someone, I remember on the thread, managed to, someone saw this and reached out to Mark supposedly, like I didn't know them personally, like, and they like asked him and he was like, well, we can't change the license because the license is LGPL because the code that I based my code on is LGPL. And that's part of the thing about all GPL is it's got this what they call the copy left provision, is that anything that you make that's derivative of the thing has to have the same license as the thing. And so taking code and literally translating it from one language to another, that very obviously needs to the same license because it's definitely derivative. But so... One of the kind of funny things is like that code was LGPL at the time that he copied it, but later they changed the license actually to the Mozilla public license, the MPL, which is slightly more permissive than the LGPL. But we couldn't change, I wanted to change ours to the MPL at the time. That was my goal. It like, it doesn't make any sense that this thing that you copied. that's license has changed. We can't just change our license to it. And it was like, well, that would be hard. We would have needed Mark's sign off and we would have needed anyone who had contributed to it at the time that it, you know, since then to sign off to, cause like in order to change the license, typically you need the sign off of every contributor, which is very hard to get. Yeah. exactly. It was, it, yeah, it can happen. It happens very rarely, but it can happen. So there's just a practicality thing here. It's like, not gonna happen.
Dan Blanchard: I think, yeah, I remember one time there was some open source project I had contributed to. cannot remember which one. They had some, was, was run by like a company and they wanted to change the license and they sent out this like email to everyone who had ever contributed to it. Like even a one line change that was like, Hey, we're going to, we would like to change the license. Can you sign this agreement that says you're okay with it? If you don't sign the agreement, we're just going to remove your contribution from, you know, they're like, we're just going to cut that part of the code out. Like, and we'll, we'll, we'll, you know, get. get by without it. kind of their attitude. So I think that was, I don't know. It will come to you whatever library that was, but I can't remember. But yeah, so this has been something I've wanted to do for a long time, is change the license. The new thing I rewrote has a different license. In a way, it's a relicensing because I kept the name the same. And the reason I kept the name...
Dan Blanchard: is because I wanted it to be a thing that would just slot right in for people so that people didn't have to like switch to some new thing and like, you know, figure out like, cause the one of the benefits of the new version, it's not just that the license is different. It's like, it's like almost 50 times faster than the old version. And it's like significantly more accurate. It's like the 10 percentage points more accurate on this like test set I have of like 2000 plus files. So like it's much more accurate, much faster, uses less memory. Like it's just better. basically every conceivable way. Sorry, the cat feeders are going off. If you can hear that over there, I apologize. It's like noise of like rock-sitting bulls. I don't know if that's getting through or not. Okay, cool. So...
Doyle Irvin: All good.
Dan Blanchard: Yeah, so it's better in lots of ways for users. like the thing is when something's being used by 130 million people every month, it's like, I mean, it's being downloaded that much. That's not even like talking about like how much is being used every day. If you make a change that like noticeably impacts performance. like I don't know not to like toot my own horn but that like that could have a measurable like energy impact like you know like on the compute for like all of these projects everywhere you know which is crazy like so like You know, people are talking about like the downsides of energy use of like LLMs and stuff. somebody on the parsley like discord was joking that they're like, this might be the first like carbon neutral or carbon negative thing that someone has done with AI. And I was like that maybe, I don't know. Yeah.
Doyle Irvin: Yeah, I think it's a fair claim.
Sachin: So basically, you're like, can't get the, it's not practical to get this license changed. So I'm just going to rebuild it from the ground up using Claude code and give it a license that it should have had, right?
Dan Blanchard: Yeah. Yes, and it's not even just for the license. was like, I, there have been changes that I've wanted to make that were like broader structural changes to make it work better and be faster that have kind of been percolating in the back of my mind, but I'm like,
Dan Blanchard: I'm not going to put all that effort in and have it be still stuck in the same place. That was a frustrating thing for me. My goal has always been it would be really nice to get this in the standard library because then there'd be more people who would be tasked with supporting it than just me. And also, it is a standard utility. It doesn't make sense that it's not. So yeah.
Sachin: Yeah. So OK, you did it. OK, no big deal. There's a new chardet. It's better, faster, stronger. It's got a new license. And then maybe Doyle, now you can read your quote.
Dan Blanchard: Yeah.
Doyle Irvin: Yeah, I mean, it's just too crazy. You can't not read this quote. It's like this is in an Ars Technica. I don't know how much readership the specific article got, but I mean, I saw Hacker News threads with hundreds of comments. you know, lots of attention. And the guy goes, this is Bruce Peren's first of all, and he goes, I'm breaking the glass and pulling the fire alarm. The entire economics of software development are dead, gone, over.
Dan Blanchard: Yeah.
Doyle Irvin: Kaput! We have been there, for example, when the printing press happened. We've been there when the scientific method proliferated. You know, I think this one is just as large. And it's like, okay, Dan, you're being compared to the printing press and scientific method.
Dan Blanchard: You Yeah, was not expecting. Yeah, I certainly wasn't expecting that. I really appreciated Bruce Peren's take on that. like, for people who don't know, Bruce Peren is like the guy who coined the term open source. So like him saying that is kind of a big deal. And yeah, that was actually a quote. like, Ars Technica was quoting him from an earlier article in the register. And that article in the register also quoted me, like they asked me for a comment. And it was really crazy to see like, my name and things I said printed next to things Bruce Peren said, like, wow. Yeah.
Sachin: Amazing.
Doyle Irvin: yeah, one of those moments for sure.
Sachin: Well, so like now let's get into the implications of all this stuff because there's a lot of discussion now of what is the value of licensing writ large. If you can just go in, build things from the ground up using AI. And I guess like you could always rebuild something. So that's like not change. It's just the ability to do that now is way, way easier. The like barrier to entry is maybe even gone for certain things. Right? So like.
Dan Blanchard: yeah.
Sachin: I don't know where does that all play in your head? How do you think about that?
Dan Blanchard: So think think LLMs just let you do something that everyone's always been able to do just way faster. like, you know, one of the things that's kind of ironic about all this is like, it used to be the free software people who were like celebrating when different things that were proprietary got rewritten to make them more free. Like, know, like Richard Solomon, the guy who like came up with like all of the free software movement, basically, he famously did it because like the printer driver on his computer didn't work and he wanted to change it, right, and make it work. And he couldn't get access to code. you know, all of this, he like reverse engineered it and all that. And there's also Windows emulators out there. The reason you can play, if you have a Steam Deck and you can play all the Windows games you want on there, even though it doesn't run Windows, it runs Linux, right? That's because somebody at some point reverse engineered a lot of their APIs and stuff like that. And we all celebrated that, because that made software better for people and made it so people could use things for free, something that they used to have to pay for. And I think... Yeah, there's this like certain sort of irony that people are like mad that I'm making this thing that is free with restrictions have less restrictions on it. And I think.
Sachin: Why do you think they're mad? What is it about now that makes it so different?
Dan Blanchard: Yeah, I think, I think, I think so. Some of the takes I've seen that I think are more balanced that say like the reason people are mad is because I kept the name the same. Like a lot of people think this would have been not a controversy if I hadn't tried to keep the name. And I think there's probably some merit to that, right? Like I think it's probably true, but I also think I think there's not a lot in a name other than the name is the package that gets installed, right? Like to me, the name, I don't actually care what it's called. Like if there was a way that I could say, make it so this other thing gets installed instead of what was previously called chardet, I would have changed the name. I did it because I wanted to make it so it's slotted in for people in a way that was low friction, right? So they wouldn't have to think about it. That like there is a possible world in which, you know, this goes to court and like, you know, it goes away that that they're like, yeah, this you need to you can't keep the name or like, you know, there's a bunch of different we can talk about all the different legal outcomes but If I was forced to change the name of a new project, I am still in charge of the old project. I would just archive the old project on GitHub with a notice that said, this project is dead. Go look over here for the new project. And I probably would even make it so it printed a warning when you installed the old version of chardet that was like, hey, this is no longer supported. It's deprecated. Go download this other one. But that has the exact opposite effect of what I talking about like potential like energy impact like if everyone who uses this suddenly has this like message printing out everywhere like that adds up when you're talking to millions of demos you know like I don't know it's it seems out of proportion to what I've done, the reactions. Like one person said I should be put on a registry like a sex offender. That is like, there should be a registry for software engineers who violate license things and then they should just be blacklisted and not hired by anyone. And then, yeah, it was just crazy.
Sachin: I also wonder like, so we went back in time and you didn't use Claude code to do it. You just did it yourself and land by line by line, writing, rewriting this thing. Do you still think it would have had controversy or is it the fact that it was AI assisted that makes it scary for people?
Dan Blanchard: Mm-hmm. Yeah. So I think. So I think there's kind of two, there's two parts to that, right? So one of the reasons I chose to do it with AI is because it would be faster. And cause I had seen an article that someone had shared with me at work about how someone had rewritten Next.js. I can't remember which company it was, but there was some company that like rewrote Next.js. in like a couple weeks or something like that. And I was like, that's crazy. Like, and you know, I was like, if they could do that, I could certainly do that with, with, you know, chardet, but one of the reasons I hadn't done it myself is because there's this thing that traditionally has been used to defend. rewrites or things where you change the license called the clean room argument and so the idea is that traditionally if you wanted to take like Windows code or something like that and like have someone else write it. You needed to have a team of people who had seen the original code. This is my understanding of it. My understanding of it is like you basically need a team of people have seen the original code and know what it does and everything and they kind of like write reports that are like, hey, it'd be great if it did this, this and this, but they can't like say anything about like how it should be written. You know, like yeah, but right. And then they hand that to another team and the
Sachin: the code itself or like how it's done.
Dan Blanchard: other team implements it based on those like specs and that's been legally defensible. for decades, know, like that's a thing that people have gotten away with. And I knew that I was the only person working on this. And so I knew anyone would have been like, hey, if Dan rewrites it, he has way too much knowledge of the code. But I thought if I use Claude to do it, I would be doing the same thing. I wouldn't be telling it what to write. I would be saying, hey, absolutely do not use the code that was there before. one of the things that I haven't posted publicly yet that I'm
Dan Blanchard: planning to soon, I guess you got that scoop, is that I have this, I had Claude put together a little single page app that went through my entire. like history of the conversations over the five days that I rewrote this. Like, because it was like, it was a mess. It was scattered over multiple directories because like I was like renaming things on my local file system. So it was like a mess. so combine that all into like one giant history. And then it summarizes what each of the sessions was about. And then if you click on the session, tells you like, here's what each of the turns of the conversation was. And then anytime there was external things, like when it did a web search or it did a web fetch or it like looked at a file, those are flagged so that you can filter and see exactly what it looked at while it was working on this. And yeah, I will say there are, some people have pointed out that in some of the prompts, I guess not the prompts, the plans. So I use superpowers and superpowers makes these plan docs and I put them in the repository so that people, so that it would be really transparent with people. They could look at what I had done. and in one of the plans, it tells it to go look at this one file that is in the chardet repository, but it's looking at a file that's like, it's like metadata about the encodings. It's basically saying like, this one is usually used for Japanese and this one's used for English. And we consider these ones modern and these ones legacy. And so first of all, it's like basically publicly available information anyway. And the other thing beyond that is that I wrote that file. you know, so like the thing is, it's always been the case that you can change the license if you can get the people who wrote the things to like agree to change the license. And like, I wrote that in entirety. Like that was a research project I did a few years ago to like come up with like every single encoding that mapped every language. and So there are a couple more that are going to be revealed in this thing when I post it that show a couple other points of potential controversy. But I think all of them are basically boiled down to either something I wrote in its entirety, like there's a script that it uses for training the models. And there's another one where it did a final pass after it wrote everything to make sure that the API was compatible. it looked at it but it had written it. It's like saying you can't go back and compare things after they're written I feel like is... that doesn't feel right to me if that's true but you know I'm not a lawyer so we'll see how that goes when I put that out there. That might be just adding fuel to the fire but like I'm a... I try to be a fairly transparent person about these things. feel like if I put all the facts out there, they are what they are. And if a court case comes out of this, like that information would come out then anyway, right? So like, why not just have it out there now? So.
Sachin: Right. And so like now that the controversy has bubbled up, now that you have like people taking sides on this thing, like obviously you didn't know that was gonna happen when you started that. Like where's your head at now? Like, do you feel like they're like, you know, hey, stay the course on this stuff. You're like, maybe I should have done this a different way. Like, how do you think about it now?
Dan Blanchard: No, no, yeah.
Sachin: with kind of hindsight is 20-20 now.
Dan Blanchard: So I think I'm too stubborn maybe sometimes, like I... I there's no world in which people being mad at me and mean to me is going to make me change my opinion on anything like if you can come out with me with facts and not argue them like you're a 12 year old who's a troll like then I am happy to have that conversation and maybe you can change my mind like I I would like to think that I'm a pretty logical person. I've changed my mind on lots of important things throughout my life and like, you know, if you can convince me that I'm wrong, it's okay like I'm okay with that. Along those lines, what's his name? Fontana, I hope that name is right. He's, or is it Bruce? No, Bruce Peren, right, Bruce Peren, Brian Fontana. Yeah, I'm pretty sure that's Brian Fontana. He's one of the guys who wrote the latest version of the LGPL, which is technically not the version that chardet was under. was under two and he wrote three. But he opened up an issue on the chardet issue tracker that was like, how much of this code was written by a person and I was like like basically none of it you know like I didn't directly write any of it really and He was like, well, know, current court cases are indicating that things that are written by AI are probably public domain, like if they're entirely written by AI. And I've sent off to two more lawyers about that, and that does seem to be the consensus. And so I actually... would not be upset if the outcome of this is, oh, TAR.it has to be a public domain. like, because that's just even freer. Where that gets complicated, though, is like outside of the US public domain doesn't exist. It's not a legal concept that exists in other countries. And so that's like a whole nightmare. But like that conversation where like actual lawyers have been jumping in on that and like, yes, occasionally there's a troll who I have to mark their comment as abusive or whatever. But like, for the most part, it's
Doyle Irvin: Mm. Mm.
Dan Blanchard: like there have been some pretty good conversations on there. Somebody was pointing out there's this license I didn't get to look into yet called like the zero BSD license which is the like it's public domain but like it's public domain that might work internationally and I might yeah I might change the license to that. One of the other interesting things that somebody brought up was like how do you If it's public domain, what about once it's open source and people start contributing to it? Like, what are their contributions? Because like, you know, like they get to choose what license their contributions are.
Dan Blanchard: technically, right? So there's not really a good legal structure, I think, right now for you to be able to even say, like, this repository is public domain and any contributions you make to it, you are putting into the public domain. Like, I don't know that there's a, there's not really a legal mechanism for that, as far as I'm aware. And that's like something that I think like people need to, that's like the next phase of open source. I saw that Sentry, Mm.
Dan Blanchard: I didn't get to read this article, but I saw that they put out something about like they were calling it fair source or something like that, which was the idea that, you know,
Sachin: I don't know.
Dan Blanchard: When things are public domain like this, basically their new thing wasn't like a copyright license because they're like, this is not copyrightable. And so like, if this code isn't copyrightable, how do we still not be jerks to each other? And like the fair source thing was basically like, you sign this agreement, but it wasn't immediately clear to me what the mechanism of enforcement was that was like, you won't use this thing as a direct competitor to the thing. You was like, it was like, you put an agreement up that was like, yeah, this totally public domain but like don't be a dick about it you know like and they were they were trying to argue that like that's really the only path we have forward right now and i i think it's really interesting
Doyle Irvin: Well, so who, my big question is like, who ultimately would be they? Like who would, who would bring, would Mark Pilgrim rise from the mysterious depths that he disappeared into and be like, I'm soon, like who would actually enforce, yeah.
Dan Blanchard: Yeah. So we didn't actually mention this, which is, I don't think, which is that the controversy came up because Mark Pilgrim resurfaced to file an issue on the issue. Yes, so confirmed supposedly, like two different people have confirmed that it's him. Allegedly Mark Pilgrim or actually?
Dan Blanchard: So I'm going to believe that it's him. It's a GitHub account that has done almost nothing, but it was created 15 years ago. So it's like, you know, it seems like it's an account that he had that was like this like shadow account, you know, that he was like, he had it, but he never used it. And so I choose to believe that it was, it was really him. And to be clear, I've never met him or talked to him ever other than this one time. And so it's like, I, some people say I have this like disdain for him and stuff like I don't.
Doyle Irvin: Okay, okay.
Dan Blanchard: Like he did a thing, it was really useful, and then he abandoned it and other people took care of it. And like, to me, you know, it's weird to say that something is owned by someone who deleted it, right? You know, like, like, I, that's that's the other part that always gets me is like people like he did all this stuff and I'm like yes he did but he also like He took the algorithm that was, know, as far as I've looked at the original code, right? The original code was basically a line by line seed of Python translation. It's like, it was the most non-Python looking Python I've ever worked with, right? It just looked like all the names of the variables were really crazy. Like these like things that were internal to Netscape and their naming conventions were in the original version of chardet And all of that's gone now.
Doyle Irvin: Yeah. So he himself got it from something else. So he can't say that was mine. It's like he translated that into something else. He abandoned it. You maintained it out of the dumpster for 10 years and now you're trying to make it even more free. You're not even trying to make any money off of it.
Dan Blanchard: Yeah, exactly. Yeah. And yeah, and the new version, I used this plagiarism analysis tool because I was like, okay, maybe there's this fact that yes, I told it specifically not to use any GPL or LGPL code when it was writing it. I said, don't look at the original chardet code. You can look at the test data, because the test data is not actually code. The test data is files that he had downloaded from the internet. to test it out and also files that I downloaded from the internet, right? Like, cause there's like, that's how you test this crap. Like people are like, Hey, it doesn't work on this file on this random website. And you're like, okay, I'll add it to the test set. Right. And those have always said like, copyright their respective publishers and stuff like that. Like that, those have never had LGPL attached to them. Cause they're, they're not even code. So I totally lost my train of thought. What was your question?
Sachin: You're talking about the plagiarism test.
Dan Blanchard: Yeah, plagiarism test. Yeah, thanks. So I did a plagiarism test using this thing called like J-PLAG. And yeah, I ran it. And in a response to Mark's message about me not having a right to change the code, I showed that if you looked at the code that was like in the chart package that people download from his version version one. He only worked on version one like version two and beyond where other people. That the highest similarity level between any file in the new version and the old version is 1.3 % and the average similarity is 0.6%. If you look at version six, so version seven is this controversial one. Version six was actually a version that I made major changes under the hood to how models were trained and stored and everything. Even that one, the maximum similarity is like 80%, you know? Cause like some of the files transfer over, right? And the average, I think, is like 10 % or something like that. It's between 10 and 20%. I don't have the numbers right in front of me. And so I feel like that pretty clearly shows. And you can look at, I made a table that's got over time how the similarities have gotten lower, but there's always been some maximum similarity that's been pretty high because some files lived on. So this shows that none of the files lived on. It's all completely different. I've been thinking about trying to come up with different ways you could do that analysis. I have some blog posts I'm planning to do that kind of walk through some of them. I, you know, it's kind of funny to ask Clawd to help investigate itself, but you know, like I think LLM is actually pretty good about that. And so I asked it to look at the structure of like six and seven and say like... describe the approaches each of these take to solving this problem and like, you know, lay out like architecturally how they work and like, tell me what's the same. Like, you know, like, like try and, try and point out points of similarity. And it's like, yeah, the only stuff that's similar is like that they're both solving the same problem and that they both are using statistics and pattern matching, which is like how you would solve this problem. There's also a thing called, a BOM, a byte order marker, which are some encodings have a little special like number at the front of the file that says like, I'm UTF-8 or I'm UTF-16. Like it says, it says what encoding it is in this little thing. So obviously checking if those are there is a pretty sure fire way to know what the encoding is. But it's like, yeah, other than these like two very obvious techniques for solving the same problem, you know. It has almost nothing in common with it. They're solving the same problem using the same basic ideas about a couple things that are just like the only way you would solve this problem. So, yeah.
Doyle Irvin: yeah.
Sachin: so much of like what happens with open source free software is about the community. Like that is like the big value you get out of it. Right. And the community has responded. Fairly strongly, strongly to what you did here. Like what's your takeaway from that? Do you feel like that means that there's some sort like to in the Bruce parents frame, there's like some sort of reckoning to happen with open source software licensing and stuff like that, or just an overreaction or what?
Dan Blanchard: yeah. So I think Bruce Perens is right. think, like, yeah, think software licensing is like, everything in the courts so far seems to indicate that AI-generated code is likely not copyrightable. And like that has way farther reaching implications than like anything about chardet, right? Like that means that like the fact that like basically every company, you know, right now is using LLMs to generate significant parts of their code. Any engineer that I know is using LLMs to write code, cause it's just so much faster, right? And if all of that isn't copyrightable, I don't know what that means. Like about like how we sell software, right? Like it seems...
Dan Blanchard: Like that's why Bruce Perens reaction was like that because he's like, I don't know how you make any money selling software like his example in the article is like, yeah, there's this company that has this tool that's like really expensive. I don't want to pay for it. So I just pointed an LLM at the docks and I said, hey, make a clone of this and it made one for him in like a couple days and he was like the amount I spent on tokens there and stuff like that was way cheaper. then I was gonna pay them. so like, you know, software as a service is like, you need to have like your hooks in deep with people to be able to like, survive someone just being able to clone you in 10 minutes. Like the kind of things that used to be like a startup would be founded around like some simple idea and like their MVP, like your MVP can be cloned in like 10 minutes, you know, like it's.
Dan Blanchard: I don't love it. You know, and it's like, you know, as someone who's a professional software engineer, I don't love that that's the world we live in, but that's the world we live in, you know, um, so.
Sachin: what do you think needs to change?
Dan Blanchard: I mean, feel like broad scale societal things need to change. Like, you know, yeah. I mean, just like, yeah, it's like the universal basic income guys might have been right. know, like, yeah, I feel like, you know, just kind of wax philosophical about like AI stuff in general, like I do frequently think about like, the more I use AI,
Doyle Irvin: Hehehehehe
Sachin: and solve it all.
Dan Blanchard: professionally or even for open source stuff, the more I'm like. At some point it feels like there's going to come a point where like my job doesn't like I please keep paying me right like you know, but like I just it just feels like there's a world in which most of the things that humans are checking have like diminishing diminishing returns and I could see a world where like a company is like, yeah, humans have bugs too. So like we don't care. We don't care if it's not perfect because people aren't perfect either right like and Meh, that's scary. Yeah.
Sachin: But I think the key difference, because I agree with you, is that agents don't really have accountability, but people do. in some ways I feel like the role of humans have moved from like execution to now like insurance policy, right?
Dan Blanchard: Yeah, no, I totally agree with that. Yeah, that's the thing. I'm trying to think. I think I can say this about my job is that like, you know, one of the one of the guidelines we have with using AI is, you know, that like every person who like, you know, like the pull request has your name on it because like you're the one accountable for it. And like, it's up to you to like check that it's And right.
Sachin: Same, yeah, same with us, right?
Dan Blanchard: And I feel like that's how I first the chardet thing too, right? So like, I, my name is on all of those commits. I very purposely made it, it do the commits so that it says like co-authored with Claude. like the only the commits that only I did have just my name on them, but otherwise like it's got both. But yeah, I've been, checking this stuff and I made it do so much testing. There are so many areas of the codebase that are tested now that would not have been like... I would have been lazy about it if I didn't use it. There are so many parts of the chardet codebase that were just completely untested before that I'm just like, it's scary to think about sometimes. And I'm like, okay, now there's 100 % unit test coverage, which is not something I would have taken the time to do myself properly.
Sachin: So like maybe the optimist view on this stuff is, yes, there was like a challenge with licensing in an AI world, but the outcome is now a much better project with lower energy uses for hundreds of millions of developers, something crazy like that.
Dan Blanchard: Yeah, right. Yeah, exactly. Like, I think this will be a net positive for the Python community.
Sachin: as you want to like do whatever you think the right and ethical thing is, which of course everybody wants to try to do here, you still can't like slow down in this pace of innovation. You can't just be like, right, like there's so much value to accrue when you use stuff like this that you can just say, no, we're not going to.
Dan Blanchard: Yeah. right, yeah. I used I've released two more versions since the controversial one. like 7.0 was the controversial one. I guess three there was 7.0 was the controversial one. Then there was like 7.01 had like this really minor bug fix. And then I did 71 and 72. think I put out maybe two days ago or something like that. Like
Sachin: Thank And so arguably that's the most progress this project has ever seen in its history in like three weeks.
Dan Blanchard: Yes, yes, yeah, 100%. Yes, like, previously, there was a three year gap between releases, because it would be just like whenever I had some time and I, but like, now it's like someone files an issue. And I'm like, I could totally fix that in like, 10 minutes. And then I'm like, might as well cut any release, you know, and I and yeah, one of the funny things about like, I'm sorry, one second. We have a cat. Can we see a cat? Do we get cat on camera?
Dan Blanchard: cats stealing the other cats. Yeah, the cat was stealing the other cats food and I had to stop that. No, she ran away.
Sachin: All right, Dan, well, we appreciate you sharing this whole story with us. It is like super interesting and I think it like is not just interesting in terms of like what you did and how it happened, but interesting for the future of all this stuff and where it goes. So maybe we have to like have you back on in like, I don't know, a month or so we can chat again.
Dan Blanchard: yeah, I mean, I'd be happy to chat some more. I'm hoping, you know, there'll be positive developments or at least less controversial developments in the future or maybe, you know, more controversial things that are fun to talk about. I guess, I guess maybe the like soundbite thing is like, I don't care about which way the court case goes. I think that's the thing that people don't understand. Like, a decision would be beneficial to everyone. Like, the uncertainty right now is the part that sucks.
Doyle Irvin: Mmm.
Sachin: Right, we're in the Wild West, we don't know what is what.
Dan Blanchard: yeah, right. Like if somebody says to me, you legally have to change the license back to LGPL, that sucks, right? Like I don't like that, but that's like, okay, that is a decision. And it's like no worse off than it was before. If somebody says to me, this has to go in public domain, it's not copyrightable, also fine with me. And because that could mean it potentially got into the standard library. If someone's like, you can keep it with MIT as it is right now, like that's also great, right? Because that's what I was trying to do. But like, the only way I see legal stuff like being a downside for me is like, if I lose a court case and there's like damages, but I don't understand how there really can be damages for a free thing being sued by a free thing, like other than like, legal like paying for the lawyers right you know like having to pay the other teams legal fees and You know I'm not gonna name names, but people have volunteered to help fund my legal funds if if this is a problem so like you know I'm hoping it doesn't come to that, but I don't I honestly don't know how else we get a like this is the way it is yeah so
Sachin: a line in the sand.
Doyle Irvin: yeah.
Sachin: But I can imagine if it did go to court and this was being argued, that line in the sand would include so many different parties that it might be a long road.
Dan Blanchard: yeah, exactly. That's true. Yeah. there is one court case in Colorado right now that is kind of relevant. That's like this guy used made some art with AI and they're saying like they're trying to hash out how much a human has to contribute to it for it to count as like them being the author. I don't know that code would go the same way as art, but like it's not that it would be some sort of Anyway.
Doyle Irvin: yeah.
Dan Blanchard: I'll shut up now. Yeah, it was really great talking to y'all.
Sachin: Thanks, Dan.
Doyle Irvin: No, well, thank you. We really appreciate it.
Transcript
More podcasts
Customer success stories
Transform your workflows today
Learn how we can help you modernize your business.


