Video Summary and Transcription
The Talk discusses the challenges of detecting and combating bots on the web. It explores various techniques such as user agent detection, tokens, JavaScript behavior, and cache analysis. The evolution of bots and the advancements in automated browsers have made them more flexible and harder to detect. The Talk also highlights the use of canvas fingerprinting and the need for smart people to combat the evolving bot problem.
1. Introduction to Web Bots
I'm here to ask what's going on with bots on the web. We'll talk about simple detections, how the bots got better. We'll talk about what's possibly the best bot out there cheating on most detection solutions. And we'll lastly get to my favorite part, which is how you can find it anyways. My job is playing hide and seek with these bots, so advertisers can avoid them. It's going to be social media, concert ticket sellers, a lot of people facing this issue because the internet was not designed with bot detection in mind. When you do that, yeah, real story, when I was 16, high school product projects may or may have not dropped service to some site. So to make the internet better, we want to detect them. Let's talk detections. Starting with the basics. User agent. Does the HTTP request header identifying the browser? You guys know this. You see it's a Python bot. You block that. Probably not a real user behind that. They figured this out, the bot makers know, they hide the user agent. Let's say you don't run JavaScript on your bot.
Hey, everyone. I'm Adam. I'm super happy to be here, and I'm here to ask what's going on with bots on the web. I'm not talking about the nice ones, the testing. I'm talking about the bad ones. We'll talk about simple detections, how the bots got better. We'll talk about what's possibly the best bot out there cheating on most detection solutions. And we'll lastly get to my favorite part, which is how you can find it anyways.
But before all that, one reason I'm here is because I always like packing stuff, and now I'm the reverse engineer for DoubleVerify. They measure ads. But my job is playing hide and seek with these bots, so advertisers can avoid them. But it's not just advertisers and the games. It's going to be social media, concert ticket sellers, a lot of people facing this issue because the internet was not designed with bot detection in mind. Seriously. The only real standard is bots.txt telling bots what they're allowed and disallowed to do. Basically the honor system asking good people to play nice. When you do that, yeah, real story, when I was 16, high school product projects may or may have not dropped service to some site. But some people actually do this on purpose and at scale, denying service to real users, using what they have to steal, sneakers, sneaking around social media with fake users. I practice that part. So to make the internet better, we want to detect them.
Let's talk detections. Starting with the basics. Not because bot makers can't play around these, but because they're usually the first thing you rely on when you come up with something more complicated because simple detections are pretty straightforward. User agent. Does the HTTP request header identifying the browser? You guys know this. You see it's a Python bot. You block that. Probably not a real user behind that. They figured this out, the bot makers know, they hide the user agent. Let's say you don't run JavaScript on your bot.
2. Detecting Bots with Tokens and JavaScript
You can use tokens and JavaScript behavior to detect bots on your site. Browser quirks can be used to verify the true nature of a browser. Digging deep into JavaScript can reveal attempts to hide something.
Maybe you make a token as the detection as the site. In Azure, actually make sure it's created. So if you have a bot that's navigating to your site, not generating this token, not running JavaScript, you know something's going wrong. But let's say they do run JavaScript. All of a sudden, you can check how the browser behaves. You people probably hate browser quirks. Bot makers hate them too, because they can be used to verify what's under the hood and not what the browser is reporting at face value. And sometimes you can dig deep in JavaScript to see if somebody's trying to hide something.
3. Detecting Bots with User Agent and Behavior Tests
User agent, hiding with object.define property. Funny stuff, bad attribute, accidental artifact for detection. Cat and mouse theme, hiding the to string. Clever ways around, repeating vectors. JavaScript library creep.js, limited effectiveness. Using data, tokens, duplicate tokens, nonsense navigations for bot catching. Caches as another avenue for detection. Behavior tests, user click frequency.
User agent, we talked about that. That property on the window navigator is going to be read-only. So bot markers, they're going to hide that with object.define property. You look at the property descriptor, you see somebody did funny stuff there, trying to hide a user agent. That's going to be suspicious image here being how you have a bad attribute identifying you. You fix it into something perfectly fine as the bot maker accidentally leave behind an artifact that can be used to incriminate you that's going to be used for detection.
This is going to be a common theme. The cat and mouse of bot detections. Another example is the bot maker can override the to string on something they're trying to hide. So you look at the to string, the to string, they hide that too. There's a fun game established there. Clever ways around this key take away being here the cat and mouse theme that's going to repeat vectors for more detections.
Let's say you're really good at this dumb stuff. You make JavaScript library like creep.js to fingerprint a browser under the hood. That only goes so far because the bot makers they can see what you're doing. Every time you find them some way they're going to evolve they're going to patch a little bit and now we got to use something else. Let's say on a site you want to use data. Data is going to be tricky because you have to mine privacy issues or unruly users that you're testing, but let's say you have a site that users they go to just for the sake of argument. Each page you put a generated token you can validate is a user go through. You can see where that user went. All of a sudden you have a whole new arena for catching these bots, duplicate tokens, nonsense navigations, anything that's giving you even the tiniest hint that somebody maybe just have pre-programmed your logic in some way. It's not an actual user navigating. Some of you might be thinking, hey, will users do this too? And that's absolutely right. With caches and stuff. That's why this isn't a smoking gun on its own, but caches, for example, also introduces a whole other avenue for bot detection. Some bots they clean the cache too much. And we'll get into advanced detection in just a moment, but still in the domain of old-school simple detections, behavior tests. How many times should the user click on the side? Let's say in an hour. OK. So, there's going to be a little spectrum.
4. Bots, Caches, and Automated Browsers
How much each user does it? And you're going to look in the edges and one side you have zero, which is going to be my father clicking absolutely zero times after a long day of work. But on the other edge, you have people clicking 172 times per second. Caches were originally used to distinguish humans from bots, but they slow the bots down. The evolution of bots and the advancements in automated browsers have made them more flexible and harder to detect. Puppeteer, Google's automated browser, is the kingpin in botting, making it accessible and difficult to detect.
How much each user does it? And you're going to look in the edges and one side you have zero, which is going to be my father clicking absolutely zero times after a long day of work. But on the other edge, you have people clicking 172 times per second. OK, we're getting somewhere there.
And also, caches. That's what we came up with originally to distinguish humans from bots. You might be asking, hey, why don't you start with that? The reason is that the bots train, they absolutely demolish humans in simple ones. And this is for complex captures too. So, captures aren't there to prevent bots. What they do currently, the reason you see them, is that they slow the bots down.
Moving forward, let's talk about how the bots got better. Getting closer to the advanced detection, that part, the bot makers haven't been sleeping on their guard all this time. They got better, they keep getting better with every little patch they evolve. Eventually, the game became written in their favor. Ten years ago, they were struggling with Python scripts. In recent news, it's now publicly retweeted, so here's some guy complaining about this to Elon Musk. Ping me in the Q&A if you want to talk about the seekers and the hatters about this some more. I'm right about that one. Point being that the evolution thing is really good at gradual improvement and problem solving.
Bots might eventually become indistinguishable from humans entirely. But moving from philosophy rambles to practice, biggest technical advances bots made in the That's going to be automated browsers. Automated browsers have changed the game, no more requests in Python, you're taking the whole browser, you run the DOM in JavaScript, you can even fake the user. Automated browsers got really good at their thing so they let bot makers fake attributes like the user agent without leaving the artifact behind that you'd normally do in JavaScript. That makes bots more flexible, harder to detect. Browser quirks and all that is useful when the browser automation solution is supporting the browser with whatever quirks they have and you can take the user too with these using vanilla JavaScript or browser hooks, scrolling through some articles with Windows Scrolls, here's some automated browsers that are good for testing but don't be like some of the bot operators I found out using these for fraud schemes because they're not meant to hide anything, they're easy to detect, no, you want something that we're getting to the kingpin here. You want something that's based on this guy, Puppeteer. That's Google's automated browser. It's not malicious of trying to hide but it's really good at its thing and makes itself super easy to extend, makes everything super smooth and with that, all the pieces kind of came into place, we see bots getting weathered better, automated browsers available and operators improving their game, thus the king bot was born. This guy, Puppeteer X-Distill, best at hiding so it can run headless meaning no rendering on the screen, still doesn't get detected, it's amazing, criminally easy to use, making really good botting really accessible. Community behind it is an army looking for even the slightest discrepancies and for example, they patch hardware concurrency, that's the amount of available processors you have, so they can scale the operations, run many of these on the same computer without even raising the suspicion that there are bots, a whole different playing field there. They left some traces when they patched this attribute on the prototype, people detected that on the screen, they found it, they patched it, they did this fast.
5. Detecting Bots: Canvas Fingerprinting and Beyond
Let's talk about canvas fingerprinting and how it can be faked. Chromium's headless mode makes it easy to fill objects with fake values for detection. Easy bots can be detected, but hard bots require techniques like hardware concurrency, behavior tests, and data analysis. By analyzing user agent data, you can identify even the best bots. The internet needs more smart people to combat the evolving bot problem.
Let's talk about something harder to fake, canvas fingerprinting, that's when you dynamically render a canvas to fingerprint your device alongside the browser. Should be harder to fake, bam, these faults develop an extension that reports fake values using JavaScript hooks, so they solve pretty much anything.
I want to take the time to explain just one of these before I fly through the others, the Chromium here, when it's headless, it does not end the Chrome.csi, Chrome app, all that performance stuff, to think, okay, can it detect puppeteer when it's headless with this? All of the sudden, not so fast, buddy, they fill every single object with fake values like this, navigator Chrome, load times, runtime, app, so on and so forth, anything that can be used for detection, it's making it stupid easy to use, annoyingly elegant, remember, CAPTCHA's two lines here, a little bit of money, and they solve that. And all of this is going to be just with these two lines here. Everything we talked about, super easy to use, bot tests failed to find it, they just hang the spot detection on their repo, but I promise that, get to what actually works, that part, I made it within 10 minutes, all right.
Obviously, I can't specify too much here, but let's start with the easy part. Easy bots are easy to detect. I'm going to name three ways to go after hard bots. This is going to be quick, but I'm going to say this more than what's out there. Starting with stuff like hardware concurrency, there's still more JavaScript artifacts to be found if you know where to look. These are becoming increasingly rare, though, so I wouldn't count on them long term, but the upside here is that they're very clear cut.
What we'll hold at the time is behavior tests and session level data analysis. Behavior tests that still work usually look at window context discrepancies interacting with the DOM, and data analysis can take many shapes, for example, let's say you pick the user agent perfectly, the question is what value you put there, look at this graph, that's the user agent along navigating to an app, each point is how many people navigated with that specific user agent, you'd say there's some variance here, but this is what it's supposed to look like. Blue line's almost flat, so that's probably because the bot foster got right how the user agent is varied, they got the weight part entirely wrong. They're probably just producing these at random. Normal sites don't do this, and here's one way you can detect even the best bot.
So at the start I asked you what's up with bots on the web? I can't tell you for sure, but what I do know is that they're getting better, we need more smart people like you to be aware so that the internet becomes a better place.
Comments