Detect and crash Chromium bots
146 points by avastel 1 month ago | 47 comments- oefrha 1 month ago> The call to page.evaluate just hangs, and the browser dies silently. browser.close() is never reached, which can cause memory leaks over time.
Not just memory leaks. Since a couple months ago, if you use Chrome via playwright etc. on macOS, it will deposit a copy of Chrome (more than 1GB) into /private/var/folders/kd/<...>/X/com.google.Chrome.code_sign_clone/, and if you exit without a clean browser.close(), the copy of Chrome will remain there. I noticed after it ate up ~50GB in two days. No idea what's the point of this code sign clone thing, but I had to add --disable-features=MacAppCodeSignClone to all my invocations to prevent it, which is super annoying.
- closewith 1 month agoThat's an open bug at the minute, but the one saving grace is that they're APFS clones so don't actually consume disk space.
- oefrha 1 month agoInteresting, IIRC I did free up quite a bit of disk space when I removed all the clones, but I also deleted a lot of other stuff that time so I could be mistaken. du(1) being unaware of APFS clones makes it hard to tell.
- oefrha 1 month ago
- closewith 1 month ago
- chrismorgan 1 month agoChecking https://issues.chromium.org/issues/340836884, I’m mildly surprised to find the report just under a year old, with no attention at all (bar a me-too comment after four months), despite having been filed with priority P1, which I understand is supposed to mean “aim to fix it within 30 days”. If it continues to get no attention, I’m curious if it’ll get bumped automatically in five days’ time when it hits one year, given that they do something like that with P2 and P3 bugs, shifting status to Available or something, can’t quite remember.
I say only “mildly”, because my experience on Chromium bugs (ones I’ve filed myself, or ones I’ve encountered that others have filed) has never been very good. I’ve found Firefox much better about fixing bugs.
- carlhjerpe 1 month agoI guess it depends on what kind of bug it is, this took 25 years to fix https://news.ycombinator.com/item?id=40431444
- Dylan16807 1 month agoTo be fair that bug was only P3.
- Dylan16807 1 month ago
- carlhjerpe 1 month ago
- wraptile 1 month agoI find the "don't let googlebot see this" kinda funny considering how top google results are often much worse. The captcha/anti-bot is getting so bad I had to move to Kagi to block some domains specifically as browsing contemporary web is almost impossible at times. Why isn't google down ranking this experience?
- lifthrasiir 1 month agoPreviously on HN: Detecting Noise in Canvas Fingerprinting https://news.ycombinator.com/item?id=43170079
The reception was not really positive for the obvious reason at that time.
- wslh 1 month agoIn Google Chrome, at least, I tried an infinite loop modifying document.title and it freezes pages in other tabs as well. Now, I am not at my computer to try again.
- neuroelectron 1 month agoI, for one, find it hilarious that "headless browsers" are even required. JavaScript interpreters serving webpages is just another amusing bit of serendipity. "Version-less HTML" hahaha
- kevin_thibedeau 1 month agoIt exists because adtech providers and CDNs punish legitimate users who don't execute untrusted code on their property.
- Thorrez 1 month agoHeadless browsers exist because adtech providers and CDNs punish legitimate users who don't execute untrusted code on their property?
If we ask the creators of headless chrome or selenium why they created them, would they say "because adtech providers and CDNs punish legitimate users who don't execute untrusted code on their property"?
- Thorrez 1 month ago
- kevin_thibedeau 1 month ago
- jillyboel 1 month ago[flagged]
- seventh12 1 month agoThe intention is to crash bots' browsers, not users' browsers
- ramesh31 1 month agoPlease point me to this 100% correct bot detection system with zero false positives.
- FridgeSeal 1 month agoYou understand the difference between intent and reality right?
The article even warns about this side-effect.
- FridgeSeal 1 month ago
- jillyboel 1 month ago[flagged]
- h4ck_th3_pl4n3t 1 month agoIf you are scraping forbidden data in my robots.txt, I don't give a damn. I am gonna mess with your bots however I like, and I'm willing to go as far as it takes to teach you a lesson about respecting my robots.txt.
- anthk 1 month agoIf you are crashing some browser from a disallowed directory in robots.txt, is not your fault.
- lightedman 1 month agoIf that's the case what do we do about websites and apps which do things like disable your back button (mobile phone's direct one) or your right click capabilities (desktop browser) while such functionality disabling is not present in the ToS or even presented to you upon visiting the site or using the app?
- h4ck_th3_pl4n3t 1 month ago
- ramesh31 1 month ago
- dmitrygr 1 month agoThen maybe we need laws about crashing my server by crawling it 163,000 times per minute nonstop, ignoring robots.txt? Until then, no pity for the bots.
- sMarsIntruder 1 month agoRunning a bot farm?
- jillyboel 1 month agoof course not, why are you immediately jumping at accusations? if i was i'd just patch the bug locally and thank OP for pointing out how they're doing it.
it's just blatantly illegal and i wouldn't want anyone to get into legal trouble
- jillyboel 1 month ago
- seventh12 1 month ago
- omneity 1 month ago[flagged]
- randunel 1 month agoHow do you deal with the usual CF, akamai and other fingerprinting and blocking you? Or is that the customer's job to figure out?
- omneity 1 month agoThank you for the question! It depends on the scale you're operating at.
1. For individual use (or company use but each user is on their device) typically the traffic is drown out in regular user activity since we use the same browser and no particular measure is needed, it just works. We have options for power users.
2. For large scale use, we offer tailored solutions depending on the anti-bot measures encountered. Part of it is to emulate #1.
3. We don't deal with "blackhat bots", so we don't offer support to work around legitimate anti-bot measures such as social spambots etc.
- lyu07282 1 month agoIf you don't put significant effort into it, any headless browser from cloud IP ranges will be banned by large parts of the internet. This isn't just about spam bots, you can't even read news articles in many cases. You will have some competition from residential proxies and other custom automation solutions that take care of all of that for their customers.
- lyu07282 1 month ago
- omneity 1 month ago
- erekp 1 month agoWe have a similar solution at metalsecurity.io :) handling large-scale automation for enterprise use cases, bypassing antibots
- omneity 1 month agoThat's super cool, thank you for sharing! It's based on playwright though right? Can you verify if the approach you are using is also subject to the bug in TFA?
My original point was not necessarily about bypassing anti-bot protections, and rather to offer a different branch of browser automation independent of incumbent solutions such as Puppeteer, Selenium and others, which we believe are not made for this purpose, and has many limitations as TFA mentions, requiring way too many workarounds as your solution illustrates.
- erekp 1 month agowe fix leaks and bugs of automation frameworks, so we don't have that problem. The approach of using the user's browser, like yours, is that you will burn the user's fingerprint depending on scale.
- erekp 1 month ago
- omneity 1 month ago
- volemo 1 month agoGuess we gotta find a way to crash these bots too. :D
- randunel 1 month ago