Silkenweb Example: Hackernews Clone

Detect and crash Chromium bots

146 points by avastel 1 month ago | 47 comments

oefrha 1 month ago
> The call to page.evaluate just hangs, and the browser dies silently. browser.close() is never reached, which can cause memory leaks over time.
Not just memory leaks. Since a couple months ago, if you use Chrome via playwright etc. on macOS, it will deposit a copy of Chrome (more than 1GB) into /private/var/folders/kd/<...>/X/com.google.Chrome.code_sign_clone/, and if you exit without a clean browser.close(), the copy of Chrome will remain there. I noticed after it ate up ~50GB in two days. No idea what's the point of this code sign clone thing, but I had to add --disable-features=MacAppCodeSignClone to all my invocations to prevent it, which is super annoying.
- closewith 1 month ago
  That's an open bug at the minute, but the one saving grace is that they're APFS clones so don't actually consume disk space.
  - oefrha 1 month ago
    Interesting, IIRC I did free up quite a bit of disk space when I removed all the clones, but I also deleted a lot of other stuff that time so I could be mistaken. du(1) being unaware of APFS clones makes it hard to tell.
chrismorgan 1 month ago
Checking https://issues.chromium.org/issues/340836884, I’m mildly surprised to find the report just under a year old, with no attention at all (bar a me-too comment after four months), despite having been filed with priority P1, which I understand is supposed to mean “aim to fix it within 30 days”. If it continues to get no attention, I’m curious if it’ll get bumped automatically in five days’ time when it hits one year, given that they do something like that with P2 and P3 bugs, shifting status to Available or something, can’t quite remember.
I say only “mildly”, because my experience on Chromium bugs (ones I’ve filed myself, or ones I’ve encountered that others have filed) has never been very good. I’ve found Firefox much better about fixing bugs.
- carlhjerpe 1 month ago
  I guess it depends on what kind of bug it is, this took 25 years to fix https://news.ycombinator.com/item?id=40431444
  - Dylan16807 1 month ago
    To be fair that bug was only P3.
wraptile 1 month ago
I find the "don't let googlebot see this" kinda funny considering how top google results are often much worse. The captcha/anti-bot is getting so bad I had to move to Kagi to block some domains specifically as browsing contemporary web is almost impossible at times. Why isn't google down ranking this experience?
lifthrasiir 1 month ago
Previously on HN: Detecting Noise in Canvas Fingerprinting https://news.ycombinator.com/item?id=43170079
The reception was not really positive for the obvious reason at that time.
wslh 1 month ago
In Google Chrome, at least, I tried an infinite loop modifying document.title and it freezes pages in other tabs as well. Now, I am not at my computer to try again.
- 1 month ago
neuroelectron 1 month ago
I, for one, find it hilarious that "headless browsers" are even required. JavaScript interpreters serving webpages is just another amusing bit of serendipity. "Version-less HTML" hahaha
- kevin_thibedeau 1 month ago
  It exists because adtech providers and CDNs punish legitimate users who don't execute untrusted code on their property.
  - Thorrez 1 month ago
    Headless browsers exist because adtech providers and CDNs punish legitimate users who don't execute untrusted code on their property?
    If we ask the creators of headless chrome or selenium why they created them, would they say "because adtech providers and CDNs punish legitimate users who don't execute untrusted code on their property"?
    - Bjartr 1 month ago
      Whether or not it's true aside, why people decide to do something and why they say something is being done don't have to match.
    - immibis 1 month ago
      Another use is testing websites.
jillyboel 1 month ago
[flagged]
- seventh12 1 month ago
  The intention is to crash bots' browsers, not users' browsers
  - ramesh31 1 month ago
    Please point me to this 100% correct bot detection system with zero false positives.
    - FridgeSeal 1 month ago
      You understand the difference between intent and reality right?
      The article even warns about this side-effect.
  - jillyboel 1 month ago
    [flagged]
    - h4ck_th3_pl4n3t 1 month ago
      If you are scraping forbidden data in my robots.txt, I don't give a damn. I am gonna mess with your bots however I like, and I'm willing to go as far as it takes to teach you a lesson about respecting my robots.txt.
    - anthk 1 month ago
      If you are crashing some browser from a disallowed directory in robots.txt, is not your fault.
    - lightedman 1 month ago
      If that's the case what do we do about websites and apps which do things like disable your back button (mobile phone's direct one) or your right click capabilities (desktop browser) while such functionality disabling is not present in the ToS or even presented to you upon visiting the site or using the app?
- dmitrygr 1 month ago
  Then maybe we need laws about crashing my server by crawling it 163,000 times per minute nonstop, ignoring robots.txt? Until then, no pity for the bots.
  - jillyboel 1 month ago
    if your software crashes due to normal usage then you only have yourself to blame
    - dmitrygr 1 month ago
      Yes indeed. Nginx running out of RAM due to A”I” companies hammering my server is my fault.
- sMarsIntruder 1 month ago
  Running a bot farm?
  - jillyboel 1 month ago
    of course not, why are you immediately jumping at accusations? if i was i'd just patch the bug locally and thank OP for pointing out how they're doing it.
    it's just blatantly illegal and i wouldn't want anyone to get into legal trouble
omneity 1 month ago
[flagged]
- randunel 1 month ago
  How do you deal with the usual CF, akamai and other fingerprinting and blocking you? Or is that the customer's job to figure out?
  - omneity 1 month ago
    Thank you for the question! It depends on the scale you're operating at.
    1. For individual use (or company use but each user is on their device) typically the traffic is drown out in regular user activity since we use the same browser and no particular measure is needed, it just works. We have options for power users.
    2. For large scale use, we offer tailored solutions depending on the anti-bot measures encountered. Part of it is to emulate #1.
    3. We don't deal with "blackhat bots", so we don't offer support to work around legitimate anti-bot measures such as social spambots etc.
    - lyu07282 1 month ago
      If you don't put significant effort into it, any headless browser from cloud IP ranges will be banned by large parts of the internet. This isn't just about spam bots, you can't even read news articles in many cases. You will have some competition from residential proxies and other custom automation solutions that take care of all of that for their customers.
- erekp 1 month ago
  We have a similar solution at metalsecurity.io :) handling large-scale automation for enterprise use cases, bypassing antibots
  - omneity 1 month ago
    That's super cool, thank you for sharing! It's based on playwright though right? Can you verify if the approach you are using is also subject to the bug in TFA?
    My original point was not necessarily about bypassing anti-bot protections, and rather to offer a different branch of browser automation independent of incumbent solutions such as Puppeteer, Selenium and others, which we believe are not made for this purpose, and has many limitations as TFA mentions, requiring way too many workarounds as your solution illustrates.
    - erekp 1 month ago
      we fix leaks and bugs of automation frameworks, so we don't have that problem. The approach of using the user's browser, like yours, is that you will burn the user's fingerprint depending on scale.
- volemo 1 month ago
  Guess we gotta find a way to crash these bots too. :D