Silkenweb Example: Hackernews Clone

YouTube-dl has an interpreter for a subset of JavaScript in 870 lines of Python

473 points by yuuta 2 years ago | 155 comments

lolinder 2 years ago
To be clear, this is an extremely tiny subset of JS. It looks like they only implemented the features needed to run a very specific function. For example, the only symbol allowed after "new" is "Date", everything else throws an exception.
It's still fun that it's there, but it's not as big a deal as it sounds from the tweet.
- krab 2 years ago
  It will only grow - as new scripts will need to be interpreted, new features will be added.
  - lolinder 2 years ago
    I would be horrified if this grew much further. It's perfectly fine for its current scope, but the architecture would not scale at all to a full interpreter without essentially starting from scratch.
    - kelnos 2 years ago
      Yeah, at some point you have to question if it's worth spending time maintaining a quirky, error-prone, ever-growing mini-JS interpreter, or just adding a dependency on v8 or node or something. And then you don't have to worry about supporting new scripts, as they'll just always work.
    - whateveracct 2 years ago
      How much does it need to "scale"? It just has to be fast and correct enough for the a CLI to work.
- mid-kid 2 years ago
  Yeah, it's essentially used as a javascript expression solver. You can see the full extent of its capabilities in the testsuite: https://github.com/ytdl-org/youtube-dl/blob/master/test/test...
  The specific site modules in youtube-dl will take care to extract the bare minimum necessary to solve whatever challenge.
- em-bee 2 years ago
  if it's going to need much more than that then it probably would make more sense to port the whole application to javascript instead.
  but then this could be turned into a commandline browser that is able to interpret a whole web-page and save the resulting html structure instead of the source as curl/wget would do.
  - pvillano 2 years ago
    Eventually, YouTube-dl might have to simulate an entire browser and human user to fool Google. Until then, the usefulness of YouTube-dl is that it's less heavy than a full browser.
    I bet someone's already started a YouTube downloader that uses a headless browser
Uptrenda 2 years ago
Anyone who has ever pulled a website from a script knows the pain that is Javascript. Normally you want to just get some text and work out the API actions but a lot of sites use horribly obfuscated Javascript -- either because that's what modern web development is (lolz) -- or because its part of their 'security.' That means if you want to write browser-based bots properly -- you ought to use a browser. There are special browsers that run 'headlessly' or are designed mostly for bot use. Like https://www.selenium.dev/ which plugs into a few different 'browser engines.'
But now you have another problem. Your simple script goes from being small, simple, self-contained, and elegant gem, to requiring a full browser, specialized drivers, and/or daemons running just to work. If you're using something like Python you just frankly don't have very good packaging. So it's hard to string together all that into a solution and have it magically work for everyone. What YouTube-dl have done is good engineering. Even though it's not a full JS interpreter: they've kept their software lean, self-contained, and easier to use.
- Scaevolus 2 years ago
  Embedding V8 can work quite well: https://github.com/sqreen/PyMiniRacer
  You probably have to emulate some of the DOM, but you can interact directly with whatever obfuscated/packed scripts in a more lightweight and secure way than driving an entire browser.
- hansvm 2 years ago
  I use pyminiracer to great effect for that sort of scraping.
- eurasiantiger 2 years ago
  Just npm install puppeteer.
  - lolinder 2 years ago
    Puppeteer is cool, but it's exactly what OP is warning against: it's a full browser that is downloaded and run through npm. It's remarkably well packaged, but still far more error prone than a simple HTTP request, and far more likely to break on its own just with the passage of time.
    - eurasiantiger 2 years ago
      Yes, but:
      ”Your simple script goes from being small, simple, self-contained, and elegant gem, to requiring a full browser, specialized drivers, and/or daemons running just to work”
      Complex problems cannot be solved by simple scripts, but they can be abstracted away to vendor libraries when/if they are well maintained, such as in this case. While it can break with time, at least someone else fixes it for you.
    - ciupicri 2 years ago
      There's also puppeteer-core which lets you use your own (Google Chrome) browser and if your own browser is broken then you're having bigger problems than youtube-dl not working.
  - ciupicri 2 years ago
    By the way there is also Playwright [1] and it has Python bindings too [2].
    [1]: https://playwright.dev/
    [2]: https://playwright.dev/python/docs/intro
delusional 2 years ago
Can we stop the trend of linking to tweets that just contain another link to the content? what's the point? Wouldn't this be 10x better if it was a link directly to the github?
- derangedHorse 2 years ago
  I like the Twitter linking since it's almost like the OP is giving credit to where they found the information from.
  - plaguepilled 2 years ago
    Agreed. If you only know this from someone else's observation, you should link the observation.
    - mkl 2 years ago
      That is against HN guidelines: "Please submit the original source. If a post reports on something found on another site, submit the latter." - https://news.ycombinator.com/newsguidelines.html
- kelnos 2 years ago
  I was thinking the same thing; link to the file on Github, with the same title text as is there now, and it saves me an extra click. And any time I don't have to visit Twitter, I consider that a win.
  - naikrovek 2 years ago
- caned 2 years ago
  I often share links to HN instead of the referred link. Many times the comments are as interesting as the content. This applies to sharing Twitter or Reddit links, too, albeit with a lower S/N ratio.
  - Firmwarrior 2 years ago
    Is there some trick to actually being able to see information on Twitter? When I click a tweet, I get the tweet, then a random smattering of 2-3 semi-related tweets, and then a login popup that breaks the page
    Do you guys use an extension to process it or something?
    (Same issue with Reddit of course)
    - paulmd 2 years ago
      replace "twitter.com" with "nitter.net", or for video embedding (discord, etc) use vxtwitter.com or fxtwitter.com. Tweetdeck is what a lot of twitter people use for "serious twittering" (lol).
      For reddit use old.reddit.com instead of www.reddit.com. Reddit is Fun is a great native app for android and on iOS there's Apollo.
      Both sites are laser-focused on driving conversions and engagement which means forcing you into an account and native apps (specifically their shitty native apps), and undoubtedly they'll start breaking the workarounds and third-party clients for realsies at some point.
      But I mean, if users don't even have an account and native app install, how can they possibly get you doomscrolling all day? It's 2022, it's all about the engagement metrics, fuck user experience.
sylware 2 years ago
Nowadays "javascript" refers to the scriptable, grotesquely and absurdely complex and massive web engines, aka google financed blink and geeko, then apple financed webkit, that with their SDK.
The currently obfuscated javascript media players will try to break yt-dlp by leveraging the complexity and size of those scripted web engines. They will make them out of reach to small teamns or individuals and it is even "better", it will force ppl to use apple or google web engine, killing any attempt to provide a real alternative.
A standalone javascript interpreter is actually some work, but seems to stay in the "reasonable" realm: look at quickjs from M. Bellard and friends (the guy who created qemu, ffmpeg, tinycc, etc): plain and simple C (no need of a c++ compiler), doing the job more that well enough.
That's why noscript/basic (x)html is so much important.
- dtx1 2 years ago
  > but seems to stay in the "reasonable" realm
  > M. Bellard and friends
  Chose one, that dude is a wizard wielding c like a brain surgeon wields a scalpel.
- olliej 2 years ago
  Yeah I agree with almost all of this - the massive size and complexity of commercial engines makes it seem like JS the language must also be complex.
  I also agree with the idea that these sites will probably be able to/want to create JS that breaks these small/lightweight engines requiring constant work :-/
  This final point I disagree with entirely. You can't point to Bellard doing something as evidence that it's reasonable. This is a guy that wrote a program that generated a TV signal via a VGA card. :D
- axiolite 2 years ago
  > quickjs from M. Bellard and friends
  Is the M key next to the F key on your particular keyboard by chance? Because I've always called him "Fabrice."
  https://en.wikipedia.org/wiki/Fabrice_Bellard
  - a_e_k 2 years ago
    Could just be the usual abbreviation for Monsieur.
  - ganjatech 2 years ago
    Monsieur Bellard - M. Bellard
    - sylware 2 years ago
      Yeah... M = Monsieur (in french, namely Mister in english), I forgot the 'r'... I should have written Mr. Bellard, I kneel upon the weight of my apology.
- randyrand 2 years ago
  Chrome and Safari both have open source JS engines…
  - userbinator 2 years ago
    That's beside the point. Open-source is not useful to the smaller players if it is too complex to comprehend and constantly churned.
    - kelnos 2 years ago
      That's not the case, though. There are even python modules that let you evaluate JS code in v8 (Chrome's JS interpreter). It'd be pretty trivial for youtube-dl to make use of that if the author thought it was worth doing.
- oblak 2 years ago
  ah, but quickjs is an actual js engine. I have tried a couple of versions with real progress between them. This thing here is not
- languageserver 2 years ago
  > That's why noscript/basic (x)html is so much important.
  xhtml has been dead for a decade
esprehn 2 years ago
This isn't really JS, it's a purpose built evaluator that's only for evaluating a particular script on YouTube, assuming a huge list of things are true about how YouTube JS is written.
Ex. Its got a hard coded list of methods for String, and it doesn't respect prototypes. It only supports creating Date instances, and won't work if you override the global Date. It parses with regexes and implements all operators with python's operator module (which is the wrong type semantics) etc. Nearly none of the semantics of JS are implemented.
It's sort of the sandwich categorization problem:
If I write a C# "interpreter" in perl thats only 200 lines and just handles string.Join, string.Concat and Console.WriteLine, and it doesn't actually try to implement C# syntax or semantics at all and just uses perl semantics for those operations is it actually C#? :P
I say "not a sandwich".
- jraph 2 years ago
  And as a user of youtube-dl, I'm quite happy about this. This probably allows a very safe, restricted "subset" of JS. Way better than using a full JS engine. 900 lines is still small and manageable.
  - mjevans 2 years ago
    yt-dlp sometimes doesn't know how to evaluate the javascript / emcascript and will call out to an optional dependency, a real javascript interpreter, if installed.
  - sebzim4500 2 years ago
    I'm trying to get the thread model here. Is the concern that Youtube will inject JS into the payload which tries to break out of the youtuble-dl js sandbox using some zero day in whatever js engine they would use instead?
    - pabs3 2 years ago
      One of the reasons people use yt-dlp/youtube-dl (and nitter.net/etc) is to transform the modern proprietary JavaScript web into something more suitable for enthusiasts of the old document web and of FOSS. If the web switched to plain <video> then yt-dlp/youtube-dl would become completely unnecessary. Your browser should not have to run JS to watch an embedded video.
    - rwmj 2 years ago
      Google attempting zero days on client computers would be something. It's not totally without precedent (Sony CD rootkits - https://en.wikipedia.org/wiki/Sony_BMG_copy_protection_rootk...) but would still be major news.
    - jraph 2 years ago
      Let's say they end up using Node. Node has a quite complete standard library that lets you access files and everything.
      Now if they do it right and only embed some bare JS interpreter, it's still way harder to audit than these < 900 lines, for which it is quite easy to convince oneself that the interpreted script cannot do much.
    - kevingadd 2 years ago
      Embedding a whole js engine and then interopping with it from python would be non trivial. Good luck fixing any bugs or corner cases you hit that way. The V8 and spidermonkey embedding apis are both c++ (iirc) and non trivial to use correctly.
      Having full control like this +simple code is probably lower risk and more maintainable, even if there's the challenge of expanding feature set if scripts change.
      The alternative would be a console js shell, but those are very different from browsers so that poses it's own challenges.
    - loeg 2 years ago
      youtube-dl targets a lot of websites other than Google properties, many of which are a lot sketchier (think, uh, NSFW streaming sites).
    - 2 years ago
  - jiggawatts 2 years ago
    That’s the exact same logic I hear from developers who say things like:
    Why do I need a full XML parser when I can just extract what I need with regex?
    And:
    All that RPC IDL stuff is overcomplicated, REST is so much easier because I can just write the client by hand.
- dang 2 years ago
  Ok, we've changed this title to shrink the scope of the interpreter.
  Submitted title was "YouTube-dl has a JavaScript interpreter written in 870 lines of Python".
  - ec109685 2 years ago
    Hence why HN better than Twitter.
    The amount of high engagement just plain wrong tweets there are is just sad.
- tra3 2 years ago
  It’s quacks like a duck at midnight, but it’s actually a frog?
- blast 2 years ago
  I suppose this means it would be easy for YouTube to fuck with youtube-dl simply by throwing in more features of JS?
  - joshenders 2 years ago
    Cat, meet mouse.
    - nyanpasu64 2 years ago
      It's unfortunate, https://github.com/mpv-player/mpv/issues/8655#issuecomment-1...:
      > Youtube now throttles requests of more than 10MB at a time, yt-dlp works around it by making many requests of 10MB using Range HTTP headers (yt-dlp calls it the http-chunk-size), but ffmpeg which does the downloading for mpv doesn't support that yet.
      I want to change mpv or yt-dlp to support range-based video URLs (eg. appending &range=333999644-335298975&rn=5&rbuf=0 to URLs) which speed up stream seeking and probably eliminate throttling altogether, but I haven't taken the time to look into how to achieve it. For anyone interested, I have an open bug report at https://github.com/mpv-player/mpv/issues/10601, and have found https://satadalsengupta.github.io/docs/papers/2017_nossdav_y... describing these parameters.
- 2 years ago
- Test0129 2 years ago
  This really isn't fair. Just because it doesn't faithfully implement whatever standard Javascript is on doesn't mean it isn't an interpreter. All an interpreter is is something that executes a script directly rather than requiring compilation. It is a defacto interpreter for a subset of javascript. Nothing more, nothing less. The title could be more clear, however.
  - blast 2 years ago
    esprehn didn't say it isn't an interpreter. They're saying it is an interpreter and what it's interpreting isn't (all of) JS. That's also what you're saying, so you're agreeing with esprehn.
    Edit: You misunderstood baobabKoodaa in the same way. Nobody is arguing about what constitutes an interpreter, except you. The question is what language is being interpreted.
    Before accusing someone of pedantry, it would first be good not to completely misread them.
  - baobabKoodaa 2 years ago
    There's a huge difference between an interpreter for "JavaScript" and an interpreter for a "subset of JavaScript".
    - Test0129 2 years ago
      Making a pedantic argument on what constitutes an interpreter is silly. The title is bad. It is an interpreter. I'll continue to eat downvotes on this because of the pedantry of HN.
  - 2 years ago
haunter 2 years ago
The same in yt-dlp https://github.com/yt-dlp/yt-dlp/blob/master/yt_dlp/jsinterp...
Interesting to see the diffcheck between the two https://www.diffchecker.com/8EJGN27K
- cheschire 2 years ago
  Is yt-dlp's implementation being better the reason why I have fewer throttling issues than with youtube-dl?
  - LeoPanthera 2 years ago
    Maybe this isn't true anymore, but for a while they would hit different APIs. yt-dlp was using the Android YouTube API because it had no throttling.
  - 2 years ago
kristopolous 2 years ago
To understand why, I have a far simpler tool that focuses on a subset of sites (adult content video aggregators)
https://github.com/kristopolous/tube-get
It too deals with this problem but does so in a way that'd be easy to maliciously sabotage
Look right about here https://github.com/kristopolous/tube-get/blob/master/tube-ge...
As to why this program exists, this was originally written between about 2010-2015 or so technically predates the yt-* ecosystem.
The tool still works fine and it's not a strict subset of yt-dlp or YouTube-dl because being a different approach, although it's overall site coverage is smaller, I've had it be a "second try" system when yt-* fails and it comes up with success maybe about half the time
- pabs3 2 years ago
  Would you mind switching to subprocess with shell=False? os.popen is obsolete and insecure because it passes the command through the shell.
  PS: I found it quite easy to contribute to yt-dlp and the reviewers are ultra-helpful and kind, you might want to migrate all of your extractors there.
  - kristopolous 2 years ago
    1. It's ancient code but sure
    2. They're fundamentally not compatible approaches. This is worthless to them
aeyes 2 years ago
They just don't want to use any external dependencies... There is also an AES implementation: https://github.com/ytdl-org/youtube-dl/blob/master/youtube_d...
M30 2 years ago
How should a programming noob interpret this? Be impressed at what was achieved here? Be concerned about security implications using the tool? Something else entirely?
- rkangel 2 years ago
  This is the compiler writer equivalent of parsing HTML with regex:
  It is technically wrong - it isn't a sufficiently rich and powerful approach to handle all JS (HTML) that you might throw at it. It'll work for a while until it eventually barfs when you least expect it.
  EXCEPT that if the inputs you are giving it come from some understood source(s) that aren't likely to change, then a simpler approach to the "all singing all dancing" correct may be appropriate and justified. E.g. because it might be easier to write, easier to maintain and/or less attack surface etc.
  - pwdisswordfish9 2 years ago
    > some understood source(s) that aren't likely to change
    Does that apply to YouTube? Or any of the other hundreds of supported sites?
    - rkangel 2 years ago
      Presumably because it gets tested with those sites and the JS doesn't change that much it can be fixed or adjusted as required.
- lolinder 2 years ago
  It's an extremely tiny subset of JS—as an example, the only object that can be instantiated is Date. Anything other than "Date" after "new" throws an exception.
  It's definitely neat, but not especially useful outside of the confines of its current application, and the security concerns of such a tiny subset will be minimal.
  - petters 2 years ago
    > Anything other than "Date" after "new" throws an exception
    It's even very sensitive to white space.
- smcl 2 years ago
  All of the above, really.
- chlorion 2 years ago
  The "interpreter" in the youtube-dl source is probably safe from a security standpoint.
  yt-dlp seems to support running javascript in a full javascript interpreter/headless browser called phantomjs though. Running javascript in a full interpreter like this is a lot more scary from a security standpoint. I am not sure whether phantomjs sandboxes the javascript evaluation from the rest of the system, and if it does, whether the sandbox actually works properly at all. It looks like the project is not being maintained which is another bad sign.
  Big projects with lots of manpower behind them such as chromium have trouble keeping javascript evaluation safe, so I would really suggest not trusting phantomjs on untrusted input.
- bjt2n3904 2 years ago
  The goal of youtube-dl is to download a video off of YouTube for offline storage.
  This isn't something YouTube particularly enjoys. They would rather you keep coming back -- every visit is more ad revenue for them. If you have an offline copy, you don't need to visit YouTube anymore.
  YouTube has an incentive, therefore, to make it more difficult to download (or "scrape") their content.
  I'm not particularly sure of the specific details, but apparently YouTube has added JavaScript (a programming language that executes in the browser) as a hurdle to jump over. A simple python script doesn't have enough brains to execute JavaScript, only enough to realize that it exists. (Clearly, youtube-dl is sophistication enough to have jumped over it.)
  These are the conclusions I come to, having written software for about a decade.
  1) Once you give information to someone, be it text, pictures, sound, or video -- they will do whatever they want with it, and you have no control. Oh, yes -- it may be illegal. Maybe unethical. But the fact of the matter is you do not have control over information once it leaves your hands.
  2) Adding hurdles to make it harder to access the information does little to stop someone who is dedicated to accessing it.
  3) Implementing a subset of JavaScript in such an elegant and tiny manner is quite impressive.
  How you interpret these facts depends on your worldviews. If you are a media and content creator, you will view these facts differently than a politician, and a teenager.
  As an engineer and amateur philosopher, I certainly support the rights of content creators to be paid for their work. And yet, I fear that more and more, content creators want to lease me a right to listen their music, instead of own a copy of it.
  I used to own CDs, DVDs, movies, and books. What happens if Amazon or YouTube decides to not serve me anymore? Anything I've "purchased" from them, I lose access to.
  Further more, if I create a song, I used to be able to burn copies of CDs and distribute it on the street corners. Now, you have to sign up to stream on Spotify. This is a double edged sword -- I get a wide audience, but Spotify will do whatever they want with me.
  This troubles me.
- Test0129 2 years ago
  > How should a programming noob interpret this?
  Usually in a virtual machine.
- tenebrisalietum 2 years ago
  > How should a programming noob interpret this?
  The browser is client-facing and everything there is possible to reverse engineer and figure out. So if you design a web-based application, and are depending on client-side Javascript for any security or distribution enforcement, it can be helpful, but can ultimately be unwound and cracked even if obfuscated, etc.
  > Be impressed at what was achieved here?
  Yes. Try to download a YouTube video with out it or an online service which is probably using it internally.
  - Supermancho 2 years ago
    Youtube-dl is impressive. This particular hack is not.
    - pwdisswordfish9 2 years ago
      youtube-dl as a whole is not particularly impressive either. It’s a big pile of unresolved technical debt, of hacks-upon-hacks and quick-and-dirty temporary solutions just like this one staying there for years.
- Tao3300 2 years ago
  In the face of weird shit like this, I give you the permission to go with your gut.
lewisl9029 2 years ago
Another really cool JS dialect I recently learned about is njs from the nginx team: https://github.com/nginx/njs
This video goes into some of the design and tradeoffs: https://www.youtube.com/watch?v=Jc_L6UffFOs
TL;DW: they optimized for fast creation/destruction of low-footprint VMs with no JIT or garbage collection.
homarp 2 years ago
the tests for it: https://github.com/ytdl-org/youtube-dl/blob/master/test/test...
olliej 2 years ago
This is super cool.
Some of the stuff is kind of questionable to me in the sense that I could believe you could probably make some kind of sufficiently wonky JS that this would do the "wrong" thing.
But it's super cool that they are able to do this as I think it shows that claims of JS complexity based on the size of JS engines is overlooking just how much of that size/complexity comes from the "make it fast" drive vs. what the language requires. Here you have a <1000LoC implementation of the core of the JS language, removed from things like regex engines, GCs, etc.
Mad props to them for even attempting it as well - it simply would not have ever occurred to me to say "let's just write a small JS engine" and I would have spent stupid amounts of time attempting to use JSC* from python instead.
[* JSC appears to be the only JS engine with a pure C API, and the API and ABI are stable so on iOS/macOS at least you can just use the system one which reduces binary size+build annoyance. The downside is that C is terrible, and C++ (differently terrible? :D) APIs make for much more pleasant interfaces to the VM - constructors+destructors mean that you get automatic lifetime management so handles to objects aren't miserable, you can have templates that allow your API to provide handles that have real type information. JSC only has JSValueRef and JSObjectRef, and as a JSObjectRef is a JSValueRef it's actually just a typedef to const JSValueRef :D OTOH other hand I do thing JSC's partially conservative GC is better for stack/temporary variables is superior to Handles for the most part, but it's also absolutely necessary to have an API that isn't absolutely wretched. The real problem with JSC's API is that it has not got any love for many many many .... many years so it doesn't have any way to handle or interact with many modern features without some kludgy wrappers where you push your API objects into JS and have the JS code wrap them up. The API objects are also super slow, as they basically get treated as "oh ffs" objects that obey no rules. I really do wish it would get updated to something more pleasant and really usable.]
- esprehn 2 years ago
  This doesn't actually implement any of the JS language though, it just reuses all of python's semantics and hard coded a tiny list of ex. String methods
  I also assume you mean mainstream JS engine, but Duktape, JerryScript and QuickJS are all C APIs.
  They probably could have used ex. https://github.com/PetterS/quickjs instead of the hacks in the OP linked file.
  - olliej 2 years ago
    Ah, I only briefly scanned the implementation, and it looked like it was doing actual work - is it mostly string replacing to get approximate python equivalent syntax? Regardless that's disappointing.
    You are correct though that I was only thinking of the big engines - bias on my part alas.
    For your suggested alternate engines, JerryScript and QuickJS seem more complete than Duktape but I can't quite work out the GC strategy of JerryScript. Bellard says QuickJS has a cycle detector but I'm generally dubious of them based on prior experience.
    If I was shipping software that had to actually include a JS engine, if perf was not an issue I would probably use JerryScript or QuickJS as binary size I think would be a more critical component.
jraph 2 years ago
I do wonder why YouTube does not try harder to make it difficult to do this computation meant to prove you are a legit YouTube web client. Providing an easy-to-find, simple JS function interpretable with 900 lines of Python is like they don't try at all. They might as well do nothing.
Or is their goal just to make youtube-dl not 100% reliable? Or to be able to say "look, you are running our code in a way we did not intend, you can't do this because you are breaking the EULA"?
- zuminator 2 years ago
  I'd guess that their efforts to make it harder are limited by the fact that they want YouTube to be able to play on thousands of different low powered set top boxes and cheap phones. So whatever obfuscated code they use has to be simple enough to be run and periodically updated by all these different devices, and that same simplicity makes it emulable.
- Arnavion 2 years ago
  They do make it harder from time to time. In fact yt-dlp's interpreter has been broken for a month or so now and the devs finally gave up and told users to just install PhantomJS (which itself hasn't been updated since 2016 and probably has bugs / vulns of its own, but whatever).
  https://github.com/yt-dlp/yt-dlp/issues/4635#issuecomment-12...
  - whywhywhywhy 2 years ago
    I mean if this is the direction it’s heading it makes more sense to port yt-dlp to node. It’s already dependent on a scripting language, it may as well be the one YouTube speaks.
- Cthulhu_ 2 years ago
  I'm guessing the amount of people using it is low enough to not bother with mitigation. Then again, there's a LOT of YT videos that take clips from other videos (which in most cases falls under fair use), which I can imagine would use this tool.
mdaniel 2 years ago
I was expecting this to be about Duktape <https://github.com/svaarala/duktape>, but heh, for sure no. I'd bet $1 there's no way youtube-dl would switch, but I wonder if yt-dlp would?
rcarmo 2 years ago
Awesome. Even if it's likely incomplete, it might come in really handy for some scraping I need to do...
Too 2 years ago
They must have been inspired by this PyCon presentation, where David Beazley live codes a fully working webassembly interpreter, in under one hour. https://youtu.be/VUT386_GKI8
atan2 2 years ago
This seems to be a pretty small subset of JavaScript, but I personally love small projects like this for educational purposes. Removing the noise and keeping things minimal helps my brain reason about things.
Earlier this year I enrolled in an online class called "Building a Programming Language" taught by Roberto Ierusalimschy (creator of Lua) and Gustavo Pezzi (creator of pikuma.com). We created a toy language interpreter/VM and the final code was around of 1,800 lines of Lua code. Keeping things as simple (and sometimes naive) as possible was definitely the right choice for me to really wrap my head around the basic theory and connect the dots.
Thanks for the link.
Tao3300 2 years ago
Greenspun's Tenth Rule:
> Any sufficiently complicated C or Fortran program contains an ad hoc, informally-specified, bug-ridden, slow implementation of half of Common Lisp. [1]
And here we have a complicated Python program with a partial JS implementation in it.
[1] https://en.wikipedia.org/wiki/Greenspun's_tenth_rule
anony23 2 years ago
What purpose does it serve?
- rany_ 2 years ago
  They need to run a JavaScript function to download YouTube videos at normal speeds.
  Edit: it's also required to download music, otherwise it will just fail
  Source:
  - https://github.com/ytdl-org/youtube-dl/issues/29326#issuecom...
  - https://github.com/ytdl-org/youtube-dl/blob/d619dd712f63aab1...
  - https://github.com/ytdl-org/youtube-dl/commit/cf001636600430...
  - ajkjk 2 years ago
    Wow:
```
   Overview of the control flow (already known):
   The Youtube API provides you with n - your video access token
   If their new changes apply to your client (they do for "web") then it is expected your client will modify n based on internal logic. This logic is inside player...base.js
   n is modified by a cryptic function
   Modified n is sent back to server as proof that we're an official client. If you send n unmodified, the server will eventually throttle you.
```
    So they can always change the function to keep you on your toes, hence you need to be able to run semi-arbitrary JS in order to keep using the API.
    Waste of human brainpower but I guess that energy is better spent imagining a world where Google isn't in charge instead of kvetching about what they're doing with their influence.
    - isatty 2 years ago
      There is a reason Google is able to serve the amount of video bandwidth, and also a reason why there are no worthwhile youtube clones. Some amount of scrape protection is absolutely essential.
- elaus 2 years ago
  I'd have to read up on the specifics as well, but I think basically Youtube uses a lot of obfuscated, rapidly and automatically changing Javascript code to fetch the video data. A project like youtube-dl has to run this code to be able to download videos, because that's what's happening in the browser as well.
  - temp_account_32 2 years ago
    For those interested further, in some of the past few weeks youtube-dl had stopped working intermittently for multiple hours at a time, and it was precisely related to this code.
    We have a custom-made Discord music bot on our server which uses ytdl to stream songs so we can listen together, and at one point we were listening and suddenly got some obscure JavaScript error.
    We began joking that there's some bug in the code which breaks it after 6PM, but later found out that Google had changed some of the obfuscated JS and this basically broke this part of code, which prevented us from fetching the song information.
  - londons_explore 2 years ago
    If you start a youtube video and then pause it and resume a few days later, you'll notice that the youtube page plays for ~30 seconds (ie. whats buffered) and then the page refreshes. I'd guess this refresh is to pick up the new javascript and any updates to the HTML code.
    It's kinda annoying if you have a lot of youtube tabs open for a long time and come back to them.
  - bitexploder 2 years ago
    What is interesting is it seems to be constant cat and mouse. I download a YT vid. It crawls. Update yt-dlp, it flies again. I love yt-dlp and use it a lot.
  - lupire 2 years ago
    But why not just use a normal JS engine called from Python?
- hadrien01 2 years ago
  It's used in the YouTube extractor: https://github.com/ytdl-org/youtube-dl/blob/d619dd712f63aab1...
  I believe YouTube limits your bitrate if you don't pass a specific calculated value; it's possible youtube-dl has to parse and eval JS to get it.
  - RicoElectrico 2 years ago
    > I believe YouTube limits your bitrate if you don't pass a specific calculated value
    It's starting to become Widevine bullshit all over again.
    - kevin_thibedeau 2 years ago
      It's their platform. They can do with it what they want.
- oynqr 2 years ago
  You need to run some obscured JS to get decent download speeds from Youtube. Something along the lines of PoW.
  - db48x 2 years ago
    It’s not like proof of work at all. It’s just a challenge and response; youtube includes a random number in the webpage for each video, and expects to see a request parameter with a particular value calculated from that random number when you request the video. If you don’t do the arithmetic it throttles you to 50kb/s.
    Since the calculation of the response is done in JS, and they occasionally change the formula, some download programs are moving towards running the JS rather than trying to keep up with the changes.
    It’s really just bullshit to make people’s lives harder.
    - xg15 2 years ago
      Next step will probably be moving the calculation to webassembly or requiring the script to fetch the result via websocket or webrtc...
    - mistrial9 2 years ago
      .. pirate determination is a thing to behold, as is crazed-repetitive digital grabs.. Its not a fair or accurate characterization to dismiss it as "making people's lives harder" .. it is remarkable that the Debian distros now include ytdl; lets do what is reasonable to make it continue
    - dannyw 2 years ago
      YouTube PM: We need to stop youtube-dl.
      Engineers: make half arsed attempt.
- throwaway0984 2 years ago
  IIRC it's used to extract/generate the signatures needed for YouTube media URLs
- 2 years ago
2 years ago
tonetheman 2 years ago
If this got much bigger I would switch it to quickjs
2 years ago