HTML Optional Tags

62 points by softskunk 1 year ago | 34 comments
  • danbruc 1 year ago
    Would there have been a way to avoid this mess?

      - a browser must reject any invalid HTML in order to force the developers to fix their HTML
      - a browser must try hard to make sense of messed up HTML, otherwise users will switch to a competing browser that renders the mess for them
    
    Theoretically all browser vendors could coordinate so that everyone rejects invalid HTML, but there is probably no good way to avoid defectors. Why did this not happen for other technologies? My first thought was that there is no compilation step which allows forcing the developer to fix things without giving the end user any power through their choice of browser. But that seems not quite right, why do Bash or Python or your C++ compiler not make a best guess what your code is supposed to do? Because there is or was only one dominant implementation and therefore no competition? Because document markup is much more robust against small errors and probably remains readable while your code likely just crashes? That is probably one of the most important ones, I think. What role did browser specific features, evolving standards and incomplete implementations play?

    What is the end result? Nothing for the end user, they do not care whether the browser has to deal with nice HTML or a mess. Developer writing HTML get to be more sloppy at the price of a lot of additional complexity and pain where ever code has to deal with HTML. This might actually have some negative impact on end users because of bugs or security issues stemming from the additional complexity. Maybe it made HTML somewhat more accessible to the casual user as they could get away with some mistakes. But was this worth it, could better tooling not have achieved the same with good error messages helping to fix errors?

    • solardev 1 year ago
      Historically this was tried as XHTML Strict. It never caught on much and was soon forgotten about. https://en.wikipedia.org/wiki/XHTML

      At the end of the day, HTML's flexibility as a markup language is what made it popular and usable by anyone, and ambiguity is the price we pay for it.

      These days the DOM semantics are even less important as everything is done in JS anyway for all but the simplest documents

      • danbruc 1 year ago
        HTML was essentially always rigorously defined through a SGML DTD. XHTML also conforms to a DTD which additionally conforms to XML which is itself a restricted subset of SGML. SGML offers huge flexibility, for example with regard to implicitly closing tags or redefining all kinds of aspects; you could easily replace the angle brackets with whatever you want and have your HTML document look like

          (START div)Hello (START b)world\b/!\div/
        
        why ever you would want this.

        But despite all the flexibility offered by SGML and because probably nobody implemented HTML parsing by using a SGML parser, we ended up with malformed HTML documents who's interpretation was essentially defined by whatever the ad hoc HTML parser implementations in browser did.

        HTML5 finally abandoned the idea of basing HTML on SGML and a DTD and instead essentially formalized the status quo of HTML parsing and put it into the specification. At least this is my understanding as a non-web developer who gets to work with HTML only occasionally.

        • solardev 1 year ago
          The informality of HTML wasn't solely a matter of markup strictness though. There were too many differences in CSS rendering (ACID tests), Javascript runtimes, SVG and PNG rendering, scrollbar and border and frame and iframe displays, etc. With so many variables, it couldn't NOT be a browser dependent implementation.

          The W3C was too strict where it didn't really matter (the basic markup) and too loose where it did matter (everything else).

      • lifthrasiir 1 year ago
        It is not really the mess---see the next chapter, 13.2 Parsing HTML documents, to see the actual mess. In fact, the HTML specification defines two concrete syntaxes for HTML where the first one is for `text/html` and another is for `application/xhtml+xml`. The latter has been never deprecated (thought the name XHTML was abolished). Moreover the spec states that:

        > Some authors find it helpful to be in the practice of always quoting all attributes and always including all optional tags, preferring the consistency derived from such custom over the minor benefits of terseness afforded by making use of the flexibility of the HTML syntax. To aid such authors, conformance checkers can provide modes of operation wherein such conventions are enforced.

        In the other words it recognizes the benefit from explicit tags, but also recognizes the benefit from optional tags. So they are equally conforming.

        • cush 1 year ago
          > Because document markup is much more robust against small errors and probably remains readable while your code likely just crashes?

          Exactly. The parser is designed to parse documents, not code. The document has a structure (like sections, paragraphs, tables, etc). When the structure doesn’t quite make sense, the parser still displays the content (the blog, story, words, etc).

          > But was this worth it, could better tooling not have achieved the same with good error messages helping to fix errors?

          You need to think about compatibility, especially backwards compatibility. If the standard was so strict that any error resulted in the browser rejecting to parse the document, then as the specification evolved every website would need to be updated. The lack of constraints around standards also means that different browsers can evolve and implement different features at different cadences.

          • perilunar 1 year ago
            > force the developers to fix their HTML

            Developers?

            In the early days of the web it wasn't developers writing HTML. It was anybody who wanted to publish anything on the web. Real programmers didn't touch it. That is why browsers had to be tolerant of bad code.

            • danbruc 1 year ago
              They did not have to, they could have rejected broken HTML. Writing valid HTML is not really harder than writing invalid but still parsable HTML, it just requires some additional tedious work fixing all the mistakes. Would this have been bad for adoption? Maybe.
          • JodieBenitez 1 year ago
            Yes... but why do this ? I don't regret the XHTML days and its feature stagnation, but this is just useless.
            • jraph 1 year ago
              I learned HTML when XHTML 1.0 was current. I've long preferred strict HTML written using XML. It helps spotting mistakes, and there are no parsing surprise.

              Now that HTML5 parsing is well specified, I've come to think that either you want to be strict and have the browser tell you something is wrong, and you use XHTML for this, or all these optional tags are just useless.

              I want to optimize readability, and then file size. I believe closing all tags you opened and quoting all attributes helps readability, and also that all these <head>, <body>, <html> tags just get in the way and make your eyes go through useless boilerplate and makes your fingers type useless things too if you don't use templates.

              You still need to specify the charset so characters are interpreted correctly, so for me, if you are not going to use application/html+xml anyway this works well:

                  <!DOCTYPE html>
                  <meta charset="utf-8" />
                  <title> My title </title>
              
                  <p> Lorem ipsum... </p>
              
              Both quicker to read and write, while not raising maintenance costs.

              Though just yesterday I edited my resume written in XHTML and the browser actually spotted a dumb mistake, so I still like the strictness of the XML parsing mode.

              One counterpoint to dropping the optional tags is for pedagogy: if I had to teach HTML to someone, I would make them use all the tags, or the result of having html and body in the DOM and CSS working on them will be very confusing. Only when they understand the DOM, what nodes are in an HTML page, I'd make them drop the tags if they want. Which is an important step so they can understand that nodes that are present in the DOM are not necessarily in the source code.

              • xigoi 1 year ago
                You need to include the <html> tag for the lang attribute, which is important for accessibility.
                • jraph 1 year ago
                  Good point!

                  I believe it can be put on any element, but if you are adding a tag just for this, it might as well be <html>.

              • lifthrasiir 1 year ago
                It is a formal specification of what browsers used to do with a broken HTML. Because it is now fully specified, everyone can safely write a broken HTML (half joking!), but also there is no longer a surprising behavior due to diverging behaviors from different browsers.
                • JodieBenitez 1 year ago
                  Good... now let's make a formal specification for this:

                      <ahahah>mmm... <ohoho>nope.</ahaha></ohoho>
                  
                  Fully joking :-P
                  • ZeroGravitas 1 year ago
                    I believe whatwg did in fact specify for this kind of overlapping tag.

                    edit: apparently it has the cute name of "the adoption agency algorithm":

                    https://html.spec.whatwg.org/multipage/parsing.html#adoption...

                    > Note: This algorithm's name, the "adoption agency algorithm", comes from the way it causes elements to change parents, and is in contrast with other possible algorithms for dealing with misnested content.

                • MrVandemar 1 year ago
                  It's very useful.

                  I use optional -- therefore abbreviated -- HTML syntax as an alternative to MarkDown for writing.

                  • JodieBenitez 1 year ago
                    You're saying you're using an abbreviated HTML as an alternative to a markup language that is a lightweigth alternative to the markup language known as HTML ?

                    We need to go deeper. •`_´•

                    • MrVandemar 1 year ago
                      Yes, exactly. There are three advantages:

                      1. No translation of markdown to HTML is required to turn it into a web-page, the medium I primarily work in. It's not required -- I can read nicely formatted raw HTML just fine.

                      2. Lower cognitive load. Damn if I can remember if a link in markdown is ()[] or []() or the link goes first or the link goes last. But a link in HTML is consistent with the rest of the language.

                      3. HTML is richer than Markdown. I make heavy use of <abbr> and <cite> and <time>, not to mention metadata tags. My position is if you need to include HTML in Markdown to express yourself, you might as well write in HTML instead of construct some hybrid bastard document.

                  • GoblinSlayer 1 year ago
                    This way you can send spam sms messages with little html markup and google analytics.
                    • layer8 1 year ago
                      It was a codification of existing browser practice. Specifying something different wouldn’t have changed browser behavior, it would only have led to browsers ignoring the spec.
                      • alerighi 1 year ago
                        Why not. If there is no ambiguity you save characeters (few bytes, but for each page) and thus pages will load faster even on slow connections. Also if you write HTML by hand (something not a lot of people does these days, but for example for my site I do) it's less characters to type and it's simpler.
                        • kevincox 1 year ago
                          The problem is that every parser and emitter needs to be aware of these weird and changing rules. It wouldn't be that bad if the only things that read HTML were browsers but as it is every language has HTML parsers that are broken different ways leading to bugs and security vulnerabilities.

                          For example emitters need to know what void elements are because <br></br> is actually equivalent to`<br><br>`. But `<script src=foo.js/>` is only an opening tag so the rest of your document will be executed as JavaScript. So you can't just write an emitter for arbitrary elements, you need to emit different things for `br` and `script`. Plus `script` has special escaping rules that are often forgotten about. Plus you better keep that list up to date!

                          With XHTML it is very easy to write a parser that will construct a tree forever and can reserialize it with no issues. I have no issues with consistent changes such as empty attributes and unquoted attribute values, but I think that these element insertion, auto-closing, void elements and non-replaceable character data are a mistake because you need to maintain an up-to-date dataset of these custom rules or you get an incorrect result.

                        • pwdisswordfishc 1 year ago
                          HTML is still feature-stagnated; features are mostly added to JavaScript and DOM APIs.
                          • jimmaswell 1 year ago
                            CSS gets relatively frequent new features too. I think this status quo is fair enough - HTML is just there, and I never really think about needing more out of it. CSS and JS though I do find myself waiting for browsers to support upcoming experimental features often enough.
                            • JodieBenitez 1 year ago
                              Yeah, let's say I'm late to the party then. But still, the improvements compared to HTML4 are huge.
                          • paulddraper 1 year ago
                            And people wonder why XHTML was a thing.

                            (Is still a thing, but W3C recommends against it)

                            • throwaway87651 1 year ago
                              Fantastic! These are great suggestions to help write readable, maintainable HTML. Similar to: https://lofi.limo/blog/write-html-right
                              • mattkenefick 1 year ago
                                It bummed me out when modern browsers started supporting mistakes in code; like when Chrome would interpret incorrect markup and fix it for you.
                                • hyperhello 1 year ago
                                  There is some variant of the theorem about any sufficiently complex language can’t express its own correctness and vice versa. We want to turn all expression failures into syntax errors; but we can’t. Just don’t write bad HTML, bad JavaScript, bad CSS, and there won’t be any trouble for you.