The web is built on HTML. Without HTML there are no web pages, no web sites, no web apps. Nothing. There may be plain-text documents, perhaps, or XML trees, in some parallel universe that enjoyed that particular kind of challenge. In this universe, HTML is the foundation of the user-facing web. There are many standards that the web is resting on, but HTML is certainly one of the most important ones.
How do we use HTML, then, how great of a foundation do we have? In the introductory section of the 2019 Markup chapter, author Brian Kardell suggested that for a long time, we haven't really known. There were some smaller samples. For example, there was Ian Hickson's research (one of modern HTML's parents) among a few others, but until last year we lacked major insight into how we as developers, as authors, make use of HTML. In 2019 we had both Catalin Rosu's work (one of this chapter's co-authors) as well as the 2019 edition of the Web Almanac to give us a better view again of HTML in practice.
Last year's analysis was based on 5.8 million pages, of which 4.4 million were tested on desktop and 5.3 million on mobile. This year we analyzed 7.5 million pages, of which 5.6 million were tested on desktop and 6.3 million on mobile, using the latest data on the websites users are visiting in 2020. We do make some comparisons to last year, but just as we've tried to analyze additional metrics for new insights, we've also tried to impart our own personalities and perspectives throughout the chapter.
In this Markup chapter, we're focusing almost exclusively on HTML, rather than SVG or MathML, which are also considered markup languages. Unless otherwise noted, stats presented in this chapter refer to the set of mobile pages. Additionally, the data for all Web Almanac chapters is open and available. Take a look at the results and share your observations with the community!
In this section, we're covering the higher-level aspects of HTML like document types, the size of documents, as well as the use of comments and scripts. "Living HTML" is very much alive!
96.82% of pages declare a doctype. HTML documents declaring a doctype is useful for historical reasons, "to avoid triggering quirks mode in browsers" as Ian Hickson explained in 2009. What are the most popular values?
|XHTML 1.0 Transitional||382,322||6.02%|
|XHTML 1.0 Strict||107,351||1.69%|
|HTML 4.01 Transitional||54,379||0.86%|
|HTML 4.01 Transitional (quirky)||38,504||0.61%|
You can already tell how the numbers decrease quite a bit after XHTML 1.0, before entering the long tail with a few standard, some esoteric, and also bogus doctypes.
Two things stand out from these results:
- Almost 10 years after the announcement of living HTML (aka "HTML5"), living HTML has clearly become the norm.
- The web before living HTML can still be seen in the next most popular doctypes, like XHTML 1.0. XHTML. Although their documents are likely delivered as HTML with a MIME type of
text/html, these older doctypes are not dead yet.
A page's document size refers to the amount of HTML bytes transferred over the network, including compression if enabled. At the extremes of the set of 6.3 million documents:
- 1,110 documents are empty (0 bytes).
- The average document size is 49.17 KB (in most cases compressed).
- The largest document by far weighs 61.19 MB, almost deserving its own analysis and chapter in the Web Almanac.
How is this situation in general, then? The median document weighs 24.65 KB, which comes without surprises:
We identified 2,863 different values for the
lang attribute on the
html start tag (compare that to the 7,117 spoken languages as per Ethnologue). Almost all of them seem valid, according to the Accessibility chapter.
22.36% of all documents specify no
lang attribute. The commonly accepted view is that they should, but ignoring the fact that software could eventually detect language automatically, document language can also be specified on the protocol level, which is something we didn't check.
Here are the 10 most popular (normalized) languages in our sample. It's important to note that the HTTP Archive crawls from US data centers with English language settings, so looking at the language pages are written in, will be skewed towards English. Nevertheless we present the
lang attributes seen to give some context to the sites analyzed.
Adding comments to code is generally a good practice and HTML comments are there to add notes to HTML documents, without having them rendered by user agents.
<!-- This is a comment in HTML -->
Although many pages will have been stripped of comments for production, we found that home pages in the 90th percentile are using about 73 comments on mobile, respectively 79 comments on desktop, while in the 10th percentile the number of the comments is about 2. The median page uses 16 (mobile) or 17 comments (desktop).
Around 89% of pages contain at least one HTML comment, while about 46% of them contain a conditional comment.
<!--[if IE 8]> <p>This renders in Internet Explorer 8 only.</p> <![endif]-->
The above is a non-standard HTML conditional comment. While those have proven to be helpful in the past in order to tackle browser differences, they have been consigned to history for some time as Microsoft dropped conditional comments in Internet Explorer 10.
Still, on the above percentile extremes, we found that web pages are using about 6 conditional comments in the 90th percentile, and 1 conditional comment while in the 10th percentile. Most of the pages include them for helpers such as html5shiv, selectivizr, and respond.js. While being decentish and still active pages, our conclusion is that many of them were using obsolete CMS themes.
For production, HTML comments are usually stripped by build tools. Considering all the above counts and percentages, and referring to the use of comments in general, we suppose that lots of pages are served without involving an HTML minifier.
As shown in the Top elements section below, the
script element is the 6th most frequently used HTML element. For the purposes of this chapter, we were interested in the ways the
script element is used across these millions of pages from the data set.
Overall, around 2% of pages contain no scripting at all, not even structured data scripts with the
type="application/ld+json" attribute. Considering that nowadays it's pretty common for a page to include at least one script for an analytics solution, this seems noteworthy.
At the opposite end of the spectrum, the numbers show that about 97% of pages contain at least one script, either inline or external.
When scripting is unsupported or turned off in the browser, the
noscript element helps to add an HTML section within a page. Considering the above script numbers, we were curious about the
noscript element as well.
Following the analysis, we found that about 49% of pages are using a
noscript element. At the same time, about 16% of
noscript elements contain an
iframe with a
src value referring to "googletagmanager.com".
This seems to confirm the theory that the total number of
noscript elements in the wild may be affected by common scripts like Google Tag Manager which enforce users to add a
noscript snippet after the
body start tag on a page.
type attribute values are used with
type="module", we found that 0.13% of
script elements currently specify this attribute-value combination.
nomodule is used by 0.95% of all tested pages. (Note that one metric relates to elements, the other to pages.)
36.38% of all scripts have no
type values set whatsoever.
In this section, the focus is on elements: what elements are used, how frequently, which elements are likely to appear on a given page, and how the situation is with respect to custom, obsolete, and proprietary elements. Is divitis still a thing? Yes.
Let's have a look at how diverse use of HTML actually is: Do authors use many different elements, or are we looking at a landscape that makes use of relatively few elements?
The median web page, it turns out, uses 30 different elements, 587 times:
Given that living HTML currently has 112 elements, the 90th percentile not using more than 41 elements may suggest that HTML is not nearly being exhausted by most documents. Yet it's hard to interpret what this really means for HTML and our use of it, as the semantic wealth that HTML offers doesn't mean that every document would need all of it: HTML elements should be used per purpose (semantics), not per availability.
How are these elements distributed?
Not that much changed compared to 2019!
In 2019, the Markup chapter of the Web Almanac featured the most frequently used elements in reference to Ian Hickson's work in 2005. We found this useful and had a look at that data again:
Nothing changed in the Top 7, but the
option element went a little out of favor and dropped from 8 to 10, letting both the
link and the
i element pass in popularity. These elements have risen in use, possibly due to an increase in use of resource hints (as with prerendering and prefetching), as well icon solutions like Font Awesome, which de facto misuses
i elements for the purpose of displaying icons.
Another thing we were curious about was the use of the
summary elements, especially since 2020 brought broad support. Are they being used? Are they attractive for—even popular—among authors? As it turns out, only 0.39% of all tested pages are using them—although it's hard to gauge whether they were all used the correct way in exactly the situations when you need them—"popular" is the wrong word.
Here's a simple example showing the use of a
summary in a
<details> <summary>Status: Operational</summary> <p>Velocity: 12m/s</p> <p>Direction: North</p> </details>
A while ago, Steve Faulkner pointed out how these two elements were used inadequately in the wild. As you can tell from the example above, for each
details element you'd need a
summary element that may only be used as the first child of
Accordingly, we looked at the number of
summary elements and it seems that they do continue to be misused. The count of
summary elements is higher on both mobile and desktop, with a ratio of 1.11
summary elements for every
details element on mobile, and 1.19 on desktop, respectively:
|Element||Mobile (0.39%)||Desktop (0.22%)|
Taking another look at element popularity, how likely is it to find a certain element in the DOM of a page? Sure,
body are present on every HTML page (even though these tags are all optional), making them common elements, but what other elements are to be commonly found?
Standard elements are those that are or were part of the HTML specification. Which ones are rare to find? In our sample, that would bring up the following:
We're including these elements to give an idea what elements may have gone out of favor. But while
basefont were last specified in XHTML 1.0 (2000) and are no longer part of HTML, the rare use of
rp (which was mentioned as early as 1998 and is still part of HTML), may just suggest that ruby markup is not very popular.
The 2019 edition of the Web Almanac handled custom elements by discussing several non-standard elements. This year, we found it valuable to have a closer look at custom elements. How did we determine these? Roughly by looking at their definition, notably their use of a hyphen. Let's focus on the top elements, in this case elements used on ≥1% of all URLs in the sample:
These elements come from three sources: Yandex Metrica (
ym-), an analytics solution we also saw last year; Slider Revolution (
rs-), a WordPress slider, for which there are more elements to be found near the top of the sample; and Wix (
wix-), a website builder.
Other groups that stand out include AMP markup with
amp- elements like
amp-img (11,700 pages),
amp-analytics (10,256), and
amp-auto-ads (7,621), as well as Angular
app- elements like
app-footer (6,745), and
There are more questions to ask about the use of HTML, including the use of obsolete elements (which are elements like
In our mobile dataset of 6.3 million pages, around 0.9 million pages (14.01%) contain one or more of these elements. Here are the top 9, which are used more than 10,000 times:
spacer is still being used 1,584 times, and present on every 5,000th page. We know that Google has been using a
center element on their homepage for 22 years now, but why are there so many imitators?
If you were wondering: The total number of
isindex elements in this dataset is: one. Exactly one page used an
isindex was part of the specs until HTML 4.01 and XHTML 1.0, yet only properly specified in 2006 (aligning with how it was implemented in browsers), and then removed in 2016.
In our set of elements we found some that were neither standard HTML (nor SVG nor MathML) elements, nor custom ones, nor obsolete ones, but somewhat proprietary ones. The top 10 that we identified are the following:
The source of these elements appears to be mixed, as in some are unknown while others can be traced. The most popular one,
noindex, is probably due to Yandex's recommendation of it to prohibit page indexing.
jdiv was noted in last year's Web Almanac and is from JivoChat.
mediaelementwrapper comes from the MediaElement media player. Both
yatag are also from Yandex. The
ss element could be from ProStores, a former ecommerce product from eBay, and
olark may be from the Olark chat software.
h7 appears to be a mistake.
limespot is probably related to the Limespot personalization program for ecommerce. None of these elements are part of a web standard.
|Heading||Occurrences||Average per page|
You might have expected to only see the standard
<h6> elements, but some sites actually use more levels:
|Heading||Occurrences||Average per page|
The last two have never been part of HTML, of course, and should not be used.
This section focuses on how attributes are used in documents and explores patterns in
data-* usage. Our findings show that
class is the queen of all attributes.
Similar to the section on the most popular elements, this section delves into the most popular attributes on the web. Given how important the
href attribute is for the web itself, or the
alt attribute in order to make information accessible, would these be most popular attributes?
The most popular attribute is
class, with nearly 3 billion occurrences in our dataset and constituting 34% of all attributes in use.
value attribute, which specifies the value of an
input element, surprisingly completes the top 10. It's surprising to us because, subjectively, we didn't get the impression
value attributes were used that frequently.
Are there attributes that we find in every document? Not quite, but almost:
These results raise questions that we cannot answer. For example,
alt? Do those 9.25% of pages not contain any images or are they just inaccessible?
Per the HTML spec,
data-* attributes "are intended to store custom data, state, annotations, and similar, private to the page or application, for which there are no more appropriate attributes or elements." How are they used? What are the popular ones? Is there anything interesting here?
The two most popular ones stand out because they are almost twice as popular than each of the attributes that followed (with >1% use):
data-src can have multiple generic uses although
data-toggle, where it's used as a state styling hook on toggle buttons. The Slick carousel plugin is the source of
data-element_type is part of Elementor's WordPress website builder. Both
data-requirecontext, then, are part of RequireJS.
Interestingly, the use of native lazy loading on images is similar to that of
data-src. 3.86% of pages use
<img> elements. This appears to be growing very fast, as back in February, this number was about 0.8%. It's possible that these are being used together for a cross-browser solution.
We've covered the use of HTML in general as well as the adoption of top elements and attributes. In this section, we're reviewing some of the special cases of viewports, favicons, buttons, inputs, and links. One thing we note here is that too many links still point to "http" URLs.
The viewport meta element is used to control layout on mobile browsers. While years ago, the motto was kind of "don't forget the viewport meta element" when building a web page, eventually this became a common practice and the motto changed to "make sure zooming and scaling are not disabled."
Users should be able to zoom and scale the text up to 500%. That's why audits in popular tools like Lighthouse or axe fail when
user-scalable="no" is used within the
meta name="viewport" element, and when the
maximum-scale attribute value is less than
We had a look at the data and in order to better understand the results, we normalized it by removing spaces, converting everything to lowercase, and sorting by comma values of the
|Content attribute value||Pages||Pages (%)|
viewportspecifications, and lack thereof.
The results show that almost half of the pages we analyzed are using the typical viewport
content value. Still, around 10% of mobile pages are entirely missing a proper
content value for the viewport meta element, with the rest of them using an improper combination of
For a while now, the Edge mobile browser allows users to zoom into a web page to at least 500%, regardless of the zoom settings defined by a web page employing the viewport meta element.
The situation around favicons is fascinating. Favicons work with or without markup—some browsers would fall back to looking at the domain root—, accept several image formats, and then also promote several dozen sizes (some tools are reported to generate 45 of them; realfavicongenerator.net would return 37 if requested to handle every case). As of this time of writing, there is an open issue for the HTML spec to help improve the situation.
When we built our tests we didn't check for the presence of images, but only looked at the markup. That means, when you review the following, note that it's more about how favicons are referenced rather than whether or how often they are used.
|Favicon format||Pages||Pages (%)|
|No favicon defined||1,643,136||25.88%|
|No extension specified (no format identifiable)||37,011||0.58%|
There are a couple of surprises in here:
- Support for other formats is there but ICO is still the go-to format for favicons on the web.
- JPG is a relatively popular favicon format even though it may not yield the best results (or a comparatively large weight) for many favicon sizes.
- WebP is twice as popular as SVG! This might change, however, with SVG favicon support improving.
There has been a lot of discussion on buttons lately and how often they are misused. We looked into this to present findings on some of the native HTML buttons.
|Button types||Occurrences||Pages (%)|
Our analysis shows that about 60% of pages contain a button element and more than half of those pages (32.43%) have at least one button that fails to specify a
type attribute. Note that the
button element has a default type of
submit, so the default behavior of buttons on these 32% of pages is to submit the current form data. To avoid possibly unexpected behavior like this, a best practice is to specify the
|Percentile||Buttons per page|
Pages in the 10th and 25th percentiles contain no buttons at all, while a page in the 90th percentile contains 13 native
button elements. In other words, 10% of pages contain 13 or more buttons.
The anchor element, or
a element, links web resources together. In this section, we analyze the adoption of the protocols included in the respective link targets.
We can see how
http are most dominant, followed by "benign" links to make writing email, making phone calls, and sending messages easier.
target="_blank" has been known to be a security vulnerability for some time now. Yet 71.35% of pages contain links with
As a rule of thumb and for usability reasons, it is recommended not to use
target="_blank" in the first place.
Within the latest Safari and Firefox versions, setting
a elements implicitly provides the same
rel behavior as setting
rel="noopener". This is already implemented in Chromium as well and will land in Chrome 88.
We've touched on some observations throughout the chapter, but as a reflection on the state of markup in 2020, here are some things that stood out for us:
Fewer pages land in quirks mode. In 2016, that number was at around 7.4%. At the end of 2019, we observed 4.85%. And now, we're at about 3.97%. This trend, to paraphrase Simon Pieters in his review of this chapter, seems clear and encouraging.
Although we lack historic data to draw the full development picture, "meaningless"
i markup has pretty much replaced the
table markup we've observed in the 1990s and early 2000s. While one may question whether
span elements are always used without there being a semantically more appropriate alternative, these elements are still preferable to
table markup, though, as during the heyday of the old web, these were seemingly used for everything but tabular data.
Elements per page and element types per page stayed roughly the same, showing no significant change in our HTML writing practice when compared to 2019. Such changes may require more time to manifest.
Proprietary product-specific elements like
g:plusone (used on 17,607 pages in the mobile sample) and
fb:like (11,335) have almost disappeared after still being among the most popular ones last year. However, we observe more custom elements for things like Slider Revolution, AMP, and Angular. Elements like
ymaps are also still prevalent. What we imagine we're seeing here is that, under the sea of slowly changing practices, HTML is very much being developed and maintained, as authors toss deprecated markup and embrace new solutions.
Now, the 2019 Web Almanac Markup chapter had 14 years of catch up to do since the last major study on the topic, so you'd think we wouldn't have much to cover in the year since. Yet what we observe with this year's data is that there's a lot of movement at the bottom and near the shore of said sea of HTML. We approach near-complete adoption of living HTML. We are quick to prune our pages of fads like Google and Facebook widgets. We're also fast in adopting and shunning frameworks, as both Angular and AMP (though a "component framework") seem to have significantly lost in popularity, likely for solutions like React and Vue.
And still, there are no signs we exhausted the options HTML gives us. The median of 30 different elements used on a given page, which is roughly a quarter of the elements HTML provides us with, suggests a rather one-sided use of HTML. That is supported by the immense popularity of elements like
span, and no custom elements to potentially meet the demands that these two elements may represent. Unfortunately, we couldn't validate each document in the sample; however, anecdotally and to be taken with caution, we learned that 79% of W3C-tested documents have validation errors. After everything we've seen, it looks like we're still far from mastering the craft of HTML.
That compels us to close with an appeal: Pay attention to HTML. Focus on HTML. It's important and worthwhile to invest in HTML. HTML is a document language that may not have the charm of a programming language, and yet the web is built on it. Use less HTML and learn what's really needed. Use more appropriate HTML—learn what's available and what it's there for. And validate your HTML. Anyone can write invalid HTML (just invite the next person you meet to write an HTML document and validate the output) but a professional developer can be expected to produce valid HTML. Writing correct and valid HTML is a craft to take pride in.
For the next edition of the Web Almanac's chapter, let's prepare to look closer at the craft of writing HTML and, hopefully, how we've been improving on it.
We're leaving the rest open to you. What are your observations? What has caught your eye? What do you think has taken a turn for the worse, and what has improved? [Leave a comment](https://discuss.httparchive.org/t/2039) to share your thoughts!