Proposal - including CSS "short descriptions" in mdn/data

wbamberg · July 27, 2018, 9:52pm

Including CSS “short descriptions” in mdn/data

The “short description” is the opening sentence or two of a CSS property reference page. It gives a very short overview of the property.

It follows a reasonably consistent pattern:

The foo CSS property is… It is related to the bar property…

For example:

The box-shadow CSS property is used to add shadow effects around an element’s frame. You can specify multiple effects separated by commas if you wish to do so. A box shadow is described by X and Y offsets relative to the element, blur and spread radii, and color.

The margin CSS property sets the margin area on all four sides of an element. It is a shorthand for setting all individual margins at once: margin-top, margin-right, margin-bottom, and margin-left.

Currently the short description is just part of the Wiki document for the property. It’s been proposed (e.g. https://github.com/mdn/data/issues/199) to move the short description out of the Wiki and into the JSON data structures in the mdn/data GitHub repository.

The rationale for doing this is that it makes it much easier for external tools to embed the short description. For example, an editor like VSCode could fetch the short description and display it in a contextual popup along with other useful information like browser compatibility. External tools can (and do) do this already by scraping the Wiki, but this is quite unreliable.

This document considers how we might go about moving the short description into an “externally embeddable” format such as mdn/data, and what the issues with doing this might be.

Precendents

This isn’t the first piece of MDN content to be moved out of the Wiki into an “externally embeddable” format.

The browser-compat-data project moves compatibility data from the Wiki into JSON files in GitHub, where it’s packaged into an npm module and can be consumed by external applications.
The existing contents of mdn/data
was originally content in the Wiki, that’s now accessible to external applications via a similar JSON=>npm setup.

One important difference is that this is the first time we’ve seriously considered migrating pure prose content out of the Wiki in this way, and this will bring its own challenges.

However, both the previous migrations listed above include some prose content:

browser compat includes “notes”, which are free text
the CSS data in mdn/data includes prose describing which elements the properties can apply to.

We’ll refer to both of these as possible precedents in meeting the challenges of migrating the short description.

The basic proposal

Adding the description to mdn/data

The mdn/data repository describes CSS properties in a properties.json file. Each entry contains data about a single CSS property, including its specification status and formal syntax. Here’s the entry for box-shadow:

"box-shadow": {
  "syntax": "none | <shadow>#",
  "media": "visual",
  "inherited": false,
  "animationType": "shadowList",
  "percentages": "no",
  "groups": [
    "CSS Backgrounds and Borders"
  ],
  "initial": "none",
  "appliesto": "allElements",
  "computed": "absoluteLengthsSpecifiedColorAsSpecified",
  "order": "uniqueOrder",
  "alsoAppliesTo": [
    "::first-letter"
  ],
  "status": "standard"
}

This data is currently used both in MDN pages and in external tools such as css-tree.

The obvious suggestion is to add the short description into this JSON structure:

"box-shadow": {
  "syntax": "none | <shadow>#",
  "media": "visual",
  "inherited": false,
  "animationType": "shadowList",
  "percentages": "no",
  "groups": [
    "CSS Backgrounds and Borders"
  ],
  "initial": "none",
  "appliesto": "allElements",
  "computed": "absoluteLengthsSpecifiedColorAsSpecified",
  "order": "uniqueOrder",
  "alsoAppliesTo": [
    "::first-letter"
  ],
  "status": "standard",
  "shortDescription": "The box-shadow CSS property is used to add shadow effects around an element's frame. You can specify multiple effects separated by commas if you wish to do so. A box shadow is described by X and Y offsets relative to the element, blur and spread radii, and color."
}

Embedding short descriptions in MDN

Once we’ve done that, we’d want to be able to include the descriptions in MDN Wiki pages. The obvious approach would be to use a KumaScript macro to fetch the short description from mdn/data and embed it in the page. This is the same approach used to embed compatibility tables populated from browser-compat-data and to embed “info” tables populated from mdn/data.

Challenges and questions

Format

The first question: which format should we use for the short description? We’ll consider three options:

HTML
Markdown
reStructuredText

The advantages of HTML are that it’s very powerful, well-specified, and tools to process it are readily available. The disadvantages are that it’s hard to write and hard to read. If we did choose HTML, we should greatly restrict the elements we permit.

The advantages of Markdown are that it’s very familiar, easy to write and easy to read. The disadvantages are that it’s not well-standardised, is very limited, and doesn’t support semantic markup.

reStructuredText is somewhere in between the other two options. It is simpler to read and write than HTML, but is better standardised, more powerful, and more extensible than Markdown, and supports semantic markup.

Recommendation

One hidden question here is: are we choosing a markup format for short descriptions only, or are we building the first step of a system to move more complex content out of the Wiki? We would like to choose the simplest format possible, but there’s a risk that it will prove inadequate if we start trying to do more with it. A related question is: how difficult would it be to change the markup format, if we discover that our choice is inadequate?

I think that at this point it’s difficult to anticipate what our future requirements will be or how we will want to address them, so we should do something simple for now and acknowledge that we might change formats later.

For notes in browser-compat-data, we use HTML, restricted to <a> and <code> elements, and it has worked quite well.

Looking through some short descriptions, I think this would be enough for short descriptions, with the possible addition of <strong>. If in the future we want to represent more complex content, such as descriptions of property syntax, we might have to consider different options, and I’d tend to favour reStructuredText.

Localization

MDN’s Wiki content is provided in a number of languages. It’s translated entirely by volunteers, and provides its own platform for translators. The platform has serious limitations but it is possible for determined volunteers to create and maintain high-quality translations.

If we move the en-US text out of the Wiki and include it using a macro, how will localization be affected?

The two precedents we have for this project take different approaches here.

browser-compat-data

In the browser-compat-data project, we don’t (yet) address localization at all. Notes are given in en-US only, and are presented in en-US in all locales. This is probably more acceptable for notes in a table than for the first paragraph of the article, though!

Note that even if we don’t support localization explicitly in this project, it’s still possible for people to provide translations for the short description. The en-US page would include the {{ShortDescription}} macro. Translators could use the rendered version of the en-US page as the source, and other locales would include the translated text content directly, rather than the macro call.

But this means translators get no help from the localization tools in Kuma, and have to treat short descriptions as a special case.

mdn/data

In mdn/data, we do address localization explicitly. Localizable strings are referred to using identifiers, which can be used as keys to look up strings in a separate l10n dictionary:

// css/properties.json

"box-shadow": {
...
  "computed": "absoluteLengthsSpecifiedColorAsSpecified",
...
}

// l10n/css.json

"absoluteLengthsSpecifiedColorAsSpecified": {
  "de": "Längen absolut gemacht; angegebene Farben berechnet; ansonsten wie angegeben",
  "en-US": "any length made absolute; any specified color computed; otherwise as specified",
  "fr": "toute longueur sous forme absolue; toute couleur sous forme calculée; sinon comme spécifié",
  "ja": "指定値（length は全て絶対値となり、color については計算値となる）",
  "ru": "любая абсолютная длина; работает любой указанный цвет; если другое не указано"
}

This system is used by the cssinfo macro.

Recommendation

We need some kind of localization answer for short descriptions, even in the first version. I don’t think it’s acceptable to do the same as browser-compat-data here - this is just because notes in a table are so much less obvious than the first paragraph of an article.

I’d recommend that we do the same thing as the other mdn/data items, as described above. But I’d really value input from localization experts and practitioners here.

As an aside: our localization strategy as a whole is unclear at the moment, and hopefully changes are going to come in this area.

Inline macros

The MDN Wiki platform, also known as Kuma, has its own macro system called KumaScript. It’s used for a wide variety of purposes, from macros like {{compat}} that build complete sections of the page, to what we could call “inline macros” that insert small bits of generated content inside static content. A major category of inline macros are cross-referencing macros, that generate links to documents.

Calls to inline KumaScript macros sometimes appear in the short description. For example, the short description for margin calls the {{cssxref}} macro, which generates a link to a page in the MDN CSS reference documentation:

The margin CSS property sets the margin area on all four sides of an element. It is a shorthand for setting all individual margins at once: {{cssxref(“margin-top”)}}, {{cssxref(“margin-right”)}}, {{cssxref(“margin-bottom”)}}, and {{cssxref(“margin-left”)}}.

This is a problem for opening up Wiki content to other applications, because these macros are not usable by applications other than Kuma. For example, suppose VSCode wants to fetch a short description for margin and gets the content quoted above. What can it do with {{cssxref("margin-right")}}?

So: when the short description’s content is moved out of the Wiki for consumption by other applications, what should happen to the macros it contains?

Recommendation

My recommendation here is that we deprecate these inline macros and remove them from the short description’s content.

The most common macro appearing in short descriptions is {{cssxref}}, and we can replace this with just standard HTML <a> and <code> elements.

Note though that this will apply to all content that we want to make available to applications that aren’t Kuma, so if we want to extend the scope of this project beyond just short descriptions, its impact on KumaScript macros will be more widespread.

In this context it would be helpful for the MDN team to have guidelines about the proper scope and valid applications for KumaScript macros.

Editing workflow

Moving content out of the Wiki into GitHub is a big change to the contribution workflow. We’ve seen with browser-compat-data and interactive-examples that we still get good contributions from GitHub-hosted content, but prose content feels different from either structured data or code examples.

I’m not sure what to suggest here, other than to try it and see if it is successful.

The future

One major open question is: how does this change fit into our overall platform strategy for MDN? We are considering migrating more content out of the Wiki and into GitHub. It’s always tempting to make moves like this incrementally, testing each step before moving on. This has been our approach so far with experiments like browser-compat-data, mdn/data, and interactive-examples.

But the risk of this incremental approach is that by proceeding without an overall vision, you end up with an incoherent mess. How can we have confidence that the “short descriptions” project will fit cleanly into a “future MDN” where, for example, all reference content is externally embeddable? Would it be better to have even a sketched out vision of our endpoint, to guide the incremental steps we take?

sheppy · July 27, 2018, 10:40pm

I’m terribly torn about this. I very much agree that this is a good idea in principle; it makes it much easier to build tools that present useful information about Web technologies inline in tools, and that’s something we are very much building toward.

However, some of the potential concerns you raise are significant enough to give me pause at this time. I have some of my own concerns, too.

Data location

If we feel that the end goal is to move toward a structured format for MDN’s content, with specific content sections (and potentially optional “custom” sections) which are edited in a set of customized editing containers, then we should keep that in mind when making decisions here. Would we want all information about each CSS property to be stored in the same data file? Or would we have the short descriptions in properties.json and everything else in another file or database? If we aren’t sure about this, making any moves may be premature.

Macros

This is a bigger concern. The use of macros to generate content is pretty key to how we work on MDN right now, and making any changes to that must be considered carefully. Before removing the ability to use them, we need to make certain that we are able to replicate the functionality we lose.

I would actually prefer that we not lose them at all, and instead of putting the descriptions directly into the existing properties.json file, we generate properties.json from a source file that contains the data currently in properties.json and one or more additional files that provide data that needs to be run through the KumaScript engine to process macros. That might involve a separate property-details.json file, containing a set of records like this:

"background-clip": {
  "summary": "The <strong><code>background-clip</code></strong> CSS property specifies if an element's background, whether a {{cssxref(\"&lt;color&gt;\")}} or an {{cssxref(\"&lt;image&gt;\")}}, extends underneath its border."
}

We would then have a tool that runs the property-details.json file through KumaScript to generate the final data, which is the HTML that KumaScript produces after rendering all the macros used within the text. That JSON could either be kept in a separate file (property-descriptions.json, for example) or inserted into the properties.json file (meaning there would need to be a source file for that as well, which would be the file that would be manually edited).

This tool would be run as part of the process of releasing a new build of the MDN data, resulting in a properties.json file being distributed that is exactly like the one we have now, but with the added information present including all KumaScript macros, pre-rendered in their HTML.

Last thought

All that said, one thing I do feel strongly about is that at least at this time, the only realistic format to use for prose in the data is a subset of HTML. Trying to use another format complicates things, since having more than one means having to do conversions when rendering pages and also requires editors to deal with multiple formats, which isn’t really a good experience.

chrisdavidmills · July 28, 2018, 8:20am

This is a well-written proposal Will, and I’m pretty happy with it.

I did have a couple of points/responses:

I agree that in the short term, the same limited HTML vocabulary we use for BCD notes would be a good way to go. It really isn’t that hard to write when you’ve only got 4 or 5 elements to choose from, and we could provide a simple guide to fill in any knowledge gaps.
In the longer term, and for more in-depth bits of content existing on GitHub (just imagine if we went down the route of putting the whole editorial process on GitHub), I am not sure sure if RST is the best idea — it’s more of a content writer thing and therefore would make the barrier to entry higher for web devs and others who don’t know it. You are far more likely to get contributions from the web dev community if you carry on using HTML or Markdown, and I’m not religious about which one to use at all, as they can be converted between easily. In terms of limited capabilities, I’d suggest that we do a similar thing to GitHub-flavoured markdown — use markdown for all the simple stuff, and then allow HTML to be included for anything non-standard.
Macros — in the short term, I don’t see a problem with removing them from the short descriptions. Most of the time they are used to link to closely-related properties, which more often than not are included in the sidebar for that group of CSS functionality anyway. In extreme cases, we could include some extra info below the short description. Thinking of a longer term plan, I guess we’d have to think carefully about this; any other resource that consumes our data could not be expected to render the macros in the same way. And do we REALLY need macros for things like links, which are simple and can easily be represented in HTML? Probably not, although I know the arguments for using them.
When you say “deprecate”, do you mean get rid of those macros from the short descriptions, or get rid of them altogether?
Yes, I totally agree that we need some kind of future vision and think about where this fits in with that. I personally am a huge fan of the idea of moving to GitHub for our editorial workflow, for many reasons previously discussed, but obviously I can’t guarantee any kind of timeline without talking to the devs. I don’t think they can guarantee any kind of timeline either, without more resources.

So how best to proceed here?

sphinx_knight · July 30, 2018, 6:00am

As @chrisdavidmills pointed out: it is indeed a neat proposal, thank you @wbamberg.

I will, of course, write about localization here

I 100% agree here. To me, regardless of the technical solution designed to answer this, localization must be part of it. It is not, in this case, a “nice to have”.

I don’t know the feelings and, most importantly, the time available from @jwhitlock about this but it might be a good way to test strings extraction so that localization can happen on Pontoon. (poking @Pike as well since we are talking about #l10n here).
Attempting this would also push towards HTML (ex. dealing with RTL locales) for the format of the content (though I don’t know ReST enough to say more).
If this proves successful, this could pave the way to achieve localization migration for existing macros/systems (e.g. browser compat data).

Adding a few things outside of localization:

+++ This proposal, if successful, could well be generalized to other sections too (for the same purposes that Will pointed out).
-- This creates a kind of mix/hierarchy between content of the same “level”. If I totally agree with the benefits of a git-based (or Pontoon for l10n) contribution model, having multiple “technologies” to contribute to an article (the first paragraph being the most important part of it) makes the contribution process more cumbersome (“here is the edition interface”, “oh wait this is how macro works”, “and this is GitHub/Pontoon”). To summarize all of this, I’d say that if this path is chosen (I would gladly walk it), we should be aware that the ultimate destination contains tooling to make contributions easy.

Again, thanks Will for writing this

Pike · July 30, 2018, 10:53am

Hard for me to suggest something on the l10n side here. Deciding on the l10n story based on the short descriptions seems like the tail wagging the dog.

I share the concerns about splitting up the content of a page between multiple toolchains. From editing to translation status.

If you externalize it, do so in a dedicated localization format, please.

Have you done some research if the data of interest can be taken from the current pages? I’d think we’d want to that algorithm anyway to bootstrap the localization effort. Maybe that’s sustainable while you’re figuring out your content story at large?

jwhitlock · July 30, 2018, 2:40pm

It could be useful to have summaries in limited HTML in mdn/data, but the primary source should still be the wiki pages.

It is currently possible to extract the summaries from MDN pages with the $json API:

https://developer.mozilla.org/en-US/docs/Web/CSS/margin$json

Kuma already supports automatic extraction of the summary text (used to populate the meta description), translation of summaries, and live updates with edits. Moving the source of summaries to mdn/data means losing many of these features, with no plans to add them back.

Google is currently embedding W3Schools’s content for web platform items in search results. If you look at these “quick results” and compare them to the W3schools page, they are scraped directly from the web page, with little or no markup to help Google determine the content. We should follow this example - the “summary” of the feature should be the leading content of the page, followed by a short example of the feature. Moving the summary into a second place makes it more difficult to craft a summary in the context of the page, and gets us further from the goal of being the embedded content for Google Search.

I think most of the proposal is good, such as JSON format and restricted HTML markup, but MDN should remain the source for this data. This could be accomplished this way:

mdn/data includes the URL of the related feature
A script periodically uses the $json API to update the summary data, submitted as a pull request. The script can live in the mdn/data repository.

This would retain the SEO and localization features of the current system, and give third parties a simple way to use these summaries in their own tools, without manually scraping MDN.

I strongly believe that the summary is part of the prose of the page, and treating it the same as pure data is a step backwards.

sheppy · July 30, 2018, 7:32pm

I agree wholeheartedly with @jwhitlock on this. We should have a tool that slurps the summaries out of the MDN pages and writes them into the JSON before each release of mdn/data is packaged up. Doing anything else causes a rift in our content that will be increasingly difficult to maintain.

If necessary, the tool can clean the summaries further than the existing bleach process does, removing tags we don’t want in the JSON. But the point is that the whole thing can be done automatically without having to make changes to workflow for the vast majority of contributors.

Sheppy

wbamberg · August 6, 2018, 6:59pm

Thanks for all your responses!

John’s amendment seems like a really good one. If I understand correctly, this will be the same as the original proposal for users of mdn/data, but changes the contribution side to keep the Wiki workflow, populating mdn/data from the Wiki using $json.summary.

The big advantage of this is that we keep the familiar workflow. I agree that editing JSON properties is a terrible workflow for prose content, and we will need to think more about that side if we want to move away from the Wiki as a contribution path.

This approach also resolves the localization issue for MDN itself, although not for anyone using the mdn/data copy.

I think the “build step” where we take $json.summary and add it to mdn/data offers value in giving us a way to validate the summary.

So to accomplish this, there seem to be about three pieces of work:

on the content side, make sure that the content returned by “$json.summary” is in good shape.
check that the way “$json.summary” is calculated is what we actually want. There is some documentation here: https://developer.mozilla.org/en-US/docs/MDN/Contribute/Tools/Document_parameters, but from that it’s not clear to me how “$json.summary” is calculated. Note also that there’s some confusion about how to mark summaries.
add the “short_description” (or “summary”, maybe) property to the mdn/data schema, and write a script that can generate a PR to update it. I guess we would run the script whenever we need to make a new mdn/data release.

atopal · August 6, 2018, 7:56pm

Thanks Will, that sounds good to me.

sheppy · August 6, 2018, 10:06pm

The existing $json.summary (or SEO summary) is currently generated basically like this:

Is there a span of text with the class “seoSummary”? If so, get that text, convert it to a plain text string, and return that.
Obtain the contents of the first <p> block in the article, convert it to plain text, and return it.

If there’s much more to it than that, I have not encountered any examples of it.

wbamberg · August 6, 2018, 10:32pm

This can’t be quite right though. Take for example this page: https://developer.mozilla.org/en-US/docs/Web/CSS/margin-block-start - it doesn’t use seoSummary, but the first paragraph is “<p>{{CSSRef}}{{SeeCompatTable}}</p>”.

chrisdavidmills · August 7, 2018, 11:32am

The plan in general makes sense to me, as long as we can work out how $json.summary is generated

sheppy · August 7, 2018, 10:03pm

OK, I’m looking at the code right now.

Determining the page summary

Look to see if there’s a section in the article with the heading “Summary”. If there is, all of the below steps are constrained to only work within that section (it treats the “Summary” section as if it were the entire article).
Look to see if the .seoSummary class is used in the article. If there is one,
a) If the summary has been requested as plain text, the return as the final SEO summary string the text contained within the block with the .seoSummary class applied as the summary (after removing all HTML elements. This is how it’s fetched for tooltips.
b) If the summary is requested in HTML format, every block found with the .seoSummary class is concatenated together, retaining their HTML formatting, and returned as the summary.
If the .seoSummary class is not used in the article, the first paragraph found in the article which meets the requirements listed below is used as the page summary. As before, the summary may be requested as either plain text or in HTML format with its elements intact.

Requirements for a `<p>` to be used as a summary

The following must all be true for a paragraph to be used as the summary. As indicated in step 3 above, this is how the summary is selected if .seoSummary is not used on the page.

The paragraph must not be empty
The paragraph must not use the word “Redirect” (case-sensitive)
The paragraph must not include the character “«” (indicates a “previous page” link)
The paragraph must be a top-level element (to avoid picking up paragraphs inside <div> elements like notes and warnings)

Thoughts

It seems like those requirements for paragraphs to be used as summaries are a little fragile in places, and I’m sure they don’t catch all the cases. What about pages that don’t use the “previous” link but do have a “next” link? Those start with “»”. And dropping words that use “Redirect” seems like it might be a bit of a sledgehammer for a toothbrush job.

One thing that does happen sometimes is that the first paragraph (or paragraphs) somehow wind up inside a <div> block (I think there are editor scenarios that do this). That will cause the wrong content to be selected.

I expect that with this in hand we can sort out why some pages are not getting the right content selected (especially those pages that are unexpectedly using a note or even the “technical review needed” box as the summary). More importantly, I think it will help us make headway on making the changes previously discussed in this topic.

wbamberg · August 27, 2018, 10:34pm

Following on from the discussion here I’ve updated the proposal, which now lives in the mdn/data wiki: https://github.com/mdn/data/wiki/CSS-property-short-descriptions.

In the last sprint Daniel Beck has done some great analysis of short descriptions: https://github.com/mdn/data/issues/261#issuecomment-416053904. I’d encourage anyone interested in this to read his analysis. If you have comments, it would probably be best to add them to the issue over there.

Proposal - including CSS "short descriptions" in mdn/data

Including CSS “short descriptions” in mdn/data

Precendents

The basic proposal

Adding the description to mdn/data

Embedding short descriptions in MDN

Challenges and questions

Format

Recommendation

Localization

browser-compat-data

mdn/data

Recommendation

Inline macros

Recommendation

Editing workflow

The future

Data location

Macros

Last thought

Determining the page summary

Requirements for a <p> to be used as a summary

Thoughts

Requirements for a `<p>` to be used as a summary