Thanks stas for outlining that. I can see why that architecture works for JS, but I don’t think it works for Django, or for Jinja and the other templating languages as far as I understand them.
I realised as I thought about it that the biggest issue is that in Django usage, internationalized strings appear in lots of places, not only in HTML templates. In these other contexts localized strings all need to be handled as plain text. The places include:
- things like labels on model attributes (see Django docs ), which can often end up being combined with other plain text strings.
- strings inside templates which are not in HTML mode (e.g. a template used for a plain text email)
- other strings that never go through the template mechanism at all e.g. the subject of an email.
We also then have strings that need to be escaped in an HTML context. These are typically output by the template engine, and also by some other utils like format_html.
So, in Django world it is entirely possible to have messages like this:
markup-instructions = You can wrap text in <b> and </b> to make it bold
This message should appear without any quotation when used in plain text context, and should be HTML escaped in an HTML context:
You can wrap text in <b> and </b> to make it bold
The correct, secure and fast way to handle this in Django is to do absolutely nothing. In plain text contexts, no further processing is necessary, and it is in fact required that we leave this text exactly as it is.
In HTML Django templates, and in format_html
, the autoescape mechanism will automatically do this escaping. Similarly the autoescape mechanism in Jinja does the same thing. The HTML generating code should deal with HTML escaping, and it does so well - XSS is considered a solved problem in Django world (when it comes to server-side generation), and also in Jinja, along with the related problem of avoiding double escaping, which we handle with mark_safe (badly named, should be mark_html
) and MarkupSafe.
This is different from the client side cases where you are always going to be inserting into the DOM, and where there often isn’t an existing framework for doing this stuff.
Of course, we then have the case of messages that need to embed HTML in them, like I had before:
confirm-email = Hello { $user }, please <a href="{ $url }">confirm your email address</a> to continue
Here the <a>
must not be escaped. This is relatively rare, but there are still plenty of instances. So we’ve got different escaping rules required for different messages. We cannot tell which is which by guessing.
In your model, we solve this ‘from the outside’ of fluent.runtime by input processing and output processing. We would need some kind of method to know whether to do nothing, as above, or do the input and output processing - perhaps a naming convention.
For example, message ids ending in -html
are treated as HTML:
confirm-email-html = Hello { $user }, please <a href="{ $url }">confirm your email address</a> to continue
Now we need to escape the $user and $url inputs before they go into FluentBundle.format
, and then mark the whole output of FluentBundle.format
as HTML so that it won’t be escaped again.
This seems to work at first - we can have two different strategies for different types of messages. However, messages can refer to other messages and terms:
welcome-html = Welcome to <b>{ -brand }</b>!
thank-you-from-us = Thank you from your friends at { -brand }.
-brand = Jack & Jill
(I’m using &
as a shorthand for “text that needs to be escaped in HTML context but must not be escaped in other contexts”).
With this addition, it’s simply not possible to get this correct unless fluent.runtime
gains some understanding of the different escaping contexts.
Another issue is the way that in Django and other systems, we can mark blocks of text as already HTML escaped, and we might want to pass these into messages. For example, I have messages like this in the project I’m internationalizing:
award-received-html = { $username } earned award { $award }
In this case, $username
and $award
could be just text, but in fact they are links. I have utilities that create pre-built links to users, like this:
def account_link(account):
return link(reverse('user_stats', args=(account.username,)),
account.username,
title="{0} {1}".format(account.first_name,
account.last_name))
def link(url, text, title=None):
if title is None or title.strip() == "":
return format_html('<a href="{0}">{1}</a>',
url, text)
else:
return format_html('<a href="{0}" title="{1}">{2}</a>',
url, title, text)
In this case, I used <a href>
but I could just as well have used <a data-username...>
or <a onclick="...">
. The use of format_html
here marks the text as HTML so that it doesn’t need to be escaped again.
We don’t want the HTML for this to have to appear in every message that features a username - we want to be able to generate the HTML correctly in one place and pass it through. We also don’t want to have to change this kind of code just so that django-ftl can handle it - something designed for Django should work with this code.
This rules out post-processing to remove dangerous HTML , because we have no idea what HTML is benign and what is malicious. But post-processing in Django/Jinja is:
- Unnecessary, if we just escape correctly, which we can.
- Very expensive.
Another problem is custom functions. If we pre-process all inputs, custom functions will need to deal with escaped text, when they might not be expecting that. In addition, they may accidentally introduce characters that ought to have been escaped. If we escape their output as well as input we get double escaping.
The escapers mechanism I’ve come up with handles all the above cases correctly. For django-ftl
it uses a naming convention as above to distinguish types of messages (because there isn’t really any other option at the moment, in the future we could use semantic comments).
If you use the wrong message id, and therefore the wrong escaping, you still won’t end up with an XSS, you will get double escaping (or single where none was required).
The boundaries between different types of messages are respected, so that a plain text term or message will be escaped if it is included in an HTML message. Escaping is done at the right point - the point where you interpolate - and correctly handles already escaped text marked with MarkupSafe/mark_safe.
In terms of performance overhead, for messages with no escapers applied, there is a very low runtime overhead for the resolver, and zero for the compiler.