Python fluent.runtime - plans

l10n

(Luke Plant) #1

I released fluent.runtime 0.1 just now :champagne:

We had some discussion on what happens next on GitHub, which we’re moving here.

Some big items for going forward are:

  1. Documentation
  2. Updating to Fluent 0.8 spec (I have a branch for that).
  3. Performance improvements, especially the second FluentBundle implementation I’ve done that uses a compile-to-Python strategy (Old PR and a more recent branch that is updated for Fluent 0.8 spec.)
  4. Higher level tools, including django-ftl.
  5. django-ftl depends on one further feature that I would like to add to fluent.runtime, namely the ability to add ‘escapers’, which enables you do things like have HTML embedded in your messages and get all the escaping issues right. (There is an old branch for escapers, but I am working on rebasing this work from scratch because I really messed up several merges).

@Pike mentioned dog-fooding - do we want to start using this stuff anywhere? I’m starting to use django-ftl in a personal project, and trying to improve the design as I go, but it will take some time to go through this whole project and FTL-ise all the strings.


(Axel) #2

Thanks for opening this thread. I’ll be probably going with a reply per reply.

One part here is porting fluent-dom to django/python? It took stas and me quite some time to understand better which role DOM is playing in Fluent, and in the end, we were most happy with stuff like that being specific to the binding. So the escaping here shouldn’t be part of fluent.runtime. In client-side js, we ended up subclassing the localization class, but the gist of it is that we post-process the returned string. There might be something here that’s mid-way between fluent.runtime and django-fluent (I’m vouching for a rename :wink: ), as this functionality is related to more server-side html templating than just django.

A related part is pseudo localization. That one we actually put into the core resolver context, because it’s important that it only transforms literal text, and not placeables.


(Axel) #3

I agree that we’ll need to talk about performance of Fluent in python.

I’m looking at this aspect from this POV: We’re discussing a new l10n infra for www dot mozilla dot org. Thanks to your efforts, Fluent is one of the alternatives. But that also means high traffic, 100+ languages, many pages in many languages only rarely accessed.

I’m less worried about loading the same page 100 times in quick succession, 'cause we have load balancers for that.

I do care about “bootstrapping” performance, and even more so, security.

I just looked at the branch you have once more. And what I read there is that it calls into eval on the live process? I’d really like to avoid having that conversation with infra-sec :frowning:.

I’d prefer to evaluate other approaches, and see if we can get performance that’s close to that. Maybe creating functions for each message/term/attribute, which for simple strings just return a value, and for complex strings call into the resolver?

As for bootstrapping, I think there’s some opportunities to speed up the parser, too. Just getting IDs and text by regular expressions instead of single-char iteration might get us a long way.


(Luke Plant) #4

I can’t see how you can do this correctly with post processing, especially in the context of Django, at least if you want to do server-side rendering of translated messages (which I do).

Suppose we have a Django template that currently looks like this:

<p>Hello {{ user }}, please <a href="{% url 'some-url' %}">confirm your email address</a> to continue.</p>

For this to work with FTL, at least some of the HTML has to become a part of the FTL message, e.g. something like this:

please-confirm = Hello { $user }, please <a href="{ $url }">confirm your email address</a> to continue.

(This kind of message hopefully wouldn’t constitute the majority of your messages, because you’d want to avoid HTML in FTL for the sake of translators, but it does come up a fair amount - in the project I’m working on I’ve got about 30 instances out of about 500 messages).

Now, for this to be correct and safe from XSS attacks, { $user } must be HTML escaped. In the original template, the {{ user }} string has this done automatically by the template engine.

For the FTL-ised version, we must have some way to call the FluentBunde.format method, passing the user argument. However, we can’t just HTML escape the whole output, because that would escape <a> to &lt;a&gt; which is not what we want. Neither can we escape just the { $user } part as a post-process task, because after fluent.runtime has finished interpolating, we don’t know which bit was the { $user } bit and which bit was the translator/developer supplied text and HTML - we’ve just got a single structureless string back.

We could require that the developer HTML escapes $user when they pass it in, and then we don’t HTML escape the output any further for these messages. But this requires the developer to remember to do it every time, or face a bug and and an XSS exploit, which is unacceptable in the Django template security philosophy.

So, I can’t see an alternative to having fluent.runtime gain some kind of facility to handle escaping - but please describe if you can!

The feature as I’ve built it doesn’t hard code any particular escaping, it just provides a generic mechanism, which I’ve tested works with several different systems.

django-fluent seems to be taken already, otherwise I would definitely have gone for it.


(Luke Plant) #5

My POV is almost opposite - dynamic pages that can almost never be cached, and/or small projects that would never have the money or need for load balancers. I’m hoping we’ll be able to cater for both of these use cases!

Yes, there is an ‘exec’/‘eval’, but:

  1. It is not done on anything derived from runtime arguments i.e. arguments to messages. It is done only on strings derived from FTL files. Certainly there is a possible attack vector here i.e. malicious or compromised translators, but it is a very different situation from eval’ing data that is coming in over the internet, for instance.

  2. This is not an unheard of technique for fast code. For example, Jinja2, which is one of the best and most popular template engines out there, does the same thing. It is certainly possible to get it right without introducing security vulnerabilities.

  3. It is not a massive surface area of code to review. Thanks to your prompt, I’ve improved the way that some of the code is structured and added more asserts, so you can review all the as_source_code methods in codegen.py and see where there might be holes without having to look at the rest of the code.

Certainly that would work to some extent, but I think the compiler approach offers a lot that you simply can’t get otherwise.

For instance, with the compiler many of the more advanced features become essentially free. Terms can be evaluated and inlined at compile time, so you pay nothing for them at runtime. You also get compile-time checking of many possible errors (exposed using the check_messages method).

Further, by generating Python, implementations like PyPy can really excel. The result is that with PyPy a simple message that has a single string substitution with the compiler is twice as fast as using gettext with ‘%’ style interpolation (which is the way that Django’s gettext usage does it), while the resolver is more than 10 times slower than gettext. With CPython 3.6 for this case, the compiler is still pretty good - only about 15% slower than gettext, while the resolver is 25 times slower than gettext. The resolver could probably be improved, but it is never going to compete with the compiler.

This stuff does matter. I often hear of people switching from Django templates to Jinja2, at some considerable effort, because the Django’s template engine is too slow (and because of design issues it is basically impossible to use the technique from Jinja2 to speed it up), and I’ve had to do it myself.

We are in a situation where:

  • solutions like gettext are dominant.
  • these solutions are already 95% there for most people in terms of features.
  • and they are known to be fast enough.

If we want to convince people to try something new, 10-25 times slower than an existing solution for a common case is not an attractive sell. Yes, fluent is doing a lot more, at least potentially, but the performance cost is way beyond what seems reasonable for the common cases. And all the additional features you get are coming at even larger performance costs - while with the compiler, many of the additional features come at almost zero cost.

So, this is why I’d suggest simply having multiple implementations, which is how my branch is currently. We can clearly spell out the advantages and possible disadvantages. I’d be happy for the default to be the resolver (or a simplified ‘eval-free’ compiler that uses the resolver for anything beyond a static message) if that will make some security folks happy. But I’m pretty confident that the current compiler code doesn’t have serious holes, or at least that some code review will be enough to find them if there are.


(Staś Małolepszy) #6

The way we handled this in fluent.js is with an abstraction over FluentBundle. We called it Localization and it was responsible for managing the ordered sequence of single-language bundles which corresponded to the user’s language preferences. This allowed for a graceful fallback in cases when a translation was not available in the user’s first (or second, etc) language.

The Localization object has a format(id, args) method which iterates over the current sequence of bundles and selects the best one to format the translation called id. The exact strategy for deciding what best means is implementation-dependent. In fluent.js we settled on simply checking if the translation exists (bundle.hasMessage). In Python, we could also look at the errors returned by bundle.format and fall back to the next bundle if the errors are grave.

In fluent.js we also had a DOMLocalization subclass which handled the logic specific to the DOM bindings. (ReactLocalization was another one.) This is where the XSS protection was implemented. In the case of Django bindings, it might a good idea to start with a simpler design where there’s only one DjangoLocalization class rather than a hierarchy of inheritance.

Localization.format(id, args) should protect against XSS in two ways: by pre-processing the input arguments, and by post-processing the final translation as returned by bundle.format.

  1. Pre-processing: escape arguments which are strings before handing them off to bundle.format, or wrap them in a subtype of FluentString which escapes them in its format() method. I think the latter approach would be the one I prefer as it allows for greater flexibility: you could still use the original value of the argument for comparisons, for example.
  2. Post-processing: sanitize the resulting translation (with interpolated arguments already escaped). Keeping the <a> might be fine, but any unexpected <script> or <img src="xxx" onerror=""> should be removed.

Does this sound like a good architecture for Django? It has worked well for us in fluent.js and I’m curious to hear your thoughts about it.


(Axel) #7

I’d like to dissect two things here:

One is the question about exec. The other is about creating executable data structures instead of interpreting ones.

The latter is something I’d really like to see. When I looked at perf of the resolver a while back, @dispatch was dominant, so replacing that with a direct callable should get good performance. I can also see how the benchmarks get that code hot in pypy’s JIT, and add additional performance wins.

To the exec point, from what I’ve found on the internets, I don’t see how the use of exec adds to that performance gain? Would you have docs for that? I found https://pypy.org/performance.html, https://pypy.org/compat.html, and a rather old http://lucumr.pocoo.org/2011/2/1/exec-in-python/. I only found references to pypy’s JIT there.


(Luke Plant) #8

When talking about the performance of exec, I was making a different point to the one Armin is discussing on his page, as far as I can tell. (I think he is mainly talking about the performance of execfile vs import. Armin also recommends use of compile + exec instead of just exec, but in our case it makes no difference to performance because we only use it once and can’t re-use the code objects that compile produces).

In particular, I’m saying suppose we have this:

my-message = Hello from { -brand }!
-brand = MyBrand

The compiler strategy produces a function like this:

def my_message(args, errors):
    return "Hello from MyBrand!"

And then uses exec to put it into a namespace where we can look it up and call it.

Any other strategy that relies on traversing the AST is going to have a very hard time competing with this or even coming close. With a compiler, we can also generate efficient specialized code for cases involving looking up arguments and calls to functions etc, and the PyPy JIT could do clever things on the resulting code making them even faster.

Of course, as you said we could make a compiler that matches some simple cases, and for those ones returns something that is just as performant. Something like this:

def simple_compiler(message):
    if (isinstance(message.value, Pattern) and
          len(message.value.elements) == 0 and 
          isinstance(message.value.elements[0], StringLiteral):
        body = message.value.elements[0].value
        errors = []

        def message_func(args):
            return (body, errors)

    else:
        # Otherwise delegate to the resolver
        def message_func(args):
            return resolver.resolve(message, args)

    return message_func

(This is not meant to be real working code, and ignoring distinctions about whether we pass in/return the errors list, which can work differently)

I was not claiming that a solution that involves exec would be any faster than this (for the special case of a single string with no substitutions) - I suspect they would be extremely similar - and this would be a good optimisation for the resolver.

However, this approach doesn’t scale. We can’t make a special cased message_func that matches every kind of message that we might find, and even with a large effort we are going to end up with stuff being done inside the function that could have been done outside the function by a compiling strategy, generic code instead of specialized code.

To build up executable message functions like the one the compiler does, as above, there are different options. We could build a Python bytecode object, and in this way you could avoid needing the exec call - you create a code object and then pass it to types.FunctionType to create a function. However, building bytecode is a horrible API for writing code, probably not portable between different Python versions, and massively harder to test compared to the test_compiler.py tests which involve readable Python functions. That’s why building up source code as a string and exec’ing seems to be the best option. It’s how Jinja, Mako and Genshi all do it, possible others.

Hopefully that clarifies what I was trying to say.


(Luke Plant) #9

I had another thought - for an ‘exec free’ compiler, we could use parts of the compiler implementation to cover more cases than I outlined above. In particular, we could handle the case of a message that was entirely statically defined strings, such as the example with an evaluated and inlined term. The simple_compiler function would call the existing compiler machinery. If the end result after simplification was a single codegen.String object, with no errors found, then that string value could be used, otherwise it falls back to resolver.

In this way it would avoid ‘exec’ entirely. This would come with the disadvantage of a heavier upfront compilation mechanism.


(Axel) #10

Thanks for the response, getting to an ‘exec-free’ compiler was what I had in mind.

I actually thought I should make that more tangible, so I mocked something up last night, and given it just a little bit of polish this morning.

I’ve put that code up on https://gist.github.com/Pike/1faef13e891a73c9835bf4b895c59987.

The starting point is a runtime tree, which can be instantiated either for a resource or just an entry. Most of the AST constructs in the parser have a matching runtime AST, I just dropped some abstract base classes, I think.

The __call__ methods on those would effectively be what your code generator would create, to a significant part. Only the literals and pattern are interesting. I see that my NumberLiteral is a good deal away from your implementation stil.

The Compiler class is just the visitor that transforms parser->runtime, plus an optimizer. I’ve demoed taking string literals out of placeables, and concating string literals in patterns, and fast-forwarding simple text instead of pattern expression.

Obviously, that code isn’t even trying to be complete :wink:

Ideas based on that concept that go around in my head:

  • create a class IsolatingVarRef(VariableReference) when isolation is wanted
  • keep parsed and runnable members on bundle, and only compile messages on demand
  • loop detection could be done at compile time, depending on whether we’d want loop detection to kick in on any possible loop, or just when actually triggering a loop

I did also spend more time looking at your compiler branch, and just want to emphasize that I realize how much energy you’ve put into that.


(Staś Małolepszy) #11

It looks like there are many approaches to compiling, and it might be helpful to define what we mean by it. I’m going to try to summarize the strategies discussed so far as well as add a few more thoughts.

  1. What landed in #81 is an interpreter; on each call to format it takes the Fluent AST and walks through it, handling each node according to some rules, and producing a string output.

  2. What @Pike posted in his gist is a partially evaluated interpreter. It takes the AST once during initialization and creates Python functions in memory (often called residuals). Each call to format translates to a call to one of these residuals. Residual functions take translation argument as input.

  3. @spookylukey’s compiler branch takes the AST and prints it into valid Python code. This code is then exec-ed, which creates Python functions in memory, ready to be called every time format is called.

  4. Yet another way would be to transform the Fluent AST into valid Python AST, then compile() and exec() it. This is similar to #3, except that it transforms Fluent AST into a Python AST rather than to text representing Python code. This might have the benefit of making it possible to run additional checks on the generated Python AST using tools designed for this purpose.

Is this a good summary of the possibilities?

As an additional optimization, we could also consider how well each approach lends itself to serializing the output of the compilation. I.e. would it be possible to run the compiler on build-time rather than once during runtime? In particular:

  • In #2, could pickle be used to serialize the residuals and then read them into memory on runtime?
  • In #3 and #4, I guess it would be possible to produce .py files on buildtime and import them on runtime?

(Luke Plant) #12

Thanks stas for outline that. I can see why that architecture works for JS, but I don’t think it works for Django, or for Jinja and the other templating languages as far as I understand them.

I realised as I thought about it that the biggest issue is that in Django usage, internationalized strings appear in lots of places, not only in HTML templates. In these other contexts localized strings all need to be handled as plain text. The places include:

  • things like labels on model attributes (see Django docs ), which can often end up being combined with other plain text strings.
  • strings inside templates which are not in HTML mode (e.g. a template used for a plain text email)
  • other strings that never go through the template mechanism at all e.g. the subject of an email.

We also then have strings that need to be escaped in an HTML context. These are typically output by the template engine, and also by some other utils like format_html.

So, in Django world it is entirely possible to have messages like this:

markup-instructions = You can wrap text in <b> and </b> to make it bold

This message should appear without any quotation when used in plain text context, and should be HTML escaped in an HTML context:

You can wrap text in &lt;b&gt; and &lt;/b&gt; to make it bold

The correct, secure and fast way to handle this in Django is to do absolutely nothing. In plain text contexts, no further processing is necessary, and it is in fact required that we leave this text exactly as it is.

In HTML Django templates, and in format_html, the autoescape mechanism will automatically do this escaping. Similarly the autoescape mechanism in Jinja does the same thing. The HTML generating code should deal with HTML escaping, and it does so well - XSS is considered a solved problem in Django world (when it comes to server-side generation), and also in Jinja, along with the related problem of avoiding double escaping, which we handle with mark_safe (badly named, should be mark_html) and MarkupSafe.

This is different from the client side cases where you are always going to be inserting into the DOM, and where there often isn’t an existing framework for doing this stuff.

Of course, we then have the case of messages that need to embed HTML in them, like I had before:

confirm-email = Hello { $user }, please <a href="{ $url }">confirm your email address</a> to continue 

Here the <a> must not be escaped. This is relatively rare, but there are still plenty of instances. So we’ve got different escaping rules required for different messages. We cannot tell which is which by guessing.

In your model, we solve this ‘from the outside’ of fluent.runtime by input processing and output processing. We would need some kind of method to know whether to do nothing, as above, or do the input and output processing - perhaps a naming convention.

For example, message ids ending in -html are treated as HTML:

confirm-email-html = Hello { $user }, please <a href="{ $url }">confirm your email address</a> to continue 

Now we need to escape the $user and $url inputs before they go into FluentBundle.format, and then mark the whole output of FluentBundle.format as HTML so that it won’t be escaped again.

This seems to work at first - we can have two different strategies for different types of messages. However, messages can refer to other messages and terms:

welcome-html = Welcome to <b>{ -brand }</b>!

thank-you-from-us = Thank you from your friends at { -brand }.

-brand = Jack & Jill

(I’m using & as a shorthand for “text that needs to be escaped in HTML context but must not be escaped in other contexts”).

With this addition, it’s simply not possible to get this correct unless fluent.runtime gains some understanding of the different escaping contexts.

Another issue is the way that in Django and other systems, we can mark blocks of text as already HTML escaped, and we might want to pass these into messages. For example, I have messages like this in the project I’m internationalizing:

award-received-html = { $username } earned award { $award }

In this case, $username and $award could be just text, but in fact they are links. I have utilities that create pre-built links to users, like this:

def account_link(account):
    return link(reverse('user_stats', args=(account.username,)),
                account.username,
                title="{0} {1}".format(account.first_name,
                                       account.last_name))

def link(url, text, title=None):
    if title is None or title.strip() == "":
        return format_html('<a href="{0}">{1}</a>',
                           url, text)
    else:
        return format_html('<a href="{0}" title="{1}">{2}</a>',
                           url, title, text)

In this case, I used <a href> but I could just as well have used <a data-username...> or <a onclick="...">. The use of format_html here marks the text as HTML so that it doesn’t need to be escaped again.

We don’t want the HTML for this to have to appear in every message that features a username - we want to be able to generate the HTML correctly in one place and pass it through. We also don’t want to have to change this kind of code just so that django-ftl can handle it - something designed for Django should work with this code.

This rules out post-processing to remove dangerous HTML , because we have no idea what HTML is benign and what is malicious. But post-processing in Django/Jinja is:

  1. Unnecessary, if we just escape correctly, which we can.
  2. Very expensive.

Another problem is custom functions. If we pre-process all inputs, custom functions will need to deal with escaped text, when they might not be expecting that. In addition, they may accidentally introduce characters that ought to have been escaped. If we escape their output as well as input we get double escaping.

The escapers mechanism I’ve come up with handles all the above cases correctly. For django-ftl it uses a naming convention as above to distinguish types of messages (because there isn’t really any other option at the moment, in the future we could use semantic comments).

If you use the wrong message id, and therefore the wrong escaping, you still won’t end up with an XSS, you will get double escaping (or single where none was required).

The boundaries between different types of messages are respected, so that a plain text term or message will be escaped if it is included in an HTML message. Escaping is done at the right point - the point where you interpolate - and correctly handles already escaped text marked with MarkupSafe/mark_safe.

In terms of performance overhead, for messages with no escapers applied, there is a very low runtime overhead for the resolver, and zero for the compiler.


(Luke Plant) #13

Thanks stas, that’s a really helpful summary. :+1:

I hadn’t thought of doing compilation via a Python AST. It may be possible to convert my compiler approach to that with minimal changes in terms of structure - it would require implements as_ast on each of my codegen.Expression classes. It looks like one difficulty is that the AST changes with each version of Python, but with the subset we use that might not be an issue.

I’ll look into it.


(Luke Plant) #14

I looked into using AST a bit, and it does indeed look promising. Initial proof of concept patch - https://gist.github.com/spookylukey/e79cec2684a2f6c8f5c3578b76eac29e

Some notes:

  • we would still have to use exec AFAICS. The difference is that we are no longer exec’ing a string, but a code object, and that code object is the result of compile run with ast objects, not string input. This does make a huge difference to potential security issues - we never have to go right back to a string and risk interpolating things wrongly, we directly build up AST objects.

  • it only requires a 2 line change to compiler.py. The codegen.py abstractions all need to gain as_ast() methods. Most of them already map directly to a single class in the ast module which makes them very easy - some of them easier than as_source_code(). Some may be a bit trickier, we’ll see.

  • with the help of this nice ast-decompiler, we can still write tests in terms of the Python functions we expect. I’m hoping most of the specific compiler tests won’t need to be changed at all (there might be some vertical whitespace issues). This makes a huge difference, because testing against the AST produced would be extremely unreadable.

  • Python version compatibility may be harder, we may need different implementations of as_ast for different Python versions, but perhaps not too much.

I don’t have any more time to work on this right now, but it does look promising.


(Staś Małolepszy) #15

Thanks for looking into it, @spookylukey! It does look very promising. In particular, I think the major benefit security-wise is how it drastically reduces the risk of interpolating unescaped strings, especially in deeply nested structures.

With the as_ast() methods, would it be possible to use its output with ast-decompiler or some other kind of serialization to implement as_source_code? This could help reduce the amount of code in codegen.py.


(Staś Małolepszy) #16

I’d like to offer another summary :slight_smile: It looks like there are two major themes in this compiler-related discussion. One is about improving the current interpreting design, and the other is about the use of exec in the compile-to-Python-code design. Both of these approaches have their merits. Rather than choose between them, I think we should recognize that they serve different use-cases, and go for both of them :slight_smile:

I’d be interested in seeing @Pike’s approach fleshed out. I think it would be a great improvement to the current interpreting implementation of FluentBundle. Not only does it perform the compilation once per message, but it also gives us an opportunity to optimize simple and static messages into callable objects which return strings. That said, we should measure what the performance and the memory footprint of this approach is. It might consume more memory than the current interpreter does (because every message becomes a callable object); OTOH, it would also make it possible to throw the Fluent AST away once compiled, and the perf win of format() might be worth it anyways.

At the same time, I want to help @spookylukey land his compiler branch, too. It’s fast and almost complete, which is fantastic. One open question I have about it is whether it should land in fluent.runtime or perhaps as a separate package (e.g. fluent.compiler). The latter option would make security audits of fluent.runtime nicely scoped to just the interpreting designs.

When I talked to @Pike earlier today, we also mentioned that in the context of using Fluent to localize mozilla.org, it would be probably easier to start with the interpreting approach—if only because it’s less code and a bit easier to reason about. This would be another reason to look into improving the current FluentBundle by evolving into the direction @Pike prototyped in his gist. I see this as a staged process: start with the less-controversial implementation, and consider using CompiledFluentBundle later on, if performance is still an issue.


(Luke Plant) #17

Yes - in fact the as_source_code methods can disappear entirely. It is now only used by the tests, and so can be replaced by a utility function in the tests that is implemented in terms of just some calls to to as_ast and ast_decompiler.decompile.

I have a fully working branch with this approach, so far only targetting Python 3.6. Hopefully soon I’ll be able to put it up for review. However it would help if we could land the other branches first - sphinx docs and fluent.syntax update - I will then re-base my compiler branch from scratch because it now has a horribly mangled history.


(Luke Plant) #18

The two implementations currently share quite a bit in terms of utility functions and in terms of runtime functions (e.g. types.py), which could make this a bit difficult. I think the biggest difficulty is the tests. I’m currently using this trick to turn every existing test under tests.format into two tests, one for each implementation. When you run the tests you then get to run all of them together, and see individual failures for the different implementations. This has been a huge benefit in my development so far, and I imagine that in the future very often the same person will end up working on both implementations. Very often while working on one I realised that there was an improvement I could make in the other, or a bug to fix, sometimes in terms of them using common code, and very often in terms of common tests.


(Luke Plant) #19

@stas A quick status update - been a bit occupied with other things, but I’ve got a branch for the compiler that I think is ready to review (it’s here in case anything happens to me…)

Before I make a PR though I need https://github.com/projectfluent/python-fluent/pull/92 to be merged/handled first - the error handling strategy affects the compiler branch more than the resolver, and I don’t want to keep going backwards and forwards on this.