Announcement - Fluent implementation for Elm, elm-fluent

spookylukey · December 19, 2018, 1:27pm

This post is to announce elm-fluent which has been ‘ready-ish’ for some time but I never got round to announcing here. It is a complete implementation of the Fluent spec (up to 0.6 at the moment) that works by compiling FTL files to Elm code. In this way it can make use of strong type checking by the Elm compiler and eliminate the vast majority of possible mistakes in FTL files at compile time. The compiler itself is written in Python.

Currently there are some issues with dependencies because Elm has no official wrapper for Intl and there have been difficulties making progress on that front. This is especially problematic for Elm 0.19, the latest release at the time of writing, because that version of the compiler attempts to stop you from using non-official ‘kernel’ or ‘native’ code, which would be necessary to write your own wrapper for Intl.

Hopefully there will be some good solutions to these problems long term.

In the mean time, in addition to being usable with Elm 0.18, I’m hoping elm-fluent may be useful in testing the development of the Elm spec from the point of view of a statically typed language. There are already some things that are problematic for an implementation like elm-fluent. For example, in the spec NUMBER and DATE can be omitted and numeric/date types will be formatted automatically in this case. This is implicitly assuming dynamic typing, and in elm-fluent it’s not possible to fully implement this (without severely impacting the design of the project and API of the generated code). elm-fluent can do a limited amount of type inference to detect numeric values (e.g. matching against CLDR plural categories), otherwise it relies on explicit NUMBER/DATETIME calls.

stas · December 18, 2018, 6:48pm

This post is to announce elm-fluent which has been ‘ready-ish’ for some time but I never got round to announcing here.

Thank you for creating this implementation, @spookylukey, and for sharing it here. I’ve only had a little bit of experience with Elm in the past and I remember it piqued my interested with its static error-checking approach. I’ll take another look at it during the holiday break.

It is a complete implementation of the Fluent spec that works by compiling FTL files to Elm code. In this way it can make use of strong type checking by the Elm compiler and eliminate the vast majority of possible mistakes in FTL files at compile time. The compiler itself is written in Python.

This is neat. It looks like this automatically creates a separate function for every message, with statically-typed arguments? I love the idea that most runtime errors can be discovered before the app even runs for the first time.

Does it have the consequence of requiring that the exact same variables are used in all translations of a single message? This may be a good requirement and in fact it’s something that we plan to do in Mozilla’s build pipeline as well. I’m merely calling it out because I wonder if this is something we should try to standardize across implementations. Can we make a universal assumption that all variables must be used in all translations?

Another consequence that I see is that all translations need to be included in the bundle, increasing its size. Do you think there are ways around it? Are there ways to segment the bundle into smaller chunks and only fetch translations which are needed? Or maybe compile all translations during development, and compile only some, on demand, in production?

Lastly, would there be benefits to compiling Fluent messages to types rather than functions? If I understand the Elm docs correctly, types can be decorated with additional state which could be used to store the arguments for each message. You then end up with a single big format function pattern matching the message types. Without knowing Elm I’m not sure if a) that’s a good practice and b) whether it would make things faster than having many small functions. I’m curious if you have considered such approach.

There are already some things that are problematic for an implementation like elm-fluent. For example, in the spec NUMBER and DATE can be omitted and numeric/date types will be formatted automatically in this case. This is implicitly assuming dynamic typing (…).

This is a really interesting point, thanks for bringing it up. Fluent is effectively implicitly typed. I find the work-around to use NUMBER() as a de facto type annotation for the variable rather elegant, but it only works as long as all translations do it. As you are aware, semantic comments might be able to provide a viable alternative. Thank you for participating in that discussion and explaining the Elm use-case!

As an alternative, would it be possible to define a union type for all possible variable types (strings, numbers, dates) and use a generic Fluent.format: FluentType -> String function to format them inside of the compiled functions? I guess that would lack the benefit of type-checking the arguments passed to translations. It would still allow to verify that all the required variables are there, however, which might be a good compromise.

stas · December 18, 2018, 7:19pm

Can we make a universal assumption that all variables must be used in all translations?

To make this question more tangible, here’s an example that I constantly see in messaging apps. In Polish, the past tense of all verbs must be accorded with the gender of the subject. Ideally, the translation would have access to the gender information:

# English
user-wrote = {$user_name} wrote

# Polish
user-wrote = {$user_gender ->
    [male] {$user_name} napisał:
    [female] {$user_name} napisała:
   *[other] Użytkownik {$user_name} napisał:
}

Without any additional annotations, the compiler has no chance of knowing about the $user_gender variable when it compiles the English translation.

With semantic comments, things can be much clearer:

# @param $user_name (String)
# @param $user_gender (String male|female|other)
user-wrote = {$user_name} wrote:

spookylukey · December 19, 2018, 11:59am

Due to the way that Elm’s type system works, and the way I’ve propagated type information from callers to callee’s, it doesn’t require each language for a given message to use the same variables.

Using your example in your the next post, elm-fluent generates the following functions (the type signatures are the most important):

For English:

userWrote : Locale.Locale -> { a | user_name : String } -> String
userWrote locale_ args_ =
    String.concat [ "⁨"
                  , args_.user_name
                  , "⁩ wrote"
                  ]

For Polish:

userWrote : Locale.Locale -> { a | user_gender : String, user_name : String } -> String
userWrote locale_ args_ =
    case args_.user_gender of
        "male" ->
            String.concat [ "⁨"
                          , args_.user_name
                          , "⁩ napisał:"
                          ]
        "female" ->
            String.concat [ "⁨"
                          , args_.user_name
                          , "⁩ napisała:"
                          ]
        _ ->
            String.concat [ "Użytkownik ⁨"
                          , args_.user_name
                          , "⁩ napisał:"
                          ]

The ‘master’ despatch function:

userWrote : Locale.Locale -> { a | user_gender : String, user_name : String } -> String
userWrote locale_ args_ =
    case String.toLower (Locale.toLanguageTag locale_) of
        "tr" ->
            TR.userWrote locale_ args_
        "en" ->
            EN.userWrote locale_ args_
        _ ->
            EN.userWrote locale_ args_

We’re making use of Elm’s partially defined types here. { a | user_gender : String } means "any record type that has a user_gender field of type String". This is a nice type system feature in general, and very useful for our use case, because it means that the English function doesn’t have to fully define the type (and suffer the type mismatch error that would result), nor do we have to repack any objects.

The second part is that the elm-fluent compiler looks up all ‘called’ functions, (for example, for the case of the master function above which calls 2 other functions) and propagates type information back (or throws an error if that is impossible, for example if a message in one language uses a variable as an argument to DATETIME and another passes the same variable to NUMBER). This means the master function can pick up all the requirements from the individual languages. This also means of course that the developer calling these functions has to provide all the arguments that all the languages need.

However, semantic comments which make things explicit would definitely be an advantage, especially for the case where you can’t guess the type.

You wrote:

Can we make a universal assumption that all variables must be used in all translations?

I’m not exactly sure what you mean? In your example, the English version might just not care the gender, so how can it use the gender? We surely don’t want to make the requirements of all languages have an impact on the FTL files of all other languages? Or do you mean just mentioning the variables in semantic comments?

Another consequence that I see is that all translations need to be included in the bundle, increasing its size. Do you think there are ways around it? Are there ways to segment the bundle into smaller chunks and only fetch translations which are needed? Or maybe compile all translations during development, and compile only some, on demand, in production?

I guess there are a few things here - unused messages, unused languages, and lazy loading.

In Elm 0.19, unused messages shouldn’t be too much of a problem as it has a really nice tree-shaking solution built-in. Unused messages will simply be omitted if you pass --optimize.

As mentioned, it’s not possible to use elm-fluent with Elm 0.19 yet, but I’m hopeful that situation will be resolved.

For unused languages, If you have certain languages that you also want to omit for certain builds, say, this obviously requires knowing that at compile time. We could easily add a command line flag to the compiler saying which languages to include/exclude, at the moment it finds everything.

However, client-side “lazy loading” of translations “on demand” is much harder to implement with the current way that elm-fluent works, and I don’t have any good ideas for that use case. It’s possible that later improvements in the Elm compiler might be helpful - if you had some way to annotate “these functions are likely not to be used, load them on demand, grouped in the following way” etc.

Lastly, would there be benefits to compiling Fluent messages to types rather than functions? If I understand the Elm docs correctly, types can be decorated with additional state which could be used to store the arguments for each message. You then end up with a single big format function pattern matching the message types. Without knowing Elm I’m not sure if a) that’s a good practice and b) whether it would make things faster than having many small functions. I’m curious if you have considered such approach.

I think it would make the API bulkier in terms of usage, because you’d have to construct the object and then pass it to a formatting function - in Elm it is easier to construct a record type than a custom type in terms of syntax. In addition I think it works heavily against you for tree-shaking possibilities.

Also, using custom types for messages means you are forced to construct a specific object for each message call. If you use record types, along with the ‘partially defined’ type signatures I’m using on functions, you could conceivably create one object that works for multiple messages e.g. if a bunch of messages often use user_name and user_gender etc., you can construct an object once and it pass it to all of them, even if some of them use none or few of those fields. This may or may not be a good idea, but it is an option.

So at the moment there doesn’t seem to be a strong motivation for this.

As an alternative, would it be possible to define a union type for all possible variable types (strings, numbers, dates) and use a generic Fluent.format: FluentType -> String function to format them inside of the compiled functions? I guess that would lack the benefit of type-checking the arguments passed to translations. It would still allow to verify that all the required variables are there, however, which might be a good compromise.

Yep, the union type approach was one possibility I thought about, but as well as the slacker compile-time, type checking, the generated code also has to check the types at run-time (more work for me the compiler writer…), and then do something for the error cases when you pass a date to NUMBER() etc. This didn’t feel like a good trade off, I thought it better to require you always use something in the message from which a type can be determined exactly. This limits us slightly to a subset of the FTL spec, but the nice thing is that the compiler (either elm-fluent or the Elm compiler itself) will catch any problems. So if you have to go and add NUMBER to a message, for example, even though it wouldn’t be needed for other implementations, this is not such a big deal.

stas · December 27, 2018, 3:43pm

The second part is that the elm-fluent compiler looks up all ‘called’ functions, (for example, for the case of the master function above which calls 2 other functions) and propagates type information back (or throws an error if that is impossible, for example if a message in one language uses a variable as an argument to DATETIME and another passes the same variable to NUMBER ). This means the master function can pick up all the requirements from the individual languages. This also means of course that the developer calling these functions has to provide all the arguments that all the languages need.

Ah, that’s the part which I didn’t understand looking at the code. Thanks for the explanation. It makes sense and it’s quite a nice solution. It also directly answers my concern about some languages not making use of some variables which are needed for other languages.

I’m not exactly sure what you mean? In your example, the English version might just not care the gender, so how can it use the gender?

I was thinking out loud, under the assumption that all translations of a single message would be required to use all variables passed into it. I then posted the example as a counter-argument which falsified that assumption. It was a bit of a back-and-forth with myself

I now understand better the role of the master dispatch function, which addresses my concern by (please correct me if I’m wrong) expecting the sum of all variables seen in all translations and by relying on the partial types mechanism to define an interface for the arguments in the individual translations.

We surely don’t want to make the requirements of all languages have an impact on the FTL files of all other languages?

Yes, absolutely. Which is why I was concerned in the first place

However, client-side “lazy loading” of translations “on demand” is much harder to implement with the current way that elm-fluent works, and I don’t have any good ideas for that use case. It’s possible that later improvements in the Elm compiler might be helpful - if you had some way to annotate “these functions are likely not to be used, load them on demand, grouped in the following way” etc.

This is mostly what I was asking about, thanks for elaborating.

you can construct an object once and it pass it to all of them, even if some of them use none or few of those fields. This may or may not be a good idea, but it is an option.

That’s an interesting option! I reminds me of an old idea of defining a single bundle-wide object with variables available to all translations. I think that’s something that would be helpful in the JS implementation, as well, in particular for things like the user’s gender which might be required by the grammar in unexpected places.

{…) and then do something for the error cases when you pass a date to NUMBER() etc. This didn’t feel like a good trade off, I thought it better to require you always use something in the message from which a type can be determined exactly. This limits us slightly to a subset of the FTL spec (…).

I can definitely see the appeal of the approach you chose, as it can help discover mistyped arguments during the compilation step. The first (dismissed) approach, OTOH, is closer to what the JS resolver currently does. As an experiment, I’d be interested in seeing if it can work well enough in Elm, too. Fluent’s choice of a weak type system was intended to make it as easy as possible for localizers to work with the branching logic.

Perhaps with semantic comments we can have the best of the two worlds: weak typing in expressions and strongly-typed hints for the compiler in the comments.

This is also interesting to me in the larger context of the discussion about the reference resolver. All of the interpolation logic is currently unspecified because the spec only defines the parsing behavior. I would love to add a reference resolver to the spec in the future (most probably after 1.0 ships) so that we can have more informed discussion about the expected behaviors and what it means to be spec-compliant. This will have an impact on promoting the reference test fixtures as an aid to help create new implementations. Subset-driven implementations might want to be able to skip some test fixtures and then document how they diverge from the reference.