Engineering Effectiveness Newsletter (February/March 2025 Edition)

Welcome to the February/March edition of the Engineering Effectiveness Newsletter! The Engineering Effectiveness org makes it easy to develop, test and release Mozilla software at scale. See below for some highlights, then read on for more detailed info!

Highlights

  • A ton of progress has been made planning out the GCP → Azure migration, putting us in a great spot to kick the project off in H2.
  • The Taskcluster team is now officially part of Engineering Effectiveness.
    • As highlighted in September, we ran a trial period where the Taskcluster team would be closer to Release Engineering. The trial period has proven to be successful.
  • The Data Loss Prevention feature reached the 1.0 milestone after getting a green status from QA. This is a major milestone and DLP is now scheduled to ship in Firefox 138.
  • Four new language pairs are now live in Firefox Desktop:
    • en → zh-Hans (Translations into Simplified Chinese)
    • en → ja (Translations into Japanese)
    • en → ko (Translations into Korean)
    • en → ar (Translations into Arabic)
  • In Firefox-CI, generic-worker can now run tasks using the deprecated docker-worker payload format. This means that the last features are complete and parity has been achieved, paving the way to update our ancient CI workers to something modern, performant and secure.

Contributors

Detailed Project Updates

Build System and Mach Environment

Crash Management

  • Alex Franchuk has set up daily ingestion of crash ping symbolication/signatures into BigQuery, so that you can now access crash signatures and stack traces in places like Redash. The table is moz-fx-data-shared-prod.crash_ping_ingest_external.ingest_output, and may be joined on document_id and submission_timestamp (example).
    • All nightly and beta pings are ingested, however 10k release pings per process-type are randomly sampled.
  • Serge Guelton improved the accuracy of crash reports for failures involving mfbt.
  • Chris Martin added support for recording and analyzing non-fatal errors during minidump generation.
  • Chris Martin added made it possible for child processes to transfer their auxiliary vector to the crash generator, making minidump generation more robust on Linux and Android.
  • Brian Tsoi landed and enabled client-side memory testing in the crash reporter client, this allows the client to detect faulty memory on user machines.

Firefox-CI, Taskcluster and Treeherder

  • Pete Moore and Matt Boris finished implementing the last features of d2g (docker-worker to generic-worker). Docker-worker is the oldest type of worker. While unmaintained for many years, it’s still a central piece of the Firefox CI instance, powering literally all tasks that can run in a Linux container.
  • Yarik greatly improved Taskcluster’s ability to manage various launch configs within a worker pool. This prevents pools with broken configurations or within a datacenter experiencing an outage from picking up tasks.
  • Andrew Halberstadt created a Looker board for the Firefox-CI ETL, which includes a new dashboard tracking some basic metrics over time.
  • Joel Maher has finished migrating windows 11 tests from Oct 2022 update to Oct 2024 update
  • Sheriffs, Sebastian Hengst and Joel Maher have migrated MacOSX tests from 10.15 to 14.70. A small subset still runs on version 10.15 for compatibility testing.

OS Integration and Security

  • David Parks and Greg Stoll pushed our Data Loss Prevention (DLP) project to reach the 1.0 milestone with green status from QA. This is a major milestone and DLP is now scheduled to ship in Firefox 138.

PDF.js

Firefox Translations

  • Evgeny Pavlov has successfully trained four models that are now released in Firefox Desktop.

    • en → zh-Hans (Translations into Simplified Chinese)
    • en → ja (Translations into Japanese)
    • en → ko (Translations into Korean)
    • en → ar (Translations into Arabic)
  • Greg Tatum built out a model registry page for all of our trained models to make it easier to retrain models and track assets.

  • Greg Tatum updated the graph describing the training pipeline with new documentation

  • Greg Tatum has been working on simplifying the pipeline for re-training languages to make the pipeline more robust, easier to use, and to have less waste by re-using existing assets. Pictured here is the same graph as above, but with just training a new student model, reducing training costs by 10x in some cases.

  • Greg Tatum landed new language identification rules to reduce the amount of false positives to offer a translation. These popups can be disruptive for users, so we now only offer translations when we are positive that the language is correct.

  • Erik Nordin added documentation on running Translations end-to-end and performance tests.

  • Erik Nordin fixed an issue that temporarily broke Translations in Nightly after migrating the TranslationsEngine from using JSWindowActors to instead use JSProcessActors.

  • Erik Nordin fixed several of the highest-frequency intermittent test failures within the Translations test suite bug 1946901, bug 1943516 and bug 1917161.

  • Erik Nordin modified the Translations performance tests to track both the peak memory usage as well as the stabilized memory usage within the Inference Process.

  • Erik Nordin made it possible to live-upgrade models from a shared-vocab format to a split-vocab format in Firefox without breaking Translations for users.

  • Erik Nordin implemented JavaScript DecompressionStream API for the Zstandard format, available within privileged contexts in Firefox. The speed of Zstandard decompression is drastically faster than the other supported formats, and will allow us to migrate the Translations ecosystem to decompress models on the fly, reducing download sizes and times, as well as reducing required on-device storage for models at virtually no impact to felt user performance.

  • Erik Nordin designed an Outreachy internship project that will help us migrate all of our models to the zstandard compression format, and is actively searching for an intern to help with the work.

Power use

Phabricator , moz-phab, and Lando

  • Connor Sheehan made Lando autoformatting use mach format after Gijs landed changes to enable formatting-only updates in the ESlint linter.
  • Connor Sheehan merged all the changes from production Lando into the new Lando.
  • Zeid transitioned all Mercurial repositories (except try) to the new Lando.

Release Engineering and Release Management

  • Gabriel Bustamante landed the patches to start shipping Linux/ARM64 builds on Firefox 136.
  • Heitor Neiva has expired old Firefox partials (specifically versions prior to 100.0). Approximately 1.6 million files were removed from our archive, freeing up around 17.9 TBs of cloud storage.
  • Bastien Orivel semi-automatically uploaded Fenix 137 to the Samsung Galaxy Store. After many years of manual uploads, Bastien is getting the publication there automated.
    • Next steps are to work through the remaining errors and integrate the Taskcluster task in release promotion.
  • Pascal updated whattrainisitnow.com to reflect our extended support of ESR 115 until September and created new API endpoints for the webcompat team.
  • Release Management shipped 2 major releases (Fx135 & Fx136) and 6 dot releases, including a chemspill and 2 dot releases to respond to major webcompat incidents.
    • 136.0.3 was released to address a webcompat incident. It shipped during Fx137 RC week, requiring additional work to delay the Fx137 merge day.
    • 136.04, 128,9.1, and 11.21.1 were chemspill releases. The time from having a reviewed patch to a 100% rollout release was ~19 hours.

Version Control

  • Julien Cristau and Connor Sheehan put up a Web Application Firewall to protect hg.mozilla.org from DDOS attacks.

Mercurial to Git project

Thanks for reading and see you in two months!