Making use of the Delta Releases

I recently posted a feature request / suggestion on GitHub to make the delta releases easily usable. The problem and proposed solution is defined on that issue, so I will not replicate it here, but I want to give further insights:

  • According to this, Internet transfer of 1 GB of data produces 3 kg CO2.
  • Here are the sizes of the last two datasets (all released language datasets combined, compressed on disk - without TCP overhead (which would be ~10%):
    • v16.1- FULL : 645 GB
    • v17.0- FULL : 663 GB
    • v17.0 - DELTA: 17.6 GB
  • So, the Delta is ~2.6546% of the complete dataset. Per language, it will of course differ, depending on the level of contribution to that language in that timeframe, but larger datasets will be downloaded more, so the following values are moderate.
  • Some CO2 calculations - assuming each dataset is downloaded by 1,000 people:
    • v17.0- FULL: 1000 * 663 GB * 3 kg/GB => 1989.0 tons CO2
    • v17.0 - DELTA: 1000 * 17.6 GB * 3 kg/GB => 52.8 tons CO2
    • => Difference: 1936.2 tons CO2 (would be saved by using delta versions - per release)
    • => 4 releases/year: 7,744.8 tons CO2
  • According to this, an average tree absorbs 10 kg CO2 per year. So:
    • We need 774,480 (young) trees to work across the year to clean the extra CO2 because we chose to use the full versions.

And the figures will increase with each new version, which will be larger.

Although estimated, I think the calculations are obvious. We need to change our workflows to use Delta releases (provided the suggestion is applied), which would also help us save bandwidth, time, and disk space (which also produces additional CO2).

It is also a requirement if we think of Responsible AI

Or, we all need to plant 1000 trees each…