This new transcribed dataset looks like it could be very helpful - I don’t know how feasible (or desirable) it would be to incorporate it into training for the next release of the main English model, but with 44.5k hrs transcribed it compares well to the amount of transcribed audio on the earlier LibriSpeech dataset (1,000 hrs)
It also has quantities of transcribed audio for other languages too but those are less dramatic (but could still be a big help compared with what’s available for them too)