WebRequest Charset Detection?

A kind user just reported a bug on my plugin regarding special characters accidentally getting replaced. (My plugin scans webpages using WebRequest and replaces NSFW images, so I am manually handling text decoding to detect and replace base64 encoded images.) I dug into it and it appears that the root issue is that I always initialize TextDecoder with UTF-8 rather than with the appropriate charset for the page.

I am currently working on a solution that simply detects the charset from the Content-Type header. I found a test suite from W3C here. Prior to this fix, it would fail tests 3, 4, 5, 8, and 9. Now only test 9 fails (“Test of a iso-latin-1 page served as text/html with no declaration”).

So I have three questions for those internationalization experts out there:

  1. Is there a recommended way to detect the charset rather than extracting it myself? I know charset detection can get quite complex and I don’t handle some cases like via the META tag.
  2. Any thoughts on how I do the detection to pass Test Encoding #9? I didn’t see a reference to the spec for how to handle this.
  3. Are there other test suites I should run my plugin against to check compliance?

Thanks for taking a peek at this! Also, let me know if anybody’s interesting in seeing charset detection get folded back into the WebExtension examples. I feel both charset detection and proper chunk handling are current deficiencies for developing real-world plugins based on the http-response minimal example.

Well, given the crickets in responses here I thought maybe I’d try to make the http-response example more robust; I have another post asking about the current maintainer as no PR’s seem to have been accepted lately. Stay tuned!