WebRequest Charset Detection?

A kind user just reported a bug on my plugin regarding special characters accidentally getting replaced. (My plugin scans webpages using WebRequest and replaces NSFW images, so I am manually handling text decoding to detect and replace base64 encoded images.) I dug into it and it appears that the root issue is that I always initialize TextDecoder with UTF-8 rather than with the appropriate charset for the page.

I am currently working on a solution that simply detects the charset from the Content-Type header. I found a test suite from W3C here. Prior to this fix, it would fail tests 3, 4, 5, 8, and 9. Now only test 9 fails (“Test of a iso-latin-1 page served as text/html with no declaration”).

So I have three questions for those internationalization experts out there:

  1. Is there a recommended way to detect the charset rather than extracting it myself? I know charset detection can get quite complex and I don’t handle some cases like via the META tag.
  2. Any thoughts on how I do the detection to pass Test Encoding #9? I didn’t see a reference to the spec for how to handle this.
  3. Are there other test suites I should run my plugin against to check compliance?

Thanks for taking a peek at this! Also, let me know if anybody’s interesting in seeing charset detection get folded back into the WebExtension examples. I feel both charset detection and proper chunk handling are current deficiencies for developing real-world plugins based on the http-response minimal example.

Well, given the crickets in responses here I thought maybe I’d try to make the http-response example more robust; I have another post asking about the current maintainer as no PR’s seem to have been accepted lately. Stay tuned!

Well, the PR got a bit bogged down but has some code that may be helpful for anyone interested. One thing that I definitely discovered while working this problem is that charset detection is not trivial to do correctly and that ideally there would be some type of solution in the browser itself to help suggest the appropriate charset. One thing I did too as part of the PR was review a number of different addons that were on GitHub to see how they handled things, and I found that generally most addons were either not aware of the issue or did not address it fully - the interested can find the results in the PR. So for anyone working with WebRequest filtering, I strongly recommend ensuring that you do some type of charset detection so things work for your international users.

On the plus side, I heard back from the original user that reported the issue - and their problems were resolved after I implemented the fix!

