A kind user just reported a bug on my plugin regarding special characters accidentally getting replaced. (My plugin scans webpages using WebRequest and replaces NSFW images, so I am manually handling text decoding to detect and replace base64 encoded images.) I dug into it and it appears that the root issue is that I always initialize TextDecoder with UTF-8 rather than with the appropriate charset for the page.
I am currently working on a solution that simply detects the charset from the Content-Type header. I found a test suite from W3C here. Prior to this fix, it would fail tests 3, 4, 5, 8, and 9. Now only test 9 fails (“Test of a iso-latin-1 page served as text/html with no declaration”).
So I have three questions for those internationalization experts out there:
- Is there a recommended way to detect the charset rather than extracting it myself? I know charset detection can get quite complex and I don’t handle some cases like via the META tag.
- Any thoughts on how I do the detection to pass Test Encoding #9? I didn’t see a reference to the spec for how to handle this.
- Are there other test suites I should run my plugin against to check compliance?
Thanks for taking a peek at this! Also, let me know if anybody’s interesting in seeing charset detection get folded back into the WebExtension examples. I feel both charset detection and proper chunk handling are current deficiencies for developing real-world plugins based on the http-response minimal example.