webRequest's filterResponseData with data: URIs?

zephyr · November 20, 2018, 4:09am

I’ve been quite enjoying the new filterResponseData API’s. I have a plugin that scans inbound images before displaying them to make sure they’re not NSFW.
However, there is one particularly notable place this does not work: the first page of Google image results. I’m not sure the exact loading mechanism, but it appears that the first page does not load them one at a time. It does seem to use a “data:image/jpeg;base64…” style URI, though.
Unfortunately, even with “<all_urls>” set as the filter, I don’t believe I’m seeing these “requests” run through my code. I suspect that this may simply not be supported after a careful reading of the match patterns page.
Can anyone comment further on support for this or lack thereof? And if not supported, any suggestions on strategies to implement something similar?

Thanks!

freaktechnik · November 20, 2018, 7:59am

Google will load those as images before inserting them as data: URI. If you open the network inspector you should see them as requests that looks something like https://encrypted-tbn0.gstatic.com/images?q=tbn:<someID>.

Based on that you should be seeing them in the webRequest filter, though the load reason/resource type may not be image/imageset.

zephyr · November 21, 2018, 12:46am

Thanks! I gave this a shot. I see images coming through that pipe; however, it appears that doesn’t catch the “first page” still - just ones below the fold.

There seems to be some upfront analysis to the size of the viewport when the request is made and whatever is above the fold is pre-baked, and the rest are dynamically loaded. So not many images will come pre-baked if you make your window smaller. There is a request to something like e.g. (a search for “dolphin”):
https://www.google.com/search?safe=active&biw=1284&bih=402&tbm=isch&sa=1&ei=DpTzW4ymENGctAXm06XYCA&q=dolphin&oq=dolphin&gs_l=img.3..0l10.147750.148805..149012...0.0..0.138.708.5j2…0…1…gws-wiz-img…0i67.9G7FepTWDI0

And it comes back with a full page for the search results with an array of data URI’s baked in, like so:

…

(function(){var data=[[[\"yyyyyyyyyyyy:\",\"data:image/jpeg;base64,/9j/4AAQSkZJRgABAQA ....

So no dice yet…

Thanks for giving me a new idea to try!

zephyr · November 22, 2018, 4:01pm

Well, while I still haven’t been able to use the data: URI approach, I was able to create a filter approach that scanned the HTML, extracted, and optionally replaced the base64 images. It’s not as clean, and scanning MB’s of HTML for regexes and replacing them can’t win any speed awards, but it works.

If anybody can tell me how to scan data: URI’s better, I’d like to hear it.

NilkasG · November 23, 2018, 8:29am

The performance of regular expressions on huge strings (assuming the way they are usually evaluated) really depends on the expression. It can be catastrophic, but it can also be linear (in regards to the string length). Sometimes executing multiple expressions is a lot faster then a single overly complex one (e.g. split/replace first, then search).

Also this may be useful (even if it is a bit annoying to read through). The basic idea is to force-enable Googles safe search and let it do its job:

zephyr · November 24, 2018, 6:43am

Thanks @NilkasG - that was indeed an interesting read! My particular approach though is to actually perform an AI-based image scan in the browser itself - so this methodology can work with pretty much any image site on the web.

The good news on the regex front is that I think this particular regex should be more or less linear when compiled down to an automata. However, I don’t think anybody would say that the whole process of collecting ArrayBuffers, merging ArrayBuffers, TextDecoder’ing them, regexing for matches, converting matches into data URLs, loading as images, finally doing the AI scan, then creating string replacements for “bad” images and finally re TextEncoder’ing it back into the under FilterStream is ever going to be that efficient - especially not compared to a data URL approach.

Perhaps you or someone else happens to know if the original question about data URI’s is 1) truly officially unsupported and 2) if so, how one could request support for this feature? I feel like data URL’s is likely going to be a crucial use case for many of us in the future.

Thanks for your responses so far everybody!

NilkasG · November 25, 2018, 10:12am

Mhm. I mostly agree. A few more thought:

If the RegExp engine would actually build the DFA, the runtime would always be linear, but it likely won’t because the size of those DFAs explodes with the expression size.
Backtracking (whats usually being used) on the other hand can have very good or very bad runtimes, depending on the expression and the string. Wikipedia summarizes this very well.
If your matching logic is simple enough, you may be able to execute it directly on the byte stream, without collecting and decoding everything. This can actually be done without using any strings at all. (this might be an interesting exercise, even if it is somewhat off-topic.)
“AI-based image scan in the browser itself” – cool!

zephyr · October 3, 2019, 3:02am

I found out why this method is not working - this is currently unsupported. It appears that synchronous handling of non-HTTP requests is not available:
https://bugzilla.mozilla.org/show_bug.cgi?id=1475832 I updated the webRequest MDN docs.

Note that for my plugin I hook onHeadersReceived, which doesn’t appear to fire for data: URIs. However, onBeforeRequest does fire - but cancel, redirect, and filterResponseData all seem to fail - silently. I’m looking forward to the bug fix!