Get HTML of remote page with JS

NilkasG · April 15, 2018, 11:25pm

Yes and yes. I typed the first part without my glasses and copied the second one. For reference I corrected it above.

Regarding host permission:
Yes, one does indeed need it (<all_urls> is the easiest way to test it). Here is why:

https://developers.google.com/web/ilt/pwa/working-with-the-fetch-api#cross-origin_requests
having a host permission for the target drops that cross-origin restriction (and CORS would actually be another way around it, if it is supported by the server)

MeowHellYeah · April 16, 2018, 6:25am

Is there answer by your link or I read bad? So what should I do? I saw extensions somehow did get remote HTML.

NilkasG · April 16, 2018, 6:35am

The two bullet points after “Here is why” just explain why you need a host permission.

I’m not sure what you mean wit the rest of your comment:

I saw extensions somehow did get remote HTML.

So it works? Great!

MeowHellYeah · April 16, 2018, 2:18pm

I meant somewhose other’s extension did it: Group Speed Dial can get page title and take a screenshot of it by user’s url. So it can be done, but question is how.

NilkasG · April 16, 2018, 3:14pm

The code to load the source of a page and get its title and body (as DOM element) from that is in my first comment.

So far, you didn’t say that you want a screenshot of the rendered page. That is only possible to get from open/loaded pages. I am quite sure that the linked extension just waits for the page to be loaded by the user and grabs the screenshot then.

An alternative would be to render the pages on a server (this can work with puppeteer).

With tab hiding (experimental in Firefox) you may also be able to just load the desired url in a hidden tab and take a screenshot there. Loading hidden tabs may have unforeseen consequences, though.

MeowHellYeah · April 16, 2018, 3:34pm

I found lib that generates screenshot by DOM element so discussion is relevant.
I didn’t hear about hidden tabs, it may be interesting. As for the linked extension, there are two options to take a screenshot: just take and take via visiting.

I will try a trick with hidden tab a bit later. My first idea is to read all needed info with content script and send it to main script (wondering how to). Or it can be done easier?

NilkasG · April 20, 2018, 12:33pm

It really depends on what exactly you intend to achieve. If the page is already loaded, https://developer.mozilla.org/en-US/Add-ons/WebExtensions/API/tabs/captureTab seems the most straight-forward solution.

My first idea is to read all needed info with content script and send it to main script (wondering how to).

I don’t know what “needed info” you refer to and how any information (except for the entire DOM serialized with evaluated inline styles) could ever let you render an external page in a background script.

MeowHellYeah · April 16, 2018, 6:36pm

Document title and body as I said

NilkasG · April 16, 2018, 8:56pm

Document title is pretty clear, it’s a string, but “document body” in what form?

MeowHellYeah · April 16, 2018, 9:27pm

In form that can be rendered to image. DOM element is suitable at the moment unless my idea about getting it from website and sending from content script to script inserted into my own HTML page is too hard to implement.

NilkasG · April 16, 2018, 9:37pm

Ok. If your goal is to render arbitrary page bodies in the background the same way they are / would be rendered on the webpage, then that can’t be done (or is very difficult). The reason is mostly that modern web pages do not only consist of HTML. When you fetch or serialize the body element as a string, you are missing information. And reconstructing that in general, without actually executing and rendering the entire page, it very far from trivial.

So (as I said):
You need to take (maybe partial) screenshots of the actual running page. As I said, that can’t be done in the background. You can either do it in a browser tab (maybe already open, maybe hidden) or on a server.

MeowHellYeah · April 16, 2018, 9:43pm

Sad to know. Your method is promising though. I’ll try it one of these days and inform about successes.

MeowHellYeah · April 20, 2018, 1:30pm

Thanks, it works and it is much easier than I imagined:

browser.tabs.create({url: "https://developer.mozilla.org/", active: false}).then(function (tab) {
    console.log("Tab:", tab);
    setTimeout(function () {
        browser.tabs.captureTab(tab.id).then(function(base64img) {
            console.log("Title:", tab.title); 
            console.log("Favicon:", tab.favIconUrl);
            console.log("Base64img", base64img);
            browser.tabs.remove(tab.id);
        });
    }, 3000); //give page some time to load itself
});

But I wonder why tab.title is just url and tab.favIconUrl is undefined. Extension has tabs and <all_urls> permissions.

freaktechnik · April 20, 2018, 1:31pm

Probably because you’re not actually waiting for the tab to load and instead just wait an arbitrary number of seconds.

MeowHellYeah · April 20, 2018, 1:41pm

True, I am, but it’s enough to load page completely and take a good screenshot.
Btw I don’t see anything like tabs.onLoaded event in tabs API. What is the good way?

freaktechnik · April 20, 2018, 1:48pm

MeowHellYeah · April 20, 2018, 3:28pm

Not really what I’m looking for. It was fired 5 times (or less and threw Message manager disconnected if tab was closed) but I need fire it only once when page loaded.

freaktechnik · April 20, 2018, 1:58pm

You can check what changed, which is in your case the tab status that should change.

MeowHellYeah · April 20, 2018, 2:15pm

Oh it even could work before, I just missed that tab info have to be updated with tabs.get(). Timer on 3 seconds was a temp placeholder anyway so thx.

NilkasG · April 20, 2018, 3:30pm

In your code, you are waiting to capture, but you are still printing the old tab object. That is not being updated. Every call to the tabs API gives you a new copy. If you need a fresh one, use tab = await tabs.get(tab.id).