Get HTML of remote page with JS

MeowHellYeah · April 15, 2018, 2:03pm

How can I get HTML content of remote page by given url? In particular I need it’s title and whole body.

NilkasG · April 15, 2018, 8:15pm

const html = (await (await fetch(url)).text()); // html as text
const doc = new DOMParser().parseFromString(html, 'text/html');
doc.title; doc.body;

freaktechnik · April 15, 2018, 8:26pm

You probably meant .text()

Also view is the global scope, I assume (i.e. window).

MeowHellYeah · April 15, 2018, 9:09pm

I tried this and browser said request has been blocked because CORS herader is missing. And then

TypeError: NetworkError when attempting to fetch resource.

(Because fetch was failed I guess?)

freaktechnik · April 15, 2018, 10:04pm

You do need a host permission for the page in most cases to circumvent CORS restrictions, yes.

NilkasG · April 15, 2018, 11:18pm

Yes and yes. I typed the first part without my glasses and copied the second one. For reference I corrected it above.

Regarding host permission:
Yes, one does indeed need it (<all_urls> is the easiest way to test it). Here is why:

https://developers.google.com/web/ilt/pwa/working-with-the-fetch-api#cross-origin_requests
having a host permission for the target drops that cross-origin restriction (and CORS would actually be another way around it, if it is supported by the server)

MeowHellYeah · April 16, 2018, 6:25am

Is there answer by your link or I read bad? So what should I do? I saw extensions somehow did get remote HTML.

NilkasG · April 16, 2018, 6:35am

The two bullet points after “Here is why” just explain why you need a host permission.

I’m not sure what you mean wit the rest of your comment:

I saw extensions somehow did get remote HTML.

So it works? Great!

MeowHellYeah · April 16, 2018, 6:40am

I meant somewhose other’s extension did it: Group Speed Dial can get page title and take a screenshot of it by user’s url. So it can be done, but question is how.

NilkasG · April 16, 2018, 3:14pm

The code to load the source of a page and get its title and body (as DOM element) from that is in my first comment.

So far, you didn’t say that you want a screenshot of the rendered page. That is only possible to get from open/loaded pages. I am quite sure that the linked extension just waits for the page to be loaded by the user and grabs the screenshot then.

An alternative would be to render the pages on a server (this can work with puppeteer).

With tab hiding (experimental in Firefox) you may also be able to just load the desired url in a hidden tab and take a screenshot there. Loading hidden tabs may have unforeseen consequences, though.

MeowHellYeah · April 16, 2018, 3:32pm

I found lib that generates screenshot by DOM element so discussion is relevant.
I didn’t hear about hidden tabs, it may be interesting. As for the linked extension, there are two options to take a screenshot: just take and take via visiting.

I will try a trick with hidden tab a bit later. My first idea is to read all needed info with content script and send it to main script (wondering how to). Or it can be done easier?

NilkasG · April 16, 2018, 5:14pm

It really depends on what exactly you intend to achieve. If the page is already loaded, tabs.captureTab() - Mozilla | MDN seems the most straight-forward solution.

My first idea is to read all needed info with content script and send it to main script (wondering how to).

I don’t know what “needed info” you refer to and how any information (except for the entire DOM serialized with evaluated inline styles) could ever let you render an external page in a background script.

MeowHellYeah · April 16, 2018, 6:36pm

Document title and body as I said

NilkasG · April 16, 2018, 8:56pm

Document title is pretty clear, it’s a string, but “document body” in what form?

MeowHellYeah · April 16, 2018, 9:27pm

In form that can be rendered to image. DOM element is suitable at the moment unless my idea about getting it from website and sending from content script to script inserted into my own HTML page is too hard to implement.

NilkasG · April 16, 2018, 9:37pm

Ok. If your goal is to render arbitrary page bodies in the background the same way they are / would be rendered on the webpage, then that can’t be done (or is very difficult). The reason is mostly that modern web pages do not only consist of HTML. When you fetch or serialize the body element as a string, you are missing information. And reconstructing that in general, without actually executing and rendering the entire page, it very far from trivial.

So (as I said):
You need to take (maybe partial) screenshots of the actual running page. As I said, that can’t be done in the background. You can either do it in a browser tab (maybe already open, maybe hidden) or on a server.

MeowHellYeah · April 16, 2018, 9:43pm

Sad to know. Your method is promising though. I’ll try it one of these days and inform about successes.

MeowHellYeah · April 20, 2018, 1:29pm

Thanks, it works and it is much easier than I imagined:

browser.tabs.create({url: "https://developer.mozilla.org/", active: false}).then(function (tab) {
    console.log("Tab:", tab);
    setTimeout(function () {
        browser.tabs.captureTab(tab.id).then(function(base64img) {
            console.log("Title:", tab.title); 
            console.log("Favicon:", tab.favIconUrl);
            console.log("Base64img", base64img);
            browser.tabs.remove(tab.id);
        });
    }, 3000); //give page some time to load itself
});

But I wonder why tab.title is just url and tab.favIconUrl is undefined. Extension has tabs and <all_urls> permissions.

freaktechnik · April 20, 2018, 1:31pm

Probably because you’re not actually waiting for the tab to load and instead just wait an arbitrary number of seconds.

MeowHellYeah · April 20, 2018, 1:38pm

True, I am, but it’s enough to load page completely and take a good screenshot.
Btw I don’t see anything like tabs.onLoaded event in tabs API. What is the good way?