Get HTML of remote page with JS


(Ruslan Komarichev) #1

How can I get HTML content of remote page by given url? In particular I need it’s title and whole body.


(Niklas Gollenstede) #2
const html = (await (await fetch(url)).text()); // html as text
const doc = new DOMParser().parseFromString(html, 'text/html');
doc.title; doc.body;

(Martin Giger) #3

You probably meant .text()

Also view is the global scope, I assume (i.e. window).


(Ruslan Komarichev) #4

I tried this and browser said request has been blocked because CORS herader is missing. And then

TypeError: NetworkError when attempting to fetch resource.

(Because fetch was failed I guess?)


(Martin Giger) #5

You do need a host permission for the page in most cases to circumvent CORS restrictions, yes.


(Niklas Gollenstede) #6

Yes and yes. I typed the first part without my glasses and copied the second one. For reference I corrected it above.

Regarding host permission:
Yes, one does indeed need it (<all_urls> is the easiest way to test it). Here is why:


(Ruslan Komarichev) #7

Is there answer by your link or I read bad? So what should I do? I saw extensions somehow did get remote HTML.


(Niklas Gollenstede) #8

The two bullet points after “Here is why” just explain why you need a host permission.

I’m not sure what you mean wit the rest of your comment:

I saw extensions somehow did get remote HTML.

So it works? Great!


(Ruslan Komarichev) #9

I meant somewhose other’s extension did it: Group Speed Dial can get page title and take a screenshot of it by user’s url. So it can be done, but question is how.


(Niklas Gollenstede) #10

The code to load the source of a page and get its title and body (as DOM element) from that is in my first comment.

So far, you didn’t say that you want a screenshot of the rendered page. That is only possible to get from open/loaded pages. I am quite sure that the linked extension just waits for the page to be loaded by the user and grabs the screenshot then.

An alternative would be to render the pages on a server (this can work with puppeteer).

With tab hiding (experimental in Firefox) you may also be able to just load the desired url in a hidden tab and take a screenshot there. Loading hidden tabs may have unforeseen consequences, though.


(Ruslan Komarichev) #12

I found lib that generates screenshot by DOM element so discussion is relevant.
I didn’t hear about hidden tabs, it may be interesting. As for the linked extension, there are two options to take a screenshot: just take and take via visiting.

I will try a trick with hidden tab a bit later. My first idea is to read all needed info with content script and send it to main script (wondering how to). Or it can be done easier?


(Niklas Gollenstede) #13

It really depends on what exactly you intend to achieve. If the page is already loaded, https://developer.mozilla.org/en-US/Add-ons/WebExtensions/API/tabs/captureTab seems the most straight-forward solution.

My first idea is to read all needed info with content script and send it to main script (wondering how to).

I don’t know what “needed info” you refer to and how any information (except for the entire DOM serialized with evaluated inline styles) could ever let you render an external page in a background script.


(Ruslan Komarichev) #14

Document title and body as I said


(Niklas Gollenstede) #15

Document title is pretty clear, it’s a string, but “document body” in what form?


(Ruslan Komarichev) #16

In form that can be rendered to image. DOM element is suitable at the moment unless my idea about getting it from website and sending from content script to script inserted into my own HTML page is too hard to implement.


(Niklas Gollenstede) #17

Ok. If your goal is to render arbitrary page bodies in the background the same way they are / would be rendered on the webpage, then that can’t be done (or is very difficult). The reason is mostly that modern web pages do not only consist of HTML. When you fetch or serialize the body element as a string, you are missing information. And reconstructing that in general, without actually executing and rendering the entire page, it very far from trivial.

So (as I said):
You need to take (maybe partial) screenshots of the actual running page. As I said, that can’t be done in the background. You can either do it in a browser tab (maybe already open, maybe hidden) or on a server.


(Ruslan Komarichev) #18

Sad to know. Your method is promising though. I’ll try it one of these days and inform about successes.


(Ruslan Komarichev) #19

Thanks, it works and it is much easier than I imagined:

browser.tabs.create({url: "https://developer.mozilla.org/", active: false}).then(function (tab) {
    console.log("Tab:", tab);
    setTimeout(function () {
        browser.tabs.captureTab(tab.id).then(function(base64img) {
            console.log("Title:", tab.title); 
            console.log("Favicon:", tab.favIconUrl);
            console.log("Base64img", base64img);
            browser.tabs.remove(tab.id);
        });
    }, 3000); //give page some time to load itself
});

But I wonder why tab.title is just url and tab.favIconUrl is undefined. Extension has tabs and <all_urls> permissions.


(Martin Giger) #20

Probably because you’re not actually waiting for the tab to load and instead just wait an arbitrary number of seconds.


(Ruslan Komarichev) #21

True, I am, but it’s enough to load page completely and take a good screenshot.
Btw I don’t see anything like tabs.onLoaded event in tabs API. What is the good way?