Download a complete web site

GNU Wget is a free software package for retrieving files using HTTP, HTTPS and FTP, the most widely-used Internet protocols. It is a non-interactive command line tool, so it may easily be called from scripts, cron jobs, terminals without X-Windows support, etc. In this article we will show you how to download a web site with various flags including how to limit the download rate and time to aboid getting banned.

If you ever need to download an entire Web site, perhaps for off-line viewing, wget can do the job for example:

$ wget \
     --recursive \
     --no-clobber \
     --page-requisites \
     --html-extension \
     --convert-links \
     --restrict-file-names=windows \
     --domains website.org \
     --no-parent \
         www.website.com/somefolder/html/

This command downloads the Web site www.website.com/somefolder/html/.

The options are:

    *recursive: download the entire Web site.
    *domains website.org: don't follow links outside website.org.
    *no-parent: don't follow links outside the directory tutorials/html/.
    *page-requisites: get all the elements that compose the page (images, CSS and so on).
    *html-extension: save files with the .html extension.
    *convert-links: convert links so that they work locally, off-line.
    *restrict-file-names=windows: modify filenames so that they will work in Windows as well.
    *no-clobber: don't overwrite any existing files (used in case the download is interrupted and
      resumed).

Note that wget has it's own built-in signal flag '-m' or '--mirror' which is easy to use.

wget -m http://debian.org/

will give you the full website on your HDD.

But the script above its more powerful, as you can control the downloading speed. More important thing is if you are going to use -m option to a well maintained server, you most probably got ban. As you are ignoring roobts.txt , downloading files without any delay. Like if you found someone is downloading all of your files with speed, making high load on your web server, will you allow it? Ofcourse no, so the best to do is to limit the download rate. Just add:

--wait=9 --limit-rate=10K

to your command so you don't kill the server you are trying to download from.

the --wait option introduces a number of seconds to wait between download attempts, the --limit-rate limits the amount of the servers bandwidth you are sucking up. Both good ideas if you don't want to be blacklisted by the servers admin.

Posted on: 17/01/2011








0 Comments
If you want to leave a comment please Login or Register