Converting HTML to Another Format

There are several ways to convert HTML files to other formats. You can convert the HTML to plain text for reading, processing, or conversion to still other formats; you can also convert the HTML to PostScript, which you can view, print, or also convert to other formats, such as PDF.

To simply remove the HTML formatting from text, use unhtml. It reads from the standard input (or a specified file name), and it writes its output to standard output. To peruse the file `index.html' with its HTML tags removed, type:

$ unhtml index.html | less RET

To remove the HTML tags from the file `index.html' and write the output to a file called `index.txt', type:

$ unhtml index.html 62; index.txt RET

When you remove the HTML tags from a file with unhtml, no further formatting is done to the text. Furthermore, it only works on files, and not on URLs themselves. Use lynx to save an HTML file or a URL as a formatted text file, so that the resultant text looks like the original HTML when viewed in lynx. It can also preserve italics and hyperlink information in the original HTML. See section Perusing Text from the Web. One thing you can do with this lynx output is pipe it to tools for spacing text, and then send that to enscript for setting in a font. This is useful for printing a Web page in typescript "manuscript" form, with images and graphics removed and text set double−spaced in a Courier font. To print a copy of the URL in typescript manuscript form, type:

$ lynx −dump −underscore −nolist | pr −d | enscript −B RET

NOTE: In some cases, you might want to edit the file before you print it, such as when a Web page contains text navigation bars or other text that you'd want to remove before you turn it into a manuscript. In such a  case, you'd pipe the lynx output to a file, edit the file, and then use pr on the file and pipe that output to enscript for printing. Finally, you can use html2ps to convert an HTML file to PostScript; this is useful when you want to print a Web page with all its graphics and images, or when you want to convert all or part of a Web site into PDF.

Give the URLs or file names of the HTML files to convert as options. Use the `−u' option to underline the anchor text of hypertext links, and specify a file name to write to as an argument to the `−o' option. The defaults are to not underline links, and to write to the standard output. To print a PostScript copy of the document at the URL to the default printer, type:

$ html2ps | lpr RET

To write a copy of the document at the URL to a PostScript file `' with all hypertext links underlined, type:

$ html2ps −u −o RET

Posted on: 17/12/2009

If you want to leave a comment please Login or Register