checksite.py

Introduction

This program was designed to find link-related problems in a web site.

It was specifically designed to find (among other things):

dead links
redirects
links to specifically indicated "bad" URLs. (I used this to find hard-coded links to a temporary staging web site.)
Links that use IP addresses instead of host names

This program was tested on Linux with Python 3.11.3 but it should work on Windows Mac OS and with any version of Python greater than 3.0. It will not work on Python 2.x.

Copyright and License

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.

Queries

If you have any comments, questions, or bug reports about this program, please contact me at glenn.story@gmail.com .

Installing the Program

The program is a single self-contained Python file. It uses only the standard modules that are distributed with Python. Therefore you need only copy the checksite.py file to your computer.

Running the Program

The program may be run either as

 python checksite.py [OPTIONS] URL

or on a Macintosh or Linux system you can place the program on your search path, give the file execute permission and run it as:

 checksite.py [OPTIONS] URL

Either way, the program requires a standard fully-formatted URL:

 checksite.py http://mysite.com

The program will then scan the page at the specified URL and any pages on the same site that are referenced by the URL and check their validity. (See the list of error messages under "Messages" below for a list of items that will be checked.)

The program is somewhat like a web crawler in that starting at the page specified by URL it will recursively analyze any additional pages that it finds. It differs from a normal web crawler, however, in that it will not follow links that have a different domain name from the original URL. (It will verify that such files exist, but will not attempt to scan them for additional URLs.

If checksite.py encounters a page more than once, it will only analyze the page the first time it is encountered.

Options

The behavior of the program can be modified by specifying one or more command-line options.

The options are as follows:

--version

This displays the version of the program and exits

-h or --help

This shows a help message describing the options and exits.

-b BAD or --bad=BAD

This option specifies a URL that you consider "bad" (for whatever reason). References to such URLs will be reported. Substitute the URL you wish reported for BAD in the option. You may use this option more than once if you have multiple URLs you consider "bad" and want the program to flag.

-p or --show-pages

This option will show a list of all the pages visited at the end of the run.

-n or --no-recurse

This option will prevent the program from recursively analyzing child pages of the page specified on the command line.

-v or --verbose

This option will request additional output messages. This option may be used multiple times to increase the verbosity of the output. The number of repetitions is as follows:

1 = HTTP level     
2 = Page level
3 = tag level     
4 = attribute level
5 = show page contents

For more detail, see the section on messages below.

-S STATUS-CODE or --skip=STATUS-CODE

This option tells the program not to report the specified HTTP status code. For example to suppress reporting of temporary redirects, you could specifiy --skip=301. This option may be used more than once if you have more than one status code you wish skipped. Replace "STATUS-CODE" with the 3-digit numeric code you want to skip. See http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html for the list of HTTP status codes and their meanings.

--skip-html-errors

This option will suppress reporting of incorrect HTML.

--skip-url=SKIP-URL

This option will prevent the program from following or reporting on specific URLs. Replace SKIP-URL with the URL you wish to skip. This option may be used more than once if you have more than one URL to skip.

-i IGNORE_FILE or --ignore-file=IGNORE_FILE

This allows you to specify an input file that indicates specific errors not to report. The format of each line of the file is as follows:

ERROR URL

where ERROR is one of the following:

NNN - a three-digit HTTP status code
html - HTML errors
SNN - "S" followed by a socket error.
??? - An unhandled exception in the program.  (Usually a recursion
error.)

The easiest way to create this file is to run checksite.py with the --summary-file option. That output file has the same format as this ignore file. You can thus cut and paste lines from the summary file to the ignore file for errors you can't correct and want to ignore on subsequent runs.

Substitute a valid pathname for IGNORE_FILE.

--summary-file=SUMMARY_FILE

This option creates an output file with one line for each problem found. I use this file in two ways: (1) as a way of seeing and acting on a summary of the errors found. If I fix an error, I delete the line from the summary file. (2) For files I can't or won't fix, I cut and paste lines from the summary file to the ignore file.

Substitute a valid pathname for SUMMARY_FILE. Since the file will be created by this program, it need not exist. If it does exist, the old contents will be overwritten.

Messages

Most of the output of this program is sent to the standard output file (usually your terminal window).

There are two primary kinds of messages: Error messages indicating problems found in the scan and verbose messages requested by the --verbose option.

Error messages look like this:

*** 0 Response from http://fonts.googleapis.com/#: 404 Not Found Page = http://fonts.googleapis.com/# Parent = http://mysite.com

Verbose message look like this:

1 Response from http://glennastory.net/fsync/#editng: 200 OK Page = http://glennastory.net/fsync/#editng Parent = http://glennastory.net/fsync/fsync.htm

The number (1 in the example) indicates the verbosity level. 1 means this message is displayed with one or more occurrences of --verbose on the command line, 2 means two or more occurrences, etc.)

Here is the currently complete list of error messages by level:

level 0 - error messages

0, "Error parsing HTML:  %s"
0, "Error parsing page contents\n%s%s"
0, "Error receiving response from  %s:  %s (1)"
0, "Error sending request to %s:  %s (1)"
0, "IP Address in URL:  %s"
0, "No HTTP header"
0, "Page is on the 'bad' site:  %s"
0, "Response from %s:  %s %s"
0, "Unhandled Exceeption:  %s: %s"

1 = HTTP level

1, "content-type = %s"
1, "Response from %s:  %s %s"
1, "Skipping this URL"

2 = Page level

2, "Already visited")
2, "----- continuing %s"
2, "^^^^^ End processing %s"
2, "Not part of home site"
2, "Recursion depth: %d"
2, " vvvvv Processing %s"

3 = tag level

3, "found tag %s"

4 = attribute level

4, "found attribute %s=%s"

5 = show page contents

5, "Page contents = %s"