This program was designed to find link-related problems in a web site.
It was specifically designed to find (among other things):
This program was tested on Linux with Python 2.7 but it should work on Windows Mac OS and with any version of Python greater than 2.3. It will not work on Python 3.0.
Copyright Glenn Story, 2010, 2015, 2018
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.
If you have any comments, questions, or bug reports about this program, please contact me at email@example.com .
The program is a single self-contained Python file. It uses only the standard modules that are distributed with Python. Therefore you need only copy the checksite.py file to your computer.
The program may be run either as
python checksite.py [OPTIONS] URL
or on a Macintosh or Linux system you can place the program on your search path, give the file execute permission and run it as:
checksite.py [OPTIONS] URL
Either way, the program requires a standard fully-formatted URL:
The program will then scan the page at the specified URL and any pages on the same site that are referenced by the URL and check their validity. (See the list of error messages under "Messages" below for a list of items that will be checked.)
The program is somewhat like a web crawler in that starting at the page specified by URL it will recursively analyze any additional pages that it finds. It differs from a normal web crawler, however, in that it will not follow links that have a different domain name from the original URL. (It will verify that such files exist, but will not attempt to scan them for additional URLs.
If checksite.py encounters a page more than once, it will only analyze the page the first time it is encountered.
The behavior of the program can be modified by specifying one or more command-line options.
The options are as follows:
This displays the version of the program and exits
-h or --help
This shows a help message describing the options and exits.
-b BAD or --bad=BAD
This option specifies a URL that you consider "bad" (for whatever reason). References to such URLs will be reported. Substitute the URL you wish reported for BAD in the option. You may use this option more than once if you have multiple URLs you consider "bad" and want the program to flag.
-p or --show-pages
This option will show a list of all the pages visited at the end of the run.
-n or --no-recurse
This option will prevent the program from recursively analyzing child pages of the page specified on the command line.
-v or --verbose
This option will request additional output messages. This option may be used multiple times to increase the verbosity of the output. The number of repetitions is as follows:
1 = HTTP level 2 = Page level 3 = tag level 4 = attribute level 5 = show page contents
For more detail, see the section on messages below.
-S STATUS-CODE or --skip=STATUS-CODE
This option tells the program not to report the specified HTTP status code. For example to suppress reporting of temporary redirects, you could specifiy --skip=301. This option may be used more than once if you have more than one status code you wish skipped. Replace "STATUS-CODE" with the 3-digit numeric code you want to skip. See http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html for the list of HTTP status codes and their meanings.
This option will suppress reporting of incorrect HTML.
This option will prevent the program from following or reporting on specific URLs. Replace SKIP-URL with the URL you wish to skip. This option may be used more than once if you have more than one URL to skip.
-i IGNORE_FILE or --ignore-file=IGNORE_FILE
This allows you to specify an input file that indicates specific errors not to report. The format of each line of the file is as follows:
where ERROR is one of the following:
NNN - a three-digit HTTP status code html - HTML errors SNN - "S" followed by a socket error. ??? - An unhandled exception in the program. (Usually a recursion error.)
The easiest way to create this file is to run checksite.py with the --summary-file option. That output file has the same format as this ignore file. You can thus cut and paste lines from the summary file to the ignore file for errors you can't correct and want to ignore on subsequent runs.
Substitute a valid pathname for IGNORE_FILE.
This option creates an output file with one line for each problem found. I use this file in two ways: (1) as a way of seeing and acting on a summary of the errors found. If I fix an error, I delete the line from the summary file. (2) For files I can't or won't fix, I cut and paste lines from the summary file to the ignore file.
Substitute a valid pathname for SUMMARY_FILE. Since the file will be created by this program, it need not exist. If it does exist, the old contents will be overwritten.
Most of the output of this program is sent to the standard output file (usually your terminal window).
There are two primary kinds of messages: Error messages indicating problems found in the scan and verbose messages requested by the --verbose option.
Error messages look like this:
*** 0 Response from http://fonts.googleapis.com/#: 404 Not Found Page = http://fonts.googleapis.com/# Parent = http://mysite.com
Verbose message look like this:
1 Response from http://glennastory.net/fsync/#editng: 200 OK Page = http://glennastory.net/fsync/#editng Parent = http://glennastory.net/fsync/fsync.htmThe number (1 in the example) indicates the verbosity level. 1 means this message is displayed with one or more occurrences of --verbose on the command line, 2 means two or more occurrences, etc.)
Here is the currently complete list of error messages by level:
level 0 - error messages
0, "Error parsing HTML: %s" 0, "Error parsing page contents\n%s%s" 0, "Error receiving response from %s: %s (1)" 0, "Error sending request to %s: %s (1)" 0, "IP Address in URL: %s" 0, "No HTTP header" 0, "Page is on the 'bad' site: %s" 0, "Response from %s: %s %s" 0, "Unhandled Exceeption: %s: %s"
1 = HTTP level
1, "content-type = %s" 1, "Response from %s: %s %s" 1, "Skipping this URL"
2 = Page level
2, "Already visited") 2, "----- continuing %s" 2, "^^^^^ End processing %s" 2, "Not part of home site" 2, "Recursion depth: %d" 2, " vvvvv Processing %s"
3 = tag level
3, "found tag %s"
4 = attribute level
4, "found attribute %s=%s"
5 = show page contents
5, "Page contents = %s"