The Prolific Programmer -- on the web....: How to Watch webpages for Changes

Today, I encountered a webpage for which there is no rss feed, nor atom feed. I whipped up something to watch the page myself and report on any changes in python. Hey, Guido, if you integrate requests, I won't have any non-stdlib requirements to the script. What do you say? please? Not that the BDFL reads my blog but anyway, here's the code:


#!/Users/hdiwan/.virtualenvs/globetrekker/bin/python
import argparse
import hashlib
import json
import logging
import pprint
import requests
import smtplib


def send_mail(msg, user, password):
    server = smtplib.SMTP('smtp.gmail.com', 587)
    server.ehlo()
    server.starttls()
    server.ehlo()
    server.login(user, password)
    server.sendmail(user, user, msg)

if __name__ == '__main__':
    argparser = argparse.ArgumentParser(description='Check a website for changes')
    argparser.add_argument('-n', '--url', type=str, default=None, help='Add URL to watcher',  action='store')
    argparser.add_argument('-l', '--list', action='store_true')
    argparser.add_argument('-u', '--user', type=str, default='hd1@jsc.d8u.us', help='Your username',  action='store')
    argparser.add_argument('-p', '--password', type=str, help='Your password', action='store')
    argparser.add_argument('-v', '--verbose', action='store_false')
    parsed = argparser.parse_args()

    if not parsed.verbose:
        logging.basicConfig(level=logging.DEBUG)
    else:
        logging.basicConfig(level=logging.FATAL)

    if parsed.url:
        new_hash = {parsed.url: 0}
        output = json.dumps(new_hash)
        logging.debug(output)
        try:
            with open('/var/tmp/.globetrekker.txt', 'a') as fin:
                data = json.load(fin)
                data[parsed.url] = 0
                logging.debug(data)
                json.dump(data, fin)
        except:
            with open('/var/tmp/.globetrekker.txt', 'w') as fout:
                json.dump(new_hash, fout)
        exit()

    with open('/var/tmp/.globetrekker.txt', 'r') as fin:
        stored_hash_json = json.load(fin)
        logging.debug(stored_hash_json)
        if parsed.list:
            for k in stored_hash_json:
                print(k)
            exit()
    new_hashes = []
    stored_hash = stored_hash_json
    logging.debug(stored_hash)
    for url in stored_hash:
        logging.debug('{} is our URL'.format(url))
        browser = requests.get(url)
        encoding = 'utf-8'
        logging.debug('page retrieved -- {}'.format(url[0]))
        text = browser.content
        encoded = text.encode(encoding, errors='xmlcharrefreplace')
        logging.debug(encoded)
        decoded = encoded.decode(encoding, errors='xmlcharrefreplace')
        logging.debug(decoded)
        new_hash = hashlib.sha1(decoded).hexdigest()
        logging.debug('Calculated hash code: {}'.format(new_hash))
        logging.debug('Stored hash: {}'.format(stored_hash[url]))
        if new_hash != stored_hash[url]:
            logging.debug('{} changed'.format(url))
            if stored_hash[url] != 0:
                send_mail(u'Subject: {} Change detected\r\n\r\n--H'.format(url), parsed.user, parsed.password)

            stored_hash[url] = new_hash
    with open('/var/tmp/.globetrekker.txt', 'w') as fout:
        json.dump(stored_hash, fout)

4 comments:

UnknownMay 6, 2015 at 6:55 PM
The way to check this is to just ask the webserver for a HEAD command instead of a GET command so you don't pull down a webpage, just return some headers. You use one of the special headers intended just for this task which you send up on the request and then check for in the results, saving its unique token so you don't have to do your own hashing. The If Modified Since header is the most popular since the token is a timestamp. If the page hasn't changed you get 304 code and no webpage instead of the 200 code and webpage data if it has changed since the timestamp. You can simply save the server response's Last-Modified header's timestamp and pass it back as the next request's If-Modified-Since timestamp and check for a 304 return code vs 200. Some servers may alternatively or in addition pass you an ETag header with a guid like hash token that is unique for that webpage instance. You pass that token back on requests in the If-None-Match header and do the same 304 vs 200 return code check. The server generates a new unique hash token every time the webpage is changed. This code example page should help explain it http://2buntu.com/articles/1493/monitoring-webpages-with-last-modified-and-etag-headers/
The Prolific ProgrammerMay 6, 2015 at 6:58 PM
Some (most) modern web sites do not allow the HEAD verb, Klaatu.

August 26, 2014

How to Watch webpages for Changes

4 comments: