August 26, 2014

How to Watch webpages for Changes

Today, I encountered a webpage for which there is no rss feed, nor atom feed. I whipped up something to watch the page myself and report on any changes in python. Hey, Guido, if you integrate requests, I won't have any non-stdlib requirements to the script. What do you say? please? Not that the BDFL reads my blog but anyway, here's the code:


#!/Users/hdiwan/.virtualenvs/globetrekker/bin/python
import argparse
import hashlib
import json
import logging
import pprint
import requests
import smtplib


def send_mail(msg, user, password):
    server = smtplib.SMTP('smtp.gmail.com', 587)
    server.ehlo()
    server.starttls()
    server.ehlo()
    server.login(user, password)
    server.sendmail(user, user, msg)

if __name__ == '__main__':
    argparser = argparse.ArgumentParser(description='Check a website for changes')
    argparser.add_argument('-n', '--url', type=str, default=None, help='Add URL to watcher',  action='store')
    argparser.add_argument('-l', '--list', action='store_true')
    argparser.add_argument('-u', '--user', type=str, default='hd1@jsc.d8u.us', help='Your username',  action='store')
    argparser.add_argument('-p', '--password', type=str, help='Your password', action='store')
    argparser.add_argument('-v', '--verbose', action='store_false')
    parsed = argparser.parse_args()

    if not parsed.verbose:
        logging.basicConfig(level=logging.DEBUG)
    else:
        logging.basicConfig(level=logging.FATAL)

    if parsed.url:
        new_hash = {parsed.url: 0}
        output = json.dumps(new_hash)
        logging.debug(output)
        try:
            with open('/var/tmp/.globetrekker.txt', 'a') as fin:
                data = json.load(fin)
                data[parsed.url] = 0
                logging.debug(data)
                json.dump(data, fin)
        except:
            with open('/var/tmp/.globetrekker.txt', 'w') as fout:
                json.dump(new_hash, fout)
        exit()

    with open('/var/tmp/.globetrekker.txt', 'r') as fin:
        stored_hash_json = json.load(fin)
        logging.debug(stored_hash_json)
        if parsed.list:
            for k in stored_hash_json:
                print(k)
            exit()
    new_hashes = []
    stored_hash = stored_hash_json
    logging.debug(stored_hash)
    for url in stored_hash:
        logging.debug('{} is our URL'.format(url))
        browser = requests.get(url)
        encoding = 'utf-8'
        logging.debug('page retrieved -- {}'.format(url[0]))
        text = browser.content
        encoded = text.encode(encoding, errors='xmlcharrefreplace')
        logging.debug(encoded)
        decoded = encoded.decode(encoding, errors='xmlcharrefreplace')
        logging.debug(decoded)
        new_hash = hashlib.sha1(decoded).hexdigest()
        logging.debug('Calculated hash code: {}'.format(new_hash))
        logging.debug('Stored hash: {}'.format(stored_hash[url]))
        if new_hash != stored_hash[url]:
            logging.debug('{} changed'.format(url))
            if stored_hash[url] != 0:
                send_mail(u'Subject: {} Change detected\r\n\r\n--H'.format(url), parsed.user, parsed.password)

            stored_hash[url] = new_hash
    with open('/var/tmp/.globetrekker.txt', 'w') as fout:
        json.dump(stored_hash, fout)

4 comments:

  1. The way to check this is to just ask the webserver for a HEAD command instead of a GET command so you don't pull down a webpage, just return some headers. You use one of the special headers intended just for this task which you send up on the request and then check for in the results, saving its unique token so you don't have to do your own hashing. The If Modified Since header is the most popular since the token is a timestamp. If the page hasn't changed you get 304 code and no webpage instead of the 200 code and webpage data if it has changed since the timestamp. You can simply save the server response's Last-Modified header's timestamp and pass it back as the next request's If-Modified-Since timestamp and check for a 304 return code vs 200. Some servers may alternatively or in addition pass you an ETag header with a guid like hash token that is unique for that webpage instance. You pass that token back on requests in the If-None-Match header and do the same 304 vs 200 return code check. The server generates a new unique hash token every time the webpage is changed. This code example page should help explain it http://2buntu.com/articles/1493/monitoring-webpages-with-last-modified-and-etag-headers/

    ReplyDelete
    Replies
    1. Oh if my memory servers, these headers work with both the HEAD and GET commands. If you were actually going to process the web page I'd just add the header to the GET and use the 304 return code to skip page processing. But for just a notification check a HEAD call is much faster since its less work for the server and less traffic thus allowing you to check many more sites much more often if you like. I'd also consider just collecting up all the changed urls and sending them in one email message body in case there are many changes since the last program run.

      Delete
  2. Some (most) modern web sites do not allow the HEAD verb, Klaatu.

    ReplyDelete
    Replies
    1. Sacrilege! That used to be built into the server, are they deprecating it from the protocol or are all the lazy youngsters just rolling their own webserver and only writing a GET handler? Anyway you should be able to apply the strategy with the GET command

      Delete