July 3, 2014

How to Track a Website for Changes

There still exist websites not hip enough to enbrace RSS or twitter feeds for changes. And, I still want to consume their content lazily. Enter my favourite tool:

#!/Users/hdiwan/.virtualenvs/globetrekker/bin/python
from selenium import webdriver
import logging
import smtplib
import argparse
import hashlib

def get_globetrekker_page(site):
    browser = webdriver.Chrome()
    browser.get(site)
    return browser

def send_mail(msg, user, password):
    server = smtplib.SMTP('smtp.gmail.com',587) #port 465 or 587
    server.ehlo()
    server.starttls()
    server.ehlo()
    server.login(user,password)
    server.sendmail(user,user,msg)
    
if __name__ == '__main__':
    try:
        argparser = argparse.ArgumentParser(description='Check a website for changes')
        argparser.add_argument('-l','--url',type=str,default='http://www.pilotguides.com/tv-shows/globe-trekker/',help='Page URL to Globetrekker', action='store')
        argparser.add_argument('-u','--user',type=str,default='hd1@jsc.d8u.us',help='Your username', action='store')
        argparser.add_argument('-p','--password',type=str,help='Your password', action='store', required=True)
        argparser.add_argument('-v','--verbose',action='store_false')
        parsed = argparser.parse_args()

        if not parsed.verbose:
            logging.basicConfig(level=logging.DEBUG)
        else:
            logging.basicConfig(level=logging.FATAL)

        browser = get_globetrekker_page(parsed.url)
        elem_ = browser.find_elements_by_id('destination-dropdown-filter')

        try:
            with open('/var/tmp/.globetrekker.txt') as fin:
                stored_hash = fin.read()
                logging.debug('Stored Hash Code: {}'.format(stored_hash))
        except IOError: 
            stored_hash = 0

        episodes_ = 0
        for elem in elem_:
            episodes_ = len(elem.get_attribute('value')) + episodes_
        new_hash = hashlib.sha1(str(episodes_)).hexdigest()
        logging.debug('Calculated hash code: {}'.format(new_hash))
        if new_hash != stored_hash: # Page changed
            with open('/var/tmp/.globetrekker.txt','w') as fout:
                fout.write('{}'.format(new_hash))
            send_mail(u'Subject: Page Changed {}\r\n\r\n--H'.format(parsed.url), parsed.user, parsed.password)

    finally:
        browser.quit()

Some notes on this script, it uses selenium, which Proshot or somebody was saking me about the other day (sory, man, it's not ruby, but it's my code, so....), you can find enough on other sources aside from this blog.

The metrics for detecting whether a page has changed is performed by IETF-standard SHA-1, which while compromised, no attack has been found in the wild, in theory there is a hash collision.

No comments:

Post a Comment