September 10, 2014

How to Manipualte XML in Python with the Standard Library

Just added tests and removed all dependencies to the XML serialiser:

from collections import OrderedDict
import cStringIO as StringIO
import logging
from xml import sax

logging.basicConfig(level=logging.FATAL)

class ContentHandler(sax.ContentHandler):
    def __init__(self):
       sax.ContentHandler.__init__(self)
       self.dictionary = {}

    def startElement(self, qname, attrs):
        if qname == 'element':
            self.dictionary[attrs.getValue('key')] = attrs.getValue('value')

def loads(__xml):
    """ 
        restores xml as a dict, without the  declaration
    """
    ch = ContentHandler()
    sax.parseString(__xml, ch)
    return(ch.dictionary)

def load(file_):
    __xml = None
    with open(file_) as fin:
        __xml = loads(fin.read())
    return(__xml)

def dumps(obj):
    """
       Dumps obj to XML
    """
    logging.debug(obj)
    __xml = ''
    for k in obj.keys():
        __xml = __xml +''.format(k, obj[k])
    return('{}'.format(__xml))

def dump(obj, output):
    __bytes = dumps(obj)
    logging.debug(__bytes)
    with open(output, 'w') as fout:
        fout.write(__bytes)
    return(fout.name)

if __name__ == '__main__':
    dictionary = {'1':'True', '0' : 'False'}
    xml__ = dumps(dictionary)
    print('PASSED deserialisation'
    logging.debug(xml__)
    if loads(xml__) == dictionary:
        print 'PASSED serialisation'
    else:
        print 'FAILED serialisation'

What's the XML that comes out look like? Take a look, under the hood:

<root><element key="1" value="True"/><element key="0" value="False"/></root>

How to Manipulate XML Pythonically

I just submitted my first pypi package, xickle. Letting you persist and read from XML just as easily as you manipulate json using the json module. Indeed, the method signatures are the same. Example code:

toxml = {True: '1', False: 0} # Dictionary
# To dump to a file
import xickle
xickle.dump(toxml, filename)

# To dump to a string containing the well-formed xml
import xickle
xickle.dumps(toxml)

# To read in from a file
import xickle
dictionary = xickle.load(filename)

# To read in from a string
import xickle
dictionary = xickle.loads(xmlstring)
No, persistence isn't pretty. Then again, no matter what James says, XML was not meant to be read by humans. It's meant to be read by machines.

September 9, 2014

How to Parse XML using Python

xmltodict is, quite possibly, the single, best way to parse XML objects into python and back. The sample below will parse the headlines from this blog's atom feed and output their titles:


In [2]: import requests,xmltodict

In [3]: xml = requests.get('http://www.prolificprogrammer.com/atom.xml').content

In [4]: blog = xmltodict.parse(xml)

In [24]: def return_title(e):
    return(e['title']['#text'])

In [52]: print "\n".join([return_title(e) for e in blog['feed']['entry']])
How to Record Data
How to Watch webpages for Changes
How to Sign Text Using Python
How to Set Your User-Agent using PyCurl
How to Search for Ports on BSD
How to Synchronise a Syndication Feed to Reddit
How to be your Own CNBC Analyst
How to Produce JSON Properly from Spring
How to Visualise Deaths in Iraq Pt 2
How to Visualise Deaths in Iraq Pt 1
How to Track your website Accesses on the Web
How to Find a Link I Sent You Without Being Embarrassed #2
How to Collect Your Gists from Github
How to Draw a Histogram in Python
When can Americans Expect to Live to 100?
How to Visualise Sent Links
How to add a Custom Git command
How to determine Word Frequency in Java
How to Handle Gzipped Files in Ruby
How to Visualise Data
How to Cleanup Gmail
How to self-document Using Spring
How to Reformat Logback Output
How to Track Multiple Websites For Changes
How to Track a Website for Changes

Now, to convert the list back into XML...


In [59]: xmltodict.unparse(blog)
Out[58]: u'<?xml version="1.0" encoding="utf-8"?>\n<feed xmlns="http://www.w3.org/2005/Atom" xmlns:openSearch="http://a9.com/-/spec/opensearchrss/1.0/" xmlns:blogger="http://schemas.google.com/blogger/2008" xmlns:georss="http://www.georss.org/georss" xmlns:gd="http://schemas.google.com/g/2005" xmlns:thr="http://purl.org/syndication/thread/1.0"><id>tag:blogger.com,1999:blog-6157408210125261684</id><updated>2014-09-09T00:03:50.967-07:00</updated><category term="python"></category><category term="csv"></category><category term="r"></category><category term="java"></category><category term="podcasts"></category><category term="json"></category><category term="rss"></category><category term="automation"></category><category term="ggplot"></category><category term="gmail"></category><category term="openpgp"></category><category term="privacy"></category><category term="shell script"></category><category term="visualisation"></category><category term="xml"></category><category term="apache"></category><category term="atom"></category><category term="gpg"></category><category term="news"></category><category term="nginx"></category><category term="pandas"></category><category term="postgresql"></category><category term="ruby"></category><category term="security"></category><category term="spring"></category><category term="SHA"></category><category term="combinedlog"></category><category term="conversion"></category><category term="email"></category><category term="http"></category><category term="jython"></category><category term="lighttpd"></category><category term="linux"></category><category term="load average"></category><category term="load balancer"></category><category term="mh"></category><category term="netbsd"></category><category term="openbsd"></category><category term="perl"></category><category term="pgp"></category><category term="scaling"></category><category term="sharedlinks"></category><category term="bittorrent"></category><category term="cgi"></category><category term="change tracking"></category><category term="databases"></category><category term="feedparser"></category><category term="flask"></category><category term="geolocation"></category><category term="git"></category><category term="gnu privacy guard"></category><category term="histogram"></category><category term="ios"></category><category term="iraqbodycount"></category><category term="j2ee"></category><category term="jodatime"></category><category term="jpa"></category><category term="llibcurl"></category><category term="loadavg"></category><category term="macintosh"></category><category term="mail handler"></category><category term="nmh"></category><category term="numpy"></category><category term="optparse"></category><category term="proc"></category><category term="productivity"></category><category term="pycurl"></category><category term="quakes"></category><category term="random"></category><category term="requests"></category><category term="rest"></category><category term="search engine"></category><category term="selenium"></category><category term="sh"></category><category term="subprocess"></category><category term="twitter"></category><category term="weather"></category><category term="webpages"></category><category term="youtube"></category><category term="RDBMS"></category><category term="WDI"></category><category term="addressbook"></category><category term="america"></category><category term="analysis"></category><category term="apple"></category><category term="attachments"></category><category term="awk"></category><category term="bitly"></category><category term="bsd"></category><category term="c"></category><category term="chat"></category><category term="checklists"></category><category term="cocoa"></category><category term="comment"></category><category term="communication"></category><category term="coursera"></category><category term="database migration"></category><category term="date"></category><category term="design"></category><category term="development"></category><category term="dictionary"></category><category term="dns"></category><category term="documentation"></category><category term="emacs"></category><category term="enclosures"></category><category term="encryption"></category><category term="endpoint"></category><category term="erlang"></category><category term="etree"></category><category term="exploration"></category><category term="fetchmail"></category><category term="ffmpeg"></category><category term="filtering"></category><category term="flickr"></category><category term="freebsd"></category><category term="georgegalloway"></category><category term="ggmap"></category><category term="gis"></category><category term="gist"></category><category term="github"></category><category term="glob"></category><category term="google talk"></category><category term="gps"></category><category term="grep"></category><category term="gui"></category><category term="gzipreader"></category><category term="h2"></category><category term="hashlib"></category><category term="hql"></category><category term="html"></category><category term="imap"></category><category term="imgur"></category><category term="imgurl"></category><category term="instant messaging"></category><category term="interactive"></category><category term="ipv4"></category><category term="ipv6"></category><category term="jabber"></category><category term="javamail"></category><category term="javascript"></category><category term="jdbc"></category><category term="knitr"></category><category term="libcurl"></category><category term="libmpg123"></category><category term="linkedin"></category><category term="links"></category><category term="locationservices"></category><category term="log formatting"></category><category term="log4j"></category><category term="logback"></category><category term="macosx"></category><category term="makefile"></category><category term="migration"></category><category term="mode"></category><category term="mongo"></category><category term="mp3"></category><category term="mp4"></category><category term="mpg123"></category><category term="mplayer"></category><category term="multimedia"></category><category term="mutagen"></category><category term="nlp"></category><category term="nosql"></category><category term="oauth"></category><category term="objectivec"></category><category term="openstreetmap"></category><category term="org-mode"></category><category term="package management"></category><category term="parallelisation"></category><category term="photosharing"></category><category term="pipes"></category><category term="pki"></category><category term="play"></category><category term="plots"></category><category term="ports"></category><category term="postal code"></category><category term="postgis"></category><category term="postgres"></category><category term="presstv"></category><category term="procmail"></category><category term="psql"></category><category term="psycopg2"></category><category term="pymongo"></category><category term="rails"></category><category term="readline"></category><category term="reddit"></category><category term="reproducible research"></category><category term="resample"></category><category term="retrieval"></category><category term="rexml"></category><category term="rubyonrails"></category><category term="scheduling"></category><category term="shapefiles"></category><category term="shell"></category><category term="smugmug"></category><category term="spring data"></category><category term="sql"></category><category term="statistics"></category><category term="sum of residuals"></category><category term="svn"></category><category term="swing"></category><category term="syndication"></category><category term="text mining"></category><category term="tm"></category><category term="traffic"></category><category term="unix"></category><category term="upload"></category><category term="url escaping"></category><category term="user agent"></category><category term="web mechanize"></category><category term="web output"></category><category term="web service"></category><category term="webapi"></category><category term="word count"></category><category term="world time"></category><category term="worldbank"></category><category term="xmpp"></category><category term="yaml"></category><category term="yelp"></category><category term="zamzar"></category><category term="zenroll"></category><category term="zlib"></category><title type="text">The Prolific Programmer -- on the web....</title><subtitle type="html">Free code</subtitle><link rel="http://schemas.google.com/g/2005#feed" type="application/atom+xml" href="http://www.prolificprogrammer.com/feeds/posts/default"></link><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/6157408210125261684/posts/default?alt=atom"></link><link rel="alternate" type="text/html" href="http://www.prolificprogrammer.com/"></link><link rel="hub" href="http://pubsubhubbub.appspot.com/"></link><link rel="next" type="application/atom+xml" href="http://www.blogger.com/feeds/6157408210125261684/posts/default?alt=atom&start-index=26&max-results=25"></link><author><name>Hasan Diwan</name><uri>https://plus.google.com/108538472544668153897</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="//lh3.googleusercontent.com/-YZ6ZwPevQL4/AAAAAAAAAAI/AAAAAAAAAMs/sop83s0_YiY/s512-c/photo.jpg"></gd:image></author><generator version="7.00" uri="http://www.blogger.com">Blogger</generator><openSearch:totalResults>106</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>25</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-6157408210125261684.post-4629828637351397901</id><published>2014-09-01T10:06:00.000-07:00</published><updated>2014-09-01T10:06:02.457-07:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="csv"></category><category scheme="http://www.blogger.com/atom/ns#" term="flask"></category><category scheme="http://www.blogger.com/atom/ns#" term="http"></category><category scheme="http://www.blogger.com/atom/ns#" term="json"></category><category scheme="http://www.blogger.com/atom/ns#" term="python"></category><category scheme="http://www.blogger.com/atom/ns#" term="rest"></category><title type="text">How to Record Data</title><content type="html"><p>I know it\'s <a href="http://en.wikipedia.org/wiki/Labor_Day">Labor Day</a> and what-nots and I\'m supposed to be celebrating the end of <a href="http://www.burningman.com/">Burning man</a>, being at a bar-be-queue, but I\'m not. Instead, I\'m tweaking things, bringing me to what I just accomplished -- a <a href="http://flask.pocoo.org">flask</a>-based REST API to data, in <a href="http://python.org">python</a>, naturally:<code><pre><br />from backports import lzma<br />import cStringIO as StringIO<br />import csv<br />import datetime<br />from flask import Flask, request, Response<br />import json<br /><br /># TODO force SSL for post -- http://flask.pocoo.org/snippets/111/<br /><br />DATA_FILE = \'sanguine.csv.xz\'<br />app = Flask(__name__)<br /><br />@app.route(\'/\', methods = [\'GET\'])<br />def index():<br />    with lzma.LZMAFile(DATA_FILE, \'r\') as data:<br />        output = StringIO.StringIO()<br />        reader = csv.DictReader(data, fieldnames=[\'Timestamp\',\'User\',\'Latitude\',\'Longitude\'], quoting=csv.QUOTE_MINIMAL, lineterminator=\'\\r\\n\')<br />        reader.next() # skip header line<br />        return(Response(json.dumps(list(reader)), mimetype=\'application/json\'))<br /><br />@app.route(\'/\', methods=[\'POST\'])<br />def newdatapiece():<br />    with lzma.LZMAFile(DATA_FILE, \'a\') as data:<br />        writer = csv.DictWriter(data, fieldnames=[\'Timestamp\',\'User\', \'Latitude\',\'Longitude\'], quoting = csv.QUOTE_MINIMAL, lineterminator=\'\\r\\n\')<br />        row = {}<br />        row[\'Timestamp\'] = datetime.datetime.now().strftime(\'%s\')<br />        row[\'User\'] = request.form[\'user_id\']<br />        row[\'Latitude\'] = request.form[\'lat\']<br />        row[\'Longitude\'] = request.form[\'lon\']<br />        writer.writerow(row)<br />    return \'\', 201<br /><br />@app.route(\'/analyze\', methods=[\'GET\']) <br />def analysis():<br />    lines = []<br />    with lzma.LZMAFile(DATA_FILE, \'r\') as data:<br />        lines = data.readlines()<br />    return(Response(lines, mimetype=\'application/csv\'))<br /><br />if __name__ == \'__main__\':<br />    with lzma.LZMAFile(DATA_FILE, \'w\') as data:<br />        writer = csv.DictWriter(data, fieldnames=[\'Timestamp\',\'User\', \'Latitude\',\'Longitude\'], quoting = csv.QUOTE_MINIMAL, lineterminator=\'\\r\\n\')<br />        writer.writeheader()<br />    app.run(host=\'0.0.0.0\', port=8080, debug=True)<br /></pre></code></content><link rel="replies" type="application/atom+xml" href="http://www.prolificprogrammer.com/feeds/4629828637351397901/comments/default" title="Post Comments"></link><link rel="replies" type="text/html" href="http://www.prolificprogrammer.com/2014/09/how-to-record-data.html#comment-form" title="0 Comments"></link><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/6157408210125261684/posts/default/4629828637351397901"></link><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/6157408210125261684/posts/default/4629828637351397901"></link><link rel="alternate" type="text/html" href="http://www.prolificprogrammer.com/2014/09/how-to-record-data.html" title="How to Record Data"></link><author><name>Hasan Diwan</name><uri>https://plus.google.com/108538472544668153897</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="//lh3.googleusercontent.com/-YZ6ZwPevQL4/AAAAAAAAAAI/AAAAAAAAAMs/sop83s0_YiY/s512-c/photo.jpg"></gd:image></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6157408210125261684.post-682776287318068816</id><published>2014-08-26T22:57:00.000-07:00</published><updated>2014-08-26T22:57:23.584-07:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="atom"></category><category scheme="http://www.blogger.com/atom/ns#" term="hashlib"></category><category scheme="http://www.blogger.com/atom/ns#" term="json"></category><category scheme="http://www.blogger.com/atom/ns#" term="python"></category><category scheme="http://www.blogger.com/atom/ns#" term="rss"></category><category scheme="http://www.blogger.com/atom/ns#" term="SHA"></category><title type="text">How to Watch webpages for Changes</title><content type="html"><p>Today, I encountered a webpage for which there is no rss feed, nor atom feed. I whipped up something to watch the page myself and report on any changes in <a href="http://python.org">python</a>. Hey, <a href="https://www.python.org/~guido/">Guido</a>, if you integrate <a href="https://pypi.python.org/pypi/requests">requests</a>, I won\'t have any non-stdlib requirements to the script. What do you say? <b>please</b>? Not that the <a href="https://wiki.python.org/moin/BDFL">BDFL</a> reads my blog but anyway, here\'s the code:<pre><code><br />#!/Users/hdiwan/.virtualenvs/globetrekker/bin/python<br />import argparse<br />import hashlib<br />import json<br />import logging<br />import pprint<br />import requests<br />import smtplib<br /><br /><br />def send_mail(msg, user, password):<br />    server = smtplib.SMTP(\'smtp.gmail.com\', 587)<br />    server.ehlo()<br />    server.starttls()<br />    server.ehlo()<br />    server.login(user, password)<br />    server.sendmail(user, user, msg)<br /><br />if __name__ == \'__main__\':<br />    argparser = argparse.ArgumentParser(description=\'Check a website for changes\')<br />    argparser.add_argument(\'-n\', \'--url\', type=str, default=None, help=\'Add URL to watcher\',  action=\'store\')<br />    argparser.add_argument(\'-l\', \'--list\', action=\'store_true\')<br />    argparser.add_argument(\'-u\', \'--user\', type=str, default=\'hd1@jsc.d8u.us\', help=\'Your username\',  action=\'store\')<br />    argparser.add_argument(\'-p\', \'--password\', type=str, help=\'Your password\', action=\'store\')<br />    argparser.add_argument(\'-v\', \'--verbose\', action=\'store_false\')<br />    parsed = argparser.parse_args()<br /><br />    if not parsed.verbose:<br />        logging.basicConfig(level=logging.DEBUG)<br />    else:<br />        logging.basicConfig(level=logging.FATAL)<br /><br />    if parsed.url:<br />        new_hash = {parsed.url: 0}<br />        output = json.dumps(new_hash)<br />        logging.debug(output)<br />        try:<br />            with open(\'/var/tmp/.globetrekker.txt\', \'a\') as fin:<br />                data = json.load(fin)<br />                data[parsed.url] = 0<br />                logging.debug(data)<br />                json.dump(data, fin)<br />        except:<br />            with open(\'/var/tmp/.globetrekker.txt\', \'w\') as fout:<br />                json.dump(new_hash, fout)<br />        exit()<br /><br />    with open(\'/var/tmp/.globetrekker.txt\', \'r\') as fin:<br />        stored_hash_json = json.load(fin)<br />        logging.debug(stored_hash_json)<br />        if parsed.list:<br />            for k in stored_hash_json:<br />                print(k)<br />            exit()<br />    new_hashes = []<br />    stored_hash = stored_hash_json<br />    logging.debug(stored_hash)<br />    for url in stored_hash:<br />        logging.debug(\'{} is our URL\'.format(url))<br />        browser = requests.get(url)<br />        encoding = \'utf-8\'<br />        logging.debug(\'page retrieved -- {}\'.format(url[0]))<br />        text = browser.content<br />        encoded = text.encode(encoding, errors=\'xmlcharrefreplace\')<br />        logging.debug(encoded)<br />        decoded = encoded.decode(encoding, errors=\'xmlcharrefreplace\')<br />        logging.debug(decoded)<br />        new_hash = hashlib.sha1(decoded).hexdigest()<br />        logging.debug(\'Calculated hash code: {}\'.format(new_hash))<br />        logging.debug(\'Stored hash: {}\'.format(stored_hash[url]))<br />        if new_hash != stored_hash[url]:<br />            logging.debug(\'{} changed\'.format(url))<br />            if stored_hash[url] != 0:<br />                send_mail(u\'Subject: {} Change detected\\r\\n\\r\\n--H\'.format(url), parsed.user, parsed.password)<br /><br />            stored_hash[url] = new_hash<br />    with open(\'/var/tmp/.globetrekker.txt\', \'w\') as fout:<br />        json.dump(stored_hash, fout)<br /></code></pre></content><link rel="replies" type="application/atom+xml" href="http://www.prolificprogrammer.com/feeds/682776287318068816/comments/default" title="Post Comments"></link><link rel="replies" type="text/html" href="http://www.prolificprogrammer.com/2014/08/how-to-watch-webpages-for-changes.html#comment-form" title="0 Comments"></link><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/6157408210125261684/posts/default/682776287318068816"></link><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/6157408210125261684/posts/default/682776287318068816"></link><link rel="alternate" type="text/html" href="http://www.prolificprogrammer.com/2014/08/how-to-watch-webpages-for-changes.html" title="How to Watch webpages for Changes"></link><author><name>Hasan Diwan</name><uri>https://plus.google.com/108538472544668153897</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="//lh3.googleusercontent.com/-YZ6ZwPevQL4/AAAAAAAAAAI/AAAAAAAAAMs/sop83s0_YiY/s512-c/photo.jpg"></gd:image></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6157408210125261684.post-8204778705105347148</id><published>2014-08-23T19:25:00.000-07:00</published><updated>2014-08-23T19:25:26.053-07:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="gnu privacy guard"></category><category scheme="http://www.blogger.com/atom/ns#" term="openpgp"></category><category scheme="http://www.blogger.com/atom/ns#" term="pki"></category><category scheme="http://www.blogger.com/atom/ns#" term="python"></category><title type="text">How to Sign Text Using Python</title><content type="html"><p>An important tenet of any asymmetric encryption system is that a public key must be distributed far and wide -- it is used to encrypt information -- while the corresponding private key must be kept, well, private. The code below shows how to sign some text with your public key in python:<code><pre><br />def sign(message):<br />    gpg = gnupg.GPG(gnupghome=\'{}/.gnupg\'.format(os.path.expanduser(\'~\')))<br />    gpg.encoding = \'utf-8\'<br />    gpg.secret_keyring=[\'secring.gpg\']<br />    gpg.public_keyring=[\'pubring.gpg\']<br />    signed = gpg.sign(message)<br />    return str(signed)<br /></pre></code>Since <a href="https://pythonhosted.org/python-gnupg/">python-gnupg</a> can take a list for both public and private keys, both filenames go in as a list. Message is a string of plaintext, whose return value\'s str method returns a cleartext signature as well as the message itself.</p></content><link rel="replies" type="application/atom+xml" href="http://www.prolificprogrammer.com/feeds/8204778705105347148/comments/default" title="Post Comments"></link><link rel="replies" type="text/html" href="http://www.prolificprogrammer.com/2014/08/how-to-sign-text-using-python.html#comment-form" title="0 Comments"></link><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/6157408210125261684/posts/default/8204778705105347148"></link><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/6157408210125261684/posts/default/8204778705105347148"></link><link rel="alternate" type="text/html" href="http://www.prolificprogrammer.com/2014/08/how-to-sign-text-using-python.html" title="How to Sign Text Using Python"></link><author><name>Hasan Diwan</name><uri>https://plus.google.com/108538472544668153897</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="//lh3.googleusercontent.com/-YZ6ZwPevQL4/AAAAAAAAAAI/AAAAAAAAAMs/sop83s0_YiY/s512-c/photo.jpg"></gd:image></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6157408210125261684.post-2916320482359317525</id><published>2014-08-23T13:49:00.002-07:00</published><updated>2014-08-23T13:49:56.480-07:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="libcurl"></category><category scheme="http://www.blogger.com/atom/ns#" term="pycurl"></category><category scheme="http://www.blogger.com/atom/ns#" term="python"></category><category scheme="http://www.blogger.com/atom/ns#" term="user agent"></category><title type="text">How to Set Your User-Agent using PyCurl</title><content type="html"><p>An unnamed CDN was blocking my sharing script because it wasn\'t in its <a href="http://www.useragentstring.com">whitelist of approved user agents</a>. And it\'s a common one. What to do? Fake it, till you make it, to borrow a turn of phrase, like so:<pre><code><br />curlObj.setOpt(pycurl.USER_AGENT, \'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2049.0 Safari/537.36\')<br /></code></pre></content><link rel="replies" type="application/atom+xml" href="http://www.prolificprogrammer.com/feeds/2916320482359317525/comments/default" title="Post Comments"></link><link rel="replies" type="text/html" href="http://www.prolificprogrammer.com/2014/08/how-to-set-your-user-agent-using-pycurl.html#comment-form" title="0 Comments"></link><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/6157408210125261684/posts/default/2916320482359317525"></link><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/6157408210125261684/posts/default/2916320482359317525"></link><link rel="alternate" type="text/html" href="http://www.prolificprogrammer.com/2014/08/how-to-set-your-user-agent-using-pycurl.html" title="How to Set Your User-Agent using PyCurl"></link><author><name>Hasan Diwan</name><uri>https://plus.google.com/108538472544668153897</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="//lh3.googleusercontent.com/-YZ6ZwPevQL4/AAAAAAAAAAI/AAAAAAAAAMs/sop83s0_YiY/s512-c/photo.jpg"></gd:image></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6157408210125261684.post-5063330501642838945</id><published>2014-08-23T00:03:00.000-07:00</published><updated>2014-08-23T00:03:53.478-07:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="openbsd"></category><category scheme="http://www.blogger.com/atom/ns#" term="ports"></category><category scheme="http://www.blogger.com/atom/ns#" term="python"></category><title type="text">How to Search for Ports on BSD</title><content type="html"><p>BSD systems, well, at least, <a href="http://www.freebsd.org">FreeBSD</a> and <a href="http://www.openbsd.org">OpenBSD</a> feature 3rd party packages in a ports-system. The <a href="http://python.org">python</a> script below lets you search for ports by substring:</p><pre><code><br />#!/home/hdiwan/.virtualenvs/ports/bin/python<br />import argparse<br />import csv<br /><br />INDEX = \'/usr/ports/INDEX\'<br />if __name__ == \'__main__\':<br />    args_ = argparse.ArgumentParser(description=\'Ports tool for OpenBSD\')<br />    args_.add_argument(\'query\', help=\'Query\', type=unicode, action=\'store\')<br />    args = args_.parse_args()<br />    with open(INDEX, \'r\') as index:<br />        reader = csv.reader(index, delimiter=\'|\')<br />        print(args.query)<br />        for line in list(reader):<br />            if line[0].find(args.query) > 0:<br />                print line[0]<br />                exit<br /></code></pre><p>You need python and to have an index file (modify the path if necessary). The output of this looks like: <pre><br />% python ./ports.py -q "ruby"<br />ruby<br />jruby-jdbc-h2-1.3.170.1<br />jruby-jdbc-mysql-5.1.22.1<br />jruby-jdbc-postgres-9.2.1002.1<br />jruby-jdbc-sqlite3-3.7.2.1<br />vim-7.4.135p0-gtk2-perl-python-ruby<br />vim-7.4.135p0-gtk2-perl-python3-ruby<br />vim-7.4.135p0-no_x11-perl-python-ruby<br />vim-7.4.135p0-no_x11-perl-python3-ruby<br />jruby-1.7.9<br />weechat-ruby-0.4.2<br />eruby-1.0.5p14<br />mod_ruby-1.2.6p7<br />jruby-profligacy-1.0</pre></content><link rel="replies" type="application/atom+xml" href="http://www.prolificprogrammer.com/feeds/5063330501642838945/comments/default" title="Post Comments"></link><link rel="replies" type="text/html" href="http://www.prolificprogrammer.com/2014/08/how-to-search-for-ports-on-bsd.html#comment-form" title="0 Comments"></link><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/6157408210125261684/posts/default/5063330501642838945"></link><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/6157408210125261684/posts/default/5063330501642838945"></link><link rel="alternate" type="text/html" href="http://www.prolificprogrammer.com/2014/08/how-to-search-for-ports-on-bsd.html" title="How to Search for Ports on BSD"></link><author><name>Hasan Diwan</name><uri>https://plus.google.com/108538472544668153897</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="//lh3.googleusercontent.com/-YZ6ZwPevQL4/AAAAAAAAAAI/AAAAAAAAAMs/sop83s0_YiY/s512-c/photo.jpg"></gd:image></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6157408210125261684.post-7777492820641836786</id><published>2014-08-19T09:15:00.000-07:00</published><updated>2014-08-19T21:28:32.189-07:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="python"></category><category scheme="http://www.blogger.com/atom/ns#" term="reddit"></category><category scheme="http://www.blogger.com/atom/ns#" term="syndication"></category><title type="text">How to Synchronise a Syndication Feed to Reddit</title><content type="html"><p>The <a href="http://www.python.org">python</a> code below will let you post the newest entry in your blog\'s syndication feed to <a href="http://www.reddit.com">Reddit</a>\'s <a href="http://www.reddit.com/r/programming">programming</a> subreddit automatically:<pre><code><br />#!/Users/hdiwan/.virtualenvs/blogger2reddit/bin/python<br />import argparse<br />import feedparser<br />import logging<br />import operator<br />import praw<br /><br />if __name__ == \'__main__\':<br />    parse = argparse.ArgumentParser(description="Submit a feed\'s newest entry to reddit")<br />    parse.add_argument(\'-f\', \'--feed\', action=\'store\', help=\'Feed URL\', default=\'http://www.prolificprogrammer.com/atom.xml\')<br />    parse.add_argument(\'-p\', \'--password\', action=\'store\', help=\'Reddit password\')<br />    parse.add_argument(\'-u\', \'--user\', action=\'store\', help=\'Reddit Username\')<br />    parse.add_argument(\'-v\', \'--verbose\', action=\'store_true\', help=\'Verbose debugging\')<br />    args = parse.parse_args()<br /><br />    if args.verbose:<br />        logging.basicConfig(level=logging.DEBUG)<br />    else:<br />        logging.basicConfig(level=logging.FATAL)<br /><br />    feed = feedparser.parse(args.feed)<br />    entries = feed.entries<br />    entries = sorted(entries, key=operator.itemgetter(\'published\'))<br />    logging.debug(entries)<br /><br />    entry = entries[0]<br />    submission_title = entry.title<br />    submission_link = entry.link<br /><br />    r = praw.Reddit(user_agent=\'example\')<br /><br />    r.login(args.user, args.password)<br />    logging.debug(\'logged in to reddit as {}\'.format(args.user))<br />    sr = r.get_subreddit(\'programming\')<br />    sr.submit(submission_title, url=submission_link)<br /></code></pre></content><link rel="replies" type="application/atom+xml" href="http://www.prolificprogrammer.com/feeds/7777492820641836786/comments/default" title="Post Comments"></link><link rel="replies" type="text/html" href="http://www.prolificprogrammer.com/2014/08/how-to-synchronise-syndication-feed-to.html#comment-form" title="0 Comments"></link><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/6157408210125261684/posts/default/7777492820641836786"></link><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/6157408210125261684/posts/default/7777492820641836786"></link><link rel="alternate" type="text/html" href="http://www.prolificprogrammer.com/2014/08/how-to-synchronise-syndication-feed-to.html" title="How to Synchronise a Syndication Feed to Reddit"></link><author><name>Hasan Diwan</name><uri>https://plus.google.com/108538472544668153897</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="//lh3.googleusercontent.com/-YZ6ZwPevQL4/AAAAAAAAAAI/AAAAAAAAAMs/sop83s0_YiY/s512-c/photo.jpg"></gd:image></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6157408210125261684.post-4716980168873773952</id><published>2014-08-18T14:45:00.001-07:00</published><updated>2014-08-18T15:21:32.401-07:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="csv"></category><category scheme="http://www.blogger.com/atom/ns#" term="feedparser"></category><category scheme="http://www.blogger.com/atom/ns#" term="python"></category><category scheme="http://www.blogger.com/atom/ns#" term="random"></category><title type="text">How to be your Own CNBC Analyst</title><content type="html"><p>We were having a cheeky discussion on lily today about how "You could probably be recognised as a market analyst if you just reported "Dow &lt;rises/falls&gt; on &lt;top headline from news.google.com&gt;" at the end of the day." I decided to put this to the test, and automate the talking heads out of a job by mining a random headline from the <a href="http://www.nytimes.com">New York Times</a> and the latest NASDAQ quote:<code><pre><br />#!/Users/hdiwan/.virtualenvs/marketAnalyst/main.py<br />import cStringIO as StringIO<br />import csv<br />import feedparser<br />import logging<br />import random<br />import requests<br /><br />if __name__ == \'__main__\':<br />    logging.basicConfig(level = logging.FATAL)<br />    quotes_ = requests.get(\'http://ichart.yahoo.com/table.csv?s=QQQ\')<br />    quotes_ = quotes_.content<br />    quotes = StringIO.StringIO(quotes_)<br />    reader = list(csv.reader(quotes))<br />    todays_close = float(reader[1][3])<br />    yesterdays_close = float(reader[2][3])<br />    <br />    news = feedparser.parse(\'http://www.nytimes.com/roomfordebate/index.rss?category=business\')<br />    entries = news.entries<br />    random.shuffle(entries)<br />    logging.info(news)<br />    story = entries[0]<br /><br />    difference = todays_close - yesterdays_close<br />    reason = story.title<br />    print(\'NYSE change {} because of {}\'.format(difference, reason))<br /></pre></code><pre><br />python ~/.virtualenvs/marketAnalyst/main.py<br />NYSE change 0.05 because of Can the U.S. Still Be a Leader in the Middle East?<br /></pre></content><link rel="replies" type="application/atom+xml" href="http://www.prolificprogrammer.com/feeds/4716980168873773952/comments/default" title="Post Comments"></link><link rel="replies" type="text/html" href="http://www.prolificprogrammer.com/2014/08/how-to-be-your-own-cnbc-analyst.html#comment-form" title="0 Comments"></link><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/6157408210125261684/posts/default/4716980168873773952"></link><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/6157408210125261684/posts/default/4716980168873773952"></link><link rel="alternate" type="text/html" href="http://www.prolificprogrammer.com/2014/08/how-to-be-your-own-cnbc-analyst.html" title="How to be your Own CNBC Analyst"></link><author><name>Hasan Diwan</name><uri>https://plus.google.com/108538472544668153897</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="//lh3.googleusercontent.com/-YZ6ZwPevQL4/AAAAAAAAAAI/AAAAAAAAAMs/sop83s0_YiY/s512-c/photo.jpg"></gd:image></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6157408210125261684.post-8212675849430011430</id><published>2014-08-17T04:45:00.001-07:00</published><updated>2014-08-17T04:45:58.159-07:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="java"></category><category scheme="http://www.blogger.com/atom/ns#" term="json"></category><category scheme="http://www.blogger.com/atom/ns#" term="spring"></category><title type="text">How to Produce JSON Properly from Spring</title><content type="html"><p>Not that the default <a href="http://www.spring.io">spring JSON</a> is <i>that</i> bad. It looks like this:<pre><br />[{ 2.62779739789553,1556.68506945,\'El Pollo Loco\'},{4.087178144481979,1632.109670148,\'Paper or Plastik Cafe\'}<br /></pre>I don\'t like this and want something more like:<pre><br />[{\'azimuth\': 1.3775424158235956,  \'distance\': 625.924396521,  \'name\': \'Starbucks\'}, {\'azimuth\': 1.628478725514169,  \'distance\': 646.038250929,  \'name\': \'Asian Cuisine\'}]<br /></pre>And I figured it out:<pre><code><br />for (results.next(); results.isAfterLast() == false; results.next()) {<br />        Spot spot = new Spot();<br /> spot.setAzimuth(results.getDouble("bearing"));<br /> spot.setDistance(results.getDouble("distance"));<br /> spot.setName(results.getString("name"));<br /> LOGGER.debug(spot.toString());<br /> spots.add(spot);<br />}</code><br /></pre>Yes, by making it at list of a bean I wrote, instead of retrieving it directly into a collection of them, it seems I can force a hash as output from <a href="http://spring.io">spring</a>.</p></content><link rel="replies" type="application/atom+xml" href="http://www.prolificprogrammer.com/feeds/8212675849430011430/comments/default" title="Post Comments"></link><link rel="replies" type="text/html" href="http://www.prolificprogrammer.com/2014/08/how-to-produce-json-properly-from-spring.html#comment-form" title="0 Comments"></link><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/6157408210125261684/posts/default/8212675849430011430"></link><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/6157408210125261684/posts/default/8212675849430011430"></link><link rel="alternate" type="text/html" href="http://www.prolificprogrammer.com/2014/08/how-to-produce-json-properly-from-spring.html" title="How to Produce JSON Properly from Spring"></link><author><name>Hasan Diwan</name><uri>https://plus.google.com/108538472544668153897</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="//lh3.googleusercontent.com/-YZ6ZwPevQL4/AAAAAAAAAAI/AAAAAAAAAMs/sop83s0_YiY/s512-c/photo.jpg"></gd:image></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6157408210125261684.post-3202721524976507913</id><published>2014-08-14T11:02:00.000-07:00</published><updated>2014-08-14T11:03:22.307-07:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="csv"></category><category scheme="http://www.blogger.com/atom/ns#" term="ggmap"></category><category scheme="http://www.blogger.com/atom/ns#" term="imgur"></category><category scheme="http://www.blogger.com/atom/ns#" term="iraqbodycount"></category><category scheme="http://www.blogger.com/atom/ns#" term="r"></category><title type="text">How to Visualise Deaths in Iraq Pt 2</title><content type="html"><p>The <a href="http://www.prolificprogrammer.com/2014/08/how-to-visualise-deaths-in-iraq-pt-1.html">perl script</a> massages the data to a csv and puts it in the temporary directory for further processing. The <a href="http://www.r-project.org">R</a> script below does that further processing and uploads the result automatically to <a href="http://imgur.com/">imgur</a> and returns the link to said image on the console:<pre><br />#!/usr/bin/Rscript<br />require(RJSONIO)<br />require(RCurl) the<br />require(ggmap)<br />require(imguR)<br /><br />setwd(\'~\')<br />filename <- system(\'/usr/bin/perl ./bin/ibc.pl\', intern=TRUE)<br />iraq <- read.csv(filename, header=TRUE, stringsAsFactors=FALSE)<br />iraq <- cbind(iraq, geocode(iraq$City))<br />map <- get_map(\'iraq\', zoom = 6)<br />mymap <- ggmap(map)+geom_point(data = iraq, aes(x=lon, y=lat, size=Casualty.Count))<br />ggsave(\'/tmp/ibc.jpg\')<br />print(imguRupload(\'/tmp/ibc.jpg\')$link)<br />unlink(c(filename,\'/tmp/ibc.jpg\'))<br /></pre>The resulting image is:<br/><img src="http://i.imgur.com/ABnQSsG.jpg" height="50%" width="50%"/><br/>(click <a href="http://i.imgur.com/ABnQSsG.jpg">here</a> for the full-size image)</content><link rel="replies" type="application/atom+xml" href="http://www.prolificprogrammer.com/feeds/3202721524976507913/comments/default" title="Post Comments"></link><link rel="replies" type="text/html" href="http://www.prolificprogrammer.com/2014/08/how-to-visualise-deaths-in-iraq-pt-2.html#comment-form" title="0 Comments"></link><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/6157408210125261684/posts/default/3202721524976507913"></link><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/6157408210125261684/posts/default/3202721524976507913"></link><link rel="alternate" type="text/html" href="http://www.prolificprogrammer.com/2014/08/how-to-visualise-deaths-in-iraq-pt-2.html" title="How to Visualise Deaths in Iraq Pt 2"></link><author><name>Hasan Diwan</name><uri>https://plus.google.com/108538472544668153897</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="//lh3.googleusercontent.com/-YZ6ZwPevQL4/AAAAAAAAAAI/AAAAAAAAAMs/sop83s0_YiY/s512-c/photo.jpg"></gd:image></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6157408210125261684.post-1460123936902167027</id><published>2014-08-14T10:46:00.001-07:00</published><updated>2014-08-14T10:46:47.403-07:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="csv"></category><category scheme="http://www.blogger.com/atom/ns#" term="iraqbodycount"></category><category scheme="http://www.blogger.com/atom/ns#" term="perl"></category><title type="text">How to Visualise Deaths in Iraq Pt 1</title><content type="html"><p>The <a href="https://www.iraqbodycount.org">Iraq Body Count</a> project has taken on the morbid task of cataloging "the violent civilian deaths that have resulted from the 2003 military intervention in Iraq". Being a pretty gruesome task, they leave visualisations of this data to others. The <a href="http://perl.org">perl</a> script below reformats their latest data as a CSV:<pre><br />#!/usr/bin/perl<br />use strict;<br />use warnings;<br />use File::Temp;<br />use HTML::TreeBuilder;<br />use LWP::UserAgent;<br />use Net::SSL;<br />use Text::CSV_XS;<br />use Date::Manip::DM5;<br />use URI::Escape qw/uri_escape_utf8/;<br />use XML::Parser;<br />use vars qw($in $line);<br /><br />my $fh = File::Temp->new(SUFFIX=>\'.csv\', UNLINK => 0);<br />my $out = Text::CSV_XS->new ({ binary => 1, auto_diag => 1 });<br />my $ua = LWP::UserAgent->new( ssl_opts => { verify_hostname => 0 }, );<br />my $res = $ua->get(\'https://www.iraqbodycount.org/database/recent\');<br />my $html = HTML::TreeBuilder -> new_from_content($res -> content);<br />my @dates = $html->look_down(\'_tag\',\'p\');<br />$out->eol("\\r\\n");<br />$out->print($fh, [\'City\', \'Casualty Count\']);<br />foreach my $date (@dates) {<br /> next unless $date->as_text =~ /:/ and $date =~ /[a-z]/;<br /> next if $date -> as_text =~ /CASUALTIES SO FAR/i;<br /> $line = ["$1, Iraq",$2] if $date->as_text =~ /(^[[:alpha:][:space:]]+){1}:\\s+(\\d+)/;<br /> $out->print($fh, $line);<br />}<br />$html->delete;<br />print $fh->filename."\\n";<br /></pre></content><link rel="replies" type="application/atom+xml" href="http://www.prolificprogrammer.com/feeds/1460123936902167027/comments/default" title="Post Comments"></link><link rel="replies" type="text/html" href="http://www.prolificprogrammer.com/2014/08/how-to-visualise-deaths-in-iraq-pt-1.html#comment-form" title="0 Comments"></link><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/6157408210125261684/posts/default/1460123936902167027"></link><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/6157408210125261684/posts/default/1460123936902167027"></link><link rel="alternate" type="text/html" href="http://www.prolificprogrammer.com/2014/08/how-to-visualise-deaths-in-iraq-pt-1.html" title="How to Visualise Deaths in Iraq Pt 1"></link><author><name>Hasan Diwan</name><uri>https://plus.google.com/108538472544668153897</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="//lh3.googleusercontent.com/-YZ6ZwPevQL4/AAAAAAAAAAI/AAAAAAAAAMs/sop83s0_YiY/s512-c/photo.jpg"></gd:image></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6157408210125261684.post-1737176967585174538</id><published>2014-08-11T23:17:00.001-07:00</published><updated>2014-08-11T23:19:33.862-07:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="apache"></category><category scheme="http://www.blogger.com/atom/ns#" term="lighttpd"></category><category scheme="http://www.blogger.com/atom/ns#" term="log formatting"></category><category scheme="http://www.blogger.com/atom/ns#" term="nginx"></category><category scheme="http://www.blogger.com/atom/ns#" term="python"></category><category scheme="http://www.blogger.com/atom/ns#" term="web output"></category><title type="text">How to Track your website Accesses on the Web</title><content type="html"><p>The script below will reformat an <a href="http://httpd.apache.org/docs/2.2/logs.html#accesslog">apache access log</a>, as produced by <a href="http://lighttpd.net">lighttpd</a>, as a web page:<code><pre><br />#!/usr/local/bin/python<br />import cgi<br />import csv<br />import cStringIO as StringIO<br /><br />if __name__ == \'__main__\':<br />    outfile = StringIO.StringIO()<br />    outfile.write(\'&lt;html&gt;&lt;head&gt;&lt;title&gt;Accesses&lt;/title&gt;&lt;/head&gt;&lt;body&gt;&lt;h1&gt;Accesses to hasan.d8u.us&lt;/h1&gt;&lt;table&gt;&lt;tr&gt;&lt;th&gt;Remote IP address&lt;/th&gt;&lt;th&gt;Username&lt;/th&gt;&lt;th&gt;Timestamp&lt;/th&gt;&lt;th&gt;HTTP Verb&lt;/th&gt;&lt;th&gt;HTTP Endpoint&lt;/th&gt;&lt;th&gt;HTTP Status Code&lt;/th&gt;&lt;th&gt;HTTP Request Length&lt;/th&gt;&lt;/tr&gt;\')<br /><br />    with open(\'/var/log/lighttpd/access.log\') as infile:<br />        reader = csv.reader(infile, delimiter=\' \', doublequote=False)<br />        for row in reversed(list(reader)):<br />            try:<br />                remote_ip = row[0]<br />                auth_user = row[2]<br />                if auth_user==\'-\': <br />                    auth_user = \'n/a\'<br />                request_timestamp = \'{} {}\'.format(row[3], row[4]).replace(\'[\',\'\').replace("]","")<br />                request = row[5].split(\' \')<br />                request_type = request[0]<br />                request_endpoint = \'http://{}{}\'.format(row[1],request[1])<br />                request_version = request[2].replace(\'HTTP/\',\'\')<br />                request_code = row[6]<br />                request_length = row[7]<br />                outfile.write(\'&lt;tr&gt;\')<br />                outfile.write(\'&lt;td&gt;{}&lt;/td&gt;&lt;td&gt;{}&lt;/td&gt;&lt;td&gt;{}&lt;/td&gt;&lt;td&gt;{}&lt;/td&gt;&lt;td&gt;&lt;a href="{}"&gt;{}&lt;/a&gt;&lt;/td&gt;&lt;td&gt;{}&lt;/td&gt;&lt;td&gt;{}&lt;/td&gt;\'.format(remote_ip, auth_user, request_timestamp, request_type, request_endpoint, request_endpoint, request_code, request_length))<br />                outfile.write(\'&lt;/tr&gt;\')<br />            except IndexError,e:<br />                continue<br /><br />    outfile.write(\'&lt;/table&gt;&lt;/body&gt;&lt;/html&gt;\')<br />    print(\'Content-Type: text/html\\r\\n\')<br />    print(outfile.getvalue())<br /></pre></code></content><link rel="replies" type="application/atom+xml" href="http://www.prolificprogrammer.com/feeds/1737176967585174538/comments/default" title="Post Comments"></link><link rel="replies" type="text/html" href="http://www.prolificprogrammer.com/2014/08/how-to-track-your-website-accesses-on.html#comment-form" title="0 Comments"></link><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/6157408210125261684/posts/default/1737176967585174538"></link><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/6157408210125261684/posts/default/1737176967585174538"></link><link rel="alternate" type="text/html" href="http://www.prolificprogrammer.com/2014/08/how-to-track-your-website-accesses-on.html" title="How to Track your website Accesses on the Web"></link><author><name>Hasan Diwan</name><uri>https://plus.google.com/108538472544668153897</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="//lh3.googleusercontent.com/-YZ6ZwPevQL4/AAAAAAAAAAI/AAAAAAAAAMs/sop83s0_YiY/s512-c/photo.jpg"></gd:image></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6157408210125261684.post-8968183935467816668</id><published>2014-08-10T19:59:00.004-07:00</published><updated>2014-08-11T11:24:42.911-07:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="html"></category><category scheme="http://www.blogger.com/atom/ns#" term="python"></category><category scheme="http://www.blogger.com/atom/ns#" term="sharedlinks"></category><title type="text">How to Find a Link I Sent You Without Being Embarrassed #2</title><content type="html"><p>I inadvertently dropped the <a href="http://www.prolificprogrammer.com/2014/05/how-to-find-link-i-sent-you-without.html">atom feed</a> a little while ago. However, I\'ve resurrected the functionality using <a href="http://hasan.d8u.us/links.py.cgi">HTML</a> and here\'s the snippet: <code><pre><br />    with gzip.open(DATA_FILE, \'r\') as csvin:<br />        reader = csv.DictReader(csvin, fieldnames = [\'Time\',\'Recipient\',\'Link\'], quoting = csv.QUOTE_MINIMAL, lineterminator = \'\\n\')<br />        print(\'Status: 200 OK\\nContent-Type: text/html\\n\')<br />        print(\'&lt;!DOCTYPE html&gt;\\n&lt;html&gt;&lt;head&gt;&lt;title&gt;Shared Links&lt;/title&gt;&lt;body&gt;Individuals emails starred out for privacy reasons.&lt;table&gt;&lt;tr&gt;&lt;th&gt;Time&lt;/th&gt;&lt;th&gt;Link&lt;/th&gt;&lt;/tr&gt;\'),<br />        links = list(reader)<br />        for link in reversed(links[1:]):<br />            time = datetime.datetime.fromtimestamp(float(link[\'Time\']))<br />            recipient = \'*\'*8+\'-at-\'+link[\'Recipient\'][link[\'Recipient\'].index(\'@\')+1:]<br />            link = \'&lt;a href="{}"&gt;{}&lt;/a&gt;\'.format(link[\'Link\'], link[\'Link\'])<br />            print(\'&lt;tr&gt;&lt;td&gt;{}&lt;/td&gt;&lt;td&gt;{}&lt;/td&gt;&lt;td&gt;{}&lt;/td&gt;&lt;/tr&gt;\'.format(time, recipient, link))<br />        print(\'&lt;/table&gt;&lt;/body&gt;&lt;/html&gt;\')<br /></pre></code></content><link rel="replies" type="application/atom+xml" href="http://www.prolificprogrammer.com/feeds/8968183935467816668/comments/default" title="Post Comments"></link><link rel="replies" type="text/html" href="http://www.prolificprogrammer.com/2014/08/how-to-find-link-i-sent-you-without.html#comment-form" title="0 Comments"></link><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/6157408210125261684/posts/default/8968183935467816668"></link><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/6157408210125261684/posts/default/8968183935467816668"></link><link rel="alternate" type="text/html" href="http://www.prolificprogrammer.com/2014/08/how-to-find-link-i-sent-you-without.html" title="How to Find a Link I Sent You Without Being Embarrassed #2"></link><author><name>Hasan Diwan</name><uri>https://plus.google.com/108538472544668153897</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="//lh3.googleusercontent.com/-YZ6ZwPevQL4/AAAAAAAAAAI/AAAAAAAAAMs/sop83s0_YiY/s512-c/photo.jpg"></gd:image></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6157408210125261684.post-6507430235309079966</id><published>2014-08-09T23:37:00.000-07:00</published><updated>2014-08-09T23:39:40.151-07:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="gist"></category><category scheme="http://www.blogger.com/atom/ns#" term="github"></category><category scheme="http://www.blogger.com/atom/ns#" term="python"></category><category scheme="http://www.blogger.com/atom/ns#" term="requests"></category><title type="text">How to Collect Your Gists from Github</title><content type="html"><p>A question came up as to how to collect one\'s <a href="https://gist.github.com">gists</a> into a single repository, so I thought I might as well take a crack at it. Here\'s what I\'ve come up with:<pre><code><br />#!/usr/bin/python<br />import argparse<br />import logging<br />import requests<br /><br />if __name__ == \'__main__\':<br />    parser = argparse.ArgumentParser(description = \'Suck up all your public gists into a single git repository on github\')<br />    parser.add_argument(\'-u\',\'--user\', type=unicode, action=\'store\', help=\'Github username\')<br />    parser.add_argument(\'-v\',\'--verbose\', help=\'Up debugging\')<br />    args = parser.parse_args()<br />    <br />    if args.verbose:<br />        logging.basicConfig(level = logging.DEBUG)<br />    else:<br />        logging.basicConfig(level = logging.FATAL)<br /><br />    gists = requests.get(\'https://api.github.com/users/{}/gists\'.format(args.user)).json()<br />    logging.debug(gists)<br />    <br />    print(\'Add the following urls as externals \'),<br />    print([g[\'git_pull_url\'] for g in gists])<br />    <br /></code></pre>You\'ll need the <a href="http://docs.python-requests.org/en/latest/">requests</a> module. The rest is included in the standard library. A sample run: <pre><br />% python /tmp/gists.py --user hdiwan                                                                                                       <br />Add the following urls as externals  [u\'https://gist.github.com/6432761.git\', u\'https://gist.github.com/6430078.git\', u\'https://gist.github.com/5573723.git\', u\'https://gist.github.com/5486130.git\', u\'https://gist.github.com/5476099.git\', u\'https://gist.github.com/5318931.git\', u\'https://gist.github.com/5254126.git\', u\'https://gist.github.com/5174259.git\']<br /></pre></p></content><link rel="replies" type="application/atom+xml" href="http://www.prolificprogrammer.com/feeds/6507430235309079966/comments/default" title="Post Comments"></link><link rel="replies" type="text/html" href="http://www.prolificprogrammer.com/2014/08/how-to-collect-your-gists-from-github.html#comment-form" title="0 Comments"></link><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/6157408210125261684/posts/default/6507430235309079966"></link><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/6157408210125261684/posts/default/6507430235309079966"></link><link rel="alternate" type="text/html" href="http://www.prolificprogrammer.com/2014/08/how-to-collect-your-gists-from-github.html" title="How to Collect Your Gists from Github"></link><author><name>Hasan Diwan</name><uri>https://plus.google.com/108538472544668153897</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="//lh3.googleusercontent.com/-YZ6ZwPevQL4/AAAAAAAAAAI/AAAAAAAAAMs/sop83s0_YiY/s512-c/photo.jpg"></gd:image></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6157408210125261684.post-1661304007794388515</id><published>2014-08-06T18:59:00.002-07:00</published><updated>2014-08-06T18:59:32.634-07:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="ggplot"></category><category scheme="http://www.blogger.com/atom/ns#" term="histogram"></category><category scheme="http://www.blogger.com/atom/ns#" term="pandas"></category><category scheme="http://www.blogger.com/atom/ns#" term="python"></category><category scheme="http://www.blogger.com/atom/ns#" term="sharedlinks"></category><title type="text">How to Draw a Histogram in Python</title><content type="html"><code><pre><br />today = links[links[\'Time\'] > int((datetime.date.today()-datetime.timedelta(days=0)).strftime(\'%s\')) - 1]<br />today[\'Hour\'] = [int(datetime.datetime.fromtimestamp(t).strftime(\'%H\')) for t in today[\'Time\']]<br />print(ggplot(today, aes(x=today[\'Hour\'])) + geom_histogram() + xlab(\'Hour of {}\'.format(datetime.date.today())))<br /></pre></code><center><a href="http://imgur.com/BATT9sF"><img src="http://i.imgur.com/BATT9sF.png" title="Hosted by imgur.com" /></a></center></content><link rel="replies" type="application/atom+xml" href="http://www.prolificprogrammer.com/feeds/1661304007794388515/comments/default" title="Post Comments"></link><link rel="replies" type="text/html" href="http://www.prolificprogrammer.com/2014/08/how-to-draw-histogram-in-python.html#comment-form" title="0 Comments"></link><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/6157408210125261684/posts/default/1661304007794388515"></link><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/6157408210125261684/posts/default/1661304007794388515"></link><link rel="alternate" type="text/html" href="http://www.prolificprogrammer.com/2014/08/how-to-draw-histogram-in-python.html" title="How to Draw a Histogram in Python"></link><author><name>Hasan Diwan</name><uri>https://plus.google.com/108538472544668153897</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="//lh3.googleusercontent.com/-YZ6ZwPevQL4/AAAAAAAAAAI/AAAAAAAAAMs/sop83s0_YiY/s512-c/photo.jpg"></gd:image></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6157408210125261684.post-5204031939679344028</id><published>2014-08-05T18:57:00.001-07:00</published><updated>2014-08-05T18:57:48.986-07:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="ggplot"></category><category scheme="http://www.blogger.com/atom/ns#" term="r"></category><category scheme="http://www.blogger.com/atom/ns#" term="WDI"></category><category scheme="http://www.blogger.com/atom/ns#" term="worldbank"></category><title type="text">When can Americans Expect to Live to 100?</title><content type="html"><p>Inspired by <a href="http://freakonometrics.hypotheses.org/">Freakonomics</a> hypothesising that <a href="http://freakonometrics.hypotheses.org/16165">the life expectancies of males and females will converge</a>, I decided to try to answer the query <a href="http://i.imgur.com/UDLgoPL.png">&quot;when can I expect to have descendants that live to 100?&quot;</a>:<center><img src="http://i.imgur.com/UDLgoPL.png" height="337" width="199"/></center><br/>The answer to the question: around 2160.</p></content><link rel="replies" type="application/atom+xml" href="http://www.prolificprogrammer.com/feeds/5204031939679344028/comments/default" title="Post Comments"></link><link rel="replies" type="text/html" href="http://www.prolificprogrammer.com/2014/08/when-can-americans-expect-to-live-to-100.html#comment-form" title="0 Comments"></link><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/6157408210125261684/posts/default/5204031939679344028"></link><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/6157408210125261684/posts/default/5204031939679344028"></link><link rel="alternate" type="text/html" href="http://www.prolificprogrammer.com/2014/08/when-can-americans-expect-to-live-to-100.html" title="When can Americans Expect to Live to 100?"></link><author><name>Hasan Diwan</name><uri>https://plus.google.com/108538472544668153897</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="//lh3.googleusercontent.com/-YZ6ZwPevQL4/AAAAAAAAAAI/AAAAAAAAAMs/sop83s0_YiY/s512-c/photo.jpg"></gd:image></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6157408210125261684.post-8649562704534078334</id><published>2014-08-04T09:52:00.000-07:00</published><updated>2014-08-04T09:52:16.572-07:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="gui"></category><category scheme="http://www.blogger.com/atom/ns#" term="java"></category><category scheme="http://www.blogger.com/atom/ns#" term="sharedlinks"></category><category scheme="http://www.blogger.com/atom/ns#" term="swing"></category><title type="text">How to Visualise Sent Links</title><content type="html"><p>Finally, a GUI for <a href="http://www.prolificprogrammer.com/2014/07/how-to-visualise-data.html">my shared links data</a>. It uses only standard JDK methods, and the source is below, enjoy:<code><pre><br />import java.awt.BorderLayout;<br />import java.io.File;<br />import java.io.FileInputStream;<br />import java.io.InputStream;<br />import java.io.IOException;<br />import java.net.URL;<br />import java.util.Date;<br />import java.util.Scanner;<br />import java.util.zip.GZIPInputStream;<br />import javax.swing.JButton;<br />import javax.swing.JFrame;<br />import javax.swing.JLabel;<br />import javax.swing.JOptionPane;<br />import javax.swing.JPanel;<br />import javax.swing.JScrollPane;<br />import javax.swing.JTable;<br />import javax.swing.table.DefaultTableCellRenderer;<br />import javax.swing.table.DefaultTableModel;<br /><br />/**<br /> * Visualises my shared links<br /> * @author Hasan Diwan <hasan.diwan@gmail.com><br /> */<br />public class LinksVisualiser extends JFrame {<br />    JTable table;<br />    DefaultTableModel model;<br />    JButton closeButton, webButton;<br /> /**<br />  * Takes data from a CSV file and places it into a table for display.<br />  * @param source - a reference to the file where the CSV data is located.<br />  */<br /> static String title = "Shared Links";<br /> public LinksVisualiser(String source) {<br />  super(title);<br />  table = new JTable();<br />  JScrollPane scroll = new JScrollPane(table);<br />  String[] colNames = { "Timestamp", "Recipient", "Link"};<br />  model = new DefaultTableModel(colNames, 0);<br />  InputStream is;<br />  try {<br />   if(source.indexOf("http")==0) {<br />    URL facultyURL = new URL(source);<br />    is = new GZIPInputStream(facultyURL.openStream());<br />   }<br />   else { //local file?<br />    File f = new File(source);<br />    is = new GZIPInputStream(new FileInputStream(f));<br />   }<br />   insertData(is);<br />   //table.getColumnModel().getColumn(0).setCellRenderer(new CustomCellRenderer());<br />  }<br />  catch(IOException ioe) {<br />   JOptionPane.showMessageDialog(this, ioe, "Error reading data", JOptionPane.ERROR_MESSAGE);<br />  }<br /><br />  JPanel buttonPanel = new JPanel();<br />  closeButton = new JButton("Close");<br />  webButton = new JButton("Weblog");<br />  buttonPanel.add(closeButton);<br />  buttonPanel.add(new JLabel("   You can download this file from our site: "));<br />  buttonPanel.add(webButton);<br /><br />  getContentPane().add(scroll, BorderLayout.CENTER);<br />  getContentPane().add(buttonPanel, BorderLayout.SOUTH);<br />  pack();<br /> }<br /><br /> /**<br />  * Places the data from the specified stream into this table for display.  The data from the file must be in CSV format<br />  * @param is - an input stream which could be from a file or a network connection or URL.<br />  */<br /> void insertData(InputStream is) {<br />  Scanner scan = new Scanner(is);<br />  scan.nextLine();<br />  String[] array;<br />  while (scan.hasNextLine()) {<br />   String line = scan.nextLine();<br />   if(line.indexOf(",")>-1)<br />    array = line.split(",");<br />   else<br />    array = line.split("\\t");<br />   Object[] data = new Object[array.length];<br />   for (int i = 0; i < array.length; i++)<br />    data[i] = array[i];<br />   Date time = new Date(new Long((String)data[0])*1000);<br />   data[0] = time;<br />   model.addRow(data);<br />  }<br />  table.setModel(model);<br /> } <br /><br /> public static void main(String args[]) {<br />  LinksVisualiser l = new LinksVisualiser("http://hasan.d8u.us/sent_links.csv.gz");<br />  l.setVisible(true);<br />  l.setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE);<br /> }<br />}<br /></code></pre></content><link rel="replies" type="application/atom+xml" href="http://www.prolificprogrammer.com/feeds/8649562704534078334/comments/default" title="Post Comments"></link><link rel="replies" type="text/html" href="http://www.prolificprogrammer.com/2014/08/how-to-visualise-sent-links.html#comment-form" title="0 Comments"></link><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/6157408210125261684/posts/default/8649562704534078334"></link><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/6157408210125261684/posts/default/8649562704534078334"></link><link rel="alternate" type="text/html" href="http://www.prolificprogrammer.com/2014/08/how-to-visualise-sent-links.html" title="How to Visualise Sent Links"></link><author><name>Hasan Diwan</name><uri>https://plus.google.com/108538472544668153897</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="//lh3.googleusercontent.com/-YZ6ZwPevQL4/AAAAAAAAAAI/AAAAAAAAAMs/sop83s0_YiY/s512-c/photo.jpg"></gd:image></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6157408210125261684.post-8333764843870929716</id><published>2014-07-31T18:46:00.001-07:00</published><updated>2014-07-31T18:46:23.129-07:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="git"></category><category scheme="http://www.blogger.com/atom/ns#" term="sh"></category><title type="text">How to add a Custom Git command</title><content type="html"><p><a href="http://git-scm.com">Git</a> is nothing more than <a href="http://blog.thehippo.de/2012/03/tools-and-software/how-to-create-a-custom-git-command-extension/">a collection of shell scripts</a>, so claims <a href="https://www.facebook.com/david.cross">David</a>, and <a href="http://blog.thehippo.de">others</a>. I will have to add my name to the list, after last night, when I wrote my own git subcommand... Presenting <b>git lost</b>. What, on $DEITY\'s green earth, does the lost subcommand do, you ask? Very good question, dear reader. The code is below and an explanation follows the code:<code><pre><br />#!/bin/sh<br />git stash $*<br /></pre></code>It\'s an alias for the <a href="https://www.kernel.org/pub/software/scm/git/docs/git-stash.html">stash subcommand</a>! Put it in your path and you can run <tt>git lost</tt> and have it do the exact same thing as stash.</p></content><link rel="replies" type="application/atom+xml" href="http://www.prolificprogrammer.com/feeds/8333764843870929716/comments/default" title="Post Comments"></link><link rel="replies" type="text/html" href="http://www.prolificprogrammer.com/2014/07/how-to-add-custom-git-command.html#comment-form" title="0 Comments"></link><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/6157408210125261684/posts/default/8333764843870929716"></link><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/6157408210125261684/posts/default/8333764843870929716"></link><link rel="alternate" type="text/html" href="http://www.prolificprogrammer.com/2014/07/how-to-add-custom-git-command.html" title="How to add a Custom Git command"></link><author><name>Hasan Diwan</name><uri>https://plus.google.com/108538472544668153897</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="//lh3.googleusercontent.com/-YZ6ZwPevQL4/AAAAAAAAAAI/AAAAAAAAAMs/sop83s0_YiY/s512-c/photo.jpg"></gd:image></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6157408210125261684.post-2510422092014239025</id><published>2014-07-26T23:50:00.001-07:00</published><updated>2014-07-26T23:50:13.113-07:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="h2"></category><category scheme="http://www.blogger.com/atom/ns#" term="java"></category><category scheme="http://www.blogger.com/atom/ns#" term="word count"></category><title type="text">How to determine Word Frequency in Java</title><content type="html"><p>?: Given a text file of arbitrary length, rank the words from most to least common in a file of arbitrary size.</p><p>A: Divide the file into individual words, put them into an RDBMS and use that to count. In the code below, I\'ve chosen to use the embedded flavour of <a href="http://www.h2database.com">H2</a>. Why H2? From its page:<pre><br />The main features of H2 are:<br /><br />Very fast, open source, JDBC API<br />Embedded and server modes; in-memory databases<br />Browser based Console application<br />Small footprint: around 1.5 MB jar file size<br /></pre>I\'ve chosen <a href="http://java.sun.com">Java</a> to do this task. And I decided to not use any NLP library to handle word-segmentation out of a concern for disk space. I do believe <a href="http://www.h2database.com/html/functions.html#csvwrite">h2 will write delimited data natively</a> as well. Maybe for future improvement. But for now, the code reads:<pre><br /> public static void main (String[] args) {<br />  try {<br />   Class.forName("org.h2.Driver").newInstance();<br />  } catch (ClassNotFoundException e) {<br />   e.printStackTrace(System.err);<br />   throw new RuntimeException(e.getMessage());<br />  } catch (InstantiationException e) {<br />   e.printStackTrace(System.err);<br />   throw new RuntimeException(e.getMessage());<br />  } catch (IllegalAccessException e) {<br />   e.printStackTrace(System.err);<br />   throw new RuntimeException(e.getMessage());<br />  }<br />  Long MAX_TRANSACTION_LENGTH = null;<br />  try {<br />   MAX_TRANSACTION_LENGTH = Long.parseLong(System.getProperty("max.transaction.length"));<br />  } catch (NumberFormatException e) {<br />   MAX_TRANSACTION_LENGTH = 40l;<br />  }<br />  Connection conn = null;<br />  try {<br />   conn = DriverManager.getConnection("jdbc:h2:alation;LOG=0;CACHE_SIZE=65536;LOCK_MODE=0;UNDO_LOG=0", "sa","");<br />   conn.setAutoCommit(false);<br />  } catch (SQLException e) {<br />   e.printStackTrace(System.err);<br />   throw new RuntimeException(e.getMessage());<br />  }<br />  String stringToExamine = "";<br />  try {<br />   conn.createStatement().execute("DROP TABLE IF EXISTS indexing;");<br />   conn.createStatement().execute("CREATE TABLE indexing (word VARCHAR UNIQUE, frequency INT)");<br />   conn.commit();<br />  } catch (SQLException e) {<br />   System.err.println("Creation failed -- "+e.getMessage());<br />   e.printStackTrace(System.err);<br />  }<br />  try {<br />   PreparedStatement insertStatement = conn.prepareStatement("INSERT INTO indexing (word, frequency) VALUES (LOWER(?),1)");<br />   int transactionLength = 0;<br />   Long insertStart = System.currentTimeMillis();<br />   BufferedReader reader = new BufferedReader(new FileReader(args[0]));<br />   String args_ = null;<br />   String line = null;<br />   while ((line = reader.readLine()) != null) {<br />    for (String word : line.split(" ")) {<br />     insertStatement.setString(1, word.replaceAll("\\\\p{Punct}",""));<br />     if (word.matches("^\\\\s+$")) {<br />      continue;<br />     }<br />     try {<br />      int inserted = insertStatement.executeUpdate();<br />     } catch (JdbcSQLException j) {<br />      PreparedStatement prepped = conn.prepareStatement("UPDATE indexing SET frequency = frequency + 1 WHERE word = ?", ResultSet.TYPE_SCROLL_SENSITIVE, ResultSet.CONCUR_UPDATABLE);<br />      prepped.setString(1, word);<br />      int updated = prepped.executeUpdate();<br />     }<br />     conn.commit();<br />    }<br />   }<br />   reader.close();<br />  } catch  (SQLException e) {<br />   System.err.println(e.getMessage());<br />   e.printStackTrace(System.err);<br />  } catch (IOException e) {<br />   System.err.println(e.getMessage());<br />   e.printStackTrace(System.err);<br />  } <br />   <br />  try {<br />   ResultSet words = conn.createStatement(ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY).executeQuery("SELECT word||\'\\t\'||frequency as res FROM indexing ORDER BY frequency DESC, word");<br />   words.next();<br />    <br />   PrintStream out = null;<br />   try {<br />    out = new PrintStream(args[1]);<br />   } catch (Exception e) {<br />    out = new PrintStream(System.out);<br />   }<br />   Long start = System.currentTimeMillis();<br />   while (words.next()) {<br />    out.println(words.getString("res"));<br />   } <br />   conn.close();<br />   System.err.println("\\n\\nProcessed in "+new Long(System.currentTimeMillis() - start)+" miliseconds.");<br />  } catch (SQLException e) {<br />   System.err.println(e.getMessage());<br />   e.printStackTrace(System.err);<br />  }<br /> }<br /></pre></content><link rel="replies" type="application/atom+xml" href="http://www.prolificprogrammer.com/feeds/2510422092014239025/comments/default" title="Post Comments"></link><link rel="replies" type="text/html" href="http://www.prolificprogrammer.com/2014/07/how-to-determine-word-frequency-in-java.html#comment-form" title="0 Comments"></link><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/6157408210125261684/posts/default/2510422092014239025"></link><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/6157408210125261684/posts/default/2510422092014239025"></link><link rel="alternate" type="text/html" href="http://www.prolificprogrammer.com/2014/07/how-to-determine-word-frequency-in-java.html" title="How to determine Word Frequency in Java"></link><author><name>Hasan Diwan</name><uri>https://plus.google.com/108538472544668153897</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="//lh3.googleusercontent.com/-YZ6ZwPevQL4/AAAAAAAAAAI/AAAAAAAAAMs/sop83s0_YiY/s512-c/photo.jpg"></gd:image></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6157408210125261684.post-5997513011948024184</id><published>2014-07-23T10:36:00.002-07:00</published><updated>2014-07-23T10:36:47.574-07:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="gzipreader"></category><category scheme="http://www.blogger.com/atom/ns#" term="ruby"></category><category scheme="http://www.blogger.com/atom/ns#" term="zlib"></category><title type="text">How to Handle Gzipped Files in Ruby</title><content type="html"><p>Late last night, I was debugging the fact that the shared links are now stored compressed and realised the <a href="http://hasan.d8u.us/atom.xml">atom feed</a> I had so carefully put together earlier was not working because ruby wasn\'t hip enough to recognise <a href="http://www.gzip.org">Gzip</a> compression out of the box (to be fair, though, that would be a performance hit without much benefit. Especially given that solving the problem is cake:<pre><code>require \'zlib\'<br />Zlib::GzipReader.open("/path/to/compressed-file") {|gz|<br />   # do what you will here<br />}<br /></code></pre></content><link rel="replies" type="application/atom+xml" href="http://www.prolificprogrammer.com/feeds/5997513011948024184/comments/default" title="Post Comments"></link><link rel="replies" type="text/html" href="http://www.prolificprogrammer.com/2014/07/how-to-handle-gzipped-files-in-ruby.html#comment-form" title="0 Comments"></link><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/6157408210125261684/posts/default/5997513011948024184"></link><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/6157408210125261684/posts/default/5997513011948024184"></link><link rel="alternate" type="text/html" href="http://www.prolificprogrammer.com/2014/07/how-to-handle-gzipped-files-in-ruby.html" title="How to Handle Gzipped Files in Ruby"></link><author><name>Hasan Diwan</name><uri>https://plus.google.com/108538472544668153897</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="//lh3.googleusercontent.com/-YZ6ZwPevQL4/AAAAAAAAAAI/AAAAAAAAAMs/sop83s0_YiY/s512-c/photo.jpg"></gd:image></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6157408210125261684.post-8623287165922810691</id><published>2014-07-21T21:42:00.000-07:00</published><updated>2014-07-21T21:42:19.510-07:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="imgurl"></category><category scheme="http://www.blogger.com/atom/ns#" term="pandas"></category><category scheme="http://www.blogger.com/atom/ns#" term="python"></category><title type="text">How to Visualise Data</title><content type="html"><p>Amongst my friends, one of the most dreaded subject lines in an email from me is, "Thought you might be interested in...". This is the result of a <a href="http://python.org">python</a> script that lets me share links with you. Over the weekend, I greatly enhanced it, adding the ability to view, remove and visualise whatever links I\'ve sent in a histogram. It\'s the latter piece of code that is mirrored here:<code><pre><br />    import pandas as pd<br />    from ggplot import geom_bar, aes, ggplot, ggsave, ggtitle<br />    from imgur.factory import factory<br /><br />    today = links[links[\'Time\'] > int(datetime.date.today().strftime(\'%s\')) - 1]<br />    logging.debug(today.to_string())<br />    today[\'Hour\'] = [datetime.datetime.fromtimestamp(t).strftime(\'%H\') for t in today[\'Time\']]<br />    logging.debug(today[\'Hour\'].to_string())<br />    p = ggplot(today, aes(x=\'Hour\')) + geom_bar() + ggtitle(\'Links shared by hour of day today\')<br />    with tempfile.NamedTemporaryFile() as fileout:<br />        ggsave(p, fileout.name)<br />        imgur_key = u\'4feb29d00face5bc1b9dae536e15c373\'<br />        req = factory.build_request_upload_from_path(fileout.name)<br />        res = imgur.retrieve(req)<br />        print(\'Image may be viewed at {}\'.format(res[\'link\']))<br /></pre></code></content><link rel="replies" type="application/atom+xml" href="http://www.prolificprogrammer.com/feeds/8623287165922810691/comments/default" title="Post Comments"></link><link rel="replies" type="text/html" href="http://www.prolificprogrammer.com/2014/07/how-to-visualise-data.html#comment-form" title="0 Comments"></link><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/6157408210125261684/posts/default/8623287165922810691"></link><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/6157408210125261684/posts/default/8623287165922810691"></link><link rel="alternate" type="text/html" href="http://www.prolificprogrammer.com/2014/07/how-to-visualise-data.html" title="How to Visualise Data"></link><author><name>Hasan Diwan</name><uri>https://plus.google.com/108538472544668153897</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="//lh3.googleusercontent.com/-YZ6ZwPevQL4/AAAAAAAAAAI/AAAAAAAAAMs/sop83s0_YiY/s512-c/photo.jpg"></gd:image></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6157408210125261684.post-3634143816946948756</id><published>2014-07-18T12:51:00.002-07:00</published><updated>2014-07-18T12:51:57.881-07:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="gmail"></category><category scheme="http://www.blogger.com/atom/ns#" term="java"></category><category scheme="http://www.blogger.com/atom/ns#" term="javamail"></category><title type="text">How to Cleanup Gmail</title><content type="html"><p>Slight detour from the <a href="http://www.prolificprogrammer.com/search/label/python">pythonic</a> nature of this blog. I\'m almost out of disk space at work, and most of the email is unnecessary. So, I wrote the java class below to delete all mail older the a month. You\'ll need <a href="http://central.maven.org/maven2/javax/mail/mail/1.4.7/mail-1.4.7.jar">javamail</a> on your classpath and a compiled class of this:<pre><code><br />package us.d8u;<br />import java.util.Calendar;<br />import java.util.Date;<br />import java.util.Properties;<br />import javax.mail.Authenticator;<br />import javax.mail.Flags;<br />import javax.mail.Folder;<br />import javax.mail.Message;<br />import javax.mail.MessagingException;<br />import javax.mail.NoSuchProviderException;<br />import javax.mail.PasswordAuthentication;<br />import javax.mail.Session;<br />import javax.mail.Store;<br /><br />public class CleanupGmail {<br />    private static Session session = null;<br />    public static void cleanup(String folderName) throws NoSuchProviderException, MessagingException {<br />\tStore store = session.getStore("imaps");<br />\tif (!store.isConnected()) {<br />\t    store.connect(System.getProperty("mail.imaps.host"), System.getProperty("mail.imaps.user"), System.getProperty("mail.imaps.password"));<br />\t}<br />\tFolder inbox = store.getFolder(folderName);<br />\tinbox.open(Folder.READ_WRITE);<br />\tMessage messages[] = inbox.getMessages();<br />\tfor (Message message : messages) {<br />\t    Calendar c = Calendar.getInstance();<br />\t    c.add(Calendar.MONTH, -1);<br />\t    Date receivedDate = message.getReceivedDate();<br />\t    if (receivedDate.before(c.getTime())) {<br />\t\tFlags deleted = new Flags(Flags.Flag.DELETED);<br />\t\tinbox.setFlags(messages, deleted, true);<br />\t    }<br />\t}<br />\tinbox.close(true);<br />    }<br />    public static void main(String[] args) {<br />\tfinal Properties props = System.getProperties();<br />\tprops.setProperty("mail.imaps.host", "imap.gmail.com");<br />\tprops.setProperty("mail.imaps.port", "993");<br />\tprops.setProperty("mail.imaps.connectiontimeout", "5000");<br />\tprops.setProperty("mail.imaps.timeout", "5000");<br />\tif (props.getProperty("mail.imaps.user") == null) {<br />\t    props.setProperty("mail.imaps.user", args[0]);<br />\t}<br />\tif (props.getProperty("mail.imaps.password") == null) {<br />\t    props.setProperty("mail.imaps.password", args[1]);<br />\t}g<br />\ttry {<br />\t    session = Session.getDefaultInstance(props, new Authenticator() {<br />\t\t    public PasswordAuthentication getPasswordAuthentication() {<br />\t\t\treturn new PasswordAuthentication(props.getProperty("mail.imaps.user"), props.getProperty("mail.imaps.password"));<br />\t\t    }<br />\t\t\t<br />\t\t});<br />\t    cleanup("[Gmail]/All Mail");<br />\t} catch (NoSuchProviderException e) {.<br />\t    e.printStackTrace();<br />\t    System.exit(1);<br />\t} catch (MessagingException e) {<br />\t    e.printStackTrace();<br />\t    System.exit(2);<br />\t}<br />    }<br />}<br />\t    <br /></pre></code>You run it using <tt>java -jar ~/bin/CleanupGmail.jar -Dmail.imaps.user=&lt;your gmail username&gt; -Dmail.imaps.password=&lt;your gmail password&gt;</tt>. Personally, I\'ve just stuck it in a monthly cron job.</p></content><link rel="replies" type="application/atom+xml" href="http://www.prolificprogrammer.com/feeds/3634143816946948756/comments/default" title="Post Comments"></link><link rel="replies" type="text/html" href="http://www.prolificprogrammer.com/2014/07/how-to-cleanup-gmail.html#comment-form" title="0 Comments"></link><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/6157408210125261684/posts/default/3634143816946948756"></link><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/6157408210125261684/posts/default/3634143816946948756"></link><link rel="alternate" type="text/html" href="http://www.prolificprogrammer.com/2014/07/how-to-cleanup-gmail.html" title="How to Cleanup Gmail"></link><author><name>Hasan Diwan</name><uri>https://plus.google.com/108538472544668153897</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="//lh3.googleusercontent.com/-YZ6ZwPevQL4/AAAAAAAAAAI/AAAAAAAAAMs/sop83s0_YiY/s512-c/photo.jpg"></gd:image></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6157408210125261684.post-8073062262876032118</id><published>2014-07-15T23:47:00.000-07:00</published><updated>2014-07-15T23:47:00.381-07:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="documentation"></category><category scheme="http://www.blogger.com/atom/ns#" term="endpoint"></category><category scheme="http://www.blogger.com/atom/ns#" term="java"></category><category scheme="http://www.blogger.com/atom/ns#" term="python"></category><category scheme="http://www.blogger.com/atom/ns#" term="spring"></category><title type="text">How to self-document Using Spring</title><content type="html"><p>I\'ve rediscovered my love affair with <a href="http://spring.io">Spring</a>. The following code will list whatever endpoints your controller has in <a href="http://json.org">JSON</a>: <code><pre><br />@RequestMapping(value = "/endpoints", method = RequestMethod.GET)<br /> public String getEndPointsInView() {<br />     return requestMappingHandlerMapping.getHandlerMethods().keySet().toString();<br /> }<br /></pre></code>Now, to visualise JSON using <a href="http://jython.org">jython</a>:<code><pre><br />import json<br />import logging<br />import optparse<br />import urllib2<br />from javax.swing import JFrame, JScrollPane, JTable<br />from javax.swing.table import DefaultTableModel<br /><br />if __name__ == \'__main__\':<br />    parser = optparse.OptionParser()<br />    parser.add_option(\'-u\', \'--url\', action=\'store\')<br />    parser.add_option(\'-v\', \'--verbose\', action=\'store_true\', dest=\'verbose\')<br />    parsed = parser.parse_args()<br />    if parsed[0].verbose:<br />        logging.basicConfig(level=logging.DEBUG)<br />    else:<br />        logging.basicConfig()<br /><br />    json_url = urllib2.urlopen(parsed[0].url)<br />    host = parsed[0].url[:parsed[0].url.rindex(\'/\')]<br />    json_source = json_url.read()<br />    logging.debug(json_source)<br />    table_data = json.loads(json_source)<br />    logging.debug(table_data)<br /><br />    model = DefaultTableModel(table_data)<br />    tbl = JTable(model)<br />    scrollable = JScrollPane(tbl)<br />    <br />    frame = JFrame(\'Methods for {}\'.format(host))<br />    frame.add(scrollable)<br />    frame.pack()<br />    frame.defaultCloseOperation = JFrame.EXIT_ON_CLOSE<br />    frame.visible = True<br /></code></pre></content><link rel="replies" type="application/atom+xml" href="http://www.prolificprogrammer.com/feeds/8073062262876032118/comments/default" title="Post Comments"></link><link rel="replies" type="text/html" href="http://www.prolificprogrammer.com/2014/07/how-to-self-document-using-spring.html#comment-form" title="0 Comments"></link><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/6157408210125261684/posts/default/8073062262876032118"></link><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/6157408210125261684/posts/default/8073062262876032118"></link><link rel="alternate" type="text/html" href="http://www.prolificprogrammer.com/2014/07/how-to-self-document-using-spring.html" title="How to self-document Using Spring"></link><author><name>Hasan Diwan</name><uri>https://plus.google.com/108538472544668153897</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="//lh3.googleusercontent.com/-YZ6ZwPevQL4/AAAAAAAAAAI/AAAAAAAAAMs/sop83s0_YiY/s512-c/photo.jpg"></gd:image></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6157408210125261684.post-2539811975160040350</id><published>2014-07-13T09:50:00.001-07:00</published><updated>2014-07-13T09:50:16.765-07:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="csv"></category><category scheme="http://www.blogger.com/atom/ns#" term="etree"></category><category scheme="http://www.blogger.com/atom/ns#" term="json"></category><category scheme="http://www.blogger.com/atom/ns#" term="logback"></category><category scheme="http://www.blogger.com/atom/ns#" term="python"></category><category scheme="http://www.blogger.com/atom/ns#" term="spring"></category><category scheme="http://www.blogger.com/atom/ns#" term="xml"></category><title type="text">How to Reformat Logback Output</title><content type="html"><p><a href="http://spring.io">Spring</a> defaults to using <a href="http://logback.qos.ch">logback</a> for logging. It spits the logs out on standard output, which cannot be persisted. So, we must first send the log output to a file. This is done by leveraging the <a href="http://logback.qos.ch/apidocs/ch/qos/logback/core/FileAppender.html">FileAppender</a> class, as follows:<pre><br />  &lt;appender name=&quot;FILE&quot; class=&quot;ch.qos.logback.core.FileAppender&quot;&gt;<br />    &lt;file&gt;/home/hdiwan/around.log&lt;/file&gt;<br />    &lt;encoder&gt;<br />      &lt;pattern&gt;&quot;%date&quot; &quot;%level&quot; &quot;[%thread]&quot; &quot;%logger&quot; &quot;%file : %line&quot; &quot;%msg&quot;%n&lt;/pattern&gt;<br />    &lt;/encoder&gt;<br />  &lt;/appender&gt;<br /></pre></p><p>Now, you\'ll be getting logs to the file indicated, make sure the LOG_FILE at the top of the script matches the configuration:<pre><code>import argparse<br />import cgi<br />import csv<br />import cStringIO as StringIO<br />import json<br />import logging<br /><br />from lxml import etree<br /><br />if __name__ == \'__main__\':<br />    LOGFILE_PATH = \'/home/hdiwan/around.log\'<br /><br />    logger = logging.basicConfig(level=logging.FATAL)<br /><br />    web = cgi.FieldStorage()<br />    format_ = web.getfirst(\'format\', default=\'csv\')<br />    csv.register_dialect(\'arounddialect\')<br />    logging.debug(csv.list_dialects())<br />    if format_ == \'csv\':<br />        print(\'Content-Type: application/csv\\n\')<br />    elif format_ == \'xml\':<br />        print(\'Content-Type: text/xml\\n\')<br />    elif format_ == \'json\':<br />        print(\'Content-Type: application/json\\n\')<br /><br />    with open(LOGFILE_PATH,\'rb\') as fin:<br />        reader = csv.reader(fin, dialect=\'arounddialect\')<br />        out = StringIO.StringIO()<br />        if format_ == \'csv\':<br />            writer = csv.writer(out)<br />            writer.writerows(list(reader))<br /><br />        elif format_ == \'xml\':<br />            document = etree.Element(\'log\')<br />            for r in list(reader):<br />                logging.debug(len(r))<br />                node = etree.SubElement(document, \'entry\')<br /><br />                timestamp = etree.SubElement(node, \'timestamp\')<br />                timestamp.text = etree.CDATA(r[0])<br /><br />                level = etree.SubElement(node, \'level\')<br />                level.text = etree.CDATA(r[1])<br /><br />                thread = etree.SubElement(node, \'thread\')<br />                try:<br />                    thread.text = etree.CDATA(r[2])<br />                except IndexError,e:<br />                    thread.text = etree.CDATA(\'\')<br /><br />                class_ = etree.SubElement(node, \'class\')<br />                try:<br />                    class_.text = etree.CDATA(r[3])<br />                except IndexError, e:<br />                    class_.text = etree.CDATA(\'\')<br /><br />                msg = etree.SubElement(node, \'message\')<br />                try:<br />                    msg.text = etree.CDATA(r[4])<br />                except IndexError, e:<br />                    msg.text = etree.CDATA(\'\')<br /><br />            out.write(etree.tostring(document, encoding=\'utf-8\', xml_declaration=True, pretty_print=True))<br /><br />        elif format_ == \'json\':<br />            out.write(json.dumps(list(reader)))<br />        <br />        print out.getvalue()<br /></code></pre>The other novel part here is the use of <a href="http://lxml.de">lxml</a> to generate the XML, which alleviates the need to use cgi.escape and friends to get the xml properly formatted and pretty prints it automatically.</p></content><link rel="replies" type="application/atom+xml" href="http://www.prolificprogrammer.com/feeds/2539811975160040350/comments/default" title="Post Comments"></link><link rel="replies" type="text/html" href="http://www.prolificprogrammer.com/2014/07/how-to-reformat-logback-output.html#comment-form" title="0 Comments"></link><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/6157408210125261684/posts/default/2539811975160040350"></link><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/6157408210125261684/posts/default/2539811975160040350"></link><link rel="alternate" type="text/html" href="http://www.prolificprogrammer.com/2014/07/how-to-reformat-logback-output.html" title="How to Reformat Logback Output "></link><author><name>Hasan Diwan</name><uri>https://plus.google.com/108538472544668153897</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="//lh3.googleusercontent.com/-YZ6ZwPevQL4/AAAAAAAAAAI/AAAAAAAAAMs/sop83s0_YiY/s512-c/photo.jpg"></gd:image></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6157408210125261684.post-1460122346711344175</id><published>2014-07-11T14:48:00.000-07:00</published><updated>2014-07-11T14:48:33.064-07:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="change tracking"></category><category scheme="http://www.blogger.com/atom/ns#" term="python"></category><category scheme="http://www.blogger.com/atom/ns#" term="selenium"></category><category scheme="http://www.blogger.com/atom/ns#" term="SHA"></category><category scheme="http://www.blogger.com/atom/ns#" term="webpages"></category><title type="text">How to Track Multiple Websites For Changes</title><content type="html"><p>Just enabled multiple-page support for the <a href="http://www.prolificprogrammer.com/2014/07/how-to-track-website-for-changes.html">poor man\'s web-page delta tracker</a>:<pre><code><br />#!/Users/hdiwan/.virtualenvs/globetrekker/bin/python<br />from selenium import webdriver<br />from selenium.common.exceptions import WebDriverException<br />import argparse<br />import hashlib<br />import json<br />import logging<br />import pprint<br />import smtplib<br /><br /><br />def get_globetrekker_page(site):<br />    browser = webdriver.Chrome()<br />    browser.get(site)<br />    return browser<br /><br /><br />def send_mail(msg, user, password):<br />    server = smtplib.SMTP(\'smtp.gmail.com\', 587)<br />    server.ehlo()<br />    server.starttls()<br />    server.ehlo()<br />    server.login(user, password)<br />    server.sendmail(user, user, msg)<br /><br /><br />if __name__ == \'__main__\':<br />    argparser = argparse.ArgumentParser(description=\'Check a website for changes\')<br />    argparser.add_argument(\'-n\', \'--url\', type=str, default=None, help=\'Add URL to watcher\',  action=\'store\')<br />    argparser.add_argument(\'-l\', \'--list\', action=\'store_true\')<br />    argparser.add_argument(\'-u\', \'--user\', type=str, default=\'hd1@jsc.d8u.us\', help=\'Your username\',  action=\'store\')<br />    argparser.add_argument(\'-p\', \'--password\', type=str, help=\'Your password\', action=\'store\')<br />    argparser.add_argument(\'-v\', \'--verbose\', action=\'store_false\')<br />    parsed = argparser.parse_args()<br /><br />    if not parsed.verbose:<br />        logging.basicConfig(level=logging.DEBUG)<br />    else:<br />        logging.basicConfig(level=logging.FATAL)<br /><br />    if parsed.url:<br />        new_hash = {parsed.url: 0}<br />        output = json.dumps(new_hash)<br />        logging.debug(output)<br />        try:<br />            with open(\'/var/tmp/.globetrekker.txt\', \'r\') as fin:<br />                data = json.load(fin)<br />                data.append(new_hash)<br />        except IOError, v:<br />            with open(\'/var/tmp/.globetrekker.txt\', \'w\') as fout:<br />                json.dump([new_hash], fout)<br />        exit()<br /><br />    with open(\'/var/tmp/.globetrekker.txt\', \'r\') as fin:<br />        stored_hash_json = json.load(fin)<br />        logging.debug(stored_hash_json)<br />        if parsed.list:<br />            for k in stored_hash_json:<br />                print(k)<br />            exit()<br />    new_hashes = []<br />    stored_hash = stored_hash_json<br />    for stored_hash_ in stored_hash:<br />        for url in stored_hash_.keys():<br />            logging.debug(\'{} is our URL\'.format(url))<br />            try:<br />                browser = get_globetrekker_page(url)<br />            except WebDriverException, e:<br />                continue<br />            encoding = \'ascii\'<br />            text = browser.find_element_by_tag_name(\'html\').text<br />            encoded = text.encode(encoding, errors=\'replace\')<br />            logging.debug(encoded)<br />            decoded = encoded.decode(encoding, errors=\'replace\')<br />            logging.debug(decoded)<br />            new_hash = hashlib.sha1(decoded).hexdigest()<br />            logging.debug(\'Calculated hash code: {}\'.format(new_hash))<br />            logging.debug(\'Stored hash: {}\'.format(stored_hash_[url]))<br />            if new_hash != stored_hash_[url]:<br />                logging.debug(\'{} changed\'.format(url))<br />                send_mail(u\'Subject: {} Change detected\\r\\n\\r\\n--H\'.format(url), parsed.user, parsed.password)<br />                stored_hash_[url] = new_hash<br /><br />    browser.quit()<br /><br />    with open(\'/var/tmp/.globetrekker.txt\', \'w\') as fout:<br />        fout.write(json.dumps(stored_hash))<br /></code></pre></content><link rel="replies" type="application/atom+xml" href="http://www.prolificprogrammer.com/feeds/1460122346711344175/comments/default" title="Post Comments"></link><lire></content><link rel="replies" type="application/atom+xml" href="http://www.prolificprogrammer.com/feeds/1460122346711344175/comments/default" title="Post Comments"></link><li
nk rel="replies" type="text/html" href="http://www.prolificprogrammer.com/2014/07/how-to-track-multiple-websites-for.html#comment-form" title="0 Comments"></link><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/6157408210125261684/posts/default/1460122346711344175"></link><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/6157408210125261684/posts/default/1460122346711344175"></link><link rel="alternate" type="text/html" href="http://www.prolificprogrammer.com/2014/07/how-to-track-multiple-websites-for.html" title="How to Track Multiple Websites For Changes"></link><author><name>Hasan Diwan</name><uri>https://plus.google.com/108538472544668153897</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="//lh3.googleusercontent.com/-YZ6ZwPevQL4/AAAAAAAAAAI/AAAAAAAAAMs/sop83s0_YiY/s512-c/photo.jpg"></gd:image></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6157408210125261684.post-260891854824070462</id><published>2014-07-03T01:34:00.001-07:00</published><updated>2014-07-03T01:34:19.172-07:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="change tracking"></category><category scheme="http://www.blogger.com/atom/ns#" term="python"></category><category scheme="http://www.blogger.com/atom/ns#" term="selenium"></category><category scheme="http://www.blogger.com/atom/ns#" term="SHA"></category><category scheme="http://www.blogger.com/atom/ns#" term="webpages"></category><title type="text">How to Track a Website for Changes</title><content type="html"><p>There still exist websites not hip enough to enbrace RSS or twitter feeds for changes. And, I still want to consume their content lazily. Enter <a href="http://python.org">my favourite tool</a>:<code><pre><br />#!/Users/hdiwan/.virtualenvs/globetrekker/bin/python<br />from selenium import webdriver<br />import logging<br />import smtplib<br />import argparse<br />import hashlib<br /><br />def get_globetrekker_page(site):<br />    browser = webdriver.Chrome()<br />    browser.get(site)<br />    return browser<br /><br />def send_mail(msg, user, password):<br />    server = smtplib.SMTP(\'smtp.gmail.com\',587) #port 465 or 587<br />    server.ehlo()<br />    server.starttls()<br />    server.ehlo()<br />    server.login(user,password)<br />    server.sendmail(user,user,msg)<br />    <br />if __name__ == \'__main__\':<br />    try:<br />        argparser = argparse.ArgumentParser(description=\'Check a website for changes\')<br />        argparser.add_argument(\'-l\',\'--url\',type=str,default=\'http://www.pilotguides.com/tv-shows/globe-trekker/\',help=\'Page URL to Globetrekker\', action=\'store\')<br />        argparser.add_argument(\'-u\',\'--user\',type=str,default=\'hd1@jsc.d8u.us\',help=\'Your username\', action=\'store\')<br />        argparser.add_argument(\'-p\',\'--password\',type=str,help=\'Your password\', action=\'store\', required=True)<br />        argparser.add_argument(\'-v\',\'--verbose\',action=\'store_false\')<br />        parsed = argparser.parse_args()<br /><br />        if not parsed.verbose:<br />            logging.basicConfig(level=logging.DEBUG)<br />        else:<br />            logging.basicConfig(level=logging.FATAL)<br /><br />        browser = get_globetrekker_page(parsed.url)<br />        elem_ = browser.find_elements_by_id(\'destination-dropdown-filter\')<br /><br />        try:<br />            with open(\'/var/tmp/.globetrekker.txt\') as fin:<br />                stored_hash = fin.read()<br />                logging.debug(\'Stored Hash Code: {}\'.format(stored_hash))<br />        except IOError: <br />            stored_hash = 0<br /><br />        episodes_ = 0<br />        for elem in elem_:<br />            episodes_ = len(elem.get_attribute(\'value\')) + episodes_<br />        new_hash = hashlib.sha1(str(episodes_)).hexdigest()<br />        logging.debug(\'Calculated hash code: {}\'.format(new_hash))<br />        if new_hash != stored_hash: # Page changed<br />            with open(\'/var/tmp/.globetrekker.txt\',\'w\') as fout:<br />                fout.write(\'{}\'.format(new_hash))<br />            send_mail(u\'Subject: Page Changed {}\\r\\n\\r\\n--H\'.format(parsed.url), parsed.user, parsed.password)<br /><br />    finally:<br />        browser.quit()<br /></code></pre><p>Some notes on this script, it uses <a href="http://docs.seleniumhq.org/">selenium</a>, which <a href="http://ffdr.d8u.us">Proshot</a> or somebody was saking me about the other day (sory, man, it\'s not <a href="http://ruby-lang.org">ruby</a>, but it\'s my code, so....), you can find enough on other sources aside from this blog.</p><p>The metrics for detecting whether a page has changed is performed by <a href="http://tools.ietf.org/html/rfc3174">IETF-standard SHA-1</a>, which while <a href="http://code.google.com/p/hashclash/">compromised</a>, no attack has been found in the wild, in theory there is a <a href="https://www.schneier.com/blog/archives/2005/02/cryptanalysis_o.html">hash collision</a>.</p></content><link rel="replies" type="application/atom+xml" href="http://www.prolificprogrammer.com/feeds/260891854824070462/comments/default" title="Post Comments"></link><link rel="replies" type="text/html" href="http://www.prolificprogrammer.com/2014/07/how-to-track-website-for-changes.html#comment-form" title="0 Comments"></link><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/6157408210125261684/posts/default/260891854824070462"></link><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/6157408210125261684/posts/default/260891854824070462"></link><link rel="alternate" type="text/html" href="http://www.prolificprogrammer.com/2014/07/how-to-track-website-for-changes.html" title="How to Track a Website for Changes"></link><author><name>Hasan Diwan</name><uri>https://plus.google.com/108538472544668153897</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="//lh3.googleusercontent.com/-YZ6ZwPevQL4/AAAAAAAAAAI/AAAAAAAAAMs/sop83s0_YiY/s512-c/photo.jpg"></gd:image></author><thr:total>0</thr:total></entry></feed>'

In [59]:
thread.text = etree.CDATA(r[2])<br />                except IndexError,e:<br />                    thread.text = etree.CDATA(\'\')<br /><br />                class_ = etree.SubElement(node, \'class\')<br />                try:<br />                    class_.text = etree.CDATA(r[3])<br />                except IndexError, e:<br />                    class_.text = etree.CDATA(\'\')<br /><br />                msg = etree.SubElement(node, \'message\')<br />                try:<br />                    msg.text = etree.CDATA(r[4])<br />                except IndexError, e:<br />                    msg.text = etree.CDATA(\'\')<br /><br />            out.write(etree.tostring(document, encoding=\'utf-8\', xml_declaration=True, pretty_print=True))<br /><br />        elif format_ == \'json\':<br />            out.write(json.dumps(list(reader)))<br />        <br />        print out.getvalue()<br /></code></pre>The other novel part here is the use of <a href="http://lxml.de">lxml</a> to generate the XML, which alleviates the need to use cgi.escape and friends to get the xml properly formatted and pretty prints it automatically.</p></content><link rel="replies" type="application/atom+xml" href="http://www.prolificprogrammer.com/feeds/2539811975160040350/comments/default" title="Post Comments"></link><link rel="replies" type="text/html" href="http://www.prolificprogrammer.com/2014/07/how-to-reformat-logback-output.html#comment-form" title="0 Comments"></link><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/6157408210125261684/posts/default/2539811975160040350"></link><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/6157408210125261684/posts/default/2539811975160040350"></link><link rel="alternate" type="text/html" href="http://www.prolificprogrammer.com/2014/07/how-to-reformat-logback-output.html" title="How to Reformat Logback Output "></link><author><name>Hasan Diwan</name><uri>https://plus.google.com/108538472544668153897</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="//lh3.googleusercontent.com/-YZ6ZwPevQL4/AAAAAAAAAAI/AAAAAAAAAMs/sop83s0_YiY/s512-c/photo.jpg"></gd:image></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6157408210125261684.post-1460122346711344175</id><published>2014-07-11T14:48:00.000-07:00</published><updated>2014-07-11T14:48:33.064-07:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="change tracking"></category><category scheme="http://www.blogger.com/atom/ns#" term="python"></category><category scheme="http://www.blogger.com/atom/ns#" term="selenium"></category><category scheme="http://www.blogger.com/atom/ns#" term="SHA"></category><category scheme="http://www.blogger.com/atom/ns#" term="webpages"></category><title type="text">How to Track Multiple Websites For Changes</title><content type="html"><p>Just enabled multiple-page support for the <a href="http://www.prolificprogrammer.com/2014/07/how-to-track-website-for-changes.html">poor man\'s web-page delta tracker</a>:<pre><code><br />#!/Users/hdiwan/.virtualenvs/globetrekker/bin/python<br />from selenium import webdriver<br />from selenium.common.exceptions import WebDriverException<br />import argparse<br />import hashlib<br />import json<br />import logging<br />import pprint<br />import smtplib<br /><br /><br />def get_globetrekker_page(site):<br />    browser = webdriver.Chrome()<br />    browser.get(site)<br />    return browser<br /><br /><br />def send_mail(msg, user, password):<br />    server = smtplib.SMTP(\'smtp.gmail.com\', 587)<br />    server.ehlo()<br />    server.starttls()<br />    server.ehlo()<br />    server.login(user, password)<br />    server.sendmail(user, user, msg)<br /><br /><br />if __name__ == \'__main__\':<br />    argparser = argparse.ArgumentParser(description=\'Check a website for changes\')<br />    argparser.add_argument(\'-n\', \'--url\', type=str, default=None, help=\'Add URL to watcher\',  action=\'store\')<br />    argparser.add_argument(\'-l\', \'--list\', action=\'store_true\')<br />    argparser.add_argument(\'-u\', \'--user\', type=str, default=\'hd1@jsc.d8u.us\', help=\'Your username\',  action=\'store\')<br />    argparser.add_argument(\'-p\', \'--password\', type=str, help=\'Your password\', action=\'store\')<br />    argparser.add_argument(\'-v\', \'--verbose\', action=\'store_false\')<br />    parsed = argparser.parse_args()<br /><br />    if not parsed.verbose:<br />        logging.basicConfig(level=logging.DEBUG)<br />    else:<br />        logging.basicConfig(level=logging.FATAL)<br /><br />    if parsed.url:<br />        new_hash = {parsed.url: 0}<br />        output = json.dumps(new_hash)<br />        logging.debug(output)<br />        try:<br />            with open(\'/var/tmp/.globetrekker.txt\', \'r\') as fin:<br />                data = json.load(fin)<br />                data.append(new_hash)<br />        except IOError, v:<br />            with open(\'/var/tmp/.globetrekker.txt\', \'w\') as fout:<br />                json.dump([new_hash], fout)<br />        exit()<br /><br />    with open(\'/var/tmp/.globetrekker.txt\', \'r\') as fin:<br />        stored_hash_json = json.load(fin)<br />        logging.debug(stored_hash_json)<br />        if parsed.list:<br />            for k in stored_hash_json:<br />                print(k)<br />            exit()<br />    new_hashes = []<br />    stored_hash = stored_hash_json<br />    for stored_hash_ in stored_hash:<br />        for url in stored_hash_.keys():<br />            logging.debug(\'{} is our URL\'.format(url))<br />            try:<br />                browser = get_globetrekker_page(url)<br />            except WebDriverException, e:<br />                continue<br />            encoding = \'ascii\'<br />            text = browser.find_element_by_tag_name(\'html\').text<br />            encoded = text.encode(encoding, errors=\'replace\')<br />            logging.debug(encoded)<br />            decoded = encoded.decode(encoding, errors=\'replace\')<br />            logging.debug(decoded)<br />            new_hash = hashlib.sha1(decoded).hexdigest()<br />            logging.debug(\'Calculated hash code: {}\'.format(new_hash))<br />            logging.debug(\'Stored hash: {}\'.format(stored_hash_[url]))<br />            if new_hash != stored_hash_[url]:<br />                logging.debug(\'{} changed\'.format(url))<br />                send_mail(u\'Subject: {} Change detected\\r\\n\\r\\n--H\'.format(url), parsed.user, parsed.password)<br />                stored_hash_[url] = new_hash<br /><br />    browser.quit()<br /><br />    with open(\'/var/tmp/.globetrekker.txt\', \'w\') as fout:<br />        fout.write(json.dumps(stored_hash))<br /></code></p
000");<br />\tprops.setProperty("mail.imaps.timeout", "5000");<br />\tif (props.getProperty("mail.imaps.user") == null) {<br />\t    props.setProperty("mail.imaps.user", args[0]);<br />\t}<br />\tif (props.getProperty("mail.imaps.password") == null) {<br />\t    props.setProperty("mail.imaps.password", args[1]);<br />\t}g<br />\ttry {<br />\t    session = Session.getDefaultInstance(props, new Authenticator() {<br />\t\t    public PasswordAuthentication getPasswordAuthentication() {<br />\t\t\treturn new PasswordAuthentication(props.getProperty("mail.imaps.user"), props.getProperty("mail.imaps.password"));<br />\t\t    }<br />\t\t\t<br />\t\t});<br />\t    cleanup("[Gmail]/All Mail");<br />\t} catch (NoSuchProviderException e) {.<br />\t    e.printStackTrace();<br />\t    System.exit(1);<br />\t} catch (MessagingException e) {<br />\t    e.printStackTrace();<br />\t    System.exit(2);<br />\t}<br />    }<br />}<br />\t    <br /></pre></code>You run it using <tt>java -jar ~/bin/CleanupGmail.jar -Dmail.imaps.user=&lt;your gmail username&gt; -Dmail.imaps.password=&lt;your gmail password&gt;</tt>. Personally, I\'ve just stuck it in a monthly cron job.</p></content><link rel="replies" type="application/atom+xml" href="http://www.prolificprogrammer.com/feeds/3634143816946948756/comments/default" title="Post Comments"></link><link rel="replies" type="text/html" href="http://www.prolificprogrammer.com/2014/07/how-to-cleanup-gmail.html#comment-form" title="0 Comments"></link><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/6157408210125261684/posts/default/3634143816946948756"></link><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/6157408210125261684/posts/default/3634143816946948756"></link><link rel="alternate" type="text/html" href="http://www.prolificprogrammer.com/2014/07/how-to-cleanup-gmail.html" title="How to Cleanup Gmail"></link><author><name>Hasan Diwan</name><uri>https://plus.google.com/108538472544668153897</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="//lh3.googleusercontent.com/-YZ6ZwPevQL4/AAAAAAAAAAI/AAAAAAAAAMs/sop83s0_YiY/s512-c/photo.jpg"></gd:image></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6157408210125261684.post-8073062262876032118</id><published>2014-07-15T23:47:00.000-07:00</published><updated>2014-07-15T23:47:00.381-07:00</updated><category scheme="http://www.blo
gger.com/atom/ns#" term="documentation"></category><category scheme="http://www.blogger.com/atom/ns#" term="endpoint"></category><category scheme="http://www.blogger.com/atom/ns#" term="java"></category><category scheme="http://www.blogger.com/atom/ns#" term="python"></category><category scheme="http://www.blogger.com/atom/ns#" term="spring"></category><title type="text">How to self-document Using Spring</title><content type="html"><p>I\'ve rediscovered my love affair with <a href="http://spring.io">Spring</a>. The following code will list whatever endpoints your controller has in <a href="http://json.org">JSON</a>: <code><pre><br />@RequestMapping(value = "/endpoints", method = RequestMethod.GET)<br /> public String getEndPointsInView() {<br />     return requestMappingHandlerMapping.getHandlerMethods().keySet().toString();<br /> }<br /></pre></code>Now, to visualise JSON using <a href="http://jython.org">jython</a>:<code><pre><br />import json<br />import logging<br />import optparse<br />import urllib2<br />from javax.swing import JFrame, JScrollPane, JTable<br />from javax.swing.table import DefaultTableModel<br /><br />if __name__ == \'__main__\':<br />    parser = optparse.OptionParser()<br />    parser.add_option(\'-u\', \'--url\', action=\'store\')<br />    parser.add_option(\'-v\', \'--verbose\', action=\'store_true\', dest=\'verbose\')<br />    parsed = parser.parse_args()<br />    if parsed[0].verbose:<br />        logging.basicConfig(level=logging.DEBUG)<br />    else:<br />        logging.basicConfig()<br /><br />    json_url = urllib2.urlopen(parsed[0].url)<br />    host = parsed[0].url[:parsed[0].url.rindex(\'/\')]<br />    json_source = json_url.read()<br />    logging.debug(json_source)<br />    table_data = json.loads(json_source)<br />    logging.debug(table_data)<br /><br />    model = DefaultTableModel(table_data)<br />    tbl = JTable(model)<br />    scrollable = JScrollPane(tbl)<br />    <br />    frame = JFrame(\'Methods for {}\'.format(host))<br />    frame.add(scrollable)<br />    frame.pack()<br />    frame.defaultCloseOperation = JFrame.EXIT_ON_CLOSE<br />    frame.visible = True<br /></code></pre></content><link rel="replies" type="application/atom+xml" href="http://www.prolificprogrammer.com/feeds/8073062262876032118/comments/default" title="Post Comments"></link><link rel="replies" type="text/html" href="http://www.prolificprogrammer.com/2014/07/how-to-self-document-using-spring.html#comment-form" title="0 Comments"></link><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/6157408210125261684/posts/default/8073062262876032118"></link><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/6157408210125261684/posts/default/8073062262876032118"></link><link rel="alternate" type="text/html" href="http://www.prolificprogrammer.com/2014/07/how-to-self-document-using-spring.html" title="How to self-document Using Spring"></link><author><name>Hasan Diwan</name><uri>https://plus.google.com/108538472544668153897</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="//lh3.googleusercontent.com/-YZ6ZwPevQL4/AAAAAAAAAAI/AAAAAAAAAMs/sop83s0_YiY/s512-c/photo.jpg"></gd:image></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6157408210125261684.post-2539811975160040350</id><published>2014-07-13T09:50:00.001-07:00</published><updated>2014-07-13T09:50:16.765-07:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="csv"></category><category scheme="http://www.blogger.com/atom/ns#" term="etree"></category><category scheme="http://www.blogger.com/atom/ns#" term="json"></category><category scheme="http://www.blogger.com/atom/ns#" term="logback"></category><category scheme="http://www.blogger.com/atom/ns#" term="python"></category><category scheme="http://www.blogger.com/atom/ns#" term="spring"></category><category scheme="http://www.blogger.com/atom/ns#" term="xml"></category><title type="text">How to Reformat Logback Output</title><content type="html"><p><a href="http://spring.io">Spring</a> defaults to using <a href="http://logback.qos.ch">logback</a> for logging. It spits the logs out on standard output, which cannot be persisted. So, we must first send the log output to a file. This is done by leveraging the <a href="http://logback.qos.ch/apidocs/ch/qos/logback/core/FileAppender.html">FileAppender</a> class, as follows:<pre><nk rel="replies" type="text/html" href="http://www.prolificprogrammer.com/2014/07/how-to-track-multiple-websites-for.html#comment-form" title="0 Comments"></link><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/6157408210125261684/posts/default/1460122346711344175"></link><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/6157408210125261684/posts/default/1460122346711344175"></link><link rel="alternate" type="text/html" href="http://www.prolificprogrammer.com/2014/07/how-to-track-multiple-websites-for.html" title="How to Track Multiple Websites For Changes"></link><author><name>Hasan Diwan</name><uri>https://plus.google.com/108538472544668153897</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="//lh3.googleusercontent.com/-YZ6ZwPevQL4/AAAAAAAAAAI/AAAAAAAAAMs/sop83s0_YiY/s512-c/photo.jpg"></gd:image></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6157408210125261684.post-260891854824070462</id><published>2014-07-03T01:34:00.001-07:00</published><updated>2014-07-03T01:34:19.172-07:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="change tracking"></category><category scheme="http://www.blogger.com/atom/ns#" term="python"></category><category scheme="http://www.blogger.com/atom/ns#" term="selenium"></category><category scheme="http://www.blogger.com/atom/ns#" term="SHA"></category><category scheme="http://www.blogger.com/atom/ns#" term="webpages"></category><title type="text">How to Track a Website for Changes</title><content type="html"><p>There still exist websites not hip enough to enbrace RSS or twitter feeds for changes. And, I still want to consume their content lazily. Enter <a href="http://python.org">my favourite tool</a>:<code><pre><br />#!/Users/hdiwan/.virtualenvs/globetrekker/bin/python<br />from selenium import webdriver<br />import logging<br />import smtplib<br />import argparse<br />import hashlib<br /><br />def get_globetrekker_page(site):<br />    browser = webdriver.Chrome()<br />    browser.get(site)<br />    return browser<br /><br />def send_mail(msg, user, password):<br />    server = smtplib.SMTP(\'smtp.gmail.com\',587) #port 465 or 587<br />    server.ehlo()<br />    server.starttls()<br />    server.ehlo()<br />    server.login(user,password)<br />    server.sendmail(user,user,msg)<br />    <br />if __name__ == \'__main__\':<br />    try:<br />        argparser = argparse.ArgumentParser(description=\'Check a website for changes\')<br />        argparser.add_argument(\'-l\',\'--url\',type=str,default=\'http://www.pilotguides.com/tv-shows/globe-trekker/\',help=\'Page URL to Globetrekker\', action=\'store\')<br />        argparser.add_argument(\'-u\',\'--user\',type=str,default=\'hd1@jsc.d8u.us\',help=\'Your username\', action=\'store\')<br />        argparser.add_argument(\'-p\',\'--password\',type=str,help=\'Your password\', action=\'store\', required=True)<br />        argparser.add_argument(\'-v\',\'--verbose\',action=\'store_false\')<br />        parsed = argparser.parse_args()<br /><br />        if not parsed.verbose:<br />            logging.basicConfig(level=logging.DEBUG)<br />        else:<br />            logging.basicConfig(level=logging.FATAL)<br /><br />        browser = get_globetrekker_page(parsed.url)<br />        elem_ = browser.find_elements_by_id(\'destination-dropdown-filter\')<br /><br />        try:<br />            with open(\'/var/tmp/.globetrekker.txt\') as fin:<br />                stored_hash = fin.read()<br />                logging.debug(\'Stored Hash Code: {}\'.format(stored_hash))<br />        except IOError: <br />            stored_hash = 0<br /><br />        episodes_ = 0<br />        for elem in elem_:<br />            episodes_ = len(elem.get_attribute(\'value\')) + episodes_<br />        new_hash = hashlib.sha1(str(episodes_)).hexdigest()<br />        logging.debug(\'Calculated hash code: {}\'.format(new_hash))<br />        if new_hash != stored_hash: # Page changed<br />            with open(\'/var/tmp/.globetrekker.txt\',\'w\') as fout:<br />                fout.write(\'{}\'.format(new_hash))<br />            send_mail(u\'Subject: Page Changed {}\\r\\n\\r\\n--H\'.format(parsed.url), parsed.user, parsed.password)<br /><br />    finally:<br />        browser.quit()<br /></code></pre><p>Some notes on this script, it uses <a href="http://docs.seleniumhq.org/">selenium</a>, which <a href="http://ffdr.d8u.us">Proshot</a> or somebody was saking me about the other day (sory, man, it\'s not <a href="http://ruby-lang.org">ruby</a>, but it\'s my code, so....), you can find enough on other sources aside from this blog.</p><p>The metrics for detecting whether a page has changed is performed by <a href="http://tools.ietf.org/html/rfc3174">IETF-standard SHA-1</a>, which while <a href="http://code.google.com/p/hashclash/">compromised</a>, no attack has been found in the wild, in theory there is a <a href="https://www.schneier.com/blog/archives/2005/02/cryptanalysis_o.html">hash collision</a>.</p></content><link rel="replies" type="application/atom+xml" href="http://www.prolificprogrammer.com/feeds/260891854824070462/comments/default" title="Post Comments"></link><link rel="replies" type="text/html" href="http://www.prolificprogrammer.com/2014/07/how-to-track-website-for-changes.html#comment-form" title="0 Comments"></link><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/6157408210125261684/posts/default/260891854824070462"></link><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/6157408210125261684/posts/default/260891854824070462"></link><link rel="alternate" type="text/html" href="http://www.prolificprogrammer.com/2014/07/how-to-track-website-for-changes.html" title="How to Track a Website for Changes"></link><author><name>Hasan Diwan</name><uri>https://plus.google.com/108538472544668153897</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="//lh3.googleusercontent.com/-YZ6ZwPevQL4/AAAAAAAAAAI/AAAAAAAAAMs/sop83s0_YiY/s512-c/photo.jpg"></gd:image></author><thr:total>0</thr:total></entry></feed>'

... on other words, the same string as handed to it....