July 23, 2014

How to Handle Gzipped Files in Ruby

Late last night, I was debugging the fact that the shared links are now stored compressed and realised the atom feed I had so carefully put together earlier was not working because ruby wasn't hip enough to recognise Gzip compression out of the box (to be fair, though, that would be a performance hit without much benefit. Especially given that solving the problem is cake:

require 'zlib'
Zlib::GzipReader.open("/path/to/compressed-file") {|gz|
   # do what you will here
}

July 21, 2014

How to Visualise Data

Amongst my friends, one of the most dreaded subject lines in an email from me is, "Thought you might be interested in...". This is the result of a python script that lets me share links with you. Over the weekend, I greatly enhanced it, adding the ability to view, remove and visualise whatever links I've sent in a histogram. It's the latter piece of code that is mirrored here:

    import pandas as pd
    from ggplot import geom_bar, aes, ggplot, ggsave, ggtitle
    from imgur.factory import factory

    today = links[links['Time'] > int(datetime.date.today().strftime('%s')) - 1]
    logging.debug(today.to_string())
    today['Hour'] = [datetime.datetime.fromtimestamp(t).strftime('%H') for t in today['Time']]
    logging.debug(today['Hour'].to_string())
    p = ggplot(today, aes(x='Hour')) + geom_bar() + ggtitle('Links shared by hour of day today')
    with tempfile.NamedTemporaryFile() as fileout:
        ggsave(p, fileout.name)
        imgur_key = u'4feb29d00face5bc1b9dae536e15c373'
        req = factory.build_request_upload_from_path(fileout.name)
        res = imgur.retrieve(req)
        print('Image may be viewed at {}'.format(res['link']))

July 18, 2014

How to Cleanup Gmail

Slight detour from the pythonic nature of this blog. I'm almost out of disk space at work, and most of the email is unnecessary. So, I wrote the java class below to delete all mail older the a month. You'll need javamail on your classpath and a compiled class of this:


package us.d8u;
import java.util.Calendar;
import java.util.Date;
import java.util.Properties;
import javax.mail.Authenticator;
import javax.mail.Flags;
import javax.mail.Folder;
import javax.mail.Message;
import javax.mail.MessagingException;
import javax.mail.NoSuchProviderException;
import javax.mail.PasswordAuthentication;
import javax.mail.Session;
import javax.mail.Store;

public class CleanupGmail {
    private static Session session = null;
    public static void cleanup(String folderName) throws NoSuchProviderException, MessagingException {
	Store store = session.getStore("imaps");
	if (!store.isConnected()) {
	    store.connect(System.getProperty("mail.imaps.host"), System.getProperty("mail.imaps.user"), System.getProperty("mail.imaps.password"));
	}
	Folder inbox = store.getFolder(folderName);
	inbox.open(Folder.READ_WRITE);
	Message messages[] = inbox.getMessages();
	for (Message message : messages) {
	    Calendar c = Calendar.getInstance();
	    c.add(Calendar.MONTH, -1);
	    Date receivedDate = message.getReceivedDate();
	    if (receivedDate.before(c.getTime())) {
		Flags deleted = new Flags(Flags.Flag.DELETED);
		inbox.setFlags(messages, deleted, true);
	    }
	}
	inbox.close(true);
    }
    public static void main(String[] args) {
	final Properties props = System.getProperties();
	props.setProperty("mail.imaps.host", "imap.gmail.com");
	props.setProperty("mail.imaps.port", "993");
	props.setProperty("mail.imaps.connectiontimeout", "5000");
	props.setProperty("mail.imaps.timeout", "5000");
	if (props.getProperty("mail.imaps.user") == null) {
	    props.setProperty("mail.imaps.user", args[0]);
	}
	if (props.getProperty("mail.imaps.password") == null) {
	    props.setProperty("mail.imaps.password", args[1]);
	}g
	try {
	    session = Session.getDefaultInstance(props, new Authenticator() {
		    public PasswordAuthentication getPasswordAuthentication() {
			return new PasswordAuthentication(props.getProperty("mail.imaps.user"), props.getProperty("mail.imaps.password"));
		    }
			
		});
	    cleanup("[Gmail]/All Mail");
	} catch (NoSuchProviderException e) {.
	    e.printStackTrace();
	    System.exit(1);
	} catch (MessagingException e) {
	    e.printStackTrace();
	    System.exit(2);
	}
    }
}
	    
You run it using java -jar ~/bin/CleanupGmail.jar -Dmail.imaps.user=<your gmail username> -Dmail.imaps.password=<your gmail password>. Personally, I've just stuck it in a monthly cron job.

July 15, 2014

How to self-document Using Spring

I've rediscovered my love affair with Spring. The following code will list whatever endpoints your controller has in JSON:

@RequestMapping(value = "/endpoints", method = RequestMethod.GET)
 public String getEndPointsInView() {
     return requestMappingHandlerMapping.getHandlerMethods().keySet().toString();
 }
Now, to visualise JSON using jython:
import json
import logging
import optparse
import urllib2
from javax.swing import JFrame, JScrollPane, JTable
from javax.swing.table import DefaultTableModel

if __name__ == '__main__':
    parser = optparse.OptionParser()
    parser.add_option('-u', '--url', action='store')
    parser.add_option('-v', '--verbose', action='store_true', dest='verbose')
    parsed = parser.parse_args()
    if parsed[0].verbose:
        logging.basicConfig(level=logging.DEBUG)
    else:
        logging.basicConfig()

    json_url = urllib2.urlopen(parsed[0].url)
    host = parsed[0].url[:parsed[0].url.rindex('/')]
    json_source = json_url.read()
    logging.debug(json_source)
    table_data = json.loads(json_source)
    logging.debug(table_data)

    model = DefaultTableModel(table_data)
    tbl = JTable(model)
    scrollable = JScrollPane(tbl)
    
    frame = JFrame('Methods for {}'.format(host))
    frame.add(scrollable)
    frame.pack()
    frame.defaultCloseOperation = JFrame.EXIT_ON_CLOSE
    frame.visible = True

July 13, 2014

How to Reformat Logback Output

Spring defaults to using logback for logging. It spits the logs out on standard output, which cannot be persisted. So, we must first send the log output to a file. This is done by leveraging the FileAppender class, as follows:

  <appender name="FILE" class="ch.qos.logback.core.FileAppender">
    <file>/home/hdiwan/around.log</file>
    <encoder>
      <pattern>"%date" "%level" "[%thread]" "%logger" "%file : %line" "%msg"%n</pattern>
    </encoder>
  </appender>

Now, you'll be getting logs to the file indicated, make sure the LOG_FILE at the top of the script matches the configuration:

import argparse
import cgi
import csv
import cStringIO as StringIO
import json
import logging

from lxml import etree

if __name__ == '__main__':
    LOGFILE_PATH = '/home/hdiwan/around.log'

    logger = logging.basicConfig(level=logging.FATAL)

    web = cgi.FieldStorage()
    format_ = web.getfirst('format', default='csv')
    csv.register_dialect('arounddialect')
    logging.debug(csv.list_dialects())
    if format_ == 'csv':
        print('Content-Type: application/csv\n')
    elif format_ == 'xml':
        print('Content-Type: text/xml\n')
    elif format_ == 'json':
        print('Content-Type: application/json\n')

    with open(LOGFILE_PATH,'rb') as fin:
        reader = csv.reader(fin, dialect='arounddialect')
        out = StringIO.StringIO()
        if format_ == 'csv':
            writer = csv.writer(out)
            writer.writerows(list(reader))

        elif format_ == 'xml':
            document = etree.Element('log')
            for r in list(reader):
                logging.debug(len(r))
                node = etree.SubElement(document, 'entry')

                timestamp = etree.SubElement(node, 'timestamp')
                timestamp.text = etree.CDATA(r[0])

                level = etree.SubElement(node, 'level')
                level.text = etree.CDATA(r[1])

                thread = etree.SubElement(node, 'thread')
                try:
                    thread.text = etree.CDATA(r[2])
                except IndexError,e:
                    thread.text = etree.CDATA('')

                class_ = etree.SubElement(node, 'class')
                try:
                    class_.text = etree.CDATA(r[3])
                except IndexError, e:
                    class_.text = etree.CDATA('')

                msg = etree.SubElement(node, 'message')
                try:
                    msg.text = etree.CDATA(r[4])
                except IndexError, e:
                    msg.text = etree.CDATA('')

            out.write(etree.tostring(document, encoding='utf-8', xml_declaration=True, pretty_print=True))

        elif format_ == 'json':
            out.write(json.dumps(list(reader)))
        
        print out.getvalue()
The other novel part here is the use of lxml to generate the XML, which alleviates the need to use cgi.escape and friends to get the xml properly formatted and pretty prints it automatically.

July 11, 2014

How to Track Multiple Websites For Changes

Just enabled multiple-page support for the poor man's web-page delta tracker:


#!/Users/hdiwan/.virtualenvs/globetrekker/bin/python
from selenium import webdriver
from selenium.common.exceptions import WebDriverException
import argparse
import hashlib
import json
import logging
import pprint
import smtplib


def get_globetrekker_page(site):
    browser = webdriver.Chrome()
    browser.get(site)
    return browser


def send_mail(msg, user, password):
    server = smtplib.SMTP('smtp.gmail.com', 587)
    server.ehlo()
    server.starttls()
    server.ehlo()
    server.login(user, password)
    server.sendmail(user, user, msg)


if __name__ == '__main__':
    argparser = argparse.ArgumentParser(description='Check a website for changes')
    argparser.add_argument('-n', '--url', type=str, default=None, help='Add URL to watcher',  action='store')
    argparser.add_argument('-l', '--list', action='store_true')
    argparser.add_argument('-u', '--user', type=str, default='hd1@jsc.d8u.us', help='Your username',  action='store')
    argparser.add_argument('-p', '--password', type=str, help='Your password', action='store')
    argparser.add_argument('-v', '--verbose', action='store_false')
    parsed = argparser.parse_args()

    if not parsed.verbose:
        logging.basicConfig(level=logging.DEBUG)
    else:
        logging.basicConfig(level=logging.FATAL)

    if parsed.url:
        new_hash = {parsed.url: 0}
        output = json.dumps(new_hash)
        logging.debug(output)
        try:
            with open('/var/tmp/.globetrekker.txt', 'r') as fin:
                data = json.load(fin)
                data.append(new_hash)
        except IOError, v:
            with open('/var/tmp/.globetrekker.txt', 'w') as fout:
                json.dump([new_hash], fout)
        exit()

    with open('/var/tmp/.globetrekker.txt', 'r') as fin:
        stored_hash_json = json.load(fin)
        logging.debug(stored_hash_json)
        if parsed.list:
            for k in stored_hash_json:
                print(k)
            exit()
    new_hashes = []
    stored_hash = stored_hash_json
    for stored_hash_ in stored_hash:
        for url in stored_hash_.keys():
            logging.debug('{} is our URL'.format(url))
            try:
                browser = get_globetrekker_page(url)
            except WebDriverException, e:
                continue
            encoding = 'ascii'
            text = browser.find_element_by_tag_name('html').text
            encoded = text.encode(encoding, errors='replace')
            logging.debug(encoded)
            decoded = encoded.decode(encoding, errors='replace')
            logging.debug(decoded)
            new_hash = hashlib.sha1(decoded).hexdigest()
            logging.debug('Calculated hash code: {}'.format(new_hash))
            logging.debug('Stored hash: {}'.format(stored_hash_[url]))
            if new_hash != stored_hash_[url]:
                logging.debug('{} changed'.format(url))
                send_mail(u'Subject: {} Change detected\r\n\r\n--H'.format(url), parsed.user, parsed.password)
                stored_hash_[url] = new_hash

    browser.quit()

    with open('/var/tmp/.globetrekker.txt', 'w') as fout:
        fout.write(json.dumps(stored_hash))

July 3, 2014

How to Track a Website for Changes

There still exist websites not hip enough to enbrace RSS or twitter feeds for changes. And, I still want to consume their content lazily. Enter my favourite tool:

#!/Users/hdiwan/.virtualenvs/globetrekker/bin/python
from selenium import webdriver
import logging
import smtplib
import argparse
import hashlib

def get_globetrekker_page(site):
    browser = webdriver.Chrome()
    browser.get(site)
    return browser

def send_mail(msg, user, password):
    server = smtplib.SMTP('smtp.gmail.com',587) #port 465 or 587
    server.ehlo()
    server.starttls()
    server.ehlo()
    server.login(user,password)
    server.sendmail(user,user,msg)
    
if __name__ == '__main__':
    try:
        argparser = argparse.ArgumentParser(description='Check a website for changes')
        argparser.add_argument('-l','--url',type=str,default='http://www.pilotguides.com/tv-shows/globe-trekker/',help='Page URL to Globetrekker', action='store')
        argparser.add_argument('-u','--user',type=str,default='hd1@jsc.d8u.us',help='Your username', action='store')
        argparser.add_argument('-p','--password',type=str,help='Your password', action='store', required=True)
        argparser.add_argument('-v','--verbose',action='store_false')
        parsed = argparser.parse_args()

        if not parsed.verbose:
            logging.basicConfig(level=logging.DEBUG)
        else:
            logging.basicConfig(level=logging.FATAL)

        browser = get_globetrekker_page(parsed.url)
        elem_ = browser.find_elements_by_id('destination-dropdown-filter')

        try:
            with open('/var/tmp/.globetrekker.txt') as fin:
                stored_hash = fin.read()
                logging.debug('Stored Hash Code: {}'.format(stored_hash))
        except IOError: 
            stored_hash = 0

        episodes_ = 0
        for elem in elem_:
            episodes_ = len(elem.get_attribute('value')) + episodes_
        new_hash = hashlib.sha1(str(episodes_)).hexdigest()
        logging.debug('Calculated hash code: {}'.format(new_hash))
        if new_hash != stored_hash: # Page changed
            with open('/var/tmp/.globetrekker.txt','w') as fout:
                fout.write('{}'.format(new_hash))
            send_mail(u'Subject: Page Changed {}\r\n\r\n--H'.format(parsed.url), parsed.user, parsed.password)

    finally:
        browser.quit()

Some notes on this script, it uses selenium, which Proshot or somebody was saking me about the other day (sory, man, it's not ruby, but it's my code, so....), you can find enough on other sources aside from this blog.

The metrics for detecting whether a page has changed is performed by IETF-standard SHA-1, which while compromised, no attack has been found in the wild, in theory there is a hash collision.