July 31, 2014

How to add a Custom Git command

Git is nothing more than a collection of shell scripts, so claims David, and others. I will have to add my name to the list, after last night, when I wrote my own git subcommand... Presenting git lost. What, on $DEITY's green earth, does the lost subcommand do, you ask? Very good question, dear reader. The code is below and an explanation follows the code:

#!/bin/sh
git stash $*
It's an alias for the stash subcommand! Put it in your path and you can run git lost and have it do the exact same thing as stash.

July 26, 2014

How to determine Word Frequency in Java

?: Given a text file of arbitrary length, rank the words from most to least common in a file of arbitrary size.

A: Divide the file into individual words, put them into an RDBMS and use that to count. In the code below, I've chosen to use the embedded flavour of H2. Why H2? From its page:

The main features of H2 are:

Very fast, open source, JDBC API
Embedded and server modes; in-memory databases
Browser based Console application
Small footprint: around 1.5 MB jar file size
I've chosen Java to do this task. And I decided to not use any NLP library to handle word-segmentation out of a concern for disk space. I do believe h2 will write delimited data natively as well. Maybe for future improvement. But for now, the code reads:
 public static void main (String[] args) {
  try {
   Class.forName("org.h2.Driver").newInstance();
  } catch (ClassNotFoundException e) {
   e.printStackTrace(System.err);
   throw new RuntimeException(e.getMessage());
  } catch (InstantiationException e) {
   e.printStackTrace(System.err);
   throw new RuntimeException(e.getMessage());
  } catch (IllegalAccessException e) {
   e.printStackTrace(System.err);
   throw new RuntimeException(e.getMessage());
  }
  Long MAX_TRANSACTION_LENGTH = null;
  try {
   MAX_TRANSACTION_LENGTH = Long.parseLong(System.getProperty("max.transaction.length"));
  } catch (NumberFormatException e) {
   MAX_TRANSACTION_LENGTH = 40l;
  }
  Connection conn = null;
  try {
   conn = DriverManager.getConnection("jdbc:h2:alation;LOG=0;CACHE_SIZE=65536;LOCK_MODE=0;UNDO_LOG=0", "sa","");
   conn.setAutoCommit(false);
  } catch (SQLException e) {
   e.printStackTrace(System.err);
   throw new RuntimeException(e.getMessage());
  }
  String stringToExamine = "";
  try {
   conn.createStatement().execute("DROP TABLE IF EXISTS indexing;");
   conn.createStatement().execute("CREATE TABLE indexing (word VARCHAR UNIQUE, frequency INT)");
   conn.commit();
  } catch (SQLException e) {
   System.err.println("Creation failed -- "+e.getMessage());
   e.printStackTrace(System.err);
  }
  try {
   PreparedStatement insertStatement = conn.prepareStatement("INSERT INTO indexing (word, frequency) VALUES (LOWER(?),1)");
   int transactionLength = 0;
   Long insertStart = System.currentTimeMillis();
   BufferedReader reader = new BufferedReader(new FileReader(args[0]));
   String args_ = null;
   String line = null;
   while ((line = reader.readLine()) != null) {
    for (String word : line.split(" ")) {
     insertStatement.setString(1, word.replaceAll("\\p{Punct}",""));
     if (word.matches("^\\s+$")) {
      continue;
     }
     try {
      int inserted = insertStatement.executeUpdate();
     } catch (JdbcSQLException j) {
      PreparedStatement prepped = conn.prepareStatement("UPDATE indexing SET frequency = frequency + 1 WHERE word = ?", ResultSet.TYPE_SCROLL_SENSITIVE, ResultSet.CONCUR_UPDATABLE);
      prepped.setString(1, word);
      int updated = prepped.executeUpdate();
     }
     conn.commit();
    }
   }
   reader.close();
  } catch  (SQLException e) {
   System.err.println(e.getMessage());
   e.printStackTrace(System.err);
  } catch (IOException e) {
   System.err.println(e.getMessage());
   e.printStackTrace(System.err);
  } 
   
  try {
   ResultSet words = conn.createStatement(ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY).executeQuery("SELECT word||'\t'||frequency as res FROM indexing ORDER BY frequency DESC, word");
   words.next();
    
   PrintStream out = null;
   try {
    out = new PrintStream(args[1]);
   } catch (Exception e) {
    out = new PrintStream(System.out);
   }
   Long start = System.currentTimeMillis();
   while (words.next()) {
    out.println(words.getString("res"));
   } 
   conn.close();
   System.err.println("\n\nProcessed in "+new Long(System.currentTimeMillis() - start)+" miliseconds.");
  } catch (SQLException e) {
   System.err.println(e.getMessage());
   e.printStackTrace(System.err);
  }
 }

July 23, 2014

How to Handle Gzipped Files in Ruby

Late last night, I was debugging the fact that the shared links are now stored compressed and realised the atom feed I had so carefully put together earlier was not working because ruby wasn't hip enough to recognise Gzip compression out of the box (to be fair, though, that would be a performance hit without much benefit. Especially given that solving the problem is cake:

require 'zlib'
Zlib::GzipReader.open("/path/to/compressed-file") {|gz|
   # do what you will here
}

July 21, 2014

How to Visualise Data

Amongst my friends, one of the most dreaded subject lines in an email from me is, "Thought you might be interested in...". This is the result of a python script that lets me share links with you. Over the weekend, I greatly enhanced it, adding the ability to view, remove and visualise whatever links I've sent in a histogram. It's the latter piece of code that is mirrored here:

    import pandas as pd
    from ggplot import geom_bar, aes, ggplot, ggsave, ggtitle
    from imgur.factory import factory

    today = links[links['Time'] > int(datetime.date.today().strftime('%s')) - 1]
    logging.debug(today.to_string())
    today['Hour'] = [datetime.datetime.fromtimestamp(t).strftime('%H') for t in today['Time']]
    logging.debug(today['Hour'].to_string())
    p = ggplot(today, aes(x='Hour')) + geom_bar() + ggtitle('Links shared by hour of day today')
    with tempfile.NamedTemporaryFile() as fileout:
        ggsave(p, fileout.name)
        imgur_key = u'4feb29d00face5bc1b9dae536e15c373'
        req = factory.build_request_upload_from_path(fileout.name)
        res = imgur.retrieve(req)
        print('Image may be viewed at {}'.format(res['link']))

July 18, 2014

How to Cleanup Gmail

Slight detour from the pythonic nature of this blog. I'm almost out of disk space at work, and most of the email is unnecessary. So, I wrote the java class below to delete all mail older the a month. You'll need javamail on your classpath and a compiled class of this:


package us.d8u;
import java.util.Calendar;
import java.util.Date;
import java.util.Properties;
import javax.mail.Authenticator;
import javax.mail.Flags;
import javax.mail.Folder;
import javax.mail.Message;
import javax.mail.MessagingException;
import javax.mail.NoSuchProviderException;
import javax.mail.PasswordAuthentication;
import javax.mail.Session;
import javax.mail.Store;

public class CleanupGmail {
    private static Session session = null;
    public static void cleanup(String folderName) throws NoSuchProviderException, MessagingException {
	Store store = session.getStore("imaps");
	if (!store.isConnected()) {
	    store.connect(System.getProperty("mail.imaps.host"), System.getProperty("mail.imaps.user"), System.getProperty("mail.imaps.password"));
	}
	Folder inbox = store.getFolder(folderName);
	inbox.open(Folder.READ_WRITE);
	Message messages[] = inbox.getMessages();
	for (Message message : messages) {
	    Calendar c = Calendar.getInstance();
	    c.add(Calendar.MONTH, -1);
	    Date receivedDate = message.getReceivedDate();
	    if (receivedDate.before(c.getTime())) {
		Flags deleted = new Flags(Flags.Flag.DELETED);
		inbox.setFlags(messages, deleted, true);
	    }
	}
	inbox.close(true);
    }
    public static void main(String[] args) {
	final Properties props = System.getProperties();
	props.setProperty("mail.imaps.host", "imap.gmail.com");
	props.setProperty("mail.imaps.port", "993");
	props.setProperty("mail.imaps.connectiontimeout", "5000");
	props.setProperty("mail.imaps.timeout", "5000");
	if (props.getProperty("mail.imaps.user") == null) {
	    props.setProperty("mail.imaps.user", args[0]);
	}
	if (props.getProperty("mail.imaps.password") == null) {
	    props.setProperty("mail.imaps.password", args[1]);
	}g
	try {
	    session = Session.getDefaultInstance(props, new Authenticator() {
		    public PasswordAuthentication getPasswordAuthentication() {
			return new PasswordAuthentication(props.getProperty("mail.imaps.user"), props.getProperty("mail.imaps.password"));
		    }
			
		});
	    cleanup("[Gmail]/All Mail");
	} catch (NoSuchProviderException e) {.
	    e.printStackTrace();
	    System.exit(1);
	} catch (MessagingException e) {
	    e.printStackTrace();
	    System.exit(2);
	}
    }
}
	    
You run it using java -jar ~/bin/CleanupGmail.jar -Dmail.imaps.user=<your gmail username> -Dmail.imaps.password=<your gmail password>. Personally, I've just stuck it in a monthly cron job.

July 15, 2014

How to self-document Using Spring

I've rediscovered my love affair with Spring. The following code will list whatever endpoints your controller has in JSON:

@RequestMapping(value = "/endpoints", method = RequestMethod.GET)
 public String getEndPointsInView() {
     return requestMappingHandlerMapping.getHandlerMethods().keySet().toString();
 }
Now, to visualise JSON using jython:
import json
import logging
import optparse
import urllib2
from javax.swing import JFrame, JScrollPane, JTable
from javax.swing.table import DefaultTableModel

if __name__ == '__main__':
    parser = optparse.OptionParser()
    parser.add_option('-u', '--url', action='store')
    parser.add_option('-v', '--verbose', action='store_true', dest='verbose')
    parsed = parser.parse_args()
    if parsed[0].verbose:
        logging.basicConfig(level=logging.DEBUG)
    else:
        logging.basicConfig()

    json_url = urllib2.urlopen(parsed[0].url)
    host = parsed[0].url[:parsed[0].url.rindex('/')]
    json_source = json_url.read()
    logging.debug(json_source)
    table_data = json.loads(json_source)
    logging.debug(table_data)

    model = DefaultTableModel(table_data)
    tbl = JTable(model)
    scrollable = JScrollPane(tbl)
    
    frame = JFrame('Methods for {}'.format(host))
    frame.add(scrollable)
    frame.pack()
    frame.defaultCloseOperation = JFrame.EXIT_ON_CLOSE
    frame.visible = True

July 13, 2014

How to Reformat Logback Output

Spring defaults to using logback for logging. It spits the logs out on standard output, which cannot be persisted. So, we must first send the log output to a file. This is done by leveraging the FileAppender class, as follows:

  <appender name="FILE" class="ch.qos.logback.core.FileAppender">
    <file>/home/hdiwan/around.log</file>
    <encoder>
      <pattern>"%date" "%level" "[%thread]" "%logger" "%file : %line" "%msg"%n</pattern>
    </encoder>
  </appender>

Now, you'll be getting logs to the file indicated, make sure the LOG_FILE at the top of the script matches the configuration:

import argparse
import cgi
import csv
import cStringIO as StringIO
import json
import logging

from lxml import etree

if __name__ == '__main__':
    LOGFILE_PATH = '/home/hdiwan/around.log'

    logger = logging.basicConfig(level=logging.FATAL)

    web = cgi.FieldStorage()
    format_ = web.getfirst('format', default='csv')
    csv.register_dialect('arounddialect')
    logging.debug(csv.list_dialects())
    if format_ == 'csv':
        print('Content-Type: application/csv\n')
    elif format_ == 'xml':
        print('Content-Type: text/xml\n')
    elif format_ == 'json':
        print('Content-Type: application/json\n')

    with open(LOGFILE_PATH,'rb') as fin:
        reader = csv.reader(fin, dialect='arounddialect')
        out = StringIO.StringIO()
        if format_ == 'csv':
            writer = csv.writer(out)
            writer.writerows(list(reader))

        elif format_ == 'xml':
            document = etree.Element('log')
            for r in list(reader):
                logging.debug(len(r))
                node = etree.SubElement(document, 'entry')

                timestamp = etree.SubElement(node, 'timestamp')
                timestamp.text = etree.CDATA(r[0])

                level = etree.SubElement(node, 'level')
                level.text = etree.CDATA(r[1])

                thread = etree.SubElement(node, 'thread')
                try:
                    thread.text = etree.CDATA(r[2])
                except IndexError,e:
                    thread.text = etree.CDATA('')

                class_ = etree.SubElement(node, 'class')
                try:
                    class_.text = etree.CDATA(r[3])
                except IndexError, e:
                    class_.text = etree.CDATA('')

                msg = etree.SubElement(node, 'message')
                try:
                    msg.text = etree.CDATA(r[4])
                except IndexError, e:
                    msg.text = etree.CDATA('')

            out.write(etree.tostring(document, encoding='utf-8', xml_declaration=True, pretty_print=True))

        elif format_ == 'json':
            out.write(json.dumps(list(reader)))
        
        print out.getvalue()
The other novel part here is the use of lxml to generate the XML, which alleviates the need to use cgi.escape and friends to get the xml properly formatted and pretty prints it automatically.