Sunday, January 10, 2016

Processing CSV Files in Java -- A Java 8 Perspective

Well, after a couple of years' hiatus, I'm going to try and start blogging here again. I'm going to do some things differently going forward. First, I'm going to moderate comments, since Blogger allows too much spam to get through. Second, instead of jumping through hoops trying to squeeze a bunch of properly formatted code into Blogger, I'm going to keep the bulk of the code in a Github repository I set up expressly for this purpose. You can find it here --

For today's post, I'm going to tackle the common task of processing CSV files. My fiancee recently spent a lot of time writing one-off CSV parsers for her work, where she had to read in a file, make a few conditional changes to some values, and write the modifications back out. The canonical way of processing a CSV file without any higher-level assistance goes something like this:
  1. Read a line of the input file.
  2. Split it.
  3. Change one value in the line.
  4. Change another value in the line, then another, until complete.
  5. Write the line back out.
  6. Repeat.
With tools like Apache Commons CSV and openCSV, one can easily do a bit better in terms of conciseness, readability, and robustness.

But after watching my fiancee's specific issues, what I really wanted to be able to do was process columns of values instead of rows. Something like this:
  1. Read the entire input file to an in-memory data structure.
  2. Change all of the values in one column.
  3. Change the values in another column, etc. until complete.
  4. Write the file back out.
(Granted, this requires reading the entire file into memory, but for many applications, this is not an issue.)

My intuition was that column-oriented processing would make the business logic clearer and less error-prone in the code. To further this goal, I wanted to be able to do this with Java 8 syntax, closures and the like.

My proof-of-concept code is available here. It uses Apache Commons CSV to read input into a data structure, the "CSVMaster". It contains a list of rows whose values are mapped by the column headers for easy access. Many CSV frameworks support that. What's slightly new is how the row list's iterator and stream are exposed. So you can write code like this:

// Change each zip value to a default zip+4.
master.forEach(row -> row.set("zip", row.get("zip") + " -0000"));

// Get all of the rows from a specific zip code.
List specialRows = -> row.get("zip").startsWith("95610"))

Provided this turns out to be useful and usable, I will continue to enhance and extend it. Additional thoughts welcome.

No comments: