Some tax-time automation
I often struggle to find the right balance between automation and manual work. As it is tax time, and Chase bank only gives you 90 days of statements, I find myself every year going back through our statements to find any business expenses and do our overall financial review for the year. In the past I’ve played around with MS Money, Quicken, Mint and kept my own spreadsheets. Now, I just download the statements at the end of year and use acrobat to combine and ruby to massage the combined PDF into a spreadsheet.1
To do my analysis I need everything in a CSV format. After, getting one PDF, I end up looking at the structure of the document which looks like:
Earn points [truncated] and 1{aaa01f1184b23bc5204459599a780c2efd1a71f819cd2b338cab4b7a2f8e97d4} back per $1 spent on all other Visa Card purchases.
Date of Transaction Merchant Name or Transaction Description $ Amount
PAYMENTS AND OTHER CREDITS
01/23 -865.63
AUTOMATIC PAYMENT - THANK YOU
PURCHASES
12/27 AMAZON MKTPLACE PMTS AMZN.COM/BILL WA 15.98
12/29 NEW JERSEY E-ZPASS 888-288-6865 NJ 25.00
12/30 AMAZON MKTPLACE PMTS AMZN.COM/BILL WA 54.01
0000001 FIS33339 C 2 000 Y 9 26 15/01/26 Page 1 of 2
I realize that I want all lines that have a number like MM/DD followed by some spaces and a bunch of text, followed by a decimal number and some spaces. In regular expression syntax, that looks like:
/^(\d{2}\/\d{2})\s+(.*)\s+(\d+\.\d+)\s+$/
which is literally just a way of describing to the computer where my data are.
Through using Ruby, I can easily get my expenses as CSV:
Boom. Hope this helps some of you who might otherwise be doing a lot of typing. Also, if you want to combine PDFs on the command line, you can use PDFtk thus:
pdftk file1.pdf file2.pdf cat output -
- The manual download takes about 10 minutes. When I get some time, I’m up for the challenge of automating this eventually with my own screen scraper and web automation using some awesome combination Ruby and Capybara. I also use PDFtk to combine PDF files. ↩
this actually works better:
require ‘csv’
i = 0
tot = 0
CSV.open(“out.csv”, “wb”) do |csv|
csv << [‘date’, ‘payee’, ‘amount’]
File.open(‘data.txt’).each_line do |l|
tot = tot + 1
if l[/^(\d{2}\/\d{2})\s+(.*)\s+((?:[-,\d]+)?.\d+)\s+$/]
#puts “‘#{$1.strip}’ and ‘#{$2.strip}’ and ‘#{$3.strip}'”
csv << [$1.strip, $2.strip, $3.strip]
i = i + 1
else
puts “not for #{l}”
end
end
end
puts ” ———- ”
puts ” completed with #{i} of #{tot} —- or #{i/tot}”
puts ” ———- “