I want my amazon data

Tracking and categorizing financial transactions can be tedious, especially when it comes to Amazon orders. With Amazon’s new data policies, it’s harder than ever to retrieve detailed purchase information in a usable format. Amazon is so broad that a credit card charge could be for a digital or physical product, a whole foods purchase. The purchases themselves are complicated, with charges connected to shipments and with gift card transactions involved. You can request all your data, but it’s in a terrible format, basically a big flattened table that doesn’t correlate order_id to invoiced transactions.

I have a well oiled set of code that downloads all my transactions from every bank, credit card, etc, uses machine learning (naive bayes) to categorize transactions and upload them to a google sheet where I can balance everything, check categories and add additional details. My code then downloads 25 years of transactions (every penny I’ve ever spent) into postgres (both locally and cloud based) that allows me to use R and tableau to do a bunch of analysis. It’s a hobby to sit down and build a google slide deck filled with data on where our money is going and how our investments are doing.

Our orders are going up, and it’s time for automation.

This system has evolved over time and works really well. Here, I’m wanted to share how I get amazon transactions to match my bank charges so I can list the products I’m buying.

Step 1: get your amazon order data into a database

This is tricky — google privacy central (the link has changed a couple times in the last year or so) and you can request a zip file of your data. There are two parts to this: request your data and then wait for an email with a link to download it later. It’s surprising that it could take days for what is surely a fully automated process, but it generally takes hours.

Eventually, you get your Orders.zip which has:

├── Digital-Ordering.1
│   ├── Digital Items.csv
...
├── Retail.CustomerReturns.1.1
│   └── Retail.CustomerReturns.1.1.csv
├── Retail.OrderHistory.1
│   └── Retail.OrderHistory.1.csv
├── Retail.OrderHistory.2
│   └── Retail.OrderHistory.2.csv

The file we want is Retail.OrderHistory.1.csv. You can get that into a database with this code:

Step 2: Get Your Amazon invoices data into the Database (via scraping)

That took a lot of work to get right, and that code works well for about 80% of my transactions, but some required matching actual invoice amounts with order_id. To make that work, you have to scrape your orders page, click on the order and download the detail. I’ve written a lot of code that does that before, but it’s a pain to get right (Google Chrome tools is a game-changer for that). Fortunately, I found this code that does exactly that: https://github.com/dcwangmit01/amazon-invoice-downloader

The Amazon Invoice Downloader is a Python script that automates the process of downloading invoices for Amazon purchases using the Playwright library. It logs into an Amazon account with provided credentials, navigates to the “Returns & Orders” section, and retrieves invoices within a specified date range or year. The invoices are saved as PDF files in a local directory, with filenames formatted to include the order date, total amount, and order ID. The script mimics human behavior to avoid detection and skips downloading invoices that already exist.

You get all the invoices, but most helpful is the resultant csv:

cat downloads/2024/transactions-2024.csv
Date,Amount,Card Type,Last 4 Digits,Order ID,Order Type
2024-01-03,7.57,Visa,1111,114-2554918-8507414,amazon
2024-01-03,5.60,Visa,1111,114-7295770-5362641,amazon

You can use this code to get this into the database:

And finally, a script that compares the database to my google sheet and adds the match to uncategorized transactions

This was a bit tricky, but all works well now. Hopefully this saves you some time. Please reach out if you have any questions and happy analysis.

Leave a Comment