CSVDIFF: How We Cut Database CSV Comparison Time from Minutes to Seconds
TLDR: This tool helped us reduce the comparison time between our reporting system database exports and new transformation script outputs from minutes to seconds during a data transformation system migration. Link to the tool repo.
Background
We were migrating our data transformation system from a legacy solution to a new one. The old system transformed data from our main production database to the reporting system, but it wasn't documented and was difficult to maintain. The new system uses SQL scripts to extract data from multiple tables and transform it to match the structure of our current reporting system.
The biggest challenge was that the logic behind the current transformations wasn't well documented. We had to dig through Java code with multiple abstraction layers to understand how the transformations worked.
Why We Needed a SQL Comparison Tool
After analysing the legacy code and writing new SQL scripts, we needed to verify that the data from our new scripts matched the actual data currently stored in the reporting system. This would help us identify major issues in our scripts and find fields where the transformation logic needed fixing.
The reporting system consisted of many tables, each containing around 300,000 rows. The row counts ranged from 50,000 to 10 million rows in the largest table. Comparing these tables manually or in Excel would have been a waste of time.
First Attempt: Building Our Own Comparison Tool
"It's not that complicated, right?"
Initially, we built a simple Python script using the Pandas library and its merge function. As a script written in less than a day by one person, it didn't look bad at first. The script executed in a few seconds, outputted non-matching rows to CSV files, and allowed us to identify and fix issues in our transformation code.
However, as we began writing transformations for larger tables, both the row count and field count increased. The script took longer to execute and the output became harder to understand. Eventually, it was taking 4-5 minutes to run, which slowed down our testing cycle significantly.
This slowdown was caused by several factors. Pandas isn't optimized for computations on such large amounts of data at once, column types weren't always recognized properly and required manual script changes, which extended our testing time. In short, our quickly prepared script had flaws, and we needed to find something better.
Enter CSVDIFF
After discovering this tool and testing it for the first time, it worked like a charm. Comparisons, even for our largest tables, took only a few seconds. The tool offered many options for displaying file differences, provided an easy overview of deletions, additions, and modification counts between two CSVs, and much more.
Most importantly, it allows you to declare which columns are primary keys, which makes it easy to show changes in CSVs.
The tool runs in the terminal, is written in Go, and uses hashing to compare data, which proves to be faster than traditional data comparison methods.
Installation
The tool requires the Go language interpreter to be installed on your system. You can find installation instructions here.
Installation steps can be found in the original repository. Below is how I installed it on Windows:
$ git clone https://github.com/aswinkarthik/csvdiff
$ go get ./...
$ go build
Usage
The tool has many features and usage options, but here's how I primarily used it:
csvdiff file1.csv file2.csv --primary-key 0 -o word-diff > csvdiff.txt
file1.csv file2.csv
— the files you want to compare--primary-key <int>
— which column(s) to use as primary keys (counting starts from 0)-o word-diff
— output type for showing differences> csvdiff.txt
— outputs results to a file. Without this, the output would be displayed in the terminal, which isn't convenient for large differences
Common Problem When Comparing CSVs
The main issue I encountered was when I mixed up the column numbers representing the primary key column. Here's what the error looks like:
Key Takeaway
The main lesson learned is that in similar situations, you should try to find tested and proven tools that match your requirements (especially open-source tools) rather than attempting to build your own solution first, even if it seems easy and quick to implement. Using this tool helped us significantly speed up our script testing cycle.
Thanks for reading! You can subscribe me here: