Building Open Access to NC Campaign Finance Data — The Plan
Welcome back! In our previous post, we introduced our North Carolina Open Campaign Finance project. If you missed that post, you may want to go back and read that one first — it provides some good context for this post (don’t worry, we’ll wait).
Today we’re talking about the game plan — how are we going to pull this project off? It’s a pretty exciting solution (at least to us it is), because it uses a lot of different technologies that we at Crossroads CX believe in. We see a lot of these pieces as the trends many organizations are moving to, that are going to be crucial to the next generation of how technology is used in business. I’ll try to stay away from getting too technical, but if you have any questions, feel free to comment below or reach out via email! As a reminder, this is an open source project, so feel free to jump into the codebase as well if you feel so inclined.
At a high level, our plan is as follows:
Data Pull:
- Pull digitally submitted (machine-readable) campaign finance records from NCSBE website
- Pull scanned PDF reports in cases where the data was not submitted digitally (human readable) and transcribe those reports into a machine-readable format.
- Drop all data files into a Google Cloud Storage location to be processed
Technologies used:
- Node.js is a popular framework we’ve used to build a number of utilities which allow us to pull data from NCSBE’s website and push it to a Google Cloud Storage Bucket
- Google Cloud Storage is used for storing the initial files, along with staged files later down the line after they are cleaned and processed.
- Google Cloud Functions run our Node utilities. A function is triggered by some event that we’ve defined (a schedule or if a new file has been uploaded), then runs whatever code we’ve built out. This is a serverless offering from Google, so not only is it powerful and customizable, but it’s also very cost efficient.
- Google Pub/Sub is being used as a messaging service between our functions and buckets. It keeps other services informed of what has run, triggers new functions, lets external services know when things have occurred and more.
Data aggregation & cleansing
As anyone who works with data will tell you, it’s not always (almost never) in a perfect format for analysis. We’ll again employ Node via Google Cloud Functions to do some…
- …ingestion: What files have come in, what time period to they represent, what types of transactions are we looking at?
- …cleansing: Making sure numbers are numbers, dates are dates, text is text, etc. We’ll also want to do some cleanup to remove extra whitespace, strange characters, and a whole lot of other issues that I’m sure we’ll run into along the way.
- …processing: Matching up records that come from the same person/organization but may have different spellings. Checking for duplicate records.
Cleaned data is then pushed to another Cloud Storage bucket and ingested into Snowflake via Snowpipe.
Technologies used:
- Snowflake is a powerful database platform that is entirely cloud-based and allows for easy sharing of data. We’ll also use this to do some of our analysis and visualization.
- Snowpipe — a Snowflake utility which automatically lets Snowflake know when a new file is uploaded to a Google Cloud bucket that needs to be ingested into Snowflake.
- Snowalert (on Google Kubernetes) — a utility developed by Snowflake employees (but runs separately) which allows us to monitor what is going on with our Snowflake instance, and lets us know if an upload fails or some other issue arises.
- Zingg — a tool developed to identify if records are related to each other, like if names are spelled differently, or if there are duplicated records.
- Node, Google Cloud Platform (GCP) — see above for descriptions
Data analysis & visualization
- Once we’ve cleaned our data, it’s time to put it to use! We’ll be making the raw data publicly available, but we want to provide a couple of tools to allow anyone to be able to do some digging even if they’re not as familiar with data analysis.
- Using Tableau Public and/or d3.js, we’ll build a few dashboards which can help you see trends and patterns in the data. Who is spending what, where they’re spending it, and when. This will require a lot of iteration and feedback, so if you have thoughts, please submit them via a Github issue (see the Introduction post for more how-to on submitting issues) or via email!
Technologies used:
- Tableau Public is the free version of the popular data visualization platform in which the dashboards and visualizations are published publicly.
- d3.js is an open source data visualization library.
- Snowflake — see above for description
And that’s it for the moment! There is certainly a lot of work to be done, but we think it’s going to be a simple, but powerful way to take a step toward more transparency in North Carolina’s campaign finance environment. As for timing, we anticipate that this project will take 6–8 months in total. That being said, these tasks are not completed in a serial manner, where we have to finish each step in order. We are already pulling the digital transactions and should have some initial visualizations up publicly in the next few weeks. They’ll be built on incomplete data that is still in need of cleanup, but hopefully it will start to show what this project can offer once it’s complete. Check back soon to see our next update when those visualizations go live!