Building Open Access to NC Campaign Finance Data — Entity Resolution (pt. 1)
Hello again! We’ve been hard at work trying to get this data available to everyone, but wanted to pause for a moment to talk through one of the trickier aspects of the project. When we started out on this project, we talked to a few folks who had worked on something similar (or the exact same project) — one aspect we heard about fairly frequently was the challenge that the data would be pretty messy. A data project with messy data? So unheard of (/sarcasm).
When we thought about it, it made sense — we are dealing with human-generated data after all. For example, let’s look at a very common expense for many campaigns — Facebook advertising. Each campaign/organization who uses this tool has someone write down what a given expense is for. If you had to guess how many different ways there are to write “Facebook” I can almost certainly assure you that you’re going to come in under the mark. Think about your answer, we’ll see what it looks like later on in the post.
We could always approach this the old-fashioned way — just look through the list and create buckets of names that refer to the same thing, right? That may work for a couple hundred or a few thousand — let’s take a look at how many entries there are:
Nearly half a million unique names?? Option one is out. We’ll have to do this with a programmatic / machine learning approach — fortunately this is right in our wheelhouse! After poking around this space for a bit we discovered a powerful open-source tool called Zingg.ai that seemed to fit our use case perfectly. We got to work setting it up and quickly found out just how powerful the tool was. Remember the Facebook question above? Here are the matches Zingg was able to come up with that had over $1k in spend:
There are twenty two different ways people have managed to record significant spend toward the same entity. If we remove the $1k limit, we end up with 42(!!!) different options. And this happens over and over again for most different expenses that are recorded. And until everyone at these campaigns looks up the exact entity name for each expense when they are recording this data (a guy can dream right?), this problem isn’t going anywhere.
All in all, Zingg found ~17k clusters (groups of names that refer to the same person or organization), with about $250mm in donations and expenses going to/from NC campaigns. That’s a whole lot of money which we didn’t really understand the full story on where it was going from/to without a tool like this.
One thing to note — we’re only showing the different names that were matched above, but the process actually looks at multiple data fields. The transactions are matched on both name and address, so we can ensure the quality of the matches. If we see John Smith and John R. Smith in the same zip code, I’m not really convinced those are the same person. However, if I see both of those names have similar or the same address, I feel much more confident. This is a huge part of both training the model and verifying that our results are accurate.
This has been a very interesting process so far, and I can see tons of uses across so many verticals. Cleaning up a CRM or sales records, monitoring internal campaign donations, scrubbing email lists, or standardizing health records could all benefit from a tool like this.
In the next post, I’ll get into the nuts and bolts of how we actually deployed our solution, so if you think this is something you could use — keep an eye out there. If you’re very eager and want to get started ASAP, you could also head over to the Zingg website, the documentation is an excellent place to get started! Or feel free to get in touch with us, we’re happy to talk through your needs and get you pointed in the right direction.