$ Political Cash Flows
← Back to story

Methodology

How I collected, cleaned, and mapped FEC disbursement data to trace the last mile of political money.

Overview

This project traces political spending from its source — committees registered with the Federal Election Commission — to its destination: the specific vendors that receive the money. Most campaign finance reporting focuses on contributions (who gives). This project focuses on disbursements (who gets paid, and where they are).

The analysis covers three complete election cycles: 2020, 2022, and 2024, comprising approximately 4.8 million itemized expenditure records filed by political committees of both major parties.

Data Sources

All financial data comes from the Federal Election Commission's bulk data downloads, which are pipe-delimited flat files updated nightly and available for public download at fec.gov/data.

Files used

Operating Expenditures (oppexp)
The primary dataset. Each record represents an itemized payment from a political committee to a vendor, including the payee name, city, state, ZIP code, dollar amount, date, and a free-text purpose description. Files used: oppexp20.zip, oppexp22.zip, oppexp24.zip
Committee Master File (cm)
Links each committee ID to a committee name, party affiliation, and connected candidate ID (if applicable). This is how we determine whether a payment came from a Democratic or Republican committee, and which candidate it supports. cm20.zip, cm22.zip, cm24.zip
Candidate Master File (cn)
Maps candidate IDs to candidate names, office sought (House, Senate, or President), state, and district. This allows us to link vendor payments through committees to specific races. cn20.zip, cn22.zip, cn24.zip

What's not included

This analysis does not currently include independent expenditures (super PAC spending) or individual contributions. Independent expenditures are filed separately and use a different schema. Adding them is a planned future enhancement.

Data Pipeline

The raw FEC files go through a five-step pipeline before they reach the frontend. The full pipeline is written in Python and is available in the project repository.

Step 1: Download

Bulk files are downloaded programmatically from the FEC's S3-hosted archive using httpx with streaming and progress bars. Files that already exist locally are skipped. Total download size across three cycles is approximately 200MB compressed.

Step 2: Parse & Merge

FEC bulk files are pipe-delimited with no headers. Column positions are mapped using the FEC's published data dictionaries. After parsing, each expenditure record is merged with the committee master file to attach party affiliation and candidate ID, then with the candidate master file to attach candidate name and office.

Records are filtered to the two major parties (DEM/DFL and REP). Negative amounts (refunds) and records with missing vendor names or addresses are dropped.

Each record also gets a purpose category derived from the free-text purpose field. We map keywords like "MEDIA", "DIRECT MAIL", "CONSULTING", "POLLING", "TRAVEL", etc. to standardized categories. Records that don't match any keyword are categorized as "Other".

Step 3: Vendor Name Deduplication

The same vendor frequently appears under multiple name variations in FEC filings. For example:

GMMB INC GMMB, INC. GMMB GMMB INC.

We use a two-phase approach:

  1. Normalization: Strip common suffixes (INC, LLC, CORP), remove punctuation, collapse whitespace, and uppercase everything.
  2. Blocking + fuzzy matching: Rather than comparing all ~290,000 unique vendor names against each other (O(n²)), we group names by the first five characters of their first significant word, then fuzzy-match only within each block. We use rapidfuzz's token_sort_ratio scorer, which handles word reordering ("SMITH CONSULTING" vs "CONSULTING SMITH"). Matches above 85% similarity are automatically merged. The canonical name is the variant with the highest total spend.

Across three cycles, this process merged approximately 40,000 name variants into ~250,000 unique vendor identities. The blocking strategy reduced the deduplication runtime from over an hour to under 10 seconds.

Step 4: Geocoding

FEC expenditure records include the vendor's city, state, and ZIP code, but not a street address. We geocode vendor locations using ZIP code centroids from a public dataset of U.S. ZIP code coordinates. Each vendor is placed at the centroid of its ZIP code, with small random jitter applied to prevent exact overlaps.

This provides city-level accuracy, which is sufficient for the clustering analysis this project presents. Approximately 87% of records are successfully geocoded; the remaining 13% have invalid or non-standard ZIP codes and are excluded from the map but included in aggregate spending totals.

Step 5: Aggregation & Export

The geocoded data is aggregated into several output files consumed by the frontend:

  • Vendor map (GeoJSON): One point per unique vendor, with total spend, party split, purpose category, and active cycles. Vendors below $1,000 total spend are excluded to keep the file manageable (~155,000 features).
  • State spend (JSON): Total DEM/REP spending per state.
  • Top vendors (JSON): Top 50 vendors ranked by total spend with enriched metadata.
  • Vendor cycles (JSON): Cycle-over-cycle spending for the top 30 repeat vendors.
  • Vendor details (JSON): For the top 500 vendors: monthly spending timeline, purpose breakdown, top 10 paying committees, and linked candidates.

Frontend

The interactive is built with SvelteKit using the static adapter, meaning the final output is a fully self-contained static site with no backend server. All data is loaded from pre-built JSON and GeoJSON files at page load.

  • Maps: MapLibre GL JS with CARTO Positron tiles (no API key required)
  • Charts: D3.js for the choropleth, treemap, and cycle timeline
  • Scroll: Custom Intersection Observer action (no Scrollama dependency)
  • Clustering: MapLibre's built-in GeoJSON clustering for performant rendering of 155K+ vendor points

Limitations & Caveats

  • Address ≠ work location. Vendor locations reflect mailing addresses (derived from ZIP codes), not where the work is performed. A D.C.-based consulting firm may produce ads that run in swing states. The map shows where the money is sent, not necessarily where its effects are felt.
  • ZIP centroid accuracy. Vendors are placed at the centroid of their ZIP code, not at their actual street address. In dense urban areas, this is accurate within a few blocks. In rural areas, it may be off by miles.
  • Name matching is imperfect. Fuzzy deduplication may over-merge distinct vendors with similar names (e.g., two different "Smith Consulting" firms) or fail to merge vendors with very different naming conventions. We use a conservative 85% threshold to minimize false merges.
  • Party attribution is by committee, not vendor. A vendor coded as "DEM" received money from Democratic-affiliated committees. Many vendors work for both parties, and their dot color reflects whichever party paid them more.
  • Purpose categorization is keyword-based. The FEC purpose field is free text entered by filers with no controlled vocabulary. Our keyword mapping captures common patterns but may miscategorize atypical descriptions. Records that don't match any keyword default to "Other."
  • No independent expenditures. This analysis currently covers only operating expenditures (direct committee spending). Super PAC independent expenditures, which represent a significant share of total political spending, are not yet included.

Tools & Technologies

Data pipeline Python 3.11+, pandas, rapidfuzz, httpx, pyarrow
Frontend SvelteKit 2, Svelte 5, TypeScript
Visualization D3.js v7, MapLibre GL JS v5
Map tiles CARTO Positron (no-labels)
Geocoding ZIP code centroids (U.S. Census)
Boundary data US Atlas TopoJSON

Reproducibility

The full data pipeline is open source. To reproduce from scratch:

git clone [repo-url]
cd political-cash-flows/pipeline
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
python -m pipeline.src.run_pipeline

This will download all FEC bulk files (~200MB compressed), parse ~4.8M records, deduplicate vendor names, geocode locations, and export the final datasets. The full pipeline takes approximately 5–10 minutes depending on your internet connection.

How to Cite

If you reference this project in your own reporting, research, or analysis, please cite it as:

Lysik, Tory. "The Last Mile of Political Money." 2026. torythetortle.github.io/political-cash-flows

BibTeX:

@misc{ lysik2026lastmile,
  author = { Lysik, Tory },
  title = { The Last Mile of Political Money },
  year = { 2026 },
  url = { https://torythetortle.github.io/political-cash-flows }
}

License

The code for this project is released under the MIT License. The underlying data is public, sourced from the Federal Election Commission.

Get the Data

The processed datasets used in this project are available for journalistic and research purposes. If you're a reporter, researcher, or newsroom that would like access to the full dataset — including the geocoded vendor database, deduplication mapping, and committee-candidate linkage tables — please get in touch.

For data requests or questions about this project:

lysiktory@gmail.com

Please include your name, affiliation, and a brief description of how you plan to use the data.