What actually predicts a data-center rejection: 1258 decisions, 30 surviving signals
Raymond Xu
May 19, 2026 · 6 min read
We just shipped what we believe is the largest publicly-cited corpus of US data-center entitlement decisions: 1258 outcomes across 42 states, 2022–2026, every row carrying a HEAD-verified primary-source URL. With a corpus this size we can actually answer the question consultants get paid $400/hour to guess at: what predicts a community fight? The answer disagrees with a surprising amount of conventional wisdom — and the strongest non-policy signal that survived statistical shrinkage is one no industry analyst would have picked manually.
1258 outcomes, 42 states, $87 of API spend
The corpus expanded 5.3× over the prior public version (236 → 1258 outcomes) through nine waves of ingest. Each row was extracted by DeepSeek V4 Flash from a primary-source news article — county commission minutes, regional newspapers, DC industry trades — discovered via Anthropic web_search and (in the final waves) Serper. Every URL was HEAD-checked before merge; the few that 404'd were dropped. Total API cost: $87, vs an Opus-4.7 baseline estimate of $300.
- Outcomes tagged
- 1258
- Resistance rate
- 28%
- States covered
- 43
- Variables modeled
- 22
Each outcome was tagged against 22 variables: moratorium status, organized opposition, PILOT offered, existing industrial zoning, pre-filing engagement, and 17 others. Three of those variables — prior_denial_50mi_36mo, prior_approvals_50mi_36mo_count, hyperscaler_precedent_50mi — were then recomputed deterministically from county centroids using a 50-mile radius and 36-month window. That enrichment found 47–85% of the model-guessed values were wrong; the deterministic values went into the fit.
The 30 signals that survived lasso shrinkage
On top of the 22 hand-picked variables, we joined each outcome to its county's full Census ACS 5-year profile — 124 features covering income, education, race, age, industry employment, housing, broadband, migration, language, disability, and veteran status. The combined 70-feature matrix (after dropping collinear raw count totals) went into an L1 lasso logistic regression. Forty features shrank to zero. The thirty that survived are the ones actually doing predictive work.
Plain English: we tested 70 possible predictors; lasso kept 30 and pushed weak ones to zero. CV AUC 0.82 means that across five train/test splits, a rejected project ranked riskier than an approved one about 82% of the time. Z-scaled means every input was put on the same scale before comparing bar lengths.
The top of the list looks like conventional wisdom — active moratorium (β=2.73), organized opposition (β=0.94), PILOT offered (β=-1.54), existing industrial zoning (β=-0.88). The bottom is where it gets interesting: percentage of county employment in transportation and utilities (β=-0.22, protective), percentage of county residents who moved in from a different county (β=-0.20, protective), and an arts-and-food employment share (β=+0.07, resistance-pushing). None of these came from the hand-picked schema. They came from the data.
Half the conventional wisdom is statistical noise
Several variables that have been in industry checklists for years turn out to have essentially no predictive power once the corpus is large enough to test them. Battlefield or historic district within 2 miles: previously claimed at 1.37× resistance odds, now sits at β=-0.08 (null effect). USGS high-groundwater stress: claimed 1.08×, now β=-0.32 in the opposite direction. Foreign-capital or shell-entity applicant: claimed 1.49×, now β=+0.14 (real but small). NDA/codename pattern: claimed 1.08×, now β=-0.12 (slightly protective, opposite of original).
| Variable | Old β (N=225) | New β (N=1006) | What changed |
|---|---|---|---|
| Moratorium active at filing | +1.40 | +2.73 | Almost doubled. Moratoriums are decisively the strongest resistance signal. |
| Water-intensive cooling proposed | -0.23 | +0.76 | Sign flipped — communities now resist water-cooled proposals. |
| Pre-filing community engagement | -1.15 | -0.41 | Overstated 3×. Still helpful, but not the silver bullet. |
| Hyperscaler precedent within 50 mi | -0.25 | +0.05 | Was overstated. Communities don't approve because a Microsoft DC is nearby. |
| New substation required | -0.02 | -0.50 | Was claimed null. Actually moderately protective. |
| Existing industrial zoning | -0.54 | -0.88 | Stronger than originally measured. |
| Battlefield / historic district nearby | +0.31 | -0.08 | Was wrong-direction noise. Null effect. |
| Groundwater stress (USGS) | +0.07 | -0.32 | Sign flipped. Real effect is slightly protective, not resistance-pushing. |
| Rural pristine setting | +0.47 | +0.07 | Much weaker. Original overcredited landscape framing. |
| ≥750 MW gigawatt-scale | -0.30 | -0.59 | Bigger projects are MORE likely approved (PILOT + state cover). |
| PILOT / tax abatement offered | -1.23 | -1.54 | Remains the strongest protective tool by 2×. |
Two of the most-cited 'silver bullet' interventions in industry advice are overstated by 3× or more. Pre-filing community engagement was published as β=-1.15 (3.1× more likely approved); the refit on 1258 outcomes gives β=-0.41 (1.5× more likely). Hyperscaler precedent within 50 miles was claimed at β=-0.25 (1.3× more likely approved); after correct deterministic measurement, it's β=+0.05 — null effect. Communities don't see 'a Microsoft DC two counties over' as license to approve yours.
Most striking: water-intensive cooling. The original published coefficient was -0.23 (slightly protective). The refit on the 5×-larger corpus gives +0.76 — a sign flip. Communities have shifted hard on water use over the 2022–2026 window. Filings that propose evaporative or chilled-water cooling are now meaningfully more likely to draw resistance than air-cooled designs.
The demographic signals no one was tracking
Six Census features survived shrinkage and weren't anywhere in the original schema. Older counties (median age higher) are more resistant. Whiter counties are more resistant; Blacker counties are less. More-educated counties (higher bachelor's-or-above share) are more resistant — a counterintuitive finding that's consistent with literature on educated-NIMBY mobilization. Counties with high arts-and-food employment shares (the 'amenity economy' — Sedona, Hudson Valley, Asheville) are more resistant.
The strongest new signal is percentage of county workforce in transportation and utilities — every standard deviation above mean cuts resistance odds by ~25%. Counties whose labor market already absorbs heavy-industry workforce treat data-center siting as more of the same. The second strongest new signal is the migration variable: counties with higher inflow of recent in-county-movers (population churn) are dramatically less resistant. NIMBY-ism is a function of stable communities, not specific opposition to data centers.
Three practical implications for siting + outreach
First, the demographic risk profile of a candidate county is now part of the underwriting. A site in a slow-growth, older, whiter, more-educated county carries hidden risk premium beyond its explicit policy variables. Sites in transport/utilities-heavy or recent-in-migration counties carry hidden discount. The 22-variable schema didn't capture this because none of the original variables were demographic; the Census join exposes it.
Second, PILOT remains the single most leveraged tool in the toolkit. At β=-1.54 it shifts approval odds by a factor of 4.7 — bigger than community engagement, hyperscaler precedent, and existing industrial zoning combined. If you can structure a material PILOT into the filing, it dominates almost every other intervention.
Third, the prior published model was overconfident. The original leave-one-out AUC of 0.96 was an artifact of overfitting on 225 rows; the honest AUC on 1006 non-pending decisions is 0.805. That's a real model — it correctly ranks resistance probability — but it's not the 96% accuracy the smaller corpus seemed to promise. Practitioners should treat resistance scoring as a structured prior, not a deterministic verdict.
How it works under the hood
Ingestion: DeepSeek V4 Flash classifier reads article body text (HTML-stripped, ~4KB cap) and emits structured JSON with the 22 variables tagged. Discovery is two-stage — Anthropic web_search via Haiku 4.5 for waves 1–8, and Serper API (Google search) for wave 9 at one-tenth the cost. Every URL HEAD-checked; deduplication on normalized project_name + outcome key. The full pipeline lives in scripts/data-center-resistance/.
Analysis: Census ACS 5-year (2022) pulled via the free Census API — 124 features × 3144 counties. The county-level join uses the US Census Bureau 2020 gazetteer for FIPS centroids. Univariate Pearson r is computed against the binary 'resistant' outcome (denied / withdrawn / voided / moratorium-blocked / delayed >12mo vs approved). Lasso fit uses proximal-gradient L1-penalized logistic regression with 5-fold CV at standardized scale. Ridge fit for the published coefficients uses the L2-regularized fit with bootstrap standard errors (300 resamples).
Primary sources + reproducibility
Get started
Type your site in. See the de-rate.
The calculator returns an effective MW number, the binding rule, and a $/MW-yr net value as you type.