Pricing model

How parcelpump charges for the data + infrastructure it operates. Captured 2026-05-05 as the canonical replacement for docs/pre-publish-roadmap.md §4 ("Financial model"), which proposed classic SaaS tiers and is now superseded.

Mental model

parcelpump is public-good infrastructure with proprietary operations. The code and adapters are not open-sourced. The cost ledger, architecture, API surface, refresh schedules, and catalog of what's wired are open. Customers see exactly what we spend and what we charge on top.

Three principles drive everything below:

Cost-plus, not value-based. We don't price on what the data is worth to the buyer. We price on what it costs us to produce, plus a published markup. The markup is the same for everyone.
The live API is for tiles + per-parcel reads. Bulk needs a different channel. Live-API hammering exhausts Lambda concurrency and inflates RDS load. Anyone who needs sweep access goes through bulk export, which is cheap on our side and priced accordingly.
Counties are stakeholders, not targets. parcelpump runs as a visible, identifiable scraper with a public contact path. If a county sees runaway scraping in their logs, we want them to be able to find us, ask us about it, and (when it's not us) get us to absorb that traffic into our infrastructure so their portals stop getting hammered.

What we sell — and what we keep internal

A hard product boundary: customers see straight, normalized data from the source counties. They do not see parcelpump's internal analytical layers.

Customer-facing (sold via API + bulk export)

Parcel polygons + canonical attributes (situs, mailing, zoning, acreage, valuation history, sales, owner names) — pulled from the county's own assessor / treasurer portal, normalized to the canonical Scrape type.
Parcel scrape data exactly as it came from the source, with consistent field names across vendors. The normalization is the service.

Internal-only (never returned via API or export)

CSB-derived agricultural flags (is_agricultural, dominant_crop, ag_year_count). These exist in the parcels table to target which parcels we scrape (cuts the per-parcel scrape budget by ~93% for ag-focused workflows). They do not ship.
The csb_fields table (USDA Crop Sequence Boundaries polygons). Internal infrastructure for the ag-targeting join.
The findings table (review-engine output: assessor differential analysis, anomaly flags, comparative valuations). This is parcelpump's analytical layer — proprietary work product. Not exposed to API customers.
Ownership-graph match decisions between SoS entities and parcels (when that ships). The matched fact ("this parcel's owner_name contains this LLC name") is fine to surface; the inferred match weight / clustering logic stays internal.

Why the boundary matters

Aligned with the cost-plus framing. Customers pay for the straight-through scrape pipeline. The analytical layers are value-add we may eventually monetize separately, but they are not part of the "data utility" product.
Provenance clarity. Everything we ship has a clear county source. No customer can confuse parcelpump's internal classification with the county's own record.
Future product surface. If/when we sell tax-appeal evidence, ag market intelligence, or ownership-graph queries, those are separate product lines built on the internal layers — not bundled into the data utility.

API + export consequences

GET /parcels/:source/:id returns canonical attributes + scrape_data only. is_agricultural, dominant_crop, ag_year_count columns are not serialized.
GET /findings/:source/:id becomes admin-only. Currently open to any valid key in src/api/server.ts; needs locking down as part of the build.
Bulk export schemas: same redactions. Exported parquet schemas exclude all internal-only columns.
POST /scrape-jobs/enqueue-ag-county continues to work — the customer requests "enqueue ag parcels in this county" without ever seeing the underlying flag. The classification is a server-side filter, not a returned field.
Search responses: ag-flag fields not surfaced; ag-keyed query shaping (e.g. ?agricultural=true) is not exposed.

The 35% markup

All prices = published AWS+vendor cost × 1.35.

The 35% is calibrated to cover, in rough proportion:

~20% engineering oncall, adapter maintenance, on-call response
~10% reserve for adapter rebuilds when county portals change
~5% legitimate margin

This number is published verbatim at parcelpump.dev/about/cost alongside the live AWS bill breakdown. If our true cost moves (RDS size up, ScraperAPI rate change), the underlying numbers update; the 35% does not.

Three product surfaces

Surface	Path	Access	Backed by	Pricing primitive
Live API	`api.parcelpump.io`	per-key auth, rate-limited	RDS hot path	per-request, cost-plus
Bulk export	signed S3 URLs	per-account subscriptions or one-shots	scheduled snapshot pipeline	per-snapshot, cost-plus
Scrape funding	`parcelpump.dev/data`	logged-in users	scrape worker fleet + adapters	wire / refresh / one-shot, three eng tiers

1. Live API

Per-request billing. Free tier covers casual use; paid plans cover production reads. Concretely:

Free key: 10K requests/month, hard rate-limit 1 RPS sustained / 10 burst.
Paid: pre-paid balance against per-request cost. A typical read costs us a fraction of a cent in Lambda + RDS + CloudFront; the customer is charged that × 1.35. Balance hits zero → calls return 402.
No tiered subscription. Pay for what you use; no monthly minimum.

Tile bytes (api.parcelpump.io/tiles/...) stay free and uncapped — they're CloudFront-cached and trivially cheap on our side. The whole point of the tile layer is broad embedding.

2. Bulk export

Pre-built or on-demand snapshots delivered as signed S3 URLs. Customers download via curl/aws-cli; URL expires in 7 days.

Format: GeoParquet by default. GeoPackage / CSV+WKT / Shapefile available with a small per-format premium (extra encode time).
Scope: by county, by state, or by filter (e.g. "assessed value > $1M in TX", "all residential land-use in Cook County"). Filter expressions use canonical scrape fields only — internal-only columns (CSB ag flags, findings, match weights) are not filterable by customers. Filter exports are quoted.
Cadence: one-shot ("most recent snapshot, $X") or subscription ("weekly OK statewide, $Y/mo").
Snapshot reuse: if multiple customers subscribe to the same (scope, cadence) pair, we generate the snapshot once and charge each at marginal serving cost (S3 GET + egress + 35%). Initial generation cost amortizes across subscribers.
Watermarking: every snapshot embeds account_id + generated_at in metadata. If a snapshot leaks to a non-customer, we can identify the source.
Freshness disclosure: every catalog entry shows last_refresh_at + next_scheduled_at. If you buy an export of a county we haven't re-scraped in 6 months, you're getting 6-month-old data. Visible upfront.

3. Scrape funding

The funding mechanism for adding new data. Customer wants Cook County IL → they fund the wiring + initial scrape. After that, they (or anyone else) can fund a refresh subscription.

Three engineering tiers for adapter wiring, published as a fixed catalog:

Tier	What it covers	Markup base
Bronze	Existing vendor, new county (e.g., another Tyler PACS county)	half-day eng + initial scrape
Silver	New vendor pattern (a vendor we haven't wired yet)	2-3 days eng + initial scrape
Gold	Auth / captcha / human-in-the-loop required (e.g., Tyler EagleWeb counties needing registered-user creds)	1-2 weeks eng + initial scrape + ongoing creds maintenance

Refund-or-recompute clause: if a Bronze quote turns out to need Silver/Gold work mid-build, we either refund the funder and stop, or re-quote and ask for incremental funding.

Cadence subscriptions for funded counties:

Customer subscribes to "(source, cadence)" pair: Cook IL daily, weekly, monthly, etc.
Cost = (Lambda invokes + ScraperAPI proxy + RDS writes for that cadence) × 1.35.
No exclusivity window. Once a county is wired, the data is available to all keys at the live-API rate. The funder's leverage is they set the refresh cadence, not they own the data. If another customer wants higher cadence, they pay the marginal delta.

This aligns with the public-utility framing: pay for the work, not for the right to exclude others.

Anti-API-scrape posture

The live API is fragile under sweep traffic. Three layers:

TOS clause at parcelpump.dev/terms: explicit "no using the live API for bulk data extraction. Use the bulk export channel."
Rate limits: per-key hard cap (free 1 RPS, paid scaled with plan). WAF on CloudFront for IP-level abuse.
Pattern detection: a heuristic flags accounts pulling many distinct parcel IDs with low spatial locality in a short window. Soft-throttle + a UI message: "this looks like a bulk extraction pattern. We offer this as an export at /dashboard/exports — switch over and your costs drop ~100x." Carrot, not just stick.

County-trust posture

Counties are not targets and not adversaries. parcelpump's posture:

Identifiable User-Agent on every scrape: parcelpump/1.0 (+https://parcelpump.dev/for-counties; ops@parcelpump.dev). A county seeing this in their logs knows immediately who we are and how to reach us.
/for-counties page explains who we are, what we scrape, our refresh cadences, our rate-limit philosophy (we throttle ourselves to portal-friendly rates), and a contact form for issues.
Future: a county-officials registry (gated by .gov email verification) where counties can opt into rate-limit honoring, register a primary contact, and report unwanted third-party scrapers we should help absorb. Deferred until first concrete county engagement; not worth building before then.
Possible future product: paid tier for counties to redirect third-party scraping traffic to us. They send wild scrapers our way; we add their portal to our catalog at zero cost to them; the county's portal infra stops getting beaten up. Speculative, but the architecture supports it.

Schema implications

New tables needed before any of the pricing above is real:

accounts — billing principal: id, contact, attribution string, credit balance, plan, stripe_customer_id, created_at
api_keys — gains account_id (multi-key per account); existing capabilities array stays
scrape_funding — funding ledger: account_id, source, kind (wire / refresh / one-shot), amount_cents, aws_cost_cents, proxy_cost_cents, eng_cost_cents, eng_tier, created_at
exports — bulk export subscriptions: account_id, scope, format, cadence, last_generated_at, last_s3_key, last_size_bytes, last_cost_cents, status
usage_log — per-request: api_key_id, endpoint, status_code, ms, response_bytes, cost_cents, occurred_at
sources — gains funded_by_account_id (nullable), wired_at, wiring_cost_cents

Open decisions still pending

Stripe wire-up: per the original roadmap, defer until 60 days of free usage data. Build the schema + UI to be Stripe-ready; flip the switch later.
Free-tier abuse vector: 10K requests/mo free is generous. If we see abuse (e.g., one user creating many accounts), tighten to 1K/mo and require credit card on file.
.gov email verification for the county registry: which provider? Probably custom — match a regex + send a verification email. Defer until first county wants in.
Bulk export pre-build vs. on-demand thresholds: when does an on-demand request become a pre-built subscription? Probably "if three customers ever ask for the same scope." Codify later.
Parcels-as-public-records license clarity: we should publish a "data license" page making explicit that individual parcel records are public records (county provenance) and customers can use them freely; the prohibition is on bulk republication of our compiled dataset. TOS lawyer review territory.

Supersedes

docs/pre-publish-roadmap.md §4 (four-tier SaaS pricing model). Marked superseded in that doc; kept for historical context.

Updated as

2026-05-05 — initial capture from chat with WM. Cost-plus + 35% + three surfaces + county-trust posture.