[$ xmrhost] _

$ pwd

/playbook/scraping

[$ ] use-case: scraping

// NAME

scraping — web scraping & data extraction servers.

// SYNOPSIS

xmrhost-cli playbook describe --workload=scraping
xmrhost-cli provision --workload=scraping --region=<is|ro>

// TL;DR

$ head -n1 README

// stable-asn vps for ethical crawling: clean ip reputation, generous egress, no per-target rate limits.

// DESCRIPTION

$ man playbook(scraping)

// ASN reputation + clean egress > exotic proxy stacks

Most scraping-blocked-ness is not residential-vs-datacenter — it is ASN reputation. A vps in a clean Romanian datacenter ASN routinely outperforms a stale residential proxy because the target site's bot-detection layer has not yet poisoned the /24 the VPS lives in. Cloudflare's bot-management scoring, Datadome, and PerimeterX all weight ASN reputation heavily; a brand-new IP in a low-noise ASN starts with a clean slate.

The technical posture for ethical scraping (robots.txt honored, rate limits respected, no PII without consent) does not need exotic infrastructure: vps-4 with Playwright + Chromium preinstalled handles 90% of the workload. CAPTCHA solvers (2Captcha, CapMonster) integrate at the application layer over the network we offer — there is no CAPTCHA-defeat magic at the host level. Bandwidth is the constraint that bites at scale: scraping pipelines routinely hit 10–30 TB egress / month, which is why the catalog defaults to generous monthly allowances and a no-overage policy on legitimate growth.

Where the scraping is targeting EU sites with personal data, GDPR Articles 6 and 14 apply regardless of where the scraper runs — a Romanian VPS does not absolve the operator of a lawful-basis analysis. We do not host scraping operations targeting personally-identifiable data without documented consent, and we cooperate with substantiated CFAA-style complaints where they apply (most scraping operations never see one).

// see also

  • Cloudflare — How Bot Management Works (developers.cloudflare.com/bots)
  • GDPR Articles 6 and 14 (Regulation 2016/679)
  • robots.txt — RFC 9309 (IETF, 2022)
  • hiQ v. LinkedIn — 9th Cir. 2019 (CFAA scraping precedent)

// THREAT MODEL + AUP BOUNDARY

$ xmrhost-cli scope --workload=scraping

// the hosting layer is one component of the threat model. what we cover, and what we explicitly don't:

// scope: in

  • Clean ASN reputation against Cloudflare / Datadome / PerimeterX bot-detection scoring
  • Generous monthly egress allowance with no-overage policy on legitimate growth
  • Playwright + Chromium + Puppeteer preinstalled, version-pinned
  • Free IP swap once per month on vps-4 and above for IPs that get burned

// scope: out

  • robots.txt compliance — that is the operator-customer's lawful-basis concern
  • GDPR Article 6 / 14 analysis when the scrape touches PII (we are not your DPO)
  • CAPTCHA solving — 2Captcha / CapMonster integrate at the application layer
  • Target-site ToS interpretation — read the precedent yourself (hiQ v. LinkedIn is the start)

// AUP boundary

Customers are responsible for compliance with target sites' Terms of Service, robots.txt directives, and applicable computer-misuse and data-protection law (CFAA, GDPR for personal-data scraping, country-specific equivalents). We do not host scraping operations targeting personally identifiable data without consent.

// SEE ALSO

// playbook — full workload list, node — full catalog, location — region posture, why-monero — billing rationale.