Policy

Cloudflare Forces AI Firms to Separate Search Crawlers from Training Bots

New default settings will block mixed-use crawlers from ad-supported sites starting September, pushing the industry toward transparent data practices

Arjun S. Mehta

Staff Writer · Singapore

Jul 2, 2026

5 min read

Cloudflare Forces AI Firms to Separate Search Crawlers from Training BotsCredit: Image Credits: Cloudflare

A Line in the Sand

Cloudflare has drawn a hard line for AI companies: separate your search crawlers from your training bots, or lose access to millions of ad-supported websites. The infrastructure giant announced Wednesday that starting September 15, its default configurations will automatically block any crawler that mixes traditional search indexing with AI training or agentic capabilities. The policy targets sites displaying advertising and will apply to all new customers, new sites from existing customers, and the entire free-tier user base.

The move arrives at a pivotal moment. Non-human traffic recently overtook human browsing for the first time, a milestone that wasn't anticipated until 2027. At DailyTechWire, we've tracked this bot surge across Asia's digital infrastructure, where the crawl load on regional publishers has roughly tripled since early 2025. Cloudflare's intervention reflects a brewing tension: publishers want discoverability without surrendering intellectual property to model training pipelines that generate no revenue in return.

CEO Matthew Prince framed the decision as necessary for a sustainable web ecosystem. The subtext is clear: the current arrangement, where a handful of large AI labs hoover up content under the guise of search functionality, cannot hold. By forcing clarity of intent, Cloudflare is effectively creating two classes of bot access and inviting publishers to price them differently.

The Google Problem

Cloudflare's announcement takes a pointed jab at what it calls "the world's largest search engine," noting that this unnamed giant enjoys access to roughly twice the data of competing AI firms. The reference is transparent: the search leader has historically bundled search indexing and AI feature development into a single crawler, making it difficult for site owners to allow one without enabling the other.

The company in question does offer a separate bot for opting out of training and generative products, but its primary crawler still feeds AI-powered search features including overview summaries and conversational modes. Publishers face a binary choice: accept the full bundle or risk disappearing from search results. Cloudflare's policy shift aims to collapse that false binary by making granular control the default.

For Asia's emerging AI ecosystem, the implications run deeper. Seoul-based search competitors and Jakarta's homegrown language model labs have long complained that dominant Western platforms leverage search monopolies to secure training advantages. Cloudflare's move could level access, but only if regional players adopt transparent crawler practices themselves. The temptation to replicate the bundling strategy will be strong.

Charging for Value, Not Just Volume

Alongside the blocking policy, Cloudflare is evolving its marketplace model from Pay Per Crawl to Pay Per Use. The original concept, launched in 2024, allowed publishers to charge AI companies per fetch request. The updated framework charges when content actually creates value, such as appearing in an AI-generated answer or being retrieved for premium synthesis.

The company's internal data shows that more than half of AI crawler traffic is spent re-fetching pages that haven't changed, a pattern that wastes publisher bandwidth and reveals inefficient training loops. By tying payment to utility rather than volume, Cloudflare hopes to align incentives: AI firms will crawl more selectively, and publishers will earn revenue proportional to their content's contribution to downstream products.

Two partners are piloting the Pay Per Use system. Ceramic.ai, an AI search platform, will compensate publishers when their material surfaces in results. You.com will pay when accessing premium content behind authentication layers. Both arrangements hinge on Cloudflare's ability to track content flow from origin to output, a non-trivial technical challenge given the opacity of most model inference chains.

Other AI companies can adapt the framework, but adoption remains voluntary. The real enforcement lever is the September default change: firms that refuse to unbundle their crawlers will find themselves locked out of a significant slice of the web, at least among Cloudflare's customer base. That footprint includes a substantial portion of Asia-Pacific's small and mid-sized publishers, many of whom lack the resources to negotiate individual licensing deals with AI labs.

Fragmenting the Open Web

Cloudflare's policy accelerates a broader fragmentation. The open web, once indexed uniformly by a handful of search engines, is splintering into access tiers governed by commercial relationships. Publishers with leverage are signing direct licensing agreements with frontier labs. Those without are left to rely on intermediaries like Cloudflare's marketplace or to block AI crawlers entirely.

The risk is a two-speed information economy. Well-resourced outlets in Singapore, Tokyo, and Sydney can command licensing fees; smaller operations in Hanoi, Manila, and Colombo may find themselves either ignored by AI systems or scraped without compensation. Cloudflare's default blocking might protect the latter from exploitative crawling, but it doesn't guarantee they'll earn revenue from legitimate use.

There's also a technical wrinkle. Separating search from training isn't as clean as it sounds. Modern retrieval-augmented generation blurs the line: a query triggers a search, which feeds context into a model, which generates an answer. Is that search, training, or agentic behavior? The answer depends on whether the retrieved content updates model weights, populates a vector database, or simply serves as one-time context. Cloudflare's policy will force AI companies to make those distinctions explicit in their crawler metadata, but verifying compliance will require ongoing monitoring.

What Comes Next

Cloudflare's September deadline gives the industry ten weeks to adapt. Expect a flurry of announcements from AI labs introducing new, purpose-specific crawlers with clear labels. Some will comply in good faith; others will rebrand existing bots with minimal functional change. Publishers will need to audit their Cloudflare settings and decide which crawlers to permit, a task that requires more sophistication than most small teams possess.

The broader question is whether other infrastructure providers will follow. If Cloudflare's stance becomes an industry standard, the AI training landscape shifts significantly. If it remains an outlier, labs may simply route around it, sourcing content from platforms that don't enforce separation. Asia's cloud and CDN providers, many of whom serve dual roles as AI developers and infrastructure vendors, will face pressure to pick a side.

For now, Cloudflare is betting that transparency and compensation can coexist with the scale AI companies need. The alternative, a web where publishers block bots en masse, serves no one. But threading that needle requires trust, and trust in the AI industry's data practices is in short supply. September will reveal whether the deadline prompts genuine reform or just more sophisticated evasion.

The Trillion-Dollar AI Bet That Could Unravel the World Economy

Arjun S. Mehta · 5 min

Policy

India's Banking Domain Registry Leaked Credentials for Over 5,500 Bank Employees

Arjun S. Mehta · 6 min

Policy

Social Platforms Fail Child Safety Tests at Alarming Scale

Priya Nair · 5 min

Spot something wrong? Email corrections@dailytechwire.com. We log every correction publicly.