Crawler Hints: Reduce The Environmental Impact Of Web Searches With Cloudflare

Cloudflare is renowned for its creativity and ground-breaking initiatives that advance the Internet. We intended to use this method of innovation to examine how the Internet is affecting the environment during Impact Week. It’s common to believe that the only option available to tech in terms of the environment is harm reduction, such as through climate credits, carbon offsets, and the like. Undoubtedly, these are significant milestones, but we wanted to go even further and focus on damage reduction. Therefore, we posed the question: How can the Internet as a whole consume less energy and be more considerate of how we allocate computer resources in the first place?

Cloudflare has a global perspective on Internet traffic. We track the traffic going to and from the more than 1 in 6 websites that use our network. Although the majority of people perceive Internet usage as a very human activity, automated systems really generate close to 50% of all traffic on the worldwide network.

We have examined this artificial traffic generated by so-called “bots” in order to comprehend the effects on the ecosystem. Malicious bot traffic predominates. By shielding our clients from this harmful traffic, Cloudflare lessens their environmental effect. If Cloudflare did not stop these bots, they would make database requests and compel the development of dynamic pages on services that are much less effective than Cloudflare’s network.

In order to offset the carbon cost of our bot mitigation services, we even went a step further and promised to plant trees. While we’d want to be able to advise the bad actors to cease using their bots and consider the environment, we doubt they’d pay attention. Instead, we want to minimize them as much as we can.

However, there is another kind of bot that we do not want to disappear: beneficial bots that index the web. More than 5% of all Internet traffic is made up of these trustworthy bots. The majority of this useful bot traffic is generated by search engine crawlers, which are essential to the web’s usability.

Large-Scale Opportunities and Large-Scale Issues

Online research is still enchanted. Search engines like Google, Bing, Yandex, or Baidu allow you to type a query into a box and instantly receive a list of websites that have the information you’re looking for. Search engines must comb the web to perform this magic and, roughly speaking, create a replica of its contents that is stored and organized on their own servers and can be promptly accessed whenever necessary.

The companies that operate search engines have worked hard to maximize efficiency, pushing the limits of server and data center performance. But there is still one area that is obviously wasteful: excessive creep.

We receive traffic from all the major search crawlers at Cloudflare. Over the past year, we’ve been researching how frequently these trustworthy bots visit a page that hasn’t changed since they last visited it. Each of these trips is a waste of time. Sadly, based on our findings, 53% of this useful bot traffic is lost.

According to the Boston Consulting Group, operating the Internet produced 2% of total carbon emissions, or around 1 billion metric tonnes annually. If excellent bots make up 5% of all Internet traffic and excessive crawling wastes 53% of that bandwidth, finding a way to lessen excessive crawling might help save as much as 26 million tonnes of carbon each year. The U.S. Environmental Protection Agency estimates that would be the same as planting 31 million acres of forest, permanently closing 6 coal-fired power plants, or removing 5.5 million cars from the road.

It’s not quite that easy, of course. But suffice it to say that if we can make sure that every search engine only crawls once or whenever it updates, we have a significant chance to reduce the environmental cost of the Internet.

Since we became aware of this issue, we have been in communication with the biggest providers of decent bots to see if we can jointly find a solution.

Crawler Hints

We’re thrilled to introduce Crawler Hints today. Search engine crawlers can precisely time their crawling, avoid unnecessary crawls, and generally reduce the resource consumption of customer origins, crawler infrastructure, and Cloudflare infrastructure by using Crawler Hints, which give them high quality data on when content has changed on sites using Cloudflare. The icing on the cake is that because search engine crawlers now have information about when content is new, the search experiences supported by these “good bots” will advance, satisfying Internet users at large with more pertinent and helpful content. Both the Internet and the Internet’s energy footprint benefit from Crawler Hints.

By giving bot creators an additional heuristic that will let them know when material has been added to or updated on a site rather than relying on preferences or prior modifications that might not accurately reflect the site’s true change cadence, we hope to make crawling a little bit more manageable.

How does this function?

At its most basic level, we want a mechanism to alert a search engine that a page has changed without having to wait for the search engine to notice. Actually, there are usually a few ways for search engines to be informed when a page or collection of pages changes.

As an illustration, if you ask Google to crawl a website again, they’ll do it within “a few days to a few weeks.”

It would be necessary to maintain track of the last time Google crawled the website in order to notify them of changes to the page quickly. Since there is a wait between requesting a recrawl and the spider’s arrival, you wouldn’t want to notify Google every time a page changes. In the time between making the request and the spider arriving to make a call, you can be instructing Google to return.

In addition, new search crawlers are developed and there isn’t just one search engine. It would be untidy and highly challenging to attempt to keep search engines updated as your website evolves. This is due in part to the fact that this model does not explicitly state when things changed.

This model just does not function well. And in part because of this, search engine crawlers necessarily waste time crawling websites repeatedly, whether or not there is fresh information to discover.

However, there is a wonderful method already in place that search engines utilize to understand the structure of websites: the sitemap. In order to inform a crawler about the pages on a website, when they last changed, and how frequently they are anticipated to change, a sitemap is a well-defined, open protocol.

Sitemaps do provide a technique for massive sites with millions of URLs, although they do have some restrictions (on the number of URLs and bytes). However, creating sitemaps can be difficult and call for specialized tools. It can be very challenging to obtain a consistent, up-to-date sitemap for a website (especially one that makes use of multiple technologies).

In this situation, Cloudflare is useful. We can automatically create an exhaustive record of when and which pages have changed because we can see which pages our clients are providing and know which ones have changed (either by hash value or timestamp).

Additionally, we can log each time a search crawler visits a specific page so that we can only offer up the content that has changed since the previous visit. It can be quite effective because we can monitor this on a per-search engine basis. Every search engine receives a unique, automatically updated sitemap or list of URLs that shows just the content that has changed since their last visit.

And the original website is not at all burdened. When a website is modified, Cloudflare can notify search engines almost immediately and show them what has changed since their last visit.

A page’s priority is also included in the sitemaps protocol. Knowing a page’s frequency of visits allows us to inform search engines that one particular page may be more significant to include in the index than another because it is frequently frequented by users.

The protocol is open and does not rely in any way on Cloudflare, but there are a few technicalities to iron out, such as how a search engine should identify itself to acquire its own set of URLs. In fact, we sincerely hope that every host and Cloudflare-like service will give the protocol some thought. We intend to keep collaborating with the hosting and search communities to improve the protocol and make it more effective. Our aim is to ensure that search engines have the most up-to-date indexes, content providers have their new content optimally indexed, and a significant portion of needless Internet traffic—as well as the associated carbon cost—disappears.

Conclusion

Not just search engines gain from crawler hints. Crawler Hints will guarantee that search engines and other bot-powered experiences always get the most recent version of your material, which will lead to happier users and eventually have an impact on search rankings for our clients and origin owners. Additionally, crawler hints will result in reduced traffic to your origin, better resource usage, and reduced carbon footprint. Additionally, since bots won’t be competing for your real clients, the performance of your website will also be enhanced.

What about Internet users? Because Cloudflare has informed the owners of the bots when to update their results, whether you interact with bot-fed experiences like pricing tools or search engines—which we all use on a daily basis whether we realize it or not—these will now provide more helpful results from crawling data.

Finally, and possibly the one that excites us the most for the Internet as a whole, it will be greener. The amount of energy used online will be drastically decreased.

A triple win. The kinds of outcomes that motivate us to work each day and how we envision contributing to the development of a better Internet.

We look forward to working with others who want to make the Internet more effective and efficient while consuming less unnecessary energy because this is an intriguing subject to address. On this front, we hope to have more information to share soon. Please send an email to crawlerhints@cloudflare.com if you run a bot that depends on recently updated content and are interested in partnering with us on this project.

Yandex pursues climate change mitigation alongside the rest of the world because it values long-term sustainability over transient success. Yandex places a strong emphasis on maintaining the relevancy and usability of search results as part of its dedication to high-quality service and user experience. We anticipate working with Cloudflare to increase the effectiveness of useful bots across the Internet since we think their solution will strengthen search performance by increasing the accuracy of results returned.

“Anything that improves search while protecting user privacy and being better for the environment is encouraged by DuckDuckGo. We’re eager to collaborate with Cloudflare on this idea.”
— Gabriel Weinberg, DuckDuckGo’s CEO and founder.

The Internet Archive’s Wayback Machine and Cloudflare joined about a year ago to power Cloudflare’s “Always Online” service and, in turn, assist the Internet Archive discover high-quality Web URLs to archive. The Wayback Machine and its partners have benefited greatly from this win-win collaboration, which has allowed us to better achieve our purpose of preserving and making accessible to future generations a large portion of the public Web in order to make it more dependable and valuable. The Internet Archive is excited to begin utilizing this new “Crawler Hints” service as a result of their continued partnership with Cloudflare. We anticipate being able to accomplish more with it. so that we can concentrate more of our server and bandwidth resources on Web sites that have changed while using less of them for pages that have not. We anticipate that this will materially affect our efforts. We are pleased to be a part of the initiative because the service promises to lessen the overall carbon footprint of the Web, which makes it especially beneficial.
— Mark Graham, Director of the Internet Archive’s Wayback Machine

Related Posts