Finding Dead Websites

https://news.ycombinator.com/rss Hits: 7
Summary

As some of the work planned for Marginalia Search this year has been progressing a bit faster than anticipated, there was time to implement an unplanned change.This post details the implementation of a system for detecting when servers are online, to avoid serving dead links and improve data quality, and for detecting when websites have significant changes including ownership transfers and parking.Table Of ContentsFeature RationaleAvailability detection is useful not just for filtering out dead links in the search results, but for informing the crawler that it should stop trying to reach a dead domain, as well as a host of other things. Likewise, ownership change detection is relevant to the crawler, which might be informed to do a clean recrawl instead of an incremental one, and could also enter as a factor in domain ranking.Since misbehaving bots spamming web servers with unwanted requests is a very big and hot issue, and reputation is incredibly important for a small search engine like Marginalia, the feature is implemented with the design constraint to as best possible only do this with only HEAD requests, normally sent 1-2 times per day per domain; and also DNS queries on a similar interval.So how much information can you really get from a HEAD request and a DNS query? It’s possible to extract a fairly large number of factors, many of which weakly indicate an ownership change, but when taken together paint a fairly lucid picture.The ownership change detection looks at DNS history, details of the certificate, the security posture of the website, other headers like X-Powered-By and Server. One or a few of these changing is normal, but many of them changing at the same time indicates at the very least a major redesign or infrastructure overhaul.As just about the only thing we do not use to detect changes and uptime is ICMP ping, the name ping-process was chosen for the process that is responsible for availability and change monitoring.Data RepresentationThe data i...

First seen: 2025-06-19 14:02

Last seen: 2025-06-19 20:11