Imagine there's a new Chinese restaurant in town. It serves a big variety of Chinese dishes. The restaurant employs two waiters, Zhāng and Lǐ. One of them, Zhāng, speaks English very well, while Lǐ only understands a few words. If you ask Zhāng for a dish that the restaurant does not offer, he will explain to you that he cannot bring you that dish and will offer some alternatives. Lǐ on the other hand takes a shortcut: Each time he does not understand what the customer wants to eat, he just brings him wonton noodles.
Image courtesy: CC BY 松林Ｌ
You have never eaten Chinese food before and have no idea what Chinese dishes look and taste like.You go to the new restaurant and order one of the dishes. How can you tell whether you got the right dish, or whether the waiter just brought you any dish because he did not understand you?
Yes, this story is a little absurd. Nevertheless, the whole situation is not that far-fetched. Imagine you have a list of URLs and want to know whether they are still "alive". The simplest approach would be to request the page behind each URL and to look at the HTTP status code. Just like Zhāng, most web servers will notify you with the HTTP status code "404" that you requested a page which does not exist anymore.
Unfortunately, the HTTP status code is not always reliable. Sometimes an HTTP status code in the range of 3xx is returned when a non-existing page is requested. The class of 3xx status codes denotes redirection to another page, just like the way Lǐ serves you his "fallback menu" when he doesn't understand you. This policy may make sense for some websites, but not when used in advertisement. Imagine that you're looking for the specific product Xyz and you're redirected to a generic webpage about the product series X. A sensible choice for normal, organic traffic, but bad when you click on an ad. In this case the marketing team should be notified to update the relevant ad. Most likely, the website structure was changed and they should update the ad's URL.
Faced with this dilemma, you come up with a solution. You go to the restaurant and place an order that you are sure they won't be able to fulfill, for example pasta al pomodoro. Of course a Chinese restaurant does not serve Italian food, therefore Lǐ will bring you wonton noodles instead. You take a picture of the dish, eat it and leave. The next time you're at the restaurant and order something unknown to you, you can simply compare the served dish with the picture you took before. If they both have a reasonable level of similarity, you can be pretty sure that Lǐ just served you his "fallback dish".
Translated to our URL status checking, we send a request to every site and ask for the pages that we are interested in. We get the responses and save them for later. Then we ask for a page of which we are sure that it doesn't exist, for example by appending a UUID to an existing URL. We see how the website responds to these requests. Finally, we compare the answers between the two sets. If some of the pages that we are interested in look suspiciously similar to the pages that do not exist, then we have an important indicator for an invalid page that conventional methods (just looking at the HTTP status code) cannot detect.
In order to be efficient, the system does not care about many of the formalities. Logos, images and other unneeded parts of the website are ignored. We focus on what matters and try to optimize according to that. We are also polite enough to state that our requests are made by a robot, so people should not worry about seeing extra "suspicious" traffic on their webserver (unless they also worry about Google discovering their website).
In software engineering, there is always room for improvement. At the moment we are simply using Python's difflib to calculate the similarity of two pages. In the future, we could use more advanced methods like calculating the Levenshtein distance or using the Cosine similarity to compare different webpages. Furthermore, the downloader script is currently a monolithic program that uses thread pools to download the different pages. This could be made more reliable and scalable by creating isolated download jobs and using distributed task queues like Celery.
If you have remaining questions or know of other methods to discover dead links, please leave a comment below!
(Credits: This article and method is based on the work of Dimitris Leventeas who left us to finish his Master's Thesis at Google. We wish you much success in your future career!)