Google.be
There's much ado about the Google.be case : two newspapers won a lawsuit against Google for archiving their news stories, which conflicted with their business model, which wants subscription for older archived articles. The court favoured the Belgian newspapers and ordered Google.be to put up a legal note about this on their frontpage, and to remove the articles from their archives.
A lot has been said about this case: that it could have been avoided with robots.txt, that it was all about the money, etc. Francois Planque has a nice roundup. The case is IMO quite stupid, will cost the two Belgian newspapers quite some visitors (the number of people visiting sites through Google is unneglectable) and indicates a problem with court rulings and understanding of the internet. If I download a site page, this page is copied in at least three places : my browser cache, my local Squid proxy cache, and probably on the cache of my ISP's transparent proxy. Must all these versions be deleted after the page has been archived on the original site ?
However, I personally don't think this could so easily be resolved by the use of the robots.txt file : this file is only a 'recommendation' for internet spiders what to index or not. Many bots don't even scan the robots.txt file. I think that lots of website builders aren't aware how to properly display guidelines how to cache site contents : simply by use of the 'HTTP header: Expires' as stated in RFC2616 (read part 14.9 about Cache-Control). The Belgian newspapers easily could have implemented this by using an expiration period of let's say 24 hours, and still have a working business model.
A lot has been said about this case: that it could have been avoided with robots.txt, that it was all about the money, etc. Francois Planque has a nice roundup. The case is IMO quite stupid, will cost the two Belgian newspapers quite some visitors (the number of people visiting sites through Google is unneglectable) and indicates a problem with court rulings and understanding of the internet. If I download a site page, this page is copied in at least three places : my browser cache, my local Squid proxy cache, and probably on the cache of my ISP's transparent proxy. Must all these versions be deleted after the page has been archived on the original site ?
However, I personally don't think this could so easily be resolved by the use of the robots.txt file : this file is only a 'recommendation' for internet spiders what to index or not. Many bots don't even scan the robots.txt file. I think that lots of website builders aren't aware how to properly display guidelines how to cache site contents : simply by use of the 'HTTP header: Expires' as stated in RFC2616 (read part 14.9 about Cache-Control). The Belgian newspapers easily could have implemented this by using an expiration period of let's say 24 hours, and still have a working business model.