Recently I had the displeasure of participating in an odd PR fire drill. I had scheduled a release to go out several days later, pending final customer review. Because of our structure I don't handle the wire service but rely on someone else to help in that area. When we can, we queue up the release with the wire service and with our Web team early.
Imagine my dismay when the customer alerted us that they had seen it on the Web â€" five days before it was supposed to go live. My manager and I spent a good 30 hours or so tracking down what had happened. At first, because stories that appeared had their origin with a UK wire service, we figured that one of our EMEA teams was trigger happy and sent it out prematurely. But no, after going through all the channels we were still a little stumped â€" no one appeared to have sent it out. The wire service was on a UK bank holiday and was unresponsive (which was odd to me...US wires are reachable 24/7, 365).
Finally the answer surfaced. It turned out that there was some sort of Web glitch that pushed the release to the site early. It appeared on our own company site for a few hours. When the error was realized it was immediately removed. Unfortunately, in those hours, the damage had been done.
Most people in our own company never noticed. Editors didn't pick up the news early (then again it wasn't material news so perhaps if it had been so that may not have been the case). The customer might never have realized. Except for the little job of scraping that the UK wire service did in those hours when the release was live. The service pulled the release off our site then pushed it live over their wire which meant pickup by sites that rely on wire services to receive their news. Sites like Yahoo and Google Finance, TradingMarkets.com and more. The kind of Web sites that executives and investors tend to read every day. The kind of Web sites that are sure to be noticed by all the wrong people if you are trying to fix a mistake.
But what exactly is scraping? And why do companies and individuals do it? Wikipedia describes it thus:
Web scraping (sometimes called harvesting) generically describes any of various means to extract content from a website over HTTP for the purpose of transforming that content into another format suitable for use in another context. Those who scrape websites may wish to store the information in their own databases or manipulate the data within a spreadsheet (Often, spreadsheets are only able to contain a fraction of the data scraped). Others may utilize data extraction techniques as means of obtaining the most recent data possible, particularly when working with information subject to frequent changes. Investors analyzing stock prices, realtors researching home listings, meteorologists studying weather, or insurance salespeople following insurance prices are a few individuals who might fit this category of users of frequently updated data.
Access to certain information may also provide users with strategic advantage in business. Attorneys might wish to scrape arrest records from county courthouses in search of potential clients. Businesses that know the locations of competitors can make better decisions about where to focus further growth. Another common, but controversial use of information taken from websites is reposting scraped data to other sites.
Have you ever been to one of those blogs where it seems like there is content but it really just regurgitates other information from places? Most likely it was scraped. Like this one, which snagged some strange little portion of my post on books. At least in that case it linked back to me. I'm rather loathe to link to it (I denied the ping on my blog already) but, for the sake of explanation, there you go.
Scraping helps to raise Web rank for those sites...if they link back to my post it may show up as a pingback, which in turn is another link back. And savvy marketers know that raising Google rank is done through linkbacks. The more sites you have linking in the higher your ranking is. In many cases it has to do with helping push advertising on a site. But in some cases, like the one above that linked my book post â€" I can't figure out for the life of me what the purpose is. Anyone care to enlighten? My thought is that they plan on implementing Google AdSense for revenue or they used to have ads and they were yanked.
In the case of the UK press wire, I suspect that they are working to proliferate their name across the Web in conjunction with as many press releases as possible. The more people see their name related to an announcement the more they think that the service is popular (we humans are rather sheeplike in that sense...automatically assuming that popularity is an indicator of level of service â€" we all covet that thy neighbor has). The more their name is out there (especially on "press releases" that appear to originate with big brand companies), the more likely that people will recognize them and come back to them to pay for services. Smart but VERY shady in my opinion.
When someone else scrapes your Web site it can help with raising awareness about your own site. In the case of that wire service picking up our news it helps to proliferate the information to a variety of places that we may not have paid for through our subscription to BusinessWire. The more awareness on a legitimate press release, the better, in my opinion.
However, you may not want all that extra dissemination. If so, what can you do to combat your Website from being scraped? Again, Wikipedia lists a few measures that Web masters can take:
Technical measures to stop bots
A web master can use various measures to stop or slow a bot. Some techniques include:
- Blocking an IP address. This will also block all browsing from that address.
- If the application is well behaved, adding entries to robots.txt will be adhered to. You can stop Google and other well-behaved bots this way.
- Sometimes bots declare who they are. Well behaved ones do (for example 'googlebot'). They can be blocked on that basis. Unfortunately, malicious bots may declare they are a normal browser.
- Bots can be blocked by excess traffic monitoring.
- Bots can be blocked with tools to verify that it is a real person accessing the site, such as the CAPTCHA project.
- Sometimes bots can be blocked with carefully crafted Javascript.
- Locating bots with a honeypot or other method to identify the IP addresses of automated crawlers.
Additional resources:
- What's with all those Spam ping-bots?
- Catching Unwanted Spiders and Content Scraping Bots in ASP.NET
- Stop RSS Feed Scraping (Wordpress Plugin)
- Stop Rogue Web Bots from Eating BandWidth and Stealing Content (2005 but seems relevant)
Scraping is sometimes used in other ways, such as described in this recent Wired article. Mashups are one way that bots can be used to pull information from a Web site and to reformat it into better and easier ways to find and navigate the information. Users may benefit from this sort of bot/scraping usage but corporations are faced with bandwidth issues, copyright infringement and more importantly, lost revenue.
Overall, this is a great lesson for companies though â€" doublecheck the methods that you use to make sure releases don't land on your Web site before you want them to. Once they hit the Web site you may no longer have control over the material. Serious food for thought huh?
Link to original post