LinkedIn Explains Data Scraping Amid Reports of More Data Hacks and Breaches

Over the past few months, there have been various reports of significant LinkedIn data hacks, with huge databases of user info being sold on the dark web, available to the highest bidder.

Back in April, Cyber News reported that personal data scraped from 500 million LinkedIn users was being made available for sale on various hacking forums, while just last month, another set, reportedly incorporating info from 700 million LinkedIn profiles, also became available online.

In each case, LinkedIn has denied that these indicate a breach of its security, instead pointing to 'data scraping' as the culprit, the (mostly legal) process of gathering publicly available info from platforms, at scale, in order to build larger data sets by incorporating that material with other sources.

As LinkedIn explained in response to the most recent reported leak:

"Our teams have investigated a set of alleged LinkedIn data that has been posted for sale. We want to be clear that this is not a data breach and no private LinkedIn member data was exposed. Our initial investigation has found that this data was scraped from LinkedIn, and other various websites, and includes the same data reported earlier this year in our April 2021 scraping update."

Yet, despite these explanations, a level of user angst remains. Which is why today, as part of its effort to provide more context on what's actually occurred, and what it's doing about it, LinkedIn has posted an overview of how data scraping works, and what users can do to better protect their LinkedIn profiles in future.

As per LinkedIn:

"Scraping has been around since the start of the internet, but it’s grown dramatically in scale and sophistication. Today, the scraping we hear most about is unauthorized scraping, which uses code and automated collection methods to make (up to) thousands of queries per second and evade technical blocks, in order to take data without permission. Scraped data can be gathered from multiple sites, combined, and sold in large batches, to be used for phishing and other campaigns designed to trick you into sharing private information."

LinkedIn has been working to stop third parties from scraping its user data for years, even heading to the Supreme Court to stop one specific business from gathering public info from LinkedIn profiles for its own purposes. But that case, this far, has not gone in LinkedIn's favor - so even if it wanted to block data scraping entirely, legally, it can't, which, in some ways, limits its capacity for response.

A key consideration within this is how much data LinkedIn makes publicly available. LinkedIn could further limit the ways in which user info can be accessed, which would also limit scraping, but that would additionally reduce discovery in the app, in search engines, and via other means, which would restrict the broader utility of the platform.

For example, LinkedIn currently displays your name and job title for all searchers, unless you've made your profile private. That data is then accessible by search engines, which can help to boost discovery - so LinkedIn could further limit that, but if you ever want to be found for relevant searches, on and off platform, which is a key value proposition of the app, it needs to keep a level of that info accessible by users and search tools.

As such, in some ways, it's stuck in between, as it works to manage how much profile data it makes publicly available, and how much it hides behind privacy settings. But within that, users do also have a choice as to how much of their personal info they make publicly accessible.

"Spend some time looking at what info you’ve added, from contact details to work history, and get familiar with your settings. In addition, take a look at your public profile page, to understand what information might be public and ensure it’s exactly what you want to be viewable to search engines and other off-LinkedIn services. You can choose to limit or adjust choices if you’d like."

LinkedIn does note that unauthorized data scraping is in breach of its terms of service, and that it has processes in place to detect, and protect, against such.

But even then, unauthorized scraping does not constitute a breach or a 'hack'.

"Scraping does not mean an attacker has been able to get inside secure systems, subvert firewalls or access protected network information. Unauthorized scraping can mean that bad actors can collect a lot of data and use it in ways that you didn’t expect."

LinkedIn uses bot detection tools and rate limits to restrict such activity, but the key point LinkedIn is seeking to highlight is that these reported breaches are not the result of hacking or data breaches, as such. Users can further limit their data to avoid concerns, but scraping, in some forms, will likely always exist.

LinkedIn is still pursuing a legal case against hiQ Labs over its use of LinkedIn member data, which could end up being a precedent-setting ruling that would give more power to platforms over data scraping. But the fact is that some data will always be publicly accessible, and when it is, third parties will look to use those sources to build databases that they can on-sell to marketing firms.

It's an important technical distinction to note, and a good example of the evolving digital landscape, and how laws are still catching up in many respects.

But to be clear, these datasets are not a result of data hacking at LinkedIn, and you can limit your exposure via your own profile settings.