All of us in the software industry care deeply about data security. But how much data is needed for behavioral analytics? I just stumbled upon an interesting article on data minimization that argues that one can build accurate predictive models with the right subset of data.
I would go much further. For behavioral analytics, you can and should throw away ASAP all personally identifiable information (PII) - credit cards, social security, driver's license, bank accounts, etc. (I'll detail a few exceptions later). The reason is that such data is useless for analysis: what you are interested in are finding patterns in purchases, clicks, responses and not in credit card numbers.
There are a couple of caveats:
- You might need a way to identify that the same person came back to your store or your web site and some PII is the key for doing this. In that case, transform the PII into something that keeps the semantics of the data but cannot be traced back to the original value, with a one way hash for example.
- Your analytics might identify high-value customers that you want to market to. This is similar to the previous case except you need to keep a record of the transformation so you can reverse it later. The important point is that your analytics tool or vendor only gets an opaque customer ID.
- You might augment your data with information about your customers - demographic info for example. In this case, the approach is similar. Use the PII to augment the data but then get rid of it when you pass that data into your analytics tools.
The bottom line is that you don't need personally identifiable information to do behavioral analytics so why take the risk?