There are junk charts, and there are junk data.
That was the thought that ran through my mind when I saw this post about a new FourSquare app (link). For those who are not familiar with it, FourSquare is this website that lets you broadcast your current location to your friends/followers. This new app, which won a competition hosted by FourSquare, allows users to fake their check-ins, in other words, to pretend to be somewhere when you're not. It's being portrayed as a kind of marketing of yourself to your social circle.
This is one of many problems with the so-called Big Data era. Yes, we collect lots of data. But a lot of the data are junk. It's worse than junk because they are mixed together with the good stuff, and it is often difficult if not impossible to tell them apart.
In this case, I wonder if I am given a dataset that has all these checkins, and some of them are faked, would I be able to filter out the faked ones? One way is to identify the source of the check-in, and blacklist apps like CouchCachet. That only works if (a) the only way to post fake check-ins is through trackable apps, and (b) there are no legitimate check-ins from those black-listed apps.
Alternatively, I would have to go create a labeled dataset in which I verify that some of the past checkins are faked. This would be very hard to put into practice.
The next question to ask is: if Big Data contain a lot of such faked or just bad data, how much can we trust the analyses?
Would love to hear about your experiences with junk data.
Here's a humorous example of junk data. Some people attempt to estimate popularity based on the number of hits in a Google search. For example, some people record the number of hits for "R programming" and "SAS programming." Interestingly, documents such as "SAS(R) Programming Course Notes" and "Step-by-Step Programming with Base SAS(R) Software" appear in the results for "R programming"! The registered trademark symbol for a SAS publication makes the search engine think the document is about R.
Posted by: Rick Wicklin | 02/05/2013 at 09:03 AM
Isn't this just another case of separating the signal from the noise? (in this case, the noise has been deliberately added)
With a large enough dataset, would fake entries have a way of identifying themselves (such as Benford's law?)
Posted by: Andy Palmer | 02/05/2013 at 09:35 AM
I don't think "junk data" is limited to "big data". I recently performed a data quality analysis of something you think you would be "good data" at this point, computer/data breaches: http://l.rud.is/W9wCqi
Generating any type of real analysis from it requires more caveats than there would be text in the report. There are components of the data set that are good enough for use in real analytics. Like Andy said, just need to figure out the best method for separating signal from noise.
Posted by: Hrbrmstr | 02/05/2013 at 10:21 AM
When you ask people a survey question, you know there is a gap between the answers you get and objective reality. Since you are again dealing with information people volunteer rather than information gathered on them objectively, you run into exactly the same sampling problem.
This is not a new thing, this is an old thing in a new dataset.
So you do what you always did; try to find a way to handle that gap, and apply it as best you can or just offer qualifiers on the results.
Posted by: Phil H | 02/05/2013 at 10:29 AM
Rick: that's a good example. Related to that, search engines look for relevant matches, not exact matches so even if R doesn't appear in "SAS(R)", I'd think some SAS results may be considered somewhat relevant.
Andy: It is certainly a kind of noise but it is one that would hard to put a distribution on, unless you know how it is generated. It's easier if I know who CouchCachet is but the challenge is there may be a hundred other such outfits which the analyst may not know about. The "deliberate" part is also important.
Hrbrmstr: Good point, and great post. The kind of exploratory analysis you did there is so important to every analysis. The problem with Big Data is that if there are 1,000 variables, it's pretty difficult to do the type of exploration you demonstrated. But not doing it subjects the analysis to all kinds of potential errors.
Phil, Hrbrmstr & others: I think most data analysts realize that "Big Data" has not changed anything. All of the best practices of the old days still apply. I agree but would also add that Big Data make life much more difficult. When there are more dimensions and more cases, there is much more room for bad analyses, e.g. each one of those could be "faked".
Posted by: Kaiser | 02/05/2013 at 09:48 PM
I think that these data analysis software teams do so much good! They really can accomplish so much.
Posted by: Rod Lucas | 04/01/2013 at 08:57 PM
Great article.
Thanks for sharing.. by the way this blog is very nice.
What template you are use?
Posted by: Android Database Download | 08/31/2017 at 01:28 AM