A new data mining tool created by Harvard University and MIT’s Broad Institute can not only sort vast data sets to find patterns, it can also rank multiple patterns within the data. Why is this significant? Data mining in general is a great tool for pulling pertinent information out of groups of data, but when it comes to really large sets of data, and multiple patterns, the programs tend to fall short.
What is Data Mining?
The term data mining may paint a picture of miners with pitchforks, but in reality it’s another means of using computers to automate common sense. Before computerized data mining, a store owner might simply observe which items sold well, and experiment to determine the best placement for products based on his observations and historical sales data. With automated data mining, this simple operation is possible on a much larger scale, incorporating data from multiple stores and various seasons to determine patterns in customer behavior. What if a new store owner has several years-worth of data on consumer purchasing behavior, but isn’t sure what patterns are significant? That’s where more robust data mining programs like MINE would come in.
MINE Data Mining for Large Data Sets and Multiple Patterns
MINE, which stands for Maximal Information-based Nonparametric Exploration, is the brainchild of a team from Harvard and the Broad Institute. The paper, published December 16, 2011 in Science, outlines the work of this team, composed of David Reshef, Yakir Reshef, Hilary Finucane, Sharon Grossman, Gilean McVean, and Eric Lander. Decoded Science had the opportunity to speak with co-lead author David Rashef about this project.
Decoded Science: What industry do you think will benefit the most from this extremely robust data mining software?
D. Reshef: Our hope is that this tool will be useful in just about any field that is amassing large amounts of data. In terms of specific disciplines that might find this useful, I would hesitate to make a strong claim since I’m not an expert in these fields, but here are a few examples:
a) Scientific research: Some scientific fields—genomics, proteomics, and the study of the human microbiome, for instance—were founded as a result of the explosion of data in the last few decades. Other fields—particle physics, sociology, econometrics, neuroscience, earth and atmospheric science—predate this development but are also becoming saturated with data. In each of these fields, exploring the emerging large data sets is becoming challenging.
b) Finance: Everything on Wall Street is measured: trading volume, stock prices, exchange rates, and more are logged at an impressive temporal resolution stretching back decades. I could imagine financial companies using tools like this to mine the vast amounts of data that they surely keep.
c) Sports statistics: sports teams wanting to explore performance statistics of individuals or teams to get an edge on their competition.
d) Media, social media and the internet: The age of the internet, Facebook, and 24-hour news networks has created an overload of news and multimedia. Tools like this could be used in this arena to track patterns in news, societal memes, or cultural trends. I could even imagine some interesting data mining being done on data sets kept by data giants like Google and Facebook.
Decoding Science. One article at a time.