Trust Google to come up with interesting stuff. With a ginormous collection of over 5 million e-books, they created the Google Ngram Viewer app. The purpose is simple – to observe the existential trend of a set of words used in literary works over a period of time. As with most Google creations, the app has great applications – especially for someone interested in Natural Language Processing and the study of evolution of topics.
As an inquisitive NLP analyst, this app really excites me. It helps me tie a few hypotheses together that we couldn’t have decisively done with smaller text corpora.
There are a plethora of articles on the rise of big data and the underlying hardware and software that drive this phenomenon. No, I don’t intend to add to that clutter. This blog attempts to infer the evolution of this phenomenon through a study of the evolution of the words that define it.
Here is an example.
Right around the end of 80’s, “data” started gaining popularity as evidenced by its preponderance in literary works. Also notice the big growth in data analytics texts between 2000 and 2005.
Here is another one.
Traditional data analytics seems to have inflected almost exactly when machine learning started becoming a popular concept. Conventional methods still dominate texts but the growth of machine learning is noteworthy.
This graph should explain it.
It is interesting to note the near parallel evolution of both fields. More interesting is how Big Data deviated from the norm between 1999 and 2007 only to converge in preponderance lately. Now here is the conjecture.
The race between the size of data and the ability to process more data is a close one. In the 80’s, no one wanted to process large volumes of data (transactional, for instance) as they simply couldn’t. They would have if they had some really interesting analytical techniques that evolved during this period. I’m sure they would have loved to hack some game-changing insights with these techniques back then. Then we put a super computer (in the Gen Y parlance) in the hands of an average man – thanks to Gordon Moore and Bob Noyce. Processing power out-did data. Then, someone said “let’s capture every data we can”. Data got bigger. Then we put Quad-Core processors on mobile phones. The race is really a close one – no clear winner has emerged in the 20 years it’s been on.
The fundamental question is this – when does the race between superior hardware and the human ability to capture relevant or irrelevant data end?
As an analyst, I see two possible ends. Either a disruptive technology wave stops trying to fit more flops (Floating Point Operations per Second) into smaller chips and figures out a smarter way to process very large volumes of data or we just stop analyzing ginormous amounts of data. In other words, either hardware transforms or the need ends.
But, what I see happening is we – the analysts – are trying to end this race by giving hardware some Turbo-Charge. We are trying to induce software into this race and help hardware win. Sure, such software has helped reduce the time to sort petabytes of data drastically, but I don’t want to just sort data and call it analysis – I want to be able to run support vector machines on it! I want to do in-memory processing of Big Data. To me, software is at best a temporary solution. The real solution is transforming hardware – not evolving it. Will software become increasingly irrelevant in this race? Or can it co-exist with the hardware to match the rapid changes in today’s data-rich and insight-poor (DRIP) world?
In my own experience I have seen how Big Data can really give out some very non-intuitive insights or “knowledge nuggets” as we call it. It really takes out the “small sample” problem of conventional data and makes simple averages equivalent to complex regression analysis. It really makes anomaly detection non-theoretical. It works! So I know that the world will, should and want to utilize data irrespective of its size and structure.
At BRIDGEi2i, we know how to deal with Big Data problems. We have developed algorithms and technology based solutions that mine unstructured and big data to address business problems. And based on that experience, we believe that 10 years from now, we will witness an assortment of hardware and software that will allow us to view terabytes of data on spreadsheets in our laptops. We are likely to enter an era where disruptive processing technology will permanently end the Big Data debate – and I eagerly look forward to it. In the mean-time, however, Hadoop is your best friend! Embrace it.
This blog is authored by Arun Krishnamoorthy, Senior Manager, Analytics at BRIDGEi2i
BRIDGEi2i provides Business Analytics Solutions to enterprises globally, enabling them to achieve accelerated business impact harnessing the power of data. Our analytics services and technology solutions enable business managers to consume more meaningful information from big data, generate actionable insights from complex business problems and make data driven decisions across pan-enterprise processes to create sustainable business impact. To know more visit www.bridgei2i.com
The views and opinions expressed in this article are those of the author and do not necessarily reflect the official position or viewpoint of BRIDGEi2i.