3 years of Swimming. In Data.


As 2016 is *finally* coming to an end, a year whose repercussions will be felt for many more to come, I cannot but help reflect on the past few years that I have spent swimming. Swimming in data that is.

I have had the opportunity to be a productive participant in the field of data science from many perspectives. Having spent time both in academia and industry, and as both a producer and consumer of data science solutions, I have plenty of lessons that I remind myself of.  Here are the key takeaways from all those years of swimming in data and finding things that would hopefully be useful for someone:

It takes a village.

‘Big Data’ was a big hype word very recently. Until everyone realized that the ‘game-changing’, ‘life-changing’ and ‘world-changing’ data either did not exist or was in the hands of very few people or companies.

There is an unhealthy disregard in academia and industry for collecting data. The way research progresses in AI, ML and all the related fields which collectively affect our abilities to make the most of our data, there is a race to publish the most sophisticated mathematical solution.

In an article on ‘Why Deep Learning is changing your Life?‘ I was happy to note that there was a small paragraph dedicated to Fei-Fei Li, a Stanford AI professor who started the first serious concentrated effort to collect data for computer vision. It was only after she created ImageNet, that the computer vision researchers of the world could show-off their sophisticated math skills at making sense of complex data. It was this effort that created the arena where thousands of researchers having Olympic levels of math skills could finally meaningfully use their expertise.

Which brings me to my first point: Creating and having good datasets is essential, crucial and possibly more useful than all those complicated math models that some people pine away hours at first to build and then to make sense of. A good dataset with a simple model can sometimes tell you far more than the most complicated function approximator run on a bad dataset.

In short, the prestige incentives of academia are not always efficiently aligned with the real world need for making (game/life/world)-changing data science advances.

Now onto why I title this section as ‘It takes a village’.

My hunch is that the ‘Big Data’ hype was generated because the way the world operates, what matters to the big guys is usually projected as what should matter to everyone else. For all the hype and McKinsey reports on the big data revolution, the number of big data companies that actually exist in my opinion are exactly these: Google, Facebook and Amazon. (Apple & Microsoft notably have their foundations as product companies).

Analogous to how the problems of the developing world are never even discussed, let alone be fixed, the real data problems of the rest of us don’t garner enough attention. Most research problems in the field today are trying to solve problems for scenarios that largely exist for these companies (and NSA, shhh, secret).

Most questions in the industry that people now want to answer in a ‘smarter’ way because they know that a (smart) phone exists in every hand don’t have the right datasets to answer them. And definitely not in the context of the developing world. (There is a connection that I see between this gap and the recent wave of the Indian Startup scene, but more on that in another post).

There is another hurdle to the willingness of creation of such datasets. Data collection takes time and it also takes time and resources to start generating dividends. In industry, the Chinese whispers that happens between the sales team, the deal makers and everybody else where it is necessary to quickly book profits and revenues, solutions are sold before there is time to figure out whether the questions the client is asking can even be solved with the data available.

It takes guts to ask that question: ‘Do we even have the data?’ and consider the possibility of hearing back a ‘No’. Then it takes deep pockets to fix that issue. And then it takes perseverance to come up with a meaningful solution. Which is why I am not playing a blame game here. Businesses and companies need to be run and what needs to be sold to that end needs to be sold. But in the end only long term thinking can produce anything valuable.

So, in the industry too, the incentives are not always efficiently aligned with the real world need for making (game/life/world)-changing data science advances.

So as a data science PM/engineer/researcher etc. what does all this mean for you? As much as possible, fix that misalignment. *It takes a village* to build a useful data science solution right from the engineer who makes the system for data collection (if you have the luxury of having such a person in your vicinity) to the designer who can finally make your beautiful insights visible to those who matter. Get in touch with all of them and take them along, up and onwards with you. Every link in that pipeline matters.

Get your language right.
When you offer insights about/from data to someone, you operate in an abstract space of ideas. In such a scenario, you could be talking about the same thing and still not be on the same page. This is usually the case because you are quite literally using words that mean something different to the other party or using different meanings of the same word.

Again the Chinese whispers that happens between the sales team, the client management and company management means that by the time the data and the problem reaches the engineering team, several things have been lost in translation. In academia, when you have researchers from different backgrounds or fields collaborating, there are several different ways of expressing the same ideas creating potential space for misunderstanding.

For example, when you talk to a natural language processing researcher, the audio and visual information is ‘context’ or background, i.e. all the extra information that they may or may not want to include in their machine learning models. Similarly, when you talk to a speech technology researcher, language and visual cues are ‘context’. As someone who has worked on building machine learning models to combine all these communication media, I very quickly realized that while discussing or presenting my work, I would have to avoid the word ‘context’ like the plague if I had to get my point across to everyone together.

Listen. Listen. And then listen some more.

To get the previous point, you must listen to your audience’s language in the first place. But there is a bigger reason why there is a need to keep one’s ears wide open. In my short stint so far, I usually find that people actually don’t know the questions they want answers to. However, people do know very well the pain points of the problems they are facing. Try to listen to these in detail and with great patience. Even to the egoistic (a*holes) who might come to you full of insight and in no mood of listening to what the data is actually saying.

This does 2 important things for you. First you can decide whether a person really has a problem or has something that they have already decided to hear. Next if they do have a problem, it is better for everyone if you can quickly assess the situation and come up with the questions that can be answered and then the questions that need to be answered. Many times, they are not the same. If you are ever part of a project in which these happen to be the same, jump at it with all you’ve got. This does not happen often.

This is analogous to the famous Steve Jobs mantra that the consumers do not know what they want. So, you must get good at listening because before finding a solution, you must transform a problem into a good question.

Math is beautiful.

When I had just started out, I was brash and naïve. I wanted to apply the full force of my training and change things everyday. But there was a slow and painful process of realization that the world does not a) know its problems b) is not always looking for solutions to the known problems c) so many years of my education had not quite prepared me to face the factors that were important in solving problems.

I immediately changed gears to try to fix some of those issues in my life. The world is hard and disappointing. However, Math is only hard, not disappointing. As a data scientist I place my bets on Math and it is that bet that inspires me to get up everyday and get to work, no matter what.

I hope this was useful! Here’s wishing a happy productive 2017 to all!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s