Hadoop : ex Yahoo CTO’s perspective

big-data-hadoop-elephant

Raymie Stata is the co-founder of Altiscale and ex-CTO of Yahoo. Altiscale is a big data service provider to companies which cannot manage their own big data infrastructure. Stata appeared in Structure Show podcast recently to talk about Hadoop, Altiscale’s approach to Hadoop and Hadoop as a Service. Here are the highlights of the conversation.

Why Hadoop is best consumed as a service?

“At some point, somebody will say, ‘Well, how many nodes do I get?’ And that’s when we say, ‘OK, great. Think about it. Why do you care?’ You shouldn’t be thinking about the number of nodes. That’s what it means to have Hadoop as a service.”

Unlike the typical cloud, where the clients are charged for number of nodes, Stata’s Altiscale bills users based on computing time and storage on a monthly basis.

Hortonworks and Cloudera aren’t arguing over nothing

“When you think about a market, first of all, there won’t be one vendor, there will be multiple ones, and, second of all, they’re going to be fiercely competitive.”

“I think ages ago, in the SQL wars, it was primarily kind of the field — sales and marketing — that would get in these kind of bloody battles.”

“With the open source element of Hadoop, I think that has brought the competition to the engineering level.”

“I think the trial by fire that you get in those environments where Hadoop is operating at scale is very valuable for people contributing to the Hadoop code base. … There’s a realization that theory and practice often diverge, and as systems scale up and become more complicated, that happens more and more. And so taking a bit more a data-driven approach to making improvements — versus just ‘Hey, I’ve got an idea’ and hacking away for days at a time and then contributing it and saying, ‘Hey, isn’t this better?’ — there’s just a certain respect, if you will, for the complexity of the system.”

Companies like Cloudera, Hortonworks, MapR and Pivotal are the biggest players in Hadoop space. However, in Stata’s opinion, relational databases advancements may nullify this competition.

Why everyone loves Spark?

“From a performance perspective, because it’s an in-memory solution, obviously, it’s a lot faster than old-style MapReduce, where after every iteration of your algorithm you have to write out to HDFS. With Spark, you’re just kind of updating in place, in memory, so you can do many, many iterations of your iterative algorithms very, very quickly. Those kind of iterative algorithms are very common in machine learning.”

In recent times, recursive and iterative methods have been replaced by methods of complexity equal to ‘n*log(n)’ and sometimes even ‘n’. Spark might someday replace MapReduce as the de facto processing framework on Hadoop clusters.

Applications are still the linchpin for big data adoption

“We distinguish what we call applications from tools. Tools are still horizontal. A Platfora, a Tableau — those are still horizontal tools. They certainly raise the level of abstraction versus Java, but they don’t have any what we call domain specificity. So an application to us is something that actually solves a domain-specific problem. … When you say ‘attribution analysis,’ that’s where that domain-specificity comes in that that’s to me what qualifies as a real solution.”

Stata acknowledges the fact that there is a gap between the solution and the task it is meant to perform even in Hadoop sector.

Innovating search in the mold of Moore’s law

“There’s continuous innovation and more-disruptive innovation, and I think that continuous innovation is real innovation and often is the most scientific. If you think about Moore’s law and what it took to kind of maintain Moore’s law, that’s continuous innovation, but it’s deep, deep work and has been enormously important. So, when I look at classic [algorithmic] search, I would put it in that category of continuous innovation. I think there’s probably some metric where they’re doubling it every 18 to 24 months — some metric of relevance — and maintaing that level of improvement is important because there’s more and more noise out there, so you have to crank up capabilities. But at the same time, it just kind of fades into the background because it just kind of changes.”

Raymie-StataIn Stata’s opinion, web search business has been evolving at a pace similar to the advances in microprocessors under Moore’s law. He continued in detail:

“I do think that if you back up, if you look at the bigger picture of people wanting to inform themselves for various purposes, it does seem like there are more disruptive innovations waiting to happen.”

It’s hard to make computer science a family business

“It’s not like a restaurant, where from 4 years old on you’re inside doing that”

“The thing about high-tech is that you can’t really participate until you’re fairly old. And I’m seeing that with my kids, by the way. I’ve got kids, and at 4 years old they didn’t want to do software testing. Imagine that!”

Stata acknowledges his father’s role in creating entrepreneurial spirit in him, but he also claims that he was not destined or trained to be a technologist. His first exposure to the industry was through an internship in analog devices when he was in grad school.