Showing posts with label Greenplum. Show all posts
Showing posts with label Greenplum. Show all posts

Monday, January 05, 2009

A Technical Look at How Parallel Processing Brings Vast New Capabilities to Large-Scale Data Analysis

Transcript of BriefingsDirect podcast on new technical approaches to managing massive data problems using parallel processing and MapReduce technologies.

Listen to the podcast. Download the podcast. Find it on iTunes/iPod. Learn more. Sponsor: Greenplum.

Dana Gardner: Hi, this is Dana Gardner, principal analyst at Interarbor Solutions, and you're listening to BriefingsDirect. Today we present a sponsored podcast discussion on new data-crunching architectures and approaches, ones designed with petabyte data sizes in their sights.

It's now clear that the Internet-size data gathering, swarms of sensors, and inputs from the mobile device fabric, as well as enterprises piling up ever more kinds of metadata to analyze, have stretched traditional data-management models to the breaking point.

In response, advances in parallel processing, using multi-core chipsets have prompted new software approaches such as MapReduce that can handle these data sets at surprisingly low total cost.

We'll examine the technical underpinnings that support the new demands being placed on, and by, extreme data sets. We'll also uncover the means by which powerful new insights are being derived from massive data compilations in near real time.

Here to provide an in-depth look at parallelism, modern data architectures, MapReduce technologies, and how they are coming together, is Joe Hellerstein, professor of computer science at UC Berkeley. Welcome, Joe.

Joe Hellerstein: Good to be here, Dana.

Gardner: Also Robin Bloor, analyst and partner at Hurwitz & Associates. Thanks for joining, Robin.

Robin Bloor: It's good to be here.

Gardner: We're also joined by Luke Lonergan, CTO and co-founder of Greenplum. Welcome to the show, Luke.

Luke Lonergan: Hi, Dana, glad to be here.

Gardner: The technical response to oceans of data is something that has been building for some time. Multi-core processing has also been something in the works for a number of years. Let's go to Joe Hellerstein first. What's different now? What is in the current confluence of events that is making this a good mixture of parallelism, multi-core, and the need to crunch ever more data?

Hellerstein: It's an interesting question, because it's not necessarily a good thing. It's a thing that's emerged that seems to work. One thing you can look at is data growth. Data growth has been following and exceeding Moore's Law over time. What we've been seeing is that the data sets that people are gathering and storing over time have been doubling at a rate of even faster than every 18 months.

That used to track Moore's Law well enough. Processors would get faster about every 18 months. Disk storage densities would go up about every 18 months. RAM sizes would go up by factor of two about every 18 months.

What's changed in the last few years is that clock speeds on processors have stopped doubling every 18 months. They're growing very slowly, and chip manufacturers like Intel have moved instead to utilizing Moore's Law to put twice as many transistors on a chip every 18 months, but not to make those transistors run your CPU faster.

Instead, what they are doing is putting more processing cores on every chip. You can expect the number of processors on your chip to double every 18 months, but they're not going to get any faster.

So data is growing faster, and we have chips basically standing still, but you're getting more of them. If you want to take advantage of that data, you're going to have to program in parallel to make use of all those processors on the chips. That's the confluence that's happening. It's the slowdown in clock speed growth against the continued growth in data.

Effects on mainstream compute problems

Gardner: Joe, where do you expect that this is going to crop up first? I mentioned a few examples of large data sets from the Internet, such as with Google and what it's doing. We're concerned about the mobile tier and how much data that's providing to the carriers. Is this something that's only going to affect a select few problems in computing, or do you expect this to actually migrate down into what we consider mainstream computing issues?

Hellerstein: We tend to think of Google as having the biggest data sets around. The amazing thing about the Web is the amount of data there that was typed in by people. It's just phenomenal to think about the amount of typing that's gone on to generate that many petabytes of data.

What we're going to see over time is that data production is going to be mechanized and follow Moore's Law as well. We'll have devices generating data. You mentioned sensors. Software logs are big today, and there will be other sources of data ... camera feeds and so on, where automated generation is going to pump out lots of data.

That data doesn't naturally go to Web search, per se. That's data that manufacturers will have, based on their manufacturing processes. There is security data that people who have large physical plants will have coming from video cameras. All the retail data that we are already capturing with things like Universal Product Code (UPC) and radio-frequency identification (RFID) is going to increase as we get finer-grain monitoring of objects, both in the supply chain and in retail.

We're going to see all kinds of large organizations gathering data from all sorts of automated sources. The only reason not to gather that data is when you run out of affordable processing and storage. Anybody with the budget will have as much data as they can budget for and will try to monetize that. It's going to be pervasive.

Gardner: Robin Bloor, you've been writing about these issues for some time. Now, we have had multi-core silicon, and we've had virtualization for some time, but there seems to be a lag between how the software can take advantage of what's going on on the metal. What's behind this discrepancy, and where do you expect that to go?

Bloor: There are different strands to this, because if we talk about parallelization, then with large database products, to a certain extent, we have already moved into the parallelization.

It's an elastic lag that comes from the fact that, when a chip maker does something new on the chip, unless its just a speed -- which was a great thing about clock speed -- you have to change your operating system to some degree to take advantage of what's new on the chip. Then, you may have to change the compilers and the way you write code in order to take advantage of what's on the chip.

It immediately throws a lag into the progress of software, even if the software can take advantage of it. With multi-core, we don't have specific tools to write parallel software, except in one or two circumstances, where people have gone through the trouble to do that. They are not pervasive.

You don't have operating systems naturally built for sharing the workload of multi-core. We have applications like virtualization, for example, that can take advantage of multi-core to some degree, but even those were not specifically written for multi-core. They were written for single-core processes.

So, you have a whole lag in the works here. That, to a certain extent, makes multi-core compelling for where you have parallel software, because it can attack those problems very, very well and can deliver benefit immediately. But you run into a paradox when Intel comes out with a four-way or an eight-way or a 16-way chip set. Then the question is how are you going to use that?

Multi-core becomes the killer app

Gardner: You've written recently Robin that the killer app, so to speak, for multi-core is data query. Why do you feel that's the case?

Bloor: There are a lot of reasons for it. First of all, it parallelizes extremely well. Basically, you have a commanding node that's looking after a data query. You can divide the data and the resources in such a way that you just basically run everything in parallel.

The other thing that's really neat about this application is it's a complete batch application, in the sense that you just keep pushing the data through an engine that keeps doing the queries. So, you're making pretty effective use of all the processes that are available to you. It's very high usage.

If you run an operating system that's based upon intervals, you're waiting for things to happen. At various times, the operating system is idle. It doesn't seem like they're very long times, but mostly on a PC the operating system is never doing anything. Even when you're running applications on a PC, it's rarely doing very much, even in a single-CPU situation. In a multiple-CPU situation, it's very hard to divide the workload.

So that's the situation. You've got this problem that we have with very large heaps of data. They've been growing roughly at a factor of about 1,000 every six years. It's an awesome growth rate. At the same time, we have the technology where we can take a very good dash at this and use the CPU power we've got very effectively.

Gardner: Luke Lonergan, we now have a data problem, and we have some shifts and changes in the hardware and infrastructure. What now needs to be brought into this to create a solution among these disparate variables?

Longergan: Well, it's interesting. As I listened to Joe and Robin talk about the problem, what comes to mind is a transition in computing that happened in the 1970s and 1980s. What we've done at Greenplum is to make a parallel operating system for data analysis.

If you look back on super computing, there were times when people were tackling larger and larger problems of compute. We had to invent different kinds of computers that could tackle that kind of problem with a greater amount of parallelism than people had seen before -- the Connection Machine with 64,000 processors and others.

What we've done with data analysis is to make what Robin brings forward happen -- have all units available within a group of commodity computers, which is the popular computing platform. It's really required for cost-efficient analysis to bring that to bear automatically on structured query language (SQL) queries and a number of different data-intensive computing problems.

The combination of the software-switch interconnect, which Greenplum built into the Greenplum product, and the underlying use of commodity parallel computers, is brought together in this database system that makes it possible to do SQL query and languages like MapReduce with automatic parallelism. We're already handling problems that involve thousands of individual cores on petabytes of data.

The problem is very much real. As Joe indicated, there are very many people storing and analyzing more data. We're very encouraged that most of our customers are finding new uses for data that are earning them more money. Consequently, the driver to analyze more and more data continues to grow. As our customers get more successful, this use of data is becoming really important.

Gardner: Back to Joe. This seem to be a bright spot in computer science, tackling these issues, particularly in regard to massive data sets, not just relational data, of course, but a multitude of different types of content and data. What's being done at the research level that backs up this direction or supports this new solution direction?

Data-centered approach has huge power

Hellerstein: It's an interesting question, because the research goes back a ways. We talked about how database systems and relational query, like SQL, can parallelize neatly. That comes straight out of the research literature projects, like the Gamma Research Project at Wisconsin in 1980s, and the Bubba Project at MCC. What's happened with that work over time is that it has matured in products like Greenplum, but it's been kind of cornered in the SQL world.

Along came Google and borrowed, reused, and reapplied a lot of that technology to text- and Web-processing with their MapReduce framework. The excitement that comes from such a successful company as Google tackling such a present problem as we have today with the Web, has begun to get the rest of computer science to wake up to the notion that a data-centric approach to parallelism has enormous power.

The traditional approach to parallelism and research in the 1980s was to think about taking algorithms -- typically complicated scientific algorithms that physicists might want to use -- and trying to very cleverly figure out how to run them on lots of cores.

Instead, what you're seeing today is people say, "Wow, well, let's get a lot of data. It's easy to parallelize the data. You break it up into little chunks and you throw it out to different machines. What can we do cleverly in computing with that kind of a framework?" There are a lot of ideas for how to move forward in machine learning and computer vision, and a variety of problems, not just databases now, where you are taking this massively parallel data-flow approach.

Gardner: I've heard this term "shared nothing architecture," and I have to admit I don't know anything about what it means. Robin, do you have a sense of what that means, and how that relates to what we are discussing?

Bloor: Yeah, I do. The first time I ran into this was not in respect to this at all. I did some work for the Hong Kong Jockey Club in the 1990s. What they do is take all the gambling on all the horse racing that goes on in Hong Kong. It's a huge operation, much, much bigger than its name sounds.

In those days, they got, I think, the largest transaction rate in the world, or at least it was amongst the top ten. They were getting 3,000 bets in the last second before a race, and they lose the money from the bet if the bet doesn't go on.

The law in Hong Kong was that the bet has to be registered on disk, before it was actually a real bet. So, if in any way, anything fell over or broke during the minute leading up to a race, a lot of money could be lost.

Basically they had an architecture that was a shared nothing architecture. They had a router in front of an awful lot of servers, which were doing nothing but taking bets and writing them to disk. It was server, after server, after server. If at any point, there was any indication that the volume was going up, they would just add servers, and it would divide the workload into smaller and smaller chunks, so it could do it.

You can think of almost being like a supermarket in the sense of lots and lots of different tools and lots of queues for people, but each tool is a resource on its own, and it shares nothing with anything else. Therefore, no bottlenecks can build up around any particular line.

If you have somebody directing the traffic, you can make sure that the flow goes through. So you go from that, straight into a query on a very large heap of data, if you manage to divide the data up in an efficient way.

A lot of these very big databases consist of nothing more than one big fact table -- a little bit more, but not much more than one big fact table. You split that over 100 machines, and you have a query against a whole fact table. Then, you just actually have 100 queries against 100 different data sets, and you bring the answer back together again.

You can even do fault tolerance in terms of the router for all this. So, with that, you can end up with nothing being shared, and you just have the speed. Basically, any device that's out there is doing a bit of query for you. If you've got 1,000 of them, you go 1,000 times faster. This scales extraordinarily well, because nothing is shared.

Gardner: Luke, tell me how these concepts of being able to scale relate to what the developers need to do. It seems to me that we've got some infrastructure benefits, but if we don't have the connection between how these business analysts and others that are seeking the results can use these infrastructure benefits, we're not capitalizing. What needs to happen now in terms of the logic as that relates to the data?

The net effects on users

Longergan: It's a good question, because, in the end, it's about users being able to gain access to all that power. What really turned the corner for general data analysis using SQL is the ability for a user to not to have to worry about what kind of table structure they have. They can have lots of small tables joining to lots of big tables, and big tables joining to each other.

These are things they do to make the business map better to the data analysis they're doing. That throws a monkey wrench in the beautiful picture of just subdividing the data and then running individual queries.

What the developer needs is an engine that doesn't care how the data is distributed, per se, just being able to use all of that parallelism on the problems of interest. The core problem we've solved is the ability for our engine to redistribute the data and the computation on the fly, as these queries and analysis are being performed.

It's the combination, as Robin put it earlier, of a compiler technology, which is our parallelizing optimizer, and a software interconnect, which we call a soft switch technology. The combination of those two things enables a developer of business logic and business analysis to not to have to worry about what is underneath them.

The physical model of how the database is distributed in a shared nothing architecture in a Greenplum system is not visible to the developer. That is where the SQL-focused data analytics realm has gone by necessity. It really has made it possible to continue to grow the amount of data, and continue to be able to run SQL analysis against that data. It's the ability to express arbitrarily constructed business rules against a large-scale data store.

Gardner: We did one of these podcasts not too long ago with Tim O'Reilly. He mentioned that he'd heard from Joe Hellerstein that every freshman now at UC Berkeley studying computer science is being taught Hadoop, which is related on an open-source development level and community to MapReduce. SQL is now an elective for seniors.

It seems that maybe we've crossed a threshold here in terms of how people are preparing themselves for this new era. Joe, how does that relate to how this new logic and ability to derive queries from these larger data sets is unfolding?

Hellerstein: What you're seeing there is three things happening at once. The first is that we have a real desire on the educational side to teach the next generation of programmers something about parallelism. It's really sticking your head in the sand to teach programming the way we have always taught it and not address the fact that every efficient program over the next ... forever is going to have to be a parallel program. That's the first issue.

The second issue is what's the simplest thing you can teach to computer science students to give them a tangible feeling for parallelism, to actually get them running code on lots of machines and get it going? The answer is data parallelism -- not a complicated scientific algorithm that's been carefully untangled, but simple data parallelism in a language that doesn't really require them to learn any new conceptual ideas that they wouldn't have learned in a high school AP course where they learned say Python or Java.

When you look at those requirements, you come up with the Google MapReduce model as instantiated in the open-source code of Hadoop. They can write simple straight-line programs that are procedural. They look just like "For" loops and "If-Then" statements. The students can see how that spreads out over a lot of data on a lot of machines. It's a very approachable way to get students thinking about parallelism.

The third piece of this, which you can't discount, is the fact that Google is very interested in making sure that they have a pipeline of programmers coming in. They very aggressively have been providing useful pedagogical tools, curriculum, and software projects, to universities to ramp this up.

So it's a win-win for the students, for the university, and frankly for Google, Yahoo, and IBM, who have been pushing this stuff. It's an interesting thing, an academic-industrial collaboration for education.

At the business level

Gardner: Let's bring this from a slightly abstract level down to a business level. We seem to be focusing more on purpose-built databases, appliances, packaging these things a little differently than we had in a distributed environment. Luke, what's going on in terms of how we package some of these technologies, so that businesses can start using them, perhaps at a crawl, walk, run type of a ramp up?

Longergan: Businesses have invested a tremendous amount of their time over the last 15 to 25 years in SQL, and some of the more traditional kinds of business analysis that pay off very well are ensconced in that programming model. So, packaging a system that can do transactional, mixed workloads with large amounts of concurrency, with applications that use the SQL paradigm, is very important.

Second, the ability to leverage the trends in microprocessors and inexpensive servers, and combine those with this kind of software model that scales and takes advantage of very high degree of parallelism, requires a certain amount of integration expertise.

Packaging this together as software plus hardware, making that available as a reference architecture for customers, has been very important and has been very successful in our accounts at New York Stock Exchange, Fox, MySpace, and many others.

Finally, as Joe and you were hinting at, there are changes in the programming paradigm. In being able to crawl, walk, and then run, you have to support the legacy, but then give people a way to get to the future. The MapReduce paradigm is very interesting, because it bridges the gap between traditional data-intensive programming with SQL and the procedural world of unstructured text analysis.

This set of technologies, put together into a single operating system-like formulation and package, has been our approach, and it's been very popular.

Gardner: Robin Bloor, this whole notion of legacy integration is pretty important. A lot of enterprises don't have the luxury of starting out "green field," don't have the luxury of hiring the best and brightest new computer scientists, and working on architecture from a pure requirements-based perspective. They have to deal with what they have in place. Increasingly, they want to relate more of what they have in place into an analytic engine of some kind.

What's being done from your perspective vis-à-vis parallelization and things like MapReduce that allow for backward compatibility, as well as setting yourself up to be positioned to expand and to take advantage of some of these advancements?

Bloor: The problem you have with what is fondly called legacy by everybody is that it really is impossible. The kind of things that were done in the past, very strongly bound the software to the data, to the environment it ran in. Therefore, unhooking that, other than starting again from scratch, is a very difficult thing to do.

Certainly, a lot of work is going on in this area. One thing that you can do is to create something -- I don't know if there is an official title to it, but everybody seems to use the word data fabric. The idea being that you actually just siphon data off from all of the data pools that you have throughout an organization, and use the newer technology in one way or another to apply to the whole data resource, as it exists.

This isn't a trivial thing to do, by the way. There are a lot of things involved, but it's certainly a direction in which things are actually going to move. It's possibly not as well acknowledged as it should be, but most of the things that we call data warehouses out there, the implementations have been done in the area of business intelligence (BI), actually don't run very well.

You have situations where people post queries, and it may take hours to answer a query. Because it takes hours to answer a query, and you have a whole scheme, a reason why you are actually mining the data for something, if every step takes a couple of hours, it's very difficult to carry out an analysis like that in a particularly effective way.

A 100-to-1 value improvement

If you take something like the Greenplum technology, and you point to the same problem, even though you are not dealing with petabytes of data, you can still have this parallel effect. You can get answers back that used to take 100 minutes, and you will get 100 to 1 out of this. You may get more, but you will certainly get 100 to 1 out of this, and it changes the way that you do the job that you have.

One thing that's kind of invisible is that there is a lot of data out there that's not being analyzed fast enough to be analyzed effectively. That's something that I think parallelism is going to address.

The other thing where it is going to play a part is that organizations are going to build data fabrics. In one way or another, they will siphon the data off and just handle it in a parallel manner. There is a lot you can do with that, basically.

Gardner: Joe Hellerstein, is there more being brought to this from the data architecture perspective, jibing the old with the new, and then providing yet better performance when it comes to these massive analytic chores?

Hellerstein: What I'm excited about, and I see this at Greenplum -- there's another company called Aster Data that's doing this, and I wouldn't be surprised if we see more of this in the market over time -- is the combination of SQL and MapReduce in a unified way in programming environments. This is short-term step, but it's a very pragmatic one that can help with people's ability to get their hands on data in an organization.

The idea is that, first of all, you want to have the same access to all your data via either an SQL interface or a MapReduce programming interface. When I say all the data, I mean the stuff you used to get with SQL, the database data, and the stuff you might currently be getting with MapReduce, which might be text files or log files in a distributed-file system. You ought to be able to access those with whatever language suits you, mix and match.

So, you can take your raw log files, which are raw text, and use SQL to join those against a customer table. Or, if you're a MapReduce programmer who does analytics and doesn't know SQL, say you're a statistician, you can write a MapReduce program that does some fancy statistical analysis. You can point it at text fields in a database full of user comments, or at purchase records that you used to have to dump out of the database into text formats to get your hands on. So, part of this is getting more access to more people who have programing paradigms at their fingertips.

Another piece of this is that some things are easier to do in MapReduce, and some things are easier to do in SQL, even when you know both. Good programmers have a lot of tools in their tool belt. They like to be able to use whatever tool is appropriate for the task. Having both of these things interleaved is really quite helpful.

Gardner: Luke, to what degree are they interleaved now, and to what degree can we expect to see more?

Longergan: It's been very gratifying that just making some of those pragmatic capabilities available and helping customers to use them has so far yielded some pretty impressive results. We have customers who have solved core business problems, in ways they couldn't have before, by unifying the unstructured text-file data sources with the data that was previously locked up inside the database.

As Joe points out, it's a good programmer who knows how to use all of the various tools that they have at their disposal. Being able to pull one that's just right for the task off the shelf is a great thing to do. With the Greenplum system we've made this available as a simple extension and just another language that one can use with the same parallel data engine, and that's been very successful so far.

Impact on cloud computing

Gardner: Let's look at how this impacts one of the hot topics of the day, and that's cloud computing, the idea that sourcing of resources can come from a variety of organizations. You're not just going to get applications as a service or even Web services, but increasingly infrastructure functionality as a service.

Does this parallelization, some of these new approaches to programming, and the ability to scale have an impact on how well organizations can start taking advantage of what's loosely defined as cloud computing? Let's start with you Joe.

Hellerstein: I'm not quite sure how this is going to play out. There are a couple of questions about how an individual organization's data will end up in the cloud. Inevitably it will, but in the short-term, people like to keep their data close, particularly database data that's traditionally been in the warehouses, very carefully managed. Those resources are very carefully protected by people in the organization.

It's going to be some time until we really see everybody's data warehouses up in the cloud. That said, as services move into the cloud, the data that those services spit out and generate, their log files, as well as the data that they're actually managing, are going to be up in the cloud anyway.

So, there is this question of, how long will it be until you really get big volumes of data in the cloud. The answer is that certainly new applications will be up there. We may start to see old data getting uploaded in the cloud as well.

There's another class of data that's already becoming available in the cloud. There is this recent announcement from Amazon that they are going to make some large data sets available on their platform for public access. I think we'll see more of this, of data as a utility that's provided by third parties, by governments, by corporations, by whomever has data that they want to share.

We'll start to see big data sets up there that don't necessarily belong to anyone, and they are going to be big. In that environment, you can imagine big data analytics will have to run in the cloud, because that's where the data will be.

One of the fun things about the cloud that's really exciting is the elasticity of the resources. You don't buy yourself a data center full of machines, but you rent as many machines as you need for a task.

If you have a task that's going to look at a lot of data, you would rent a lot of machines for a few hours, and then you would shrink your pool. What this is going to allow people to do is that even small organizations may, for a short period of time, look at an enormous amount of data, which perhaps doesn't originate in their own data production environment, but is something that they want to utilize for their purposes.

There is going to be a democratization of the ability to take advantage of information, and it comes from this ability to share these resources that compute, as well as the actual content to share them in a temporary way.

Gardner: Let's go to Robin on that. It seems that there is a huge potential payoff if, as Joe mentioned, you can gather data from a variety of sources, perhaps not in your own applications, not your own infrastructure and/or legacy, but go out and rent or borrow some data, but then do some very interesting things with it. That requires joins, that requires us to relate data from one cloud to another or to suck it into one cloud, do some wonderful magic-dust pixie sprinkling on it, and then move along.

How do you view this problem of managing boundaries of clouds, given that there is such a potential, if we could do it well, with data?

Looking at networks

Bloor: There would have to be, because you are looking at a technical problem, and you really are going to have to have specific interfaces for doing that, especially if you are joining data across clouds. Let's drop the word "cloud" and just think large network, because everything that is representative of the cloud ultimately comes down to being somewhat of a larger network.

When you've got something very large, like what Google and Amazon have, then you have this incredible flexibility of resources. You can push resources in or redeploy these resources very, very effectively. But you're not going to be able to do joins across data heaps in one cloud and another cloud, and in perhaps a particular network without there being interfaces that allow you to do that, and without query agents sitting in those particular clouds that are going to go off and do the work. You're going to care very much as to how fast they do that work as well.

This is going to be a job for big engines like Greenplum, rather than your average relational database, because your average relational database is going to be very slow.

Also, you have to master the join. In other words, the result has to arrive somewhere, and be brought together. There are a number of technical issues that are going to have to be addressed, if we're going to do this effectively, but I don't see anything that stops it being done. We have the fast networks to enable this. So, I think it can be done.

Gardner: Luke, last word goes to you. I don't expect you to pre-announce necessarily, but how do you, from Greenplum's perspective, address this need for joining, but recognizing it's a difficult technical problem?

Longergan: Well, the cloud really manifests itself as a few different things to us. When Joe was talking about how people are going to be putting, and are already putting, a lot of services up in the cloud that are generating a lot of new data, then it requires that the kinds of data analysis, as Robin was hinting at, scale to meet that demand.

We already have the engine that implements those kinds of join in between networks abilities. So we are cloud capable. The real action is going to be when people start to do business that counts on public clouds to function properly, and are generating enormous amounts of very valuable data that requires the kind of parallel compute that we provide.

Joining inside clouds, using cloud resources to do the kind of data analysis work, this is all happening as we speak, and this is another aspect of what's forcing the change from an earlier paradigm of database to the modern massively parallel one.

Gardner: I just want to wrap up quickly now. Thank you. Joe Hellerstein, you mentioned earlier on Moore's Law and how it stalled a bit on the silicon. Are we going to see a similar track, however, to what we did with processing over the last 15 years -- a rapid decrease in the total cost associated with these tasks? Even if we don't necessarily see the same effect in terms of the computing, are we going to be able to do what we've been describing here today at an accelerating decreased total cost?

Hellerstein: Absolutely. The only barrier to this is the productivity of programmers.

Just think about storage. I have a terabyte disk in my basement that holds videos, and it costs $100 or so dollars at Amazon.com. Ten years ago a terabyte was referred to by the experts in the field as a "terror byte." That's how worried people were about data volumes like that.

We'll see that again. Disk densities show no signs of slowing down. So, data is going to be essentially no cost. The data-gathering infrastructure is also going to be mechanized. We're going through what I call the industrial revolution of data production. We're just going to build machines to generate data, because we think we can get value out of that data, and we can store it essentially for free.

The compute cost of multi-core with parallelism is going to continue Moore's Law. It's just going to continue it in a parallel programming environment. If we can get all those cores looking at all that data, it won't cost much to do that, and the cost of that will continue to shrink by half.

The only real barrier to the process is to make those systems easy to program and manageable. Cloud helps somewhat with manageability, and programming environments like SQL and MapReduce are well-suited to parallelism. We're going to just see an enormous use of data analysis over time. It's just going to grow, because it gets cheaper and cheaper and bigger and bigger.

Gardner: Well, great, that's very exciting. We've been discussing advances in parallel processing using multi-core chipsets and how that's prompted new software approaches such as MapReduce that can handle these large data sets, as we have just pointed out, at surprisingly low total cost.

I want to thank our panel for today. We have been joined by Joe Hellerstein, professor of computer science at UC Berkeley, and I should point out also an adviser at Greenplum. Thank you for joining, Joe.

Hellerstein: It was a pleasure.

Gardner: Robin Bloor, analyst and partner at Hurwitz & Associates. I appreciate your input, Robin.

Bloor: Yeah, it was fun.

Gardner: Luke Lonergan, CTO and co-founder at Greenplum. Thank you, sir.

Longergan: Thanks, Dana.

Gardner: This is Dana Gardner, principal analyst at Interarbor Solutions, you have been listening to a sponsored podcast from BriefingsDirect. Thanks and come back next time.

Listen to the podcast. Download the podcast. Find it on iTunes/iPod. Learn more. Sponsor: Greenplum.

Transcript of BriefingsDirect podcast on new technical approaches to managing massive data problems using parallel processing and MapReduce technologies. Copyright Interarbor Solutions, LLC, 2005-2009. All rights reserved.

Tuesday, December 16, 2008

MapReduce-scale Analytics Change Business Intelligence Landscape as Enterprises Mine Ever-Expanding Data Sets

Transcript of BriefingsDirect podcast on new computing challenges and solutions in data processing and data management.

Listen to the podcast. Download the podcast. Find it on iTunes/iPod. Learn more. Sponsor: Greenplum.

Dana Gardner: Hi, this is Dana Gardner, principal analyst at Interarbor Solutions, and you're listening to BriefingsDirect. Today, we present a sponsored podcast discussion on the architectural response to a significant and fast-growing class of new computing challenges. We will be discussing how Internet-scale data sets and Web-scale analytics have placed a different set of requirements on software infrastructure and data processing techniques.

Following the lead of such Web-scale innovators as Google, and through the leveraging of powerful performance characteristics of parallel computing on top of industry-standard hardware, we are now focusing on how MapReduce approaches are changing business intelligence (BI) and the data-management game.

More types of companies and organizations are seeking new inferences and insights across a variety of massive datasets -- some into the petabyte scale. How can all this data be shifted and analyzed quickly, and how can we deliver the results to an inclusive class of business-focused users?

We'll answer some of these questions and look deeply at how these new technologies will produce the payback from cloud computing and massive data mining and BI activities. We'll discover how the results can quickly reach the hands of more decision makers and strategists across more types of businesses.

While the challenge is great, the new value for managing these largest data sets effectively offers deep and powerful new tools for business and for social and economic progress.

To provide an in-depth look at how parallelism, modern data infrastructure, and MapReduce technologies come together, we welcome Tim O’Reilly, CEO and founder of O’Reilly Media, and a top influencer and thought leader in the blogosphere. Welcome, Tim.

Tim O’Reilly: Hi, thanks for having me.

Gardner: We're also joined by Jim Kobielus, senior analyst at Forrester Research. Thank you, Jim.

Jim Kobielus: Hi, Dana. Hi, everybody.

Gardner: Also, Scott Yara, president and co-founder at Greenplum. Welcome, Scott.

Scott Yara: Thank you.

Gardner: We're still dealing with oceans of data, even though we have harsh economic times. We see reduction in some industries, of course, but the amount of data and need for analytics across the Internet is still growing rapidly. BI has become a killer application over the past few years, and we're now extending that beyond enterprise-class computing into cloud-class computing.

I want to go to Jim Kobielus first. Jim, why has this taken place now? What is happening in the world that is simultaneously creating these huge data sets, but also making necessary even better analytics across more businesses?

Kobielus: Thanks, Dana. A number of things are happening or have been happening over the past several years, and the trend continues to grow. In terms of the data sets, it’s becoming ever more massive for analytics. It’s equivalent to Moore’s Law, in the sense that every several years, the size of the average data warehouse or data mart grows by an order of magnitude.

In the early 1990s or the mid 1990s, the average data warehouse was in gigabytes. Now, in the mid to late 2000s, it's in the terabytes. Pretty soon, in the next several years, the average data warehouse will be in the petabyte range. That’s at least a thousand times larger than the current middle-of-the-road data warehouse.

Why are data warehouses bulking up so rapidly? One key thing is that organizations, especially in tough times when they're trying to cut costs, continue to consolidate a lot of disparate data sets into fewer data centers, onto fewer servers, and into fewer data warehouses that become ever-more important for their BI and advanced analytics.

What we're seeing is that more data warehouses are becoming enterprise data warehouses and are becoming multi-domain and multi-subject. You used to have tactical data marts, one for your customer data, one for your product data, one for your finance data, and so forth. Now, the enterprise data warehouse is becoming the be all and end all -- one hub for all of those sets.

What that means is that you have a lot of data coming together that never needed to come together before. Also, the data warehouse is becoming more than a data warehouse. It's becoming a full-fledged content warehouse, not just structured relational data, but unstructured and semi-structured data -- from XML, from your enterprise content management (ECM) system, from the Web, from various formats, and so forth. It's coming together and converging into your warehouse environment. That’s like the bottom of the iceberg that’s coming up, you're seeing it now, and it's coming into your warehouse.

Also, because of the Web 2.0 world and social networking, a lot of the customer and market intelligence that you need is out there in blogs, RSS feeds, and various formats. Increasingly, that is the data that enterprises are trying to mine to look for customers, marketing opportunities, cross-sell opportunities, and clickstream analysis. That’s a massive amount of data that’s coming together in warehouses, and it's going to continue to grow in the foreseeable future.

Gardner: Let’s go to Tim O’Reilly. Tim, from your perspective, what has changed over the past 10 or 20 years that makes these datasets so important?

Long-term perspective

O'Reilly: If you look at what I would call Web 2.0 in a long-term historical perspective, in one sense it's a story about the evolution of computing.

In the first age of computing, business models were dominated by hardware. In the second age, they were dominated by software. What started to happen in the 1990s, underneath everybody’s nose, but not understood and seen, was the commodification of software via open industry standards. Open source started to create new business models around data, and, in particular, around network applications that built huge data sets through user participation. That’s the essence of what I call Web 2.0.

Look at Google. It's a BI company, based on massive data sets, where, first of all, they are spidering all the activity off of the Web, and that’s one layer. Then, they do this detailed analysis of the link structure of that Web, and that’s another layer. Then, they start saying, "Well, what else can we find? They start looking at click stream data. They start looking at browsing history, and where people go afterward. Think of all the data. Then, they deliver service against that.

That’s the essence of Web 2.0, building a massive data set, doing real-time analytics against it, and then figuring out what services you can deliver. What’s happening today is that movement is transferring from the consumer Web into business. People are starting to realize, "Oh, the companies that are doing better are better with their data."

A great example of that is Wal-Mart. You can think of Wal-Mart as a Web 2.0 company. They've got end-to-end analytics in the same way that Google does, except they're doing it with stuff. Somebody takes something off the shelf at Wal-Mart and rings it up. Wal-Mart knows, and it sends a signal downstream to the supplier.

We need to understand that this move to real-time understanding of data at massive scale is going to become more and more important as the lever of competitive advantage -- not just in computer businesses, but in all businesses. Data warehousing and analytics aren't just something that you do in the back office and it's a nice-to-have. It's the very essence of competitive advantage moving forward.

When we think about where this is going, we first have to understand that everybody is connected all the time via applications, and this is accelerating, for example, via mobile. The need for real-time analytics against massive data sets is universal.

Look at some of the things that are happening on the phone. Okay, where am I? What data is relevant to me right now, because you know where I am? Speech recognition is starting to come into focus on the phone. Again, it's a massive data problem, integrating not only speech recognition, but also local dialogs. Oh, wait, local again, you start to see some cross connections between data streams that will help you do better.

Even in the case of starting with someone from Nuance about why Google is able to do some interesting things in the particular domain of search and speech recognition, it’s because they're able to cross-correlate two different data sets -- the speech data set and the search data set. They say, "Okay, yeah, when somebody says that, they are most likely looking for this, because we know that. When they type, they also are most likely looking for that." So this idea of cross-correlation between data sets is starting to come up more and more.

This is a real frontier of competitive advantage. You look at the way that new technologies are being explored by startups. So many of the advantages are in data.

A great example is the company where I'm on the board. It's called Wesabe. They're a personal finance application. People upload their bank statements or give Wesabe information to upload their bank statements. Wesabe is able to do customer analytics for these guys, and say, "Oh, you spent so much on groceries." But, more than that, they're able to say, "The average person who shops at Safeway, spends this much. The average person who shops at Lucky spends this much in your area." Again, it's a massive data problem. That’s the heart of their application.

Now, you think the banks are going to get clued into this and they are going to start to say, "Well, what services can we offer?" Phone companies: "What services can we offer against our data?"

One thing that’s going to happen is the migration of all the BI competencies from the back office to the front office, from being something that you do and generate reports from, to something that you actually generate real-time services from. In order to do that, you've absolutely got to have high performance at massive scale.

Second, a lot of these data sets are not the old-fashion data sets where it was simply structured data.

Gardner: Let’s go to Scott Yara. Scott, we need this transformation. We need this competitive differentiation and new, innovative business approaches by more real-time analytics across larger sets and more diverse sets of content and inference. What’s the approach on the solution side? What technologies are being brought to bear, and how can we start dealing with this at the time and scale that’s required?

A big shift

Yara: Sure. For Greenplum, one of the more interesting aspects of what’s going on is that big technology concepts and ideas that have really been around for two or three decades are being brought to bear, because of the big shift that Tim alludes to, and we are big believers. We're now entering this new cycle, where companies are going to be defined by their ability to capture and make use of the data and the user contributions that are coming from their customers and community. That is really being able to make parallel computing a reality.

We look at the other major computing trend today, and it’s a very mainstream thing like virtualization. Well, virtualization itself was born on the mainframe well over 30 years ago. So, why is virtualization today, in 2008, so important?

Well, it took this intersection of major trends. You had x86 and, as Tim mentioned, the commoditization of both hardware and software, and x86 and multi-core machines became incredibly cheap. At the same time, you had a high-level business trend, an industry trend. The rising cost of data centers and power became so significant that CIOs had to think about the efficiency of their data centers and their infrastructure and what could lower the cost of computing.

If you look at running applications on a much cheaper and much more efficient set of commodity systems and consolidating applications through virtualization, that would be a really compelling thing, and we've seen a multi-billion dollar industry born of that.

You're seeing the same thing here, because business is now driven by Web 2.0, by the success of Google, and by their own use and actions of the Web realizing how important data is to their own businesses. That’s become a very big driver, because it turns out that parallel computing, combined with commodity hardware, is a very disruptive platform for doing large-scale data analysis.

The fact that you can take very, very cheap machines, as Google has shown -- off-the-shelf PCs -- and with the right software, combine them to hundreds, thousands and tens of thousands of systems to deliver analytics at a scale that people couldn’t do before. It’s that confluence and that intersection of market factors that's actually making this whole thing possible.

While parallel computing has been around for 30 years, the timing has become such that it’s now having an opportunity to become really mainstream. Google has become a thought leader in how to do this, and there are a lot of companies creating technologies and models that are emblematic of that.

But, at the end of the day, the focus is in software that is purpose-built to provide parallelism out of the box. This allows companies to sift through huge amounts of data, whether structured or unstructured data. All the fault tolerance, all the parallelism, all those things that you need are done in software, so that you choose off-the-shelf hardware from HP, IBM, Dell, and white-box systems. That’s a model that's as disruptive a shift as client-server and symmetric multiprocessing (SMP) computing was on the mainframe.

Gardner: Jim Kobielus, speak to this point of moving the analytic results, the fruits of this impressive engine and architectural shift from the back office to the front office. This requires quite a shift in tools. We're not going to have those front-office folks writing long SQL queries. They're not going to study up on some of the traditional ways that we interact with data.

What’s in the offing for development, so developers can create applications that target this data now that’s in a format that we can get out and is cross-pollinated in huge data sets that are themselves diverse? What’s in store for app dev, and what’s in store for the people that are looking for a graphical way to get into the business strategist type of user?

Self-service paradigm

Kobielus: One thing we're seeing in the front-end app development is, to take Tim’s point even further, it’s very much becoming more of a Web 2.0 user-centric, self-service development paradigm for analytics.

Look at the ongoing evolution of the online analytical processing (OLAP) market, for example. Things that are going on in terms of user self service, development of data mining, advanced analytic applications within their browser, and within their spreadsheet. They can pull data from various warehouses and marts, and online transaction processing (OLTP) systems, but in a visual, intuitive paradigm.

That can catch a lot of that information in the front-end -- in other words, on the desktop or in the mobile device -- and allows the user to graphically build ever-richer reports and dashboards, and then be able to share that all out to the others in their teams. You can build a growing and collective analytical knowledge base that can be shared. That whole paradigm is coming to the fore.

At Forrester, we published a number of reports on it. Recently, Boris Evelson and I looked at the next generation of OLAP technology. One very important initiative to look at is what Microsoft is doing with Project Gemini. They're still working on that, but they demoed it a couple of months ago at their BI show.

The front office is the actual end user, and power users are the ones who are going to do the bulk of the BI and analytics application development in this new paradigm. This will mean that for the traditional high priesthood of data modelers and developers and data mining specialists, more and more of this development will be offloaded from them, so they can do more sophisticated statistical analysis, and so forth.

The front office will do the bulk of the development. The back office -- in other words, the traditional IT data-modeling professionals -- will be there. They'll be setting the policies and they'll be providing the tooling that the end users and the power users will use to build applications that are personalized to their needs.

So IT then will define the best practices, and they'll provide the tooling. They'll provide general coaching and governance around all of the user-centric development that will go on. That’s what’s going to happen.

It’s not just Microsoft. You can look at the OLAP tooling, more user-centric in-memory spreadsheet-centric approaches that IBM, Cognos, Oracle, and others are rolling out or have already rolled out in their product sets. This is where it’s all going.

Gardner: Tim O’Reilly, in the past, when we've opened up more technological power to more people, we've often encountered much greater innovation, unpredictably so. Should we expect some sort of a wisdom-of-crowd effect to come into play, when we take more of these data sets and analytic tools and make them available?

O'Reilly: There's a distinction between the wisdom of crowds and collective intelligence. The wisdom-of-crowds thesis, as expounded by Surowiecki, is that if you get a whole bunch of people independently, really independently, to weigh in on some subject, their average guess is better than any individual expert's. That’s really about a certain kind of quantitative stuff.

But, there's also a machine-learning approach in which you're not necessarily looking for the average, but you're finding different kinds of meaning in data. I think it’s important to distinguish those two.

Google realized that there was meaning in links that every other search engine of the day was throwing away. This was a way of harnessing collective intelligence, but it wasn’t just the wisdom of crowds. This was actually an insight into the structure of the data and the meaning that was hidden in it.

The breakthroughs are coming from the ability of people to discern meaning in data. That meaning sometimes is very difficult to extract, but the more data you have, the better you can be at it.

A great example of this recently is from the last election. Nate Silver, who ran 538.com, was uncannily accurate in calling the results of the election. The reason he was able to do that was that he looked at everybody’s polls, but didn’t just say, "Well, I'm just going to take the average of them." He used all kinds of deep thinking to understand, "Well, what’s the bias in this one. What’s the bias in that one?" And, he was able to develop an algorithm in which he weighted these things differently.

Gardner: I suppose it’s important for us to take the ability to influence the algorithms that target these advanced data sets and put them into the hands of the people that are closer to the real business issues.

More tools are critical


O'Reilly: That’s absolutely true. Getting more tools for handling larger and more complex data sets, and in particular, being able to mix data sets, is critical.

One of the things that Nate did that nobody else did was that he took everybody’s polls and then created a meta-poll.

Another example is really interesting. You guys probably are familiar with the Netflix Challenge, where Netflix has put up a healthy sum of money to whomever can improve their recommendation algorithm by 10 percent. What’s interesting is that people seem to be stuck at about 8 percent, and they haven’t been able to get the last couple of percent.

It occurred to me in a conversation I was having last night that the breakthroughs will come, not by getting a better algorithm against the Netflix data set, but by understanding some other data set that, when mixed with the Netflix data set, will give better predicted results.

Again, that tells us something about the future of data mining and the future of business intelligence. It is larger, more complex, and more diverse data sets in which you are able to extract meaning in new ways.

One other thing. You were talking earlier about the democratization of these tools. One thing I don’t want to pass by is a comment that was made recently by Joe Hellerstein, who is a computer science professor at UC Berkeley. It was one of those real wake-up-and-smell-the-coffee moments. He said that at Berkeley, every freshman student in CS is now being taught Hadoop. SQL is an elective for seniors. You say, "Whoa, that is a fundamental change in our thinking."

That’s why I think what Greenplum is doing is really interesting, trying to marry the old BI world of SQL with the new business intelligence world of these loose, unstructured data sets that are often analyzed with a MapReduce kind of approach. Can we bring the best of these things together?

That fits with this idea of crossing data sets being one of the new competencies that people are going to have to get better at.

Kobielus: If I can butt in here just one moment, I want to tie into something that Tim just said, that I said a little bit earlier. One important thing is that when you add more data sets to say your analytic environment, it gives you the potential to see more cross-correlations among different entities or domains. So, that’s one of the value props for an all-encompassing or more multi-domain enterprise data warehouse.

Before, you had these subject-specific marts -- customer data here, product data there, finance data there -- and you didn’t have any easy way to cross-correlate them. When you bring them altogether into common repository, implementing common dimensions and hierarchies, and conforming with common metadata, it makes it a whole lot easier for the data miners, the power users, and the end users, to build the applications that can tie it altogether.

There is the "aha" moment. "Aha, I didn’t realize all these hooked up in these various ways." You can extract more meaning by bringing it all together into a unified, enterprise data warehouse.

Gardner: To you, Scott Yara. There's a great emphasis here on bringing together different data sets from disparate sources, with entirely different technologies underlying them. It's not a trivial problem. It’s not a matter of scale necessarily.

What do you see as the potential? What is Greenplum working on to allow folks to mix and match in such a way that the analytics can be innovative and game-changing in a harsh economic environment?

Price/performance improvement

Yara: A couple of things. One, I definitely agree with the assertion that analysis gets easier the more data you have. Whether those are heterogeneous data sets or just the scale of data that people can collect, it's fundamentally easier, cheaper.

In general, these businesses are pretty smart. The executives, analysts, or people that are driving business know that their data is valuable and that insight in improving customer experience through data is key. It’s just really hard and expensive, and that has made it prohibitive for a long, long time.

Now, we're talking about using parallel computing techniques, open-source software, and commodity hardware. It’s literally a 10- to 100-fold improvement in price performance. When the cost of data analysis comes down 10 to 100 times, that’s when new things become possible.

O'Reilly: Absolutely.

Yara: We see lots of customers now from the New York Stock Exchange. These are all businesses that are across vertical industries, but are all affected by the Web and network computing at some level.

Algorithmic trading is driving financial services in a way that we haven’t seen before. They're processing billions of trades every day. Whether it's security, surveillance, or real-time support that they need to provide to very large trading companies, that ability to mine and sift through billions of transactions on a real-time basis is acute.

We were sitting down with one of our large telecom customers yesterday, and there was this convergence that Tim’s talking about. You've got companies with very large mobile carrier businesses. They're broadband service providers, fixed-line service providers, and Internet companies.

Today, the kind of basic personalization that companies like Amazon, eBay, or Google do, telecom carriers are just at the beginning of trying to do that. They have to aggregate the consumer event stream from all these disparate communication systems, and it’s at massive scale.

Greenplum is solely focused on making that happen and mixing the modalities of data, as Tim suggested. Whether it’s unstructured data, whether those are things that exist in legacy databases, or whether you want to mix and match SQL or MapReduce, fundamentally you need to make it easy for businesses to do those things. That’s starting to happen.

Gardner: I suppose part of the new environment that we are in economically is that incremental change is probably not going to cut it. We need to find new forms of revenue and be able to attain them at a very low cost, upfront if possible, and be transformative in how we can take our businesses out through the public networks to reach more customers and give them more value.

Now that we've established that we have these data sets, we can combine them to a certain degree, and that will improve over time. What are the ways in which companies can start actually making money in new ways using these technologies?

Apple’s Genius comes to mind for me as a way of saying, "Okay, you pick a song in your iTunes library, and we're going to use our data and our analytics, and come back with some suggestions on what you might like as a result of that." Again, this is sort of a first go at this, but it opens my eyes to a lot of other types of business development opportunities. Any thoughts on this, Tim O’Reilly?

O'Reilly: In general, as I said earlier, this is the frontier of competitive advantage. Sure, iTunes’ has Genius, but it's the same thing with Netflix recommendations. Amazon has been doing this for years. It's part of their competitive advantage. I mentioned earlier how this is starting to be a force in areas like banking. Think about phone companies and all of the opportunities for new local services.

Not only that, one of my pet hobbyhorses is that phone companies have this call-history database, but they're not building new services for users against it. Your phone still only remembers the last few people that you called. Why can’t I do a search against somebody I talked to three months ago. "Who the heck was that? Was it a guy from this company?" You should be able to search that. They've got the data.

So, as I said earlier, the frontier is turning the back office into new user-facing services, and having the analytics in place to be able to do that meaningfully at scale in real-time. This applies to supply chains. It applies to any business that has data that gets better through user interaction.

This is the lesson of the Web. We saw it first in Web applications. I gave you the example earlier of Wal-Mart. They realized, "Oh, wait a minute. Every time somebody buys something, it’s a vote." That’s the same point that Wesabe is trying to exploit. A credit card statement is a voting list.

I went to this restaurant once. That doesn’t necessarily mean anything. If I go back every week, that may mean something. I spent on average this much. It’s going up. That means something. I spend on average this much. It’s going down, and that means something. So, finding meaning in the data that I already have, how could this be useful not just me but to my users, to my customers, and the services could I build.

This is the frontier, particularly in the world that we are entering, in which computing is going mobile, because so many of the mobile services are fundamentally going to be driven by BI. You need to be able to say in real-time or close to real-time, "This is the relevant data set for this person based on where they are right now."

Needed: future view


Kobielus: I want to underline what Tim just said. Traditionally, data warehouses existed to provide you with perfect hindsight on the customer -- historical data, massive historical data, hopefully on the customer, and that 360 degree view of everything about the customer and everything they have ever done in the past, back to the dawn of recorded time.

Now, it’s coming down to managing that customer relationship and evolving and growing with that relationship. You have to have not so much a past or historical view, but a future view on that customer. You need to know that customer and where they are going better than they know themselves.

In other words, that’s where the killer app of the online recommendation engine becomes critical. Then, the data warehouse, as the platform for recommendation engines, can take both the historical data that persists, but also can take the continuing streams of real-time event data on pricing, on customer interaction in various channels -- be it on the Web or over the phone or whatever -- customer transactions that are going on now, and things and events that are going on in the customer social network.

Then, you feed that all into a recommendation engine, which is a predictive-analytics model running inside the data warehouse. That can optimize that customer’s interaction at every touch point. Let’s say they're dealing with a call-center person live. The call-center person knows exactly how the world looks to that customer right now and has a really good sense for what that customer might need now or might need in three month, six months, or a year, in terms of new services or products, because other customers like them are doing similar things.

It can have recommendations being generated and scripted for the call-center agent in real-time saying, "You know what we think. We recommend that you upgrade to the following service plan because, it provides you with these features that you will find useful in your lifestyle, blah, blah, blah."

In other words, it's understanding the customer in their future, in their possible future, and suggesting things to the customers that they themselves didn’t realize until you suggested them. That’s the future of analytics, and competitive advantage.

O'Reilly: I couldn’t agree more.

Gardner: Scott Yara, we've been discussing this with a little bit of a business-to-consumer (B2C) flavor. In the business-to-business (B2B) world many things are equal in a commoditized market, with traditional types of products and services.

An advantage might be that, as a supplier, I'm going to give you analytics that I can derive from data sets that you might not have access to. I might provide analytical results to you as a business partner free of charge, but as an enticement for you to continue to do business with me, when I don’t have any other way to differentiate. What do you see are some of the scenarios possible on the B2B side?

Yara: You don’t have to look much further than what Salesforce.com is doing. In a lot of ways, they're pioneering what it means to be an enterprise technology company that sells services, and ultimately data, back to their customers. By creating a common platform, where applications can be built, they are very much thinking about how the data is being aggregated on the platforms in use, not by their individual customers, but in aggregate.

You're going to see lots of cases where for traditional businesses that are selling services and products to other businesses, the aggregation of data is going to be interesting and relevant. At the same time, you have companies where even the internal analysis of their data is something they haven’t been able to do before.

We were talking about Google, which is an amazing company. They have this big vision to organize the world’s information. What the rest of the business world is finding out is that while it’s a great vision and they have a lot of data, they only have a small fraction of the overall data in the world. Telecommunication companies, financial stock exchange, retail companies, have all of this real-world data that's not being indexed or organized by Google. These companies actually have access to amazing amounts of information about the customers and businesses.

They are saying, "Why can’t we, at the point of interaction -- like eBay, Amazon, or some of these recommended engines -- start to take some of this aggregate information and turn it into improving businesses in the way that the Web companies have done so successfully. That’s going to be true for B2C businesses, as well as for B2B companies.

We're just at the beginning of that. That’s fundamentally what’s so exciting about Greenplum and where we're headed.

Gardner: Jim Kobielus, who does this make sense for right away? Some companies might be a little skeptical. They're going to have to think about this. But where is the low-lying fruit, where are the no-brainer applications for this approach to data and analytics?

Kobielus: No-brainers -- I always hate that term. It sounds like I am condescending, but low-hanging fruit should be one of those "aha!" opportunities that everybody realizes intuitively. You don’t have to explain to them, so in a sense it's a no-brainer. It’s call center -- customer-contact center.

The customer-contact center is where you touch the customer, and where you hopefully initiate, cultivate, nurture, maintain, and grow the customer relationship. It's one of the many places where you do that. There are people in your organization who are in that front-line capacity.

It doesn’t have to be just people. It could be automated programs through your Website that need to be empowered continuously with the full customer context -- the history of that customer's interactions, the customer’s current state, current sentiment and feelings, and with a full context on the customer’s likely future evolution. So, really it's the call center.

In fact, I cover data warehousing for Forrester. I talk to the data warehousing vendors and their customers about in database analytics, where they are selling this capability right now into real-world deployment. The customer call center is, far and away -- with a bullet -- the number one place for inline analytics to drive the customer interaction in a multi-channel fashion.

Gardner: How about you, Tim O’Reilly. Where are some of the hot verticals and early adopters likely to be on this?

O'Reilly: I've already said several times, mobile apps of various kinds are probably highest on the list. But, I'm a big fan of supply chain. There's a lot to be done there, and there's a huge amount of data. There already is a BI infrastructure, but it hasn’t really been tuned to think about it as a customer-facing application. It's really more a back-office or planning tool.

There are enormous opportunities in media, if you want to put it that way. If you think about the amount of money that’s spent on polling and the power of integrating actual data, rather than stated preference, I think it's huge.

How do we actually figure out what people are going to do? There is great marketing study. I forget who told this story, but it was about a consumer product. They showed examples of different colors. It was a boom box or something like that.

They said, "How many of you think white is the cool color, how many of you think black, how many, blah, blah, blah?" All the people voted, and then they had piles of the boom boxes by the door that the people took as their thank you gift. What they said and what they did were completely at variance.

One of the things that’s possible today is that, increasingly, we are able to see what people actually do, rather than what they say they will do or think they will do.

Gardner: We're just about out of time. Scott Yara, what’s your advice for those folks who are just getting their heads wrapped around this on how to get started? It’s not a trivial activity. It does require a great deal of concerted effort across multiple aspects of IT, perhaps more so than in the past. How do you get started, what should you be doing to get ready?

Yara: That’s one of the real advantages. In sort of a orthogonal way, the ability to create new businesses online in the age of Web 2.0 has been fundamentally cheaper and faster. Doing something disruptive inside of business with their data has to be a fundamentally cheaper and easier thing. So not starting with the big vision of where they need to go, and starting with something tactical -- whether it lives in the call center or at some departmental application -- is the best way to get going.

There are technologies, services, and people now that you can actually peel off a real project, and you can deliver real value right away.

I agree with Tim. We're going to see a lot of activity in the mobility and telecommunication space. These companies are just realizing this. If you think about the kind of personalization that you get with almost every major Internet site today, what’s level of personalization you get from your carrier, relative to how much data that they have? You're going to see lots of telecom companies do things with data that will have real value.

One of our customers was saying that in the traditional old data warehousing world, where it was back office, the service level agreement (SLA) was that when a call got placed and logged, it just needed to make its way into the warehouse seven days later. Seven days from the point of origination of a call, it would make itself into a back-office warehouse.

Those are the kinds of things that are going to change, if we are going to really provide mobility, locality, and recommendation services to customer.

It's having a clear idea of the first application that can benefit from data. Call centers are going to be a good area to provide the service representation of a profile of a customer and be able to change the experience. I think we are going to see those things.

So, they're tractable problems. Starting small is what held back enterprise data warehousing before, where they were looking at these huge investments of people and capital and infrastructure. I think that’s really changing.

Gardner: I am afraid we have to leave it there. We've been discussing new approaches to managing data, processing data, mixing data types and sets, and extracting real-time business results from that. We've looked at tools and we've looked at some of the verticals in business advantages.

I want to thank our panel. We've been joined today by Tim O’Reilly, the CEO and founder of O’Reilly Media. Thank you Tim.

O'Reilly: Glad to do it.

Gardner: Jim Kobielus, Forrester senior analyst. Thank you Jim.

Kobielus: Dana, always a pleasure.

Gardner: Scott Yara, president and co-founder of Greenplum. Appreciate it, Scott.

Yara: Great. Thanks everybody.

Gardner: This is Dana Gardner, principal analyst at Interarbor Solutions. You've been listening to a sponsored BriefingsDirect podcast. Thanks, and come back next time.

Listen to the podcast. Download the podcast. Find it on iTunes/iPod. Learn more. Sponsor: Greenplum.

Transcript of BriefingsDirect podcast on new computing challenges and solutions in data processing and data management. Copyright Interarbor Solutions, LLC, 2005-2008. All rights reserved.