BriefingsDirect Transcripts: data mining

Showing posts with label data mining. Show all posts

Tuesday, July 21, 2009

Seeking to Master Information Explosion: Enterprises Gain Better Ways to Discover and Manage their Information

Transcript of a BriefingsDirect podcast on new strategies and tools for dealing with the burgeoning problem of information overload.

Listen to the podcast. Download the podcast. Download the transcript. Find it on iTunes/iPod and Podcast.com. Learn more. Sponsor: Hewlett-Packard.

Join a free HP Solutions Virtual Event on July 28 on four main IT themes. Learn more. Register.

Dana Gardner: Hi, this is Dana Gardner, principal analyst at Interarbor Solutions, and you’re listening to BriefingsDirect.

Today, we present a sponsored podcast discussion

on how enterprises can better manage the explosion of information around them. Businesses of all stripes need better means of access, governance, and data lifecycle best practices, given the vast ocean of new information coming from many different directions. By getting a better handle on information explosion, analysts and users gain clarity in understanding what is really going on within the businesses, and, especially these days, across the dynamic market environment.

The immediate solution approach requires capturing, storing, managing, finding, and using information better. We’ve all seen a precipitous drop in the cost of storage and a dramatic rise in the incidents of data from all kinds of devices and across more kinds of business processes, from sensors to social media.

To help us better understand how to best manage and leverage information, even as it’s exploding around us, we’re joined by Suzanne Prince, worldwide director of information solutions marketing at Hewlett-Packard (HP). Welcome, Suzanne.

Suzanne Prince: Thanks, Dana.

Gardner: As I mentioned, things have changed rather dramatically in the past several years, in terms of the amount of information, the complexity, and the sources of that information. From your perspective, how has the world changed for the worse when it comes to managing information?

Prince: Well, it’s certainly a change for the worse.

The flood is getting bigger and bigger. You’ve touched on a couple of things already about the volume and the complexity, and it’s not getting any better. It’s getting worse by the minute, in terms of new types of information. But, more importantly, we’re noticing major shifts going on in the business environment, which are partially driven by the economy, but they were already happening anyway.

We’re moving more into the collaboration age, with flatter organizations. And the way is information is consumed is changing rapidly. We live in the always-on age, and we all expect and want instant access, instant gratification for whatever we want. It’s just compounding the problems.

Gardner: I’m afraid there's a price to be paid if one loses control over this burgeoning level and complexity of information.

Prince: Absolutely. There are these horror stories

that we all regularly read in the press that range from compliance and eDiscovery fines that are massive fines. And, we’re also seeing major loses of revenue.

I’ll give you an example of an oil company that was hit by hurricane Katrina in the Gulf of Mexico. Their drilling rigs were hit and damaged severely. They had to rebuild them and they were ready to start pumping, but they had to regenerate the paperwork, because the environmental compliance documentation was actually on paper.

Guess what happened in the storm -- it got lost. It took them two weeks to regenerate that documentation and, in that time, they lost $200 million worth of revenue. So, there are massive numbers associated with this risk around information.

Gardner: We’re talking about not just information that’s originating in a digital format, but information that originates in number of different formats across a number of different modalities, from rich media to just plain text. That has to be brought into a manageable digital environment.

Information is life

Prince: Absolutely. You often hear people saying that information is life -- it’s the lifeblood of an organization. But, in reality, that analogy breaks down pretty quickly, because it does not run smoothly through veins. It’s sitting in little pockets everywhere, whether it’s the paper files I just talked about that get lost, on your or my memory sticks, on our laptops, or in the data center.

Gardner: We’ve heard a lot about data management and data mining. That tends to focus on structured data, but I suppose we need to include other sorts and types of information.

Prince: Yes. The latest analyst tracker reports -- showing what type of storage is being bought and used -- reveal that the growth in unstructured content is double the growth that’s going on in the structured world. It makes sense, if you think about it, because for the longest time now, IT has really focused on the structure side of data, stuff that’s in databases. But, with the growth of content that was just mentioned -- whether it's videos, Twitters, or whatever -- we’re seeing a massive uptick in the problems around content storage.

Gardner: While we’re dealing with a hockey stick curve on volume, I suppose that the amount of time that we have to react to markets is shrinking rapidly. We’ve had an economic storm and folks have had to adjust, perhaps cutting 30-40 percent of their businesses as quickly as possible. So, in order to react to environments that are themselves changing, we can’t wait for a batch reply on some look at information from 3-10 weeks ago.

Prince: No. That comes back to what I said previously about instant gratification. In reality, it’s a necessity. Where do I shed? Where do I cut my costs? Where are the costs that I can cut and still not cut into the meat of my company? More importantly, it’s all about where are my best customers? How do I focus my sales energy on my best customers? As we all know, it costs more to get a new customer than it does to retain an old one.

Gardner: Also compounding the complexity nowadays, we’re hearing quite a bit about cloud computing. One of the promises of the vision around cloud computing

We’ve seen very good returns on investment (ROIs) ranging from 230 percent to 350 percent. We’ve seen major net benefits in the millions.

is being able to share certain data to certain applications, certain people, certain processes, but not others. So, we need to start managing how we then allow access to data at a much more granular level.

Prince: The whole category of information governance really comes into play when you start talking about cloud computing, because we’ve already talked about the fact that we’ve got disparate sources, pockets of information throughout an organization. That’s already there now. Now, you open it up with cloud and you’ve got even more. There are quality issues, security issues, and data integration issues, because you most likely want to pull information from your cloud applications or services and integrate that within something like a customer relationship management (CRM) system to be able to pull business intelligence (BI) out.

Gardner: I just spoke with a number of CIOs last week at an HP conference, and their modus operandi these days is that they need to show a return on whatever new investments they make in a matter of one or two months. They don’t have a 12- or 18-month window for return on their activities. What are the business paybacks, when one starts to do data mining, management, cleansing, storing, the whole process? When they do it right, what do they get?

Prince: We’ve seen very good returns on investment (ROIs) ranging from 230 percent to 350 percent. We’ve seen major net benefits in the millions. And, in today’s world, the most important thing is, to get the cost out and use that cost to invest for growth. There are places you can look, where you can get cost out quite quickly.

I already mentioned one of them, which is around the costs of eDiscovery. It may not be provisioned yet in the IT budget, but may be in your legal department’s budget. They are spending millions in responding to court cases. If you put an eDiscovery solution in, you could get that cost back and then reallocate that to other projects. This is one example. Storage virtualization is another one. Also outsourcing -- look into what you could outsource and turn capital expenditure into operating expenditure.

Gardner: I suppose too that productivity, when lost, comes with a high penalty. So, getting accurate timely information in the hands of your decision makers perhaps has a rapid ROI as well, but it’s not quite as easy to measure.

Right information at the right time

Prince: No, it’s not as easy to measure, but here’s something quite interesting. We did a survey in February of this year in several countries around the world. It was both for IT and line-of-business decision makers. The top business priority for those people that we talked to, way and above everything else, was having the right information at the right time, when needed. It was above reducing operating costs, and even above reducing IT costs. So what it’s telling us is how business managers see this need for information as business critical.

Gardner: I suppose another rationale for making investments, even in a tough budgetary environment, is regulatory compliance. One really doesn’t have a choice.

Prince: You don’t have a choice. You have to do it. The main thing is how can you do it for least cost and also make sure that you’re covering your risk.

Gardner: Well, we’ve had an opportunity to look at the problem set. What sorts of solutions can organizations begin to anticipate and put into place?

Prince: I touched on a few, when I was talking about some of the areas to look for cost savings. At the infrastructure layer: we’ve talked about storage. You can definitely optimize your storage -- virtualization, deduplication. You really need to look at deleting what I would call "nuisance information," so that you’re not storing things you don’t need to. In other words, if I’m emailing you to see if you’d like to come have a cup of coffee, that doesn’t need to be stored. So, optimizing storage and optimizing your data center infrastructure.

Also, we talked about the pockets of information everywhere.

You need to have a governance plan that brings together business and IT. This is not just an IT problem, it’s a business problem and all parties need to be at the table.

Another area to look at is content repository consolidation, or data mart consolidation. I’m talking about consolidating the content and data stores.

As an example, a pharmaceutical company that we know of has over 39 different content management solutions. In this situation, a) How do we get an enterprise view of what’s going on and b) What's the cost? So, at the infrastructure layer, it's definitely around consolidation, standardizing, and automating.

Then, at the governance layer, you need to look at data integration. You need to have a quality plan. You need to have a governance plan that brings together business and IT. This is not just an IT problem, it’s a business problem and all parties need to be at the table. You’re going to need to have your compliance officers, your legal people, and your records manager involved.

One of the most important things we believe is that IT needs to deliver information as a business-ready service. You need to be able to hide the complexity of all of that plumbing that I was talking about with those 39 different applications. You need to be able to hide that from your end users. They don’t care where information came from. They just want what they want in the format that they want it in, which is usually an Office application, because that’s what they’re most used to. You’ve got to hide the complexity underneath by delivering that information as a service.

Gardner: It sounds like an integration problem as well, given that we’re not going to take all these different types of information and data and put them into a single repository. It sounds as if we’re going to leave it where it is natively, but extract some values and some indexing and gain the ability to access it rather rapidly.

Prince: Yes, because business users, when they want things, want them quickly or they do it themselves. We all do it. Each one of us does it. "Oh, let’s get some spreadsheet going" or whatever. We will never be in a place where we have everything in one bucket. So, it’s always going to be federated. It’s always going to be a data integration issue. As I said, we really need to shield the end users from all of that and give them an easy-to-use interface at the top end.

Gardner: Are there any standards that have jumped out in recent years that seem more valuable in solving this problem than others?

No single standard

Prince: No, not really. There are a lot of people who keep taking runs at it. There are the groups looking at it. There are industry groups like ARMA looking at the records management. AIIM is looking at the information content management. But, there is not any one particular standard that’s coming out above the others. I would recommend, because of the complexity underneath and the fact that you will always have a heterogeneous environment, open standards are important, so that you can do more of a plug-and-play game.

Gardner: It seems that what we were doing with information in some ways is mimicking what we have done with applications around integration and consolidation. Are there means that we have already employed in IT that can be now reused or applied to this information explosion in terms of infrastructure, service orientation, enterprise service buses, or policy engines? How does this information chore align with some of the other IT activity?

Prince: It sort of lines up. You touched on something there about the applications. What you said is exactly true. People are now looking at information as the issue. Before they would look at the applications as the issue. Now, there's the realization that, when we talk about IT, there is an "I" there that says "Information." In reality, the work product of IT is information. It’s not applications. Applications are what move it around, but, at the end of the day, information is what is produced for the business by IT.

Gardner: Years ago, when we had one mainframe that had several applications, all drawing on the same data, it wasn’t the same issue it is today, where the data is essentially divorced from the application.

Prince: Yes, and you mentioned it before. It’s going to get even more so

We've definitely got the expertise and the flexible sourcing, so that we can help reduce the total cost of ownership and move expenditure around.

with cloud. It’s going to get even more divorced.

Gardner: From HP’s perspective, what do you have to bring to the table from a methods, product, process, and people perspective? I'm getting the impression that this has to be done in totality. How do you get started? What do you do?

Prince: There are two questions there. From an HP perspective, as you said, we bring the total package from our expertise and experience, which is vital in all of this. One of the main things is that you need people have done it before. They know the tricks and have got maturity models and best practices in their back pockets and they bring those out.

We've definitely got the expertise and the flexible sourcing, so that we can help reduce the total cost of ownership and move expenditure around. We've got that side of the fence and we've obviously got the adaptive infrastructure. We already talked about the data warehouse consolidation. We've got services around governance. So, we've got the whole stack. But, you also asked where to start, and the answer is wherever the customer needs to start.

Gardner: It's that big of a problem?

Increasing law suits

Prince: Yes, it is that big, and it’s going to depend. If I'm a manufacturing company I might be getting a lot of law suits, because the number of law suits have gone sky high since people are trying to get money out of enterprises any way they can. So, look for where your cost is, get that cost out, and then, as I said before, use that to fund innovation, which is where growth comes from. It's all about how you transform your company by using information.

Gardner: So, you identify the tactical cost centers, and that gives you the business rationale and opportunity to invest perhaps at a strategic level along the way, employing governance as well?

Prince: It’s like any other large project. You need to get senior executive commitment and sponsorship -- and I mean real commitment. I mean that they are involved. It’s also the old adage of "how do you eat an elephant?" You eat an elephant in small chunks. In other words, you have a strategic plan and you know where you are going, but you tackle it in tactical projects that return business benefits. And then, IT needs to be very visible in communicating the benefits they are making in each of those steps, so that it reinforces the re-investment cycle.

Gardner: Something you mentioned earlier that caught my attention was the new options around sourcing. Whether it's on-premises, modernized data center, on-premises cloud-like or grid-like or utility types of resource pools, moving towards colocation, outsourcing and even a third-party cloud provider, how does that spectrum of sourcing come into play on a solutions level for information explosion?

Prince: Again, it goes back to the strategies that we were talking about. There needs to be an underpinning strategy, and people need to look at the business values of information.

There needs to be an underpinning strategy, and people need to look at the business values of information. There is some information that you will never want outsourced. You will always want it close at hand.

There is some information that you will never want outsourced. You will always want it close at hand -- the CEO’s numbers that he is monitoring the business with. They're under lock and key in his office. It’s core business value information. There are others that you can move out. So, it’s going to involve the spectrum of looking at the business value, the security, and the data integration needs, assessing all of that, and then making your decisions.

Gardner: Are there some examples we can look to and get a track record, an approach, and learned some lessons along the way? After we have a sense of what people have done, what kind of success rates do they tend to enjoy?

Prince: Because it’s such a broad topic, it’s hard to hone in on any one thing, but I will give you an example of document processing outsourcing. It’s just an example. With the acquisition of EDS, we offer a service where we will automate the mailroom. So, when the mail comes into the mailroom, it gets digitized and then sent to the appropriate application or user. If it’s a customer complaint, it will go to the complaints department. If it’s a sales request, it will get sent to the call center.

That’s a totally outsourced environment. What all of our customers are seeing is a) reduction in cost, and b) an increase in efficiency, because that paper comes in and, once digitized, moves around as a digital item.

Gardner: We perhaps wouldn’t name names, but have you encountered situations where certain companies, in fact, found themselves at a significant deficit competitively as result of not doing the right thing around information.

Lack of information

Prince: Well, I can give you one. Actually, it’s in the public domain. So, I can name names. New Century. They were the first sub-prime mortgage company to go under in the US, and it’s publicly documented.

The bankruptcy examiner has actually written in his report that one of the major reasons they went crash was because of the lack of information at the management level. In fact, they were running their business for the longest time on Excel spreadsheets, which were not being transmitted to management. So, they were not aware of the risks that they were actually exposed to.

Gardner: We’ve certainly seen quite clear indicators that risk wasn’t always being measured properly across a number of different industries over the past several years. I suppose we would have to attribute that not only to a process, but to simply not knowing what’s going on within their systems.

Prince: Yes. I'll give you another public domain example of something from a completely different angle -- a European police database. They have just admitted -- in fact, I think it went public in February -- that they had 83 percent errors in their database. As a result of that, over a million people either lost their jobs or were fired because they were wrongly categorized as being criminals.

You have absolutely catastrophic events, if you don’t look after your quality and if you don’t have governance programs in place.

Gardner: I want to hear more about how we get started in terms of approaching a problem, but I also understand that we should have some hope

The bankruptcy examiner has actually written in his report that one of the major reasons they went crash was because of the lack of information at the management level.

that new technologies, approaches, and processes are coming out. Has there been anything at the labs level or the R&D level, where investments are being made that offer some new opportunities in terms of some of the problems and solution tension that we have been discussing?

Prince: In HP Labs, we have eight major focus areas, and I would categorize six of them as being focused on information -- the next set of technology challenges. It ranges all the way from content transformation, which is the complete convergence of the physical and digital information, to having intelligent information infrastructure. So, it’s the whole gamut. But, six out of eight of our key projects are all based on information, information processing, and information management.

I'll give you example of one that’s in beta at the moment. It’s Taxonom, which is an information-as-a-service (IaaS) taxonomy builder. One thing that is really important, especially in the content world, is the classification of the content. If you don’t classify, you can’t find it. We are in beta at the moment, but you are going to see a lot of more energy around these types of solution.

Gardner: So the majority of R&D money’s, at least at HP, is now being focused on this information explosion problem set.

Prince: Yes, yes, absolutely.

Gardner: Interesting. Well, some folks may be interested in getting some more detailed information. They perhaps have some easily identified pain points and they want to drill down on that tactical level, consider some of the other strategic approaches, and look to some of those benefits and risk reduction. Where can they go to get started?

Prince: The first one to call is your HP account representative. So, talk to them and start exploring how we can help you solve the issues in your company. If you want to just generally browse, go to hp.com. I'd also strongly recommend a sub page -- hp.com/go/imhub.

Gardner: Very good. Well, we were discussing this burgeoning problem around information explosion, along with some of the risks and penalties that unfortunately many folks suffer and some of the paybacks for those who start to get a handle on this problem.

We've also looked at some examples of winners and, unfortunately, losers and we have found some early ways to start in on this solutions road map. I want to thank our guest today. We have been talking with Suzanne Prince, worldwide director of information solutions marketing at HP. Thank you, Suzanne.

Prince: Thanks, Dana. It was a pleasure.

Gardner: This is Dana Gardner, principal analyst at Interarbor Solutions. You have been listening to BriefingsDirect. Thanks and come back next time.

Listen to the podcast. Download the podcast. Download the transcript. Find it on iTunes/iPod and Podcast.com. Learn more. Sponsor: Hewlett-Packard.

Join a free HP Solutions Virtual Event on July 28 on four main IT themes. Learn more. Register.

Transcript of a BriefingsDirect podcast on new strategies and tools for dealing with the burgeoning problem of information overload. Copyright Interarbor Solutions, LLC, 2005-2009. All rights reserved.

Tuesday, December 16, 2008

MapReduce-scale Analytics Change Business Intelligence Landscape as Enterprises Mine Ever-Expanding Data Sets

Transcript of BriefingsDirect podcast on new computing challenges and solutions in data processing and data management.

Listen to the podcast. Download the podcast. Find it on iTunes/iPod. Learn more. Sponsor: Greenplum.

Dana Gardner: Hi, this is Dana Gardner, principal analyst at Interarbor Solutions, and you're listening to BriefingsDirect. Today, we present a sponsored podcast discussion on the architectural response to a significant and fast-growing class of new computing challenges. We will be discussing how Internet-scale data sets and Web-scale analytics have placed a different set of requirements on software infrastructure and data processing techniques.

Following the lead of such Web-scale innovators as Google, and through the leveraging of powerful performance characteristics of parallel computing on top of industry-standard hardware, we are now focusing on how MapReduce approaches are changing business intelligence (BI) and the data-management game.

More types of companies and organizations are seeking new inferences and insights across a variety of massive datasets -- some into the petabyte scale. How can all this data be shifted and analyzed quickly, and how can we deliver the results to an inclusive class of business-focused users?

We'll answer some of these questions and look deeply at how these new technologies will produce the payback from cloud computing and massive data mining and BI activities. We'll discover how the results can quickly reach the hands of more decision makers and strategists across more types of businesses.

While the challenge is great, the new value for managing these largest data sets effectively offers deep and powerful new tools for business and for social and economic progress.

To provide an in-depth look at how parallelism, modern data infrastructure, and MapReduce technologies come together, we welcome Tim O’Reilly, CEO and founder of O’Reilly Media, and a top influencer and thought leader in the blogosphere. Welcome, Tim.

Tim O’Reilly: Hi, thanks for having me.

Gardner: We're also joined by Jim Kobielus, senior analyst at Forrester Research. Thank you, Jim.

Jim Kobielus: Hi, Dana. Hi, everybody.

Gardner: Also, Scott Yara, president and co-founder at Greenplum. Welcome, Scott.

Scott Yara: Thank you.

Gardner: We're still dealing with oceans of data, even though we have harsh economic times. We see reduction in some industries, of course, but the amount of data and need for analytics across the Internet is still growing rapidly. BI has become a killer application over the past few years, and we're now extending that beyond enterprise-class computing into cloud-class computing.

I want to go to Jim Kobielus first. Jim, why has this taken place now? What is happening in the world that is simultaneously creating these huge data sets, but also making necessary even better analytics across more businesses?

Kobielus: Thanks, Dana. A number of things are happening or have been happening over the past several years, and the trend continues to grow. In terms of the data sets, it’s becoming ever more massive for analytics. It’s equivalent to Moore’s Law, in the sense that every several years, the size of the average data warehouse or data mart grows by an order of magnitude.

In the early 1990s or the mid 1990s, the average data warehouse was in gigabytes. Now, in the mid to late 2000s, it's in the terabytes. Pretty soon, in the next several years, the average data warehouse will be in the petabyte range. That’s at least a thousand times larger than the current middle-of-the-road data warehouse.

Why are data warehouses bulking up so rapidly? One key thing is that organizations, especially in tough times when they're trying to cut costs, continue to consolidate a lot of disparate data sets into fewer data centers, onto fewer servers, and into fewer data warehouses that become ever-more important for their BI and advanced analytics.

What we're seeing is that more data warehouses are becoming enterprise data warehouses and are becoming multi-domain and multi-subject. You used to have tactical data marts, one for your customer data, one for your product data, one for your finance data, and so forth. Now, the enterprise data warehouse is becoming the be all and end all -- one hub for all of those sets.

What that means is that you have a lot of data coming together that never needed to come together before. Also, the data warehouse is becoming more than a data warehouse. It's becoming a full-fledged content warehouse, not just structured relational data, but unstructured and semi-structured data -- from XML, from your enterprise content management (ECM) system, from the Web, from various formats, and so forth. It's coming together and converging into your warehouse environment. That’s like the bottom of the iceberg that’s coming up, you're seeing it now, and it's coming into your warehouse.

Also, because of the Web 2.0 world and social networking, a lot of the customer and market intelligence that you need is out there in blogs, RSS feeds, and various formats. Increasingly, that is the data that enterprises are trying to mine to look for customers, marketing opportunities, cross-sell opportunities, and clickstream analysis. That’s a massive amount of data that’s coming together in warehouses, and it's going to continue to grow in the foreseeable future.

Gardner: Let’s go to Tim O’Reilly. Tim, from your perspective, what has changed over the past 10 or 20 years that makes these datasets so important?

Long-term perspective

O'Reilly: If you look at what I would call Web 2.0 in a long-term historical perspective, in one sense it's a story about the evolution of computing.

In the first age of computing, business models were dominated by hardware. In the second age, they were dominated by software. What started to happen in the 1990s, underneath everybody’s nose, but not understood and seen, was the commodification of software via open industry standards. Open source started to create new business models around data, and, in particular, around network applications that built huge data sets through user participation. That’s the essence of what I call Web 2.0.

Look at Google. It's a BI company, based on massive data sets, where, first of all, they are spidering all the activity off of the Web, and that’s one layer. Then, they do this detailed analysis of the link structure of that Web, and that’s another layer. Then, they start saying, "Well, what else can we find? They start looking at click stream data. They start looking at browsing history, and where people go afterward. Think of all the data. Then, they deliver service against that.

That’s the essence of Web 2.0, building a massive data set, doing real-time analytics against it, and then figuring out what services you can deliver. What’s happening today is that movement is transferring from the consumer Web into business. People are starting to realize, "Oh, the companies that are doing better are better with their data."

A great example of that is Wal-Mart. You can think of Wal-Mart as a Web 2.0 company. They've got end-to-end analytics in the same way that Google does, except they're doing it with stuff. Somebody takes something off the shelf at Wal-Mart and rings it up. Wal-Mart knows, and it sends a signal downstream to the supplier.

We need to understand that this move to real-time understanding of data at massive scale is going to become more and more important as the lever of competitive advantage -- not just in computer businesses, but in all businesses. Data warehousing and analytics aren't just something that you do in the back office and it's a nice-to-have. It's the very essence of competitive advantage moving forward.

When we think about where this is going, we first have to understand that everybody is connected all the time via applications, and this is accelerating, for example, via mobile. The need for real-time analytics against massive data sets is universal.

Look at some of the things that are happening on the phone. Okay, where am I? What data is relevant to me right now, because you know where I am? Speech recognition is starting to come into focus on the phone. Again, it's a massive data problem, integrating not only speech recognition, but also local dialogs. Oh, wait, local again, you start to see some cross connections between data streams that will help you do better.

Even in the case of starting with someone from Nuance about why Google is able to do some interesting things in the particular domain of search and speech recognition, it’s because they're able to cross-correlate two different data sets -- the speech data set and the search data set. They say, "Okay, yeah, when somebody says that, they are most likely looking for this, because we know that. When they type, they also are most likely looking for that." So this idea of cross-correlation between data sets is starting to come up more and more.

This is a real frontier of competitive advantage. You look at the way that new technologies are being explored by startups. So many of the advantages are in data.

A great example is the company where I'm on the board. It's called Wesabe. They're a personal finance application. People upload their bank statements or give Wesabe information to upload their bank statements. Wesabe is able to do customer analytics for these guys, and say, "Oh, you spent so much on groceries." But, more than that, they're able to say, "The average person who shops at Safeway, spends this much. The average person who shops at Lucky spends this much in your area." Again, it's a massive data problem. That’s the heart of their application.

Now, you think the banks are going to get clued into this and they are going to start to say, "Well, what services can we offer?" Phone companies: "What services can we offer against our data?"

One thing that’s going to happen is the migration of all the BI competencies from the back office to the front office, from being something that you do and generate reports from, to something that you actually generate real-time services from. In order to do that, you've absolutely got to have high performance at massive scale.

Second, a lot of these data sets are not the old-fashion data sets where it was simply structured data.

Gardner: Let’s go to Scott Yara. Scott, we need this transformation. We need this competitive differentiation and new, innovative business approaches by more real-time analytics across larger sets and more diverse sets of content and inference. What’s the approach on the solution side? What technologies are being brought to bear, and how can we start dealing with this at the time and scale that’s required?

A big shift

Yara: Sure. For Greenplum, one of the more interesting aspects of what’s going on is that big technology concepts and ideas that have really been around for two or three decades are being brought to bear, because of the big shift that Tim alludes to, and we are big believers. We're now entering this new cycle, where companies are going to be defined by their ability to capture and make use of the data and the user contributions that are coming from their customers and community. That is really being able to make parallel computing a reality.

We look at the other major computing trend today, and it’s a very mainstream thing like virtualization. Well, virtualization itself was born on the mainframe well over 30 years ago. So, why is virtualization today, in 2008, so important?

Well, it took this intersection of major trends. You had x86 and, as Tim mentioned, the commoditization of both hardware and software, and x86 and multi-core machines became incredibly cheap. At the same time, you had a high-level business trend, an industry trend. The rising cost of data centers and power became so significant that CIOs had to think about the efficiency of their data centers and their infrastructure and what could lower the cost of computing.

If you look at running applications on a much cheaper and much more efficient set of commodity systems and consolidating applications through virtualization, that would be a really compelling thing, and we've seen a multi-billion dollar industry born of that.

You're seeing the same thing here, because business is now driven by Web 2.0, by the success of Google, and by their own use and actions of the Web realizing how important data is to their own businesses. That’s become a very big driver, because it turns out that parallel computing, combined with commodity hardware, is a very disruptive platform for doing large-scale data analysis.

The fact that you can take very, very cheap machines, as Google has shown -- off-the-shelf PCs -- and with the right software, combine them to hundreds, thousands and tens of thousands of systems to deliver analytics at a scale that people couldn’t do before. It’s that confluence and that intersection of market factors that's actually making this whole thing possible.

While parallel computing has been around for 30 years, the timing has become such that it’s now having an opportunity to become really mainstream. Google has become a thought leader in how to do this, and there are a lot of companies creating technologies and models that are emblematic of that.

But, at the end of the day, the focus is in software that is purpose-built to provide parallelism out of the box. This allows companies to sift through huge amounts of data, whether structured or unstructured data. All the fault tolerance, all the parallelism, all those things that you need are done in software, so that you choose off-the-shelf hardware from HP, IBM, Dell, and white-box systems. That’s a model that's as disruptive a shift as client-server and symmetric multiprocessing (SMP) computing was on the mainframe.

Gardner: Jim Kobielus, speak to this point of moving the analytic results, the fruits of this impressive engine and architectural shift from the back office to the front office. This requires quite a shift in tools. We're not going to have those front-office folks writing long SQL queries. They're not going to study up on some of the traditional ways that we interact with data.

What’s in the offing for development, so developers can create applications that target this data now that’s in a format that we can get out and is cross-pollinated in huge data sets that are themselves diverse? What’s in store for app dev, and what’s in store for the people that are looking for a graphical way to get into the business strategist type of user?

Self-service paradigm

Kobielus: One thing we're seeing in the front-end app development is, to take Tim’s point even further, it’s very much becoming more of a Web 2.0 user-centric, self-service development paradigm for analytics.

Look at the ongoing evolution of the online analytical processing (OLAP) market, for example. Things that are going on in terms of user self service, development of data mining, advanced analytic applications within their browser, and within their spreadsheet. They can pull data from various warehouses and marts, and online transaction processing (OLTP) systems, but in a visual, intuitive paradigm.

That can catch a lot of that information in the front-end -- in other words, on the desktop or in the mobile device -- and allows the user to graphically build ever-richer reports and dashboards, and then be able to share that all out to the others in their teams. You can build a growing and collective analytical knowledge base that can be shared. That whole paradigm is coming to the fore.

At Forrester, we published a number of reports on it. Recently, Boris Evelson and I looked at the next generation of OLAP technology. One very important initiative to look at is what Microsoft is doing with Project Gemini. They're still working on that, but they demoed it a couple of months ago at their BI show.

The front office is the actual end user, and power users are the ones who are going to do the bulk of the BI and analytics application development in this new paradigm. This will mean that for the traditional high priesthood of data modelers and developers and data mining specialists, more and more of this development will be offloaded from them, so they can do more sophisticated statistical analysis, and so forth.

The front office will do the bulk of the development. The back office -- in other words, the traditional IT data-modeling professionals -- will be there. They'll be setting the policies and they'll be providing the tooling that the end users and the power users will use to build applications that are personalized to their needs.

So IT then will define the best practices, and they'll provide the tooling. They'll provide general coaching and governance around all of the user-centric development that will go on. That’s what’s going to happen.

It’s not just Microsoft. You can look at the OLAP tooling, more user-centric in-memory spreadsheet-centric approaches that IBM, Cognos, Oracle, and others are rolling out or have already rolled out in their product sets. This is where it’s all going.

Gardner: Tim O’Reilly, in the past, when we've opened up more technological power to more people, we've often encountered much greater innovation, unpredictably so. Should we expect some sort of a wisdom-of-crowd effect to come into play, when we take more of these data sets and analytic tools and make them available?

O'Reilly: There's a distinction between the wisdom of crowds and collective intelligence. The wisdom-of-crowds thesis, as expounded by Surowiecki, is that if you get a whole bunch of people independently, really independently, to weigh in on some subject, their average guess is better than any individual expert's. That’s really about a certain kind of quantitative stuff.

But, there's also a machine-learning approach in which you're not necessarily looking for the average, but you're finding different kinds of meaning in data. I think it’s important to distinguish those two.

Google realized that there was meaning in links that every other search engine of the day was throwing away. This was a way of harnessing collective intelligence, but it wasn’t just the wisdom of crowds. This was actually an insight into the structure of the data and the meaning that was hidden in it.

The breakthroughs are coming from the ability of people to discern meaning in data. That meaning sometimes is very difficult to extract, but the more data you have, the better you can be at it.

A great example of this recently is from the last election. Nate Silver, who ran 538.com, was uncannily accurate in calling the results of the election. The reason he was able to do that was that he looked at everybody’s polls, but didn’t just say, "Well, I'm just going to take the average of them." He used all kinds of deep thinking to understand, "Well, what’s the bias in this one. What’s the bias in that one?" And, he was able to develop an algorithm in which he weighted these things differently.

Gardner: I suppose it’s important for us to take the ability to influence the algorithms that target these advanced data sets and put them into the hands of the people that are closer to the real business issues.

More tools are critical

O'Reilly: That’s absolutely true. Getting more tools for handling larger and more complex data sets, and in particular, being able to mix data sets, is critical.

One of the things that Nate did that nobody else did was that he took everybody’s polls and then created a meta-poll.

Another example is really interesting. You guys probably are familiar with the Netflix Challenge, where Netflix has put up a healthy sum of money to whomever can improve their recommendation algorithm by 10 percent. What’s interesting is that people seem to be stuck at about 8 percent, and they haven’t been able to get the last couple of percent.

It occurred to me in a conversation I was having last night that the breakthroughs will come, not by getting a better algorithm against the Netflix data set, but by understanding some other data set that, when mixed with the Netflix data set, will give better predicted results.

Again, that tells us something about the future of data mining and the future of business intelligence. It is larger, more complex, and more diverse data sets in which you are able to extract meaning in new ways.

One other thing. You were talking earlier about the democratization of these tools. One thing I don’t want to pass by is a comment that was made recently by Joe Hellerstein, who is a computer science professor at UC Berkeley. It was one of those real wake-up-and-smell-the-coffee moments. He said that at Berkeley, every freshman student in CS is now being taught Hadoop. SQL is an elective for seniors. You say, "Whoa, that is a fundamental change in our thinking."

That’s why I think what Greenplum is doing is really interesting, trying to marry the old BI world of SQL with the new business intelligence world of these loose, unstructured data sets that are often analyzed with a MapReduce kind of approach. Can we bring the best of these things together?

That fits with this idea of crossing data sets being one of the new competencies that people are going to have to get better at.

Kobielus: If I can butt in here just one moment, I want to tie into something that Tim just said, that I said a little bit earlier. One important thing is that when you add more data sets to say your analytic environment, it gives you the potential to see more cross-correlations among different entities or domains. So, that’s one of the value props for an all-encompassing or more multi-domain enterprise data warehouse.

Before, you had these subject-specific marts -- customer data here, product data there, finance data there -- and you didn’t have any easy way to cross-correlate them. When you bring them altogether into common repository, implementing common dimensions and hierarchies, and conforming with common metadata, it makes it a whole lot easier for the data miners, the power users, and the end users, to build the applications that can tie it altogether.

There is the "aha" moment. "Aha, I didn’t realize all these hooked up in these various ways." You can extract more meaning by bringing it all together into a unified, enterprise data warehouse.

Gardner: To you, Scott Yara. There's a great emphasis here on bringing together different data sets from disparate sources, with entirely different technologies underlying them. It's not a trivial problem. It’s not a matter of scale necessarily.

What do you see as the potential? What is Greenplum working on to allow folks to mix and match in such a way that the analytics can be innovative and game-changing in a harsh economic environment?

Price/performance improvement

Yara: A couple of things. One, I definitely agree with the assertion that analysis gets easier the more data you have. Whether those are heterogeneous data sets or just the scale of data that people can collect, it's fundamentally easier, cheaper.

In general, these businesses are pretty smart. The executives, analysts, or people that are driving business know that their data is valuable and that insight in improving customer experience through data is key. It’s just really hard and expensive, and that has made it prohibitive for a long, long time.

Now, we're talking about using parallel computing techniques, open-source software, and commodity hardware. It’s literally a 10- to 100-fold improvement in price performance. When the cost of data analysis comes down 10 to 100 times, that’s when new things become possible.

O'Reilly: Absolutely.

Yara: We see lots of customers now from the New York Stock Exchange. These are all businesses that are across vertical industries, but are all affected by the Web and network computing at some level.

Algorithmic trading is driving financial services in a way that we haven’t seen before. They're processing billions of trades every day. Whether it's security, surveillance, or real-time support that they need to provide to very large trading companies, that ability to mine and sift through billions of transactions on a real-time basis is acute.

We were sitting down with one of our large telecom customers yesterday, and there was this convergence that Tim’s talking about. You've got companies with very large mobile carrier businesses. They're broadband service providers, fixed-line service providers, and Internet companies.

Today, the kind of basic personalization that companies like Amazon, eBay, or Google do, telecom carriers are just at the beginning of trying to do that. They have to aggregate the consumer event stream from all these disparate communication systems, and it’s at massive scale.

Greenplum is solely focused on making that happen and mixing the modalities of data, as Tim suggested. Whether it’s unstructured data, whether those are things that exist in legacy databases, or whether you want to mix and match SQL or MapReduce, fundamentally you need to make it easy for businesses to do those things. That’s starting to happen.

Gardner: I suppose part of the new environment that we are in economically is that incremental change is probably not going to cut it. We need to find new forms of revenue and be able to attain them at a very low cost, upfront if possible, and be transformative in how we can take our businesses out through the public networks to reach more customers and give them more value.

Now that we've established that we have these data sets, we can combine them to a certain degree, and that will improve over time. What are the ways in which companies can start actually making money in new ways using these technologies?

Apple’s Genius comes to mind for me as a way of saying, "Okay, you pick a song in your iTunes library, and we're going to use our data and our analytics, and come back with some suggestions on what you might like as a result of that." Again, this is sort of a first go at this, but it opens my eyes to a lot of other types of business development opportunities. Any thoughts on this, Tim O’Reilly?

O'Reilly: In general, as I said earlier, this is the frontier of competitive advantage. Sure, iTunes’ has Genius, but it's the same thing with Netflix recommendations. Amazon has been doing this for years. It's part of their competitive advantage. I mentioned earlier how this is starting to be a force in areas like banking. Think about phone companies and all of the opportunities for new local services.

Not only that, one of my pet hobbyhorses is that phone companies have this call-history database, but they're not building new services for users against it. Your phone still only remembers the last few people that you called. Why can’t I do a search against somebody I talked to three months ago. "Who the heck was that? Was it a guy from this company?" You should be able to search that. They've got the data.

So, as I said earlier, the frontier is turning the back office into new user-facing services, and having the analytics in place to be able to do that meaningfully at scale in real-time. This applies to supply chains. It applies to any business that has data that gets better through user interaction.

This is the lesson of the Web. We saw it first in Web applications. I gave you the example earlier of Wal-Mart. They realized, "Oh, wait a minute. Every time somebody buys something, it’s a vote." That’s the same point that Wesabe is trying to exploit. A credit card statement is a voting list.

I went to this restaurant once. That doesn’t necessarily mean anything. If I go back every week, that may mean something. I spent on average this much. It’s going up. That means something. I spend on average this much. It’s going down, and that means something. So, finding meaning in the data that I already have, how could this be useful not just me but to my users, to my customers, and the services could I build.

This is the frontier, particularly in the world that we are entering, in which computing is going mobile, because so many of the mobile services are fundamentally going to be driven by BI. You need to be able to say in real-time or close to real-time, "This is the relevant data set for this person based on where they are right now."

Needed: future view

Kobielus: I want to underline what Tim just said. Traditionally, data warehouses existed to provide you with perfect hindsight on the customer -- historical data, massive historical data, hopefully on the customer, and that 360 degree view of everything about the customer and everything they have ever done in the past, back to the dawn of recorded time.

Now, it’s coming down to managing that customer relationship and evolving and growing with that relationship. You have to have not so much a past or historical view, but a future view on that customer. You need to know that customer and where they are going better than they know themselves.

In other words, that’s where the killer app of the online recommendation engine becomes critical. Then, the data warehouse, as the platform for recommendation engines, can take both the historical data that persists, but also can take the continuing streams of real-time event data on pricing, on customer interaction in various channels -- be it on the Web or over the phone or whatever -- customer transactions that are going on now, and things and events that are going on in the customer social network.

Then, you feed that all into a recommendation engine, which is a predictive-analytics model running inside the data warehouse. That can optimize that customer’s interaction at every touch point. Let’s say they're dealing with a call-center person live. The call-center person knows exactly how the world looks to that customer right now and has a really good sense for what that customer might need now or might need in three month, six months, or a year, in terms of new services or products, because other customers like them are doing similar things.

It can have recommendations being generated and scripted for the call-center agent in real-time saying, "You know what we think. We recommend that you upgrade to the following service plan because, it provides you with these features that you will find useful in your lifestyle, blah, blah, blah."

In other words, it's understanding the customer in their future, in their possible future, and suggesting things to the customers that they themselves didn’t realize until you suggested them. That’s the future of analytics, and competitive advantage.

O'Reilly: I couldn’t agree more.

Gardner: Scott Yara, we've been discussing this with a little bit of a business-to-consumer (B2C) flavor. In the business-to-business (B2B) world many things are equal in a commoditized market, with traditional types of products and services.

An advantage might be that, as a supplier, I'm going to give you analytics that I can derive from data sets that you might not have access to. I might provide analytical results to you as a business partner free of charge, but as an enticement for you to continue to do business with me, when I don’t have any other way to differentiate. What do you see are some of the scenarios possible on the B2B side?

Yara: You don’t have to look much further than what Salesforce.com is doing. In a lot of ways, they're pioneering what it means to be an enterprise technology company that sells services, and ultimately data, back to their customers. By creating a common platform, where applications can be built, they are very much thinking about how the data is being aggregated on the platforms in use, not by their individual customers, but in aggregate.

You're going to see lots of cases where for traditional businesses that are selling services and products to other businesses, the aggregation of data is going to be interesting and relevant. At the same time, you have companies where even the internal analysis of their data is something they haven’t been able to do before.

We were talking about Google, which is an amazing company. They have this big vision to organize the world’s information. What the rest of the business world is finding out is that while it’s a great vision and they have a lot of data, they only have a small fraction of the overall data in the world. Telecommunication companies, financial stock exchange, retail companies, have all of this real-world data that's not being indexed or organized by Google. These companies actually have access to amazing amounts of information about the customers and businesses.

They are saying, "Why can’t we, at the point of interaction -- like eBay, Amazon, or some of these recommended engines -- start to take some of this aggregate information and turn it into improving businesses in the way that the Web companies have done so successfully. That’s going to be true for B2C businesses, as well as for B2B companies.

We're just at the beginning of that. That’s fundamentally what’s so exciting about Greenplum and where we're headed.

Gardner: Jim Kobielus, who does this make sense for right away? Some companies might be a little skeptical. They're going to have to think about this. But where is the low-lying fruit, where are the no-brainer applications for this approach to data and analytics?

Kobielus: No-brainers -- I always hate that term. It sounds like I am condescending, but low-hanging fruit should be one of those "aha!" opportunities that everybody realizes intuitively. You don’t have to explain to them, so in a sense it's a no-brainer. It’s call center -- customer-contact center.

The customer-contact center is where you touch the customer, and where you hopefully initiate, cultivate, nurture, maintain, and grow the customer relationship. It's one of the many places where you do that. There are people in your organization who are in that front-line capacity.

It doesn’t have to be just people. It could be automated programs through your Website that need to be empowered continuously with the full customer context -- the history of that customer's interactions, the customer’s current state, current sentiment and feelings, and with a full context on the customer’s likely future evolution. So, really it's the call center.

In fact, I cover data warehousing for Forrester. I talk to the data warehousing vendors and their customers about in database analytics, where they are selling this capability right now into real-world deployment. The customer call center is, far and away -- with a bullet -- the number one place for inline analytics to drive the customer interaction in a multi-channel fashion.

Gardner: How about you, Tim O’Reilly. Where are some of the hot verticals and early adopters likely to be on this?

O'Reilly: I've already said several times, mobile apps of various kinds are probably highest on the list. But, I'm a big fan of supply chain. There's a lot to be done there, and there's a huge amount of data. There already is a BI infrastructure, but it hasn’t really been tuned to think about it as a customer-facing application. It's really more a back-office or planning tool.

There are enormous opportunities in media, if you want to put it that way. If you think about the amount of money that’s spent on polling and the power of integrating actual data, rather than stated preference, I think it's huge.

How do we actually figure out what people are going to do? There is great marketing study. I forget who told this story, but it was about a consumer product. They showed examples of different colors. It was a boom box or something like that.

They said, "How many of you think white is the cool color, how many of you think black, how many, blah, blah, blah?" All the people voted, and then they had piles of the boom boxes by the door that the people took as their thank you gift. What they said and what they did were completely at variance.

One of the things that’s possible today is that, increasingly, we are able to see what people actually do, rather than what they say they will do or think they will do.

Gardner: We're just about out of time. Scott Yara, what’s your advice for those folks who are just getting their heads wrapped around this on how to get started? It’s not a trivial activity. It does require a great deal of concerted effort across multiple aspects of IT, perhaps more so than in the past. How do you get started, what should you be doing to get ready?

Yara: That’s one of the real advantages. In sort of a orthogonal way, the ability to create new businesses online in the age of Web 2.0 has been fundamentally cheaper and faster. Doing something disruptive inside of business with their data has to be a fundamentally cheaper and easier thing. So not starting with the big vision of where they need to go, and starting with something tactical -- whether it lives in the call center or at some departmental application -- is the best way to get going.

There are technologies, services, and people now that you can actually peel off a real project, and you can deliver real value right away.

I agree with Tim. We're going to see a lot of activity in the mobility and telecommunication space. These companies are just realizing this. If you think about the kind of personalization that you get with almost every major Internet site today, what’s level of personalization you get from your carrier, relative to how much data that they have? You're going to see lots of telecom companies do things with data that will have real value.

One of our customers was saying that in the traditional old data warehousing world, where it was back office, the service level agreement (SLA) was that when a call got placed and logged, it just needed to make its way into the warehouse seven days later. Seven days from the point of origination of a call, it would make itself into a back-office warehouse.

Those are the kinds of things that are going to change, if we are going to really provide mobility, locality, and recommendation services to customer.

It's having a clear idea of the first application that can benefit from data. Call centers are going to be a good area to provide the service representation of a profile of a customer and be able to change the experience. I think we are going to see those things.

So, they're tractable problems. Starting small is what held back enterprise data warehousing before, where they were looking at these huge investments of people and capital and infrastructure. I think that’s really changing.

Gardner: I am afraid we have to leave it there. We've been discussing new approaches to managing data, processing data, mixing data types and sets, and extracting real-time business results from that. We've looked at tools and we've looked at some of the verticals in business advantages.

I want to thank our panel. We've been joined today by Tim O’Reilly, the CEO and founder of O’Reilly Media. Thank you Tim.

O'Reilly: Glad to do it.

Gardner: Jim Kobielus, Forrester senior analyst. Thank you Jim.

Kobielus: Dana, always a pleasure.

Gardner: Scott Yara, president and co-founder of Greenplum. Appreciate it, Scott.

Yara: Great. Thanks everybody.

Gardner: This is Dana Gardner, principal analyst at Interarbor Solutions. You've been listening to a sponsored BriefingsDirect podcast. Thanks, and come back next time.

Listen to the podcast. Download the podcast. Find it on iTunes/iPod. Learn more. Sponsor: Greenplum.

Transcript of BriefingsDirect podcast on new computing challenges and solutions in data processing and data management. Copyright Interarbor Solutions, LLC, 2005-2008. All rights reserved.

BriefingsDirect Transcripts

Tuesday, July 21, 2009

Seeking to Master Information Explosion: Enterprises Gain Better Ways to Discover and Manage their Information

Tuesday, December 16, 2008

MapReduce-scale Analytics Change Business Intelligence Landscape as Enterprises Mine Ever-Expanding Data Sets

Principal Analyst

Translate this Blog

Folo My Flipboard Magazines

Search Blog

Subscribe to Podcast Via iTunes

BriefingsDirect Network

Blog Archive