Showing posts with label data warehouse. Show all posts
Showing posts with label data warehouse. Show all posts

Wednesday, April 22, 2015

ECommerce Portal Avito Uses Big Data to Master Just-in-Time Ad Fraud Detection

Transcript of a BriefingsDirect discussion on how a Russian ecommerce and search engine site leverages big data analytics to identify fraud.

Listen to the podcast. Find it on iTunes. Get the mobile app for iOS or Android. Download the transcript. Sponsor: HP.

Dana Gardner: Hello, and welcome to the next edition of the HP Discover Podcast Series. I'm Dana Gardner, Principal Analyst at Interarbor Solutions, your host and moderator for this ongoing sponsored discussion on IT innovation and how it’s making an impact on people’s lives.
Become a member of MyVertica
Register now
And gain access to the Free HP Vertica Community Edition.
Our next innovation case study interview highlights how Avito, a Russian eCommerce site and portal, uses big data technology to improve fraud detection, as well as better understand how their users adapt to new advertising approaches.

To learn more about how big data offers new insights to the eCommerce portal user experience, please join me in welcoming Nikolay Golov, Chief Data Warehousing Architect at Avito in Moscow. Welcome.

Nikolay Golov: Hi.

Gardner: Tell us a little bit about your site and your business at Avito. It sounds like the Craigslist of Russia.

Golov: Yes, Avito is a Russian Craigslist. It's a big site and also the biggest search engine for some goods. We at Avito have more searches, for example, from iPhones than Google or Yandex. Yandex is a Russian Google.

Gardner: Does Avito cover all types of goods, services, business-to-business commerce?

Golov: On Avito, you can sell almost anything that can be bought in the market. You can sell cars, you can sell houses, or rent them, for example. You can even find boats or business jets. We right now have about three business jets listed.

Gardner: So quite a diversity. What are your big data needs? It sounds as if in a country as large as Russia -- with that many goods and services -- you have a high-volume-of-data issue.

Size advantage

Golov: The main advantages of Avito is firstly its size. Everybody in Russia knows that if you want to buy or sell something, the best place for it is Avito. It’s first. is speed. It is very easy to use it. We have a very easy interface. So we must keep these two advantages. But there are also some people who want to use Avito to sell weapons, drugs, and prohibited medicines. It's absolutely critical for Avito to keep it all clean, to prevent such items from appearing in the queries of our visitors.

We're growing very fast, and if we use moderators we'll have to increase our expense on moderation in a linear progressions as we grow. So, the only solution to avoid a linear increase in expenses is to use automation.

Gardner: In order to rapidly decide which should or should not be appearing on your site, you’ve decided to use a data warehouse that provides a streaming real-time data automation effect. Tell me what your requirements are for that technology?

Golov: We have various requirements. For example, we need to be able to perform fast fraud detection. The warehouse has to have very little delay. Hours are not permitted, it must be 10 minutes, no more.
Our data warehouse has to be big. It has to store months, possibly years, of data.

Second, we have to have data for long periods of time to learn our data mining algorithms, to create reports, and to analyze trends. So our data warehouse has to be very big. It has to store months, possibly years, of data. It has to be fast, or only slightly delayed, and it has to be big.

Third, we're developing very quickly. We're adding some new services, and we're integrating with partners. Not long ago, for example, we added information from Google AdWords to optimize banners. So the warehouse must be very flexible. It must be able to grow in all three ways.

Gardner: How long have you been using HP Vertica and how did you come to choose that particular platform?

Golov: Well over a year now. We chose Vertica for two two main advantages. First, speed of load and data. The I/O speed provided by Vertica is awesome.
Become a member of MyVertica
Register now
And gain access to the Free HP Vertica Community Edition.
Second is its ability to upgrade, thanks to the commodity hardware. So if you have some new requirements that require you to increase performance, you can just buy new hardware -- commodity hardware -- and its power just increases.

It’s great and it can be done really fast. Vertica was the winner.

Measuring the impact

Gardner: Do you have a sense of what the performance and characteristics of Vertica and your data warehouse have gotten for you? Do you have a sense of reduced fraud by X percent or better analytics that have given you a business advantage of some sort? Are there ways to measure the true impact?

Golov: During last year, Avito grew really fast. We have a moderation team of about 250 persons at the beginning of this process. Now, we have the same moderation team, but the number of items has increased two-fold. I suppose that's one of the best measures that can be used.

Gardner: Fair enough. Now, looking to the future, when you're working in a business where your margins, your business, your revenue comes from the ability to provide advertisement placements, improving the performance and the value on the actual distribution of ads and the costs associated is critical.

In addition to rapid fraud detection and protection, is there a value from your analytics that refines the business algorithms and therefore the retail value to your customers?
We're starting few more products. The main aim of them is to create our own tool for optimizing the directions of advertising.

Golov: We're creating more products. The main aim of them is to create our own tool for optimizing the directions of advertising. We have banners, marketing campaigns, and SMS. So we've achieved some results in our reporting and in fraud prevention. We'll continue to work in that direction, and we are planning to add some new types of functionality to our data warehouse.

Gardner: It certainly seems that a data warehouse delivers a tactical benefit but then over time moves to a strategic benefit. The more data, inference, and understanding you have of your processes, the more powerful you can become as a total business.

Golov: Yes. One of my teachers in data warehouses explained the role of data warehouses in an enterprise. It’s like a diesel engine inside a ship. It just works, works, and works, and it’s hot around it. You can create various tools to increase it, to make it better.

But there must always be something deep inside that continuously provides all of the associated tools with power and strong data services from all sides of the business.

Gardner: I wonder for others who are listening to you and saying, "We really need to have that core platform in order to build out these other values over time." Do you have any lessons that you have learned that you might share. That is to say, if you're starting out to develop your own data warehouse and your own business intelligence (BI) and analytics capabilities, do you have any advice?

Be flexible

Golov: First, you have to be flexible. If you ask a business about changing, they'll tell you that they can’t. It will be absolutely this, every time. And in two months, it will still change. If you're not ready to change using your data warehouse to get needed data and analytics, it would be a disaster. That's first.

Second, there always will be errors in data, there will be gaps, and it's absolutely critical to start building a data warehouse together with an automated data quality system that will automatically control and monitor the quality of all the data. This will help you to see the problems when they occur.
If you're not ready to change the ratio of your data warehouse to get such data, it would be a disaster.

Gardner: I'm afraid we'll have to leave it there. We've been discussing how Avito, a large e-commerce portal and super retail site in Moscow, has been deploying a data warehouse and BI capability to not only prevent fraud, but also to grow its business through a better understanding of its customers and processes.

So, a big thank you to our guest, Nikolay Golov, Chief Data Warehousing Architect at Avito. Thank you so much.

Golov: Thanks a lot.
Become a member of MyVertica
Register now
And gain access to the Free HP Vertica Community Edition.
Gardner: And I'd like to thank our audience as well for joining us today for our special big data innovation discussion.

I'm Dana Gardner, Principal Analyst at Interarbor Solutions, your host for this ongoing series of HP-sponsored discussions. Thanks for listening, and come back next time.

Listen to the podcast. Find it on iTunes. Get the mobile app for iOS or Android. Download the transcript. Sponsor: HP.

Transcript of a BriefingsDirect discussion on how a Russian ecommerce and search engine site leverages big data analytics. Copyright Interarbor Solutions, LLC, 2005-2015. All rights reserved.

You may also be interested in:

Thursday, January 22, 2015

adMarketplace Solves Search Intent Challenge with HP Vertica Big Data Warehouse Solution

Transcript of a BriefingsDirect podcast on how an consumer intent search company is able to handle massive amounts of data and analyze it quickly with HP Vertica.

Listen to the podcast. Find it on iTunesDownload the transcript. Sponsor: HP.

Dana Gardner: Hello, and welcome to the next edition of the HP Discover Podcast Series. I’m Dana Gardner, Principal Analyst at Interarbor Solutions, your host and moderator for this ongoing sponsored discussion on IT innovation and how it’s making an impact on people’s lives.

Once again, we're focusing on how companies are adapting to the new style of IT to improve IT performance and deliver better user experiences, as well as better business results.

This time, we're coming to you directly from the recent HP Discover 2014 Conference in Las Vegas. We're here to learn directly from IT and business leaders alike how big datacloud, and converged infrastructure implementations are supporting their goals.

Our next innovation case study interview explores how New York-based adMarketplace, a search syndication advertising network, has met its daunting data-warehouse requirements.
and gain access to the
FREE HP Vertica Community Edition
We'll learn how adMarketplace captures and analyzes massive data to allow for efficient real-time bidding for traffic sources for online advertising. And we'll hear how the data-analysis infrastructure also delivers rapid cost-per-click insights to advertisers.

To learn more about how adMarketplace manages its big-data challenges, please join me in welcoming Michael Yudin, the Chief Technology Officer at adMarketplace.

Michael Yudin: Hello. Thank you, Dana.

Gardner: Tell us first about what adMarketplace does. It sounds very interesting, but I'm not sure I fully understand it.

Yudin: Well, adMarketplace is the leading marketplace for search intent advertising, and let me explain what that means. Search advertising is the best form of advertising ever invented. For the first time, a consumer actually tells a computer what they're interested in. That’s why Google became so successful as a search engine.

Some things are changing in the marketplace these days. Consumer search intent is fracturing. You probably wonder what this means. It’s very simple. What this means is Google is no longer the only place you go to search for stuff.

I'll give you an example. Last night, I was looking for a Brazilian steakhouse here in Las Vegas. I didn't go on I opened my iPhone and I fired up a yellow pages (YP) app and I entered "Brazilian steakhouse" in the search box.

There are a variety of apps in my phone like that for travel, sports, news, and various other things I'm interested in. Anytime I search there, I don’t go to Consumer search has really fractured and adMarketplace has solved the monetization problem for that.

Providing value

Gardner: So when people are searching in areas other than say Google or Yahoo, how does your organization intercept with that and how does that provide value to both the consumer that’s searching and advertisers that want to provide them information?

Yudin: It benefits both the consumer and the advertiser. In the search world, an ad is really nothing more than a search result in response to user’s query. That’s why it’s so great.

Our clients are the Internet's largest marketers and brands. They use adMarketplace to acquire additional customers in addition to the other marketing channels like Google, where they are pretty much already maxed out. are only so many searches that happen in Google and they're declining. So advertisers are looking for new ways to capture consumer intent and to convert this into sales and measurable return on investment (ROI), and that's what we do for them.

Gardner: Of course, a really important thing here is to match properly, and that requires data and analysis -- and it requires speed. Tell us a little about the requirements. How do you do this technically?

Yudin: You just nailed it. This is a very, very big data problem and it has to be solved at scale and fast. And it’s also a 24x7 problem. We can never take our system down. We have a global business, and anytime you go and you search for something as a consumer, you expect to see the result right away.
and gain access to the
FREE HP Vertica Community Edition
Our network handles about half a billion search queries per day and this results in about two terabytes of data per hour constantly generated by our platform, across multiple data centers. We needed a very scalable and robust analytical data warehouse solution that could handle this. Two years ago, we evaluated a number of vendors and settled on HP Vertica, which was best able to satisfy our tough requirements.

Gardner: And are these requirements primarily about the scale and volume, or are we talking also about a need for rapid query, or all the above? Give us a bit more insight into the actual requirements for your network?

Yudin: That's a great question, and I think this is what makes Vertica unique. There are products out there that can store a lot of data, but you can't get this data out of these solutions quickly and at high concurrency. We require a system that can ingest large amounts of data constantly. I am talking about terabytes and terabytes of data. This data has to be queryable right away, with very low latency requirements.

Some of our queries for Advertiser 3D and analytical dashboard are preplanned queries obviously, but they are very big data queries and the service-level agreement (SLA) on these queries is two seconds. Very few products can do that. Some queries are obviously more complex, but we're still talking about seconds and not hours.

Concurrency requirement

On top of this, there's a concurrency requirement and that’s a very big weak spot of a lot of products. Vertica is actually able to provide sufficient concurrency, and it’s never enough.

I do know that there's an upcoming release of Vertica 7, where this is going to be improved even further, but it’s quite acceptable right now. And it has to be fault tolerant, which means that it should be able to sustain a hardware failure on any of its nodes -- and it can do that.

Gardner: Tell us a bit about where you've built Vertica in terms of data centers. Are they your own? Do you have managed service providers? How are you managing your infrastructure that supports Vertica and then therefore your data processes?

Yudin: We own our own infrastructure. So these are not managed services. We actually once used managed services, but we've outgrown them. And Vertica runs on dedicated hardware.
This was driven by business requirements. We didn’t just decide that we needed this

We also have several other Vertica clusters that run on virtualized hardware, and even though it’s dedicated infrastructure, it’s really dedicated at the cloud level now. So call it private cloud. It's a buzzword. It's a mix of dedicated and virtualized. It's elastic scaling.

Gardner: And the transition. You mentioned that two years ago, you were searching for a product. How were you able to bring this on board and what sort of growth have you had as a result -- in terms of data volume, but also in your business, in terms of customers and overall business metrics of growth?

Yudin: This was driven by business requirements. We didn’t just decide that we needed this. So we started to undertake a very, very ambitious project -- Advertiser 3D. If you go to our website,, you can read more about it.

This is a very elegant, simple, and yet powerful, system to match and price traffic across a multitude of traffic sources. To deliver this product, we didn’t have a choice. We had to have a powerful analytical back-end data warehouse. That's when we started to evaluate products and chose Vertica.

Gardner: And have there been any other benefits of going to Vertica in terms of being able to increase the number of features, or have you been able to leverage the technology in new business opportunities in terms of what you can offer your customers, not just to have met the requirements, but perhaps whole new types of benefits?
and gain access to the
FREE HP Vertica Community Edition

Heavy lifting

Yudin: Definitely. Our customers don’t know and don’t even care that we use Vertica on the back end. That’s probably why we won an HP award, because we integrated it into our overall solution very elegantly and seamlessly, but it obviously does a lot of heavy lifting on the back end.

And the project was successful and transformed our business. Our growth rates have accelerated over 50 percent on our core revenue and performance. Data-savvy marketers, and our clients started to see significantly double-digit improvement in ROIs.

Gardner: As Chief Technology Officer there, you've gone through a fairly significant change in your infrastructure and adoption, as you've just described. Looking back, are there any lessons learned that you could offer to others who are also running into a wall with their data infrastructure or looking for alternatives? Any thoughts on how you would advise them to make the transition?
Our growth rates have accelerated over 50 percent on our core revenue and performance.

Yudin: Definitely. The number one advice I would give anybody is don’t believe anything until you do two things: Try it yourself and get references from people who actually use this and whom you trust. That's very important.

Gardner: Well, great. We've been talking about how adMarketplace captures and analyzes massive data to allow for efficient real-time bidding for traffic sources for online advertising.

I would like to thank our guest, Michael Yudin, the Chief Technology Officer at adMarketplace. Thanks so much.

Yudin: Thank you, Dana. My pleasure.

Gardner: And I also want to thank our audience as well for joining us for this special new style of IT discussion coming to you directly from the recent HP Discover 2014 Conference.

I'm Dana Gardner, Principal Analyst at Interarbor Solutions, your host for this ongoing series of HP-sponsored discussions. Thanks again for listening, and come back next time.

Listen to the podcast. Find it on iTunesDownload the transcript. Sponsor: HP.

Transcript of a BriefingsDirect podcast on how an consumer intent search company is able to handle massive amounts of data and analyze it quickly with HP Vertica. Copyright Interarbor Solutions, LLC, 2005-2015. All rights reserved.

You may also be interested in:

Wednesday, January 08, 2014

Nimble Storage Leverages Big Data and Cloud to Produce Data Performance Optimization on the Fly

Transcript of a BriefingsDirect podcast on how a hybrid storage provider can analyze operational data to bring about increased efficiency.

Listen to the podcast. Find it on iTunes. Download the transcript.

Dana Gardner: Hello, and welcome to the next edition of the HP Discover Performance Podcast Series. I'm Dana Gardner, Principal Analyst at Interarbor Solutions, your moderator for this ongoing discussion of IT innovation and how it’s making an impact on people’s lives.

Once again, we’re focusing on how IT leaders are improving their business performance for better access, use and analysis of their data and information.

Our next innovation case study focuses on how optimized hybrid storage provider Nimble Storage has leveraged big data and cloud to produce significant storage  performance, and efficiency. Nimble is, of course, also notable for its recent successful IPO.

Learn here how Nimble Storage has leveraged the HP Vertica analytics platform to analyze operational data on mixed-storage environments to optimize workloads. High-performing, cost-effective big-data processing via cloud helps to make the best use of dynamic storage resources, it turns out. A fascinating story.

To learn more, join me in welcoming our guest, Larry Lancaster, Chief Data Scientist at Nimble Storage Inc. in San Jose, California. Welcome, Larry.

Larry Lancaster: Hi, Dana, it's great to talk to you today.

Gardner: I'm glad you could join us. As I said, it's a fascinating use-case. Tell us about the general scope of how you use data in the cloud to create this hybrid storage optimization service.

Lancaster: At a high level, Nimble Storage recognized early, near the inception of the product, that if we were able to collect enough operational data about how our products are performing in the field, get it back home and analyze it, we'd be able to dramatically reduce support costs. Also, we can create a feedback loop that allows engineering to improve the product very quickly, according to the demands that are being placed on the product in the field.

Looking at it from that perspective, to get it right, you need to do it from the inception of the product. If you take a look at how much data we get back for every array we sell in the field, we could be receiving anywhere from 10,000 to 100,000 data points per minute from each array. Then, we bring those back home, we put them into a database, and we run a lot of intensive analytics on those data.

Once you're doing that, you realize that as soon as you do something, you have this data you're starting to leverage. You're making support recommendations and so on, but then you realize you could do a lot more with it. We can do dynamic cache sizing. We can figure out how much cache a customer needs based on an analysis of their real workloads.

We found that big data is really paying off for us. We want to continue to increase how much it's paying off for us, but to do that we need to be able to do bigger queries faster. We have a team of data scientists and we don't want them sitting here twiddling their thumbs. That’s what brought us to Vertica at Nimble.

Using big data

Gardner: It's an interesting juxtaposition that you're using big data in order to better manage data and storage. What better use of it? And what sort of efficiencies are we talking about here, when you are able to get that data in that massive scale and do these analytics and then go back out into the field and adjust? What does that get for you?

Lancaster: We have a very tight feedback loop. In one release we put out, we may make some changes in the way certain things happen on the back end, for example, the way NVRAM is drained. There are some very particular details around that, and we can observe very quickly how that performs under different workloads. We can make tweaks and do a lot of tuning.

Without the kind of data we have, we might have to have multiple cases being opened on performance in the field and escalations, looking at cores, and then simulating things in the lab.

It's a very labor-intensive, slow process with very little data to base the decision on. When you bring home operational data from all your products in the field, you're now talking about being able to figure out in near real-time the distribution of workloads in the field and how people access their storage. I think we have a better understanding of the way storage works in the real world than any other storage vendor, simply because we have the data.

Gardner: So it's an interesting combination of a product lifecycle approach to getting data -- but also combining a service with a product in such a way that you're adjusting in real time.

Lancaster: That’s right. We do a lot of neat things. We do capacity forecasting. We do a lot of predictive analytics to try to figure out when the storage administrator is going to need to purchase something, rather than having them just stumble into the fact that they need to provision for equipment because they've run out of space.
That’s the kind of efficiency we gain that you can see, and the InfoSight service delivers that to our customers.

A lot of things that should have been done in storage from the very beginning that sound straightforward were simply never done. We're the first company to take a comprehensive approach to it. We open and close 80 percent of our cases automatically, 90 percent of them are automatically opened.

We have a suite of tools that run on this operational data, so we don't have to call people up and say, "Please gather this data for us. Please send us these log posts. Please send us these statistics." Now, we take a case that could have taken two or three days and we turn it into something that can be done in an hour.

That’s the kind of efficiency we gain that you can see, and the InfoSight service delivers that to our customers.

Gardner: Larry, just to be clear, you're supporting both flash and traditional disk storage, but you're able to exploit the hybrid relationship between them because of this data and analysis. Tell us a little bit about how the hybrid storage works.

Challenge for hard drives

Lancaster: At a high level, you have hard drives, which are inexpensive, but they're slow for random I/O. For sequential I/O, they are all right, but for random I/O performance, they're slow. It takes time to move the platter and the head. You're looking at 5 to 10 milliseconds seek time for random read.

That's been the challenge for hard drives. Flash drives have come out and they can dramatically improve on that. Now, you're talking about microsecond-order latencies, rather than milliseconds.

But the challenge there is that they're expensive. You could go buy all flash or you could go buy all hard drives and you can live with those downsides of each. Or, you can take the best of both worlds.

Then, there's a challenge. How do I keep the data that I need to access randomly in flash, but keep the rest of the data that I don't care so much about in a frequent random-read performance, keep that on the hard drives only, and in that way, optimize my use of flash. That's the way you can save money, but it's difficult to do that.

It comes down to having some understanding of the workloads that the customer is running and being able to anticipate the best algorithms and parameters for those algorithms to make sure that the right data is in flash.
It would be hard to be the best hybrid storage solution without the kind of analytics that we're doing.

We've built up an enormous dataset covering thousands of system-years of real-world usage to tell us exactly which approaches to caching are going to deliver the most benefit. It would be hard to be the best hybrid storage solution without the kind of analytics that we're doing.

Gardner: Then, to extrapolate a little bit higher, or maybe wider, for how this benefits an organization, the analysis that you're gathering also pertains to the data lifecycle, things like disaster recovery (DR), business continuity, backups, scheduling, and so forth. Tell us how the data gathering analytics has been applied to that larger data lifecycle equation.

Lancaster: You're absolutely right. One of the things that we do is make sure that we audit all of the storage that our customers have deployed to understand how much of it is protected with local snapshots, how much of it is replicated for disaster recovery,  and how much incremental space is required to increase retention time and so on.

We have very efficient snapshots, but at the end of the day, if you're making changes, snapshots still do take some amount of space. So, learning exactly what is that overhead, and how can we help you achieve your disaster recovery goals.

We have a good understanding of that in the field. We go to customers with proactive service recommendations about what they could and should do. But we also take into account the fact that they may be doing DR when we forecast how much capacity they are going to need.

Larger lifecycle

You're right. It is part of a larger lifecycle that we address, but at the end of the day, for my team it's still all about analytics. It's about looking to the data as the source of truth and as the source of recommendation.

We can tell you roughly how much space you're going to need to do disaster recovery on a given type of application, because we can look in our field and see the distribution of the extra space that would take and what kind of bandwidth you're going to need. We have all that information at our fingertips.

When you start to work this way, you realize that you can do things you couldn't do before. And the things you could do before, you can do orders of magnitude better. So we're a great case of actually applying data science to the product lifecycle, but also to front-line revenue and cost enhancement.

Gardner: I think this is a great example and I think you're a harbinger of what we're going to see more and more, which is bringing this high level of intelligence to bear on many other different services, for many different types of products. IT and storage is great and makes a lot of sense as an early adopter. But I can see this is pertaining to many other vertical industries. It illustrates where a lot of big-data value is going to go.

Now, let's dig into how you actually can get that analysis in the speed, at the scale, and at the cost that you require. Tell us about your journey in terms of different analytics platforms and data architectures that you've been using and where you're headed.
I have to tell you, I fell in love with Vertica because of the performance benefits that it provided.

Lancaster: To give you a brief history of my awareness of HP Vertica and my involvement around the product, I don’t remember the exact year, but it may have been eight years ago roughly. At some point, there was an announcement that Mike Stonebraker was involved in a group that was going to productize the C-Store Database, which was sort of an academic experiment at UC Berkeley, to understand the benefits and capabilities of real column store.

[Learn more about column store architectures and how they benefit data speed and management for Infinity Insurance.]

I was immediately interested and contacted them. I was working at another storage company at the time. I had a 20 terabyte (TB) data warehouse, which at the time was one of the largest Oracle on Linux data warehouses in the world.

They didn't want to touch that opportunity just yet, because they were just starting out in alpha mode. I hooked up with them again a few years later, when I was CTO at a company called Glassbeam, where we developed what's substantially an extract, transform, and load (ETL) platform.

By then, they were well along the road. They had a great product and it was solid. So we tried it out, and I have to tell you, I fell in love with Vertica because of the performance benefits that it provided.

When you start thinking about collecting as many different data points as we like to collect, you have to recognize that you’re going to end up with a couple choices on a row store. Either you're going to have very narrow tables and a lot of them or else you're going to be wasting a lot of I/O overhead, retrieving entire rows where you just need a couple fields.

Greater efficiency

That was what piqued my interest at first. But as I began to use it more and more at Glassbeam, I realized that the performance benefits you could gain by using HP Vertica properly were another order of magnitude beyond what you would expect just with the column-store efficiency.

That's because of certain features that Vertica allows, such as something called pre-join projections. We can drill into that sort of stuff more if you like, but, at a high-level, it lets you maintain the normalized logical integrity of your schema, while having under the hood, an optimized denormalized query performance physically on disk.

Now you might ask you can be efficient if you have a denormalized structure on disk. It's because Vertica allows you to do some very efficient types of encoding on your data. So all of the low cardinality columns that would have been wasting space in a row store end up taking almost no space at all.

What you find, at least it's been my impression, is that Vertica is the data warehouse that you would have wanted to have built 10 or 20 years ago, but nobody had done it yet.
Vertica is the data warehouse that you would have wanted to have built 10 or 20 years ago, but nobody had done it yet.

Nowadays, when I'm evaluating other big data platforms, I always have to look at it from the perspective of it's great, we can get some parallelism here, and there are certain operations that we can do that might be difficult on other platforms, but I always have to compare it to Vertica. Frankly, I always find that Vertica comes out on top in terms of features, performance, and usability.

Gardner: When you arrived there at Nimble Storage, what were they using, and where are you now on your journey into a transition to Vertica?

Lancaster: I built the environment here from the ground up. When I got here, there were roughly 30 people. It's a very small company. We started with Postgres. We started with something free. We didn’t want to have a large budget dedicated to the backing infrastructure just yet. We weren’t ready to monetize it yet.

So, we started on Postgres and we've scaled up now to the point where we have about 100 TBs on Postgres. We get decent performance out of the database for the things that we absolutely need to do, which are micro-batch updates and transactional activity. We get that performance because the database lives on Nimble Storage.

I don't know what the largest unsharded Postgres instance is in the world, but I feel like I have one of them. It's a challenge to manage and leverage. Now, we've gotten to the point where we're really enjoying doing larger queries. We really want to understand the entire installed base of how we want to do analyses that extend across the entire base.

Rich information

We want to understand the lifecycle of a volume. We want to understand how it grows, how it lives, what its performance characteristics are, and then how gradually it falls into senescence when people stop using it. It turns out there is a lot of really rich information that we now have access to to understand storage lifecycles in a way I don't think was possible before.

But to do that, we need to take our infrastructure to the next level. So we've been doing that and we've loaded a large number of our sensor data that’s the numerical data I have talked about into Vertica, started to compare the queries, and then started to use Vertica more and more for all the analysis we're doing.

Internally, we're using Vertica, just because of the performance benefits. I can give you an example. We had a particular query, a particularly large query. It was to look at certain aspects of latency over a month across the entire installed base to understand a little bit about the distribution, depending on different factors, and so on.
I'm really excited. We're getting exactly what we wanted and better.

We ran that query in Postgres, and depending on how busy the server was, it took  anywhere from 12 to 24 hours to run. On Vertica, to run the same query on the same data takes anywhere from three to seven seconds.

I anticipated that because we were aware upfront of the benefits we'd be getting. I've seen it before. We knew how to structure our projections to get that kind of performance. We knew what kind of infrastructure we'd need under it. I'm really excited. We're getting exactly what we wanted and better.

This is only a three node cluster. Look at the performance we're getting. On the smaller queries, we're getting sub-second latencies. On the big ones, we're getting sub-10 second latencies. It's absolutely amazing. It's game changing.

People can sit at their desktops now, manipulate data, come up with new ideas and iterate without having to run a batch and go home. It's a dramatic productivity increase. Data scientists tend to be fairly impatient. They're highly paid people, and you don’t want them sitting at their desk waiting to get an answer out of the database. It's not the best use of their time.

Gardner: Larry, is there another aspect to the HP Vertica value when it comes to the cloud model for deployment? It seems to me that if Nimble Storage continues to grow rapidly and scales that, bringing all that data back to a central single point might be problematic. Having it distributed or in different cloud deployment models might make sense. Is there something about the way Vertica works within a cloud services deployment that is of interest to you as well?

No worries

Lancaster: There's the ease of adding nodes without downtime, the fact that you can create a K-safe cluster. If my cluster is 16 nodes wide now, and I want two nodes redundancy, it's very similar to RAID. You can specify that, and the database will take care of that for you. You don’t have to worry about the database going down and losing data as a result of the node failure every time or two.

I love the fact that you don’t have to pay extra for that. If I want to put more cores or  nodes on it or I want to put more redundancy into my design, I can do that without paying more for it. Wow! That’s kind of revolutionary in itself.

It's great to see a database company incented to give you great performance. They're incented to help you work better with more nodes and more cores. They don't have to worry about people not being able to pay the additional license fees to deploy more resources. In that sense, it's great.

We have our own private cloud -- that’s how I like to think of it -- at an offsite colocation facility. We do DR through Nimble Storage. At the same time, we have a K-safe cluster. We had a hardware glitch on one of the nodes last week, and the other two nodes stayed up, served data, and everything was fine.
If you do your job right as a cloud provider, people just want more and more and more.

Those kinds of features are critical, and that ability to be flexible and expand is critical for someone who is trying to build a large cloud infrastructure, because you're never going to know in advance exactly how much you're going to need.

If you do your job right as a cloud provider, people just want more and more and more. You want to get them hooked and you want to get them enjoying the experience. Vertica lets you do that.

Gardner: I'm afraid we'll have to leave it there. We've been learning about how optimized hybrid storage provider Nimble Storage has leveraged big data and cloud to produce unique storage performance analytics and efficiencies. And we've seen how the HP Vertica Analytics platform has been used to analyze Nimble's operational data across mixed storage environments in near real-time, so that they can optimize their workloads and also extend the benefits to a data lifecycle.

So, a big thank you to our guest, Larry Lancaster, Chief Data Scientist at Nimble Storage. Thank you, Larry.

Lancaster: Thanks, Dana.

Gardner: Also, thank you to our audience for joining us for this special HP Discover Performance Podcast.

I'm Dana Gardner; Principal Analyst at Interarbor Solutions, your host for this ongoing series of HP-sponsored discussions. Thanks again for joining, and come back next time.

Listen to the podcast. Find it on iTunes. Download the transcript. Sponsor: HP.

Transcript of a BriefingsDirect podcast on how a hybrid storage provider can analyze operational data to bring about increased efficiency.  Copyright Interarbor Solutions, LLC, 2005-2014. All rights reserved.

You may also be interested in:

Monday, January 05, 2009

A Technical Look at How Parallel Processing Brings Vast New Capabilities to Large-Scale Data Analysis

Transcript of BriefingsDirect podcast on new technical approaches to managing massive data problems using parallel processing and MapReduce technologies.

Listen to the podcast. Download the podcast. Find it on iTunes/iPod. Learn more. Sponsor: Greenplum.

Dana Gardner: Hi, this is Dana Gardner, principal analyst at Interarbor Solutions, and you're listening to BriefingsDirect. Today we present a sponsored podcast discussion on new data-crunching architectures and approaches, ones designed with petabyte data sizes in their sights.

It's now clear that the Internet-size data gathering, swarms of sensors, and inputs from the mobile device fabric, as well as enterprises piling up ever more kinds of metadata to analyze, have stretched traditional data-management models to the breaking point.

In response, advances in parallel processing, using multi-core chipsets have prompted new software approaches such as MapReduce that can handle these data sets at surprisingly low total cost.

We'll examine the technical underpinnings that support the new demands being placed on, and by, extreme data sets. We'll also uncover the means by which powerful new insights are being derived from massive data compilations in near real time.

Here to provide an in-depth look at parallelism, modern data architectures, MapReduce technologies, and how they are coming together, is Joe Hellerstein, professor of computer science at UC Berkeley. Welcome, Joe.

Joe Hellerstein: Good to be here, Dana.

Gardner: Also Robin Bloor, analyst and partner at Hurwitz & Associates. Thanks for joining, Robin.

Robin Bloor: It's good to be here.

Gardner: We're also joined by Luke Lonergan, CTO and co-founder of Greenplum. Welcome to the show, Luke.

Luke Lonergan: Hi, Dana, glad to be here.

Gardner: The technical response to oceans of data is something that has been building for some time. Multi-core processing has also been something in the works for a number of years. Let's go to Joe Hellerstein first. What's different now? What is in the current confluence of events that is making this a good mixture of parallelism, multi-core, and the need to crunch ever more data?

Hellerstein: It's an interesting question, because it's not necessarily a good thing. It's a thing that's emerged that seems to work. One thing you can look at is data growth. Data growth has been following and exceeding Moore's Law over time. What we've been seeing is that the data sets that people are gathering and storing over time have been doubling at a rate of even faster than every 18 months.

That used to track Moore's Law well enough. Processors would get faster about every 18 months. Disk storage densities would go up about every 18 months. RAM sizes would go up by factor of two about every 18 months.

What's changed in the last few years is that clock speeds on processors have stopped doubling every 18 months. They're growing very slowly, and chip manufacturers like Intel have moved instead to utilizing Moore's Law to put twice as many transistors on a chip every 18 months, but not to make those transistors run your CPU faster.

Instead, what they are doing is putting more processing cores on every chip. You can expect the number of processors on your chip to double every 18 months, but they're not going to get any faster.

So data is growing faster, and we have chips basically standing still, but you're getting more of them. If you want to take advantage of that data, you're going to have to program in parallel to make use of all those processors on the chips. That's the confluence that's happening. It's the slowdown in clock speed growth against the continued growth in data.

Effects on mainstream compute problems

Gardner: Joe, where do you expect that this is going to crop up first? I mentioned a few examples of large data sets from the Internet, such as with Google and what it's doing. We're concerned about the mobile tier and how much data that's providing to the carriers. Is this something that's only going to affect a select few problems in computing, or do you expect this to actually migrate down into what we consider mainstream computing issues?

Hellerstein: We tend to think of Google as having the biggest data sets around. The amazing thing about the Web is the amount of data there that was typed in by people. It's just phenomenal to think about the amount of typing that's gone on to generate that many petabytes of data.

What we're going to see over time is that data production is going to be mechanized and follow Moore's Law as well. We'll have devices generating data. You mentioned sensors. Software logs are big today, and there will be other sources of data ... camera feeds and so on, where automated generation is going to pump out lots of data.

That data doesn't naturally go to Web search, per se. That's data that manufacturers will have, based on their manufacturing processes. There is security data that people who have large physical plants will have coming from video cameras. All the retail data that we are already capturing with things like Universal Product Code (UPC) and radio-frequency identification (RFID) is going to increase as we get finer-grain monitoring of objects, both in the supply chain and in retail.

We're going to see all kinds of large organizations gathering data from all sorts of automated sources. The only reason not to gather that data is when you run out of affordable processing and storage. Anybody with the budget will have as much data as they can budget for and will try to monetize that. It's going to be pervasive.

Gardner: Robin Bloor, you've been writing about these issues for some time. Now, we have had multi-core silicon, and we've had virtualization for some time, but there seems to be a lag between how the software can take advantage of what's going on on the metal. What's behind this discrepancy, and where do you expect that to go?

Bloor: There are different strands to this, because if we talk about parallelization, then with large database products, to a certain extent, we have already moved into the parallelization.

It's an elastic lag that comes from the fact that, when a chip maker does something new on the chip, unless its just a speed -- which was a great thing about clock speed -- you have to change your operating system to some degree to take advantage of what's new on the chip. Then, you may have to change the compilers and the way you write code in order to take advantage of what's on the chip.

It immediately throws a lag into the progress of software, even if the software can take advantage of it. With multi-core, we don't have specific tools to write parallel software, except in one or two circumstances, where people have gone through the trouble to do that. They are not pervasive.

You don't have operating systems naturally built for sharing the workload of multi-core. We have applications like virtualization, for example, that can take advantage of multi-core to some degree, but even those were not specifically written for multi-core. They were written for single-core processes.

So, you have a whole lag in the works here. That, to a certain extent, makes multi-core compelling for where you have parallel software, because it can attack those problems very, very well and can deliver benefit immediately. But you run into a paradox when Intel comes out with a four-way or an eight-way or a 16-way chip set. Then the question is how are you going to use that?

Multi-core becomes the killer app

Gardner: You've written recently Robin that the killer app, so to speak, for multi-core is data query. Why do you feel that's the case?

Bloor: There are a lot of reasons for it. First of all, it parallelizes extremely well. Basically, you have a commanding node that's looking after a data query. You can divide the data and the resources in such a way that you just basically run everything in parallel.

The other thing that's really neat about this application is it's a complete batch application, in the sense that you just keep pushing the data through an engine that keeps doing the queries. So, you're making pretty effective use of all the processes that are available to you. It's very high usage.

If you run an operating system that's based upon intervals, you're waiting for things to happen. At various times, the operating system is idle. It doesn't seem like they're very long times, but mostly on a PC the operating system is never doing anything. Even when you're running applications on a PC, it's rarely doing very much, even in a single-CPU situation. In a multiple-CPU situation, it's very hard to divide the workload.

So that's the situation. You've got this problem that we have with very large heaps of data. They've been growing roughly at a factor of about 1,000 every six years. It's an awesome growth rate. At the same time, we have the technology where we can take a very good dash at this and use the CPU power we've got very effectively.

Gardner: Luke Lonergan, we now have a data problem, and we have some shifts and changes in the hardware and infrastructure. What now needs to be brought into this to create a solution among these disparate variables?

Longergan: Well, it's interesting. As I listened to Joe and Robin talk about the problem, what comes to mind is a transition in computing that happened in the 1970s and 1980s. What we've done at Greenplum is to make a parallel operating system for data analysis.

If you look back on super computing, there were times when people were tackling larger and larger problems of compute. We had to invent different kinds of computers that could tackle that kind of problem with a greater amount of parallelism than people had seen before -- the Connection Machine with 64,000 processors and others.

What we've done with data analysis is to make what Robin brings forward happen -- have all units available within a group of commodity computers, which is the popular computing platform. It's really required for cost-efficient analysis to bring that to bear automatically on structured query language (SQL) queries and a number of different data-intensive computing problems.

The combination of the software-switch interconnect, which Greenplum built into the Greenplum product, and the underlying use of commodity parallel computers, is brought together in this database system that makes it possible to do SQL query and languages like MapReduce with automatic parallelism. We're already handling problems that involve thousands of individual cores on petabytes of data.

The problem is very much real. As Joe indicated, there are very many people storing and analyzing more data. We're very encouraged that most of our customers are finding new uses for data that are earning them more money. Consequently, the driver to analyze more and more data continues to grow. As our customers get more successful, this use of data is becoming really important.

Gardner: Back to Joe. This seem to be a bright spot in computer science, tackling these issues, particularly in regard to massive data sets, not just relational data, of course, but a multitude of different types of content and data. What's being done at the research level that backs up this direction or supports this new solution direction?

Data-centered approach has huge power

Hellerstein: It's an interesting question, because the research goes back a ways. We talked about how database systems and relational query, like SQL, can parallelize neatly. That comes straight out of the research literature projects, like the Gamma Research Project at Wisconsin in 1980s, and the Bubba Project at MCC. What's happened with that work over time is that it has matured in products like Greenplum, but it's been kind of cornered in the SQL world.

Along came Google and borrowed, reused, and reapplied a lot of that technology to text- and Web-processing with their MapReduce framework. The excitement that comes from such a successful company as Google tackling such a present problem as we have today with the Web, has begun to get the rest of computer science to wake up to the notion that a data-centric approach to parallelism has enormous power.

The traditional approach to parallelism and research in the 1980s was to think about taking algorithms -- typically complicated scientific algorithms that physicists might want to use -- and trying to very cleverly figure out how to run them on lots of cores.

Instead, what you're seeing today is people say, "Wow, well, let's get a lot of data. It's easy to parallelize the data. You break it up into little chunks and you throw it out to different machines. What can we do cleverly in computing with that kind of a framework?" There are a lot of ideas for how to move forward in machine learning and computer vision, and a variety of problems, not just databases now, where you are taking this massively parallel data-flow approach.

Gardner: I've heard this term "shared nothing architecture," and I have to admit I don't know anything about what it means. Robin, do you have a sense of what that means, and how that relates to what we are discussing?

Bloor: Yeah, I do. The first time I ran into this was not in respect to this at all. I did some work for the Hong Kong Jockey Club in the 1990s. What they do is take all the gambling on all the horse racing that goes on in Hong Kong. It's a huge operation, much, much bigger than its name sounds.

In those days, they got, I think, the largest transaction rate in the world, or at least it was amongst the top ten. They were getting 3,000 bets in the last second before a race, and they lose the money from the bet if the bet doesn't go on.

The law in Hong Kong was that the bet has to be registered on disk, before it was actually a real bet. So, if in any way, anything fell over or broke during the minute leading up to a race, a lot of money could be lost.

Basically they had an architecture that was a shared nothing architecture. They had a router in front of an awful lot of servers, which were doing nothing but taking bets and writing them to disk. It was server, after server, after server. If at any point, there was any indication that the volume was going up, they would just add servers, and it would divide the workload into smaller and smaller chunks, so it could do it.

You can think of almost being like a supermarket in the sense of lots and lots of different tools and lots of queues for people, but each tool is a resource on its own, and it shares nothing with anything else. Therefore, no bottlenecks can build up around any particular line.

If you have somebody directing the traffic, you can make sure that the flow goes through. So you go from that, straight into a query on a very large heap of data, if you manage to divide the data up in an efficient way.

A lot of these very big databases consist of nothing more than one big fact table -- a little bit more, but not much more than one big fact table. You split that over 100 machines, and you have a query against a whole fact table. Then, you just actually have 100 queries against 100 different data sets, and you bring the answer back together again.

You can even do fault tolerance in terms of the router for all this. So, with that, you can end up with nothing being shared, and you just have the speed. Basically, any device that's out there is doing a bit of query for you. If you've got 1,000 of them, you go 1,000 times faster. This scales extraordinarily well, because nothing is shared.

Gardner: Luke, tell me how these concepts of being able to scale relate to what the developers need to do. It seems to me that we've got some infrastructure benefits, but if we don't have the connection between how these business analysts and others that are seeking the results can use these infrastructure benefits, we're not capitalizing. What needs to happen now in terms of the logic as that relates to the data?

The net effects on users

Longergan: It's a good question, because, in the end, it's about users being able to gain access to all that power. What really turned the corner for general data analysis using SQL is the ability for a user to not to have to worry about what kind of table structure they have. They can have lots of small tables joining to lots of big tables, and big tables joining to each other.

These are things they do to make the business map better to the data analysis they're doing. That throws a monkey wrench in the beautiful picture of just subdividing the data and then running individual queries.

What the developer needs is an engine that doesn't care how the data is distributed, per se, just being able to use all of that parallelism on the problems of interest. The core problem we've solved is the ability for our engine to redistribute the data and the computation on the fly, as these queries and analysis are being performed.

It's the combination, as Robin put it earlier, of a compiler technology, which is our parallelizing optimizer, and a software interconnect, which we call a soft switch technology. The combination of those two things enables a developer of business logic and business analysis to not to have to worry about what is underneath them.

The physical model of how the database is distributed in a shared nothing architecture in a Greenplum system is not visible to the developer. That is where the SQL-focused data analytics realm has gone by necessity. It really has made it possible to continue to grow the amount of data, and continue to be able to run SQL analysis against that data. It's the ability to express arbitrarily constructed business rules against a large-scale data store.

Gardner: We did one of these podcasts not too long ago with Tim O'Reilly. He mentioned that he'd heard from Joe Hellerstein that every freshman now at UC Berkeley studying computer science is being taught Hadoop, which is related on an open-source development level and community to MapReduce. SQL is now an elective for seniors.

It seems that maybe we've crossed a threshold here in terms of how people are preparing themselves for this new era. Joe, how does that relate to how this new logic and ability to derive queries from these larger data sets is unfolding?

Hellerstein: What you're seeing there is three things happening at once. The first is that we have a real desire on the educational side to teach the next generation of programmers something about parallelism. It's really sticking your head in the sand to teach programming the way we have always taught it and not address the fact that every efficient program over the next ... forever is going to have to be a parallel program. That's the first issue.

The second issue is what's the simplest thing you can teach to computer science students to give them a tangible feeling for parallelism, to actually get them running code on lots of machines and get it going? The answer is data parallelism -- not a complicated scientific algorithm that's been carefully untangled, but simple data parallelism in a language that doesn't really require them to learn any new conceptual ideas that they wouldn't have learned in a high school AP course where they learned say Python or Java.

When you look at those requirements, you come up with the Google MapReduce model as instantiated in the open-source code of Hadoop. They can write simple straight-line programs that are procedural. They look just like "For" loops and "If-Then" statements. The students can see how that spreads out over a lot of data on a lot of machines. It's a very approachable way to get students thinking about parallelism.

The third piece of this, which you can't discount, is the fact that Google is very interested in making sure that they have a pipeline of programmers coming in. They very aggressively have been providing useful pedagogical tools, curriculum, and software projects, to universities to ramp this up.

So it's a win-win for the students, for the university, and frankly for Google, Yahoo, and IBM, who have been pushing this stuff. It's an interesting thing, an academic-industrial collaboration for education.

At the business level

Gardner: Let's bring this from a slightly abstract level down to a business level. We seem to be focusing more on purpose-built databases, appliances, packaging these things a little differently than we had in a distributed environment. Luke, what's going on in terms of how we package some of these technologies, so that businesses can start using them, perhaps at a crawl, walk, run type of a ramp up?

Longergan: Businesses have invested a tremendous amount of their time over the last 15 to 25 years in SQL, and some of the more traditional kinds of business analysis that pay off very well are ensconced in that programming model. So, packaging a system that can do transactional, mixed workloads with large amounts of concurrency, with applications that use the SQL paradigm, is very important.

Second, the ability to leverage the trends in microprocessors and inexpensive servers, and combine those with this kind of software model that scales and takes advantage of very high degree of parallelism, requires a certain amount of integration expertise.

Packaging this together as software plus hardware, making that available as a reference architecture for customers, has been very important and has been very successful in our accounts at New York Stock Exchange, Fox, MySpace, and many others.

Finally, as Joe and you were hinting at, there are changes in the programming paradigm. In being able to crawl, walk, and then run, you have to support the legacy, but then give people a way to get to the future. The MapReduce paradigm is very interesting, because it bridges the gap between traditional data-intensive programming with SQL and the procedural world of unstructured text analysis.

This set of technologies, put together into a single operating system-like formulation and package, has been our approach, and it's been very popular.

Gardner: Robin Bloor, this whole notion of legacy integration is pretty important. A lot of enterprises don't have the luxury of starting out "green field," don't have the luxury of hiring the best and brightest new computer scientists, and working on architecture from a pure requirements-based perspective. They have to deal with what they have in place. Increasingly, they want to relate more of what they have in place into an analytic engine of some kind.

What's being done from your perspective vis-à-vis parallelization and things like MapReduce that allow for backward compatibility, as well as setting yourself up to be positioned to expand and to take advantage of some of these advancements?

Bloor: The problem you have with what is fondly called legacy by everybody is that it really is impossible. The kind of things that were done in the past, very strongly bound the software to the data, to the environment it ran in. Therefore, unhooking that, other than starting again from scratch, is a very difficult thing to do.

Certainly, a lot of work is going on in this area. One thing that you can do is to create something -- I don't know if there is an official title to it, but everybody seems to use the word data fabric. The idea being that you actually just siphon data off from all of the data pools that you have throughout an organization, and use the newer technology in one way or another to apply to the whole data resource, as it exists.

This isn't a trivial thing to do, by the way. There are a lot of things involved, but it's certainly a direction in which things are actually going to move. It's possibly not as well acknowledged as it should be, but most of the things that we call data warehouses out there, the implementations have been done in the area of business intelligence (BI), actually don't run very well.

You have situations where people post queries, and it may take hours to answer a query. Because it takes hours to answer a query, and you have a whole scheme, a reason why you are actually mining the data for something, if every step takes a couple of hours, it's very difficult to carry out an analysis like that in a particularly effective way.

A 100-to-1 value improvement

If you take something like the Greenplum technology, and you point to the same problem, even though you are not dealing with petabytes of data, you can still have this parallel effect. You can get answers back that used to take 100 minutes, and you will get 100 to 1 out of this. You may get more, but you will certainly get 100 to 1 out of this, and it changes the way that you do the job that you have.

One thing that's kind of invisible is that there is a lot of data out there that's not being analyzed fast enough to be analyzed effectively. That's something that I think parallelism is going to address.

The other thing where it is going to play a part is that organizations are going to build data fabrics. In one way or another, they will siphon the data off and just handle it in a parallel manner. There is a lot you can do with that, basically.

Gardner: Joe Hellerstein, is there more being brought to this from the data architecture perspective, jibing the old with the new, and then providing yet better performance when it comes to these massive analytic chores?

Hellerstein: What I'm excited about, and I see this at Greenplum -- there's another company called Aster Data that's doing this, and I wouldn't be surprised if we see more of this in the market over time -- is the combination of SQL and MapReduce in a unified way in programming environments. This is short-term step, but it's a very pragmatic one that can help with people's ability to get their hands on data in an organization.

The idea is that, first of all, you want to have the same access to all your data via either an SQL interface or a MapReduce programming interface. When I say all the data, I mean the stuff you used to get with SQL, the database data, and the stuff you might currently be getting with MapReduce, which might be text files or log files in a distributed-file system. You ought to be able to access those with whatever language suits you, mix and match.

So, you can take your raw log files, which are raw text, and use SQL to join those against a customer table. Or, if you're a MapReduce programmer who does analytics and doesn't know SQL, say you're a statistician, you can write a MapReduce program that does some fancy statistical analysis. You can point it at text fields in a database full of user comments, or at purchase records that you used to have to dump out of the database into text formats to get your hands on. So, part of this is getting more access to more people who have programing paradigms at their fingertips.

Another piece of this is that some things are easier to do in MapReduce, and some things are easier to do in SQL, even when you know both. Good programmers have a lot of tools in their tool belt. They like to be able to use whatever tool is appropriate for the task. Having both of these things interleaved is really quite helpful.

Gardner: Luke, to what degree are they interleaved now, and to what degree can we expect to see more?

Longergan: It's been very gratifying that just making some of those pragmatic capabilities available and helping customers to use them has so far yielded some pretty impressive results. We have customers who have solved core business problems, in ways they couldn't have before, by unifying the unstructured text-file data sources with the data that was previously locked up inside the database.

As Joe points out, it's a good programmer who knows how to use all of the various tools that they have at their disposal. Being able to pull one that's just right for the task off the shelf is a great thing to do. With the Greenplum system we've made this available as a simple extension and just another language that one can use with the same parallel data engine, and that's been very successful so far.

Impact on cloud computing

Gardner: Let's look at how this impacts one of the hot topics of the day, and that's cloud computing, the idea that sourcing of resources can come from a variety of organizations. You're not just going to get applications as a service or even Web services, but increasingly infrastructure functionality as a service.

Does this parallelization, some of these new approaches to programming, and the ability to scale have an impact on how well organizations can start taking advantage of what's loosely defined as cloud computing? Let's start with you Joe.

Hellerstein: I'm not quite sure how this is going to play out. There are a couple of questions about how an individual organization's data will end up in the cloud. Inevitably it will, but in the short-term, people like to keep their data close, particularly database data that's traditionally been in the warehouses, very carefully managed. Those resources are very carefully protected by people in the organization.

It's going to be some time until we really see everybody's data warehouses up in the cloud. That said, as services move into the cloud, the data that those services spit out and generate, their log files, as well as the data that they're actually managing, are going to be up in the cloud anyway.

So, there is this question of, how long will it be until you really get big volumes of data in the cloud. The answer is that certainly new applications will be up there. We may start to see old data getting uploaded in the cloud as well.

There's another class of data that's already becoming available in the cloud. There is this recent announcement from Amazon that they are going to make some large data sets available on their platform for public access. I think we'll see more of this, of data as a utility that's provided by third parties, by governments, by corporations, by whomever has data that they want to share.

We'll start to see big data sets up there that don't necessarily belong to anyone, and they are going to be big. In that environment, you can imagine big data analytics will have to run in the cloud, because that's where the data will be.

One of the fun things about the cloud that's really exciting is the elasticity of the resources. You don't buy yourself a data center full of machines, but you rent as many machines as you need for a task.

If you have a task that's going to look at a lot of data, you would rent a lot of machines for a few hours, and then you would shrink your pool. What this is going to allow people to do is that even small organizations may, for a short period of time, look at an enormous amount of data, which perhaps doesn't originate in their own data production environment, but is something that they want to utilize for their purposes.

There is going to be a democratization of the ability to take advantage of information, and it comes from this ability to share these resources that compute, as well as the actual content to share them in a temporary way.

Gardner: Let's go to Robin on that. It seems that there is a huge potential payoff if, as Joe mentioned, you can gather data from a variety of sources, perhaps not in your own applications, not your own infrastructure and/or legacy, but go out and rent or borrow some data, but then do some very interesting things with it. That requires joins, that requires us to relate data from one cloud to another or to suck it into one cloud, do some wonderful magic-dust pixie sprinkling on it, and then move along.

How do you view this problem of managing boundaries of clouds, given that there is such a potential, if we could do it well, with data?

Looking at networks

Bloor: There would have to be, because you are looking at a technical problem, and you really are going to have to have specific interfaces for doing that, especially if you are joining data across clouds. Let's drop the word "cloud" and just think large network, because everything that is representative of the cloud ultimately comes down to being somewhat of a larger network.

When you've got something very large, like what Google and Amazon have, then you have this incredible flexibility of resources. You can push resources in or redeploy these resources very, very effectively. But you're not going to be able to do joins across data heaps in one cloud and another cloud, and in perhaps a particular network without there being interfaces that allow you to do that, and without query agents sitting in those particular clouds that are going to go off and do the work. You're going to care very much as to how fast they do that work as well.

This is going to be a job for big engines like Greenplum, rather than your average relational database, because your average relational database is going to be very slow.

Also, you have to master the join. In other words, the result has to arrive somewhere, and be brought together. There are a number of technical issues that are going to have to be addressed, if we're going to do this effectively, but I don't see anything that stops it being done. We have the fast networks to enable this. So, I think it can be done.

Gardner: Luke, last word goes to you. I don't expect you to pre-announce necessarily, but how do you, from Greenplum's perspective, address this need for joining, but recognizing it's a difficult technical problem?

Longergan: Well, the cloud really manifests itself as a few different things to us. When Joe was talking about how people are going to be putting, and are already putting, a lot of services up in the cloud that are generating a lot of new data, then it requires that the kinds of data analysis, as Robin was hinting at, scale to meet that demand.

We already have the engine that implements those kinds of join in between networks abilities. So we are cloud capable. The real action is going to be when people start to do business that counts on public clouds to function properly, and are generating enormous amounts of very valuable data that requires the kind of parallel compute that we provide.

Joining inside clouds, using cloud resources to do the kind of data analysis work, this is all happening as we speak, and this is another aspect of what's forcing the change from an earlier paradigm of database to the modern massively parallel one.

Gardner: I just want to wrap up quickly now. Thank you. Joe Hellerstein, you mentioned earlier on Moore's Law and how it stalled a bit on the silicon. Are we going to see a similar track, however, to what we did with processing over the last 15 years -- a rapid decrease in the total cost associated with these tasks? Even if we don't necessarily see the same effect in terms of the computing, are we going to be able to do what we've been describing here today at an accelerating decreased total cost?

Hellerstein: Absolutely. The only barrier to this is the productivity of programmers.

Just think about storage. I have a terabyte disk in my basement that holds videos, and it costs $100 or so dollars at Ten years ago a terabyte was referred to by the experts in the field as a "terror byte." That's how worried people were about data volumes like that.

We'll see that again. Disk densities show no signs of slowing down. So, data is going to be essentially no cost. The data-gathering infrastructure is also going to be mechanized. We're going through what I call the industrial revolution of data production. We're just going to build machines to generate data, because we think we can get value out of that data, and we can store it essentially for free.

The compute cost of multi-core with parallelism is going to continue Moore's Law. It's just going to continue it in a parallel programming environment. If we can get all those cores looking at all that data, it won't cost much to do that, and the cost of that will continue to shrink by half.

The only real barrier to the process is to make those systems easy to program and manageable. Cloud helps somewhat with manageability, and programming environments like SQL and MapReduce are well-suited to parallelism. We're going to just see an enormous use of data analysis over time. It's just going to grow, because it gets cheaper and cheaper and bigger and bigger.

Gardner: Well, great, that's very exciting. We've been discussing advances in parallel processing using multi-core chipsets and how that's prompted new software approaches such as MapReduce that can handle these large data sets, as we have just pointed out, at surprisingly low total cost.

I want to thank our panel for today. We have been joined by Joe Hellerstein, professor of computer science at UC Berkeley, and I should point out also an adviser at Greenplum. Thank you for joining, Joe.

Hellerstein: It was a pleasure.

Gardner: Robin Bloor, analyst and partner at Hurwitz & Associates. I appreciate your input, Robin.

Bloor: Yeah, it was fun.

Gardner: Luke Lonergan, CTO and co-founder at Greenplum. Thank you, sir.

Longergan: Thanks, Dana.

Gardner: This is Dana Gardner, principal analyst at Interarbor Solutions, you have been listening to a sponsored podcast from BriefingsDirect. Thanks and come back next time.

Listen to the podcast. Download the podcast. Find it on iTunes/iPod. Learn more. Sponsor: Greenplum.

Transcript of BriefingsDirect podcast on new technical approaches to managing massive data problems using parallel processing and MapReduce technologies. Copyright Interarbor Solutions, LLC, 2005-2009. All rights reserved.