Transcript
of a BriefingsDirect podcast on how a hybrid storage provider can
analyze operational data to bring about increased efficiency.
Listen to the podcast. Find it on iTunes. Download the transcript.
Dana Gardner: Hello, and welcome to the next edition of the
HP Discover Performance Podcast Series. I'm
Dana Gardner, Principal Analyst at
Interarbor Solutions, your moderator for this ongoing discussion of
IT innovation and how it’s making an impact on people’s lives.
Once
again, we’re focusing on how IT leaders are improving their business
performance for better access, use and analysis of their data and
information.
Our next innovation case study focuses on how optimized hybrid storage provider
Nimble Storage has leveraged
big data and
cloud to produce significant storage performance, and efficiency. Nimble is, of course, also notable for its
recent successful IPO.
Learn here how
Nimble Storage has leveraged the
HP Vertica analytics
platform to analyze operational data on
mixed-storage environments to optimize workloads. High-performing, cost-effective big-data processing via cloud helps
to make the best use of dynamic storage resources, it turns out. A fascinating
story.
To learn more, join me in welcoming our guest,
Larry Lancaster, Chief Data Scientist at Nimble Storage Inc. in San Jose, California. Welcome, Larry.
Larry Lancaster: Hi, Dana, it's great to talk to you today.
Gardner:
I'm glad you could join us. As I said, it's a fascinating use-case.
Tell us about the general scope of how you use data in the cloud to
create this
hybrid storage optimization service.
Lancaster:
At a high level, Nimble Storage recognized early, near the inception of
the product, that if we were able to collect enough operational data
about how our products are performing in the field, get it back home
and analyze it, we'd be able to dramatically reduce support costs. Also,
we can create a
feedback loop
that allows engineering to improve the product very quickly, according
to the demands that are being placed on the product in the field.
Looking at it from that perspective, to get it right,
you need to do it from the inception of the product. If you take a look
at how much data we get back for
every array we sell in the field, we
could be receiving anywhere from 10,000 to 100,000 data points per
minute from each array. Then, we bring those back home, we put them into
a database, and we run a lot of intensive analytics on those data.
Once
you're doing that, you realize that as soon as you do something, you
have this data you're starting to leverage. You're making support
recommendations and so on, but then you realize you could do a lot more
with it. We can do
dynamic cache sizing. We can figure out how much cache
a customer needs based on an
analysis of their real workloads.
We
found that big data is really paying off for us. We want to continue to
increase how much it's paying off for us, but to do that we need to be
able to do bigger queries faster. We have a team of data scientists and
we don't want them sitting here twiddling their thumbs. That’s what
brought us to Vertica at Nimble.
Using big data
Gardner:
It's an interesting juxtaposition that you're using big data in order
to better manage data and storage. What better use of it? And what sort
of efficiencies are we talking about here, when you are able to get that
data in that massive scale and do these analytics and then go back out
into the field and adjust? What does that get for you?
Lancaster:
We have a very tight feedback loop. In one release we put out, we may
make some changes in the way certain things happen on the back end, for
example, the way
NVRAM
is drained. There are some very particular details around that, and we
can observe very quickly how that performs under different workloads. We
can make tweaks and do a lot of tuning.

Without
the kind of data we have, we might have to have multiple cases being
opened on performance in the field and escalations, looking at cores,
and then simulating things in the lab.
It's a very
labor-intensive, slow process with very little data to base the decision
on. When you bring home operational data from all your products in the
field, you're now talking about being able to figure out in near real-time the distribution of workloads in the field and how people access
their storage. I think we have a
better understanding of the way storage works in the real world than any other storage vendor, simply because
we have the data.
Gardner: So it's an
interesting combination of a product lifecycle approach to getting data -- but also combining a service with a product in such a way that you're
adjusting in real time.
Lancaster: That’s right.
We do a lot of neat things. We do capacity forecasting. We do a lot of
predictive analytics to try to figure out when the storage administrator
is going to need to purchase something, rather than having them just
stumble into the fact that they need to provision for equipment because
they've run out of space.
That’s the kind of efficiency we gain that you can see, and the InfoSight service delivers that to our customers.
A
lot of things that should have been done in storage from the very
beginning that sound straightforward were simply never done. We're
the first company to take a comprehensive approach to it. We open and close
80 percent of our cases automatically, 90 percent of them are
automatically opened.
We have a suite of tools that run
on this operational data, so we don't have to call people up and say,
"Please gather this data for us. Please send us these log posts. Please
send us these statistics." Now, we take a case that could have taken two
or three days and we turn it into something that can be done in an
hour.
That’s the kind of efficiency we gain that you can see, and the
InfoSight service delivers that to our customers.
Gardner: Larry, just to be clear, you're supporting both
flash
and traditional disk storage, but you're able to exploit the hybrid
relationship between them because of this data and analysis. Tell us a
little bit about how the
hybrid storage works.
Challenge for hard drives
Lancaster:
At a high level, you have hard drives, which are inexpensive, but
they're slow for random
I/O. For sequential I/O, they are all right, but
for random I/O performance, they're slow. It takes time to move the
platter and the head. You're looking at 5 to 10 milliseconds seek time
for random read.
That's been the challenge for hard
drives. Flash drives have come out and they can dramatically improve on
that. Now, you're talking about microsecond-order latencies, rather than
milliseconds.
But the challenge there is that they're
expensive. You could go buy all flash or you could go buy all hard
drives and you can live with those downsides of each. Or, you can take
the best of both worlds.
Then, there's a challenge. How
do I keep the data that I need to access randomly in flash, but keep
the rest of the data that I don't care so much about in a frequent
random-read performance, keep that on the hard drives only, and in that
way, optimize my use of flash. That's the way you can save money, but
it's difficult to do that.
It comes down to having some
understanding of the workloads that the customer is running and being
able to anticipate the best algorithms and parameters for those
algorithms to make sure that the right data is in flash.
It would be hard to be the best hybrid storage solution without the kind of analytics that we're doing.
We've
built up an enormous dataset covering thousands of system-years of
real-world usage to tell us exactly which approaches to caching are
going to deliver the most benefit. It would be hard to be the best
hybrid storage solution without the kind of analytics that we're doing.
Gardner:
Then, to extrapolate a little bit higher, or maybe wider, for how this
benefits an organization, the analysis that you're gathering also
pertains to the data lifecycle, things like
disaster recovery (DR),
business continuity,
backups, scheduling, and so forth. Tell us how the data gathering
analytics has been applied to that larger data lifecycle equation.
Lancaster:
You're absolutely right. One of the things that we do is make sure that
we audit all of the storage that our customers have deployed to
understand how much of it is protected with local snapshots, how much of
it is replicated for disaster recovery, and how much incremental space
is required to increase retention time and so on.
We
have very efficient snapshots, but at the end of the day, if you're
making changes, snapshots still do take some amount of space. So,
learning exactly what is that overhead, and how can we help you achieve your
disaster recovery goals.
We have a good understanding
of that in the field. We go to customers with proactive service
recommendations about what they could and should do. But we also take
into account the fact that they may be doing DR when
we forecast how much capacity they are going to need.
Larger lifecycle
You're
right. It is part of a larger lifecycle that we address, but at the end
of the day, for my team it's still all about analytics. It's about
looking to the data as the source of truth and as the source of
recommendation.
We can tell you roughly how much space
you're going to need to do disaster recovery on a given type of
application, because we can look in our field and see the distribution
of the extra space that would take and what kind of bandwidth you're
going to need. We have all that information at our fingertips.
When
you start to work this way, you realize that you can do things you
couldn't do before. And the things you could do before, you can do
orders of magnitude better. So we're a great case of actually applying
data science to the product lifecycle, but also to front-line revenue
and cost enhancement.
Gardner: I think this is a
great example and I think you're a harbinger of what we're going to see
more and more, which is bringing this high level of intelligence to
bear on many other different services, for many different types of
products. IT and storage is great and makes a lot of sense as an early
adopter. But I can see this is pertaining to many other vertical
industries. It illustrates where a lot of big-data value is going to go.
Now,
let's dig into how you actually can get that analysis in the speed, at
the scale, and at the cost that you require. Tell us about your journey
in terms of different analytics platforms and data architectures that
you've been using and where you're headed.
I have to tell you, I fell in love with Vertica because of the performance benefits that it provided.
Lancaster:
To give you a brief history of my awareness of
HP Vertica and my
involvement around the product, I don’t remember the exact year, but it
may have been eight years ago roughly. At some point, there was an
announcement that
Mike Stonebraker was involved in a group that was going to productize the
C-Store Database, which was sort of an academic experiment at UC Berkeley, to understand the benefits and capabilities of real
column store.
[Learn more about column store architectures and how they benefit data speed and management for Infinity Insurance.]
I was immediately interested and contacted them. I was working at another storage company at the time. I had a 20
terabyte (TB) data warehouse, which at the time was one of the largest
Oracle on
Linux data warehouses in the world.
They
didn't want to touch that opportunity just yet, because they were just
starting out in alpha mode. I hooked up with them again a few years
later, when I was CTO at a company called
Glassbeam, where we developed
what's substantially an
extract, transform, and load (ETL) platform.
By
then, they were well along the road. They had a great product and it
was solid. So we tried it out, and I have to tell you, I fell in love
with Vertica because of the performance benefits that it provided.
When
you start thinking about collecting as many different data points as we
like to collect, you have to recognize that you’re going to end up with
a couple choices on a row store. Either you're going to have very
narrow tables and a lot of them or else you're going to be wasting a lot
of I/O overhead, retrieving entire rows where you just need a couple
fields.
Greater efficiency
That
was what piqued my interest at first. But as I began to use it more and
more at Glassbeam, I realized that the performance benefits you could
gain by using HP Vertica properly were another order of magnitude beyond
what you would expect just with the column-store efficiency.
That's
because of certain features that Vertica allows, such as something
called pre-join projections. We can drill into that sort of stuff more
if you like, but, at a high-level, it lets you maintain the normalized
logical integrity of your schema, while having under the hood, an
optimized
denormalized query performance physically on disk.
Now
you might ask you can be efficient if you have a denormalized structure
on disk. It's because Vertica allows you to do some very efficient
types of encoding on your data. So all of the low cardinality columns
that would have been wasting space in a row store end up taking almost
no space at all.
What you find, at least it's been my
impression, is that Vertica is the data warehouse that you would have
wanted to have built 10 or 20 years ago, but nobody had done it yet.
Vertica is the data warehouse that you would have wanted to have built 10 or 20 years ago, but nobody had done it yet.
Nowadays,
when I'm evaluating other big data platforms, I always have to look at
it from the perspective of it's great, we can get some parallelism here,
and there are certain operations that we can do that might be difficult
on other platforms, but I always have to compare it to Vertica.
Frankly, I always find that Vertica comes out on top in terms of
features, performance, and usability.
Gardner:
When you arrived there at Nimble Storage, what were they using, and
where are you now on your journey into a transition to Vertica?
Lancaster:
I built the environment here from the ground up. When I got here, there
were roughly 30 people. It's a very small company. We started with
Postgres.
We started with something free. We didn’t want to have a large budget
dedicated to the backing infrastructure just yet. We weren’t ready to
monetize it yet.
So, we started on Postgres and we've
scaled up now to the point where we have about 100 TBs on Postgres. We
get decent performance out of the database for the things that we
absolutely need to do, which are micro-batch updates and transactional
activity. We get that performance because the database lives on Nimble
Storage.
I don't know what the largest
unsharded
Postgres instance is in the world, but I feel like I have one of them.
It's a challenge to manage and leverage. Now, we've gotten to the point
where we're really enjoying doing larger queries. We really want to
understand the entire installed base of how we want to do analyses that
extend across the entire base.
Rich information
We
want to understand the lifecycle of a volume. We want to understand how
it grows, how it lives, what its performance characteristics are, and
then how gradually it falls into senescence when people stop using it.
It turns out there is a lot of really rich information that we now have
access to to understand storage lifecycles in a way I don't think was
possible before.
But to do that, we need to take our
infrastructure to the next level. So we've been doing that and we've
loaded a large number of our sensor data that’s the numerical data I
have talked about into Vertica, started to compare the queries, and then
started to use Vertica more and more for all the analysis we're doing.
Internally,
we're using Vertica, just because of the performance benefits. I can
give you an example. We had a particular query, a particularly large
query. It was to look at certain aspects of latency over a month across
the entire installed base to understand a little bit about the
distribution, depending on different factors, and so on.
I'm really excited. We're getting exactly what we wanted and better.
We
ran that query in Postgres, and depending on how busy the server was,
it took anywhere from 12 to 24 hours to run. On Vertica, to run the
same query on the same data takes anywhere from three to seven seconds.
I
anticipated that because we were aware upfront of the benefits we'd be
getting. I've seen it before. We knew how to structure our projections
to get that kind of performance. We knew what kind of infrastructure
we'd need under it. I'm really excited. We're getting exactly what we
wanted and better.
This is only a three node cluster.
Look at the performance we're getting. On the smaller queries, we're
getting sub-second latencies. On the big ones, we're getting sub-10
second latencies. It's absolutely amazing. It's game changing.
People
can sit at their desktops now, manipulate data, come up with new ideas
and iterate without having to run a batch and go home. It's a
dramatic
productivity increase. Data scientists tend to be fairly impatient.
They're highly paid people, and you don’t want them sitting at their
desk waiting to get an answer out of the database. It's not the best use
of their time.
Gardner: Larry, is there another
aspect to the HP Vertica value when it comes to the cloud model for
deployment? It seems to me that if Nimble Storage continues to grow
rapidly and scales that, bringing all that data back to a central single
point might be problematic. Having it distributed or in different cloud
deployment models might make sense. Is there something about the way
Vertica works within a cloud services deployment that is of interest to
you as well?
No worries
Lancaster:
There's the ease of adding nodes without downtime, the fact that you
can create a
K-safe cluster. If my cluster is 16 nodes wide now, and I
want two nodes redundancy, it's very similar to
RAID.
You can specify that, and the database will take care of that for you.
You don’t have to worry about the database going down and losing data as
a result of the node failure every time or two.
I love
the fact that you don’t have to pay extra for that. If I want to put
more cores or nodes on it or I want to put more redundancy into my
design, I can do that without paying more for it. Wow! That’s kind of
revolutionary in itself.
It's great to see a database
company incented to give you great performance. They're incented to help
you work better with more nodes and more cores. They don't have to
worry about people not being able to pay the additional license fees to
deploy more resources. In that sense, it's great.
We
have our own private cloud -- that’s how I like to think of it -- at an
offsite colocation facility. We do DR through Nimble Storage. At the same time, we have
a K-safe cluster. We had a hardware glitch on one of the nodes last
week, and the other two nodes stayed up, served data, and everything was
fine.
If you do your job right as a cloud provider, people just want more and more and more.
Those
kinds of features are critical, and that ability to be flexible and
expand is critical for someone who is trying to build a large cloud
infrastructure, because you're never going to know in advance exactly
how much you're going to need.
If you do your job right
as a cloud provider, people just want more and more and more. You want
to get them hooked and you want to get them enjoying the experience.
Vertica lets you do that.
Gardner: I'm afraid we'll have to leave it there. We've been learning about
how optimized hybrid storage provider Nimble Storage has leveraged big
data and cloud to produce unique storage performance analytics and
efficiencies. And we've seen how the HP Vertica Analytics platform has
been used to analyze Nimble's operational data across mixed storage
environments in near real-time, so that they can optimize their
workloads and also extend the benefits to a data lifecycle.
So,
a big thank you to our guest, Larry Lancaster, Chief Data Scientist at Nimble Storage. Thank you, Larry.
Lancaster: Thanks, Dana.
Gardner: Also, thank you to our audience for joining us for this special HP Discover Performance Podcast.
I'm
Dana Gardner; Principal Analyst at Interarbor Solutions, your host for
this ongoing series of HP-sponsored discussions. Thanks again for
joining, and come back next time.
Listen to the podcast. Find it on iTunes. Download the transcript. Sponsor: HP.
Transcript
of a BriefingsDirect podcast on how a hybrid storage provider can
analyze operational data to bring about increased efficiency. Copyright
Interarbor Solutions, LLC, 2005-2014. All rights reserved.
You may also be interested in: