BriefingsDirect Transcripts: HPE

Showing posts with label HPE. Show all posts

Monday, December 07, 2020

How to Industrialize Data Science to Attain Mastery of Repeatable Intelligence Delivery

Transcript of a discussion on the latest methods, tools, and thinking around making data science an integral core function of any business.

Listen to the podcast. Find it on iTunes. Download the transcript. Sponsor: Hewlett Packard Enterprise.

Dana Gardner: Hello, and welcome to the next BriefingsDirect Voice of Analytics Innovation podcast series. I’m Dana Gardner, Principal Analyst at Interarbor Solutions, your host and moderator for this ongoing discussion on the latest insights into data science advances and strategy.

Gardner

Businesses these days are quick to declare their intention to become data-driven, yet the deployment of analytics and the use of data science remains spotty, isolated, and often uncoordinated. To fully reach their digital business transformation potential, businesses large and small need to make data science more of a repeatable assembly line -- an industrialization, if you will, of end-to-end data exploitation.

Stay with us now as we explore the latest methods, tools, and thinking around making data science an integral core function that both responds to business needs and scales to improve every aspect of productivity.

To learn more about the ways that data and analytics behave more like a factory -- and less like an Ivory Tower -- please join me now in welcoming Doug Cackett, EMEA Field Chief Technology Officer at Hewlett Packard Enterprise. Welcome, Doug.

Doug Cackett: Thank you so much, Dana.

Gardner: Doug, why is there a lingering gap -- and really a gaping gap -- between the amount of data available and the analytics that should be taking advantage of it?

Data’s potential at edge

Cackett: That’s such a big question to start with, Dana, to be honest. We probably need to accept that we’re not doing things the right way at the moment. Actually, Forrester suggests that something like 40 zettabytes of data are going to be under management by the end of this year, which is quite enormous.

Cackett

And, significantly, more of that data is being generated at the edge through applications, Internet of Things (IoT), and all sorts of other things. This is where the customer meets your business. This is where you’re going to have to start making decisions as well.

So, the gap is two things. It’s the gap between the amount of data that’s being generated and the amount you can actually comprehend and create value from. In order to leverage that data from a business point of view, you need to make decisions at the edge.

You will need to operationalize those decisions and move that capability to the edge where your business meets your customer. That’s the challenge we’re all looking for machine learning (ML) -- and the operationalization of all of those ML models into applications -- to make the difference.

Gardner: Why does HPE think that moving more toward a factory model, industrializing data science, is part of the solution to compressing and removing this gap?

Cackett: It’s a math problem, really, if you think about it. If there is exponential growth in data within your business, if you’re trying to optimize every step in every business process you have, then you’ll want to operationalize those insights by making your applications as smart as they can possibly be. You’ll want to embed ML into those applications.

Because, correspondingly, there’s exponential growth in the demand for analytics in your business, right? And yet, the number of data scientists you have in your organization -- I mean, growing them exponentially isn’t really an option, is it? And, of course, budgets are also pretty much flat or declining.

There's exponential growth in the demand for analytics in your business. And yet the number of data scientists in your organization, growing them, is not exponential. And budgets are pretty much flat or declining.

So, it’s a math problem because we need to somehow square away that equation. We somehow have to generate exponentially more models for more data, getting to the edge, but doing that with fewer data scientists and lower levels of budget.

Industrialization, we think, is the only way of doing that. Through industrialization, we can remove waste from the system and improve the quality and control of those models. All of those things are going to be key going forward.

Gardner: When we’re thinking about such industrialization, we shouldn’t necessarily be thinking about an assembly line of 50 years ago -- where there are a lot of warm bodies lined up. I’m thinking about the Lucille Ball assembly line, where all that candy was coming down and she couldn’t keep up with it.

Perhaps we need more of an ultra-modern assembly line, where it’s a series of robots and with a few very capable people involved. Is that a fair analogy?

Industrialization of data science

Cackett: I think that’s right. Industrialization is about manufacturing where we replace manual labor with mechanical mass production. We are not talking about that. Because we’re not talking about replacing the data scientist. The data scientist is key to this. But we want to look more like a modern car plant, yes. We want to make sure that the data scientist is maximizing the value from the data science, if you like.

We don’t want to go hunting around for the right tools to use. We don’t want to wait for the production line to play catch up, or for the supply chain to catch up. In our case, of course, that’s mostly data or waiting for infrastructure or waiting for permission to do something. All of those things are a complete waste of their time.

As you look at the amount of productive time data scientists spend creating value, that can be pretty small compared to their non-productive time -- and that’s a concern. Part of the non-productive time, of course, has been with those data scientists having to discover a model and optimize it. Then they would do the steps to operationalize it.

But maybe doing the data and operations engineering things to operationalize the model can be much more efficiently done with another team of people who have the skills to do that. We’re talking about specialization here, really.

But there are some other learnings as well. I recently wrote a blog about it. In it, I looked at the modern Toyota production system and started to ask questions around what we could learn about what they have learned, if you like, over the last 70 years or so.

It was not just about automation, but also how they went about doing research and development, how they approached tooling, and how they did continuous improvement. We have a lot to learn in those areas.

For an awful lot of organizations that I deal with, they haven’t had a lot of experience around such operationalization problems. They haven’t built that part of their assembly line yet. Automating supply chains and mistake-proofing things, what Toyota called jidoka, also really important. It’s a really interesting area to be involved with.

Gardner: Right, this is what US manufacturing, in the bricks and mortar sense, went through back in the 1980s when they moved to business process reengineering, adopted kaizen principles, and did what Deming and more quality-emphasis had done for the Japanese auto companies.

And so, back then there was a revolution, if you will, in physical manufacturing. And now it sounds like we’re at a watershed moment in how data and analytics are processed.

Cackett: Yes, that’s exactly right. To extend that analogy a little further, I recently saw a documentary about Morgan cars in the UK. They’re a hand-built kind of car company. Quite expensive, very hand-built, and very specialized.

And I ended up by almost throwing things at the TV because they were talking about the skills of this one individual. They only had one guy who could actually bend the metal to create the bonnet, the hood, of the car in the way that it needed to be done. And it took two or three years to train this guy, and I’m thinking, “Well, if you just automated the process, and the robot built it, you wouldn’t need to have that variability.” I mean, it’s just so annoying, right?

In the same way, with data science we’re talking about laying bricks -- not Michelangelo hammering out the figure of David. What I’m really trying to say is a lot of the data science in our customer’s organizations are fairly mundane. To get that through the door, get it done and dusted, and give them time to do the other bits of finesse using more skills -- that’s what we’re trying to achieve. Both [the basics and the finesse] are necessary and they can all be done on the same production line.

Gardner: Doug, if we are going to reinvent and increase the productivity generally of data science, it sounds like technology is going to be a big part of the solution. But technology can also be part of the problem.

What is it about the way that organizations are deploying technology now that needs to shift? How is HPE helping them adjust to the technology that supports a better data science approach?

Define and refine

Cackett: We can probably all agree that most of the tooling around MLOps is relatively young. The two types of company we see are either companies that haven’t yet gotten to the stage where they’re trying to operationalize more models. In other words, they don’t really understand what the problem is yet.

Forrester research suggests that only 14 percent of organizations that they surveyed said they had a robust and repeatable operationalization process. It’s clear that the other 86 percent of organizations just haven’t refined what they’re doing yet. And that’s often because it’s quite difficult.

Many of these organizations have only just linked their data science to their big data instances or their data lakes. And they’re using it both for the workloads and to develop the models. And therein lies the problem. Often they get stuck with simple things like trying to have everyone use a uniform environment. All of your data scientists are both sharing the data and sharing the computer environment as well.

Data scientists can be very destructive in what they're doing. Maybe overwriting data, for example. To avoid that, you end up replicating terabytes of data, which can take a long time. That also demands new resources, including new hardware.

And data scientists can often be very destructive in what they’re doing. Maybe overwriting data, for example. To avoid that, you end up replicating the data. And if you’re going to replicate terabytes of data, that can take a long period of time. That also means you need new resources, maybe new more compute power and that means approvals, and it might mean new hardware, too.

Often the biggest challenge is in provisioning the environment for data scientists to work on, the data that they want, and the tools they want. That can all often lead to huge delays in the process. And, as we talked about, this is often a time-sensitive problem. You want to get through more tasks and so every delayed minute, hour, or day that you have becomes a real challenge.

The other thing that is key is that data science is very peaky. You’ll find that data scientists may need no resources or tools on Monday and Tuesday, but then they may burn every GPU you have in the building on Wednesday, Thursday, and Friday. So, managing that as a business is also really important. If you’re going to get the most out of the budget you have, and the infrastructure you have, you need to think differently about all of these things. Does that make sense, Dana?

Gardner: Yes. Doug how is HPE Ezmeral being designed to help give the data scientists more of what they need, how they need it, and that helps close the gap between the ad hoc approach and that right kind of assembly line approach?

Two assembly lines to start

Cackett: Look at it as two assembly lines, at the very minimum. That’s the way we want to look at it. And the first thing the data scientists are doing is the discovery.

The second is the MLOps processes. There will be a range of people operationalizing the models. Imagine that you’re a data scientist, Dana, and I’ve just given you a task. Let’s say there’s a high defection or churn rate from our business, and you need to investigate why.

First you want to find out more about the problem because you might have to break that problem down into a number of steps. And then, in order to do something with the data, you’re going to want an environment to work in. So, in the first step, you may simply want to define the project, determine how long you have, and develop a cost center.

You may next define the environment: Maybe you need CPUs or GPUs. Maybe you need them highly available and maybe not. So you’d select the appropriate-sized environment. You then might next go and open the tools catalog. We’re not forcing you to use a specific tool; we have a range of tools available. You select the tools you want. Maybe you’re going to use Python. I know you’re hardcore, so you’re going to code using Jupyter and Python.

And the next step, you then want to find the right data, maybe through the data catalog. So you locate the data that you want to use and you just want to push a button and get provisioned for that lot. You don’t want to have to wait months for that data. That should be provisioned straight away, right?

You can do your work, save all your work away into a virtual repository, and save the data so it’s reproducible. You can also then check the things like model drift and data drift and those sorts of things. You can save the code and model parameters and those sorts of things away. And then you can put that on the backlog for the MLOps team.

Then the MLOps team picks it up and goes through a similar data science process. They want to create their own production line now, right? And so, they’re going to seek a different set of tools. This time, they need continuous integration and continuous delivery (CICD), plus a whole bunch of data stuff they want to operationalize your model. They’re going to define the way that that model is going to be deployed. Let’s say, we’re going to use Kubeflow for that. They might decide on, say, an A/B testing process. So they’re going to configure that, do the rest of the work, and press the button again, right?

Clearly, this is an ongoing process. Fundamentally that requires workflow and automatic provisioning of the environment to eliminate wasted time, waiting for stuff to be available. It is fundamentally what we’re doing in our MLOps product.

But in the wider sense, we also have consulting teams helping customers get up to speed, define these processes, and build the skills around the tools. We can also do this as-a-service via our HPE GreenLake proposition as well. Those are the kinds of things that we’re helping customers with.

Gardner: Doug, what you’re describing as needed in data science operations is a lot like what was needed for application development with the advent of DevOps several years ago. Is there commonality between what we’re doing with the flow and nature of the process for data and analytics and what was done not too long ago with application development? Isn’t that also akin to more of a cattle approach than a pet approach?

Operationalize with agility

Cackett: Yes, I completely agree. That’s exactly what this is about and for an MLOps process. It’s exactly that. It’s analogous to the sort of CICD, DevOps, part of the IT business. But a lot of that tool chain is being taken care of by things like Kubeflow and MLflow Project, some of these newer, open source technologies.

I should say that this is all very new, the ancillary tooling that wraps around the CICD. The CICD set of tools are also pretty new. What we’re also attempting to do is allow you, as a business, to bring these new tools and on-board them so you can evaluate them and see how they might impact what you’re doing as your process settles down.

The way we're doing MLOps and data science is progressing extremely quickly. So you don't want to lock yourself into a corner where you're trapped in a particular workflow. You want to have agility. It's analogous to the DevOps movement.

The idea is to put them in a wrapper and make them available so we get a more dynamic feel to this. The way we’re doing MLOps and data science generally is progressing extremely quickly at the moment. So you don’t want to lock yourself into a corner where you’re trapped into a particular workflow. You want to be able to have agility. Yes, it’s very analogous to the DevOps movement as we seek to operationalize the ML model.

The other thing to pay attention to are the changes that need to happen to your operational applications. You’re going to have to change those so they can tool the ML model at the appropriate place, get the result back, and then render that result in whatever way is appropriate. So changes to the operational apps are also important.

Gardner: You really couldn’t operationalize ML as a process if you’re only a tools provider. You couldn’t really do it if you’re a cloud services provider alone. You couldn’t just do this if you were a professional services provider.

It seems to me that HPE is actually in a very advantageous place to allow the best-of-breed tools approach where it’s most impactful but to also start put some standard glue around this -- the industrialization. How is HPE is an advantageous place to have a meaningful impact on this difficult problem?

Cackett: Hopefully, we’re in an advantageous place. As you say, it’s not just a tool, is it? Think about the breadth of decisions that you need to make in your organization, and how many of those could be optimized using some kind of ML model.

You’d understand that it’s very unlikely that it’s going to be a tool. It’s going to be a range of tools, and that range of tools is going to be changing almost constantly over the next 10 and 20 years.

This is much more to do with a platform approach because this area is relatively new. Like any other technology, when it’s new it almost inevitably to tends to be very technical in implementation. So using the early tools can be very difficult. Over time, the tools mature, with a mature UI and a well-defined process, and they become simple to use.

But at the moment, we’re way up at the other end. And so I think this is about platforms. And what we’re providing at HPE is the platform through which you can plug in these tools and integrate them together. You have the freedom to use whatever tools you want. But at the same time, you’re inheriting the back-end system. So, that’s Active Directory and Lightweight Directory Access Protocol (LDAP) integrations, and that’s linkage back to the data, your most precious asset in your business. Whether that be in a data lake or a data warehouse, in data marts or even streaming applications.

This is the melting point of the business at the moment. And HPE has had a lot of experience helping our customers deliver value through information technology investments over many years. And that’s certainly what we’re trying to do right now.

Gardner: It seems that HPE Ezmeral is moving toward industrialization of data science, as well as other essential functions. But is that where you should start, with operationalizing data science? Or is there a certain order by which this becomes more fruitful? Where do you start?

Machine learning leads change

Cackett: This is such a hard question to answer, Dana. It’s so dependent on where you are as a business and what you’re trying to achieve. Typically, to be honest, we find that the engagement is normally with some element of change in our customers. That’s often, for example, where there’s a new digital transformation initiative going on. And you’ll find that the digital transformation is being held back by an inability to do the data science that’s required.

There is another Forrester report that I’m sure you’ll find interesting. It suggests that 98 percent of business leaders feel that ML is key to their competitive advantage. It’s hardly surprising then that ML is so closely related to digital transformation, right? Because that’s about the stage at which organizations are competing after all.

So we often find that that’s the starting point, yes. Why can’t we develop these models and get them into production in time to meet our digital transformation initiative? And then it becomes, “Well, what bits do we have to change? How do we transform our MLOps capability to be able to do this and do this at scale?”

Often this shift is led by an individual in an organization. There develops a momentum in an organization to make these changes. But the changes can be really small at the start, of course. You might start off with just a single ML problem related to digital transformation.

We acquired MapR some time ago, which is now our HPE Ezmeral Data Fabric. And it underpins a lot of the work that we’re doing. And so, we will often start with the data, to be honest with you, because a lot of the challenges in many of our organizations has to do with the data. And as businesses become more real-time and want to connect more closely to the edge, really that’s where the strengths of the data fabric approach come into play.

So another starting point might be the data. A new application at the edge, for example, has new, very stringent requirements for data and so we start there with building these data systems using our data fabric. And that leads to a requirement to do the analytics and brings us obviously nicely to the HPE Ezmeral MLOps, the data science proposition that we have.

Gardner: Doug, is the COVID-19 pandemic prompting people to bite the bullet and operationalize data science because they need to be fleet and agile and to do things in new ways that they couldn’t have anticipated?

Cackett: Yes, I’m sure it is. We know it’s happening; we’ve seen all the research. McKinsey has pointed out that the pandemic has accelerated a digital transformation journey. And inevitably that means more data science going forward because, as we talked about already with that Forrester research, some 98 percent think that it’s about competitive advantage. And it is, frankly. The research goes back a long way to people like Tom Davenport, of course, in his famous Harvard Business Review article. We know that customers who do more with analytics, or better analytics, outperform their peers on any measure. And ML is the next incarnation of that journey.

Gardner: Do you have any use cases of organizations that have gone to the industrialization approach to data science? What is it done for them?

Financial services benefits

Cackett: I’m afraid names are going to have to be left out. But a good example is in financial services. They have a problem in the form of many regulatory requirements.

When HPE acquired BlueData it gained an underlying technology, which we’ve transformed into our MLOps and container platform. BlueData had a long history of containerizing very difficult, problematic workloads. In this case, this particular financial services organization had a real challenge. They wanted to bring on new data scientists. But the problem is, every time they wanted to bring a new data scientist on, they had to go and acquire a bunch of new hardware, because their process required them to replicate the data and completely isolate the new data scientist from the other ones. This was their process. That’s what they had to do.

So as a result, it took them almost six months to do anything. And there’s no way that was sustainable. It was a well-defined process, but it’s still involved a six-month wait each time.

So instead we containerized their Cloudera implementation and separated the compute and storage as well. That means we could now create environments on the fly within minutes effectively. But it also means that we can take read-only snapshots of data. So, the read-only snapshot is just a set of pointers. So, it’s instantaneous.

They scaled out their data science without scaling up their costs or the number of people required. They are now doing that in a hybrid cloud environment. And they only have to change two lines of code to push workloads into AWS, which is pretty magical, right?

They were able to scale-out their data science without scaling up their costs or the number of people required. Interestingly, recently, they’ve moved that on further as well. Now doing all of that in a hybrid cloud environment. And they only have to change two lines of code to allow them to push workloads into AWS, for example, which is pretty magical, right? And that’s where they’re doing the data science.

Another good example that I can name is GM Finance, a fantastic example of how having started in one area for business -- all about risk and compliance -- they’ve been able to extend the value to things like credit risk.

But doing credit risk and risk in terms of insurance also means that they can look at policy pricing based on dynamic risk. For example, for auto insurance based on the way you’re driving. How about you, Dana? I drive like a complete idiot. So I couldn’t possibly afford that, right? But you, I’m sure you drive very safely.

But in this use-case, because they have the data science in place it means they can know how a car is being driven. They are able to look at the value of the car, the end of that lease period, and create more value from it.

These are types of detailed business outcomes we’re talking about. This is about giving our customers the means to do more data science. And because the data science becomes better, you’re able to do even more data science and create momentum in the organization, which means you can do increasingly more data science. It’s really a very compelling proposition.

Gardner: Doug, if I were to come to you in three years and ask similarly, “Give me the example of a company that has done this right and has really reshaped itself.” Describe what you think a correctly analytically driven company will be able to do. What is the end state?

A data-science driven future

Cackett: I can answer that in two ways. One relates to talking to an ex-colleague who worked at Facebook. And I’m so taken with what they were doing there. Basically, he said, what originally happened at Facebook, in his very words, is that to create a new product in Facebook they had an engineer and a product owner. They sat together and they created a new product.

Sometime later, they would ask a data scientist to get involved, too. That person would look at the data and tell them the results.

Then they completely changed that around. What they now do is first find the data scientist and bring him or her on board as they’re creating a product. So they’re instrumenting up what they’re doing in a way that best serves the data scientist, which is really interesting.

The data science is built-in from the start. If you ask me what’s going to happen in three years’ time, as we move to this democratization of ML, that’s exactly what’s going to happen. I think we’ll end up genuinely being information-driven as an organization.

That will build the data science into the products and the applications from the start, not tack them on to the end.

Gardner: And when you do that, it seems to me the payoffs are expansive -- and perhaps accelerating.

Cackett: Yes. That’s the competitive advantage and differentiation we started off talking about. But the technology has to underpin that. You can’t deliver the ML without the technology; you won’t get the competitive advantage in your business, and so your digital transformation will also fail.

This is about getting the right technology with the right people in place to deliver these kinds of results.

Gardner: I’m afraid we’ll have to leave it there. You’ve been with us as we explored how businesses can make data science more of a repeatable assembly line – an industrialization, if you will -- of end-to-end data exploitation. And we’ve learned how HPE is ushering in the latest methods, tools, and thinking around making data science an integral core function that both responds to business needs and scales to improve nearly every aspect of productivity.

So please join me in thanking our guest, Doug Cackett, EMEA Field Chief Technology Officer at HPE. Thank you so much, Doug. It was a great conversation.

Cackett: Yes, thanks everyone. Thanks, Dana.

Gardner: And a big thank you as well to our audience for joining this sponsored BriefingsDirect Voice of Analytics Innovation discussion. I’m Dana Gardner, Principal Analyst at Interarbor Solutions, your host for this ongoing series of Hewlett Packard Enterprise-supported discussions.

Thanks again for listening. Please pass this along to your IT community, and do come back next time.

Listen to the podcast. Find it on iTunes. Download the transcript. Sponsor: Hewlett Packard Enterprise.

Transcript of a discussion on the latest methods, tools, and thinking around making data science an integral core function of any business. Copyright Interarbor Solutions, LLC, 2005-2020. All rights reserved.

You may also be interested in:

Friday, November 20, 2020

How the Journey to Modern Data Management is Paved with an Inclusive Edge-to-Cloud Data Fabric

Transcript of a discussion on the best ways widely inclusive data can be managed for today’s data-rich but too often insights-poor organizations.

Listen to the podcast. Find it on iTunes. Download the transcript. Sponsor: Hewlett Packard Enterprise.

Dana Gardner: Hello, and welcome to the next BriefingsDirect Voice of Analytics Innovation discussion. I’m Dana Gardner, Principal Analyst at Interarbor Solutions, your host and moderator for this ongoing discussion on the latest insights into end-to-end data management strategies.

Gardner

As businesses seek to gain insights for more elements of their physical edge -- from factory sensors, myriad machinery, and across field operations -- data remains fragmented. But a Data Fabric approach allows information and analytics to reside locally at the edge yet contribute to the global improvement in optimizing large-scale operations.

Stay with us now as we explore how edge-to-core-to-cloud dispersed data can be harmonized with a common fabric to make it accessible for use by more apps and across more analytics.

To learn more about the ways all data can be managed for today’s data-rich but too often insights-poor organizations, we’re joined by Chad Smykay, Field Chief Technology Officer for Data Fabric at Hewlett Packard Enterprise (HPE). Welcome, Chad.

Chad Smykay: Thank you.

Gardner: Chad, why are companies still flooded with data? It seems like they have the data, but they’re still thirsty for actionable insights. If you have the data, why shouldn’t you also have the insights readily available?

Smykay

Smykay: There are a couple reasons for that. We still see today challenges for our customers. One is just having a common data governance methodology. That’s not just to govern the security and audits, and the techniques around that -- but determining just what your data is.

I’ve gone into so many projects where they don’t even know where their data lives; just a simple matrix of where the data is, where it lives, and how it’s important to the business. This is really the first step that most companies just don’t do.

Gardner: What’s happening with managing data access when they do decide they want to find it? What’s been happening with managing the explosive growth of unstructured data from all corners of the enterprise?

Tame your data

Smykay: Five years ago, it was still the Wild West of data access. But we’re finally seeing some great standards being deployed and application programming interfaces (APIs) for that data access. Companies are now realizing there’s power in having one API to rule them all. In this case, we see mostly Amazon S3.

There are some other great APIs for data access out there, but just having more standardized API access into multiple datatypes has been great for our customers. It allows for APIs to gain access across many different use cases. For example, business intelligence (BI) tools can come in via an API. Or an application developer can access the same API. So that approach really cuts down on my access methodologies, my security domains, and just how I manage that data for API access.

Gardner: And when we look to get buy-in from the very top levels of businesses, why are leaders now rethinking data management and exploitation of analytics? What are the business drivers that are helping technologists get the resources they need to improve data access and management?

Smykay: The business drivers gain when data access methods are as reusable as possible across the different use cases. It used to be that you’d have different point solutions, or different open source tools, needed to solve a business use-case. That was great for the short-term, maybe with some quarterly project or something for the year you did it in.

Gaining a common, secure access layer that can access different types of data is the biggest driver of our HPE Data Fabric. And the business drivers gain when the data access methods are as reusable as possible.

But then, down the road, say three years out, they would say, “My gosh, we have 10 different tools across the many different use cases we’re using.” It makes it really hard to standardize for the next set of use cases.

So that’s been a big business driver, gaining a common, secure access layer that can access different types of data. That’s been the biggest driver for our HPE Data Fabric. That and having common API access definitely reduces the management layer cost, as well as the security cost.

Gardner: It seems to me that such data access commonality, when you attain it, becomes a gift that keeps giving. The many different types of data often need to go from the edge to dispersed data centers and sometimes dispersed in the cloud. Doesn’t data access commonality also help solve issues about managing access across disparate architectures and deployment models?

Smykay: You just hit the nail on the head. Having commonality for that API layer really gives you the ability to deploy anywhere. When I have the same API set, it makes it very easy to go from one cloud provider, or one solution, to another. But that can also create issues in terms of where my data lives. You still have data gravity issues, for example. And if you don’t have portability of the APIs and the data, you start to see some lock-in with the either the point solution you went with or the cloud provider that’s providing that data access for you.

Gardner: Following through on the gift that keeps giving idea, what is it about the Data Fabric approach that also makes analytics easier? Does it help attain a common method for applying analytics?

Data Fabric deployment options

Smykay: There are a couple of things there. One, it allows you to keep the data where it may need to stay. That could be for regulatory reasons or just depend on where you build and deploy the analytics models. A Data Fabric helps you to start separating out your computing and storage capabilities, but also keeps them coupled for wherever the deployment location is.

For example, a lot of our customers today have the flexibility to deploy IT resources out in the edge. That could be a small cluster or system that pre-processes data. They may typically slowly trickle all the data back to one location, a core data center or a cloud location. Having these systems at the edge gives them the benefit of both pushing information out, as well as continuing to process at the edge. They can choose to deploy as they want, and to make the data analytics solutions deployed at the core even better for reporting or modeling.

Gardner: It gets to the idea of act locally and learn globally. How is that important, and why are organizations interested in doing that?

Smykay: It’s just-in-time, right? We want everything to be faster, and that’s what this Data Fabric approach gets for you.

In the past, we’ve seen edge solutions deployed, but you weren’t processing a whole lot at the edge. You were pushing along all the data back to a central, core location -- and then doing something with that data. But we don’t have the time to do that anymore.

Unless you can change the laws of physics -- last time I checked, they haven’t done that yet -- we’re bound by the speed of light for these networks. And so we need to keep as much data and systems as we can out locally at the edge. Yet we need to still take some of that information back to one central location so we can understand what’s happening across all the different locations. We still want to make the rearview reporting better globally for our business, as well as allow for more global model management.

Gardner: Let’s look at some of the hurdles organizations have to overcome to make use of such a Data Fabric. What is it about the way that data and information exist today that makes it hard to get the most out of it? Why is it hard to put advanced data access and management in place quickly and easily?

Track the data journey

Smykay: It’s tough for most organizations because they can’t take the wings off the airplane while flying. We get that. You have to begin by creating some new standards within your organization, whether that’s standardizing on an API set for different datatypes, multiple datatypes, a single datatype.

Then you need to standardize the deployment mechanisms within your organization for that data. With the HPE Data Fabric, we give the ability to just say, “Hey, it doesn’t matter where you deploy. We just need some x86 servers and we can help you standardize either on one API or multiple APIs.”

We now support more than 10 APIs, as well as the many different data types that these organizations may have.

We see a lot of data silos out there today with customers -- and they're getting worse. They're now all over the place between multiple cloud providers. And there's all the networking in the middle. I call it silo sprawl.

Typically, we see a lot of data silos still out there today with customers – and they’re getting worse. By worse, I mean they’re now all over the place between multiple cloud providers. I may use some of these cloud storage bucket systems from cloud vendor A, but I may use somebody else’s SQL databases from cloud vendor B, and those may end up having their own access methodologies and their own software development kits (SDKs).

Next you have to consider all the networking in the middle. And let’s not even bring up security and authorization to all of them. So we find that the silos still exist, but they’ve just gotten worse and they’ve just sprawled out more. I call it the silo sprawl.

Gardner: Wow. So, if we have that silo sprawl now, and that complexity is becoming a hurdle, the estimates are that we’re going to just keep getting more and more data from more and more devices. So, if you don’t get a handle on this now, you’re never going to be able to scale, right?

Smykay: Yes, absolutely. If you’re going to have diversity of your data, the right way to manage it is to make it use-case-driven. Don’t boil the ocean. That’s where we’ve seen all of our successes. Focus on a couple of different use cases to start, especially if you’re getting into newer predictive model management and using machine learning (ML) techniques.

But, you also have to look a little further out to say, “Okay, what’s next?” Right? “What’s coming?” When you go down that data engineering and data science journey, you must understand that, “Oh, I’m going to complete use case A, that’s going to lead to use case B, which means I’m going to have to go grab from other data sources to either enrich the model or create a whole other project or application for the business.”

You should create a data journey and understand where you’re going so you don’t just end up with silo sprawl.

Gardner: Another challenge for organizations is their legacy installations. When we talk about zettabytes of data coming, what is it about the legacy solutions -- and even the cloud storage legacy -- that organizations need to rethink to be able to scale?

Zettabytes of data coming

Smykay: It’s a very important point. Can we just have a moment of silence? Because now we’re talking about zettabytes of data. Okay, I’m in.

Some 20 years ago, we were talking about petabytes of data. We thought that was a lot of data, but if you look out to the future, we’re talking about some studies showing connected Internet of Things (IoT) devices generating this zettabytes amount of data.

If you don’t get a handle on where your data points are going to be generated, how they’re going to be stored, and how they’re going to be accessed now, this problem is just going to get worse and worse for organizations.

Look, Data Fabric is a great solution. We have it, and it can solve a ton of these problems. But as a consultant, if you don’t get ahead of these issues right now, you’re going to be under the umbrella of probably 20 different cloud solutions for the next 10 years. So, really, we need to look at the datatypes that you’re going to have to support, the access methodologies, and where those need to be located and supported for your organization.

Gardner: Chad, it wasn’t that long ago that we were talking about how to manage big data, and Hadoop was a big part of that. NoSQL and other open source databases in particular became popular. What is it about the legacy of the big data approach that also needs to be rethought?

Smykay: One common issue we often see is the tendency to go either/or. By that I mean saying, “Okay, we can do real-time analytics, but that’s a separate data deployment. Or we can do batch, rearview reporting analytics, and that’s a separate data deployment.” But one thing that our HPE Data Fabric has always been able to support is both -- at the same time -- and that’s still true.

So if you’re going down a big data or data lake journey -- I think now the term now is a data lakehouse, that’s a new one. For these, basically I need to be able to do my real-time analytics, as well as my traditional BI reporting or rearview mirror reporting -- and that’s what we’ve been doing for over 10 years. That’s probably one of the biggest limitations we have seen.

But it’s a heavy lift to get that data from one location to another, just because of the metadata layer of Hadoop. And then you had dependencies with some of these NoSQL databases out there on Hadoop, it caused some performance issues. You can only get so much performance out of those databases, which is why we have NoSQL databases just out of the box of our Data Fabric -- and we’ve never run into any of those issues.

Gardner: Of course, we can’t talk about end-to-end data without thinking about end-to-end security. So, how do we think about the HPE Data Fabric approach helping when it comes to security from the edge to the core?

Secure data from edge to core

Smykay: This is near-and-dear to my heart because everyone always talks about these great solutions out there to do edge computing. But I always ask, “Well, how do you secure it? How do you authorize it? How does my application authorization happen all the way back from the edge application to the data store in the core or in the cloud somewhere?”

That’s what I call off-sprawl, where those issues just add up. If we don’t have one way to secure and manage all of our different data types, then what happens is, “Okay, well, I have this object-based system out there, and it has its own authorization techniques.” It has its own authentication techniques. By the way, it has its own way of enforcing security in terms of who has access to what, unless … I haven’t talked about monitoring, right? How do we monitor this solution?

So, now imagine doing that for each type of data that you have in your organization -- whether it’s a SQL database, because that application is just a driving requirement for that, or a file-based workload, or a block-based workload. You can see where this starts to steamroll and build up to be a huge problem within an organization, and we see that all the time.

We're seeing a ton of issues today in the security space. We're seeing people getting hacked. It happens all the way down to the application layer, as you often have security sprawl that makes it very hard to manage all of the different systems.

And, by the way, when it comes to your application developers, that becomes the biggest annoyance for them. Why? Because when they want to go and create an application, they have to go and say, “Okay, wait. How do I access this data? Oh, it’s different. Okay. I’ll use a different key.” And then, “Oh, that’s a different authorization system. It’s a completely different way to authenticate with my app.”

I honestly think that’s why we’re seeing a ton of issues today in the security space. It’s why we’re seeing people get hacked. It happens all the way down to the application layer, as you often have this security sprawl that makes it very hard to manage all of these different systems.

Gardner: We’ve come up in this word sprawl several times now. We’re sprawling with this, we’re sprawling with that; there’s complexity and then there’s going to be even more scale demanded.

The bad news is there is quite a bit to consider when you want end-to-end data management that takes the edge into consideration and has all these other anti-sprawl requirements. The good news is a platform and standards approach with a Data Fabric forms the best, single way to satisfy these many requirements.

So let’s talk about the solutions. How does HPE Ezmeral generally -- and the Ezmeral Data Fabric specifically -- provide a common means to solve many of these thorny problems?

Smykay: We were just talking about security. We provide the same security domain across all deployments. That means having one web-based user interface (UI), or one REST API call, to manage all of those different datatypes.

We can be deployed across any x86 system. And having that multi-API access -- we have more than 10 – allows for multi-data access. It includes everything from storing data into files and storing data in blocks. We’re soon going to be able to support blocks in our solution. And then we’ll be storing data into bit streams such as Kafka, and then into a NoSQL database as well.

Gardner: It’s important for people to understand that HPE Ezmeral is a larger family and that the Data Fabric is a subset. But the whole seems to be greater than the sum of the parts. Why is that the case? How has what HPE is doing in architecting Ezmeral been a lot more than data management?

Smykay: Whenever you have this “whole is greater than the sum of the parts,” you start reducing so many things across the chain. When we talk about deploying a solution, that includes, “How do I manage it? How do I update it? How do I monitor it?” And then back to securing it.

Honestly, there is a great report from IDC that says it best. We show a 567-percent, five-year return on investment (ROI). That’s not from us, that’s IDC talking to our customers. I don’t know of a better business value from a solution than that. The report speaks for itself, but it comes down to these paper cuts of managing a solution. When you start to have multiple paper cuts, across multiple arms, it starts to add up in an organization.

Gardner: Chad, what is it about the HPE Ezmeral portfolio and the way the Data Fabric fits in that provides a catalyst to more improvement?

All data put to future use

Smykay: One, the HPE Data Fabric can be deployed anywhere. It can be deployed independently. We have hundreds and hundreds of customers. We have to continue supporting them on their journey of compute and storage, but today we are already shipping a solution where we can containerize the Data Fabric as a part of our HPE Ezmeral Container Platform and also provide persistent storage for your containers.

The HPE Ezmeral Container Platform comes with the Data Fabric, it’s a part of the persistent storage. That gives you full end-to-end management of the containers, not only the application APIs. That means the management and the data portability.

So, now imagine being able to ship the data by containers from your location, as it makes sense for your use case. That’s the powerful message. We have already been on the compute and storage journey; been down that road. That road is not going away. We have many customers for that, and it makes sense for many use cases. We’ve already been on the journey of separating out compute and storage. And we’re in general availability today. There are some other solutions out there that are still on a road map as far as we know, but at HPE we’re there today. Customers have this deployed. They’re going down their compute and storage separation journey with us.

Gardner: One of the things that gets me excited about the potential for Ezmeral is when you do this right, it puts you in a position to be able to do advanced analytics in ways that hadn’t been done before. Where do you see the HPE Ezmeral Data Fabric helping when it comes to broader use of analytics across global operations?

Smykay: One of our CMOs used to say it best, and which Jack Morris has said: “If it’s going to be about the data, it better be all about the data.”

When you improve automating data management across multiple deployments -- managing it, monitoring it, keeping it secure -- you can then focus on those actual use cases. You can focus on the data itself, right? That’s living in the HPE Data Fabric. That is the higher-level takeaway. Our users are not spending all their time and money worrying about the data lifecycle. Instead, they can now go use that data for their organizations and for future use cases.

HPE Ezmeral sets your organization up to use your data instead of worrying about your data. We are set up to start using the Data Fabric for newer use cases and separating out compute and storage, and having it run in containers. We’ve been doing that for years. The high-level takeaway is you can go focus on using your data and not worrying about your data.

Gardner: How about some of the technical ways that you’re doing this? Things like global namespaces, analytics-ready fabrics, and native multi-temperature management. Why are they important specifically for getting to where we can capitalize on those new use cases?

Smykay: Global namespaces is probably the top feature we hear back from our customers on. It allows them to gain one view of the data with the same common security model. Imagine you’re a lawyer sitting at your computer and you double-click on a Data Fabric drive, you can literally then see all of your deployments globally. That helps with discovery. That helps with bringing onboard your data engineers and data scientists. Over the years that’s been one of the biggest challenges, they spend a lot of time building up their data science and data engineering groups and on just discovering the data.

Global namespace means I’m reducing my discovery time to figure out where the data is. A lot of this analytics-ready value we’ve been supporting in the open source community for more than 10 years. There’s a ton of Apache open source projects out there, like Presto, Hive, and Drill. Of course there’s also Spark-ready, and we have been supporting Spark for many years. That’s pretty much the de facto standard we’re seeing when it comes to doing any kind of real-time processing or analytics on data.

As for multi-temperature, that feature allows you to decrease your cost of your deployment, but still allows managing all your data in one location. There are a lot of different ways we do that. We use erasure coding. We can tear off to Amazon S3-compliant devices to reduce the overall cost of deployment.

These features contribute to making it still easier. You gain a common Data Fabric, common security layer, and common API layer.

Gardner: Chad, we talked about much more data at the edge, how that’s created a number of requirements, and the benefits of a comprehensive approach to data management. We talked about the HPE Data Fabric solution, what it brings, and how it works. But we’ve been talking in the abstract.

What about on the ground? Do you have any examples of organizations that have bitten off and made Data Fabric core for them? As an adopter, what do they get? What are the business outcomes?

Central view benefits businesses

Smykay: We’ve been talking a lot about edge-to-core-to-cloud, and the one example that’s just top-of-mind is a big, tier-1 telecoms provider. This provider makes the equipment for your AT&Ts and your Vodafones. That equipment sits out on the cell towers. And they have many Data Fabric use cases, more than 30 with us.

But the one I love most is real-time antenna tuning. They’re able to improve customer satisfaction in real time and reduce the need to physically return to hotspots on an antenna. They do it via real-time data collection on the antennas and then aggregating that across all of the different layers that they have in their deployments.

One example is real-time antennae tuning. They're able to improve customer satisfaction in real time and reduce the need to physically return to hotspots on an antennae. They do it instead via real-time data collection and aggregating that across all of their deployments.

They gain a central view of all of the data using a modern API for the DevOps needs. They still centrally process data, but they also process it at the edge today. We replicate all of that data for them. We manage that for them and take a lot of the traditional data management tasks off the table for them, so they can focus on the use case of the best way to tune antennas.

Gardner: They have the local benefit of tuning the antenna. But what’s the global payback? Do we have a business quantitative or qualitative returns for them in doing that?

Smykay: Yes, but they’re pretty secretive. We’ve heard that they’ve gotten a payback in the millions of dollars, but an immediate, direct payback for them is in reducing the application development spend everywhere across the layer. That reduction is because they can use the same type of API to publish that data as a stream, and then use the same API semantics to secure and manage it all. They can then take that same application, which is deployed in a container today, and easily deploy it to any remote location around the world.

Gardner: There’s that key aspect of the application portability that we’ve danced around a bit. Any other examples that demonstrate the adoption of the HPE Data Fabric and the business pay-offs?

Smykay: Another one off the top of my head is a midstream oil and gas customer in the Houston area. This one’s not so much about edge-to-core-to-cloud. This is more about consolidation of use cases.

We discussed earlier that we can support both rearview reporting analytics as well as real-time reporting use cases. And in this case, they actually have multiple use cases, up to about five or six right now. Among them, they are able to do predictive failure reports for heat exchangers. These heat exchangers are deployed regionally and they are really temperamental. You have to monitor them all the time.

But now they have a proactive model where they can do a predictive failure monitor on those heat exchangers just by checking the temperatures on the floor cameras. They bring in all real-time camera data and they can predict, “Oh, we think we’re having an issue with this heat exchanger on this time and this day.” So that decreases management cost for them.

They also gain a dynamic parts management capability for all of their inventory in their warehouses. They can deliver faster, not only on parts, but reduce their capital expenditure (CapEx) costs, too. They have gained material measurement balances. When you push oil across a pipeline, they can detect where that balance is off across the pipeline and detect where they’re losing money, because if they are not pushing oil across the pipe at x amount of psi, they’re losing money.

So they’re able to dynamically detect that and fix it along the pipe. They also have a pipeline leak detection that they have been working on, which is modeled to detect corrosion and decay.

The point is there are multiple use cases. But because they’re able to start putting those data types together and continue to build off of it, every use case gets stronger and stronger.

Gardner: It becomes a virtuous adoption cycle; the more you can use the data generally, then the more value, then the more you invest in getting a standard fabric approach, and then the more use cases pop up. It can become very powerful.

This last example also shows the intersection of operational technology (OT) and IT. Together they can start to discover high-level, end-to-end business operational efficiencies. Is that what you’re seeing?

Data science teams work together

Smykay: Yes, absolutely. A Data Fabric is kind of the Kumbaya set among these different groups. If they’re able to standardize on the IT and developer side, it makes it easier for them to talk the same language. I’ve seen this with the oil and gas customer. Now those data science and data engineering teams work hand in hand, which is where you want to get in your organization. You want those IT teams working with the teams managing your solutions today. That’s what I’m seeing. As you get a better, more common data model or fabric, you get faster and you get better management savings by having your people working better together.

Gardner: And, of course, when you’re able to do data-driven operations, procurement, logistics, and transportation you get to what we’re referring generally as digital business transformation.

Chad, how does a Data Fabric approach then contribute to the larger goal of business transformation?

Smykay: It allows organizations to work together through a common data framework. That’s been one of the biggest issues I’ve seen, when I come in and say, “Okay, we’re going to start on this use case. Where is the data?”

Depending on size of the organization, you’re talking to three to five different groups, and sometimes 10 different people, just to put a use case together. But as you create a common data access method, you see an organization where it’s easier and easier for not only your use cases, but your businesses to work together on the goal of whatever you’re trying to do and use your data for.

Gardner: I’m afraid we’ll have to leave it there. We’ve been exploring how a Data Fabric approach allows information and analytics to reside locally at the edge, yet contribute to a global improvement in optimizing large-scale operations.

And we’ve learned how HPE Ezmeral Data Fabric makes modern data management more attainable so businesses can dramatically improve their operational efficiency and innovate from edge to core to clouds.

So please join me in thanking our guest, Chad Smykay, Field Chief Technology Officer for Data Fabric at HPE. Thanks so much, Chad.

Smykay: Thank you, I appreciate it.

Thanks again for listening. Please pass this along to your IT community, and do come back next time.

Listen to the podcast. Find it on iTunes. Download the transcript. Sponsor: Hewlett Packard Enterprise.

Transcript of a discussion on the best ways widely inclusive data can be managed for today’s data-rich but too often insights-poor organizations. Copyright Interarbor Solutions, LLC, 2005-2020. All rights reserved.

You may also be interested in:

BriefingsDirect Transcripts

Monday, December 07, 2020

How to Industrialize Data Science to Attain Mastery of Repeatable Intelligence Delivery

Friday, November 20, 2020

How the Journey to Modern Data Management is Paved with an Inclusive Edge-to-Cloud Data Fabric

Principal Analyst

Translate this Blog

Folo My Flipboard Magazines

Search Blog

Subscribe to Podcast Via iTunes

BriefingsDirect Network

Blog Archive