Tuesday, February 05, 2008

New Ways Emerge to Improve IT Operational Performance While Heading Off Future Datacenter Reliability Problems

Transcript of BriefingsDirect podcast on IT operational performance using Integrien Alive.

Listen to podcast here. Sponsor: Integrien.

Dana Gardner: Hi, this is Dana Gardner, principal analyst at Interarbor Solutions, and you’re listening to BriefingsDirect.

Today, a sponsored podcast discussion about new ways to improve IT operational performance, based on real-time analytics and the ability to effectively compare data-center performance from a normal state to something that is going to be a problem. We’re going to look at the ability to get a “heads-up” that something is about to go wrong, rather than going into firefighting mode.

Today’s complexity in IT systems is making previous error prevention approaches for operators inefficient and costly. IT staffs are expensive to retain, and are increasingly hard to find. So even when operators have a sufficient staff, a quality staff, it simply takes too long to interpret and resolve IT failures and glitches, given the complexity of distributed systems.

There is also insufficient information about what’s going on in the context of an entire systems setup, and operators are using manual processes -- in firefighting mode -- to maintain critical service levels.

IT executives are therefore seeking more automated approaches to not only remediate problems, but also to get earlier detection. These same operators don’t want to replace their systems management investments, they want to better use them in a cohesive manner to learn more from them, and to better extract the information that these systems emit.

To help us better understand the problems and some of the new solutions and approaches to remediation and detection of IT issues, we’re joined by Steve Henning, the Vice President of Products for Integrien. Welcome to the show, Steve.

Steve Henning: Thanks a lot, Dana.

Gardner: Let’s take a look at some of the real-life issues that are affecting IT operators, drill down into them a bit, look at some of the solutions and benefits, and perhaps some examples of what these bring in terms of relief and increased savings of time and energy.

Tell me a little bit about complexity and problems. How do you view the current state of affairs in the datacenter operations field?

Henning: It’s a dichotomous situation for the vice president of IT operations at this point. On one hand, they're working at growing companies. They need to manage more things in their environment -- devices and resources. Also, given the changes and how people are deploying applications today, they are dealing with more complexity as well.

Service oriented architecture (SOA) and virtualization increase the management problem by at least a factor of three. So you can see that this is a more complex and challenging environment to manage.

On the other side of this equation is the fact that IT operations is being told to either keep their budgets static or to reduce them. Traditionally, the way that the vice president of IT operations has been able to keep the problems from occurring in these environments has been by throwing more people at it. We now see 70-plus-percent of the IT operations budget spent on labor costs.

Just the other day, I was talking to the vice president of IT operations of a large online financial company. He told me that he had 10 people on staff just to understand the normal behavior of their systems. They are literally cutting out graphs and holding them up to the light to compare them against what they have seen in previous incarnations of the system, trying to see when the behavior of this system is normal.

He told me that this is just not scalable. There is no way -- given the fact that he has to scale his infrastructure by a factor of three over the next two years -- that he can possibly hire the people that he would need to support that. Even if he had the budget, he couldn’t find the people today.

So it’s a very troubling environment these days. It’s really what’s pushing people toward looking at different approaches, of taking more of a probabilistic look, measuring variables, and looking at probable outcomes -- rather than trying to do things in a deterministic way, measuring every possible variable, looking at it as quickly as possible, and hoping that problems just don’t slip by.

Gardner: It seems as if we're looking at both a quality and a quantity issue here. We've got a quantity of outputs from these different systems, many times in different formats, but what we really need to do is find that “needle in the haystack” to detect the true issue that’s going to create a failure.

Do you agree that we are dealing with both quality and quantity issues?

Henning: Absolutely. If you look at most of the companies that we talk to today, they are mired in these monitoring events. Most of the companies we talk to have multiple monitoring tools, and they're siloed. You've got the network guys using one tool. You've got the OS and hardware guys using another. The app guys and database guys have their tools, and there is no place where all of this data is analyzed holistically.

Each system emits sets of events typically based on arbitrary hard thresholds that have been set in the environment. There's this massive manual effort of looking at these individual events that are coming from these systems and trying to determine whether they are the actual precursors to real problems, or if they're just a normal behavior of the system that can be ignored. It’s very difficult to keep your hands around that.

Gardner: I suppose it wasn’t that long ago where you could have specialists that would oversee different specific aspects of the IT infrastructure, and they would just be responsible for maintaining that particular part. But, as you mentioned, we have SOA, virtualization, datacenter consolidation, and finding ways of reducing total costs that, in effect, accelerate the interdependencies. I suppose we need more specialization, but -- at the same time -- those specialists need to communicate with the rest of the environment, or the people running it.

Henning: If you look at the applications that are being delivered today, monitoring everything from a silo standpoint and hoping to be able to solve problems in that environment is absolutely impossible. There has to be some way for all of the data to be analyzed in a holistic fashion, understanding the normal behaviors of each of the metrics that are being collected by these monitoring systems. Once you have that normal behavior, you’re alerting only to abnormal behaviors that are the real precursors to problems. That’s where Integrien comes in.

Gardner: You mentioned that you've got reams and reams of events pouring in, and that, in many cases, people are sifting through these manually, charting them, and then comparing them in sort of a haphazard way. What sort of solutions or alternatives are there?

Henning: One of the alternatives is separating the wheat from the chaff and learning the normal behavior of the system. If you look at Integrien Alive, we use sophisticated, dynamic thresholding algorithms. We have multiple algorithms looking at the data to determine that normal behavior and then alerting only to abnormal precursors of problems.

It’s really the hard-threshold-based monitoring that’s the issue here, because hard-threshold-based monitoring does two things. One, it results in alert storms for perfectly normal behavior. Two, it masks real problem behavior that you just can't catch with hard thresholds.

For example, let’s say that at 9 p. m. some online system's normal behavior is a set of servers it would be at 10 percent CPU utilization. But let’s say that it’s at 60 percent utilization. If you have your hard threshold set at 80 percent, you've got a pending problem that you have no idea about. That’s why it’s so important to have an adaptive learning mechanism for determining behavior and when something is important enough to raise to an operator.

Gardner: When you're able to do this comparison on the basis of, "Hey, this is deviating from a pattern," rather than a binary-basis, on-off problem, what kind of benefits can people derive?

Henning: Well, you're automating this massive manual effort that I was talking about. If you look at that vice president of IT operations of the online financial company I talked about earlier, he has 10 guys who are sitting around doing nothing but analyzing this data all day.

Now, that data analysis can be completely automated with sophisticated dynamic thresholding. These 10 guys are freed up to do real problem solving, rather than just looking at these event storms, trying to figure out what’s important and what’s not, when the company is having an issue with one of their mission-critical systems.

Gardner: Do you have any examples of how effective this has been for companies, if they start to take that manpower and focus it where it's most effective? What kind of paybacks are we talking about?

Henning: We see up to a 95 percent reduction in this manual effort around setting thresholds and dealing with events. So it’s a huge reduction in time. We see up to a 50 percent reduction in the time it takes to solve problems, because this kind of information, and the fact that we consolidate alerts based on topology, which makes it much quicker to get down to where the root cause of the problem is, and to focus efforts there.

Gardner: You mentioned getting this “normal state,” of gathering enough information and putting it in the context of use scenarios. How do operators do that? How do they know what’s going to lead to problems by virtue of detecting baseline?

Henning: If you look at most IT environments today, the IT people will tell you that three or four minutes before a problem occurs, they will start to understand that little pattern of events that lead to the problem.

But most of the people that I speak to tell me that’s too late. By the time they identify the pattern that repeats and leads to a particular problem -- for example, a slowdown of a particular critical transaction -- it’s too late. Either the system goes down or the slowdown is such that they are losing business.

We found these abnormal behaviors are the earliest precursors to problems in the IT environment -- either slowdowns or applications actually going down. Once you've learned the normal behavior of the system, these abnormal behaviors far downstream of where the problem actually occurs are the earliest precursors to these problems. We can pick up that these problems are going to occur, sometimes an hour before the problem actually happens.

If you think about a typical IT environment, you're talking about tens of thousands of servers and hundreds of thousands, even millions, of metrics to correlate all that data and understand the relationships between different metrics and which lead up to problems. It’s really a humanly unsolvable problem. That’s where this ability to “connect the dots” -- this ability to model problems when they occur -- is a really important capability.

Gardner: I suppose we’re talking about some fairly large libraries of models to compare and contrast -- something that is far beyond the scale of 5 or 10 people.

Henning: Yes, but these models are learned based on the environment, understanding the normal behaviors of all the metrics in a particular IT operation, and understanding what the key indicators of business performance are.

For example, you might say that if this transaction ever takes more than five seconds, then I know I have a problem. Or you could say that if this database metric, open cursors, goes above 1,000, I know I have a problem. Once you understand what those key indicators are, you can set them. And when you have those, you can actually capture a model of what this problem looks like when that key indicator is exceeded.

That’s the key thing, building this model, having the analytic capability to be able to connect the dots and understand what the precursors that lead up to problems, even an hour before the problem occurs. That’s one of the things that Integrien Alive can do.

Gardner: What sort of benefits do we get from this deeper correlation of what’s good, what’s bad, and what’s gray and that could become bad? Are we talking about minutes or days? What sort of impact does this have on the business?

Henning: We see a couple of things. One is that it’s solving this massive data correlation issue that right now is very limited in the IT operations that we go into. There are just a few highly trained experts who have “tribal knowledge” of the application, and who know even the beginnings of what these correlations are. With a product like Integrien Alive you can solve that kind of massive data correlation issue.

The second benefit of it is that the first time a problem occurs, the capture of a model of the problem, with all the abnormal behaviors that led up to it, can often target for you the places in the applications that are performing abnormally and are likely to be the causes of the problem.

For example, you might find that a particular problem is showing abnormal behavior in the application server tier and the database tier. Now, there's no reason to get on the phone with the network guy, the Web server guy, and other people who can't contribute to the resolution of that problem. By targeting and understanding which metrics are behaving abnormally, can get you to a much quicker mean-time to identify and repair the problem. As I said, we see up to 50 percent reduction in the time it takes to resolve problems.

The final thing is the ability to get predictive alerts, and that’s kind of the nirvana of IT operations. Once you’ve captured models of the recurring problems in the IT environment, a product like Integrien Alive can see the incoming stream of real-time data and compare that against the models in the library.

If it sees a match with a high enough probability it can let you know ahead of time, up to an hour ahead of time, that you are going to have a particular problem that has previously occurred. You can also record exactly what you did to solve the problem, and how you have diagnosed it, so that you can solve it.

Gardner: Then, you can share that. Now, you mentioned “tribal knowledge.” It sounds like we are taking what used to reside in wetware -- in people’s minds and experience. Instead of having to throw those people at a problem without knowing the depth of the problem, or even losing that knowledge if they walk out the door, we're saying, "Enough of that. Let’s go and instantiate this knowledge into the systems and become less dependent on individual experienced people."

Henning: The way I look at it is that we're actually enhancing the expertise of these folks. You're always going to need experts in there. You’re always going to need the folks who have the tribal knowledge of the application. What we are doing, though, is enabling them to do their job better with earlier understanding of where the problems are occurring by adding and solving this massive data correlation issue when a problem occurs.

Even the tribal experts will tell you that just a few minutes before a problem occurs they can start to see the problem. We are offering them a solution that allows them to see this problem forming up to an hour ahead of time, notifying them of abnormal behavior and patterns of behavior that would be seemingly unrelated to them based on their current knowledge of the application.

Gardner: When you do resolve a problem and capture that and make that available for future use, that sounds more like a collaboration issue. How do we deal with so many inputs, so much information, not only on the receiving end, but on the outgoing end, after a resolution?

Henning: This is what we were talking about before. You’ve got all of the siloed sources of monitoring data and alerts, and there's currently no way to consolidate that data for holistic problem solving. So, it’s very important that any kind of solution can integrate a wide variety of monitoring tools, so that all the data can be in one place and available for this kind of collaborative problem solving.

For example, in one environment that we went into we had an alert that went to an application server administrator. He happened to notice that there was a prediction that a database key indicator was going out of its normal range, which would have caused a crash of the database with 85 percent probability in 15 minutes. Armed with that information, he got the alert over to the database administrator who was made able to make some configuration changes that staved off the problem.

Being able to analyze this data holistically and being able to share the data that’s typically been in the siloed monitoring solutions allows quicker and more collaborative problem resolution. We're really talking about centralizing and automating data analysis across the silos of IT.

Gardner: It also reminds me, conceptually, of SOA, where you want to transform the information into a form that can be used generally. It sounds like you are doing that and applying it to this whole notion of IT management and remediation.

Henning: Very much so. There are seemingly unrelated things happening within an application infrastructure that can result in a problem. The fact that all the data is analyzed in a single place holistically through these statistical algorithms, allows us to provide an interface where people can work together and collaborate. This makes the team more effective and makes it much easier for people to solve problems quickly.

Gardner: So, we standardize gathering and managing the information. We also standardize the way in which people can access it and use it, so that they are not fixing the same broken wheel over and over again at different times. It can recognize when they are going to need to do it and have it fixed ready to go. This sounds like a real big saver when it comes to labor and lowering costs for your staff, but also gets that root saving around no downtime or reduced downtime.

Henning: Right. When we typically work with customers, most of the IT operations folks that we talk to are really concerned with reducing the labor costs and reducing the time to identify and resolve the problem. In truth, the real benefit to the business is really removing downtime and removing slowdowns of the applications that cause you to lose or reduce business.

So although we see major benefits of real-time analytic solutions in providing reduction in labor costs, we also say that it’s a very big boon to the business, in terms of keeping the applications effectively generating revenue.

Gardner: Another current trend is the ability to gather interface views, graphical views of the system. There are a lot of dashboards out there for business issues. What do we get in terms of visibility for end-to-end operations, even in a real-time or close to real-time setting from the Integrien Alive that you are describing?

Henning: Once again, it’s still a real issue when you have siloed monitoring tools. Even though a lot of companies have a manager of managers, that’s typically used by the level-one operations folks to filter through the alerts and determine who they need to be passed off to, who can actually take a look at them and resolve them. But, we find that most of the companies that we talk to don’t have any tools that allow them to be efficient in role-based problem solving.

One of the things that Integrien Alive provides is this idea of customizable role-based dashboards, this library of custom analysis widgets that allows people to slice and dice the data in whatever way is most effective for that particular individual in problem solving. We talked earlier about the holistic data analysis that was really enabling effective teamwork. When we talk about role-based dashboards for problem solving showing the database administrator exactly what they need, we are really talking about making each team member more effective.

That’s one of the benefits of the role-based dashboards. The other thing is giving visibility all the way up to the CIO and the vice president of operations who are concerned with much different views. They want it filtered in a much different way, because they are more concerned about business performance than any individual server or resource problems that might be occurring in the environment.

Gardner: What sort of views do those business folks prefer over what the outputs of some of these monitoring tools might be?

Henning: You want to look at things from a business-service perspective, how are my critical business services performing? If I have an investment banking solution, and I’ve got a couple of other mission-critical applications that are outward facing, I want to know how those are performing right now, in terms of the critical transaction performance.

I want to be able to accommodate business data as well. So, if I see that from an IT performance level the transaction seemed to be performing well and I can see that I am also processing a consistent number of transactions that are enabling my business, I have a good view that things are going well in my operation at this point. So, it’s really a higher level view.

I am going to be much more concerned with any kind of alerts that are affecting my entire business service. If we see an alert that’s been consolidated all the way up to the investment banking business-service level, that’s going to be something that’s very important for the VP of IT operations, because he’s got a problem now that’s actually affecting his business.

Gardner: I suppose from the IT side the more that we can show and tell to the business folks about how well we are doing the better. It makes us seem less like we are firefighters and that we're proactive and on top of things. If there are any decisions several months or years out about outsourcing, we have a nice trail, a cookie-crumb trail, if you will, of how well things are going and how costs are being managed.

Henning: That’s absolutely true. I was talking to the CIO of a large university the other day. One thing that was very frustrating for him was that he was in a meeting with the president of the university, and the president was saying that it seemed like the applications were and some of the critical applications were down a lot.

This CIO was very frustrated, because he knew that wasn’t the case, but he didn’t have effective reporting tools to show that it was not the case. That was one of the things that he was very excited about, when he took a look at our product.

Gardner: We know that complexity is substantial. It’s pretty clear that that complexity is going to continue as we see organizations move toward SOA and software as a service, and hybrid issues, where a holistic business process could be supported by your systems, partner systems, or perhaps third-party systems.

I can just imagine there is going to be finger pointing when things go wrong. You’re going to want to be able to say, "Hey, not my problem, but I am ready, willing and able to help you fix it. In fact, I've got more insight into your systems than you do."

Henning: That’s absolutely the case.

Gardner: Give me a sense of where Integrien and Alive, as a product set, are going in the future, I know you can't pre-announce things, but as these new complexities in terms of permeable organizational boundaries kick in and virtualization kicks in, what might we expect in the future?

Henning: One of the things that you’re going to see from us is a comprehensive solution around the virtualized environment. Several other companies claim to have solutions in this space, but from what we have been able to see so far, the issue of motion of virtual machines (VM), moving them between different servers, is still an issue for all of these solutions.

We’re working extremely diligently to solve the issue of how to deal with performance monitoring in a virtualized environment, where you have got the individual VMs moving all over the place, based on changes in capacity, and things like that. So, look out for that solution coming from Integrien in the coming months.

Gardner: So we're talking about instances of entire stacks, provisioning and moving dynamically among systems. That sounds like a whole other level of complexity that we are adding to an already difficult situation.

Henning: Yes, it’s a big math problem. You can also compound that with the fact that when a VM moves from one physical server to another, it might be allocated a different percentage of resources. So, when you think about this whole hard-threshold based monitoring paradigm that IT is in now, what does a hard-threshold really mean in an environment like that? It makes absolutely no sense at all.

If you don’t have some way to understand the normal behavior, to provide context, and to quickly learn and adapt to changes in the environment, managing the virtualized environment is going to be an absolute nightmare. Based on spending some time with the folks over at VMware, and attending the VMWorld show this year, you could certainly see in their customers this concern about how to deal with this complex management problem.

Gardner: The old manual wetware approaches just aren’t going to cut it in that environment?

Henning: That’s correct.

Gardner: I appreciate your candor and I look forward to seeing some of these newer solutions focused on virtualization.

We have been talking about remediation and ability to get in front of problems for IT operators using predictive and analytic algorithmic approaches. To help us understand this, we have been joined by Steve Henning, the Vice President of Products at Integrien. Thank you, Steve.

Henning: Thank you very much, Dana.

Gardner: This is Dana Gardner, principal analyst at Interarbor Solutions. You have been listening to BriefingsDirect. Thanks and come back next time.

Listen to podcast here. Sponsor: Integrien.

Transcript of BriefingsDirect podcast on IT operational performance using Integrien Alive with Integrien's Steve Henning. Copyright Interarbor Solutions, LLC, 2005-2008. All rights reserved.