BriefingsDirect Transcripts: Integrien

Showing posts with label Integrien. Show all posts

Tuesday, February 05, 2008

New Ways Emerge to Improve IT Operational Performance While Heading Off Future Datacenter Reliability Problems

Transcript of BriefingsDirect podcast on IT operational performance using Integrien Alive.

Listen to podcast here. Sponsor: Integrien.

Dana Gardner: Hi, this is Dana Gardner, principal analyst at Interarbor Solutions, and you’re listening to BriefingsDirect.

Today, a sponsored podcast discussion about new ways to improve IT operational performance, based on real-time analytics and the ability to effectively compare data-center performance from a normal state to something that is going to be a problem. We’re going to look at the ability to get a “heads-up” that something is about to go wrong, rather than going into firefighting mode.

Today’s complexity in IT systems is making previous error prevention approaches for operators inefficient and costly. IT staffs are expensive to retain, and are increasingly hard to find. So even when operators have a sufficient staff, a quality staff, it simply takes too long to interpret and resolve IT failures and glitches, given the complexity of distributed systems.

There is also insufficient information about what’s going on in the context of an entire systems setup, and operators are using manual processes -- in firefighting mode -- to maintain critical service levels.

IT executives are therefore seeking more automated approaches to not only remediate problems, but also to get earlier detection. These same operators don’t want to replace their systems management investments, they want to better use them in a cohesive manner to learn more from them, and to better extract the information that these systems emit.

To help us better understand the problems and some of the new solutions and approaches to remediation and detection of IT issues, we’re joined by Steve Henning, the Vice President of Products for Integrien. Welcome to the show, Steve.

Steve Henning: Thanks a lot, Dana.

Gardner: Let’s take a look at some of the real-life issues that are affecting IT operators, drill down into them a bit, look at some of the solutions and benefits, and perhaps some examples of what these bring in terms of relief and increased savings of time and energy.

Tell me a little bit about complexity and problems. How do you view the current state of affairs in the datacenter operations field?

Henning: It’s a dichotomous situation for the vice president of IT operations at this point. On one hand, they're working at growing companies. They need to manage more things in their environment -- devices and resources. Also, given the changes and how people are deploying applications today, they are dealing with more complexity as well.

Service oriented architecture (SOA) and virtualization increase the management problem by at least a factor of three. So you can see that this is a more complex and challenging environment to manage.

On the other side of this equation is the fact that IT operations is being told to either keep their budgets static or to reduce them. Traditionally, the way that the vice president of IT operations has been able to keep the problems from occurring in these environments has been by throwing more people at it. We now see 70-plus-percent of the IT operations budget spent on labor costs.

Just the other day, I was talking to the vice president of IT operations of a large online financial company. He told me that he had 10 people on staff just to understand the normal behavior of their systems. They are literally cutting out graphs and holding them up to the light to compare them against what they have seen in previous incarnations of the system, trying to see when the behavior of this system is normal.

He told me that this is just not scalable. There is no way -- given the fact that he has to scale his infrastructure by a factor of three over the next two years -- that he can possibly hire the people that he would need to support that. Even if he had the budget, he couldn’t find the people today.

So it’s a very troubling environment these days. It’s really what’s pushing people toward looking at different approaches, of taking more of a probabilistic look, measuring variables, and looking at probable outcomes -- rather than trying to do things in a deterministic way, measuring every possible variable, looking at it as quickly as possible, and hoping that problems just don’t slip by.

Gardner: It seems as if we're looking at both a quality and a quantity issue here. We've got a quantity of outputs from these different systems, many times in different formats, but what we really need to do is find that “needle in the haystack” to detect the true issue that’s going to create a failure.

Do you agree that we are dealing with both quality and quantity issues?

Henning: Absolutely. If you look at most of the companies that we talk to today, they are mired in these monitoring events. Most of the companies we talk to have multiple monitoring tools, and they're siloed. You've got the network guys using one tool. You've got the OS and hardware guys using another. The app guys and database guys have their tools, and there is no place where all of this data is analyzed holistically.

Each system emits sets of events typically based on arbitrary hard thresholds that have been set in the environment. There's this massive manual effort of looking at these individual events that are coming from these systems and trying to determine whether they are the actual precursors to real problems, or if they're just a normal behavior of the system that can be ignored. It’s very difficult to keep your hands around that.

Gardner: I suppose it wasn’t that long ago where you could have specialists that would oversee different specific aspects of the IT infrastructure, and they would just be responsible for maintaining that particular part. But, as you mentioned, we have SOA, virtualization, datacenter consolidation, and finding ways of reducing total costs that, in effect, accelerate the interdependencies. I suppose we need more specialization, but -- at the same time -- those specialists need to communicate with the rest of the environment, or the people running it.

Henning: If you look at the applications that are being delivered today, monitoring everything from a silo standpoint and hoping to be able to solve problems in that environment is absolutely impossible. There has to be some way for all of the data to be analyzed in a holistic fashion, understanding the normal behaviors of each of the metrics that are being collected by these monitoring systems. Once you have that normal behavior, you’re alerting only to abnormal behaviors that are the real precursors to problems. That’s where Integrien comes in.

Gardner: You mentioned that you've got reams and reams of events pouring in, and that, in many cases, people are sifting through these manually, charting them, and then comparing them in sort of a haphazard way. What sort of solutions or alternatives are there?

Henning: One of the alternatives is separating the wheat from the chaff and learning the normal behavior of the system. If you look at Integrien Alive, we use sophisticated, dynamic thresholding algorithms. We have multiple algorithms looking at the data to determine that normal behavior and then alerting only to abnormal precursors of problems.

It’s really the hard-threshold-based monitoring that’s the issue here, because hard-threshold-based monitoring does two things. One, it results in alert storms for perfectly normal behavior. Two, it masks real problem behavior that you just can't catch with hard thresholds.

For example, let’s say that at 9 p. m. some online system's normal behavior is a set of servers it would be at 10 percent CPU utilization. But let’s say that it’s at 60 percent utilization. If you have your hard threshold set at 80 percent, you've got a pending problem that you have no idea about. That’s why it’s so important to have an adaptive learning mechanism for determining behavior and when something is important enough to raise to an operator.

Gardner: When you're able to do this comparison on the basis of, "Hey, this is deviating from a pattern," rather than a binary-basis, on-off problem, what kind of benefits can people derive?

Henning: Well, you're automating this massive manual effort that I was talking about. If you look at that vice president of IT operations of the online financial company I talked about earlier, he has 10 guys who are sitting around doing nothing but analyzing this data all day.

Now, that data analysis can be completely automated with sophisticated dynamic thresholding. These 10 guys are freed up to do real problem solving, rather than just looking at these event storms, trying to figure out what’s important and what’s not, when the company is having an issue with one of their mission-critical systems.

Gardner: Do you have any examples of how effective this has been for companies, if they start to take that manpower and focus it where it's most effective? What kind of paybacks are we talking about?

Henning: We see up to a 95 percent reduction in this manual effort around setting thresholds and dealing with events. So it’s a huge reduction in time. We see up to a 50 percent reduction in the time it takes to solve problems, because this kind of information, and the fact that we consolidate alerts based on topology, which makes it much quicker to get down to where the root cause of the problem is, and to focus efforts there.

Gardner: You mentioned getting this “normal state,” of gathering enough information and putting it in the context of use scenarios. How do operators do that? How do they know what’s going to lead to problems by virtue of detecting baseline?

Henning: If you look at most IT environments today, the IT people will tell you that three or four minutes before a problem occurs, they will start to understand that little pattern of events that lead to the problem.

But most of the people that I speak to tell me that’s too late. By the time they identify the pattern that repeats and leads to a particular problem -- for example, a slowdown of a particular critical transaction -- it’s too late. Either the system goes down or the slowdown is such that they are losing business.

We found these abnormal behaviors are the earliest precursors to problems in the IT environment -- either slowdowns or applications actually going down. Once you've learned the normal behavior of the system, these abnormal behaviors far downstream of where the problem actually occurs are the earliest precursors to these problems. We can pick up that these problems are going to occur, sometimes an hour before the problem actually happens.

If you think about a typical IT environment, you're talking about tens of thousands of servers and hundreds of thousands, even millions, of metrics to correlate all that data and understand the relationships between different metrics and which lead up to problems. It’s really a humanly unsolvable problem. That’s where this ability to “connect the dots” -- this ability to model problems when they occur -- is a really important capability.

Gardner: I suppose we’re talking about some fairly large libraries of models to compare and contrast -- something that is far beyond the scale of 5 or 10 people.

Henning: Yes, but these models are learned based on the environment, understanding the normal behaviors of all the metrics in a particular IT operation, and understanding what the key indicators of business performance are.

For example, you might say that if this transaction ever takes more than five seconds, then I know I have a problem. Or you could say that if this database metric, open cursors, goes above 1,000, I know I have a problem. Once you understand what those key indicators are, you can set them. And when you have those, you can actually capture a model of what this problem looks like when that key indicator is exceeded.

That’s the key thing, building this model, having the analytic capability to be able to connect the dots and understand what the precursors that lead up to problems, even an hour before the problem occurs. That’s one of the things that Integrien Alive can do.

Gardner: What sort of benefits do we get from this deeper correlation of what’s good, what’s bad, and what’s gray and that could become bad? Are we talking about minutes or days? What sort of impact does this have on the business?

Henning: We see a couple of things. One is that it’s solving this massive data correlation issue that right now is very limited in the IT operations that we go into. There are just a few highly trained experts who have “tribal knowledge” of the application, and who know even the beginnings of what these correlations are. With a product like Integrien Alive you can solve that kind of massive data correlation issue.

The second benefit of it is that the first time a problem occurs, the capture of a model of the problem, with all the abnormal behaviors that led up to it, can often target for you the places in the applications that are performing abnormally and are likely to be the causes of the problem.

For example, you might find that a particular problem is showing abnormal behavior in the application server tier and the database tier. Now, there's no reason to get on the phone with the network guy, the Web server guy, and other people who can't contribute to the resolution of that problem. By targeting and understanding which metrics are behaving abnormally, can get you to a much quicker mean-time to identify and repair the problem. As I said, we see up to 50 percent reduction in the time it takes to resolve problems.

The final thing is the ability to get predictive alerts, and that’s kind of the nirvana of IT operations. Once you’ve captured models of the recurring problems in the IT environment, a product like Integrien Alive can see the incoming stream of real-time data and compare that against the models in the library.

If it sees a match with a high enough probability it can let you know ahead of time, up to an hour ahead of time, that you are going to have a particular problem that has previously occurred. You can also record exactly what you did to solve the problem, and how you have diagnosed it, so that you can solve it.

Gardner: Then, you can share that. Now, you mentioned “tribal knowledge.” It sounds like we are taking what used to reside in wetware -- in people’s minds and experience. Instead of having to throw those people at a problem without knowing the depth of the problem, or even losing that knowledge if they walk out the door, we're saying, "Enough of that. Let’s go and instantiate this knowledge into the systems and become less dependent on individual experienced people."

Henning: The way I look at it is that we're actually enhancing the expertise of these folks. You're always going to need experts in there. You’re always going to need the folks who have the tribal knowledge of the application. What we are doing, though, is enabling them to do their job better with earlier understanding of where the problems are occurring by adding and solving this massive data correlation issue when a problem occurs.

Even the tribal experts will tell you that just a few minutes before a problem occurs they can start to see the problem. We are offering them a solution that allows them to see this problem forming up to an hour ahead of time, notifying them of abnormal behavior and patterns of behavior that would be seemingly unrelated to them based on their current knowledge of the application.

Gardner: When you do resolve a problem and capture that and make that available for future use, that sounds more like a collaboration issue. How do we deal with so many inputs, so much information, not only on the receiving end, but on the outgoing end, after a resolution?

Henning: This is what we were talking about before. You’ve got all of the siloed sources of monitoring data and alerts, and there's currently no way to consolidate that data for holistic problem solving. So, it’s very important that any kind of solution can integrate a wide variety of monitoring tools, so that all the data can be in one place and available for this kind of collaborative problem solving.

For example, in one environment that we went into we had an alert that went to an application server administrator. He happened to notice that there was a prediction that a database key indicator was going out of its normal range, which would have caused a crash of the database with 85 percent probability in 15 minutes. Armed with that information, he got the alert over to the database administrator who was made able to make some configuration changes that staved off the problem.

Being able to analyze this data holistically and being able to share the data that’s typically been in the siloed monitoring solutions allows quicker and more collaborative problem resolution. We're really talking about centralizing and automating data analysis across the silos of IT.

Gardner: It also reminds me, conceptually, of SOA, where you want to transform the information into a form that can be used generally. It sounds like you are doing that and applying it to this whole notion of IT management and remediation.

Henning: Very much so. There are seemingly unrelated things happening within an application infrastructure that can result in a problem. The fact that all the data is analyzed in a single place holistically through these statistical algorithms, allows us to provide an interface where people can work together and collaborate. This makes the team more effective and makes it much easier for people to solve problems quickly.

Gardner: So, we standardize gathering and managing the information. We also standardize the way in which people can access it and use it, so that they are not fixing the same broken wheel over and over again at different times. It can recognize when they are going to need to do it and have it fixed ready to go. This sounds like a real big saver when it comes to labor and lowering costs for your staff, but also gets that root saving around no downtime or reduced downtime.

Henning: Right. When we typically work with customers, most of the IT operations folks that we talk to are really concerned with reducing the labor costs and reducing the time to identify and resolve the problem. In truth, the real benefit to the business is really removing downtime and removing slowdowns of the applications that cause you to lose or reduce business.

So although we see major benefits of real-time analytic solutions in providing reduction in labor costs, we also say that it’s a very big boon to the business, in terms of keeping the applications effectively generating revenue.

Gardner: Another current trend is the ability to gather interface views, graphical views of the system. There are a lot of dashboards out there for business issues. What do we get in terms of visibility for end-to-end operations, even in a real-time or close to real-time setting from the Integrien Alive that you are describing?

Henning: Once again, it’s still a real issue when you have siloed monitoring tools. Even though a lot of companies have a manager of managers, that’s typically used by the level-one operations folks to filter through the alerts and determine who they need to be passed off to, who can actually take a look at them and resolve them. But, we find that most of the companies that we talk to don’t have any tools that allow them to be efficient in role-based problem solving.

One of the things that Integrien Alive provides is this idea of customizable role-based dashboards, this library of custom analysis widgets that allows people to slice and dice the data in whatever way is most effective for that particular individual in problem solving. We talked earlier about the holistic data analysis that was really enabling effective teamwork. When we talk about role-based dashboards for problem solving showing the database administrator exactly what they need, we are really talking about making each team member more effective.

That’s one of the benefits of the role-based dashboards. The other thing is giving visibility all the way up to the CIO and the vice president of operations who are concerned with much different views. They want it filtered in a much different way, because they are more concerned about business performance than any individual server or resource problems that might be occurring in the environment.

Gardner: What sort of views do those business folks prefer over what the outputs of some of these monitoring tools might be?

Henning: You want to look at things from a business-service perspective, how are my critical business services performing? If I have an investment banking solution, and I’ve got a couple of other mission-critical applications that are outward facing, I want to know how those are performing right now, in terms of the critical transaction performance.

I want to be able to accommodate business data as well. So, if I see that from an IT performance level the transaction seemed to be performing well and I can see that I am also processing a consistent number of transactions that are enabling my business, I have a good view that things are going well in my operation at this point. So, it’s really a higher level view.

I am going to be much more concerned with any kind of alerts that are affecting my entire business service. If we see an alert that’s been consolidated all the way up to the investment banking business-service level, that’s going to be something that’s very important for the VP of IT operations, because he’s got a problem now that’s actually affecting his business.

Gardner: I suppose from the IT side the more that we can show and tell to the business folks about how well we are doing the better. It makes us seem less like we are firefighters and that we're proactive and on top of things. If there are any decisions several months or years out about outsourcing, we have a nice trail, a cookie-crumb trail, if you will, of how well things are going and how costs are being managed.

Henning: That’s absolutely true. I was talking to the CIO of a large university the other day. One thing that was very frustrating for him was that he was in a meeting with the president of the university, and the president was saying that it seemed like the applications were and some of the critical applications were down a lot.

This CIO was very frustrated, because he knew that wasn’t the case, but he didn’t have effective reporting tools to show that it was not the case. That was one of the things that he was very excited about, when he took a look at our product.

Gardner: We know that complexity is substantial. It’s pretty clear that that complexity is going to continue as we see organizations move toward SOA and software as a service, and hybrid issues, where a holistic business process could be supported by your systems, partner systems, or perhaps third-party systems.

I can just imagine there is going to be finger pointing when things go wrong. You’re going to want to be able to say, "Hey, not my problem, but I am ready, willing and able to help you fix it. In fact, I've got more insight into your systems than you do."

Henning: That’s absolutely the case.

Gardner: Give me a sense of where Integrien and Alive, as a product set, are going in the future, I know you can't pre-announce things, but as these new complexities in terms of permeable organizational boundaries kick in and virtualization kicks in, what might we expect in the future?

Henning: One of the things that you’re going to see from us is a comprehensive solution around the virtualized environment. Several other companies claim to have solutions in this space, but from what we have been able to see so far, the issue of motion of virtual machines (VM), moving them between different servers, is still an issue for all of these solutions.

We’re working extremely diligently to solve the issue of how to deal with performance monitoring in a virtualized environment, where you have got the individual VMs moving all over the place, based on changes in capacity, and things like that. So, look out for that solution coming from Integrien in the coming months.

Gardner: So we're talking about instances of entire stacks, provisioning and moving dynamically among systems. That sounds like a whole other level of complexity that we are adding to an already difficult situation.

Henning: Yes, it’s a big math problem. You can also compound that with the fact that when a VM moves from one physical server to another, it might be allocated a different percentage of resources. So, when you think about this whole hard-threshold based monitoring paradigm that IT is in now, what does a hard-threshold really mean in an environment like that? It makes absolutely no sense at all.

If you don’t have some way to understand the normal behavior, to provide context, and to quickly learn and adapt to changes in the environment, managing the virtualized environment is going to be an absolute nightmare. Based on spending some time with the folks over at VMware, and attending the VMWorld show this year, you could certainly see in their customers this concern about how to deal with this complex management problem.

Gardner: The old manual wetware approaches just aren’t going to cut it in that environment?

Henning: That’s correct.

Gardner: I appreciate your candor and I look forward to seeing some of these newer solutions focused on virtualization.

We have been talking about remediation and ability to get in front of problems for IT operators using predictive and analytic algorithmic approaches. To help us understand this, we have been joined by Steve Henning, the Vice President of Products at Integrien. Thank you, Steve.

Henning: Thank you very much, Dana.

Gardner: This is Dana Gardner, principal analyst at Interarbor Solutions. You have been listening to BriefingsDirect. Thanks and come back next time.

Listen to podcast here. Sponsor: Integrien.

Monday, September 24, 2007

Probabilistic Analysis Predicts IT Systems Problems Before Costly Applications Outages

Edited transcript of BriefingsDirect[TM] podcast on probabilistic IT systems analysis and management, recorded Aug. 16, 2007.

Listen to the podcast here. Sponsor: Integrien Corp.

Dana Gardner: Hi, this is Dana Gardner, principal analyst at Interarbor Solutions, and you're listening to BriefingsDirect. Today our sponsored podcast focuses on the operational integrity of data centers, the high cost of IT operations, and the extremely high cost of application downtime and performance degradation.

Rather than losing control to ever-increasing complexity, and gaining less and less insight into the root causes of problematic applications and services, enterprises and on-demand application providers alike need to predict how systems will behave under a variety of conditions.

By adding real-time analytics to their systems management practices, operators can fully determine the normal state of how systems should be performing. Then, by measuring the characteristics of systems under many conditions over time, datacenter administrators and systems operators gain the ability to predict and prevent threats to the performance of their applications and services.

As a result they can stay ahead of complexity, and contain the costs of ongoing high-performance applications delivery.

This ability to maintain operational integrity through predictive analytics means IT operators can significantly reduce costs while delivering high levels of service.

Here to help us understand how to manage complexity by leveraging probabilistic systems management and remediation, we are joined by Mazda Marvasti, the CTO of Integrien Corp. Welcome to the show, Mazda.

Mazda Marvasti: Thank you, Dana.

Gardner: Why don’t we take a look at the problem set? Most people are now aware that their IT budgets are strained just by ongoing costs. Whether they are in a medium-sized company, large enterprise, or service-hosting environment, some 70 percent to 80 percent of budgets are going to ongoing operations.

That leaves very little left over for discretionary spending. If you have constant change or a dynamic environment, you're left without much resources to tap in order to meet a dynamic market shift. Can you explain how we got to this position? Why are we spending so much just to keep our heads above water in IT?

Marvasti: When we started in the IT environment, if you remember the mainframe days, it was pretty well defined. You had couple of big boxes. They ran a couple of large applications. It was well understood. You could collect some data from it, so you knew what was going on within it.

We graduated to the client-server era, where we had more flexibility in terms of deployment -- but with that came increasing complexity. Then we moved ahead to n-tier Web applications, and we had yet another increase in complexity. A lot of products came in to try to alleviate that complexity for deep-data collection. And management systems grew out, covering an entire enterprise for data collection, but the complexity was still there.

Now, with service-oriented architecture (SOA) and virtualization moving into application-development and data-center automation, there is a tremendous amount of complexity in the operations arena. You can’t have the people who used to have the "tribal knowledge" in their head determining where the problems are coming from or what the issues are.

The problems and the complexity have gone beyond the capability of people just sitting there in front of screens of data, trying to make sense out of it. So, as we gained efficiency from application development, we need consistency of performance and availability, but all of this added to the complexity of managing the data center.

That’s how the evolution of the data center went from being totally deterministic, meaning that you knew every variable, could measure it, and had very specific rules telling you if certain things happened, and what they were and what they meant -- all the way to a non-deterministic era, which we are in right now.

Now, you can't possibly know all the variables, and the rules that you come up with today may be invalid tomorrow, all just because of change that has gone on in your environment. So, you cannot use the same techniques that you used 10 or 15 years ago to manage your operations today. Yet that’s what the current tools are doing. They are just more of the same, and that’s not meeting the requirements of the operations center anymore.

Gardner: At the same time, we are seeing that a company’s applications are increasingly the primary way that it reaches out to its sell side, to customers -- as well as its buy side, to its supply chain, its partners, and ecology. So applications are growing more important. The environment is growing more complex, and the ability to know what’s going on is completely out of hand.

Marvasti: That’s right. You used to know exactly where your application was, what systems it touched, and what it was doing. Now, because of the demand of the customers and the demands of the business to develop applications more rapidly, you’ve gone into an SOA era or an n-tier application era, where you have a lot of reusability of components for faster development and better quality of applications -- but also a lot more complexity in the operations arena.

What that has led to is that you no longer even know in a deterministic fashion where your applications might be touching or into what arenas they might be going. There's no sense of, "This is it. These are the bounds of my application." Now it’s getting cloudier, especially with SOA coming in.

Gardner: We’ve seen some attempts in the conventional management space to keep up with this. We’ve been generating more agents, putting in more sniffers, applying different kinds of management. And yet we still seem to be suffering the problems. What do you think is a next step in terms of coming to grips with this -- perhaps on an holistic basis -- so that we can get as much of the picture as possible?

Marvasti: The "business service" is the thing that the organization offers to its customers. It runs through their data center, IT operations, and the business center. It goes across multiple technology layers and stacks. So having data collection at a specific stack or for a specific technology silo, in and of itself, is insufficient to tell you the problems with the business service, which is what you are ultimately trying to get to. You really need to do a holistic analysis of the data from all of the silos that the business service runs through.

You may have some networking silos, where people are using specific tools to do network management -- and that’s perfectly fine. But then the business service may go through some Web tier, application tier, database tier, or storage -- and then all of those devices may be virtualized. There may be some calls to a SOA.

There are deep-dive tools to collect data and report upon specifics of what maybe going on within silos, but you really need to do an analysis across all the silos to tell you where the problems of the business service may be coming from. The interesting thing is that there is a lot of information locked into these metrics. Once correlated across the silos, they paint a pretty good picture as to the impending problem or the root cause of what a problem may be.

By looking at individual metrics collected via silos you don’t get as full a picture as if you were to correlate that individual metric with another metric in another silo. That paints a much larger picture as to what may be going on within your business service.

Gardner: So if we want to gather insights and even predictability into the business service level -- that higher abstraction of what is productive -- we need to go in and mine this data in this context. But it seems to me that it’s in many different formats. From that "Tower of Babel" how do you create a unified view? Or you are creating metadata? What’s the secret sauce that gets you from raw data to analysis?

Marvasti: One misperception is that, "I need to have every piece of metric that I collect go into a magical box that then tells me everything I need to know." In fact, you don’t need to have every piece of metrics. There is much information locked between the correlation of the metrics. We’ve seen at our customers that a gap in monitoring in one silo can often be compensated by data collection in other silos.

So, if you have a monitoring system already -- IBM Tivoli, as an example -- and you are collecting operating-system metrics, you may have one or two other application-specific metrics that you are also collecting. That may be enough to tell you everything that is going on within your business service. You don't need to go to the nth degree of data collection and harmonization of that data into one data repository to get a clear picture.

Even starting with what you’ve got now, without having to go very deep, what we’ve seen in our customers is that it actually lights up a pretty good volume of information in terms of what may be going on across the silos. They couldn't achieve that by just looking at individual metrics.

Gardner: It’s a matter of getting to the right information that’s going to tell you the most across the context of a service?

Marvasti: To a certain degree, but a lot of times you don’t even know what the right metrics are. Basically I go to our customers and say, "What do you have?" Let’s just start with that, and then the application will determine whether you actually have gaps in your monitoring or whether these metrics that you are collecting are the right ones to solve those specific problems.

If not, we can figure out where the gaps may be. A lot of times, customers don’t even know what the right metrics are. And that’s one of the mental shifts of thinking deterministically versus probabilistically.

Deterministically is, "What are the right metrics that I need to collect to be able to identify this problem?" In fact, what we’ve found out is that a particular problem in a business service can be modeled by a group or a set of metric event conditions that are seemingly unrelated to that problem, but are pretty good indicators of the occurrence of that problem.

When we start with what they have, we often point out that there is a lot more information within that data set. They don’t really need to ask, "Do I have the right metrics or not?"

Gardner: Once you’ve established a pretty good sense of the right metrics and the right data, then I suppose you need to apply the right analysis and tools. Maybe this would be a good time for you to explain about the heritage of Integrien, how it came about, and how you get from this deterministic to more probabilistic or probability-oriented approach?

Marvasti: I’ve been working on these types of problems for the past 18 years. Since graduate school, I’ve been analyzing data extraction of information from disparate data. I went to work for Ford and General Motors -- really large environments. Back then, it was client-servers and how those environments were being managed. I could see the impending complexity, because I saw the level of pressure that there was on application developers to develop more reusable code and to develop faster with higher quality.

All that led to the Web application era. Back then, I was the CTO of a company called LowerMyBills.com here in the Los Angeles area. One problem I had was that I had a few people with the tribal knowledge to manage and run the systems, but that was very scary to me. I couldn't rely on these people to be able to have a continuous business going on.

So I started looking at management systems, because I thought it was probably a solved problem. I looked at a lot of management tools out there, and saw that it was mainly centered on data collection, manual rule writing, and better way of presenting the same data over and over.

I didn’t see any way of doing a deep analysis of the data to bring out insights. That’s when I and my business partner Al Eisaian, who is our CEO, formed a company to specifically attack this problem. That was in 2001. We spent a couple of years developing the product, got our first set of customers in 2003, and really started proving the model.

One of the interesting things is that if you have a small environment, your tendency is to think that it's small enough that you can manage it, and that actually may be true. You develop some specific technical knowledge about your systems and you can move from there. But in the larger environments where there is so much change happening in the environment it becomes impossible to manage it that way.

A product like ours almost becomes a necessity, because we’ve transitioned from people knowing in their heads what to do, to not being able to comprehend all of the things happening in the data center. The technology we developed was meant to address this problem of not being able to make sense of the data coming through, so that you could make an intelligent decision about problems occurring in the environment.

Gardner: Clearly a tacit knowledge approach is not sufficient, and just throwing more people at it is not going to solve the problem. What’s the next step? How do we get to a position where we can gather and then analyze data in such a way that we get to that Holy Grail, which is predictive, rather than reactive, response.

Marvasti: Obviously, the first step is collecting the data. Without the data, you can’t really do much. A lot of investment has already gone into data collection mechanisms, be it agent-based or agent-less. So there is data being collected right now.

The missing piece is the utilization of that data and the extraction of information from that data. Right now, as you said at the beginning of your introduction, a lot of cost is going toward keeping the lights on at the operations center. That’s typically people cost, where people are deployed 24/7, looking at monitors, looking at failures, and then trying to do postmortem on the problem.

This does require a little bit of mind shift from deterministic to probabilistic. The reason for that is that a lot of things have been built to make the operations center do a really good job of cleaning up after an accident, but not a lot of thought has been put into place of what to do if you're forewarned of an accident, before it actually happens.

Gardner: How do I intercede? How do I do something?

Marvasti: How do I intercede? What do I do? What does it mean? For example, one of the outputs from our product is a predictive alert that says, "With 80 percent confidence, this particular problem will occur within the next 15 minutes." Well, nothing has happened yet, so what does my run book say I should do? The run book is missing that information. The run book only has the information on how to clean it up after an accident happens.

That’s the missing piece in the operations arena. Part of the challenge for our company is getting the operations folks to start thinking in a different fashion. You can do it a little at a time. It doesn’t have to be a complete shift in one fell swoop, but it does require that change in mentality. Now that I am actually forewarned about something, how do I prevent it, as opposed to cleaning up after it happens.

Gardner: When we talk about operational efficiency, are we talking about one or two percent here and there? Is this a rounding error? Are we talking about some really wasteful practices that we can address? What’s the typical return on investment that you are pursuing?

Marvasti: It’s not one or two percent. We're talking about a completely different way of managing operations. After a problem occurs, you typically have a lot of people on a bridge call, and then you go through a process of elimination to determine where the problem is coming from, or what might have caused it. Once the specific technology silo has been determined, then they go to the experts for that particular silo to figure out what’s going on. That actually has a lot of time and manpower associated with it.

What we're talking about is being proactive, so that you know something is about to happen, and we can tell you to a certain probability where it’s going to be. Now you have a list of low-hanging fruits to go after, as opposed to just covering everybody in the operations center, trying to get the problem fixed.

The first order of business is, "Okay, this problem is about to occur, and this is where it may occur. So, that’s the guy I’m going to engage first." Basically, you have a way of following down from the most probable to the least probable, and not involving all the people that typically get involved in a bridge call to try to resolve the issues.

One gain is the reduction in mean time to identify where the problem is coming from. The other one is not having all of those people on these calls. This reduces the man-hours associated with root-cause determination and source identification of the problem. In different environments you're going to see different percentages, but in one environment that I saw first hand, one of the largest health-care organizations, it is like 20-30 percent of cost, just associated with people being on bridge calls, on a continuous basis.

Gardner: Now, this notion of "management forensics," can you explain that a little bit?

Marvasti: One of the larger problems in IT is actually getting to the root cause of problems. What do you know? How do you know what the root cause is? Often times, something happens and the necessity of getting the business service back up forces people to reboot the servers and worry later about figuring out what happened. But, when you do that, you lose a lot of information that would have been very helpful in determining what the root cause was.

The forensic side of it is: The data is collected already, so we already know what it is. If you have the state when a problem occurred, that’s a captured environment in the data base that you can always go back to.

What we offer is the ability to walk back in time, without having the server down, while you are doing your investigation. You can bring the server back up, come back to our product, and then walk back in time to see exactly what were the leading indicators to the problems you experienced. Using those leading indicators, you can get to the root causes very quickly. That eliminates the guess work of where to start, reduces the time to get to the root cause, and maybe even prevent it.

Sometimes you only have so much time to work on something. If you can’t solve it by that time, you move on, and then the problem occurs again. That's the forensic side.

Gardner: We talked earlier about this notion of capturing the normal state, and now you've got this opportunity to capture an abnormal state. You can compare and contrast. Is that something that you use on an ongoing basis to come up with these probabilities? Or is the probability analysis something different?

Marvasti: No, that’s very much part and parcel of it. What we do is look to see what is the normal operating state of an environment. Then it is the abnormalities from that normal that become your trigger points of potential issues. Those are your first indicators that there might be a problem growing. We also do a cross-event analysis. That’s another probability analysis that we do. We look at patterns of events, as opposed to a single event, indicating a potential problem. One thing we've found is that events in seemingly unrelated silos are very good indicators of a potential problem that may brew some place else.

Doing that kind of analysis, looking at what’s normal, then abnormal becomes your first indicator. Then, doing a cross-event analysis to see what patterns indicate a particular problem becomes total normal to problem-prevention scenario.

Gardner: There has to be a cause-and-effect. As much as we would like to imagine ghosts in the machines, that’s really not the case. It's simply a matter of tracking it down.

Marvasti: Exactly. The interesting thing is that you may be measuring a specific metric that is a clear indicator of a problem, but it is oftentimes some other metric on another machine that gets to be out of normal first, before the cause of the problem surfaces in the machine in question. So early indicators to a problem become events that occur some place else, and that’s really important to capture.

When I was talking about the cross-silo analysis, that’s the information that it brings out. It gives you lot more "heads-up" time to a potential problem than if you were just looking at a specific silo.

Gardner: Of course, each data center is unique, each company has its own history and legacy, and its IT department has evolved on its own terms. But is there any general crossover analysis? That is to say, is there a way of having an aggregate view of things and predicting things based on some assumptions, because of the particular systems that are in use? Or, is it site by site on a detailed level?

Marvasti: Every customer that I have seen is totally different. We developed our applications specifically to be learning-based, not rules-based. And by rules I mean not having any preconceived notion of what an environment may look like. Because if you have that, and the environment doesn’t look like that, you're going to be sending a lot of false positives -- which we definitely did not want to do.

Ours is a purely learning-based system. That means that we install our product, it starts gathering the metrics, and then it starts learning what your systems look like and behave like. Then based on that behavior it starts formulating the out-of-normal conditions that can lead to problems. That becomes unique to the customer environment. That is an advantage, because when you get something, it actually adapts itself to an environment.

For example, it learns your change management patterns. If you have a change windows occurring, it learns that change window. It knows that those change windows occur without anybody having to enter anything into the application. When you are doing wholesale upgrade of devices, it knows that change is coming about, because it has learned your patterns.

The downside of that is that it does take two to three weeks of gathering your data and learning what has been happening for it to become useful. The good side of it is that you get something that completely maps to your business, as opposed to having to map your business through a product. The downside is that it takes two or three weeks of learning time, before it starts producing some results for you.

Gardner: The name of your product set is Alive, is that correct?

Marvasti: That’s correct.

Gardner: I understand you are going to have a release coming out later this year, Alive 6.0?

Marvasti: That’s correct.

Gardner: I don’t expect you to pre-release, but perhaps you can give us some sense of the direction that the major new offerings within the product set will take. What they are directed toward? Can you give us a sneak peek on that?

Marvasti: Basically, we have three pillars that the product is based on. First is usability. That's a particular pet peeve of mine. I didn't find any of the applications out there very usable. We have spent a lot of time working with customers and working with different operations groups, trying to make sure that our product is actually usable for the people that we are designing for.

The second piece is interoperability. The majority of the organizations that we go to already have a whole bunch of systems, whether it be data collection systems, event management systems, or configuration management databases, etc. Our product absolutely needs to leverage those investments -- and they are leveragable. But even those investments in their silos don’t produce as much benefit to the customer as a product like ours going in there and utilizing all of that data that they have in there, and bringing out the information that’s locked within it.

The third piece is analytics. What we have in the product coming out is scalability to 100,000 servers. We've kind of gone wild on the scalability side, because we are designing for the future. Nobody that I know of right now has that kind of a scale, except maybe Google, but theirs is basically the same thing replicated thousands of times over, which is different than the enterprises we deal with, like banks or health-care organizations.

A single four-processor Xeon box, with Alive installed on it, can run real-time analytics for up to 100,000 devices. That’s the level of scale we're talking about. In terms of analytics, we've got three new pieces coming out, and basically every event we send out is a predictive event. It’s going to tell you this event occurred, and then this other set of events have a certain probability within a certain timeframe to occur.

Not only that, but then we can match it to what we call our "finger printing." Our finger printing is a pattern-matching technology that allows us to look at patterns of events and formulate a particular problem. It indicates particular problems and those become the predictive alerts to other problems.

What’s coming out in the product is really a lot of usability, reporting capabilities, and easier configurations. Tens of thousands of devices can be configured very quickly. We have interoperability -- Tivoli, OpenView, Hyperic -- and an open API that allows you to connect to our product and pump in any kind of data, even if it’s business data.

Our technology is context agnostic. What that means is that it does not have any understanding of applications, databases, etc. You can even put in business-type data and have it correlated with your IT data, and extract information that way as well.

Gardner: You mentioned usability. Who are the typical users and buyers of a product like Integrien Alive? Who is your target audience?

Marvasti: The typical user would be at the operations center. The interesting thing is that we have seen a lot of different users come in after the product is deployed. I've seen database administrators use our product, because they like to see what is normal behavior of their databases. So they run the analytics under database type metrics and get information that way.

I've seen application folks who want to have more visibility in terms of how this particular application is impacting the database. They become users. But the majority of users are going to be at the operations center -- people doing day-to-day event management and who are responsible for reducing the mean time to identify where the problems come from.

The typical buyers are directors of IT operations or VP of IT operations. We are really on the operation side, as opposed to the application development side.

Gardner: Do you suppose that in the future, when we get more deeply into SOA and virtualization, that some of the analysis that is derived through Integrien Alive becomes something that’s fed into a business dashboard, or something that’s used in governance around how services are going to be provisioned, or how service level agreements are going to be met?

Can we extrapolate as to how the dynamics of the data center and then the job of IT itself changes, on how your value might shift as well?

Marvasti: That link between IT and the business is starting to occur. I definitely believe that our product can play a major part in illuminating what in the business side gets impacted by IT. Because we are completely data agnostic, you can put in IT-type data, business-type data, or customer data -- and have all of it be correlated.

You then have one big holistic view as to what may impact what. ... If this happens, what else might happen? If I want to increase this, what are the other parameters that may be impacted?

So, you know what you want to play from the business side in terms of growth. Having that, we project how IT needs to change in order to support that growth. The information is there within the data and the very fact that we are completely data agnostic allows us to do that kind of a multi-function analysis within an enterprise.

Gardner: It sounds like you can move from an operational efficiency value to a business efficiency value pretty quickly?

Marvasti: Absolutely. Our initial target is the operations arena, because of the tremendous amount of inefficiencies there. But as we move into the future, that’s something we are going to look into.

Gardner: We mentioned Alive 6.0. Do you have a ball-park figure on when that’s due? Is it Q4 of 2007?

Marvasti: We are going to come out with it in 2007, and it will be available in Q4.

Gardner: Well, I think that covers it, and we are just about out of time. I want to thank Mazda Marvasti, the CTO of Integrien, for helping us understand more about the notion of management forensics and probabilistic- rather than deterministic-based analysis.

We have been seeking to understand better how to address high costs, and inefficiencies in data centers, as well as managing application performance -- perhaps in quite a different way than many companies have been accustomed to. Is there anything else you would like to add before we end, Mazda?

Marvasti: No, I appreciate your time, Dana, and thank you for your good questions.

Gardner: This is Dana Gardner, principal analyst at Interarbor Solutions. You have been listening to a sponsored BriefingsDirect podcast. Thanks, and come back next time.

Listen to the podcast here. Sponsor: Integrien Corp.

Transcript of BriefingsDirect podcast on systems management efficiencies and analytics. Copyright Interarbor Solutions, LLC, 2005-2007. All rights reserved.

BriefingsDirect Transcripts

Tuesday, February 05, 2008

New Ways Emerge to Improve IT Operational Performance While Heading Off Future Datacenter Reliability Problems

Monday, September 24, 2007

Probabilistic Analysis Predicts IT Systems Problems Before Costly Applications Outages

Principal Analyst

Translate this Blog

Folo My Flipboard Magazines

Search Blog

Subscribe to Podcast Via iTunes

BriefingsDirect Network

Blog Archive