Monday, September 24, 2007

Probabilistic Analysis Predicts IT Systems Problems Before Costly Applications Outages

Edited transcript of BriefingsDirect[TM] podcast on probabilistic IT systems analysis and management, recorded Aug. 16, 2007.

Listen to the podcast here. Sponsor: Integrien Corp.

Dana Gardner: Hi, this is Dana Gardner, principal analyst at Interarbor Solutions, and you're listening to BriefingsDirect. Today our sponsored podcast focuses on the operational integrity of data centers, the high cost of IT operations, and the extremely high cost of application downtime and performance degradation.

Rather than losing control to ever-increasing complexity, and gaining less and less insight into the root causes of problematic applications and services, enterprises and on-demand application providers alike need to predict how systems will behave under a variety of conditions.

By adding real-time analytics to their systems management practices, operators can fully determine the normal state of how systems should be performing. Then, by measuring the characteristics of systems under many conditions over time, datacenter administrators and systems operators gain the ability to predict and prevent threats to the performance of their applications and services.

As a result they can stay ahead of complexity, and contain the costs of ongoing high-performance applications delivery.

This ability to maintain operational integrity through predictive analytics means IT operators can significantly reduce costs while delivering high levels of service.

Here to help us understand how to manage complexity by leveraging probabilistic systems management and remediation, we are joined by Mazda Marvasti, the CTO of Integrien Corp. Welcome to the show, Mazda.

Mazda Marvasti: Thank you, Dana.

Gardner: Why don’t we take a look at the problem set? Most people are now aware that their IT budgets are strained just by ongoing costs. Whether they are in a medium-sized company, large enterprise, or service-hosting environment, some 70 percent to 80 percent of budgets are going to ongoing operations.

That leaves very little left over for discretionary spending. If you have constant change or a dynamic environment, you're left without much resources to tap in order to meet a dynamic market shift. Can you explain how we got to this position? Why are we spending so much just to keep our heads above water in IT?

Marvasti: When we started in the IT environment, if you remember the mainframe days, it was pretty well defined. You had couple of big boxes. They ran a couple of large applications. It was well understood. You could collect some data from it, so you knew what was going on within it.

We graduated to the client-server era, where we had more flexibility in terms of deployment -- but with that came increasing complexity. Then we moved ahead to n-tier Web applications, and we had yet another increase in complexity. A lot of products came in to try to alleviate that complexity for deep-data collection. And management systems grew out, covering an entire enterprise for data collection, but the complexity was still there.

Now, with service-oriented architecture (SOA) and virtualization moving into application-development and data-center automation, there is a tremendous amount of complexity in the operations arena. You can’t have the people who used to have the "tribal knowledge" in their head determining where the problems are coming from or what the issues are.

The problems and the complexity have gone beyond the capability of people just sitting there in front of screens of data, trying to make sense out of it. So, as we gained efficiency from application development, we need consistency of performance and availability, but all of this added to the complexity of managing the data center.

That’s how the evolution of the data center went from being totally deterministic, meaning that you knew every variable, could measure it, and had very specific rules telling you if certain things happened, and what they were and what they meant -- all the way to a non-deterministic era, which we are in right now.

Now, you can't possibly know all the variables, and the rules that you come up with today may be invalid tomorrow, all just because of change that has gone on in your environment. So, you cannot use the same techniques that you used 10 or 15 years ago to manage your operations today. Yet that’s what the current tools are doing. They are just more of the same, and that’s not meeting the requirements of the operations center anymore.

Gardner: At the same time, we are seeing that a company’s applications are increasingly the primary way that it reaches out to its sell side, to customers -- as well as its buy side, to its supply chain, its partners, and ecology. So applications are growing more important. The environment is growing more complex, and the ability to know what’s going on is completely out of hand.

Marvasti: That’s right. You used to know exactly where your application was, what systems it touched, and what it was doing. Now, because of the demand of the customers and the demands of the business to develop applications more rapidly, you’ve gone into an SOA era or an n-tier application era, where you have a lot of reusability of components for faster development and better quality of applications -- but also a lot more complexity in the operations arena.

What that has led to is that you no longer even know in a deterministic fashion where your applications might be touching or into what arenas they might be going. There's no sense of, "This is it. These are the bounds of my application." Now it’s getting cloudier, especially with SOA coming in.

Gardner: We’ve seen some attempts in the conventional management space to keep up with this. We’ve been generating more agents, putting in more sniffers, applying different kinds of management. And yet we still seem to be suffering the problems. What do you think is a next step in terms of coming to grips with this -- perhaps on an holistic basis -- so that we can get as much of the picture as possible?

Marvasti: The "business service" is the thing that the organization offers to its customers. It runs through their data center, IT operations, and the business center. It goes across multiple technology layers and stacks. So having data collection at a specific stack or for a specific technology silo, in and of itself, is insufficient to tell you the problems with the business service, which is what you are ultimately trying to get to. You really need to do a holistic analysis of the data from all of the silos that the business service runs through.

You may have some networking silos, where people are using specific tools to do network management -- and that’s perfectly fine. But then the business service may go through some Web tier, application tier, database tier, or storage -- and then all of those devices may be virtualized. There may be some calls to a SOA.

There are deep-dive tools to collect data and report upon specifics of what maybe going on within silos, but you really need to do an analysis across all the silos to tell you where the problems of the business service may be coming from. The interesting thing is that there is a lot of information locked into these metrics. Once correlated across the silos, they paint a pretty good picture as to the impending problem or the root cause of what a problem may be.

By looking at individual metrics collected via silos you don’t get as full a picture as if you were to correlate that individual metric with another metric in another silo. That paints a much larger picture as to what may be going on within your business service.

Gardner: So if we want to gather insights and even predictability into the business service level -- that higher abstraction of what is productive -- we need to go in and mine this data in this context. But it seems to me that it’s in many different formats. From that "Tower of Babel" how do you create a unified view? Or you are creating metadata? What’s the secret sauce that gets you from raw data to analysis?

Marvasti: One misperception is that, "I need to have every piece of metric that I collect go into a magical box that then tells me everything I need to know." In fact, you don’t need to have every piece of metrics. There is much information locked between the correlation of the metrics. We’ve seen at our customers that a gap in monitoring in one silo can often be compensated by data collection in other silos.

So, if you have a monitoring system already -- IBM Tivoli, as an example -- and you are collecting operating-system metrics, you may have one or two other application-specific metrics that you are also collecting. That may be enough to tell you everything that is going on within your business service. You don't need to go to the nth degree of data collection and harmonization of that data into one data repository to get a clear picture.

Even starting with what you’ve got now, without having to go very deep, what we’ve seen in our customers is that it actually lights up a pretty good volume of information in terms of what may be going on across the silos. They couldn't achieve that by just looking at individual metrics.

Gardner: It’s a matter of getting to the right information that’s going to tell you the most across the context of a service?

Marvasti: To a certain degree, but a lot of times you don’t even know what the right metrics are. Basically I go to our customers and say, "What do you have?" Let’s just start with that, and then the application will determine whether you actually have gaps in your monitoring or whether these metrics that you are collecting are the right ones to solve those specific problems.

If not, we can figure out where the gaps may be. A lot of times, customers don’t even know what the right metrics are. And that’s one of the mental shifts of thinking deterministically versus probabilistically.

Deterministically is, "What are the right metrics that I need to collect to be able to identify this problem?" In fact, what we’ve found out is that a particular problem in a business service can be modeled by a group or a set of metric event conditions that are seemingly unrelated to that problem, but are pretty good indicators of the occurrence of that problem.

When we start with what they have, we often point out that there is a lot more information within that data set. They don’t really need to ask, "Do I have the right metrics or not?"

Gardner: Once you’ve established a pretty good sense of the right metrics and the right data, then I suppose you need to apply the right analysis and tools. Maybe this would be a good time for you to explain about the heritage of Integrien, how it came about, and how you get from this deterministic to more probabilistic or probability-oriented approach?

Marvasti: I’ve been working on these types of problems for the past 18 years. Since graduate school, I’ve been analyzing data extraction of information from disparate data. I went to work for Ford and General Motors -- really large environments. Back then, it was client-servers and how those environments were being managed. I could see the impending complexity, because I saw the level of pressure that there was on application developers to develop more reusable code and to develop faster with higher quality.

All that led to the Web application era. Back then, I was the CTO of a company called LowerMyBills.com here in the Los Angeles area. One problem I had was that I had a few people with the tribal knowledge to manage and run the systems, but that was very scary to me. I couldn't rely on these people to be able to have a continuous business going on.

So I started looking at management systems, because I thought it was probably a solved problem. I looked at a lot of management tools out there, and saw that it was mainly centered on data collection, manual rule writing, and better way of presenting the same data over and over.

I didn’t see any way of doing a deep analysis of the data to bring out insights. That’s when I and my business partner Al Eisaian, who is our CEO, formed a company to specifically attack this problem. That was in 2001. We spent a couple of years developing the product, got our first set of customers in 2003, and really started proving the model.

One of the interesting things is that if you have a small environment, your tendency is to think that it's small enough that you can manage it, and that actually may be true. You develop some specific technical knowledge about your systems and you can move from there. But in the larger environments where there is so much change happening in the environment it becomes impossible to manage it that way.

A product like ours almost becomes a necessity, because we’ve transitioned from people knowing in their heads what to do, to not being able to comprehend all of the things happening in the data center. The technology we developed was meant to address this problem of not being able to make sense of the data coming through, so that you could make an intelligent decision about problems occurring in the environment.

Gardner: Clearly a tacit knowledge approach is not sufficient, and just throwing more people at it is not going to solve the problem. What’s the next step? How do we get to a position where we can gather and then analyze data in such a way that we get to that Holy Grail, which is predictive, rather than reactive, response.

Marvasti: Obviously, the first step is collecting the data. Without the data, you can’t really do much. A lot of investment has already gone into data collection mechanisms, be it agent-based or agent-less. So there is data being collected right now.

The missing piece is the utilization of that data and the extraction of information from that data. Right now, as you said at the beginning of your introduction, a lot of cost is going toward keeping the lights on at the operations center. That’s typically people cost, where people are deployed 24/7, looking at monitors, looking at failures, and then trying to do postmortem on the problem.

This does require a little bit of mind shift from deterministic to probabilistic. The reason for that is that a lot of things have been built to make the operations center do a really good job of cleaning up after an accident, but not a lot of thought has been put into place of what to do if you're forewarned of an accident, before it actually happens.

Gardner: How do I intercede? How do I do something?

Marvasti: How do I intercede? What do I do? What does it mean? For example, one of the outputs from our product is a predictive alert that says, "With 80 percent confidence, this particular problem will occur within the next 15 minutes." Well, nothing has happened yet, so what does my run book say I should do? The run book is missing that information. The run book only has the information on how to clean it up after an accident happens.

That’s the missing piece in the operations arena. Part of the challenge for our company is getting the operations folks to start thinking in a different fashion. You can do it a little at a time. It doesn’t have to be a complete shift in one fell swoop, but it does require that change in mentality. Now that I am actually forewarned about something, how do I prevent it, as opposed to cleaning up after it happens.

Gardner: When we talk about operational efficiency, are we talking about one or two percent here and there? Is this a rounding error? Are we talking about some really wasteful practices that we can address? What’s the typical return on investment that you are pursuing?

Marvasti: It’s not one or two percent. We're talking about a completely different way of managing operations. After a problem occurs, you typically have a lot of people on a bridge call, and then you go through a process of elimination to determine where the problem is coming from, or what might have caused it. Once the specific technology silo has been determined, then they go to the experts for that particular silo to figure out what’s going on. That actually has a lot of time and manpower associated with it.

What we're talking about is being proactive, so that you know something is about to happen, and we can tell you to a certain probability where it’s going to be. Now you have a list of low-hanging fruits to go after, as opposed to just covering everybody in the operations center, trying to get the problem fixed.

The first order of business is, "Okay, this problem is about to occur, and this is where it may occur. So, that’s the guy I’m going to engage first." Basically, you have a way of following down from the most probable to the least probable, and not involving all the people that typically get involved in a bridge call to try to resolve the issues.

One gain is the reduction in mean time to identify where the problem is coming from. The other one is not having all of those people on these calls. This reduces the man-hours associated with root-cause determination and source identification of the problem. In different environments you're going to see different percentages, but in one environment that I saw first hand, one of the largest health-care organizations, it is like 20-30 percent of cost, just associated with people being on bridge calls, on a continuous basis.

Gardner: Now, this notion of "management forensics," can you explain that a little bit?

Marvasti: One of the larger problems in IT is actually getting to the root cause of problems. What do you know? How do you know what the root cause is? Often times, something happens and the necessity of getting the business service back up forces people to reboot the servers and worry later about figuring out what happened. But, when you do that, you lose a lot of information that would have been very helpful in determining what the root cause was.

The forensic side of it is: The data is collected already, so we already know what it is. If you have the state when a problem occurred, that’s a captured environment in the data base that you can always go back to.

What we offer is the ability to walk back in time, without having the server down, while you are doing your investigation. You can bring the server back up, come back to our product, and then walk back in time to see exactly what were the leading indicators to the problems you experienced. Using those leading indicators, you can get to the root causes very quickly. That eliminates the guess work of where to start, reduces the time to get to the root cause, and maybe even prevent it.

Sometimes you only have so much time to work on something. If you can’t solve it by that time, you move on, and then the problem occurs again. That's the forensic side.

Gardner: We talked earlier about this notion of capturing the normal state, and now you've got this opportunity to capture an abnormal state. You can compare and contrast. Is that something that you use on an ongoing basis to come up with these probabilities? Or is the probability analysis something different?

Marvasti: No, that’s very much part and parcel of it. What we do is look to see what is the normal operating state of an environment. Then it is the abnormalities from that normal that become your trigger points of potential issues. Those are your first indicators that there might be a problem growing. We also do a cross-event analysis. That’s another probability analysis that we do. We look at patterns of events, as opposed to a single event, indicating a potential problem. One thing we've found is that events in seemingly unrelated silos are very good indicators of a potential problem that may brew some place else.

Doing that kind of analysis, looking at what’s normal, then abnormal becomes your first indicator. Then, doing a cross-event analysis to see what patterns indicate a particular problem becomes total normal to problem-prevention scenario.

Gardner: There has to be a cause-and-effect. As much as we would like to imagine ghosts in the machines, that’s really not the case. It's simply a matter of tracking it down.

Marvasti: Exactly. The interesting thing is that you may be measuring a specific metric that is a clear indicator of a problem, but it is oftentimes some other metric on another machine that gets to be out of normal first, before the cause of the problem surfaces in the machine in question. So early indicators to a problem become events that occur some place else, and that’s really important to capture.

When I was talking about the cross-silo analysis, that’s the information that it brings out. It gives you lot more "heads-up" time to a potential problem than if you were just looking at a specific silo.

Gardner: Of course, each data center is unique, each company has its own history and legacy, and its IT department has evolved on its own terms. But is there any general crossover analysis? That is to say, is there a way of having an aggregate view of things and predicting things based on some assumptions, because of the particular systems that are in use? Or, is it site by site on a detailed level?

Marvasti: Every customer that I have seen is totally different. We developed our applications specifically to be learning-based, not rules-based. And by rules I mean not having any preconceived notion of what an environment may look like. Because if you have that, and the environment doesn’t look like that, you're going to be sending a lot of false positives -- which we definitely did not want to do.

Ours is a purely learning-based system. That means that we install our product, it starts gathering the metrics, and then it starts learning what your systems look like and behave like. Then based on that behavior it starts formulating the out-of-normal conditions that can lead to problems. That becomes unique to the customer environment. That is an advantage, because when you get something, it actually adapts itself to an environment.

For example, it learns your change management patterns. If you have a change windows occurring, it learns that change window. It knows that those change windows occur without anybody having to enter anything into the application. When you are doing wholesale upgrade of devices, it knows that change is coming about, because it has learned your patterns.

The downside of that is that it does take two to three weeks of gathering your data and learning what has been happening for it to become useful. The good side of it is that you get something that completely maps to your business, as opposed to having to map your business through a product. The downside is that it takes two or three weeks of learning time, before it starts producing some results for you.

Gardner: The name of your product set is Alive, is that correct?

Marvasti: That’s correct.

Gardner: I understand you are going to have a release coming out later this year, Alive 6.0?

Marvasti: That’s correct.

Gardner: I don’t expect you to pre-release, but perhaps you can give us some sense of the direction that the major new offerings within the product set will take. What they are directed toward? Can you give us a sneak peek on that?

Marvasti: Basically, we have three pillars that the product is based on. First is usability. That's a particular pet peeve of mine. I didn't find any of the applications out there very usable. We have spent a lot of time working with customers and working with different operations groups, trying to make sure that our product is actually usable for the people that we are designing for.

The second piece is interoperability. The majority of the organizations that we go to already have a whole bunch of systems, whether it be data collection systems, event management systems, or configuration management databases, etc. Our product absolutely needs to leverage those investments -- and they are leveragable. But even those investments in their silos don’t produce as much benefit to the customer as a product like ours going in there and utilizing all of that data that they have in there, and bringing out the information that’s locked within it.

The third piece is analytics. What we have in the product coming out is scalability to 100,000 servers. We've kind of gone wild on the scalability side, because we are designing for the future. Nobody that I know of right now has that kind of a scale, except maybe Google, but theirs is basically the same thing replicated thousands of times over, which is different than the enterprises we deal with, like banks or health-care organizations.

A single four-processor Xeon box, with Alive installed on it, can run real-time analytics for up to 100,000 devices. That’s the level of scale we're talking about. In terms of analytics, we've got three new pieces coming out, and basically every event we send out is a predictive event. It’s going to tell you this event occurred, and then this other set of events have a certain probability within a certain timeframe to occur.

Not only that, but then we can match it to what we call our "finger printing." Our finger printing is a pattern-matching technology that allows us to look at patterns of events and formulate a particular problem. It indicates particular problems and those become the predictive alerts to other problems.

What’s coming out in the product is really a lot of usability, reporting capabilities, and easier configurations. Tens of thousands of devices can be configured very quickly. We have interoperability -- Tivoli, OpenView, Hyperic -- and an open API that allows you to connect to our product and pump in any kind of data, even if it’s business data.

Our technology is context agnostic. What that means is that it does not have any understanding of applications, databases, etc. You can even put in business-type data and have it correlated with your IT data, and extract information that way as well.

Gardner: You mentioned usability. Who are the typical users and buyers of a product like Integrien Alive? Who is your target audience?

Marvasti: The typical user would be at the operations center. The interesting thing is that we have seen a lot of different users come in after the product is deployed. I've seen database administrators use our product, because they like to see what is normal behavior of their databases. So they run the analytics under database type metrics and get information that way.

I've seen application folks who want to have more visibility in terms of how this particular application is impacting the database. They become users. But the majority of users are going to be at the operations center -- people doing day-to-day event management and who are responsible for reducing the mean time to identify where the problems come from.

The typical buyers are directors of IT operations or VP of IT operations. We are really on the operation side, as opposed to the application development side.

Gardner: Do you suppose that in the future, when we get more deeply into SOA and virtualization, that some of the analysis that is derived through Integrien Alive becomes something that’s fed into a business dashboard, or something that’s used in governance around how services are going to be provisioned, or how service level agreements are going to be met?

Can we extrapolate as to how the dynamics of the data center and then the job of IT itself changes, on how your value might shift as well?

Marvasti: That link between IT and the business is starting to occur. I definitely believe that our product can play a major part in illuminating what in the business side gets impacted by IT. Because we are completely data agnostic, you can put in IT-type data, business-type data, or customer data -- and have all of it be correlated.

You then have one big holistic view as to what may impact what. ... If this happens, what else might happen? If I want to increase this, what are the other parameters that may be impacted?

So, you know what you want to play from the business side in terms of growth. Having that, we project how IT needs to change in order to support that growth. The information is there within the data and the very fact that we are completely data agnostic allows us to do that kind of a multi-function analysis within an enterprise.

Gardner: It sounds like you can move from an operational efficiency value to a business efficiency value pretty quickly?

Marvasti: Absolutely. Our initial target is the operations arena, because of the tremendous amount of inefficiencies there. But as we move into the future, that’s something we are going to look into.

Gardner: We mentioned Alive 6.0. Do you have a ball-park figure on when that’s due? Is it Q4 of 2007?

Marvasti: We are going to come out with it in 2007, and it will be available in Q4.

Gardner: Well, I think that covers it, and we are just about out of time. I want to thank Mazda Marvasti, the CTO of Integrien, for helping us understand more about the notion of management forensics and probabilistic- rather than deterministic-based analysis.

We have been seeking to understand better how to address high costs, and inefficiencies in data centers, as well as managing application performance -- perhaps in quite a different way than many companies have been accustomed to. Is there anything else you would like to add before we end, Mazda?

Marvasti: No, I appreciate your time, Dana, and thank you for your good questions.

Gardner: This is Dana Gardner, principal analyst at Interarbor Solutions. You have been listening to a sponsored BriefingsDirect podcast. Thanks, and come back next time.

Listen to the podcast here. Sponsor: Integrien Corp.

Transcript of BriefingsDirect podcast on systems management efficiencies and analytics. Copyright Interarbor Solutions, LLC, 2005-2007. All rights reserved.