Transcript of BriefingsDirect podcast on Agile Development principles and practices with Borland Software.
Listen to the podcast. Sponsor: Borland Software.
Dana Gardner: Hi, this is Dana Gardner, principal analyst at Interarbor Solutions, and you’re listening to BriefingsDirect. Today we present a sponsored podcast discussion about Agile software development.
We're going to be talking to a software executive from Borland Software about Borland's own Agile “journey.” They deployed Agile practices and enjoyed benefits from that, as well as gained many lessons learned, as they built out their latest application lifecycle management (ALM) products. [See product and solution rundowns.]
We're going to talk with Pete Morowski, the senior vice president of research and development (R&D) at Borland Software. Welcome to the show, Pete.
Peter Morowski: Thank you, Dana. It's good to be here.
Gardner: Before you get into Borland Software's journey, I want to get a level-set about Agile Development practices in general. Why is Agile development a good idea now? What is it about the atmosphere in the evolution of development that makes this timely?
Morowski: From the standpoint of software development, it's a realization that development is an empirical process, a process of discovery. Look at the late delivery cycles that traditional waterfall methodologies have brought upon us. Products have been delivered and end up on the shelf. The principles behind Agile right now allow teams to deliver on a much more frequent cycle and also to deliver more focused releases.
Gardner: There are also, I suppose, technical and business drivers: better quality, faster turnaround, more complexity, and, of course, distributed teams. What is it about the combination? Why is this important now in terms of some of these other technical business and even economic imperatives?
Morowski: With the advent of Web applications, businesses really expect a quicker turnaround time. In addition, when you look at cost structures, the time spent on features not used and other things are critical business inhibitors at this point.
Gardner: Let's help out some folks out who might not be that familiar with Agile and its associated process called Scrum. Tell us a little bit from an elevator-pitch perspective. What is Agile and what is Scrum?
Morowski: Agile really is a set of principles, and these principles are based on things like self-directed teams, using working code as a measure of progress, and also looking at software development in terms of iteration. What we mean by that is that when you look at traditional software development, we talked about things like design, code, and testing as actual phases in a development lifecycle. Within Agile, in an iteration, these are just activities that occur in each iteration.
Now, when you talk about Scrum, that is more of a process and a methodology. This is actually taking those Agile principles and then being more prescriptive on how to apply them to a software-development cycle.
In the case of Scrum, it's based upon a concept called a sprint, which is a two-to-four week iteration that the team plans for and then executes. In that two-to-four weeks, whatever they get done is considered completed during that sprint, and what work hadn't been completed goes into what they call "product backlog" for prioritization on what is done in the next sprint. You chain these several iterations together for a release.
The beauty of this is that now you have a way to induce change on the borders of those iterations. So, one of the things that's really advantageous to Agile is its ability to adapt the changing requirements.
Gardner: When I try to explain Agile to people, some of them come away thinking that it's an oxymoron or is conflicted because they say, "Okay, your goal is to do things better and faster, but you are telling people use fewer rules, use less structure, and have your teams be self-selecting." People see a conflict here. Why isn't that a conflict?
Morowski: I think it's a misnomer that self-directed teams and that type of thing mean that we can do whatever we want. What it's really about is that teams begin to take ownership for delivering the product. What happens is that, by allowing these teams to become self-directed, they own the schedule for delivery.
What happens is that you see some things like traditional breakdowns of roles, where they are looking at what work needs to be finished in a sprint, versus "Well, I am a developer. I don't do testing," or "I am a doc writer, and I can't contribute on requirements," and those types of things. It really builds a team, which makes it a much more efficient use of resources and processes, and you end up with better results than you do in a traditional methodology.
Gardner: It almost sounds like we're using market forces, whereby entrepreneurs or small startups tend to be more energized and focused than teams within a larger, centralized organization. Is that a fair characterization?
Morowski: Yeah, I think it is very fair.
Gardner: And, given that we're looking for this empirical learn-as-you-go, do what's right for you, I suppose that also means that one size does not fit all. So, Agile would probably look very different from organization to organization.
Morowski: It could. One thing we chose to do, though, was to really to set a benchmark process. So, when Borland first started developing in Agile, we had multiple locations, and each site was, in essence, developing its own culture around Agile. What I found was that we were getting into discussions about whose Agile was more pure and things like that, and so I decided to develop a Borland Agile culture. [See case study on Borland and Agile.]
We broke that up on geographic bases, where we started with one site, had one "ScrumMaster" and we built what we call the reference process. As we've grown, and our projects are getting more complex, the fact that we evolve from site-to-site based on the same process and the same terminology has allowed us to now choose more complex agile techniques like Scrum of Scrums or work across organizations, and have a common vocabulary, and that kind of common way of working.
Gardner: It also sounds like you are taking the best of what a centralized approach offers and the best of what a decentralized approach offers, in terms of incentive; take charge, and local ownership, and then making them co-exist.
Morowski: That's correct.
Gardner: All right, let's get specifically into Borland's situation. What is it about the way that Borland has been developing software, which is of course a core competency for a large independent software vendor (ISV) like yourselves, and it has been for 15-plus years … How difficult was it for you to come into this established organization and shake things up?
Morowski: Initially, it wasn't an issue because, like most organizations, when we went through and looked at it, there were a couple of grassroots efforts underway. From an Agile perspective, one of the things we did was to begin to leverage that activity and the successes that it had to use as a benchmark with other teams. As we grew and moved into other organizations that were not necessarily grassroots efforts, there were some challenges.
Gardner: So, it might be quite possible that lot of organizations that do development have people who are Agile-minded and perhaps even followers of Agile doing this already. Perhaps they should look for those and start there.
Morowski: I would recommend that you start with your grassroots efforts, establish your benchmark process, and then begin to move out from there.
One thing we clearly did was, once that we saw the benefits of doing this, we had a lot of executive sponsorship around this. I made it one of the goals for the year to expand our use of Agile within the organization, so that teams knew it was safe to go ahead and begin to look at it. In addition, because we had a reference implementation of it, it also gave teams a starting point to begin their experimentation. We also paid for our teams to undergo training and those types of things. We created an environment that encouraged transformation.
Gardner: Let's learn a little bit more about you, Pete. Tell us a little bit about your background and how you came into development and then into Agile?
Morowski: I've been in this business a little over 25 years now. I started in the defense and aerospace industries and then moved into commercial ISVs later in my career. I've been an executive at Novell. I've also been a CTO at IBM Tivoli, and prior to Borland, was the vice president of software at Dell.
Gardner: You've taken on this Agile project at Borland, and you've written a paper on the “Borland Agile Journey.” I've had the pleasure of reading it. I think it's a really nice read and I commend for you it.
Morowski: Oh, thank you.
Gardner: Tell us about this particular product set [Borland Software Delivery Management information] that Borland is coming out with. It's a product set about helping people develop software. Is there a commonality between some of the lessons you learned and then what you may have actually visited in terms of requirements for your products? [See demo and see launch video.]
Morowski: Oh, absolutely. One of the interesting things about the products that we are delivering is that one of them is a product for managing Agile development, especially in distributed teams and managing the requirements. So, we had the advantage of actually using the tools as we were developing them.
Now, we were also very cautious because you can get myopic about that type of thing, where we also using Agile principles, and we involved our customers in the process, as well. So we were getting kind of the best of both worlds.
Gardner: What makes software development different? In reading your paper, I was thinking about how these principles about self-empowerment and working quickly and then setting these boundaries -- "Okay, we're going to just work and do this for three weeks and then will revisit any changes," -- that might be something it would apply to almost any creative activity where a team is involved.
Is Agile something you think applies to any creative activity, a complex team-based activity, or is there something about it that really is specific and germane to software development?
Morowski: If you look at Agile principles, conceptually, they do apply to a lot of things. Anything in which you are going into a period of discovery, one of the key things is knowing what your goal or mission is. In the case of software, that's a requirement, and what you want the product to be.
But in any kind of empirically based endeavor, this would be something that you could apply. Now, when you get down to the actual Scrum process itself, it's the terminology, the measures, the metrics, and all those types of things are really tailored for software development.
Gardner: When I read your paper, I also came away with some interesting observations. You say, there is a difference between how development is supposed to work and how it actually works. It's sounds like many companies are living in denial or a certain level of dysfunction that they are not necessarily facing.
Morowski: It's one of the issues with laying a manufacturing process over something that's inherently an empirical process. In the end, all software R&D organizations or IT shops responsible for applications are responsible to the business for delivering results. And, in doing so, we all try to measure those things.
What I have observed over my career was the fact that there really existed two worlds. There is what I call the "management plane," and this is a plane of milestones, of phase transitions and a very orderly process and progress through the software development lifecycle.
Underneath it though, in reality, is a world of chaos. It's a world of rework, a world of discovery, in which the engineers, testers and frontline managers live. We traditionally use Gantt as a measure that is task-based. It requires a translation from the implementation world to the management world to show indications of progress. Any time you do a translation, there can be a loss of information, and that's why today software is such an experienced-based endeavor.
Gardner: And it's often been perceived as sort of a dark art. People don't appreciate or understand how it's done, and that those who do it should say, "Hey, leave me alone, get away from me. I'll come back with the results in three months."
Morowski: Exactly.
Gardner: But that doesn't necessarily or hasn't historically been the best approach.
Morowski: Absolutely not.
Gardner: Also, at times, you see them downplay process and say that doing good hiring probably is the biggest issue here. What's the relationship between hiring and what people, not always affectionately, refer to as human resources? What's the relationship between HR and Agile?
Morowski: Well, first of all, just getting back to a little bit on hiring thing. Hiring is important, regardless of what methodology you use, and I tend to stress that. I do contend there are different kinds of personalities and skill sets you are going to be looking for when you are building Agile teams, and it's important to highlight those.
It's very important that whoever comes onboard in Agile team is collaborative in nature. In traditional software environments, there are two roles that traditionally you may struggle with, and you have to look at them closely. One is the manager. If a manager is a micromanager-type, that's not going to work in an Agile environment.
And, the other thing, interestingly enough, is the chief architect role. What's interesting about that is that you would think you would fit in Agile very easily, but in a lot of traditional software organizations, all decisions of a technical nature on a project go through the chief architect. In an Agile world, it's much more collaborative and everybody contributes. So for some personalities, this would be a difficult change for them.
Gardner: So there is that grassroots element, and you have to be open to it.
Morowski: Right.
Gardner: What is it about the structures here? Again, for folks who might not be that familiar with Agile, tell us a little bit about some of the hierarchy.
Morowski: There are really two key roles. There is the ScrumMaster and the ScrumMaster runs what they call the daily stand-up. This is basically a meeting, where everybody on the team gets together on a daily basis and they answer three questions. "What did I get accomplished yesterday?" "What am I going to do today?" And "What's blocking me?"
Everybody goes around the room. It's a 15- minute meeting. You solve any particular problems, but you log things. The role of ScrumMaster is to run that meeting and to remove blocks to the team, and it's a very key role.
The second major role within Scrum is really the product owner, and this is the individual that's responsible for prioritizing the requirements or what we call the product backlog -- what is what is going to be done during the sprint, which features are going to be completed. Those are the two primary roles, and then from there everybody is pretty much a team member.
Gardner: When you decided to bring this into play at Borland, a very large, distributed organization, you didn't try to bite off too much. You didn't say, "We are going to transform the entire company and organization." You did this on more of an iterative basis. It seems that most people, when they do Agile, will probably follow similar path. They'll take a project basis and then say, "Now we need to expand this and make it holistic."
Many organizations, however, across all kinds of different management activities, can stumble at that transition from the project, or the tactical, into the holistic, or general, across an organization. What did you learn in making this transition from small to large scale at Borland?
Morowski: A couple things. One is that, as we rolled it out, let's say starting by site-by-site, we grew from teams-to-teams. The ScrumMasters worked very collaboratively to help each other out, because, in the end, they were responsible for delivering at the end of those sprints. That was a very positive effect.
As we moved out to distributed teams, there were a number of challenges, things like the daily stand-up, or if I have people in Singapore that are supporting a particular sprint, say, from the system testing point, that made things difficult. But, what I found is the team was pretty creative in involving those individuals, whether they recorded sprints, whether they shifted time zones, and they did this all on their own.
That was the absolute positive, one of the things that surprised me. It was an interesting discovery.
As we started to be more broad with the interaction with the non-Agile parts of the organization, this was a little bit more of a challenge, and I learned a couple of things. In doing any kind outsourcing, if you try to match a traditional, contractual base -- the statement of work (SOW)-type outsourcer -- with an agile team, that's going to present problems. The outsourcer is expecting very detailed specifications as a statement of work and that's just not produced during an agile or sprint/Scrum type of development activity.
The other thing is internally, and what I would say at the end of the pipe and at the beginning of the pipe, working with marketing and our new product introduction processes and support and getting sales out. One of the things that we found is that we started to have a capacity to release more often, but the organization, as a whole, had to adjust now to: A) provide market requirements to us in a different manner, and B) we had to adjust our process at the end to be able to accept more rapid releases.
Gardner: So in order to get the most out of Agile, it sounds like, for those organizations where software development is core competencies, important to their successes as a company, or as a government organization, or a public not-for-profit, that the edges of Agile start to blend into other departments. The whole competency around their organization can perhaps borrow some of these principles from development and extend them into the entire lifecycle.
Morowski: Yes, we no longer look at it as strictly an R&D thing anymore, just because of that. And, it's interesting. You know you are making progress from a development team perspective, when you are starting to output more than the organization can accept.
Gardner: Interesting. So, adjustments along the way, and that's again a principle of the approach.
All right. In this age of Agile and your Agile journey, you came away with three basic observations about the benefits. One was around self-directing teams; second around being able to manage change well; and, third, about how to do the relationship with the customer, in this case the customer being the folks who are interested in getting the software. Tell us about these three benefits and what you have learned?
Morowski: Well, we touched on the self-directing teams, and the key to that is one of the most important things as an executive is that you really have to take the lead and let your teams go and develop -- let them truly own their projects. There will be mistakes along the way, but once they do, it's an extremely powerful concept.
One of the great things about agile is that it's a very open and very visible methodology. During daily stand-ups, I can attend any daily stand-up and sit there and listen to what's going on. I can't contribute in those meetings, because that's run by the ScrumMaster. But, one of the times I was attending the daily stand-up, I knew the teams had progressed a great deal.
When they were looking at their remaining work backlog that they had for that particular sprint, and there were a couple of tests that need to be run that there was nobody assigned to. One of the developers had time, looked at that, and picked it up.
Now, normally, that would never happen, because we behave in a silo fashion. "I am an engineer." "I am a tester." It's an "I am a …" type of thing. But, when you really have a self-directing team, the team owns that schedule and it's very interested in making sure that they meet their commitments.
Gardner: I suppose that also fosters willingness of people to move in and out of role, without just saying, "Well, that's not my job …", but taking more group responsibility, and even as an individual.
Morowski: Absolutely correct, and that to me has been one of the more powerful things that I have personally observed.
Gardner: Change management has often been something that drives developers crazy. They hate when people come in and start changing requirements when they are in the middle of doing code or test. On the other hand, things don't stay the same, and change is part of everything in life and business, perhaps more so today than ever. How do you reconcile those two?
Morowski: Well, I think the reality is that there is going to be change during these development cycles, and so the question is what's the best way to handle it? If you look in a traditional waterfall methodology, you march along phase transitions. Even if you have iteration in place, if you discover a design or coding defect late in the game, you have to go backwards to a different phase and start going into the design or fixing the code. Then, you repeat the process again, and you continue to move along your space transition line.
The thing that's interesting is that with Agile you have an orderly way of injecting change. In other words, as a sprint completes and you've demonstrated the code -- and you demonstrate it after that three-week iteration -- if something has changed and you need to change the prioritization, you have a way to inject that change along that boundary, and then let the team go forward. That's what I always like to say, "We're always going forward in Agile."
Gardner: And how do the teams adjust to that?
Morowski: It's part of the process. The changes go into the backlog. The product owner looks at them and then prioritizes it based upon the complexity of the work and the timing and so on and so forth, and just how important that is. If it's important enough, it will go into the next iteration. The teams are used to doing that, because you are not, in essence, disrupting at a random point. They have already finished what work they were working on, and now there is a cleaner opportunity to inject that change.
Gardner: So, boundaries allow for those who want change to get it done without having to wait for a particularly long period of time or until the project is done. But, for those involved in the project, they have these sections where it's not going to become chaotic and they are not going to lose track of their overall process, because of this injection of change.
Morowski: No, as a matter of fact, the process encourages it.
Gardner: How about this, what you call customer relationships? It sound to me as thought it's just being transparent.
Morowski: It is. It's a different approach, in the sense that you are actually bringing in the customer as what I would call a partner in the development. They participate in sprint reviews, and sprint reviews at the end of a sprint, where you show the working code, what you have completed and so. Those are done on an every-three-week basis, and we involve our customers.
They also take early drops of the code and provide input into the product backlog on requests that they want, and things like that. It's proven to be very beneficial for us. The one thing is that, when you choose these customers to participate, it's important for them to be Agile, as well, and understand that, and they need to approach this as a partnership not just an opportunity to get their particular features or requirements in.
Gardner: And, that must also help keep expectations in line, right?
Morowski: Absolutely. What I have found is that the customers we have involved want to get used to our cycles and our delivery rhythm. They are less adamant about getting every feature on a list in a particular release, because they know it's a relatively short time before the next one comes around.
Gardner: When we describe these customers, would that, in many organizations, include bringing the marketing people in, and the salespeople. Can they get involved so that this becomes something that will enter the market as an agile activity, rather than having Agile happening on the development side, and then falling back into a waterfall mentality when it comes to the go-to-market activities?
Morowski: Yes, we do, and the transparency that's there actually helps build confidence in the rest of the organization on what we are delivering, because they see it as we progress along. It's not something that mysteriously shows up on their doorstep.
Gardner: It certainly sounds great in theory, and it sounds like you've been able to accomplish quite a bit in practice, but what about metrics of success? How have you been able to say, "it works?" Has Borland cut their cost, their time to development? Do they have better products? All of the above? How do we know we are succeeding?
Morowski: I'd say it's combination of all the above. The first thing is that by putting these teams together, they are much smaller teams than in traditional organizations. So, if you look at it, my teams are almost 30 percent smaller on the Agile side than they are on the traditional side.
Gardner: And what's accounting for that change?
Morowski: I think one, is the ownerships of the teams, and two, the breakdown of very specific roles.
Gardner: Would I be going out on a limb in saying you have eliminated the middle management factor?
Morowski: There is absolutely that as well. The other thing is the fact that we're delivering working code and involving with customers. We are developing fewer superfluous features. When a product goes out the door, it generally has the most important features that were entailed for this release. So, it really helps the prioritization process.
Gardner: Not too many cooks in the kitchen?
Morowski: Exactly.
Gardner: Cool! Tell us a little bit about what surprised you the most about this Agile journey of Borland.
Morowski: I think the power of the daily stand-up. I mean, yes, we got a lot of benefits, and yes, we had a number of successes, we were able to transition code from locations and things like that, but I owe that a lot to the daily stand-up. The thing that surprised me is how powerful it is each morning when everybody gets around the table and actually goes through what they've done, basically saying, "Have I lived up to my commitments? What I am committing to the team today? And then is there anything blocking?"
Generally speaking, a lot of developers tend to be quiet and not the most social. What this did is that it just didn't allow the few people who were social to have input on what was going on. This daily stand-up had people, everybody on the team, contributing, and it really changed the relationships in the team. It was just a very, very powerful thing.
Gardner: It sounds like balance among personality types, but that balance directed toward the real activity that is developing code.
Morowski: Absolutely.
Gardner: Interesting! Well, congratulations. I enjoyed reading your paper, and this certainly sounds like the future of development, I know that's what many people in the business think. We've been talking about Agile development practices and principles and how Borland Software has been undertaking an Agile journey itself, in a development project around development process tools and application lifecycle management products.
Back to those products. Is there anything about the synergy between doing it this way and then presenting products into the field that you think will help other people engage with Agile benefits?
Morowski: Are you talking about the products themselves?
Gardner: Yes.
Morowski: The products themselves, absolutely. We have a product coming out called Team Analytics. The key to this is that, while we talked about self-directed teams, we still have responsibilities to reporting to the business and how we are progressing.
Team Analytics gives us a view into the process, gives us the ability to go ahead and look at how the team is progressing, and those types of things, what features have been included or dropped, without having to go into the team and request that information. So that's a very powerful thing.
Gardner: Right. So, it's one thing to agree that visibility and transparency are good, but it's another to actually accomplish it in terms of complexity in large teams and hierarchy.
Morowski: Absolutely. This allows us to move to what I call a "monitored" from a "reported" kind of methodology of metrics. What I mean by that is, typically, at the senior vice president or vice president level, you really get to look at the state of your products once a month, in the sense that you have operations reviews or some kind of review cycle where all your teams come in and then they report the progress of what's going on.
With Team Analytics, you are able to actually look at that on a daily basis and see if anything’s changed over time. That way, you know where you need to spend your time and that's why we call it monitored, at this point.
Gardner: Super! Well, thank you for sharing your insights. I think there is a lot to be taken away here and learned.
We have been talking with Pete Morowski, the senior vice president of research and development for Borland Software. We were looking at Agile principles in the context of Borland's Agile journey.
Thanks, Pete.
Morowski: Thank you, Dana.
Gardner: This is Dana Gardner, principal analyst at Interarbor Solutions, and you’ve been listening to a sponsored BriefingsDirect podcast.
Thanks for joining us and come back next time.
Listen to the podcast. Sponsor: Borland Software.
Transcript of BriefingsDirect podcast on Agile development principles with Borland Software. Copyright Interarbor Solutions, LLC, 2005-2008. All rights reserved.
Showing posts with label analytics. Show all posts
Showing posts with label analytics. Show all posts
Wednesday, August 13, 2008
Monday, September 24, 2007
Probabilistic Analysis Predicts IT Systems Problems Before Costly Applications Outages
Edited transcript of BriefingsDirect[TM] podcast on probabilistic IT systems analysis and management, recorded Aug. 16, 2007.
Listen to the podcast here. Sponsor: Integrien Corp.
Dana Gardner: Hi, this is Dana Gardner, principal analyst at Interarbor Solutions, and you're listening to BriefingsDirect. Today our sponsored podcast focuses on the operational integrity of data centers, the high cost of IT operations, and the extremely high cost of application downtime and performance degradation.
Rather than losing control to ever-increasing complexity, and gaining less and less insight into the root causes of problematic applications and services, enterprises and on-demand application providers alike need to predict how systems will behave under a variety of conditions.
By adding real-time analytics to their systems management practices, operators can fully determine the normal state of how systems should be performing. Then, by measuring the characteristics of systems under many conditions over time, datacenter administrators and systems operators gain the ability to predict and prevent threats to the performance of their applications and services.
As a result they can stay ahead of complexity, and contain the costs of ongoing high-performance applications delivery.
This ability to maintain operational integrity through predictive analytics means IT operators can significantly reduce costs while delivering high levels of service.
Here to help us understand how to manage complexity by leveraging probabilistic systems management and remediation, we are joined by Mazda Marvasti, the CTO of Integrien Corp. Welcome to the show, Mazda.
Mazda Marvasti: Thank you, Dana.
Gardner: Why don’t we take a look at the problem set? Most people are now aware that their IT budgets are strained just by ongoing costs. Whether they are in a medium-sized company, large enterprise, or service-hosting environment, some 70 percent to 80 percent of budgets are going to ongoing operations.
That leaves very little left over for discretionary spending. If you have constant change or a dynamic environment, you're left without much resources to tap in order to meet a dynamic market shift. Can you explain how we got to this position? Why are we spending so much just to keep our heads above water in IT?
Marvasti: When we started in the IT environment, if you remember the mainframe days, it was pretty well defined. You had couple of big boxes. They ran a couple of large applications. It was well understood. You could collect some data from it, so you knew what was going on within it.
We graduated to the client-server era, where we had more flexibility in terms of deployment -- but with that came increasing complexity. Then we moved ahead to n-tier Web applications, and we had yet another increase in complexity. A lot of products came in to try to alleviate that complexity for deep-data collection. And management systems grew out, covering an entire enterprise for data collection, but the complexity was still there.
Now, with service-oriented architecture (SOA) and virtualization moving into application-development and data-center automation, there is a tremendous amount of complexity in the operations arena. You can’t have the people who used to have the "tribal knowledge" in their head determining where the problems are coming from or what the issues are.
The problems and the complexity have gone beyond the capability of people just sitting there in front of screens of data, trying to make sense out of it. So, as we gained efficiency from application development, we need consistency of performance and availability, but all of this added to the complexity of managing the data center.
That’s how the evolution of the data center went from being totally deterministic, meaning that you knew every variable, could measure it, and had very specific rules telling you if certain things happened, and what they were and what they meant -- all the way to a non-deterministic era, which we are in right now.
Now, you can't possibly know all the variables, and the rules that you come up with today may be invalid tomorrow, all just because of change that has gone on in your environment. So, you cannot use the same techniques that you used 10 or 15 years ago to manage your operations today. Yet that’s what the current tools are doing. They are just more of the same, and that’s not meeting the requirements of the operations center anymore.
Gardner: At the same time, we are seeing that a company’s applications are increasingly the primary way that it reaches out to its sell side, to customers -- as well as its buy side, to its supply chain, its partners, and ecology. So applications are growing more important. The environment is growing more complex, and the ability to know what’s going on is completely out of hand.
Marvasti: That’s right. You used to know exactly where your application was, what systems it touched, and what it was doing. Now, because of the demand of the customers and the demands of the business to develop applications more rapidly, you’ve gone into an SOA era or an n-tier application era, where you have a lot of reusability of components for faster development and better quality of applications -- but also a lot more complexity in the operations arena.
What that has led to is that you no longer even know in a deterministic fashion where your applications might be touching or into what arenas they might be going. There's no sense of, "This is it. These are the bounds of my application." Now it’s getting cloudier, especially with SOA coming in.
Gardner: We’ve seen some attempts in the conventional management space to keep up with this. We’ve been generating more agents, putting in more sniffers, applying different kinds of management. And yet we still seem to be suffering the problems. What do you think is a next step in terms of coming to grips with this -- perhaps on an holistic basis -- so that we can get as much of the picture as possible?
Marvasti: The "business service" is the thing that the organization offers to its customers. It runs through their data center, IT operations, and the business center. It goes across multiple technology layers and stacks. So having data collection at a specific stack or for a specific technology silo, in and of itself, is insufficient to tell you the problems with the business service, which is what you are ultimately trying to get to. You really need to do a holistic analysis of the data from all of the silos that the business service runs through.
You may have some networking silos, where people are using specific tools to do network management -- and that’s perfectly fine. But then the business service may go through some Web tier, application tier, database tier, or storage -- and then all of those devices may be virtualized. There may be some calls to a SOA.
There are deep-dive tools to collect data and report upon specifics of what maybe going on within silos, but you really need to do an analysis across all the silos to tell you where the problems of the business service may be coming from. The interesting thing is that there is a lot of information locked into these metrics. Once correlated across the silos, they paint a pretty good picture as to the impending problem or the root cause of what a problem may be.
By looking at individual metrics collected via silos you don’t get as full a picture as if you were to correlate that individual metric with another metric in another silo. That paints a much larger picture as to what may be going on within your business service.
Gardner: So if we want to gather insights and even predictability into the business service level -- that higher abstraction of what is productive -- we need to go in and mine this data in this context. But it seems to me that it’s in many different formats. From that "Tower of Babel" how do you create a unified view? Or you are creating metadata? What’s the secret sauce that gets you from raw data to analysis?
Marvasti: One misperception is that, "I need to have every piece of metric that I collect go into a magical box that then tells me everything I need to know." In fact, you don’t need to have every piece of metrics. There is much information locked between the correlation of the metrics. We’ve seen at our customers that a gap in monitoring in one silo can often be compensated by data collection in other silos.
So, if you have a monitoring system already -- IBM Tivoli, as an example -- and you are collecting operating-system metrics, you may have one or two other application-specific metrics that you are also collecting. That may be enough to tell you everything that is going on within your business service. You don't need to go to the nth degree of data collection and harmonization of that data into one data repository to get a clear picture.
Even starting with what you’ve got now, without having to go very deep, what we’ve seen in our customers is that it actually lights up a pretty good volume of information in terms of what may be going on across the silos. They couldn't achieve that by just looking at individual metrics.
Gardner: It’s a matter of getting to the right information that’s going to tell you the most across the context of a service?
Marvasti: To a certain degree, but a lot of times you don’t even know what the right metrics are. Basically I go to our customers and say, "What do you have?" Let’s just start with that, and then the application will determine whether you actually have gaps in your monitoring or whether these metrics that you are collecting are the right ones to solve those specific problems.
If not, we can figure out where the gaps may be. A lot of times, customers don’t even know what the right metrics are. And that’s one of the mental shifts of thinking deterministically versus probabilistically.
Deterministically is, "What are the right metrics that I need to collect to be able to identify this problem?" In fact, what we’ve found out is that a particular problem in a business service can be modeled by a group or a set of metric event conditions that are seemingly unrelated to that problem, but are pretty good indicators of the occurrence of that problem.
When we start with what they have, we often point out that there is a lot more information within that data set. They don’t really need to ask, "Do I have the right metrics or not?"
Gardner: Once you’ve established a pretty good sense of the right metrics and the right data, then I suppose you need to apply the right analysis and tools. Maybe this would be a good time for you to explain about the heritage of Integrien, how it came about, and how you get from this deterministic to more probabilistic or probability-oriented approach?
Marvasti: I’ve been working on these types of problems for the past 18 years. Since graduate school, I’ve been analyzing data extraction of information from disparate data. I went to work for Ford and General Motors -- really large environments. Back then, it was client-servers and how those environments were being managed. I could see the impending complexity, because I saw the level of pressure that there was on application developers to develop more reusable code and to develop faster with higher quality.
All that led to the Web application era. Back then, I was the CTO of a company called LowerMyBills.com here in the Los Angeles area. One problem I had was that I had a few people with the tribal knowledge to manage and run the systems, but that was very scary to me. I couldn't rely on these people to be able to have a continuous business going on.
So I started looking at management systems, because I thought it was probably a solved problem. I looked at a lot of management tools out there, and saw that it was mainly centered on data collection, manual rule writing, and better way of presenting the same data over and over.
I didn’t see any way of doing a deep analysis of the data to bring out insights. That’s when I and my business partner Al Eisaian, who is our CEO, formed a company to specifically attack this problem. That was in 2001. We spent a couple of years developing the product, got our first set of customers in 2003, and really started proving the model.
One of the interesting things is that if you have a small environment, your tendency is to think that it's small enough that you can manage it, and that actually may be true. You develop some specific technical knowledge about your systems and you can move from there. But in the larger environments where there is so much change happening in the environment it becomes impossible to manage it that way.
A product like ours almost becomes a necessity, because we’ve transitioned from people knowing in their heads what to do, to not being able to comprehend all of the things happening in the data center. The technology we developed was meant to address this problem of not being able to make sense of the data coming through, so that you could make an intelligent decision about problems occurring in the environment.
Gardner: Clearly a tacit knowledge approach is not sufficient, and just throwing more people at it is not going to solve the problem. What’s the next step? How do we get to a position where we can gather and then analyze data in such a way that we get to that Holy Grail, which is predictive, rather than reactive, response.
Marvasti: Obviously, the first step is collecting the data. Without the data, you can’t really do much. A lot of investment has already gone into data collection mechanisms, be it agent-based or agent-less. So there is data being collected right now.
The missing piece is the utilization of that data and the extraction of information from that data. Right now, as you said at the beginning of your introduction, a lot of cost is going toward keeping the lights on at the operations center. That’s typically people cost, where people are deployed 24/7, looking at monitors, looking at failures, and then trying to do postmortem on the problem.
This does require a little bit of mind shift from deterministic to probabilistic. The reason for that is that a lot of things have been built to make the operations center do a really good job of cleaning up after an accident, but not a lot of thought has been put into place of what to do if you're forewarned of an accident, before it actually happens.
Gardner: How do I intercede? How do I do something?
Marvasti: How do I intercede? What do I do? What does it mean? For example, one of the outputs from our product is a predictive alert that says, "With 80 percent confidence, this particular problem will occur within the next 15 minutes." Well, nothing has happened yet, so what does my run book say I should do? The run book is missing that information. The run book only has the information on how to clean it up after an accident happens.
That’s the missing piece in the operations arena. Part of the challenge for our company is getting the operations folks to start thinking in a different fashion. You can do it a little at a time. It doesn’t have to be a complete shift in one fell swoop, but it does require that change in mentality. Now that I am actually forewarned about something, how do I prevent it, as opposed to cleaning up after it happens.
Gardner: When we talk about operational efficiency, are we talking about one or two percent here and there? Is this a rounding error? Are we talking about some really wasteful practices that we can address? What’s the typical return on investment that you are pursuing?
Marvasti: It’s not one or two percent. We're talking about a completely different way of managing operations. After a problem occurs, you typically have a lot of people on a bridge call, and then you go through a process of elimination to determine where the problem is coming from, or what might have caused it. Once the specific technology silo has been determined, then they go to the experts for that particular silo to figure out what’s going on. That actually has a lot of time and manpower associated with it.
What we're talking about is being proactive, so that you know something is about to happen, and we can tell you to a certain probability where it’s going to be. Now you have a list of low-hanging fruits to go after, as opposed to just covering everybody in the operations center, trying to get the problem fixed.
The first order of business is, "Okay, this problem is about to occur, and this is where it may occur. So, that’s the guy I’m going to engage first." Basically, you have a way of following down from the most probable to the least probable, and not involving all the people that typically get involved in a bridge call to try to resolve the issues.
One gain is the reduction in mean time to identify where the problem is coming from. The other one is not having all of those people on these calls. This reduces the man-hours associated with root-cause determination and source identification of the problem. In different environments you're going to see different percentages, but in one environment that I saw first hand, one of the largest health-care organizations, it is like 20-30 percent of cost, just associated with people being on bridge calls, on a continuous basis.
Gardner: Now, this notion of "management forensics," can you explain that a little bit?
Marvasti: One of the larger problems in IT is actually getting to the root cause of problems. What do you know? How do you know what the root cause is? Often times, something happens and the necessity of getting the business service back up forces people to reboot the servers and worry later about figuring out what happened. But, when you do that, you lose a lot of information that would have been very helpful in determining what the root cause was.
The forensic side of it is: The data is collected already, so we already know what it is. If you have the state when a problem occurred, that’s a captured environment in the data base that you can always go back to.
What we offer is the ability to walk back in time, without having the server down, while you are doing your investigation. You can bring the server back up, come back to our product, and then walk back in time to see exactly what were the leading indicators to the problems you experienced. Using those leading indicators, you can get to the root causes very quickly. That eliminates the guess work of where to start, reduces the time to get to the root cause, and maybe even prevent it.
Sometimes you only have so much time to work on something. If you can’t solve it by that time, you move on, and then the problem occurs again. That's the forensic side.
Gardner: We talked earlier about this notion of capturing the normal state, and now you've got this opportunity to capture an abnormal state. You can compare and contrast. Is that something that you use on an ongoing basis to come up with these probabilities? Or is the probability analysis something different?
Marvasti: No, that’s very much part and parcel of it. What we do is look to see what is the normal operating state of an environment. Then it is the abnormalities from that normal that become your trigger points of potential issues. Those are your first indicators that there might be a problem growing. We also do a cross-event analysis. That’s another probability analysis that we do. We look at patterns of events, as opposed to a single event, indicating a potential problem. One thing we've found is that events in seemingly unrelated silos are very good indicators of a potential problem that may brew some place else.
Doing that kind of analysis, looking at what’s normal, then abnormal becomes your first indicator. Then, doing a cross-event analysis to see what patterns indicate a particular problem becomes total normal to problem-prevention scenario.
Gardner: There has to be a cause-and-effect. As much as we would like to imagine ghosts in the machines, that’s really not the case. It's simply a matter of tracking it down.
Marvasti: Exactly. The interesting thing is that you may be measuring a specific metric that is a clear indicator of a problem, but it is oftentimes some other metric on another machine that gets to be out of normal first, before the cause of the problem surfaces in the machine in question. So early indicators to a problem become events that occur some place else, and that’s really important to capture.
When I was talking about the cross-silo analysis, that’s the information that it brings out. It gives you lot more "heads-up" time to a potential problem than if you were just looking at a specific silo.
Gardner: Of course, each data center is unique, each company has its own history and legacy, and its IT department has evolved on its own terms. But is there any general crossover analysis? That is to say, is there a way of having an aggregate view of things and predicting things based on some assumptions, because of the particular systems that are in use? Or, is it site by site on a detailed level?
Marvasti: Every customer that I have seen is totally different. We developed our applications specifically to be learning-based, not rules-based. And by rules I mean not having any preconceived notion of what an environment may look like. Because if you have that, and the environment doesn’t look like that, you're going to be sending a lot of false positives -- which we definitely did not want to do.
Ours is a purely learning-based system. That means that we install our product, it starts gathering the metrics, and then it starts learning what your systems look like and behave like. Then based on that behavior it starts formulating the out-of-normal conditions that can lead to problems. That becomes unique to the customer environment. That is an advantage, because when you get something, it actually adapts itself to an environment.
For example, it learns your change management patterns. If you have a change windows occurring, it learns that change window. It knows that those change windows occur without anybody having to enter anything into the application. When you are doing wholesale upgrade of devices, it knows that change is coming about, because it has learned your patterns.
The downside of that is that it does take two to three weeks of gathering your data and learning what has been happening for it to become useful. The good side of it is that you get something that completely maps to your business, as opposed to having to map your business through a product. The downside is that it takes two or three weeks of learning time, before it starts producing some results for you.
Gardner: The name of your product set is Alive, is that correct?
Marvasti: That’s correct.
Gardner: I understand you are going to have a release coming out later this year, Alive 6.0?
Marvasti: That’s correct.
Gardner: I don’t expect you to pre-release, but perhaps you can give us some sense of the direction that the major new offerings within the product set will take. What they are directed toward? Can you give us a sneak peek on that?
Marvasti: Basically, we have three pillars that the product is based on. First is usability. That's a particular pet peeve of mine. I didn't find any of the applications out there very usable. We have spent a lot of time working with customers and working with different operations groups, trying to make sure that our product is actually usable for the people that we are designing for.
The second piece is interoperability. The majority of the organizations that we go to already have a whole bunch of systems, whether it be data collection systems, event management systems, or configuration management databases, etc. Our product absolutely needs to leverage those investments -- and they are leveragable. But even those investments in their silos don’t produce as much benefit to the customer as a product like ours going in there and utilizing all of that data that they have in there, and bringing out the information that’s locked within it.
The third piece is analytics. What we have in the product coming out is scalability to 100,000 servers. We've kind of gone wild on the scalability side, because we are designing for the future. Nobody that I know of right now has that kind of a scale, except maybe Google, but theirs is basically the same thing replicated thousands of times over, which is different than the enterprises we deal with, like banks or health-care organizations.
A single four-processor Xeon box, with Alive installed on it, can run real-time analytics for up to 100,000 devices. That’s the level of scale we're talking about. In terms of analytics, we've got three new pieces coming out, and basically every event we send out is a predictive event. It’s going to tell you this event occurred, and then this other set of events have a certain probability within a certain timeframe to occur.
Not only that, but then we can match it to what we call our "finger printing." Our finger printing is a pattern-matching technology that allows us to look at patterns of events and formulate a particular problem. It indicates particular problems and those become the predictive alerts to other problems.
What’s coming out in the product is really a lot of usability, reporting capabilities, and easier configurations. Tens of thousands of devices can be configured very quickly. We have interoperability -- Tivoli, OpenView, Hyperic -- and an open API that allows you to connect to our product and pump in any kind of data, even if it’s business data.
Our technology is context agnostic. What that means is that it does not have any understanding of applications, databases, etc. You can even put in business-type data and have it correlated with your IT data, and extract information that way as well.
Gardner: You mentioned usability. Who are the typical users and buyers of a product like Integrien Alive? Who is your target audience?
Marvasti: The typical user would be at the operations center. The interesting thing is that we have seen a lot of different users come in after the product is deployed. I've seen database administrators use our product, because they like to see what is normal behavior of their databases. So they run the analytics under database type metrics and get information that way.
I've seen application folks who want to have more visibility in terms of how this particular application is impacting the database. They become users. But the majority of users are going to be at the operations center -- people doing day-to-day event management and who are responsible for reducing the mean time to identify where the problems come from.
The typical buyers are directors of IT operations or VP of IT operations. We are really on the operation side, as opposed to the application development side.
Gardner: Do you suppose that in the future, when we get more deeply into SOA and virtualization, that some of the analysis that is derived through Integrien Alive becomes something that’s fed into a business dashboard, or something that’s used in governance around how services are going to be provisioned, or how service level agreements are going to be met?
Can we extrapolate as to how the dynamics of the data center and then the job of IT itself changes, on how your value might shift as well?
Marvasti: That link between IT and the business is starting to occur. I definitely believe that our product can play a major part in illuminating what in the business side gets impacted by IT. Because we are completely data agnostic, you can put in IT-type data, business-type data, or customer data -- and have all of it be correlated.
You then have one big holistic view as to what may impact what. ... If this happens, what else might happen? If I want to increase this, what are the other parameters that may be impacted?
So, you know what you want to play from the business side in terms of growth. Having that, we project how IT needs to change in order to support that growth. The information is there within the data and the very fact that we are completely data agnostic allows us to do that kind of a multi-function analysis within an enterprise.
Gardner: It sounds like you can move from an operational efficiency value to a business efficiency value pretty quickly?
Marvasti: Absolutely. Our initial target is the operations arena, because of the tremendous amount of inefficiencies there. But as we move into the future, that’s something we are going to look into.
Gardner: We mentioned Alive 6.0. Do you have a ball-park figure on when that’s due? Is it Q4 of 2007?
Marvasti: We are going to come out with it in 2007, and it will be available in Q4.
Gardner: Well, I think that covers it, and we are just about out of time. I want to thank Mazda Marvasti, the CTO of Integrien, for helping us understand more about the notion of management forensics and probabilistic- rather than deterministic-based analysis.
We have been seeking to understand better how to address high costs, and inefficiencies in data centers, as well as managing application performance -- perhaps in quite a different way than many companies have been accustomed to. Is there anything else you would like to add before we end, Mazda?
Marvasti: No, I appreciate your time, Dana, and thank you for your good questions.
Gardner: This is Dana Gardner, principal analyst at Interarbor Solutions. You have been listening to a sponsored BriefingsDirect podcast. Thanks, and come back next time.
Listen to the podcast here. Sponsor: Integrien Corp.
Transcript of BriefingsDirect podcast on systems management efficiencies and analytics. Copyright Interarbor Solutions, LLC, 2005-2007. All rights reserved.
Listen to the podcast here. Sponsor: Integrien Corp.
Dana Gardner: Hi, this is Dana Gardner, principal analyst at Interarbor Solutions, and you're listening to BriefingsDirect. Today our sponsored podcast focuses on the operational integrity of data centers, the high cost of IT operations, and the extremely high cost of application downtime and performance degradation.
Rather than losing control to ever-increasing complexity, and gaining less and less insight into the root causes of problematic applications and services, enterprises and on-demand application providers alike need to predict how systems will behave under a variety of conditions.
By adding real-time analytics to their systems management practices, operators can fully determine the normal state of how systems should be performing. Then, by measuring the characteristics of systems under many conditions over time, datacenter administrators and systems operators gain the ability to predict and prevent threats to the performance of their applications and services.
As a result they can stay ahead of complexity, and contain the costs of ongoing high-performance applications delivery.
This ability to maintain operational integrity through predictive analytics means IT operators can significantly reduce costs while delivering high levels of service.
Here to help us understand how to manage complexity by leveraging probabilistic systems management and remediation, we are joined by Mazda Marvasti, the CTO of Integrien Corp. Welcome to the show, Mazda.
Mazda Marvasti: Thank you, Dana.
Gardner: Why don’t we take a look at the problem set? Most people are now aware that their IT budgets are strained just by ongoing costs. Whether they are in a medium-sized company, large enterprise, or service-hosting environment, some 70 percent to 80 percent of budgets are going to ongoing operations.
That leaves very little left over for discretionary spending. If you have constant change or a dynamic environment, you're left without much resources to tap in order to meet a dynamic market shift. Can you explain how we got to this position? Why are we spending so much just to keep our heads above water in IT?
Marvasti: When we started in the IT environment, if you remember the mainframe days, it was pretty well defined. You had couple of big boxes. They ran a couple of large applications. It was well understood. You could collect some data from it, so you knew what was going on within it.
We graduated to the client-server era, where we had more flexibility in terms of deployment -- but with that came increasing complexity. Then we moved ahead to n-tier Web applications, and we had yet another increase in complexity. A lot of products came in to try to alleviate that complexity for deep-data collection. And management systems grew out, covering an entire enterprise for data collection, but the complexity was still there.
Now, with service-oriented architecture (SOA) and virtualization moving into application-development and data-center automation, there is a tremendous amount of complexity in the operations arena. You can’t have the people who used to have the "tribal knowledge" in their head determining where the problems are coming from or what the issues are.
The problems and the complexity have gone beyond the capability of people just sitting there in front of screens of data, trying to make sense out of it. So, as we gained efficiency from application development, we need consistency of performance and availability, but all of this added to the complexity of managing the data center.
That’s how the evolution of the data center went from being totally deterministic, meaning that you knew every variable, could measure it, and had very specific rules telling you if certain things happened, and what they were and what they meant -- all the way to a non-deterministic era, which we are in right now.
Now, you can't possibly know all the variables, and the rules that you come up with today may be invalid tomorrow, all just because of change that has gone on in your environment. So, you cannot use the same techniques that you used 10 or 15 years ago to manage your operations today. Yet that’s what the current tools are doing. They are just more of the same, and that’s not meeting the requirements of the operations center anymore.
Gardner: At the same time, we are seeing that a company’s applications are increasingly the primary way that it reaches out to its sell side, to customers -- as well as its buy side, to its supply chain, its partners, and ecology. So applications are growing more important. The environment is growing more complex, and the ability to know what’s going on is completely out of hand.
Marvasti: That’s right. You used to know exactly where your application was, what systems it touched, and what it was doing. Now, because of the demand of the customers and the demands of the business to develop applications more rapidly, you’ve gone into an SOA era or an n-tier application era, where you have a lot of reusability of components for faster development and better quality of applications -- but also a lot more complexity in the operations arena.
What that has led to is that you no longer even know in a deterministic fashion where your applications might be touching or into what arenas they might be going. There's no sense of, "This is it. These are the bounds of my application." Now it’s getting cloudier, especially with SOA coming in.
Gardner: We’ve seen some attempts in the conventional management space to keep up with this. We’ve been generating more agents, putting in more sniffers, applying different kinds of management. And yet we still seem to be suffering the problems. What do you think is a next step in terms of coming to grips with this -- perhaps on an holistic basis -- so that we can get as much of the picture as possible?
Marvasti: The "business service" is the thing that the organization offers to its customers. It runs through their data center, IT operations, and the business center. It goes across multiple technology layers and stacks. So having data collection at a specific stack or for a specific technology silo, in and of itself, is insufficient to tell you the problems with the business service, which is what you are ultimately trying to get to. You really need to do a holistic analysis of the data from all of the silos that the business service runs through.
You may have some networking silos, where people are using specific tools to do network management -- and that’s perfectly fine. But then the business service may go through some Web tier, application tier, database tier, or storage -- and then all of those devices may be virtualized. There may be some calls to a SOA.
There are deep-dive tools to collect data and report upon specifics of what maybe going on within silos, but you really need to do an analysis across all the silos to tell you where the problems of the business service may be coming from. The interesting thing is that there is a lot of information locked into these metrics. Once correlated across the silos, they paint a pretty good picture as to the impending problem or the root cause of what a problem may be.
By looking at individual metrics collected via silos you don’t get as full a picture as if you were to correlate that individual metric with another metric in another silo. That paints a much larger picture as to what may be going on within your business service.
Gardner: So if we want to gather insights and even predictability into the business service level -- that higher abstraction of what is productive -- we need to go in and mine this data in this context. But it seems to me that it’s in many different formats. From that "Tower of Babel" how do you create a unified view? Or you are creating metadata? What’s the secret sauce that gets you from raw data to analysis?
Marvasti: One misperception is that, "I need to have every piece of metric that I collect go into a magical box that then tells me everything I need to know." In fact, you don’t need to have every piece of metrics. There is much information locked between the correlation of the metrics. We’ve seen at our customers that a gap in monitoring in one silo can often be compensated by data collection in other silos.
So, if you have a monitoring system already -- IBM Tivoli, as an example -- and you are collecting operating-system metrics, you may have one or two other application-specific metrics that you are also collecting. That may be enough to tell you everything that is going on within your business service. You don't need to go to the nth degree of data collection and harmonization of that data into one data repository to get a clear picture.
Even starting with what you’ve got now, without having to go very deep, what we’ve seen in our customers is that it actually lights up a pretty good volume of information in terms of what may be going on across the silos. They couldn't achieve that by just looking at individual metrics.
Gardner: It’s a matter of getting to the right information that’s going to tell you the most across the context of a service?
Marvasti: To a certain degree, but a lot of times you don’t even know what the right metrics are. Basically I go to our customers and say, "What do you have?" Let’s just start with that, and then the application will determine whether you actually have gaps in your monitoring or whether these metrics that you are collecting are the right ones to solve those specific problems.
If not, we can figure out where the gaps may be. A lot of times, customers don’t even know what the right metrics are. And that’s one of the mental shifts of thinking deterministically versus probabilistically.
Deterministically is, "What are the right metrics that I need to collect to be able to identify this problem?" In fact, what we’ve found out is that a particular problem in a business service can be modeled by a group or a set of metric event conditions that are seemingly unrelated to that problem, but are pretty good indicators of the occurrence of that problem.
When we start with what they have, we often point out that there is a lot more information within that data set. They don’t really need to ask, "Do I have the right metrics or not?"
Gardner: Once you’ve established a pretty good sense of the right metrics and the right data, then I suppose you need to apply the right analysis and tools. Maybe this would be a good time for you to explain about the heritage of Integrien, how it came about, and how you get from this deterministic to more probabilistic or probability-oriented approach?
Marvasti: I’ve been working on these types of problems for the past 18 years. Since graduate school, I’ve been analyzing data extraction of information from disparate data. I went to work for Ford and General Motors -- really large environments. Back then, it was client-servers and how those environments were being managed. I could see the impending complexity, because I saw the level of pressure that there was on application developers to develop more reusable code and to develop faster with higher quality.
All that led to the Web application era. Back then, I was the CTO of a company called LowerMyBills.com here in the Los Angeles area. One problem I had was that I had a few people with the tribal knowledge to manage and run the systems, but that was very scary to me. I couldn't rely on these people to be able to have a continuous business going on.
So I started looking at management systems, because I thought it was probably a solved problem. I looked at a lot of management tools out there, and saw that it was mainly centered on data collection, manual rule writing, and better way of presenting the same data over and over.
I didn’t see any way of doing a deep analysis of the data to bring out insights. That’s when I and my business partner Al Eisaian, who is our CEO, formed a company to specifically attack this problem. That was in 2001. We spent a couple of years developing the product, got our first set of customers in 2003, and really started proving the model.
One of the interesting things is that if you have a small environment, your tendency is to think that it's small enough that you can manage it, and that actually may be true. You develop some specific technical knowledge about your systems and you can move from there. But in the larger environments where there is so much change happening in the environment it becomes impossible to manage it that way.
A product like ours almost becomes a necessity, because we’ve transitioned from people knowing in their heads what to do, to not being able to comprehend all of the things happening in the data center. The technology we developed was meant to address this problem of not being able to make sense of the data coming through, so that you could make an intelligent decision about problems occurring in the environment.
Gardner: Clearly a tacit knowledge approach is not sufficient, and just throwing more people at it is not going to solve the problem. What’s the next step? How do we get to a position where we can gather and then analyze data in such a way that we get to that Holy Grail, which is predictive, rather than reactive, response.
Marvasti: Obviously, the first step is collecting the data. Without the data, you can’t really do much. A lot of investment has already gone into data collection mechanisms, be it agent-based or agent-less. So there is data being collected right now.
The missing piece is the utilization of that data and the extraction of information from that data. Right now, as you said at the beginning of your introduction, a lot of cost is going toward keeping the lights on at the operations center. That’s typically people cost, where people are deployed 24/7, looking at monitors, looking at failures, and then trying to do postmortem on the problem.
This does require a little bit of mind shift from deterministic to probabilistic. The reason for that is that a lot of things have been built to make the operations center do a really good job of cleaning up after an accident, but not a lot of thought has been put into place of what to do if you're forewarned of an accident, before it actually happens.
Gardner: How do I intercede? How do I do something?
Marvasti: How do I intercede? What do I do? What does it mean? For example, one of the outputs from our product is a predictive alert that says, "With 80 percent confidence, this particular problem will occur within the next 15 minutes." Well, nothing has happened yet, so what does my run book say I should do? The run book is missing that information. The run book only has the information on how to clean it up after an accident happens.
That’s the missing piece in the operations arena. Part of the challenge for our company is getting the operations folks to start thinking in a different fashion. You can do it a little at a time. It doesn’t have to be a complete shift in one fell swoop, but it does require that change in mentality. Now that I am actually forewarned about something, how do I prevent it, as opposed to cleaning up after it happens.
Gardner: When we talk about operational efficiency, are we talking about one or two percent here and there? Is this a rounding error? Are we talking about some really wasteful practices that we can address? What’s the typical return on investment that you are pursuing?
Marvasti: It’s not one or two percent. We're talking about a completely different way of managing operations. After a problem occurs, you typically have a lot of people on a bridge call, and then you go through a process of elimination to determine where the problem is coming from, or what might have caused it. Once the specific technology silo has been determined, then they go to the experts for that particular silo to figure out what’s going on. That actually has a lot of time and manpower associated with it.
What we're talking about is being proactive, so that you know something is about to happen, and we can tell you to a certain probability where it’s going to be. Now you have a list of low-hanging fruits to go after, as opposed to just covering everybody in the operations center, trying to get the problem fixed.
The first order of business is, "Okay, this problem is about to occur, and this is where it may occur. So, that’s the guy I’m going to engage first." Basically, you have a way of following down from the most probable to the least probable, and not involving all the people that typically get involved in a bridge call to try to resolve the issues.
One gain is the reduction in mean time to identify where the problem is coming from. The other one is not having all of those people on these calls. This reduces the man-hours associated with root-cause determination and source identification of the problem. In different environments you're going to see different percentages, but in one environment that I saw first hand, one of the largest health-care organizations, it is like 20-30 percent of cost, just associated with people being on bridge calls, on a continuous basis.
Gardner: Now, this notion of "management forensics," can you explain that a little bit?
Marvasti: One of the larger problems in IT is actually getting to the root cause of problems. What do you know? How do you know what the root cause is? Often times, something happens and the necessity of getting the business service back up forces people to reboot the servers and worry later about figuring out what happened. But, when you do that, you lose a lot of information that would have been very helpful in determining what the root cause was.
The forensic side of it is: The data is collected already, so we already know what it is. If you have the state when a problem occurred, that’s a captured environment in the data base that you can always go back to.
What we offer is the ability to walk back in time, without having the server down, while you are doing your investigation. You can bring the server back up, come back to our product, and then walk back in time to see exactly what were the leading indicators to the problems you experienced. Using those leading indicators, you can get to the root causes very quickly. That eliminates the guess work of where to start, reduces the time to get to the root cause, and maybe even prevent it.
Sometimes you only have so much time to work on something. If you can’t solve it by that time, you move on, and then the problem occurs again. That's the forensic side.
Gardner: We talked earlier about this notion of capturing the normal state, and now you've got this opportunity to capture an abnormal state. You can compare and contrast. Is that something that you use on an ongoing basis to come up with these probabilities? Or is the probability analysis something different?
Marvasti: No, that’s very much part and parcel of it. What we do is look to see what is the normal operating state of an environment. Then it is the abnormalities from that normal that become your trigger points of potential issues. Those are your first indicators that there might be a problem growing. We also do a cross-event analysis. That’s another probability analysis that we do. We look at patterns of events, as opposed to a single event, indicating a potential problem. One thing we've found is that events in seemingly unrelated silos are very good indicators of a potential problem that may brew some place else.
Doing that kind of analysis, looking at what’s normal, then abnormal becomes your first indicator. Then, doing a cross-event analysis to see what patterns indicate a particular problem becomes total normal to problem-prevention scenario.
Gardner: There has to be a cause-and-effect. As much as we would like to imagine ghosts in the machines, that’s really not the case. It's simply a matter of tracking it down.
Marvasti: Exactly. The interesting thing is that you may be measuring a specific metric that is a clear indicator of a problem, but it is oftentimes some other metric on another machine that gets to be out of normal first, before the cause of the problem surfaces in the machine in question. So early indicators to a problem become events that occur some place else, and that’s really important to capture.
When I was talking about the cross-silo analysis, that’s the information that it brings out. It gives you lot more "heads-up" time to a potential problem than if you were just looking at a specific silo.
Gardner: Of course, each data center is unique, each company has its own history and legacy, and its IT department has evolved on its own terms. But is there any general crossover analysis? That is to say, is there a way of having an aggregate view of things and predicting things based on some assumptions, because of the particular systems that are in use? Or, is it site by site on a detailed level?
Marvasti: Every customer that I have seen is totally different. We developed our applications specifically to be learning-based, not rules-based. And by rules I mean not having any preconceived notion of what an environment may look like. Because if you have that, and the environment doesn’t look like that, you're going to be sending a lot of false positives -- which we definitely did not want to do.
Ours is a purely learning-based system. That means that we install our product, it starts gathering the metrics, and then it starts learning what your systems look like and behave like. Then based on that behavior it starts formulating the out-of-normal conditions that can lead to problems. That becomes unique to the customer environment. That is an advantage, because when you get something, it actually adapts itself to an environment.
For example, it learns your change management patterns. If you have a change windows occurring, it learns that change window. It knows that those change windows occur without anybody having to enter anything into the application. When you are doing wholesale upgrade of devices, it knows that change is coming about, because it has learned your patterns.
The downside of that is that it does take two to three weeks of gathering your data and learning what has been happening for it to become useful. The good side of it is that you get something that completely maps to your business, as opposed to having to map your business through a product. The downside is that it takes two or three weeks of learning time, before it starts producing some results for you.
Gardner: The name of your product set is Alive, is that correct?
Marvasti: That’s correct.
Gardner: I understand you are going to have a release coming out later this year, Alive 6.0?
Marvasti: That’s correct.
Gardner: I don’t expect you to pre-release, but perhaps you can give us some sense of the direction that the major new offerings within the product set will take. What they are directed toward? Can you give us a sneak peek on that?
Marvasti: Basically, we have three pillars that the product is based on. First is usability. That's a particular pet peeve of mine. I didn't find any of the applications out there very usable. We have spent a lot of time working with customers and working with different operations groups, trying to make sure that our product is actually usable for the people that we are designing for.
The second piece is interoperability. The majority of the organizations that we go to already have a whole bunch of systems, whether it be data collection systems, event management systems, or configuration management databases, etc. Our product absolutely needs to leverage those investments -- and they are leveragable. But even those investments in their silos don’t produce as much benefit to the customer as a product like ours going in there and utilizing all of that data that they have in there, and bringing out the information that’s locked within it.
The third piece is analytics. What we have in the product coming out is scalability to 100,000 servers. We've kind of gone wild on the scalability side, because we are designing for the future. Nobody that I know of right now has that kind of a scale, except maybe Google, but theirs is basically the same thing replicated thousands of times over, which is different than the enterprises we deal with, like banks or health-care organizations.
A single four-processor Xeon box, with Alive installed on it, can run real-time analytics for up to 100,000 devices. That’s the level of scale we're talking about. In terms of analytics, we've got three new pieces coming out, and basically every event we send out is a predictive event. It’s going to tell you this event occurred, and then this other set of events have a certain probability within a certain timeframe to occur.
Not only that, but then we can match it to what we call our "finger printing." Our finger printing is a pattern-matching technology that allows us to look at patterns of events and formulate a particular problem. It indicates particular problems and those become the predictive alerts to other problems.
What’s coming out in the product is really a lot of usability, reporting capabilities, and easier configurations. Tens of thousands of devices can be configured very quickly. We have interoperability -- Tivoli, OpenView, Hyperic -- and an open API that allows you to connect to our product and pump in any kind of data, even if it’s business data.
Our technology is context agnostic. What that means is that it does not have any understanding of applications, databases, etc. You can even put in business-type data and have it correlated with your IT data, and extract information that way as well.
Gardner: You mentioned usability. Who are the typical users and buyers of a product like Integrien Alive? Who is your target audience?
Marvasti: The typical user would be at the operations center. The interesting thing is that we have seen a lot of different users come in after the product is deployed. I've seen database administrators use our product, because they like to see what is normal behavior of their databases. So they run the analytics under database type metrics and get information that way.
I've seen application folks who want to have more visibility in terms of how this particular application is impacting the database. They become users. But the majority of users are going to be at the operations center -- people doing day-to-day event management and who are responsible for reducing the mean time to identify where the problems come from.
The typical buyers are directors of IT operations or VP of IT operations. We are really on the operation side, as opposed to the application development side.
Gardner: Do you suppose that in the future, when we get more deeply into SOA and virtualization, that some of the analysis that is derived through Integrien Alive becomes something that’s fed into a business dashboard, or something that’s used in governance around how services are going to be provisioned, or how service level agreements are going to be met?
Can we extrapolate as to how the dynamics of the data center and then the job of IT itself changes, on how your value might shift as well?
Marvasti: That link between IT and the business is starting to occur. I definitely believe that our product can play a major part in illuminating what in the business side gets impacted by IT. Because we are completely data agnostic, you can put in IT-type data, business-type data, or customer data -- and have all of it be correlated.
You then have one big holistic view as to what may impact what. ... If this happens, what else might happen? If I want to increase this, what are the other parameters that may be impacted?
So, you know what you want to play from the business side in terms of growth. Having that, we project how IT needs to change in order to support that growth. The information is there within the data and the very fact that we are completely data agnostic allows us to do that kind of a multi-function analysis within an enterprise.
Gardner: It sounds like you can move from an operational efficiency value to a business efficiency value pretty quickly?
Marvasti: Absolutely. Our initial target is the operations arena, because of the tremendous amount of inefficiencies there. But as we move into the future, that’s something we are going to look into.
Gardner: We mentioned Alive 6.0. Do you have a ball-park figure on when that’s due? Is it Q4 of 2007?
Marvasti: We are going to come out with it in 2007, and it will be available in Q4.
Gardner: Well, I think that covers it, and we are just about out of time. I want to thank Mazda Marvasti, the CTO of Integrien, for helping us understand more about the notion of management forensics and probabilistic- rather than deterministic-based analysis.
We have been seeking to understand better how to address high costs, and inefficiencies in data centers, as well as managing application performance -- perhaps in quite a different way than many companies have been accustomed to. Is there anything else you would like to add before we end, Mazda?
Marvasti: No, I appreciate your time, Dana, and thank you for your good questions.
Gardner: This is Dana Gardner, principal analyst at Interarbor Solutions. You have been listening to a sponsored BriefingsDirect podcast. Thanks, and come back next time.
Listen to the podcast here. Sponsor: Integrien Corp.
Transcript of BriefingsDirect podcast on systems management efficiencies and analytics. Copyright Interarbor Solutions, LLC, 2005-2007. All rights reserved.
Labels:
analytics,
Dana Gardner,
data center,
Integrien,
Interarbor,
Marvasti,
probabilistic,
silos,
SOA
Subscribe to:
Posts (Atom)