Showing posts with label silos. Show all posts
Showing posts with label silos. Show all posts

Sunday, April 27, 2008

HP Creates Security Reference Model to Better Manage Enterprise Information Risk

Transcript of BriefingsDirect podcast on best practices for integrated management of security, risk and compliance approaches.

Listen to the podcast here. Sponsor: Hewlett-Packard.

Dana Gardner: Hi, this is Dana Gardner, principal analyst at Interarbor Solutions, and you’re listening to BriefingsDirect. Today, a sponsored podcast discussion about risk, security, and management in the world’s largest organizations. We're going to talk about the need for verifiable best practices, common practices, and common controls at a high level.

The idea is for management of processes, and the ability to prevent unknown and undesirable outcomes -- not at the silo level, or the instance-level of security breaches that we hear about in the news. We will focus instead on what security requires at the high level of business process.

These processes have been newly managed through Information Security Service Management (ISSM) approaches, and there is a reference model (ISSM RM) that goes along with it.

To help us learn more about ISSM, we are joined by two Hewlett-Packard (HP) executives. We are going to be talking with Tari Schreider, the chief security architect in the America’s Security Practice within HP’s Consulting & Integration (C&I) unit.

Also joining us to help us understand ISSM is John Carchide, the worldwide governance solutions manager in the Security and Risk Management Practice within HP C&I. Welcome to you both.

Tari Schreider: Thank you.

John Carchide: Thank you, Dana.

Gardner: John, we have a lot of compliance and regulations to be concerned about. We are in an age where there is so much exposure to networks and the World Wide Web. When something goes wrong, and the word gets out -- it gets out in a big way.

Help us to understand the problem. Then perhaps we'll begin to get closer to the solutions for mitigating risk at the conceptual and practical levels.

Carchide: Part of the problem, Dana, is that we've had several highly publicized incidents where certain things have happened that have prompted regulatory actions by local, state, and foreign governments. They are developing standards, defining best practices, and defining what they call control objectives and detailed controls for one to comply with, prior to being a viable entity within an industry.

These regulatory requirements are coming at us from all directions. Our senior management is currently struggling, because now they have added personal liability and fines associated with this, as each event occurs, like the TJ Max event. The industry is being inundated with compliance and regulatory requirements.

On the other side of this, there are some industry-driving forces, like Visa, which has established standards and requirements that, if you want to do business with Visa, you need to be Payment Card Compliance (PCI) compliant.

All these requirements are hitting senior-level managers within organizations, and they're looking at their IT environment and asking their management teams to address compliance. “Are we compliant?” The answers they're getting are usually vague, and that’s because of the standards.

What Tari Schreider has done is establish a process of defining requirements, based on open standards, and mapping them to risk levels and maturity levels. This provides customers with a clear, succinct, and articulated picture. This tells them what their current state is, what they are doing well, what they are not doing well, where they're in compliance, where they're not in compliance. And it helps them to build the controls in a very logical and systematic way to bring them into compliance.

In the 32 years of security experience I have, Tari is one of the most forward-thinking individuals I've met. It gives me nothing but great pleasure to bring Tari to a much larger audience so he can share his vision.

Information Security Service Management is his vision, his brainchild. We've invested heavily, and will continue to, in the development and maturity of this process. It incorporates all of HP’s services from the C&I organizations and others. It takes HP’s best practices, methodologies, and proven processes, and incorporates them into a solution for a customer.

So, I would like to introduce everyone to the ISSM godfather, Tari Schreider -- probably one of the most innovative individuals you will ever have the privilege of meeting.

Gardner: Thank you, John. Tari, that’s a lot to live up to. Tell us a little bit about how you actually got started in this? How did you end up being the “godfather” of ISSM?

Schreider: Well, let me compose myself from that introduction. When I joined the Security Practice, we would make sales calls to some of HP’s largest customers. Although we were always viewed as great technologists and operationally competent providers of products and services, we weren’t really viewed -- or weren’t on the radar screen -- as a security service provider, or even a security consulting organization.

Through close alignment with the financial services vertical -- because they had basically heard the same message -- we came up with a strategy where we would go out to the top 30 or so financial services clients and talk with them.

"What is it that you're looking for? Where would you like to see us provide leadership? Where do you see us as a component provider of security services? What level do you view us playing at?"

We took that information, went throughout HP, and invited individuals that we felt were thought leaders within the organization. We invited people from the CTO’s office, from HP Labs, from financial services, worldwide security, as well as representation from a number of senior solution architects.

We got together in Chicago for what we look back on and refer to as the "Chicago Sessions." We hammered out a framework based upon some early work that was done principally in control assessments, building on top of that, and leveraging experiences with delivery in terms of what worked and what didn’t.

We started off with what was referred to then as the "building of the house" and the "blueprint." Then, over the last couple of years, as we have delivered and worked with various parts of the organization, as well as clients, we realized that one of the success factors that we would have to quickly align ourselves with was the momentum that we had with HP’s ITSM, now called Service Management Framework. We had to articulate security as a security service management function within that stack. It really came together when we started viewing security as an end-to-end operational process.

Gardner: What happened that required this to become more of a top-down approach? In John’s introduction, it sounded as if there was a lot of history, where a CIO or an executive would just ask for reports, and the information would flow from the bottom on up.

It sounds like something happened at some point where that was no longer tenable, that the complexity and the issues had outgrown that type of an approach. What happened to make compliance require a top-down, systemic approach?

Schreider: One problem that we were constantly faced with was that clients were asking us, "Where is your thought leadership on security? We know we bring you in here when we have to fix security vulnerabilities on the server, and we get that. We know that you know what you are doing and you're competent there. But frankly, we don’t know what it is that you do. We don’t know the value that you can bring to the table. When we invite you in, you come in with a slide deck full of products. Pretty much, you are like everybody else. So where is your thought leadership?"

Because nobody will ever argue against that HP is an operations- and process-oriented company, we wanted to leverage that. And what we wanted to do was stop the assessment and reporting bureaucracy that CIOs and CSOs and CFOs were in because of Sarbanes-Oxley and so forth, and to provide real meat to their information security programs.

The problem was, we had some very large customers that we were losing to competition, because we basically ran out of things to sell them -- only because we didn’t know we had anything to sell them. We had all of this knowledge. We had all of this legacy of doing security in technology for 20 or 30 years, and we didn’t know how to articulate it.

So we formulated this into a reference model, the Information Security Service Management Reference Model, where it would basically serve as an umbrella, by which all of the pillars of security for trusted infrastructure and proactive security management -- and identity and access management, and governance and so forth -- would be showcased under this thought leadership umbrella.

It got us invited into the door, with things like, "You guys are a breath of fresh air. We have all of these Big Four accounting firm-type organizations. They are burying us in reports. And at the end of the day we still fail audits and nothing gets done."

Gardner: I know this is a large and complex topic, on common security and risk management controls, but in a nutshell, or as simply as we can for those folks that might be coming to this from a different perspective, What is ISSM, and what does it mean conceptually?

Schreider: Well, if you look at ISSM, it’s very specifically referred to as the Information Security Service Management Reference Model. It is several things, a framework, architecture, a model, and a methodology. It's a manner in which you can take an information-security program and turn it into a process-driven system within your organization.

That provides you with a better level of security alignment with the business objectives of your organization. It positions security as a driver for IT business-process improvement. It reduces the amount of operational risk, which ensures a higher degree of continuity of business operations. It’s instrumental in uncovering inadequate or failing internal processes that stave off security breaches, and it also turns security into a highly leveraged, high-value process within your organization.

Gardner: This becomes, in effect, a core competency with a command and control structure, rather than something that’s done ad hoc?

Schreider: Absolutely. The other aspect is that through the definition of linked attributes, which we can talk about later, it allows you to actually make security sticky to other business processes.

If you're a financial institution, and you are going to have Web-based banking, it gives you the ability to have sticky security controls, rather than “stovepipes.”

If you're a utility industry, and you have to comply with North America Reliability Corporation (NERC) and Critical Infrastructure Protection (CIP) regulations, it gives you the ability to have sticky security controls around all of your critical cyber assets. Today, they’re simply security controls that are buried in some spreadsheet or Word document, and there is really no way to manage the behavior of those controls.

Gardner: Why don’t we then just name somebody the “Chief Risk Officer” and tell them to pull this all together and organize it in such a way that this is no longer just piecemeal? Is that enough or does something bigger or more methodological have to take place as well?

Schreider: What’s important to understand is that all of our clients represent fairly large global concerns with thousands of employees and billions of dollars in revenue, and with many demands on their day-to-day operations. A lot of them have done some things for security over time.

Pulling the risk manager aside and sort of leaving him with the impression that everything they are doing, they are doing wrong is probably not the best course. We've recognized that through trial and error.

We want to work with that individual and position the ISSM Reference Model as the middle layer, which is typically missing, to pull together all the pieces of their disparate security programs, tools, policies, and processes in an end-to-end system.

Gardner: It sounds as if we really need to look at security and risk in a whole new way.

Schreider: I believe we do. And this is key because what differentiates us from our contemporaries is that we are now “operationalizing” security as a process or a workflow.

Many times, when we pull up The Wall Street Journal or Information Week, and we read about a breach of security -- the proverbial tape rolling off the back of the truck with all of the Social Security numbers -- we find that, when you look at the morphology of that security breach, it’s not necessarily that a product failed. It’s not necessarily that an individual failed. It’s that the process failed. There was no end-to-end workflow and nobody understood where the break points were in the process.

Our unique methodology, which includes a number of frameworks and models, has a component called a P5 Model, where every control has five basic properties:
  • Property 1 -- People, has to be applied to the control.
  • Property 2 --Policies, certainly has to have clear and unambiguous governance in order for controls to work.
  • Property 3 -- Processes, is an end-to-end workflow, where everyone understands where the touch points are.
  • Property 4 -- Products, means technology has to be applied in many cases to these controls in order to bring them to life and to be functioning appropriately, and
  • Property 5 -- Proof, because there have to be proof points to demonstrate that all of this is actually working as prescribed by a standard, a regulation, or best practice.
Gardner: It seems that you are weaving this together so that you get a number of checks and balances, backstops and redundancies -- so that there aren’t unforeseen holes through which these risky practices might fall.

Schreider: I couldn’t say it any better than that.

Gardner: How do I know that I am a company that needs this? Maybe I am of the impression that, "Well, I've done a lot. I've complied and studied and I've got my reports."

Are there any telltale signs that an organization needs to shift the way they are thinking about holistic security and compliance?

Schreider: I'm often asked that question. When I sit down with CFOs or CIOs or business-unit stakeholders, I can ask one question that will be a telltale sign of whether they have a well-managed, continuously improving information security program. That question is, "How much did you spend on security last year?" Then I just shut up.

Gardner: And they don’t have an answer for it at all?

Schreider: They don't have any answer. If you don’t know what you are spending on security, then you actually don’t know what you are doing for security. It starts from there.

Gardner: That’s because these measures are scattered around in a variety of budgets. And, as you say, they evolve through a “siloed” approach. It was, "Okay, we've got to put a band-aid here, a band-aid there. We need to react to this." Over time, however, you've just got a hairball, rather than a concerted, organized, principled approach.

Schreider: That’s correct, Dana. As a matter of fact, we have a number of tools in our methodology that expose this disfranchised approach to security. Within our Property #4 portion of the P5 Model, we have a tool that allows us to go in and inventory all of the products that an organization has.

Then we map that to things like the Open Systems Interconnection (OSI) Reference Model for security on a layered approach, a "defense in depth" approach, an investment approach, and also from a risk and a threat model approach, and in ownership.

When they see the results of that, they say, "Wait a second. I thought we only had 10 or 12 security products, and I manage that." We show them that they actually have 40, 50, or 60, because they're spread throughout the organization, and there's a tremendous amount of duplication.

It’s not unusual for us to present back to a client that they have three or four different identity management systems that they never knew about. They might have four or five disparate identity stores spread throughout the organization. If you don’t know it and if you can’t see it, you can’t manage it.

Gardner: Now, it sounds as if, from an organizational and a power-structure perspective, this could organize itself in several places. It could be a function within IT, or within a higher accounting or auditing level or capability.

Does it matter, or is there high variability from organization to organization as to where the authority comes for this? Do you have more of a prescriptive approach as to how they should do it?

Schreider: The answer to both of those questions is "yes." We recognize that just because of the dynamics, the culture, and the bureaucracy, in many of our customers' organizations, security is going to live in multiple silos or departments. Through our P5 Model, we have the ability to basically take and share the governance of the control.

So, for example, the office of the Business Information Security Officers (BISO) or the Chief Security Officer (CSO) typically owns policies and proof. For the technology piece -- which has been always a struggle between the office of security and the office of technology on who owns what -- we can define the control of the attributes. So, the network-operations people can then own the technical controls, because they are not going to give up their firewalls and their intrusion detection systems. They actually view that as an integral component of their overall network plumbing.

The beauty of ISSM is that it's very nimble and very malleable. We can assign responsibilities at an attribute level for control, which allows people to contribute and then it allows them to have a sharing-of-power strategy, if you will, for security.

Gardner: There's an analogy here to Service Oriented Architecture (SOA) from the IT side. In many respects, we want to leave the resources, assets, applications, and data where they are, but elevate them through metadata to a higher abstraction. That allows us then to manage, on a policy basis, for governance, but also to create processes that are across business domains and which can create a higher productivity level.

I'm curious, did this evolve from the way that IT is dealing with its complexity issues? Is there an analogy here?

Schreider: It's very much similar to how IT is managed, where basically you want to push out to the lowest common denominator and as close as possible to the customer the services that you provide.

By this whole concept of what we would refer to as BISOs there are large components of security that should actually live in the business unit, but they shouldn’t be off doing their own thing. It shouldn’t be the Wild West. There is a component that needs to be structured for overall corporate governance.

We're certainly not shy about lessons learned and about borrowing from what contemporaries have done in the IT world. We're not looking to buck the trend. That’s why we had to make sure that our reference model supported the general direction of where IT has been moving over the last few years.

Gardner: Conceptually I have certainly bought into this. It makes a great deal of sense. But implementation is an entirely different story. How do you approach this in a large global organization, and actually get started on this? To me, it's not so much daunting conceptually, but how do you get started? How do you implement?

Schreider: One of the reasons people come to HP is that we are a global organization. We have the ability to field 600 security consultants in over 80 countries and deliver with uniformity, regardless of where you’re at as a customer.

There is still a bit of work that goes in. Although we have the ISSM Reference Model, and we have a tremendous amount of methodology and collateral, we are not positioning ourselves as a cookie-cutter approach. We spend a good bit of time educating ourselves about where the customer is, understanding where their security program currently lies, and -- based on business direction and external drivers, for example, regulatory concerns -- where it needs to go.

We also want to understand where they want to be in terms of maturity range, according to the Capability Maturity Model (CMM). Once we learn all of that, then we come back to them and we create a road map. We say that, "Today, we view that you are probably at a maturity level of ‘One.’ Based upon the risk and threat profile of your organization, it is our recommendation that you be at a maturity level of ‘Three’."

We can put together process improvement plans that show them step-by-step how they move along the maturity continuum to get to a state that’s appropriate for their business model, their level of investment, and appetite for risk.

Gardner: How would one ever know that they are done, that you are in a compliant state, that your risk has been mitigated? Is this a destination, or is it a journey?

Schreider: It's a journey, with stops along the way. If you are in the IT world -- compliance, risk management, continuity of operation -- it will always be a journey. Technology changes. Business models change. There are many aspects to an organization that require that they continually be moving forward in order to stay competitive.

We map out a road map, which is their journey, but we have very defined stops along the way. They may not ever need to go past a level of maturity of “Three,” for example, but there are things that have to occur for them to maintain that level. There's never a time when they can say, "Aha, we have arrived. We are completely safe."

Security is a mathematical model. As long as math exists, and as long as there are infinite numbers, there will be people who will be able to scientifically or mathematically define exploits to systems that are out there. As long as we have an infinite number of numbers we will always have the potential for a breach of security.

Gardner: I also have to imagine that this is a moving target. Seven years ago, we didn’t worry about Sarbanes-Oxley, ISO, and on-going types of ill effects in the market. We don’t know what’s going to come down the pike in a few years, or perhaps even some more in the financial vertical.

Is there something about putting this ISSM model in place that allows you to better absorb those unforeseen issues and/or compliance dictates? And is there a return on investment (ROI) benefit of setting up your model sooner rather than later?

Schreider: Absolutely. Historically, businesses throughout the world have lacked the discipline to self-regulate. So there is no question that the more onerous types of regulations are going to continue. That's what happened in the subprime [mortgage] arena, and the emphasis toward [mitigating] operational risk is going to continue and require organizations to have a greater level of due diligence and control over their businesses.

Businesses are run on technology, and technologies require security and continuity of operations. So, we understand that this is a moving target.

One of the things we have done with the ISSM Reference Model is to recognize that there has to be an internal framework or a controlled taxonomy that allows you to have a base root that never changes. What happens around you will always change, and regulations always change -- but how you manage your security program at its core will relatively stay the same.

Let me provide an example. If you have a process for hardening a server to make sure that that the soft, chewy inside is less likely to be attacked by a hacker or compromised by malware, that process will improve over time as technology changes. But at the end of the day it is not going to fundamentally change, nor should it change, just because a regulation comes out. How you report on what you are doing is going to change almost on a daily basis.

So we have adopted the open standard with the ISO 27001 and 17799 security-control taxonomy. We have structured the internal framework of ISSM for 1,186 base controls that we have then mapped to virtually every industry regulation and standard out there.

As long as you are minding the store, if you will, which is the inventory of controls based on ISO, we can report out to any change at any regulatory level without having to reverse engineer or reorganize your security program. That level of flexibility is crucial for organizations. When you don't have to redo how you look at security every time a new regulation comes out, the cost savings are just obvious.

Gardner: I suppose there is another analogy to IT, in that this is like a standardized component object model approach.

Schreider: Absolutely.

Gardner: Okay. How about examples of how well this works? Can you tell us about some of your clients, their experiences, or any metrics of success?

Schreider: Let me share with you as many different cross-industry examples that come to mind. One of the first early adopters of ISSM was one of the largest banks based in Mumbai, India.

One issue they had was a great deal of their IT operation was outsourced. They were entering into an area with a significant amount of regulatory oversight for security that never existed before. They also had an environment where operational efficiencies were not necessarily viewed as positive. The cost component of being able to apply human resources to solve a problem or monitor something manually was virtually unlimited, because of the demographics of where their financial institution was located.

However, they needed to structure a program to manage the fact that they had literally hundreds of security professionals working in dozens of different areas of the bank, and they were all basically doing their own things, creating their own best practices, and they lacked sort of that middleware that brought them all together.

ISSM gave them the flexibility to have a model that accounted for the fact that they could have a great number of security engineers and not worry so much about the cost aspect, but for them what was important is that they were basically all following the same set of standards and the same control model.

It worked very well in their example, and they were able to pass the audits of all of the new security regulations.

Another thing was, this organization was looking to do financial instruments with other financial organizations from around the world. They now had an internationally adopted, common control framework, in which they could provide some level of assurance that they were securing their technology in a manner that was aligned to an internationally vetted global and widely accepted standard.

Gardner: That brings to mind another issue. If I am that organization and I have gone through this diligence, and I have a much greater grasp on my risks and security issues, it seems to me I could take that to a potential suitor in a merger and acquisition situation.

I would be a much more attractive mate in terms of what they would need to assume, in terms of what they would be inheriting in regard to risk and security.

Schreider: Sure. When you acquire a company, not only do you acquire their assets, you also acquire their risk. And it’s not unusual for an organization not to pay any attention whatsoever to the threats and vulnerabilities that they are inheriting.

We have numerous stories of manufacturing or financial concerns that open up their network to a new company. They have never done a security assessment, and now, all of a sudden, they have a lot of Barbarians behind the firewall.

Gardner: Interesting. Any other examples of how this works?

Schreider: Actually there are two other ones that I would like to talk about quickly. One of the largest public municipalities in the world was in the process of integrating all of their disparate 911 systems into a common framework. What they had basically was 700 pages of security controls spread over almost 40 different documents, with a lot of duplication. They expected all of their agencies to follow this over the last number of years.

What resulted was that there was no commonality of security approach. Every agency was out there negotiating their own deals with security providers, service providers, and product providers. Now that they were consolidating, they basically had a Tower of Babel.

One thing we were able to do with the ISSM Reference Model was to take all of this disparate control constructs, normalize it into our framework, and articulate to them a comprehensive end-to-end security approach that all of the agencies could then follow.

They had uniformity in terms of their security approaches, their people, their roles, responsibilities, policies, and how they would actually have common proof points to ensure that the key performance indicators and the metrics and the service-level agreements (SLAs) were all working in unity for one homogenized system.

Another example, and it is rapidly exploding within our security practice is the utility industry. There are the NERC CIP regulators, which have now passed a whole series of cyber incident protection standards and requirements.

This just passed in January 2008. All U.S.-based utility organizations -- it could be a water utility, electric utility, anybody who is providing and using a control system -- has to abide by these new standards. These organizations are very “stove-piped.” They operate in a very tightly controlled manner. Most of them have never had to worry about applying security controls at all.

Because of the malleability of the ISSM Reference Model, we now have one that is called the ISSM Reference Model Energy Edition. We have it preloaded with all the NERC CIP standards. There are very specific types of controls that are built into the system, and the types of policies and procedures and workflows that are unique to the energy industry, and also partnerships with products like N-Dimension, Symantec, and our own TCS-e product. We build a compliance portfolio to allow them to become NERC CIP-compliant.

Gardner: That brings to mind another ancillary benefit of the ISSM approach and that is business continuity. It is your being able to maintain business continuity through unforeseen or unfortunate issues with nature or man. What’s the relationship between the business continuity goals and what ISSM provides?

Schreider: There are many who will argue that security is just one facet of business continuity. If you look at continuity of operations and you look at where the disrupters are, it could be acts of man, natural disasters, breaches of security, and so forth. That’s why when you look at our Service Management Framework and availability, continuity, and security-service management functions are all very closely aligned.

It's that cohesion that we bring to the table. How they intersect with one another, and how we have common workflows developed for the process in an organization gives the client a sense that we are paying attention to the entire continuum of continuity of business.

Gardner: So when you look at it through that lens, this also bumps up against business transformation and how you run your overall business across the board?

Schreider: Continuity of business, and security in particular, is an enabler for business transformation. There are organizations out there that could do so much better in their business model if they were able to figure out a way to get a higher degree of intimacy with their customer, but they can’t unless they can guarantee that transaction is secure.

Gardner: Well, great. We've learned a lot today about ISSM as a reference model for getting risk, security, and management together under a common framework, best practices and common controls approach.

I want to thank our guest, Tari Schreider, the chief security architect in the America’s Security Practice at HP’s Consulting & Integration Unit. We really appreciate your input. Tari, great to have you on the show.

Schreider: Thank you, Dana.

Gardner: I also want to thank our introducer, John Carchide, the worldwide governance solutions manager in the Security & Risk Management Practice, also within HP C&I. Thanks to you, John, as well.

Carchide: Thank you very much, Dana.

Gardner: This is Dana Gardner, principal analyst at Interarbor Solutions. You have been listening to a sponsored podcast discussion. This is the BriefingsDirect Podcast Network. Thank you for joining, and come back next time.

Listen to the podcast here. Sponsor: Hewlett-Packard.

Transcript of BriefingsDirect podcast on best practices for integrated security, risk and compliance approaches. Copyright Interarbor Solutions, LLC, 2005-2008. All rights reserved.

Monday, September 24, 2007

Probabilistic Analysis Predicts IT Systems Problems Before Costly Applications Outages

Edited transcript of BriefingsDirect[TM] podcast on probabilistic IT systems analysis and management, recorded Aug. 16, 2007.

Listen to the podcast here. Sponsor: Integrien Corp.

Dana Gardner: Hi, this is Dana Gardner, principal analyst at Interarbor Solutions, and you're listening to BriefingsDirect. Today our sponsored podcast focuses on the operational integrity of data centers, the high cost of IT operations, and the extremely high cost of application downtime and performance degradation.

Rather than losing control to ever-increasing complexity, and gaining less and less insight into the root causes of problematic applications and services, enterprises and on-demand application providers alike need to predict how systems will behave under a variety of conditions.

By adding real-time analytics to their systems management practices, operators can fully determine the normal state of how systems should be performing. Then, by measuring the characteristics of systems under many conditions over time, datacenter administrators and systems operators gain the ability to predict and prevent threats to the performance of their applications and services.

As a result they can stay ahead of complexity, and contain the costs of ongoing high-performance applications delivery.

This ability to maintain operational integrity through predictive analytics means IT operators can significantly reduce costs while delivering high levels of service.

Here to help us understand how to manage complexity by leveraging probabilistic systems management and remediation, we are joined by Mazda Marvasti, the CTO of Integrien Corp. Welcome to the show, Mazda.

Mazda Marvasti: Thank you, Dana.

Gardner: Why don’t we take a look at the problem set? Most people are now aware that their IT budgets are strained just by ongoing costs. Whether they are in a medium-sized company, large enterprise, or service-hosting environment, some 70 percent to 80 percent of budgets are going to ongoing operations.

That leaves very little left over for discretionary spending. If you have constant change or a dynamic environment, you're left without much resources to tap in order to meet a dynamic market shift. Can you explain how we got to this position? Why are we spending so much just to keep our heads above water in IT?

Marvasti: When we started in the IT environment, if you remember the mainframe days, it was pretty well defined. You had couple of big boxes. They ran a couple of large applications. It was well understood. You could collect some data from it, so you knew what was going on within it.

We graduated to the client-server era, where we had more flexibility in terms of deployment -- but with that came increasing complexity. Then we moved ahead to n-tier Web applications, and we had yet another increase in complexity. A lot of products came in to try to alleviate that complexity for deep-data collection. And management systems grew out, covering an entire enterprise for data collection, but the complexity was still there.

Now, with service-oriented architecture (SOA) and virtualization moving into application-development and data-center automation, there is a tremendous amount of complexity in the operations arena. You can’t have the people who used to have the "tribal knowledge" in their head determining where the problems are coming from or what the issues are.

The problems and the complexity have gone beyond the capability of people just sitting there in front of screens of data, trying to make sense out of it. So, as we gained efficiency from application development, we need consistency of performance and availability, but all of this added to the complexity of managing the data center.

That’s how the evolution of the data center went from being totally deterministic, meaning that you knew every variable, could measure it, and had very specific rules telling you if certain things happened, and what they were and what they meant -- all the way to a non-deterministic era, which we are in right now.

Now, you can't possibly know all the variables, and the rules that you come up with today may be invalid tomorrow, all just because of change that has gone on in your environment. So, you cannot use the same techniques that you used 10 or 15 years ago to manage your operations today. Yet that’s what the current tools are doing. They are just more of the same, and that’s not meeting the requirements of the operations center anymore.

Gardner: At the same time, we are seeing that a company’s applications are increasingly the primary way that it reaches out to its sell side, to customers -- as well as its buy side, to its supply chain, its partners, and ecology. So applications are growing more important. The environment is growing more complex, and the ability to know what’s going on is completely out of hand.

Marvasti: That’s right. You used to know exactly where your application was, what systems it touched, and what it was doing. Now, because of the demand of the customers and the demands of the business to develop applications more rapidly, you’ve gone into an SOA era or an n-tier application era, where you have a lot of reusability of components for faster development and better quality of applications -- but also a lot more complexity in the operations arena.

What that has led to is that you no longer even know in a deterministic fashion where your applications might be touching or into what arenas they might be going. There's no sense of, "This is it. These are the bounds of my application." Now it’s getting cloudier, especially with SOA coming in.

Gardner: We’ve seen some attempts in the conventional management space to keep up with this. We’ve been generating more agents, putting in more sniffers, applying different kinds of management. And yet we still seem to be suffering the problems. What do you think is a next step in terms of coming to grips with this -- perhaps on an holistic basis -- so that we can get as much of the picture as possible?

Marvasti: The "business service" is the thing that the organization offers to its customers. It runs through their data center, IT operations, and the business center. It goes across multiple technology layers and stacks. So having data collection at a specific stack or for a specific technology silo, in and of itself, is insufficient to tell you the problems with the business service, which is what you are ultimately trying to get to. You really need to do a holistic analysis of the data from all of the silos that the business service runs through.

You may have some networking silos, where people are using specific tools to do network management -- and that’s perfectly fine. But then the business service may go through some Web tier, application tier, database tier, or storage -- and then all of those devices may be virtualized. There may be some calls to a SOA.

There are deep-dive tools to collect data and report upon specifics of what maybe going on within silos, but you really need to do an analysis across all the silos to tell you where the problems of the business service may be coming from. The interesting thing is that there is a lot of information locked into these metrics. Once correlated across the silos, they paint a pretty good picture as to the impending problem or the root cause of what a problem may be.

By looking at individual metrics collected via silos you don’t get as full a picture as if you were to correlate that individual metric with another metric in another silo. That paints a much larger picture as to what may be going on within your business service.

Gardner: So if we want to gather insights and even predictability into the business service level -- that higher abstraction of what is productive -- we need to go in and mine this data in this context. But it seems to me that it’s in many different formats. From that "Tower of Babel" how do you create a unified view? Or you are creating metadata? What’s the secret sauce that gets you from raw data to analysis?

Marvasti: One misperception is that, "I need to have every piece of metric that I collect go into a magical box that then tells me everything I need to know." In fact, you don’t need to have every piece of metrics. There is much information locked between the correlation of the metrics. We’ve seen at our customers that a gap in monitoring in one silo can often be compensated by data collection in other silos.

So, if you have a monitoring system already -- IBM Tivoli, as an example -- and you are collecting operating-system metrics, you may have one or two other application-specific metrics that you are also collecting. That may be enough to tell you everything that is going on within your business service. You don't need to go to the nth degree of data collection and harmonization of that data into one data repository to get a clear picture.

Even starting with what you’ve got now, without having to go very deep, what we’ve seen in our customers is that it actually lights up a pretty good volume of information in terms of what may be going on across the silos. They couldn't achieve that by just looking at individual metrics.

Gardner: It’s a matter of getting to the right information that’s going to tell you the most across the context of a service?

Marvasti: To a certain degree, but a lot of times you don’t even know what the right metrics are. Basically I go to our customers and say, "What do you have?" Let’s just start with that, and then the application will determine whether you actually have gaps in your monitoring or whether these metrics that you are collecting are the right ones to solve those specific problems.

If not, we can figure out where the gaps may be. A lot of times, customers don’t even know what the right metrics are. And that’s one of the mental shifts of thinking deterministically versus probabilistically.

Deterministically is, "What are the right metrics that I need to collect to be able to identify this problem?" In fact, what we’ve found out is that a particular problem in a business service can be modeled by a group or a set of metric event conditions that are seemingly unrelated to that problem, but are pretty good indicators of the occurrence of that problem.

When we start with what they have, we often point out that there is a lot more information within that data set. They don’t really need to ask, "Do I have the right metrics or not?"

Gardner: Once you’ve established a pretty good sense of the right metrics and the right data, then I suppose you need to apply the right analysis and tools. Maybe this would be a good time for you to explain about the heritage of Integrien, how it came about, and how you get from this deterministic to more probabilistic or probability-oriented approach?

Marvasti: I’ve been working on these types of problems for the past 18 years. Since graduate school, I’ve been analyzing data extraction of information from disparate data. I went to work for Ford and General Motors -- really large environments. Back then, it was client-servers and how those environments were being managed. I could see the impending complexity, because I saw the level of pressure that there was on application developers to develop more reusable code and to develop faster with higher quality.

All that led to the Web application era. Back then, I was the CTO of a company called here in the Los Angeles area. One problem I had was that I had a few people with the tribal knowledge to manage and run the systems, but that was very scary to me. I couldn't rely on these people to be able to have a continuous business going on.

So I started looking at management systems, because I thought it was probably a solved problem. I looked at a lot of management tools out there, and saw that it was mainly centered on data collection, manual rule writing, and better way of presenting the same data over and over.

I didn’t see any way of doing a deep analysis of the data to bring out insights. That’s when I and my business partner Al Eisaian, who is our CEO, formed a company to specifically attack this problem. That was in 2001. We spent a couple of years developing the product, got our first set of customers in 2003, and really started proving the model.

One of the interesting things is that if you have a small environment, your tendency is to think that it's small enough that you can manage it, and that actually may be true. You develop some specific technical knowledge about your systems and you can move from there. But in the larger environments where there is so much change happening in the environment it becomes impossible to manage it that way.

A product like ours almost becomes a necessity, because we’ve transitioned from people knowing in their heads what to do, to not being able to comprehend all of the things happening in the data center. The technology we developed was meant to address this problem of not being able to make sense of the data coming through, so that you could make an intelligent decision about problems occurring in the environment.

Gardner: Clearly a tacit knowledge approach is not sufficient, and just throwing more people at it is not going to solve the problem. What’s the next step? How do we get to a position where we can gather and then analyze data in such a way that we get to that Holy Grail, which is predictive, rather than reactive, response.

Marvasti: Obviously, the first step is collecting the data. Without the data, you can’t really do much. A lot of investment has already gone into data collection mechanisms, be it agent-based or agent-less. So there is data being collected right now.

The missing piece is the utilization of that data and the extraction of information from that data. Right now, as you said at the beginning of your introduction, a lot of cost is going toward keeping the lights on at the operations center. That’s typically people cost, where people are deployed 24/7, looking at monitors, looking at failures, and then trying to do postmortem on the problem.

This does require a little bit of mind shift from deterministic to probabilistic. The reason for that is that a lot of things have been built to make the operations center do a really good job of cleaning up after an accident, but not a lot of thought has been put into place of what to do if you're forewarned of an accident, before it actually happens.

Gardner: How do I intercede? How do I do something?

Marvasti: How do I intercede? What do I do? What does it mean? For example, one of the outputs from our product is a predictive alert that says, "With 80 percent confidence, this particular problem will occur within the next 15 minutes." Well, nothing has happened yet, so what does my run book say I should do? The run book is missing that information. The run book only has the information on how to clean it up after an accident happens.

That’s the missing piece in the operations arena. Part of the challenge for our company is getting the operations folks to start thinking in a different fashion. You can do it a little at a time. It doesn’t have to be a complete shift in one fell swoop, but it does require that change in mentality. Now that I am actually forewarned about something, how do I prevent it, as opposed to cleaning up after it happens.

Gardner: When we talk about operational efficiency, are we talking about one or two percent here and there? Is this a rounding error? Are we talking about some really wasteful practices that we can address? What’s the typical return on investment that you are pursuing?

Marvasti: It’s not one or two percent. We're talking about a completely different way of managing operations. After a problem occurs, you typically have a lot of people on a bridge call, and then you go through a process of elimination to determine where the problem is coming from, or what might have caused it. Once the specific technology silo has been determined, then they go to the experts for that particular silo to figure out what’s going on. That actually has a lot of time and manpower associated with it.

What we're talking about is being proactive, so that you know something is about to happen, and we can tell you to a certain probability where it’s going to be. Now you have a list of low-hanging fruits to go after, as opposed to just covering everybody in the operations center, trying to get the problem fixed.

The first order of business is, "Okay, this problem is about to occur, and this is where it may occur. So, that’s the guy I’m going to engage first." Basically, you have a way of following down from the most probable to the least probable, and not involving all the people that typically get involved in a bridge call to try to resolve the issues.

One gain is the reduction in mean time to identify where the problem is coming from. The other one is not having all of those people on these calls. This reduces the man-hours associated with root-cause determination and source identification of the problem. In different environments you're going to see different percentages, but in one environment that I saw first hand, one of the largest health-care organizations, it is like 20-30 percent of cost, just associated with people being on bridge calls, on a continuous basis.

Gardner: Now, this notion of "management forensics," can you explain that a little bit?

Marvasti: One of the larger problems in IT is actually getting to the root cause of problems. What do you know? How do you know what the root cause is? Often times, something happens and the necessity of getting the business service back up forces people to reboot the servers and worry later about figuring out what happened. But, when you do that, you lose a lot of information that would have been very helpful in determining what the root cause was.

The forensic side of it is: The data is collected already, so we already know what it is. If you have the state when a problem occurred, that’s a captured environment in the data base that you can always go back to.

What we offer is the ability to walk back in time, without having the server down, while you are doing your investigation. You can bring the server back up, come back to our product, and then walk back in time to see exactly what were the leading indicators to the problems you experienced. Using those leading indicators, you can get to the root causes very quickly. That eliminates the guess work of where to start, reduces the time to get to the root cause, and maybe even prevent it.

Sometimes you only have so much time to work on something. If you can’t solve it by that time, you move on, and then the problem occurs again. That's the forensic side.

Gardner: We talked earlier about this notion of capturing the normal state, and now you've got this opportunity to capture an abnormal state. You can compare and contrast. Is that something that you use on an ongoing basis to come up with these probabilities? Or is the probability analysis something different?

Marvasti: No, that’s very much part and parcel of it. What we do is look to see what is the normal operating state of an environment. Then it is the abnormalities from that normal that become your trigger points of potential issues. Those are your first indicators that there might be a problem growing. We also do a cross-event analysis. That’s another probability analysis that we do. We look at patterns of events, as opposed to a single event, indicating a potential problem. One thing we've found is that events in seemingly unrelated silos are very good indicators of a potential problem that may brew some place else.

Doing that kind of analysis, looking at what’s normal, then abnormal becomes your first indicator. Then, doing a cross-event analysis to see what patterns indicate a particular problem becomes total normal to problem-prevention scenario.

Gardner: There has to be a cause-and-effect. As much as we would like to imagine ghosts in the machines, that’s really not the case. It's simply a matter of tracking it down.

Marvasti: Exactly. The interesting thing is that you may be measuring a specific metric that is a clear indicator of a problem, but it is oftentimes some other metric on another machine that gets to be out of normal first, before the cause of the problem surfaces in the machine in question. So early indicators to a problem become events that occur some place else, and that’s really important to capture.

When I was talking about the cross-silo analysis, that’s the information that it brings out. It gives you lot more "heads-up" time to a potential problem than if you were just looking at a specific silo.

Gardner: Of course, each data center is unique, each company has its own history and legacy, and its IT department has evolved on its own terms. But is there any general crossover analysis? That is to say, is there a way of having an aggregate view of things and predicting things based on some assumptions, because of the particular systems that are in use? Or, is it site by site on a detailed level?

Marvasti: Every customer that I have seen is totally different. We developed our applications specifically to be learning-based, not rules-based. And by rules I mean not having any preconceived notion of what an environment may look like. Because if you have that, and the environment doesn’t look like that, you're going to be sending a lot of false positives -- which we definitely did not want to do.

Ours is a purely learning-based system. That means that we install our product, it starts gathering the metrics, and then it starts learning what your systems look like and behave like. Then based on that behavior it starts formulating the out-of-normal conditions that can lead to problems. That becomes unique to the customer environment. That is an advantage, because when you get something, it actually adapts itself to an environment.

For example, it learns your change management patterns. If you have a change windows occurring, it learns that change window. It knows that those change windows occur without anybody having to enter anything into the application. When you are doing wholesale upgrade of devices, it knows that change is coming about, because it has learned your patterns.

The downside of that is that it does take two to three weeks of gathering your data and learning what has been happening for it to become useful. The good side of it is that you get something that completely maps to your business, as opposed to having to map your business through a product. The downside is that it takes two or three weeks of learning time, before it starts producing some results for you.

Gardner: The name of your product set is Alive, is that correct?

Marvasti: That’s correct.

Gardner: I understand you are going to have a release coming out later this year, Alive 6.0?

Marvasti: That’s correct.

Gardner: I don’t expect you to pre-release, but perhaps you can give us some sense of the direction that the major new offerings within the product set will take. What they are directed toward? Can you give us a sneak peek on that?

Marvasti: Basically, we have three pillars that the product is based on. First is usability. That's a particular pet peeve of mine. I didn't find any of the applications out there very usable. We have spent a lot of time working with customers and working with different operations groups, trying to make sure that our product is actually usable for the people that we are designing for.

The second piece is interoperability. The majority of the organizations that we go to already have a whole bunch of systems, whether it be data collection systems, event management systems, or configuration management databases, etc. Our product absolutely needs to leverage those investments -- and they are leveragable. But even those investments in their silos don’t produce as much benefit to the customer as a product like ours going in there and utilizing all of that data that they have in there, and bringing out the information that’s locked within it.

The third piece is analytics. What we have in the product coming out is scalability to 100,000 servers. We've kind of gone wild on the scalability side, because we are designing for the future. Nobody that I know of right now has that kind of a scale, except maybe Google, but theirs is basically the same thing replicated thousands of times over, which is different than the enterprises we deal with, like banks or health-care organizations.

A single four-processor Xeon box, with Alive installed on it, can run real-time analytics for up to 100,000 devices. That’s the level of scale we're talking about. In terms of analytics, we've got three new pieces coming out, and basically every event we send out is a predictive event. It’s going to tell you this event occurred, and then this other set of events have a certain probability within a certain timeframe to occur.

Not only that, but then we can match it to what we call our "finger printing." Our finger printing is a pattern-matching technology that allows us to look at patterns of events and formulate a particular problem. It indicates particular problems and those become the predictive alerts to other problems.

What’s coming out in the product is really a lot of usability, reporting capabilities, and easier configurations. Tens of thousands of devices can be configured very quickly. We have interoperability -- Tivoli, OpenView, Hyperic -- and an open API that allows you to connect to our product and pump in any kind of data, even if it’s business data.

Our technology is context agnostic. What that means is that it does not have any understanding of applications, databases, etc. You can even put in business-type data and have it correlated with your IT data, and extract information that way as well.

Gardner: You mentioned usability. Who are the typical users and buyers of a product like Integrien Alive? Who is your target audience?

Marvasti: The typical user would be at the operations center. The interesting thing is that we have seen a lot of different users come in after the product is deployed. I've seen database administrators use our product, because they like to see what is normal behavior of their databases. So they run the analytics under database type metrics and get information that way.

I've seen application folks who want to have more visibility in terms of how this particular application is impacting the database. They become users. But the majority of users are going to be at the operations center -- people doing day-to-day event management and who are responsible for reducing the mean time to identify where the problems come from.

The typical buyers are directors of IT operations or VP of IT operations. We are really on the operation side, as opposed to the application development side.

Gardner: Do you suppose that in the future, when we get more deeply into SOA and virtualization, that some of the analysis that is derived through Integrien Alive becomes something that’s fed into a business dashboard, or something that’s used in governance around how services are going to be provisioned, or how service level agreements are going to be met?

Can we extrapolate as to how the dynamics of the data center and then the job of IT itself changes, on how your value might shift as well?

Marvasti: That link between IT and the business is starting to occur. I definitely believe that our product can play a major part in illuminating what in the business side gets impacted by IT. Because we are completely data agnostic, you can put in IT-type data, business-type data, or customer data -- and have all of it be correlated.

You then have one big holistic view as to what may impact what. ... If this happens, what else might happen? If I want to increase this, what are the other parameters that may be impacted?

So, you know what you want to play from the business side in terms of growth. Having that, we project how IT needs to change in order to support that growth. The information is there within the data and the very fact that we are completely data agnostic allows us to do that kind of a multi-function analysis within an enterprise.

Gardner: It sounds like you can move from an operational efficiency value to a business efficiency value pretty quickly?

Marvasti: Absolutely. Our initial target is the operations arena, because of the tremendous amount of inefficiencies there. But as we move into the future, that’s something we are going to look into.

Gardner: We mentioned Alive 6.0. Do you have a ball-park figure on when that’s due? Is it Q4 of 2007?

Marvasti: We are going to come out with it in 2007, and it will be available in Q4.

Gardner: Well, I think that covers it, and we are just about out of time. I want to thank Mazda Marvasti, the CTO of Integrien, for helping us understand more about the notion of management forensics and probabilistic- rather than deterministic-based analysis.

We have been seeking to understand better how to address high costs, and inefficiencies in data centers, as well as managing application performance -- perhaps in quite a different way than many companies have been accustomed to. Is there anything else you would like to add before we end, Mazda?

Marvasti: No, I appreciate your time, Dana, and thank you for your good questions.

Gardner: This is Dana Gardner, principal analyst at Interarbor Solutions. You have been listening to a sponsored BriefingsDirect podcast. Thanks, and come back next time.

Listen to the podcast here. Sponsor: Integrien Corp.

Transcript of BriefingsDirect podcast on systems management efficiencies and analytics. Copyright Interarbor Solutions, LLC, 2005-2007. All rights reserved.