Showing posts with label DevOps. Show all posts
Showing posts with label DevOps. Show all posts

Wednesday, December 01, 2021

How Houwzer Speeds Growth and Innovation for Online Real Estate by Gaining Insights into API Use and Behavior


Transcript of a discussion on
how a cloud-based home-brokerage-enabler, Houwzer, constructed a resilient API-based platform as the heart of its services integration engine.

Listen to the podcast. Find it on iTunes. Download the transcript. Sponsor: Traceable AI.

Dana Gardner: Hi, this is Dana Gardner, Principal Analyst at Interarbor Solutions, and you’re listening to BriefingsDirect.

Complexity and security challenges can hobble the growth of financial transactions for private-data-laden, consumer-facing software-as-a-service (SaaS) applications. Add to that the need to deliver user experiences that are simple, intuitive, and personalized -- and you have a thorny thicket of software development challenges.

Stay with us now as we explore how streamlined and cost-efficient home-brokerage-enabler Houwzer constructed a resilient application programming interface (API)-based platform as the heart of its services integration engine for buying and selling real estate online.

To learn how Houwzer makes the most of APIs and protects its user data while preventing vulnerabilities, please welcome Greg Phillips, Chief Technology Officer (CTO) at Houwzer. Welcome, Greg.

Greg Phillips: Thanks, Dana. It’s nice to be here.

Gardner: Greg, what does Houwzer do, and why is an API-intensive architecture core to your platform?

Phillips: We are more than just a real estate brokerage. We’re also a mortgage brokerage and a title agency. The secret sauce for that is our technology platform, which binds those services together and creates a seamless, end-to-end experience for our consumers, whether they are buying or selling a home.

Phillips
Those services are typically fragmented among different companies, which can lead to an often-chaotic transaction. We streamline all of that into a much smoother experience with our salaried agents and a consistent technology platform across the whole transaction. We are rethinking how real estate transactions are done by making it a better experience across the board, inclusive of all those services.

Early on, we decided to build, essentially, a protocol for conducting real estate transactions. There are laws and regulations for how to conduct such transactions in different jurisdictions. Instead of having an unmanageable variety of local rules and regulations -- in one area they’re doing it one way, and in another area doing it another way – we looked for the common elements for doing real estate transactions, mortgages, and for titles.

We’ve built into our system these common elements around real estate transactions. From there, we can localize to the local jurisdictions to provide the end services. But we still have a consistent experience across the country in terms of offering services.

That’s why we began with an API-first architecture. We focused on the protocols and building-blocks of the platform that we offered to our agents, coordinators, and mortgage advisers for their services. Then we layered on the front end, which has a lot more localization and other services. So, we very intentionally thought about it as a protocol for conducting real estate transactions, rather than building an app to manage just specific types of real estate transactions in specific jurisdictions.

Gardner: When say API-first, what do you mean? Was that how you constructed your internal platform? How you deliver the services? Was it also for the third-party and internal integration points? All of the above?

Real estate transactions gain flexibility

Phillips: All of the above, yes. We wanted to build an API core that was flexible enough to support lots of variants within different types of real estate transactions. We’re already in seven states. And we’re still pretty early in our journey. We’re going to be adding more states and jurisdictions.

We wanted to build an API core that was flexible enough to support lots of variants within different types of real estate transactions. ... We put a lot of thought into our data model and our API platform.

So, we knew from the get-go that was our direction. We put a lot of thought into our data model and our API platform, such that we wouldn’t have to rewrite or break up the APIs every time we entered a new jurisdiction. We wanted a flexible underlying API that we could use to offer a finished product, even though it might look a bit different in Maryland than it does in Pennsylvania, for example.

Gardner: It’s evident that such flexibility, speed of development, and reuse of services are some of the good things about APIs. But are there any downsides? What can detract from that versatility when going API-first?

Phillips: One of the downsides of APIs is once you put it out there into the world, you are supporting that API, for better or worse. Things get built against it. And if you want to change or rethink what you do with that API, you have downstream dependencies reliant on that API.

It’s not like you’re a single code base, where if you want to refactor, you can use your idea to go discover all the things that might break if you change things. With the API model, it’s harder to know exactly who’s out there using it, or what might break if you change the API.

Learn More 

That means there’s a semi-permanence to an API. That’s somewhat unique in the software development realm where things typically move with a lot of flux. We have libraries that are updating all the time, especially in the JavaScript ecosystem. Things are going a mile a minute.

When you deliver an open API, you have to be more thoughtful about what you put out there ahead of time, because it is harder to change, harder to version, and harder to migrate. It’s by definition something you’ve chosen to set in stone, at least for some period of time, so people can build against it.

Our API interacts with third parties. The vast majority of the usage of our API is for our internal front-end application. It’s not like we have tons of different stakeholders on the API. But we need to factor for those third parties and partners. 

Obviously, then, security is another huge undertaking when you put an API out there. This is not an API that is just sitting behind a firewall. This is an API on the Internet for conducting real estate transactions, which are highly sensitive transactions. So, obviously, security is a huge concern when building an API.

Gardner: With so many different parties involved in real estate transactions, to get people to rely on Houwzer as a hub, there needs to be an element of trust. Not just trust about performance, but trust that the activity is going to be safe, and privacy is assured.

What did you do to bring that level of resiliency to your API? How did you troubleshoot your own API to make sure that others would view it favorably?

Keep data safe from start to finish

Phillips: From the very beginning, we’ve been really concerned with security. Even before we had any transactions running through the system -- and we were just in the design phases of the API -- we knew we’re in an industry that’s constantly under attack.

The most common and dangerous thing that happens in the real estate brokerage industry is when some non-public information about a transaction is somehow leaked. There are a lot of criminals out there who can use that information to attempt to exploit our customers. For example, if they find out information about when a closing is supposed to be in the name of the title company, they could pose as an agent of that title company and say, “Hey, for your upcoming closing, the wiring instructions have changed. You actually need to wire ‘here’ instead of ‘there’.”

There are a lot of criminals out there who can use that information to attempt to exploit our customers. It's been a huge problem in the industry. We need to make sure that the information stays private.

We’ve seen brokerages across the country fall victim to that consistently over the past five to 10 years, if not longer. It’s been a huge problem in the industry. So, while an API enables a great user experience by having very streamlined transactions, we need to make sure that the information stays private to only our clients, agents, and coordinators -- and not leak any of that data to the public through the API. That’s been paramount for us.

As far as performance goes, we’ve been fortunate that our business has relatively few high-value transactions. We haven’t had to achieve super-scale yet with our APIs. Our security concerns are a 10, but our scalability concerns, fortunately, are at a two. So far, it’s not open to the masses. It’s more of a premium service for a smaller audience than a free service on the Internet.

Gardner: Given the need for that high level of security, you can’t depend on just the perimeter security tools. You need to look at different ways of anticipating vulnerabilities to head them off.

Phillips: Yes. You must be aware of what you’re putting out into the world. You must assume the worst about who is going to interact with your API, and make sure there is no way for an unauthorized person to gain access to information they’re not privy to.

Since the beginning of building this platform, that kept me up at night. One of the things that ultimately led me to Traceable AI was that I wanted to effectively gain more confidence about how my APIs were being used out in the world. You try to anticipate as much as you can when you’re building it.

You reason: “Okay, who’s going to be calling on this? We don’t want to expose any additional information here. We want to have just the information needed, with no additional information that might leak out. We want really strong access controls on each API request, such as what parameters will be accepted, what will be updated, and what will show in each scenario based on all the different users’ rules.”

Obviously, that’s a lot to keep track of. And you always worry there is some misalignment or misconfiguration that you’re missing somewhere. You want to be able to monitor how the API is getting used -- and, essentially, have an artificial intelligence (AI) capability look for that type of thing in addition to your ability to query for it.

That has been very attractive for us. It’s given us a lot of confidence that, in practice, we are not leaking data. It’s an additional level of validation. Instead of enforcing a perimeter and not letting anybody in, we’re very careful about what we put out there beyond the perimeter. And not only are we careful about what we put out there beyond the perimeter, we’re also monitoring it very closely, which I think is key.

Monitor who’s doing what, where, and when

Gardner: Such monitoring gives you the opportunity to create a baseline of behaviors, so that even for unintended consequences of how people use your API, you have a data record. And you’re doing it at scale because there’s a lot of data involved that humans couldn’t keep up with. Instead, you have machine learning (ML) and AI technologies to bring to bear on that.

What have you learned from that capability to observe and trace to such a high degree?

Phillips: We have discovered a few vulnerabilities that we weren’t aware of. So, there were areas where we were exposing, or potentially exposing, more information than we meant to through a given API endpoint. That was identified and fixed.

Learn More 

We’ve also seen some areas where people have tried to attack us. Even though we don’t have the vulnerability, we’ve seen malicious actors hitting our API, attempting to do a sequel injection, for example, or attempting to read a file on the file system, or to run a command down the system. You can actually see that stuff and observe how they’re doing it without having to parse through raw API requests, which aren’t humanly readable. Those are the first order of insights we’ve gained.

The second order of things we’ve seen are also very interesting. We can look at the API requests segmented by our users and our user roles. That means learning what API requests our clients, agents, and coordinators tend to make. We can now examine how these different stakeholders interact with the API. It has been really interesting to see from a planning perspective.

We can look at the API requests segmented by our users and our user roles. We can now examine how these different stakeholders interact with the API. The API is a living, breathing thing that you can look at and observe.

Even outside of security, it’s been fascinating to see how the system gets used, and the kinds of natural rhythms that occur, such as when is it used during the day. What are these types of things happening versus these other types of things happening?

It’s interesting to see that which would be very hard in the non-human-readable API requests. When you aggregate it and display it in an information display, you can see that stuff. The API is a living, breathing thing that you can look at and observe as it’s out there in the world.

Gardner: Not only as it breathes and lives, but it’s easily updated. So how do you create a feedback loop from what you learn in your observability phase and bring that into the development iteration process?

As the CTO, are you the one that has to cross the chasm between what you can observe in operations and what you can subsequently ameliorate in development?

Security now part of every job

Phillips: Generally, yes. I view that as a key part of my role. Our software engineers are in there looking at it as well, but I hold myself accountable for that function. Also, I try to recruit generalist software engineers who can take security into account, just like with user experience, when they’re building things. 

I find it very hard to build a cohesive and secure product if you are just throwing requirements over the fence to the software engineers from different departments, saying, “Build this.” I think you lose something.

Rather, there has to be a complete understanding in one accountable individual’s mind to deliver the complete product. And that’s not to say those areas of the company shouldn’t have input on what gets built. But the engineers in my mind have to have a deeper understanding. I like to give them as much data as possible to understand what they’re putting out. Then they have that all in their minds when they’re writing the code.

Gardner: Have your developers been receptive to this observability of API behavior data, or do they say, “Well, that’s the security person’s job, not mine”?

Phillips: All of us on the team feel a responsibility for the security of our systems. I think everyone takes that really seriously. I don’t think anyone thinks that it’s “someone else’s” problem. We all know that we all have to watch out for it.

That being said, not everyone is a security expert. Some people may know more or less than others about information security. None of us are dedicated information security professionals. We rely on the inputs from the Traceable AI platform and from what we’re seeing happen to learn about the things that we should be worried about. What are the things that we don’t even know about yet?

It’s about having a culture of learning and having generalists who want to get better at building secure systems and to convey secure APIs. That is increasingly part of the job description for software engineers, to take that into account. That’s especially critical as we see higher value services, like our own, being offered directly on the Internet.

Things are so different now. Years ago, real estate and other financial transactions had some kind of application front end. Then some person would put it all into a mainframe that night and do the financial transactions. Then, the next morning, after it ran on the mainframe, the humans would look at it again. And then they would update your bank account.

Learn More 

Now there’s far more automation. Things happen live via APIs on the Internet. And that’s created much more reason for developers to truly understand the security implications of what they’re building. You simply can’t insert a failsafe as easily, as you start to eliminate process friction, which is what consumers want. There are less natural insertion points for a true dedicated security or dedicated fraud prevention review. You have to do these processes live and in an automated way. Security therefore has to be built into the thing itself.

Gardner: Of course, these transactions come with high urgency for people. This is their home, one of the biggest transactions of their lives. They’re not interested in a wishy-washy API.

How easy was using Traceable AI to bring automation for better security into your organization?

Data delivers better development

Phillips: What I like about the Traceable AI user experience is that you can engage with it at multiple levels. On the most basic level, you log in and it’s pushing out immediate alerts of threats. You can view what has happened since you last logged in, and you can review your bots. It surfaces the most important things right away, which is great.

But then you can also pursue questions about the APIs in production. For example, you can plot how the APIs are being used. They give you great tools to drill down so you can navigate to different ways of aggregating the API usage data and then visualize, as I mentioned before, those usage patterns.

What I like about the Traceable AI user experience is that you can engage with it at multiple levels. It surfaces the most important things right away, which is great. They also give you the tools to drill down so you can navigate to aggregating the API usage data.

You can look at performance as well as security. So even if you’re feeling good about the security, you can determine if latency doesn’t look great, for example. There are a lot of things in there to show where you can go really deep. I don’t think I’ve gotten to the bottom. There’s more to discover, and there’s tons of ways to slice, dice, and look at things. I tend to do that a lot because I’m a power user and like to figure things out. But at the same time, Traceable AI does a great job of using their intelligence to surface the most important things and the most critical security concerns and get those in front of you in the first place.

Gardner: It sounds like these data deliverables provide you an on-ramp to a more analytics-driven approach to not only development -- but for improving the processes around development, too.

Phillips: Yes. I would even extend that into the processes around our business operations, our real estate operations. We’re offering a product through our technology that is ultimately a real estate transaction engine. And we can actually see in the API things that we need to do to make the real-world solution better.

We have three critical stakeholders: the buyer client, the real estate agent, and the transaction coordinator, who makes sure everything goes smoothly. And, using these tools, we can see if the user or coordinator are trying to do something, meaning they’re getting errors. We can see if there is a point in the real estate transaction where we might not have everything included. Maybe the information that was expected to be there is incomplete, and so they are not able to get to the next step of the transaction.

So, you can actually uncover things that are not explicitly in the technology, like a process problem. We need this information ahead of that point in the process, and we don’t always have it. We want to then know what next to build into our protocols for the future.

Gardner: Greg, what are your suggestions for other folks grappling with the API Economy, as some people call it? Any words of wisdom now that you’ve been through an API development and refinement journey?

Take one real estate step at a time

Phillips: Start small and expand. Don’t try to put everything and the kitchen sink out there all at once. We currently represent people selling their home, buying a home, and getting a mortgage, people who need title insurance -- people doing all of those things together all at once.

However, the first transaction through our system was just people listing their homes. We said, “Let’s take on this specific process.” And even at the time that we launched, it was a much less detailed version of the process we have today. It’s really important to release something early that is complete but limited in scope. Scope creep -- of trying to pack in a lot at once -- is what causes security issues. It’s what causes performance issues. It causes usability issues. So, start simple and expand. It’s probably the best piece of advice I have.

Gardner: Assuming you are going to continue to crawl, walk, and run, what comes next for Houwzer? What does the future portend? What other transactions might this protocol approach lend itself to?

Phillips: We thought about all the things needed to consummate a real estate transaction. We have covered three of those. But we are missing one, which is homeowners’ insurance. We consider the core services to purchasing a home as brokerage, mortgage, title, and homeowners’ insurance. So that piece is in the works for us.

Learn More 

Outside of those core pieces, however, there are lots of things people need when they’re buying and selling homes. It could be resources to fix up their current home, resources to move in, guidance around where in the country they should move to as a remote worker. There’s lots of different services to build out to from the core.

We began at the core transactions, and now we can build our way out. That was a very intentional strategy. When you look at Zillow, Redfin, or some of the other real estate technology companies, they began with the portal and then tried to bolt on the services.

We’re trying to build the best technology-enabled real estate services, and then build from that core outward into more of those needed services. Some of the next things in our product road map, for example, are pre-transaction, helping our consumers make more educated decisions about the transactions they’re going to enter into. And we can do that because we have this bullet-proof, secure, battle-tested system for doing it all and great real estate agents that will help guide you through the process.

Gardner: I’m afraid we’ll have to leave it there. You’ve been listening to a sponsored BriefingsDirect discussion on how a streamlined and cost-efficient home brokerage enabler, Houwzer, constructed a resilient core API platform.

And we’ve learned how protecting user data and preventing vulnerabilities across an end-to-end API services approach has allowed Houwzer to deliver user experiences that are simple, intuitive, personalized, and trusted. So, a big thank you to our guest, Greg Phillips, Chief Technology Officer at Houwzer. Thanks so much, Greg.

Phillips: Yes, thank you as well. It’s been a pleasure.

Gardner: And a big thank you as well for our audience for joining this BriefingsDirect API resiliency discussion. I’m Dana Gardner, Principal Analyst at Interarbor Solutions, your host throughout this series of Traceable AI-sponsored BriefingsDirect interviews.

Thanks again for listening. Please pass this along to your business community and do come back for our next chapter.

Listen to the podcast. Find it on iTunes. Download the transcript. Sponsor: Traceable AI.

Transcript of a discussion on how streamlined and cost-efficient home-brokerage-enabler, Houwzer, constructed a resilient API-based platform as the heart of its services integration engine. Copyright Interarbor Solutions, LLC, 2005-2021. All rights reserved.

You may also be interested in:

Tuesday, October 19, 2021

How FinTech Innovator Razorpay Uses Open-Source Tracing to Manage Fast-Changing APIs

Transcript of a discussion on an open-source project, Hypertrace, and how it helps designers, builders, and testers of modern APIs gain visibility across their internal and third-party services.

Listen to the podcast. Find it on iTunes. Download the transcript. Sponsor: Traceable AI.

Dana Gardner: Hi, this is Dana Gardner, Principal Analyst at Interarbor Solutions, and you’re listening to BriefingsDirect.

The speed and complexity of microservices-intense applications often leave their developers in the dark. They too often struggle to track and visualize the actual underlying architecture of these distributed services.

The designers, builders, and testers of modern API-driven apps, therefore, need an ongoing and instant visibility capability into the rapidly changing data flows, integration points, and assemblages of internal and third-party services.

Thankfully, an open-source project to advance the sophisticated distributed tracing and observability platform called Hypertrace is helping.

Stay with us now as we hear about the evolution and capabilities of Hypertrace and how an early adopter in the online payment suite business, Razorpay, has gained new insights and deeper understanding of their services components.


To learn how Hypertrace discovers, monitors, visualizes, and optimizes increasingly complex services architectures, please welcome Venkat Vaidhyanathan, Architect at Razorpay in Bangalore, India. Welcome, Venkat.

Venkat Vaidhyanathan: Thank you, Dana, for the warm welcome.

Gardner: We’re also here with Jayesh Ahire, Founding Engineer at Traceable AI and Product Manager for Hypertrace. Welcome, Jayesh.

Jayesh Ahire: Thanks, Dana. Glad to be here.

Gardner: Venkat, what does Razorpay do and why is tracing and understanding your services architecture so important?

Built by developers, for developers

Venkat: Razorpay’s mission is to enable frictionless banking and payment experiences by powering the entire financial infrastructure for businesses of all shapes and sizes. It’s a full-stack financial solution that enables thousands of small- to medium-sized enterprises (SMEs) and enterprises to accept, process, and disburse payments at scale.

Venkat
Today, we process billions of dollars of payments from millions of businesses across India. As a leading payments provider, we have been the first to bring to market most of the major online innovations in payments for the last five years.

For the last two years, we have successfully curated neo banking and lending services. We have seen outstanding growth in the last five years and attracted close to $300 million-plus in funding from investors such as Sequoia, Tiger Global, Rebate, Matrix Partners, and others.

One of the fundamental principles about designing Razorpay has been to build a largely API-driven ecosystem. We are a developer-first company. Our general principle of building is, “It is built by developers for developers,” which means that every single product we build is always going to be API-driven first. In that regard, we must ensure that our APIs are resilient. That they perform to the best and most optimum capacity is of extreme importance to us.

Gardner: What is it about being an API-driven organization that makes tracing and observability such an important undertaking?

Venkat: We are an extremely Agile organization. As a startup, we have an obsession around our customers. Focus on building quality products is paramount to creating the best user experience (UX).

Our customers have amazing stories around our projects, products, and ecosystem. We have worked through extreme times (for example, demonetization, and the Yes Bank outage), and that has helped our customers build a lot of trust in what we do -- and what we can do.

Learn More 

We have quickly taken up the challenge and turned the tables for most of our customers to build a lot of trust in the kinds of things we do.

After all, we are dealing with one of the most sensitive aspects of human lives, which is their money. So, in this regard, the resiliency, security, and all the useability parameters are extremely important for our success.

Gardner: Jayesh, why is Razorpay a good example of what businesses are facing when it comes to APIs? And what requirements for such users are you attempting to satisfy with your distributed tracing and observability platform?

Observability offers scale, insight, resilience

Ahire: Going back to the days when it all started, people began building applications using monoliths. And it was easier then to begin with monolithic applications to get the business moving.

Ahire

But in recent times, that is not the only important thing for businesses. As we heard, Venkat needs scale and resiliency in the platform while building with APIs. Most modern organizations use microservices, which complicates these modern architectures. They become hard to manage, especially at large-scale organizations where you can have 100 to 300 microservices, with thousands of APIs communicating between those microservices.

It’s just hard now for businesses to have visibility and observability to determine if they have any issues and to see if the APIs are performing as they are expected.

I use a list of four brief questions that every organization needs to answer at some point. Are their APIs:

  • Providing the functionality they are supposed to deliver?

  • Performing in the way they are supposed to?

  • Secure for their business users?

  • Understood across all their APIs and microservices uses?

They must understand if the APIs and microservices are performing up to the actual expectations and required functionality. They need something that can provide the answers to these questions, at the very least.

Observability helps answer these essential questions without having to open the black box and go to each service and every API. Instead, the instrumentation data provides those insights. You can ask questions of your system and it will give you the answers. You can ask, for example, how your system is performing -- and it will give you some answers. Such observability helps large-scale organizations keep up with the scale and with the increasing number of users. And that keeps the systems resilient.

Gardner: Venkat, what are your business imperatives for using Hypertrace? Is it for UX? What is the business case for gaining more observability in your services development?

Metrics, logs, and traces limit trouble

Venkat: There are three fundamental legs to what we define as modern observability. One part is with respect to metrics, the next part has to do with the logs, and the third part is in respect to the traces.

Up until recently, we had application performance monitoring (APM) systems that monitored some of these things, with a single place to gather some metrics and insights. However, as microservices grew wider in use, APMs are no longer necessarily the right way to do these things. For such metrics, a lot of work is already going on in the open-source ecosystem with respect to Prometheus and others. I wrote a blog about our journey into scaling our metrics platform to trillions of data points.

Once you can get logs -- whether it is from open-source ELK Stack [Elasticsearch, Logstash, and Kibana], or whether it is from a lot of platform as a service (PaaS) and software as a service (SaaS) log providers -- fundamentally the issue comes down to traces.

As microservices evolve, you're talking about a lot more problems, such as how much time would a network call take? How much time would a database call take? Was my DNS request the biggest impediment? What really happened?

Now, traces can be visualized in a very primitive way, such as for instrumenting a particular piece of code to understand its behavior. It could be for a timing function, for example.

However, as microservices evolve, you’re talking about a lot more problems, such as how much time would a network call take? How much time would the database call take? Was my DNS request the biggest impediment? What really happened in the last mile?

And when you’re talking about an entire graph of services, it’s very important to know what particular point in the entire graph breaks down often – or doesn’t break down very often.

Understanding all these things, as Jayesh said, and asking the right questions cannot happen only by using metrics or just logs. They only give different slices of the problems. And it cannot happen only by using tracing, which also only gives a different slice of the problem.

In an ideal, nirvana world, you need to combine all these things and create a single place that can correlate these various things and allow a deep dive with respect to a specific component, module, function, system, query, or whatever. Being able to identify root causes and the mean time to detect (MTTD), these are some of the most paramount things that we probably need to worry about.

In complex, large-scale systems, things go wrong. Why things went wrong is one part, when did things go wrong is another part, and being able to arrive and fix things – the MTTD and the mean time to recovery (MTTR) -- those largely define the success of any business.

We are just one of the many financial ecosystem providers. There are tons of providers in the world. So, the customer has many options to switch from one provider to another. For any business, how they react to these performance issues is the most important.

Observability tools like Hypertrace puts us in control, rather than just leaving it for hypothesis.

Gardner: Jayesh, how does Hypertrace improve on such key performance controls as MTTD and MTTR? How is Hypertrace being used to cut down on that all important time to remediation that makes the user experience more competitive?

Tracing eases uncovering the unknown

Ahire: As Venkat pointed out, in these modern systems, there are too many unknown unknowns. Finding out what caused any problem at any point in time is hard.

At Hypertrace, in trying to help businesses, we present entity-focused, API-first views. Hypertrace provides a very detailed service dashboard, an overview, an out-of-the-box service overview. Such a backend API overview helps find what different services are talking to each other, how they are talking to each other, the interactions between the different services, and then what different APIs are talking to the services. It provides a list of APIs.

Hypertrace provides a single pane view into the services and API trace data. The insights gained from the trace data makes it easier to find which API or service has some issue. That’s where the entity-first API view makes the most sense. The API dashboard helps people get to the issue very easily and helps reduce the MTTD and MTTR.

Venkat: Just to add to what Jayesh mentioned, in our world our ecosystem is internally a Kubernetes ecosystem. And Kubernetes is extremely dynamic in nature. You’re not anymore dealing with single, private IDs or public IDs, or any of those things. Services can come up. Parts can come up. Deployments can come up, go down.

So, service discoverability becomes a problem, which means that tying back a particular behavior to these services, which are themselves a collection of services, and to the underlying infrastructure -- whether you’re talking about queues or network calls -- you’re talking about any number of interconnected infrastructure components as well. That becomes extremely challenging.

Cardinality becomes an extremely important issue. Metrics alone cannot solve that [service discoverability] problem. Logs alone cannot solve that problem. A very simple payments request carries at least 35 different cardinality dimensions.

The second aspect is implicitly most of our ecosystems run on preemptive workloads, or smart workloads. So, nodes can come up, nodes can go down. How do you put these things together? While we can identify a particular service as problematic, I want to find out if it is the service that is problematic or the underlying cloud provider. And within the cloud provider, is it the network or the actual hardware or operating system (OS)? If it is OS, which part precisely? Is it just a particular part that is problematic, or is the entire hardware problematic? That’s one view.

The other view is that cardinality becomes an extremely important issue. Metrics alone cannot solve that problem. Logs alone cannot solve that problem. A very simple request, for example, a payment-create-request in our world, carries at least 30 to 35 different cardinality dimensions (e.g.: the merchant identity, gateway, terminal, network, and whether the payment is domestic vs international, etc.).

Learn More 

A variety of these parameters comes into play. You need to know if it’s an issue overall, is it at a particular merchant, and at what dimension? So, you need to narrow down the problem in a tight production scenario.

To manage those aspects, tools like Hypertrace, or any observability tool, for that matter -- tracing in general -- makes it a lot easier to arrive at the right conclusions.

Gardner: You mentioned there are other options for tracing. How did you at Razorpay come to settle on Hypertrace? What’s the story behind your adoption of Hypertrace after looking at the tracing options landscape?

The why and how of choosing Hypertrace

Venkat: When we began our observability journey, we realized we had to go further into visibility tracing because the APMs were not answering a lot of questions we were asking of the APM tool. The best open-source version was that offered by Jaeger. We evaluated a lot of PaaS/SaaS solutions. We really didn't want to build an in-house observability stack.

There were a few challenges in all the PaaS offerings including storage, ability to drill down, retention, and cost versus value offered. Additionally, many of the providers were just giving us Jaeger with add-ons. The overall cost-to-benefit ratio suffered because we were growing with both the number of services and users. Any model that charges us on the user level, data storage level, or services level -- these become prohibitive over time.

Although maintaining an in-house observability tool is not the most natural business direction for us, we soon realized that maybe it’s best for us to do it in-house. We were doing some research and hit upon this solution called Hypertrace. It looked interesting so we decided to give it a try.

They offered the ability for me to jump into a Slack call. And that’s all I did. I just signed up. In fact, I didn’t even sign up with my company email address. I signed up with my personal email address and I just jumped on to their Slack call.


I started asking the Hypertrace team lots of questions. Started with a Docker-compose, straight out of their GitHub repo. The integration was quite straightforward. We did a set of proof-of-concepts and said, “Okay, this sort of makes sense.” The UX was on par with any commercial SaaS provider. That blew my mind. How can an open-source product build such a fantastic user interface (UI)? I think that was the first thing that hit most of our heads. And I think that was the biggest sell. We said, “Let’s just jump in and see how it evaluates.” And that’s the story.

Gardner: What sort of paybacks or metrics of success have you enjoyed since adopting Hypertrace? As open source, are you injecting your own requirements or desired functions and features into it?

Venkat: First and foremost, we wanted to understand the beast we were dealing with in our APIs, which meant we had to build in the instrumentation and software development kits (SDKs), including OpenCensus, OpenTracing, and OpenTelemetry agents.

We had to make internal developer adoption easier by building the right toolkits, the right frameworks, and the right SDKs because applications have their own business asks, and you shouldn't be adding woes to their existing development life cycles.

The next step was integrating these tools within our services and ecosystem. There are challenges in terms of internally standardizing all our instrumentation, using best practices, and ensuring that applications are adopted. We had to make internal developer adoption easier by building the right toolkits, the right frameworks, and the right SDKs because applications have their own business asks, and you shouldn’t be adding woes to their existing development life cycle. Integration should be simple! So, we formulated a virtual team internally within Razorpay to build the observability stack.

As we built the SDKs and tooling and started instrumenting, we did a lot of adoption exercises within the organization. Now, we have more than 15 critical services and a lot more in the pipeline. Over a period of time, we were able to make tracing a habit rather than just another “nice to have.”

One of the biggest benefits we started seeing from the production monitoring is our internal engineering teams figured out how to run performance tests in pre-production. Some of these wouldn’t have been possible before; being able to pin down the right problem areas.

Learn More 

Now, during the performance testing, our engineers can early-on pinpoint the root cause of the problems. And they’ve gone back to fix their code even before the code goes into production. And believe me that it’s a lot more valuable for us than the code going into production and then facing these problems.

The misfortune about all monitoring tools is typical metrics might not be applicable. Why? Because when things go right, nobody wants to look at monitoring. It’s only when things go wrong that people log into a monitoring tool.

The benefits of Hypertrace come in terms of how many issues you’re able to detect much earlier in the stages of development. That’s probably the biggest benefit we have gotten.

Gardner: Jayesh, what makes Hypertrace unique in the tracing market?

Democratic data for API analytics

Ahire: There are two different ways to analyze, visualize, and use the data to better understand the systems. The first important thing is how we do data collection. Hypertrace provides data collection from any standard instrumentation.

If your application is instrumented with Jaeger, Zipkin, or OpenTelemetry, and you start sending the instrumentation data to Hypertrace, it will be able to analyze it and show you the dashboard. You then will be able to slice and dice the data using our explorer. You can discover a lot of different things.

That democratization of the data collection aspect is one important thing Hypertrace provides. And if you want to use any other tracing platform you can do that with Hypertrace because we support all the standard instrumentation.

Next is how we utilize that data. Most tracing platforms provide a way to slice and dice their data. So that’s just one explorer view where there’s all the data from the instrumentation available and you can find the information you want. Ask the question and then you will get the information. That’s one way to look at it.

Hypertrace provides, in addition to that explorer view, a detailed service graph. With it, you can go to applications, see the service interactions, the latency markings, and learn which services are having errors right away. Out-of-the-box services derived from instrumentation data provide many necessary metrics and visualizations, including latency, error rate, and call rate.

You can see more of the API interactions. You can see comparison data to current data, for example. Whatever your latency was in the last one day to the last hour. It provides you a comparison for that. And it’s pretty helpful by being able to compare between deployments, such as if the performance, latency, or error rate is affected. There are a lot of use cases you can solve with Hypertrace.

With such observability used in early problem detection, you can reduce MTTD and MTTR using these dashboard services. You can achieve early problem detection easily.

The expectation is for availability of 99.99 percent. In the case of Razorpay, it's very critical. Any downtime has a business impact. For most businesses, that's the case.

Then there’s availability. The expectation is for availability of 99.99 percent. In the case of Razorpay, it’s very critical. Any downtime has a business impact. For most businesses, that’s the case. So, availability is a critical issue.

The Hypertrace dashboards help you to maintain that as well. Currently, we are working on alerting features on deviations -- and those deviations are calculated automatically. We calculate baselines from the previous data, and whenever a deviation happens, we give an alert. That obviously helps in reducing MTTD as well as increasing availability generally.

Hypertrace strives to make the UX seamless. As Venkat mentioned, we have a beautiful UI that looks professional and attractive. The UI work we put into our SaaS security solution, Traceable AI, this functionality also goes into Hypertrace, and so helps the community. It helps people such as Venkat at Razorpay to solve the problems in their environment. That’s pretty good.

Gardner: Venkat, for other organizations facing similar complexity and a need to speed remediation, what recommendations do you have? What should other companies be thinking about as they evaluate observability and tracing choices? What do you recommend they do as they get more involved with API resiliency?

Evaluate then invest in your journey

Venkat: A fundamental problem today in the open-source world with tracing is the quality of standards. We have OpenCensus on one side going to OpenTelemetry and OpenTracing going to OpenTelemetry. In trying to keep it all compatible, and because it’s all so nascent, there is not a lot of automation.

For most startups, it is quite daunting to build their own observability stack.

My recommendation is to start with an existing tracing provider and evaluate that against your past solutions. Over time it may become cost prohibitive. At some point, you must start looking inward. That’s the time when systems like Hypertrace become quite useful for an organization.

The truth is it’s not easy to build on an observability stack. So, experiment with a SaaS provider on a lower scale. Then invest in the right tooling, one that gives the liberty to not maintain the stack, such as Hypertrace. Keep the internal tooling separate, experiment, and come back. That’s what I would recommend.

The cost is not just the physical infrastructure cost, or the licensing cost. Cost is also engineering cost of the stack. If the stack goes down, who monitors the monitor? It’s a big question. So, there are trade-offs. There is no right answer, but it’s a journey.

After our experience with Hypertrace, I have connected with a couple of my friends in different organizations, and I’ve told them of the benefits. I do not know their results, but I’ve told them some of the benefits that we have leveraged using Hypertrace.

Gardner: And just to follow up on your advice for others, Venkat, what is it about open source that helps with those trade-offs?

Venkat: One advantage we have with open-source is there is no vendor lock-in. That’s one major advantage. One of our critical services is in PHP. And hence, we needed to only use OpenCensus for instrumenting it.

We're working with the Hypertrace community to build in some new features, such as tool design, Blue Coat, knowledge sharing, and bug-fixing. For us, it's been an interesting and exciting journey.

But there were a lot of performance and resilience issues with this codebase. Today, the original OpenCensus PHP implementation points to Razorpay’s fork.

And we are working with the Hypertrace community, too, to build some features, whether it is in tool design, Blue Coat, knowledge sharing, and bug-fixing. For us it’s been an interesting and exciting journey.

Ahire: Yes, that has been the mutual experience from our end as well. We learned a lot of things. We had made assumptions in the beginning about what users might expect or want.

But Razorpay worked with us. On some things they said, “Okay, this is not going to work. You have to change this part.” And we modified some things, we added a few features, and we removed a few things. That’s how it came to where it is today. The whole collaboration aspect has been very rewarding.

Venkat: Even though we have a handful of critical services, the data that are instrumented from them, it was over two terabytes a day. And while that is a good problem to have, we have other interesting scaling challenges we need to deal with.

So how do you optimize these things at scale? In the SaaS form, we could have just gone and said, “Hey, this sort of doesn’t work.” We stick with them for a few months then we go ahead with another SaaS provider and say, “Are you going to solve this problem or not?”

The flexibility we get with open source is to say, “Okay, here’s the problem. How do we fix it?” Because, of course, they’re not under our control, right? I think that’s super powerful.

Ahire: Here we all learn together.

Gardner: Yes, it certainly sounds like a partnership relationship. Jayesh, tell us a little bit about the roadmap for Hypertrace, and particularly for the smaller organizations who might prefer a SaaS model, what do you have in store for them?

Ahire: We are currently working on alerting. We’ll soon release dynamic anomaly-based alerting.

We are also working on metric ingestion and integrations throughout the Hypertrace platform. An important aspect of tracing and observability is being able to correlate the data. To propagate context throughout the system is very important. That’s what we will be doing with our metric integration. You will be able to send application metrics, and you will be able to correlate back to base data and log data.

Learn More 

And talking of SaaS, when it comes to smaller organizations with maybe 10, 20, or 30 developers and a not very well-defined DevOps team, it can be hard to deploy and manage this kind of platform.

So, for those users, we are working toward a SaaS model so smaller companies will be able to use the Hypertrace stack functionality.

Gardner: Where can organizations go to learn more about Hypertrace and start to use some of these features and functions?

Ahire: You can head on to hypertrace.org, our website, and find the details of our use cases. There’s a Slack channel link, GitHub, and everything is available there. Those are good places to start.

Venkat: Just try it first and just go to GitHub and within a few minutes you should have the entire stack up and running. I mean, that’s as simple as simplicity can get.

For further details, just go to the Slack channel and start communicating. Their team is super-duper responsive and super-duper helpful. In fact, we have never had to talk to them saying, “Hey, what’s this?” because we sort of realized that they come back with a patch much faster than you can imagine.

Gardner: I’m afraid we’ll have to leave it there. You’ve been listening to a sponsored BriefingsDirect discussion on how the speed and complexity of microservices-laden applications can often leave developers in the dark as to what’s going on with their underlying dynamic service architectures.

And we’ve learned how a sophisticated, distributed tracing and observability platform called Hypertrace discovers, monitors, visualizes, and optimizes services for an innovative online payments business, Razorpay.

So, a big thank you to our guests, Venkat Vaidhyanathan, Architect at Razorpay in Bangalore, India. Thank you so much, Venkat.


Venkat:
Thank you, Dana, for the opportunity, and thank you, Jayesh, and the Hypertrace team for helping us to build and make our systems far more robust.

Gardner: We’ve also been here with Jayesh Ahire, Founding Engineer at Traceable AI and Product Manager for Hypertrace. Thank you, Jayesh.

Ahire: Thanks, Dana. It was great talking to you and sharing our story.

Gardner: And a big thank you as well for our audience for joining this BriefingsDirect API resiliency discussion. I’m Dana Gardner, Principal Analyst at Interarbor Solutions, your host throughout this series of Traceable AI-sponsored BriefingsDirect interviews.

Thanks again for listening. Please pass this along to your business community and do come back for our next chapter.

Listen to the podcast. Find it on iTunes. Download the transcript. Sponsor: Traceable AI.

Transcript of a discussion on an open-source project, Hypertrace, and how it helps designers, builders, and testers of modern APIs gain visibility across their internal and third-party services. Copyright Interarbor Solutions, LLC, 2005-2021. All rights reserved.

You may also be interested in: