BriefingsDirect Transcripts: data analysis

Showing posts with label data analysis. Show all posts

Friday, June 24, 2016

Here's How Two Part-Time DBAs Maintain Mobile App Ad Platform Tapjoy’s Massive Data Needs

Transcript of a discussion on how mobile app advertising platform Tapjoy handles fast and massive data flows with just two part-time database administrators.

Listen to the podcast. Find it on iTunes. Get the mobile app. Download the transcript. Sponsor: Hewlett Packard Enterprise.

Dana Gardner: Hello, and welcome to the next BriefingsDirect Voice of the Customer innovation case study discussion. I'm Dana Gardner, Principal Analyst at Interarbor Solutions, your host and moderator for this ongoing discussion on IT transformation, and how it’s making an impact on people’s lives.

Gardner

Our next big-data case study discussion examines how mobile app advertising platform Tapjoy handles fast and massive data -- some two dozen terabytes per day -- with just two part-time database administrators (DBAs).

So examine how Tapjoy’s data-driven business of serving 500 million global mobile users -- or more than 1.5 million add engagements per day, a data volume of a 120 terabytes -- runs on HPE Vertica.

We'll learn more about how high scale and complexity meets minimal labor for building user and advertiser loyalty from David Abercrombie, Principal Data Analytics Engineer at Tapjoy in San Francisco.

HPE Vertica
Community Edition
Start Your Free Trial Now

Welcome, David.

David Abercrombie: Thank you so much for having me.

Gardner: Mobile advertising has really been a major growth area, perhaps more than any other type of advertising. We hear a lot about advertising waning, but not mobile app advertising. How does Tapjoy and its platform help contribute to the success of what we're seeing in the mobile app ad space?

Abercrombie: The key to Tapjoy’s success is engaging the users and rewarding them for engaging with an ad. Our advertising model is you engage with an ad and then you get typically some sort of reward: A virtual currency in the game you're playing or some sort of discount.

Abercrombie

We actually have the kind of ads that lead users to seek us out to engage with the ads and get their rewards.

Gardner: So this is quite a bit different than a static presented ad. This is something that has a two-way street, maybe multiple directions of information coming and going. Why the analysis? Why is that so important? And why the speed of analysis?

Abercrombie: We have basically three types of customers. We have the app publishers who want to monetize and get money from displaying ads. We have the advertisers who need to get their message out and pay for that. Then, of course, we have the users who want to engage with the ads and get their rewards.

The key to Tapjoy’s success is being able to balance the needs of all of these disparate uses. We can’t charge the advertisers too much for their ads, even though the monetizers would like that. It’s a delicate balancing act, and that can only be done through big-data analysis, careful optimization, and careful monitoring of the ad network assets and operation.

Gardner: Before we learn more about the analytics, tell us a bit more about what role Tapjoy plays specifically in what looks like an ecosystem play for placing, evaluating, and monetizing app ads? What is it specifically that you do in this bigger app ad function?

Ad engagement model

Abercrombie: Specifically what Tapjoy does is enable this rewarded ad engagement model, so that the advertisers know that people are going to be paying attention to their ads and so that the publishers know that the ads we're displaying are compatible with their app and are not going to produce a jarring experience. We want everybody to be happy -- the publishers, the advertisers, and the users. That’s a delicate compromise that’s Tapjoy’s strength.

Gardner: And when you get an end user to do something, to take an action, that’s very powerful, not only because you're getting them to do what you wanted, but you can evaluate what they did under what circumstances and so forth. Tell us about the model of the end user specifically. What is it about engaging with them that leads to the data -- which we will get to in a moment?

Abercrombie: In our model of the user, we talk about long-term value. So even though it may be a new user who has just started with us, maybe their first engagement, we like to look at them in terms of their long-term value, both to the publishers and the advertiser.

We don’t want people who are just engaging with the ad and going away, getting what they want and not really caring about it. Rather, we want good users who will continue their engagement and continue this process. Once again, that takes some fairly sophisticated machine-learning algorithms and very powerful inferences to be able to assess the long-term value.

As an example, we have our publishers who are also advertisers. They're advertising their app within our platform and for them the conversion event, what they are looking for, is a download. What we're trying to do is to offer them users who will not only download the game once to get that initial payoff reward, but will value the download and continue to use it again and again.

The people who are advertising don’t want people to just see their ads. They want people to follow up with whatever it is they're advertising.

So all of our models are designed with that end in mind -- to look at the long-term value of the user, not just the immediate conversion at this instant in time.

Gardner: So perhaps it’s a bit of a misnomer to talk about ads in apps. We're really talking about a value-add function in the app itself.

Abercrombie: Right. The people who are advertising don’t want people to just see their ads. They want people to follow up with whatever it is they're advertising. If it’s another app, they want good users for whom that app is relevant and useful.

That’s really the way we look at it. That’s the way to enhance the overall experience in the long-term. We're not just in it for the short-term. We're looking at developing a good solid user base, a good set of users who engage thoroughly.

Gardner: And as I said in my set-up, there's nothing hotter in all of advertising than mobile apps and how to do this right. It’s early innings, but clearly the stakes are very high.

A tough business

Abercrombie: And it’s a tough business. People are saturated. Many people don’t want ads. Some of the business models are difficult to master.

For instance, there may be a sequence of multiple ad units. There may be a video followed by another ad to download something. It becomes a very tricky thing to balance the financing here. If it was just a simple pass-through and we take a cut, that would be trivial, but that doesn't work in today's market. There are more sophisticated approaches, which do involve business risk.

If we reward the user, based on the fact that they're watching the video, but then they don't download the app, then we don't get money. So we have to look very carefully at the complexity of the whole interaction to make it as smooth and rewarding as possible, so that the thing works. That's difficult to do.

Gardner: So we're in a dynamic, fast-growing, fairly fresh, new industry. Knowing what's going to happen before it happens is always fun in almost any industry, but in this case, it seems with those high stakes and to make that monetization happen, it’s particularly important.

HPE Vertica
Community Edition
Start Your Free Trial Now

Tell me now about gathering such large amounts of data, being able to work with it, and then allowing analysis to happen very swiftly. How do you go about making that possible?

Abercrombie: Our data architecture is relatively standard for this type of clickstream operation. There is some data that can be put directly into a transactional database in real time, but typically, that's only when you get to the very bottom of the funnel, the conversion stuff. But all that clickstream stuff gets written, has JSON formatted log files, gets swept up by a queuing system, and then put into our data systems.

Our legacy system involved a homegrown queuing system, dumping data into HDFS. From there, we would extract and load CSVs into Vertica. As with so many other organizations, we're moving to more real-time operations. Our queuing system has evolved from a couple of different homegrown applications, and now we're implementing Apache Kafka.

We use Spark as part of our infrastructure, as sort of a hub, if you will, where data is farmed out to other systems, including a real-time, in-memory SQL database, which is fairly new to us this year. Then, we're still putting data in HDFS, and that's where the machine learning occurs. From there, we're bringing it into Vertica.

In Vertica -- and our Vertica cluster has two main purposes -- there is the operational data store, which has the raw, flat tables that are one row for every event, with the millisecond timestamps and the IDs of all the different entities involved.

From that operational data store, we do a pure SQL ETL extract into kind of an old-school star schema within Vertica, the same database.

Pure SQL

So our business intelligence (BI) ETL is pure SQL and goes into a full-fledged snowflake schema, moderately denormalized with all the old-school bells and whistles, the type 1, type 2, slowly changing dimensions. With Vertica, we're able to denormalize that data warehouse to a large degree.

Sitting on top of that we have a BI tool. We use MicroStrategy, for which we have defined our various metrics and our various attributes, and it’s very adept at knowing exactly which fact table and which dimensions to join.

So we have sort of a hybrid architecture. I'd say that we have all the way from real-time, in-memory SQL, Hadoop and all of its machine learning and our algorithmic pipelines, and then we have kind of the old-school data warehouse with the operational data store and the star schema.

Gardner: So a complex, innovative, custom architectural approach to this and yet I'm astonished that you are running and using Vertica in multiple ways with two part-time DBAs. How is it possible that you have minimal labor, given this topology that you just described?

Abercrombie: Well, we found Vertica very easy to manage. It has been very well-behaved, very stable.

In terms of ad-hoc users of our Vertica database, we have well over 100 people who have the ability to run any query they want at any time into the Vertica database.

For instance, we don’t even really use the Management Console, because there is not enough to manage. Our cluster is about 120 terabytes. It’s only on eight nodes and it’s pretty much trouble free.

One of the part-times DBAs deals with kind of more operating-system level stuff -- patches, cluster recovery, those sorts of issues. And the other part-time DBA is me. I deal more with data structure design, SQL tuning and Vertica training for our staff.

In terms of ad-hoc users of our Vertica database, we have well over 100 people who have the ability to run any query they want at any time into the Vertica database.

When we first started out, we tried running Vertica in Amazon EC2. Mind you, this was four or five years ago. Amazon EC2 was not where it is today. It failed. It was very difficult to manage. There were perplexing problems that we couldn’t solve. So we moved our Vertica and essentially all of our big-data data systems out of the cloud onto dedicated hardware, where they are much easier to manage and much easier to bring the proper resources.

Then, at one time in our history, when we built a dedicated hardware cluster for Vertica, we failed to heed properly the hardware planning guide and did not provision enough disk I/O bandwidth. In those situations, Vertica is unstable, and we had a lot of problems.

But once we got the proper disk I/O, it has been smooth sailing. I can’t even remember the last time we even had a node drop out. It has been rock solid. I was able to go on a vacation for three weeks recently and know that there would be no problem, and there was no problem.

Gardner: The ultimate key performance indicator (KPI), "I was able to go on vacation."

Fairly resilient

Abercrombie: Exactly. And with the proper hardware design, HPE Vertica is fairly resilient against out-of-control queries. There was a time when half my time was spent monitoring for slow queries, but again, with the proper hardware, it's smooth sailing. I don’t even bother with that stuff anymore.

Our MicroStrategy BI tool writes very good SQL. Part of the key to our success with this BI portion is designing the Vertica schema and the MicroStrategy metadata layer to take advantage of each other’s strengths and avoid each other’s weaknesses. So that really was key to the stable, exceptional performance we get. I basically get no complaints of slow queries from my BI tool. No problem.

Gardner: The right kind of problem to have.

Abercrombie: Yes.

Gardner: Okay, now that we have heard quite a bit about how you are doing this, I'd like to learn, if I could, about some of the paybacks when you do this properly, when it is running well, in terms of SQL queries, ETL load times reduction, the ability for you to monetize and help your customers create better advertising programs that are acceptable and popular. What are the paybacks technically and then in business terms?

The only way to get that confidence was by having highly accurate data and extensive quality control (QC) in the ETL.

Abercrombie: In order to get those paybacks, a key element was confidence in the data, the results that we were shipping out. The only way to get that confidence was by having highly accurate data and extensive quality control (QC) in the ETL.

What that also means is that as a product is under development and when it’s not ready yet, the instrumentation isn’t ready, that stuff doesn’t make it into our BI tool. You can only get that stuff from ad hoc.

So the benefit has been a very clear understanding of the day-to-day operations of our ad network, both for our internal monitoring to know when things are behaving properly, when the instrumentation is working as expected, and when the queues are running, but also for our customers.

Because of the flexibility that we can do from a traditional BI system with 500 metrics, over a couple of dozen dimensions, our customers, the publishers and the advertisers, get incredible detail, customized exactly the way they need for ingestion into their systems or to help them understand how Tapjoy is serving them. Again, that comes from confidence in the data.

Gardner: When you have more data and better analytics, you can create better products. Where might we look next to where you take this? I don’t expect you to pre-announce anything, but where can you now take these capabilities as a business and maybe even expand into other activities on a mobile endpoint?

Flexibility in algorithms

Abercrombie: As we expand our business and move into new areas, what we really need is flexibility in our algorithms and the way we deal with some of our real-time decision making.

So one area that’s new to us this year is the in-memory SQL database like MemSQL. Some of our old real-time ad optimization was based on pre-calculating data and serving it up through HBase KeyValue, but now, where we can do real-time aggregation queries using SQL, that is easy to understand, easy to modify, very expressive and very transparent. It gives us more flexibility in terms of fine-tuning our real-time decision-making algorithms, which is absolutely necessary.

As an example, we acquired a company in Korea called 5Rocks that does app tech and that tracks the users within the app, like what level they're on, or what activities they're doing and what they enjoy, with an eye towards in-app purchase optimization.

And so we're blending the in-app purchase optimization along with traditional ad network optimization, and the two have different rules and different constraints. So we really need the flexibility and expressiveness of our real-time decision making systems.

Gardner: One last question. You mentioned machine learning earlier. Do you see that becoming more prominent in what you do and how you're working with data scientists, and how might that expand in terms of where you employ it?

Abercrombie: Tapjoy started with machine learning. Our data scientists are machine learning. Our productive algorithm team is about six times larger than our traditional Vertica BI team. Mostly what we do at Tapjoy is predictive analytics and various machine-learning things. So we wouldn't be alive without it. And we expanded. We're not shifting in one direction or another. It's apples and oranges, and there's a place for both.

HPE Vertica
Community Edition
Start Your Free Trial Now

Gardner: I'm afraid we will have to leave it there. We've been examining how mobile app advertising platform Tapjoy handles fast and massive data flows with just two part-time database administrators. And we've learned how Tapjoy’s data-driven business of serving 500 million global mobile users, more than 1.5 million ad engagements per day, runs on HPE Vertica.

Please join me in thanking our guest. We've been here with David Abercrombie, Principal Data Analytics Engineer at Tapjoy in San Francisco. Thank you so much, David.

Abercrombie: Thank you.

Gardner: And I'd like to thank our audience as well for joining us for this big data innovation case study discussion.

I'm Dana Gardner, Principal Analyst at Interarbor Solutions, your host for this ongoing series of HPE-sponsored discussions. Thanks again for listening, and come back next time.

Listen to the podcast. Find it on iTunes. Get the mobile app. Download the transcript. Sponsor: Hewlett Packard Enterprise.

Transcript of a discussion on how mobile app advertising platform Tapjoy handles fast and massive data flows with just two part-time database administrators. Copyright Interarbor Solutions, LLC, 2005-2016. All rights reserved.

Tuesday, April 12, 2016

How Etsy Uses Big Data for eCommerce to Put Buyers and Sellers in the Best Light

Transcript of a discussion on how Etsy uses data science to improve their buyers and sellers’ experience as well as their own corporate destiny.

Listen to the podcast. Find it on iTunes. Get the mobile app. Download the transcript. Sponsor: Hewlett Packard Enterprise.

Dana Gardner: Hello, and welcome to the next edition of the Hewlett Packard Enterprise (HPE) innovator podcast series. I'm Dana Gardner, Principal Analyst at Interarbor Solutions, your host and moderator for this ongoing discussion on IT innovation -- and how it’s making an impact on people’s lives.

Gardner

Our next big-data case study discussion explores how Etsy, a global e-commerce site focused on handmade and vintage items, uses data science to improve buyers and sellers’ discovery and shopping experiences. We'll learn how mining big data helps Etsy define and distribute top trends, and allows those with specific interests to find items that will best appeal to them.

To learn more about leveraging big data in the e-commerce space, please join me in welcoming Chris Bohn aka “CB,” a Senior Data Engineer at Etsy, based in Brooklyn, New York. Welcome, CB.

CB: Thank you.

Gardner: Tell us about Etsy for those that aren’t familiar with it. I've heard it described as it’s like being able to go through your grandmother's basement. Is that fair?

Start Your
HPE Vertica
Community Edition Trial Now

CB: Well, I hope it’s not as musty and dusty as my grandmother’s basement. The best way to describe it is that Etsy is a marketplace. We create a marketplace for sellers of handcrafted goods and the people who want to buy those goods.

We've been around for 10 years. We're the leader in this space and we went public in 2015. Just some quick little metrics. The total of value of the merchandise sold on Etsy in 2014 was about $1.93 billion. We have about 1.5 million sellers and about 22 million buyers.

Gardner: That's an awful lot of stuff that’s being moved around. What does the big data and analytics role bring to the table?

CB: It’s all about understanding more about our customers, both buyers and sellers. We want to know more about them and make the buying experience easier for them. We want them to be able to find products easier. Too much choice sometimes is no choice. You want to get them to the product they want to buy as quickly as possible.

We also want to know how people are different in their shopping habits across the geography of the world. There are some people in different countries that transact differently than we do here in the States, and big data lets us get some insight into that.

Gardner: Is this insight derived primarily from what they do via their clickstreams, what they're doing online? Or are there other ways that you can determine insights that then you can share among yourself and also back to your users?

Data architecture

CB: I'll describe our data architecture a little bit. When Etsy started out, we had a monolithic Postgres database and we threw everything in there. We had listings, users, sellers, buyers, conversations, and forums. It was all in there, but we outgrew that really quickly, and so the solution to that was to shard horizontally.

Now we have many hundreds of sharded MySQL servers, horizontal. Then we decided that we needed to do some analytics on this stuff. So we scratched our heads. This was about five years ago. So we said, "Let’s just set up a Postgres server and we'll copy all the data from these shards into the Postgres server that we call BI server." And we got that done.

Then, we kind of scratched our heads and said, "Wait a minute. We just came full circle. We started with a monolithic database, then we went sharded, and now all the data is back monolithic."

It didn't perform well, because it's hard to get the volume of big data into that database. A relational database like Postgres just isn’t designed to do analytic-type queries. Those are big aggregations, and Postgres, even though it is a great relational database, is really tailored for single-record lookup.

So we decided to get something else going on. About three-and-a-half years ago, we set about searching for the replacement to our monolithic business-intelligence (BI) database and looked at what the landscape was. There were a number of very worthy products out there, but we eventually settled on HPE Vertica for a number of reasons.

One of those is that it derives, in large part, from Postgres. Postgres has a Berkeley license. So companies could take it private. They can take that code and they don’t have to republish it out to the community, unlike other types of open source copyright agreements.

So we found out that the parser was right out of Postgres and all the date handling and typecasting stuff that is usually different from database to database was exactly spot-on the same between Vertica and Postgres. Also, data ingestion via the copy command is the best way to bulk-load data, exactly the same in both, and it’s the same format.

There were a number of very worthy products out there, but we eventually settled on Vertica for a number of reasons.

We said, "This looks good, because we can get the data in quickly, and queries will probably not have to be edited much." So that's where we went. We experimented with it and we found exactly that. Queries would run unchanged, except they ran a lot faster and we were able to get the data in easily.

We built some data replication tools to get data from the shards and also some legacy Postgres databases that we had laying around for billing and got that all data into HPE Vertica.

Then, we built some tools that allowed our analysts to bring over custom tables they had created on that old BI machine. We were able to get up to speed really quickly with Vertica, and boom, we had an analytics database that we were able to hit the ground running with it.

Gardner: And is the challenge for you about the variety of that data? Is it about the velocity that you need to move it in and out? Is it about simply volume that you just have so much of it, or a little of some of those?

All of the above

CB: It’s really all of those problems. Velocity-wise, we want our replication system to be eventually consistent, and we want it to be as near real-time as possible. There is a challenge in that, because you really start to get into micro-batching data in.

This is where we ended up having to pay off some technical debt, because years ago, disk storage was fairly pricey, and databases were designed to minimize storage. Practices grew up around that fact. So data would get deleted and updated. That's the policy that the early originators of Etsy followed when they designed the first database for it.

Eventually what we have got now is lossy data. If someone changes the description or the tags that are associated with a listing, the old ones go away. They are lost forever. And that's too bad, because if we kept those, we can do analytics on a product that wasn’t selling for a long time and all of a sudden it started selling. What changed? We would love to do analytics on that, but we can't do it because of the loss of data. That's one thing that we learned in this whole process.

But getting back to your question here about velocity and then also the volume of data, we have a lot of data from our production databases. We need to get it all into Vertica. We also have a lot of clickstream data. Etsy is a top 50 website, I believe, for traffic, and that generates a lot of clicks and that all gets put into Vertica.

This is where we ended up having to pay off some technical debt, because years ago, disk storage was fairly pricey, and databases were designed to minimize storage.

We run big batch jobs every night to load that. It's important that we have that, because one of the biggest things that our analytics like to do is correlate clickstream data with our production data. Clickstream data doesn't have a lot of information about the user who is doing those clicks. It’s just information about their path through the site at that time.

To really get a value-add on that, you want to be able to join on your user details tables, so that you can know where this person lives, how old they are, or their buying history in the past. You need to be able to join those, too, and we do that in HPE Vertica.

Gardner: CB, give us a sense about the paybacks, when you do this well, when you've architected, and when you've paid your technical debts, as you put it. How are your analysts able to leverage this in order to make your business better and make the experience of your users better?

CB: When we first installed Vertica, it was just a small group of analysts that were using it. Our analytics program was fairly new, but it just exploded. Everybody started to jump in on it, because all of a sudden, there was a database with which you could write good SQL, with a rich SQL engine, and get fantastic results quickly.

The results weren’t that different from what we were getting in the past, but they were just coming to us so fast, the cycle of getting information was greatly shortened. Getting result sets was so much better that it was like a whole different world. It’s like the Pony Express versus email. That’s the kind of difference it was. So everybody started jumping in on it.

More dashboards

Engineers who were adding new facets of the product wanted to have dashboards, more or less real time, so they could monitor what the thing was doing. For example, we added postage to Etsy, so that our sellers can have preprinted labels. We'd like to monitor that in real time to see how it's this going. Is it going well or what?

That was something that took a long time to analyze before we got into big-data analytics. All of a sudden, we had Vertica and we could do that for them, and that pattern has repeated with other groups in the company.

We're doing different aspects of the site. All of a sudden, you have your marketing people, your finance people, saying, "Wow, I can run these financial reports that used to take days in literally seconds." There was a lot of demand. Etsy has about 750 employees and we have way more than 200 Vertica accounts. That shows you how popular it is.

Start Your
HPE Vertica
Community Edition Trial Now

One anecdotal story. I've been wanting to update Vertica for the past couple of months. The woman who runs our analytics team said, "Don't you dare. I have to run Q2 numbers. Everybody is working on this stuff. You have to wait until this certain week to be able to do that." It’s not just HPE Vertica, but big data is now relied on for so many things in the company.

Gardner: So the technology led to the culture. Many times we think it's the other way around, but having that ability to do those easy SQL queries and get information opened up people's imagination, but it sounds like it has gone beyond that. You have a data-driven company now.

CB: That's an astute observation. You're right. This is technology that has driven the culture. It's really changed the way people do their job at Etsy. And I hear that elsewhere also, just talking to other companies and stuff. It really has been impactful.

This is technology that has driven the culture. It's really changed the way people do their job at Etsy.

Gardner: Just for the sake of those of our readers who are on the operations side, how do you support your data infrastructure? Are you thinking about cloud? Are you on-prem? Are you split between different data centers? How does that work?

CB: I have some interesting data points there for you. Five-plus years ago, we started doing Hadoop stuff, and we started out spinning up Hadoop in Amazon Web Service (AWS).

We would run nightly jobs. We collected all of the search terms that were used and buying patterns and we fed these into MapReduce jobs. The output from that then went into MATLAB, and we would get a set of rules out of that, that then would drive our search engine, basically improving search.

Commodity hardware

We did that for a while and then realized we were spending a lot of money in AWS. It was many thousands of dollars a month. We said, "Wait a minute. This is crazy. We could actually buy our own servers. This is commodity hardware that this can run on, and we can run this in our own data center. We will get the data in faster, because there are bigger pipes." So that's what we did.

We created what we call Etsydoop, which has got 200+ nodes and we actually save a lot of money doing it that way. That's how we got into it.

We really have a bifurcated data analytics, big-data system. On the one hand, we have Vertica for doing ad hoc queries, because the analysts and the people out there understand SQL and they demand it. But for batch jobs, Hadoop rocks, and it's really, really good for that.

But the tradeoff is that those are hard jobs to write. Even a good engineer is not going to get it right every time, and for most analysts, it's probably a little bit beyond their reach to get down, roll up their sleeves, and get into actual coding and that kind of stuff.

The analysts and the people out there understand SQL and they demand it. But for batch jobs, Hadoop rocks, and it's really, really good for that.

But they're great at SQL, and we want to encourage exploration and discovering new things. We've discovered things about our business just by some of these analysts wildcatting in the database, finding interesting stuff, and then exploring it, and we want to encourage that. That's really important.

Gardner: CB, in getting to understand Etsy a little bit more, I saw that you have something called Top Trends and Etsy Finds, ways that you can help people with affinity for a product or a craft or some interest to pursue that. Did that come about as a result of these technologies that you have put in place, or did they have a set of requirements that they wanted to be able to do this and then went after you to try to accommodate it? How do you pull off that Etsy Finds capability?

CB: A lot of that is cross-architecture. Some of our production data is used to find that. Then, a lot of the hard crunching is done in Vertica to find that. Some of it is MapReduce. There's a whole mix of things that go into that.

I couldn't claim for Etsy Finds, for example, that it’s all big data. There are other things that go in there, but definitely HPE Vertica plays a role in that stuff.

I'll give you another example, fraud. We fingerprint a lot of our users digitally, because we have problems with resellers. These are people who are selling resold mass-produced stuff on Etsy. It's not huge, but it's an annoyance. Those products compete against really quality handmade products that our regular sellers sell in their shops.

Sometimes it’s like a game of Whack-a-Mole. You knock one of these guys down -- sometimes they're from the Far East or other parts of the world -- and as soon as you knock one down, another one pops up. Being able to capture them quickly is really important, and we use Vertica for that. We have a team that works just on that problem.

What's next?

Gardner: Thinking about the future, with this great architecture, with your ability to do things like fraud detection and affinity correlations, what's next? What can you do that will help make Etsy more impactful in its market and make your users more engaged?

CB: The whole idea behind databases and computing in general is just making things faster. When the first punch-card machines came out in the 1930s or whatever, the phone companies could do faster billing, because billing was just getting out of control. That’s where the roots of IBM lie.

As time went by, punch cards were slow and they wanted to go faster. So they developed magnetic tape, and then spinning rust disks. Now, we're into SSDs, the flash drives. And it’s the same way with databases and getting answers. You always want to get answers faster.

We do a lot of A/B testing. We have the ability to set the site so that maybe a small percentage of users get an A path through the site, and the others a B path, and there's control stuff on that. We analyze those results. This is how we test to see if this kind of button work better than this other one. Is the placement right? If we just skip this page, is it easier for someone to buy something?

The whole idea behind databases and computing in general is just making things faster.

So we do A/B testing. In the past, we've done it where we had to run the test, gather the data, and then comb through it manually. But now with Vertica, the turnaround time to iterate over each cycle of an A/B test has shrunk dramatically. We get our data from the clickstreams, which go into Vertica, and then the next day, we can run the A/B test results on that.

The next step is shrinking that even more. One of the themes that’s out there at the various big data conferences is streaming analytics. That's a really big thing. There is a new database out there called PipelineDB, a fork of Postgres. It allows you to create an event steam into Postgres.

You can then create a view and a window on top of that stream. Then you can pump your event data, like your clickstream data, and you can join the data in that window to your regular Postgres tables, which is really great, because we could get A/B information in real time. You set up a one minute turnaround as opposed to one day. I think that’s where a lot of things are going.

If you just look at the history of big data, MapReduce started about 10 years ago at Google, and that was batch jobs, overnight runs. Then, we started getting into the columnar stores to make databases like Vertica possible, and it’s really great for aggregation. That kicked it up to the next level.

Another thing is real-time analytics. It’s not going to replace any of these things, just like Vertica didn't replace Hadoop. They're complementary. Real-time streaming analytics will be complementary. So we're continuing to add these tools to our big data toolbox.

Gardner: It has compressed those feedback loops if we provide that capability into innovative, creative organization. The technology might drive the culture, and who knows what sort of benefits they will derive from that.

All plugged in

CB: That's very true. You touched earlier about how we do our infrastructure. I'm in data engineering, and we're responsible for making sure that our big databases are healthy and running right. But we also have our operations department. They're working on the actual pipes and hardware and making sure it’s all plugged in. It's tough to get all this stuff working right, but if you have the right people, it can happen.

I mentioned earlier about AWS. The reason we were able to move off of that and save money is because we have the people who can do it. When you start using AWS extensively, what you're doing is you are paying for a very high priced but good IT staff at Amazon. If you have got a good IT staff of your own, you're probably going to be able to realize some efficiencies there, and that's why really we moved over. We do it all ourselves.

Gardner: Having it as a core competency might be an important thing moving forward. The whole idea behind databases and computing in general is just making things faster.

CB: Absolutely. You have to stay on top of all this stuff. A lot is made of the word disruption, and you don't go knocking on disruption’s door; it usually knocks on yours. And you had better be agile enough to respond to it.

A lot is made of the word disruption, and you don't go knocking on disruption’s door; it usually knocks on yours. And you had better be agile enough to respond to it.

I'll give you an example that ties back into big data. One of the most disruptive things that has happened to Etsy is the rise of the smartphone. When Etsy started back in 2005, the iPhone wasn't around yet; it was still two years out. Then, it came on the scene, and people realized that this was a suitable device for commerce.

It’s very easy to just be complacent and oblivious to new technologies sneaking up on you. But we started seeing that there was more and more commerce being done on smartphones. We actually fell a little bit behind, as a lot of companies did five years ago. But our management made decisions to invest in mobile, and now 60 percent of our traffic is on mobile. That's turned around in the past two years and it has been pretty amazing.

Big data helps us with that, because we do a lot of crunching of what these mobile devices are doing. Mobile is not the best device maybe for buying stuff because of the form factor, but it is a really good device for managing your store, paying your Etsy bill, and doing that kind of stuff. So we analyzed all that and crunched it in big data.

Gardner: And big data allowed you to know when to make that strategic move and then take advantage of it?

CB: Exactly. There are all sorts of crossover points that happen with technology, and you have to monitor it. You have to understand your business really well to see when certain vectors are happening. If you can pick up on those, you're going to be okay.

Start Your
HPE Vertica
Community Edition Trial Now

Gardner: I'm afraid we'll have to leave it there. We've been exploring how Etsy, a global e-commerce site focused on handmade and vintage items, uses data science to improve their buyers' and sellers’ experience as well as their own corporate destiny.

I'd like to thank our guest, CB, Senior Data Engineer at Etsy in Brooklyn, New York. Thanks, CB.

CB: Thank you very much, Dana.

Gardner: And I would also like to thank our audience for joining us for this Hewlett Packard Enterprise big data innovation case study discussion. I'm Dana Gardner, Principal Analyst at Interarbor Solutions, your host for this ongoing series of HPE-sponsored discussions.

Listen to the podcast. Find it on iTunes. Get the mobile app. Download the transcript. Sponsor: Hewlett Packard Enterprise.

Transcript of a discussion on how Etsy uses data science to improve their buyers and sellers’ experience as well as their own corporate destiny. Copyright Interarbor Solutions, LLC, 2005-2016. All rights reserved.

Tuesday, March 08, 2016

IoT Plus Big Data Analytics Translate into Better Services Management at Auckland Transport

Transcript of a discussion on the impact and experience of using Internet of Things technologies together with big data analysis in a regional public enterprise.

Listen to the podcast. Find it on iTunes. Get the mobile app. Download the transcript. Sponsor: Hewlett Packard Enterprise.

Dana Gardner: Hello, and welcome to the next edition of the HPE Discover business transformation series. I'm Dana Gardner, Principal Analyst at Interarbor Solutions, your host and moderator for this ongoing discussion on IT innovation and how it’s making an impact on people’s lives.

Gardner

Our next top innovator case study discussion explores the impact and experience of using Internet of Things (IoT) technologies together with big data analysis to better control and manage a burgeoning transportation agency in New Zealand.

To hear more about how fast big data supports rapidly-evolving demand for different types of sensor outputs -- and massive information inputs -- please join me in welcoming our guest, Roger Jones, CTO for Auckland Transport in Auckland, New Zealand. Welcome, Roger.

Roger Jones: Thank you.

Gardner: Tell us about your organization, its scope, its size and what you're doing for the people in Auckland.

Start Your
HPE Vertica
Community Edition Trial Now

Jones: Auckland Transport was formed five years ago -- we just celebrated our fifth birthday -- from an amalgamation of six regional councils. All the transport functions were merged along with the city functions, to form a super-city concept, of which transport was pulled out and set up as a council-controlled organization.

But it's a semi-government organization as well. So we get funded by the government and the ratepayer and then we get our income as well.

We have multiple stakeholders. We're run by a board, an independent board, as a commercial company.

We look after everything to do with transport in the city: All the roads, everything on the roads, light poles, rubbish bins, the maintenance of the roads and the footpaths and the grass bins, boarding lights, and public transport. We run and operate the ferries, buses and trains, and we also promote and manage cycling across the city, walking activities, commercial vehicle planning, how they operate across the ports and carry their cargoes, and also carpooling schemes.

Gardner: Well, that's a very large, broad set of services and activities. Of course a lot of people in IT are worried about keeping the trains running on time as an analogy, but you're literally doing that.

Real-time systems

Jones: Yeah. We have got a lot of real-time systems, and trains. We've just brought in a whole new electric train fleet. So all of the technology that goes with that has to be worked through. That's the real-time systems on the platforms, right through to how we put Wi-Fi on to those trains and get data off those trains.

Jones

So all of those trains have closed-circuit television (CCTV) cameras on them for safety. It's how you get all that information off and analyze it. There's about a terabyte of data that comes off all of those trains every month. It's a lot of data to go through and work out what you need to keep and what you don’t.

Gardner: Of course, you can't manage and organize things unless you can measure and keep track of them. In addition to that terabyte you talked about from the trains, what's the size of the data -- and not just data as we understand it, unstructured data, but content -- that you're dealing with across all these other activities?

Jones: Our traditional data warehouse is about three terabytes, in round numbers, and on the CCTV we take about eight petabytes of data a week, and that's what we're analyzing. That's from about 1,800 cameras that are out on the streets. They're in a variety of places, mostly on intersections, and they're doing a number of functions.

They're counting vehicles. Under the new role, what we want to do is count pedestrians and cyclists and have the cyclists activate the traffic lights. From a cycle-safety perspective, the new carbon fiber bikes don’t activate the magnetic loops in the roads. That's a bone of contention -- they can’t get the lights to change. We'll change all that using CCTV analytics and promote that.

But we'll also be able to count vehicles that turn right and where they go in the city through number plate recognition. By storing that, when a vehicle comes into the city, we would be able to see if they traveled through the city and their average length of stay.

What we're currently working on is putting in a new parking system, where we'll collect all the data about the occupancy of parking spaces and be able to work out, in real time, the probability of getting a car parked in a certain street, at a certain time. Then, we'll be able to make that available to the customer, and especially the tradesman, who need to be able to park to do their business.

Gardner: Very interesting. We've heard a lot about smart cities and bringing intelligence to bear on some of these problems and issues. It sounds like you're really doing that. In order for you to fulfill that mission, what was lacking in your IT infrastructure? What did you need to change, either in architecture or an ability to scale or adapt to these different types of inputs?

Merged councils

Jones: The key driver was, having merged five councils. We had five different CCTV systems, for instance, watched by people manually. If you think about 1,800 cameras being monitored by maybe three staff at a time, it’s very obvious that they can’t see actually what’s happening in real time, and most of the public safety events were being missed. The cameras were being used for reactive investigation rather than active management of a problem at this point in time.

That drove us into what do we were doing around CCTV, the analytics, and how we automate that and make it easy for operators to be presented with, in real-time, here is the situation you need to manage now, and be able to be proactive, and that was the key driver.

There’s a mix of technologies out there, lots and lots of technologies. One of the considerations was which partner we should go with.

When we looked at that and at all the other scenes that are around the city we asked how we put that all together, process it in real time, and be able to make it available again, both to ourselves, to the police, to the emergency services, and to other third-party application developers who can board their own applications using that data. It’s no value if it’s historic.

Gardner: So, a proverbial Tower of Babel. How did you solve this problem in order to bring those analytics to the people who can then make good use of it and in a time frame where it can be actionable?

Jones: We did a scan, as most IT shops would do, around what could and couldn’t be done. There’s a mix of technologies out there, lots and lots of technologies. One of the considerations was which partner we should go with. Which one was going to give us longevity of product and association, because you could buy a product today, and in the changing world of IT, it’s out of business, being bought out, or it’s changed in three years time. We needed a brand that was going to be in there for the long haul.

Start Your
HPE Vertica
Community Edition Trial Now

Part of that was the brand, and there are multiple big brands out there. Did they have the breadth of the toolsets that we were looking for, both from a hardware perspective, managing the hardware, and the application perspective? That’s where we selected Hewlett Packard Enterprise (HPE), taking all of those factors into account.

Gardner: Tell us a bit about what you're doing with data. On the front end, you're using a high-speed approach, perhaps in a warehouse, you're using something that will scale and allow for analytics to take place more quickly. Tell us about the tiering and the network and what you've been able to do with that?

Jones: What we've done is taken a tiered approach. For instance, the analytics on the CCTV comes in and gets processed by the HPE IDOL engine. That strips most of it out. We integrate that into an incident management system, which is also running on the IDOL engine.

Then, we take the statistics and the pieces that we want to keep and we're storing that in HPE Vertica. The parking system will go into HPE Vertica because it’s near real-time processing of significant volumes.

The traditional data warehouse, which was a SQL data warehouse, it’s still very valid today, and it will be valid tomorrow. That’s where we're putting in a lot of the corporate information and tying a lot of the statistical information together so that we have all the historic information around real time, which was always in an old data warehouse.

Combining information

We tie that together with our financials. A lot of smaller changing datasets are held in that data warehouse. Then, we combine that information with the stuff in Vertica and the Microsoft Analytics Platform System (APS) appliances to get us an integrated reporting at the front end in real time.

We're making a lot of that information available through an API manager, so that whatever we do internally is just a service that we can pick up and reuse or make available to whoever we want to make it available to. It’s not all public, but some of it is to our partners and our stakeholders. It’s a platform that can manage that.

Gardner: You mentioned that APS appliance, a Microsoft and HPE collaboration. That’s to help you with that real-time streaming, high velocity, high volume data, and then you have your warehouse. Where are these being run? Do you have a private cloud? Do you have managed hosting, public cloud? Where are the workloads actually being supported?

Jones: The key workloads around the CCTV, the IDOL engine, and Vertica are all are running on HPE kit on our premises, but managed by HPE-Critical Watch. That’s an HPE, almost an end-to-end service, but it just happens to be on our facilities. The rest is again on our facilities.

So we have a huge performance increase. That means that by the time the operators come in, they have yesterday’s information and they can make the right business decisions.

The problem in New Zealand is that there aren't many private clouds that can be used by government agencies. We can’t offshore it because of latency issues and the cost of shipping data to and from the cloud from the ISPs, who know how to charge on international bandwidth.

Gardner: Now that you've put your large set of services together, what are some of the paybacks that you've been able to get? How do you get a return on investment (ROI), which must be pretty sizable to get this infrastructure in place? What are you able to bring back to the public service benefits by having this intelligence, by being able to react in real time?

Jones: There are two bits to this. The traditional data warehouse was bottle-necked. If you take, from an internal business perspective, the processing out of our integrated feed system, which was a batch-driven system, the processing window each night is around 4.5 hours. To process the batch file was just over that.

We were actually running into not getting the batch file processed until about 6 a.m. At that time, the service operators, the bus operators, the ferry operators have already started work for the day. So they weren’t getting yesterday’s information in time to analyze what to do today.

Using the Microsoft APS appliance we've cut that down, and that process now takes about two hours, end-to-end. So we have a huge performance increase. That means that by the time the operators come in, they have yesterday’s information and they can make the right business decisions.

Customer experience

On the public front, I'd put it back to the customer experience. If you go into a car park and have an incident with somebody in the car park, your expectation is that somebody would be monitoring that and somebody will come to your help. Under the old system that was not the case. It would be pure coincidence if that happened.

Under the new scenario, from a public perception, that will be alerted, something will happen, and someone will come to you. So the public safety is a huge step increased. That has no financial ROI directly for us. It has across the medical spectrum and the broader community spectrum, but for us as a transport agency, it has no true ROI, except for customer expectations and perceptions.

Gardner: Well, as taxpayers having expectations met, it's probably a very strong attribute for you. When we look at your architecture, it strikes me that this is probably something more people will be looking to do, because of this IoT trend, where more sensors are picking up more data. It’s data that’s coming in, maybe in the form of a video feed across many different domains or modes. It needs to be dealt with rapidly. What do you see from your experience that might benefit others as they consider how to deal with this IoT architectural challenge?

When you start streaming data in real-time at those volumes, it impacts your data networks. Suddenly your data networks become swamped, or potentially swamped, with large volumes of data.

Jones: We had some key learning from this. That’s a very good point. IoT is all about connecting in devices. When we went from the old CCTV systems to a new one, we didn’t actually understand that some of that data was being aggregated and lost forever at the front end, and what was being received at the back end was only a snippet.

When you start streaming data in real-time at those volumes, it impacts your data networks. Suddenly your data networks become swamped, or potentially swamped, with large volumes of data.

That then drove us to thinking about how to put that through a firewall, and the reality is you can’t. The firewalls aren’t built to handle that. We're running F5’s and we looked at that and they would not have run the volume of CCTV through that.

So then you start driving to other things about how you secure your data, how you secure the endpoints, and tools like looking down your networks so that you understand what’s connected or what’s changed at the connection end, what’s changing in the traffic patterns on your network, become essential to an organization like us, because there is no way we can secure all the endpoints.

Now, a set of traffic lights has a full data connection at the end. If someone opens a cabinet and plugs in a PC, how do you know that they have done that, and that’s what we have got to protect against. The only way to do that is to know that something abnormal is there. It’s not the normal traffic coming from that area of the network, and then we're flagging it and blocking it off. That’s where we are hitting because that’s the only way we can see the IoT working from a security perspective.

Gardner: Now Roger, when you put this amount of data to work, when you've solved some of those networking issues and you have this growing database and historical record of what takes place, that can also be very valuable. Do you expect that you'll be analyzing this data over historical time periods, looking for trends and applying that to feedback loops where you can refine and find productivity benefits? How does this grow over time in value for you as a public-service organization?

Integrated system

Jones: The first real payback for us has been the integrated ticketing system. We run a tag on-tag off electronic system. For the first time, we understand where people are traveling to and from, the times of day they're traveling, and to a certain extent, the demographics of those travelers. We know if they're a child, a pensioner, a student, or just a normal adult type user.

For the first time, we're actually understanding, not only just where people get on, but where they get off and the time. We can now start to tailor our messaging, especially for transport. For instance, if we have a special event, a rugby game or a pop concert, which may only be of interest to a certain segment of the population, we know where to put our advertising or our messaging about the transport options for that. We can now tailor that to the stops where people are there at the right time of day.

We could never do that before, but from a planning perspective, we now have a view of who travels across town, who travels in and out of the city, how often, how many times a day.

We could never do that before, but from a planning perspective, we now have a view of who travels across town, who travels in and out of the city, how often, how many times a day. We've never ever had that. The planners have never had that. When we get the parking information coming in about the parking occupancy, that’s a new set of data that we have never had.

This is very much about the planners having reliable information. And if we go through the license plate reading, we'll be able to see where trucks come into the city and where they go through.

One of our big issues at the moment is that we have got a link route that goes into the port for the trucks. It's a motorway. How many of the trucks use that versus how many trucks take the shortcut straight through the middle of the city? We don’t know that, and we can do ad-hoc surveys, but we'll hit that in real time constantly, forever, and the planners can then use that when they are planning the heavy transport options.

Gardner: I’m afraid we will have to leave it there. We have been learning about how big data, modern networks, and a tiered architectural approach has helped a transportation agency in New Zealand improve its public safety, its reaction to traffic and other congestion issues, and also set in place a historic record to help it improve its overall transportation capabilities.

So I'd like to thank our guest, Roger Jones, CTO for Auckland Transport in Auckland, New Zealand. Thank you, Roger.

Start Your
HPE Vertica
Community Edition Trial Now

Jones: Thanks very much.

Gardner: And thank you, too, to our audience for joining us for this Hewlett Packard Enterprise transformation and innovation interview. I’m Dana Gardner, Principal Analyst at Interarbor Solutions, your host for this ongoing series of HPE-sponsored discussions. Thanks again for listening, and come back next time.

Listen to the podcast. Find it on iTunes. Get the mobile app. Download the transcript. Sponsor: Hewlett Packard Enterprise.

Transcript of a discussion on the impact and experience of using Internet of Things technologies together with big data analysis in a regional public enterprise. Copyright Interarbor Solutions, LLC, 2005-2016. All rights reserved.