Showing posts with label data analysis. Show all posts
Showing posts with label data analysis. Show all posts

Thursday, February 11, 2016

How New York Genome Center Manages the Massive Data Generated from DNA Sequencing

Transcript of a discussion on how the drive to better diagnose diseases and develop more effective treatments is aided by swift, cost efficient, and accessible big data analytics infrastructure.

Listen to the podcast. Find it on iTunes. Get the mobile app. Download the transcript. Sponsor: Hewlett Packard Enterprise.

Dana Gardner: Hello, and welcome to the next edition of the HPE Discover Podcast Series. I'm Dana Gardner, Principal Analyst at Interarbor Solutions, your host and moderator for this ongoing discussion on IT innovation and how it’s making an impact on people’s lives.

Gardner
Our next big-data use case leadership discussion examines how the non-profit New York Genome Center manages and analyzes up to 12 terabytes of data generated each day from its genome sequence appliances. We’ll learn how the drive to better diagnose disease and develop more effective treatments is aided by swift, cost efficient, and accessible big-data analytics.

To hear how genome analysis pioneers exploit vast data outputs to then speedily correlate for time-sensitive reporting, please join me in welcoming our guest.
Start Your
HPE Vertica
Community Edition Trial Now
We're here with Toby Bloom, Deputy Scientific Director for Informatics at the New York Genome Center in New York. Welcome, Toby.

Toby Bloom: Hi. Thank you.

Gardner: First, tell us a little bit about your organization. It seems like it’s a unique institute, with a large variety of backers, consortium members. Tell us about it.

Bloom
Bloom: New York Genome Center is about two-and-a-half years old. It was formed initially as a collaboration among 12 of the large medical institutions in New York: Cornell, Columbia, NewYork-Presbyterian Hospital, Mount Sinai, NYU, Einstein Montefiore, and Stony Brook University. All of the big hospitals in New York decided that it would be better to have one genome center than have to build 12 of them. So we were formed initially to be the center of genomics in New York.

Gardner: And what does one do at a center of genomics?

Bloom: We're a biomedical research facility that has a large capacity to sequence genomes and use the resulting data output to analyze the genomes, find the causes of disease, and hopefully treatments of disease, and have a big impact on healthcare and on how medicine works now.

Gardner: When it comes to doing this well, it sounds like you are generating an awesome amount of data. What sort of data is that and where does it come from?

Bloom: Right now, we have a number of genome sequencing instruments that produce about 12 terabytes of raw data per day. That raw data is basically lots of strings of As, Cs, Ts and Gs -- the DNA data from genomes from patients who we're sequencing. Those can be patients who are sick and we are looking for specific treatment. They can be patients in large research studies, where we're trying to use and correlate a large number of genomes to find the similarities that show us the cause of the disease.

Gardner: When we look at a typical big data environment such as in a corporation, it’s often transactional information. It might also be outputs from sensors or machines. How is this a different data problem when you are dealing with DNA sequences?

Lots of data

Bloom: Some of it’s the same problem, and some of it’s different. We're bringing in lots of data. The raw data, I said, is probably about 12 terabytes a day right now. That could easily double in the next year. But than we analyze the data, and I probably store three to four times that much data in a day.

In a lot of environments, you start with the raw data, you analyze it, and you cook it down to your answers. In our environment, it just gets bigger and bigger for a long time, before we get the answers and can make it smaller. So we're dealing with very large amounts of data.

We do have one research project now that is taking in streaming data from devices, and we think over time we'll likely be taking in data from things like cardiac monitors, glucose monitors, and other kinds of wearable medical devices. Right now, we are taking in data off apps on smartphones that are tracking movement for some patients in a rheumatoid arthritis study we're doing.
In our environment, it just gets bigger and bigger for a long time, before we get the answers and can make it smaller. So we're dealing with very large amounts of data.

We have to analyze a bunch of different kinds of data together. We’d like to bring in full medical records for those patients and integrate it with the genomic data. So we do have a wide variety of data that we have to integrate, and a lot of it is quite large.

Gardner: When you were looking for the technological platforms and solutions to accommodate your specific needs, how did that pan out? What works? What doesn’t work? And where are you in terms of putting in place the needed infrastructure?

Bloom: The data that comes off the machines is in large files, and a lot of the complex analysis we do, we do initially on those large files. I am talking about files that are from 150 to 500 gigabytes or maybe a terabyte each, and we do a lot of machine-learning analysis on those. We do a bunch of Bayesian statistical analyses. There are a large number of methods we use to try to extract the information from that raw data.
Start Your
HPE Vertica
Community Edition Trial Now
When we've figured out the variance and mutations in the DNA that we think are correlated with the disease and that we were interested in looking at, we then want to load all of that into a database with all of the other data we have to make it easy for researchers to use in a number of different ways. We want to let them find more data like the data they have, so that they can get statistical validation of their hypotheses.

We want them to be able to find more patients for cohorts, so they can sequence more and get enough data. We need to be able to ask questions about how likely it is, if you have a given genomic variant, you get a given disease. Or, if you have the disease, how likely it is that you have this variant. You can only do that if it’s easy to find all of that data together in one place in an organized way.

So we really need to load that data into a database and connect it to the medical records or the symptoms and disease information we have about the patients and connect DNA data with RNA data with epigenetic data with microbiome data. We needed a database to do that.

We looked at a number of different databases, but we had some very hard requirements to solve. We were looking for one that could handle trillions of rows in a table without failing over, tens of trillions of rows without falling over, and to be able to answer queries fast across multiple tables with tens of trillions of rows. We need to be able to easily change and add new kinds of data to it, because we're always finding new kinds of data we want to correlate. So there are things like that.

Simple answer

We need to be able to load terabytes of data a day. But more than anything, I had a lot of conversations with statisticians about why they don’t like databases, about why they keep asking me for all of the data in comma-delimited files instead of databases. And the answer, when you boiled it down, was pretty simple.

When you have statisticians who are looking at data with huge numbers of attributes and huge numbers of patients, the kinds of statistical analysis they're doing means they want to look at some much smaller combinations of the attributes for all of the patients and see if they can find correlations, and then change that and look at different subsets. That absolutely requires a column-oriented database. A row-oriented relational database will bring in the whole database to get you that data. It takes forever, and it’s too slow for them.

So, we started from that. We must have looked at four or five different databases. Hewlett Packard Enterprise (HPE) Vertica was the one that could handle the scale and the speed and was robust and reliable enough, and is our platform now. We're still loading in the first round of our data. We're still in the tens of billions of rows, as opposed to trillions of rows, but we'll get there.
We must have looked at four or five different databases. Vertica was the one that could handle the scale and the speed and was robust and reliable enough and is our platform now.

Gardner: You’re also in the healthcare field. So there are considerations around privacy, governance, auditing, and, of course, price sensitivity, because you're a non-profit. How did that factor into your decision? Is the use of off-the-shelf hardware a consideration, or off-the-shelf storage? Are you looking at conversion infrastructure? How did you manage some of those cost and regulatory issues?

Bloom: Regulatory issues are enormous. There are regulations on clinical data that we have to deal with. There are regulations on research data that overlap and are not fully consistent with the regulations on clinical data. We do have to be very careful about who has access to which sets of data, and we have all of this data in one database, but that doesn’t mean any one person can actually have access to all of that data.

We want it in one place, because over time, scientists integrate more and more data and get permission to integrate larger and larger datasets, and we need that. There are studies we're doing that are going to need over 100,000 patients in them to get statistical validity on the hypotheses. So we want it all in one place.

What we're doing right now is keeping all of the access-control information about who can access which datasets as data in the database, and we basically append clauses to every query to filter down the data to the data that any particular user can use. Then we'll tell them the answers for the datasets they have and how much data that’s there that they couldn’t look at, and if they needed the information, how to go try to get access to that.

Gardner: So you're able to manage some of those very stringent requirements around access control. How about that infrastructure cost equation?

Bloom: Infrastructure cost is a real issue, but essentially, what we're dealing with is, if we're going to do the work we need to do and deal with the data we have to deal with, there are two options. We spend it on capital equipment or we spend it on operating costs to build it ourselves.

In this case, not all cases, it seemed to make much more sense to take advantage of the equipment and software, rather than trying to reproduce it and use our time and our personnel's time on other things that we couldn’t as easily get.

A lot of work went into HPE Vertica. We're not going to reproduce it very easily. The open-source tools that are out there don’t match it yet. They may eventually, but they don’t now.

Getting it right

Gardner: When we think about the paybacks or determining return on investment (ROI) in a business setting, there’s a fairly simple straightforward formula. For you, how do you know you’ve got this right? What is it when you see certain, what we might refer to in the business world as service-level agreements (SLAs) or key performance indicators (KPIs)? What are you looking for when you know that you’ve got it right and when you’re getting the job done, based all of its requirements and from all of these different constituencies?

Bloom: There’s a set of different things. The thing I am looking for first is whether the scientists who we work with most closely, who will use this first, will be able to frame the questions they want to ask in terms of the interface and infrastructure we’ve provided.

I want to know that we can answer the scientific questions that people have with the data we have and that we’ve made it accessible in the right way. That we’ve integrated, connected and aggregated the data in the right ways, so they can find what they are looking for. There's no easy metric for that. There’s going to be a lot of beta testing.
The place where this database is going to be the most useful, not by any means the only way it will be used, is in our investigations of common and complex diseases, and how we find the causes of them and how we can get from causes to treatments.

The second thing is, are we are hitting the performance standards we want? How much data can I load how fast? How much data can I retrieve from a query? Those statisticians who don’t want to use relational databases, still want to pull out all those columns and they want to do their sophisticated analysis outside the database.

Eventually, I may convince them that they can leave the data in the database and run their R-scripts there, but right now they want to pull it out. I need to know that I can pull it out fast for them, and that they're not going to object that this is organized so they can get their data out.

Gardner: Let's step back to the big picture of what we can accomplish in a health-level payback. When you’ve got the data managed, when you’ve got the input and output at a speed that’s acceptable, when you’re able to manage all these different level studies, what sort of paybacks do we get in terms of people’s health? How do we know we are succeeding when it comes to disease, treatment, and understanding more about people and their health?

Bloom: The place where this database is going to be the most useful, not by any means the only way it will be used, is in our investigations of common and complex diseases, and how we find the causes of them and how we can get from causes to treatments.

I'm talking about looking at diseases like Alzheimer’s, asthma, diabetes, Parkinson’s, and ALS, which is not so common, but certainly falls in the complex disease category. These are diseases that are caused by some combinations of genomic variance, not by a single gene gone wrong. There are a lot of complex questions we need to ask in finding those. It takes a lot of patience and a lot of genomes, to answer those questions.

The payoff is that if we can use this data to collect enough information about enough diseases that we can ask the questions that say it looks like this genomic variant is correlated with this disease, how many people in your database have this variant and of those how many actually have the disease, and of the ones who have the disease, how many have this variant. I need to ask both those questions, because a lot of these variants confer risk, but they don’t absolutely give you the disease.

If I am going to find the answers, I need to be able to ask those questions and those are the things that are really hard to do with the raw data in files. If I can do just that, think about the impact on all of us? If we can find the molecular causes of Alzheimer’s that could lead to treatments or prevention and all of those other diseases as well.

Gardner: It’s a very compelling and interesting big data use case, one of the best I’ve heard.

I am afraid we’ll have to leave it there. We've been examining how the New York Genome Center manages and analyzes vast data outputs to speedily correlate for time-sensitive reporting, and we’ve learned how the drive to better diagnose diseases and develop more effective treatments is aided by swift, cost efficient, and accessible big data analytics infrastructure.
Start Your
HPE Vertica
Community Edition Trial Now
So, join me in thanking our guest, Toby Bloom, Deputy Scientific Director for Informatics at the New York Genome Center. Thank you so much, Toby.

Bloom: Thank you, and thanks for inviting me.

Gardner: Thank you also to our audience for joining us for this big data innovation case study discussion. I'm Dana Gardner, Principal Analyst at Interarbor Solutions, your host for this ongoing series of HPE-sponsored discussions. Thanks again for listening, and come back next time.

Listen to the podcast. Find it on iTunes. Get the mobile app. Download the transcript. Sponsor: Hewlett Packard Enterprise.

Transcript of a discussion on how the drive to better diagnose diseases and develop more effective treatments is aided by swift, cost efficient, and accessible big data analytics infrastructure. Copyright Interarbor Solutions, LLC, 2005-2016. All rights reserved.

You may also be interested in:

Thursday, January 14, 2016

How SKYPAD and HPE Vertica Enable Luxury Retail Brands to Gain Rapid Insight into Consumer Sales Trends

Transcript of a BriefingsDirect discussion on how Sky I.T. has changed its platform and solved the challenges around variety, velocity, and volume for big data to make better insights available to retail users.

Listen to the podcast. Find it on iTunes. Get the mobile app. Download the transcript. Sponsor: Hewlett Packard Enterprise.

Dana Gardner: Hello, and welcome to the next edition of the HPE Discover Podcast Series. I'm Dana Gardner, Principal Analyst at Interarbor Solutions, your host and moderator for this ongoing discussion on IT innovation and how it’s making an impact on people’s lives.

Gardner
Our next big-data use case leadership discussion explores how retail luxury goods market analysis provider Sky I.T. Group has upped its game to provide more buyer behavior analysis faster and with more depth. We will see how Sky I.T. changed its data analysis platform infrastructure and why that has helped solve its challenges around data variety, velocity, and volume to make better insights available to its retail users.

Here to share how retail intelligence just got a whole lot smarter, we are joined by Jay Hakami, President of Sky I.T. Group in New York. Welcome, Jay.

Jay Hakami: Thank you very much. Thank you for having us.

Gardner: We're also here with Dane Adcock, Vice President of Business Development at Sky I.T. Group. Welcome Dane.

Dane Adcock: Thank you very much.

Gardner: And we're here with Stephen Czetty, Vice President and Chief Technology Officer at Sky I.T. Group. Welcome to BriefingsDirect, Stephen.
Sky I.T. Group
Retail Business-Intelligence Solutions
Get More Information
Stephen Czetty: Thank you, Dana, and I'm looking forward to the chance.

Gardner: What are the top trends that are driving the need for greater and better big-data analysis for retailers? Why do they need to know more, better, faster?

Adcock: Well, customers have more choices. As a result, businesses need to be more agile and responsive and fill the customer's needs more completely or lose the business. That's driving the entire industry into practices that mean shorter times from design to shelf in order to be more responsive.

It has created a great deal of gross marketing pressure, because there's simply more competition and more selections that a consumer can make with their dollar today.

Gardner: Is there anything specific to the retail process around luxury goods that is even more pressing when it comes to this additional speed? Are there more choices and  higher expectations of the end user?

Greater penalty

Adcock: Yes. The downside to making mistakes in terms of designing a product and allocating it in the right amounts to locations at the store level carries a much greater penalty, because it has to be liquidated. There's not a chance to simply cut back on the supply chain side, and so margins are more at risk in terms of making the mistake.

Ten years ago, from a fashion perspective, it was about optimizing the return and focusing on winners. Today, you also have to plan to manage and optimize the margins on your losers as well. So, it's a total package.

Gardner: So, clearly, the more you know about what those users are doing or what they have done is going to be essential. It seems to me, though, that we'rere talking about a market-wide look rather than just one store, one retailer, or one brand.

How does that work, Jay? How do we get to the point where we've been able to gather information at a fairly comprehensive level, rather than cherry-picking or maybe getting a non-representative look based on only one organization’s view into the market?

Hakami: With SKYPAD, what we're doing is collecting data from the supplier, from the wholesaler, as well as from their retail stores, their wholesale business, and their dot-com, meaning the whole omni channel. When we collect that data, we cleanse it to make sure its meaningful to the user.

Hakami
Now, we're dealing with a connected world where the retailer, wholesalers, and suppliers have to talk to one another and plan together for the buying season. So the partnerships and the insight that they get into the product performance is extremely important, as Dane mentioned, in terms of the gross margin and in terms of the software information. SKYPAD basically provides that intelligence, that insight, into this retail/wholesale world.

Gardner: Correct me if I'm wrong, but isn’t this also a case where people are opening up their information and making it available for the benefit of a community or recognizing that the more data and the more analysis that’s available, the better it is for all the participants, even if there's an element of competition at some point?

Hakami: Dana, that's correct. The retail business likes to share the information with their suppliers, but they're not sharing it across all the suppliers. They're sharing it with each individual supplier. Then, you have the market research companies who come in and give you aggregation of trends and so on. But the retailers are interested in sell-through. They're interested in telling X supplier, "This is how your products are performing in my stores."

If they're not performing, then there's going to be a mark down. There's going to be less of a margin for you and for us. So, there's a very strong interest between the retailer and a specific supplier to improve the performance of the product and the sell-through of those products on the floor.

Gardner: Before we learn more about the data science and dealing with the technology and business case issues, tell us a little bit more about Sky I.T. Group, how you came about, and what you're doing with SKYPAD to solve some of these issues across this entire supply chain and retail market spot.

Complex history

Hakami: I'll take the beginning. I'll give you a little bit of the history, Dana, and then maybe Dane and Stephen can jump in and tell you what we are doing today, which is extremely complex and interesting at the same time.

We started with SKYPAD about eight years ago. We found a pain point within our customers where they were dealing with so many retailers, as well as their own retail stores, and not getting the information that they needed to make sound business decisions on a timely basis.

We started with one customer, which was Theory. We came to them and we said, "We can give you a solution where we're going to take some data from your retailers, from your retail stores, from your dot-com, and bring it all into one dashboard, so you can actually see what’s selling and what’s not selling."

Fast forward, we've been able to take not only EDI transactions, but also retail portals. We're taking information from any format you can imagine -- from Excel, PDF, merchant spreadsheets -- bringing that wealth of data into our data warehouse, cleansing it, and then populating the dashboard.

So today, SKYPAD is giving a wealth of information to the users by the sheer fact that they don’t have to go out by retailer and get the information. That’s what we do, and we give them, on a Monday morning, the information they need to make decisions.
As these business intelligence (BI) tools have become more popular, the distribution of data coming from the retailers has gotten more ubiquitous and broader in terms of the metrics.

Dane, can you elaborate more on this as well?

Adcock: This process has evolved from a time when EDI was easy, because it was structured, but it was also limited in the number of metrics that were provided by the mainstream. As these business intelligence (BI) tools have become more popular, the distribution of data coming from the retailers has gotten more ubiquitous and broader in terms of the metrics.

But the challenge has moved from reporting to identification of all these data sources and communication methodologies and different formats. These can change from week to week, because they're being launched by individuals, rather than systems, in terms of Excel spreadsheets and PDF files. Sometimes, they come from multiple sources from the same retailer.

One of our accounts would like to see all of their data together, so they can see trends across categories and different geographies and markets. The challenge is to bring all those data sources together and align them to their own item master file, rather than the retailer’s item master file, and then be able to understand trends, which accounts are generating the most profits, and what strategies are the most profitable.
Visit the Sky BI Team
At The 2016 NRF Big Show!
Get More Information
It's been a shifting model from the challenge of reporting all this data together, to data collection. And there's a lot more of it today, because more retailers report at the UPC level, size level, and the store level. They're broadcasting some of this data by day. The data pours in, and the quicker they can make a decision, the more money they can make. So, there's a lot of pressure to turn it around.

Gardner: Let me understand, Dane. When you're putting out those reports on Monday morning, do you get queries back? Is this a sort of a conversation, if you will, where not only are you presenting your findings, but people have specific questions about specific things? Do you allow for them to do that, and is the data therefore something that’s subject to query?

Subject to queries

Adcock: It’s subject to queries in the sense that they're able to do their own discovery within the data. In other words, we put it in a BI tool, it’s on the web, and they're doing their own analysis. They're probing to see what their best styles are. They're trying to understand how colors are moving, and they're looking to see where they're low on stock, where they may be able to backfill in the marketplace, and trying to understand what attributes are really driving sales.

But of course, they always have questions about completeness of the data. When things don’t look correct, they have questions about it. That drives us to be able to do analysis on the fly, on-demand, and deliver some responses, "All your stores are there, all of your locations, everything looks normal." Or perhaps there seems to be some flaws or things in the data that don’t actually look correct.

Not only do we need to organize it and provide it to them so that they can do their own broad, flexible analysis, but they're coming back to us with questions about how their data was audited. And they're looking for us to do the analysis on the spot and provide them with satisfactory answers.

Gardner: Stephen Czetty, we've heard about the use case, the business case, and how this data challenge has grown in terms of variety as well as volume. What do you need to bring to the table from the architecture and the data platform to sustain this growth and provide for the agility that these market decision makers are demanding?

Czetty: We started out with an abacus, in a sense, but today we collect information from thousands of sources literally every single week. Close to 9,000 files will come across to us and we'll process them correctly and sort of them out -- what client they belong to and so forth, but the challenge is forever growing.

Czetty
We needed to go from older technology to newer technology, because our volumes of data are increasing and the amount of time that we need to consume to data in is static.

So we're quite aware that we have a time limit. We found Vertica as a platform for us to be able to collect the data into a coherent structure in a very rapid time as opposed to our legacy systems.

It allows us to treat the data in a truly vertical way, although that has nothing to do with the application or the database itself. In the past we had to deal with each client separately. Now we can deal with each retailer separately and just collect their data for every single client that we have. That makes our processes much more pipelined and far faster in performance.

The secret sauce behind that is the ability in our Vertica environment to rapidly sort out the data -- where it belongs, who it belongs to -- calculate it out correctly, put it into the database tables that we need to, and then serve it back to the front end that we're using to represent it.

That's why we've shifted from a traditional database model to a Vertica-type model. It's 100 percent SQL for us, so it looks the same for everybody who is querying it, but under the covers we get tremendous performance and compression and lots of cost savings.

Gardner: For some organizations that are dealing with the different sources and  different types of data, cleansing is one problem. Then, the ability to warehouse that and make it available for queries is a separate problem. You've been able to tackle those both at the same time with the same platform. Is that right?

Proprietary parsers

Czetty: That's correct. We get the data, and we have proprietary parsers for every single data type that we get. There are a couple of hundred of them at this point. But all of that data, after parsing, goes into Vertica. From there, we can very rapidly figure out what is going where and what is not going anywhere, because it’s incomplete or it’s not ours, which happens, or it’s not relevant to our processes, which happens.

We can sort out what we've collected very rapidly and then integrate it with the information we already have or insert new information if it's brand-new. Prior to this, we'd been doing this by hand to a large-scale, and that's not effective any longer with our number of clients growing.

Gardner: I'd like to hear more about what your actual deployment is, but before we do that, let’s go back to the business case. Dane and Jay, when Vertica came online, when Steve was able to give you some of these more pronounced capabilities, how did that translate into a benefit for your business? How did you bring that out to the market, and what's been the response?

Hakami: I think the first response was "wow." And I think the second response was "Wow, how can we do this fast and move quickly to this platform?"
Prior to this, we'd been doing this by hand to a large-scale, and that's not effective any longer with our number of clients growing.

Let me give you some examples. When Steve did the proof of concept (POC) with the folks from HP, we were very impressed with the statistics we had seen. In other words, going from a processing time of eight or nine hours to minutes was a huge advantage that we saw from the business side, showing our customers that we can load data much faster.

The ability to use less hardware and infrastructure as a result of the architecture of Vertica allowed us to reduce, and to continue to reduce, the cost of infrastructure. These two are the major benefits that I've seen in the evolution of us moving from our legacy to Vertica.

From the business perspective, if we're able to deliver faster and more reliably to the customer, we accomplished one of the major goals that we set for ourselves with SKYPAD.

Adcock: Let me add something there. Jay is exactly right. The real impact, as it translates into the business, is that we have to stop processing and stop collecting data at a certain point in the morning and start processing it in order for us to make our service-level agreements (SLAs) on reporting for our clients, because they start their analysis. The retail data comes in staggered over the morning and it may not all be in by the time that we need to shut that processing off.

One of the things that moving to Vertica has allowed us to do is to cut that time off later, and when we cut it off later, we have more data, as a rule, for a customer earlier in the morning to do their analysis. They don’t have to wait until the afternoon. That’s a big benefit. They get a much better view of their business.

Driving more metrics

The other thing that it has enabled us to do is drive more metrics into the database and do some processing in the database, rather than in the user tool, which makes the user tool faster and it provides more value.

For example, maybe for age on the floor, we can do the calculation in the background, in the database, and it doesn't impede the response in the front-end engine. We get more metrics in the database calculated rather than in our user tool, and it becomes more flexible and more valuable.
Sky I.T. Group
Retail Business-Intelligence Solutions
Get More Information
Gardner: So not only are you doing what you used to do faster, better, cheaper, but you're able to now do things you couldn't have done before in terms of your quality of data and analysis. Is there anything else that is of a business nature that you're able to do vis-à-vis analytics that just wasn't possible before, and might, in fact, be equivalent of a new product line or a new service for you?

Czetty: In the old model, when we got a new client we had to essentially recreate the processes that we'd built for other clients to match that new client, because they're collecting that data just for that client just at that moment.
In the current model, where we're centered on retailers, the only thing that will take us a long time to do in this particular situation is if there's a new retailer that we've never collected data from.

So 99 percent of it is the same as any other client, but one percent is always different, and it had to be built out. On-boarding a client, as we call it, took us a considerable amount of time -- we are talking weeks.

In the current model, where we're centered on retailers, the only thing that will take us a long time to do in this particular situation is if there's a new retailer that we've never collected data from. We have to understand their methodology of delivery, how it comes, how complex it is and so forth, and then create the logic to load that into the database correctly to match up with what we are collecting for others.

In this scenario, since we’ve got so many clients, very few new stores or new retailers show up, and typically it’s just our clients on retail chain, and therefore our on-boarding is just simplified, because if we are getting Nordstrom’s data from client A, we're getting the same exact data for client B, C, D, E, and F.

Now, it comes through a single funnel and it's the Nordstrom funnel. It’s just a lot easier to deal with, and on-boarding comes naturally.

Hakami: In addition to that, since we're adding more significant clients, the ability to increase variety, velocity, and volume is very important to us. We couldn't scale without having Vertica as a foundation for us. We'd be standing still, rather than moving forward and being innovative, if we stayed where we were. So this is a monumental change and a very instrumental change for us going forward.

Gardner: Steve, tell us a little bit about your actual deployment. Is this a single tenant environment? Are you on a single database? What’s your server or data center environment? What's been the impact of that on your storage and compression and costs associated with some of the ancillary issues?

Multi-tenant environment

Czetty: To begin with, we're coming from a multi-tenant environment. Every client had its own private database in the past, because in DB2, we couldn't add all these clients into one database and get the job done. There was not enough horsepower to do the queries and the loads.

We ran a number of databases on a farm of servers, on Rackspace as our hosting system. When we brought in Vertica, we put up a minimal configuration with three nodes, and we're still living with that minimal configuration with three nodes.

We haven't exhausted our capacity on the license by any means whatsoever in loading up this data. The compression is obscenely high for us, because at the end of the day, our data absolutely lends itself to being compressed.

Everything repeats over and over again every single week. In the world of Vertica, that means it only appears once in wherever it lives in the database, and the rest of it is magic. Not to get into the technology underneath it at this point, from our perspective, it's just very effective in that scenario.
With the three nodes, we've had zero problems with performance. It hasn't been an issue at all. We're just looking back and saying that we wish we had this a little sooner.

Also in our DB2 world, we're using quite costly large SAN configurations with lots of spindles, so that we can have the data distributed all across the spindles for performance on DB2, and that does improve the performance of that product.

However, in Vertica, we have 600 GB drives and we can just pop more in if we need to expand our capacity. With the three nodes, we've had zero problems with performance. It hasn't been an issue at all. We're just looking back and saying that we wish we had this a little sooner.

Vertica came in and did the install for us initially. Then, we ended up taking those servers down and reinstalling it ourselves. With a little information from the guide, we were able to do it. We wanted to learn it for ourselves. That took us probably a day and a half to two days, as opposed to Vertica doing it in two hours. But other than that, everything is just fine. We’ve had a little training, we’ve gone to the Vertica event to learn how other people are dealing with things, and it's been quite a bit of fun.

Now there is a lot of work we have to do at the back end to transform our processes to this new methodology. There are some restrictions on how we can do things, updates and so forth. So, we had to reengineer that into this new technology, but other than that, no changes. The biggest change is that we went vertical on the retail silos. That's just a big win for us.

Gardner: As you know, Vertica is cloud ready. Is there any benefit to that further down the road where maybe it’s around issues of a spike demand in holiday season, for example, or for backup recovery or business continuity? Any thoughts about where you might leverage that cloud readiness in the future?

Dedicated servers

Czetty: We're already sort of in the cloud with the use of dedicated servers, but in our business, the volume increases in the stores around holidays is not doubling the volume. It’s adding 10 percent, 15 percent, maybe 20 percent of the volume for the holiday season. It hasn’t been that big a problem in DB2. So, it’s certainly not going to be a problem in Vertica.

We've looked at virtualization in the cloud, but with the size of the hardware that we actually want to run, we want to take advantage of the speed and the memory and everything else. We put up pretty robust servers ourselves, and it turns out that in secure cloud environments like we're using right now at Rackspace, it's simply less expensive to do it as dedicated equipment. To spin up a machine, like another node for us at Rackspace, would take about same time it would take for virtual system setup and configure to a day or so. They can give us another node just like this on our rack.

We looked at the cloud financially every single time that somebody came around and said there was a better cloud deal, but so far, owning it seems to be a better financial approach.

Gardner: Before we close out, looking to the future, I suppose the retailers are only going to face more competition. They're going to be getting more demand from their end users or customers for user experience for information.
We looked at the cloud financially every single time that somebody came around and said there was a better cloud deal, but so far, owning it seems to be a better financial approach.

We're going to see more mobile devices that will be used in a dot-com world or even a retail world. We are going to start to see geolocation data brought to bear. We're going to expect the Internet of Things (IoT) to kick in at some point where there might be more sensors involved either in a retail environment or across the supply chain.

Clearly, there's going to be more demand for more data doing more things faster. Do you feel like you're in a good position to do that? Where do you see your next challenges from the data-architecture perspective?

Czetty: Not to disparage too much the industry of luxury, but at this point, they're not the bleeding edge on the data collection and analysis side, where they are on the bleeding edge on social media and so forth. We've anticipated that. We've got some clients who were collecting information about their web activities and we have done analysis for identifying customers who are presenting different personas through their different methods as they contact the company.

We're dabbling in that area and that’s going to grow as it becomes so tablet-oriented or phone-oriented as the interfaces go. A lot of sales are potentially going to go through social media and not just the official websites in the future.

We'll be capturing that information as well. We’ve got some experience with that kind of data that we’ve done in the past. So, this is something I'm looking forward to getting more of, but as of today, we’re only doing it for a few clients.

Well positioned

Hakami: In terms of planning, we're very well-positioned as a hub between the wholesaler and the retailer, the wholesaler and their own retail stores, as well as the wholesaler and their dot-coms. One of the things that we are looking into, and this is going to probably get more oxygen next year, is also taking a look at the relationships and the data between the retailer and the consumer.

As you mentioned, this is a growing area, and the retailers are looking to capture more of the consumer information so they can target-market to them, not based on segment but based on individual preferences. This is again a huge amount of data that needs to be cleansed, populated, and then presented to the CMOs of companies to be able to sell more, market more, and be in front of their customers much more than ever before.

Gardner: That’s a big trend that we are seeing in many different sectors of the economy -- that drive for personalization, and it really is a result of these data technologies to allow that to happen.

Last word to you, Dane. Any other thoughts about where the intersection of computer science capabilities and market intelligence demands are coming together in new and interesting ways?

Adcock: I'm excited about the whole approach to leveraging some predictive capabilities alongside the great inventory of data that we've put together for our clients. It's not just about creating better forecasts of demand, but optimizing different metrics, using this data to understand when product should be marked down, what types of attributes of products seem to be favored by different locations of stores that are obviously alike in terms of their shopper profiles, and bringing together better allocations and quantities in breadth and depth of products to individual locations to drive better, higher percentage of full-price selling and fewer markdowns for our clients.

So it’s a predictive side, rather than discovery using a BI tool.

Czetty: Just to add to that, there's the margin. When we talked to CEOs and CFOs five or six years ago and told them we could improve business by two, three, or four percent, they were laughing at us, saying it was meaningless to them. Now, three, four, or five percent, even in the luxury market, is a huge improvement to business. The companies like Michael Kors, Tory Burch, Marc Jacobs, Giorgio Armani, and Prada are all looking for those margins.
I'm excited about the whole approach to leveraging some predictive capabilities alongside the great inventory of data that we've put together for our clients.

So, how do we become more efficient with a product assortment, how do we become more efficient with distribution and all of these products to different sales channels, and then how do we increase our margins? How do we not over-manufacture and not create those blue shirts in Florida, where they are not selling, and create them for Detroit, where they're selling like hotcakes.

These are the things that customers are looking at and they must have that tool or tools in place to be able to manage their merchandising and by doing so become a lot more agile and a lot more profitable.

Gardner: Well, great. I'm afraid we will have to leave it there. We've been discussing how retail luxury goods and fashion market goods providers are using analysis from Sky I.T. Group and how Sky I.T. Group heads up its game through using HPE Vertica to provide more buyer behavior analysis faster, better, and cheaper.

And we’ve seen how Sky I.T. has changed its platform and solved the challenges around variety, velocity, and volume for that data to make those better insights available to those retail users, allowing them to become more data-driven across their entire market.

So please join me in thanking our guests. We have been talking with Jay Hakami,  President of Sky I.T. Group in New York. Thank you so much, Jay.
Visit the Sky BI Team
At The 2016 NRF Big Show!
Get More Information
Hakami: Thank you, Dana. I appreciate it very much.

Gardner: And we've also been talking with Dane Adcock, Vice President of Business Development at Sky I.T. Group. Thank you, Dane.

Adcock: It’s great to have the conversation. Thank you.
Gardner: I've enjoyed it myself. And lastly, a big thank you to Stephen Czetty, Vice-President and Chief Technology Officer there at Sky I.T. Group. Thank you, Stephen.

Czetty: You're very welcome, and I enjoyed the conversation. Thank you.

Gardner: And I’d also like to thank our audience as well for joining us for this big-data use case leadership discussion.

I'm Dana Gardner, Principal Analyst at Interarbor Solutions, your host for this ongoing series of Hewlett Packard Enterprise-sponsored discussions. Thanks again for listening, and come back next time.

Listen to the podcast. Find it on iTunes. Get the mobile app. Download the transcript. Sponsor: Hewlett Packard Enterprise.

Transcript of a BriefingsDirect discussion on how Sky I.T. has changed its platform and solved the challenges around variety, velocity, and volume for big data to make better insights available to retail users. Copyright Interarbor Solutions, LLC, 2005-2016. All rights reserved.

You may also be interested in:

Monday, December 21, 2015

How INOVVO Delivers Analysis that Leads to Greater User Retention and Loyalty for Mobile Operators

Transcript of a discussion on how advanced analytics drawing on multiple data sources provides wireless operators improved interactions with their subscribers and enhances customer experience through personalized insights.

Listen to the podcast. Find it on iTunes. Get the mobile app. Download the transcript. Sponsor: Hewlett Packard Enterprise.

Dana Gardner: Hello, and welcome to the next edition of the HPE Discover Podcast Series. I'm Dana Gardner, Principal Analyst at Interarbor Solutions, your host and moderator for this ongoing discussion on IT innovation and how it’s making an impact on people’s lives.

Gardner
Our next big-data case study discussion examines how INOVVO delivers impactful analytical services to mobile operators to help them engender improved end-user loyalty.

We'll see how advanced analytics, drawing on multiple data sources, enables INOVVO’s mobile carrier customers to provide mobile users with faster, more reliable, and relevant services.

To learn more about how INOVVO uses big data to make major impacts on mobile services, please join me in welcoming Joseph Khalil, President and CEO of INOVVO in Reston, Virginia. Welcome, Joseph.
Embed the HPE Big Data Analytics Engines
To Meet Enterprise-Scale Requirements
Get More Information
Joseph Khalil: Thank you, Dana. I'm glad to be here.

Gardner: User experience and quality of service are so essential nowadays. What has been the challenge for you to gain an integrated and comprehensive view of subscribers and networks that they're on in order to uphold that expectation for user experience and quality?

Khalil: As you mentioned in your intro, we cater to the mobile telco industry. Our customers are mobile operators who have customers in North America, Europe, and the Asia-Pacific region. There are a lot of privacy concerns when you start talking about customer data, and we're very sensitive to that.

Khalil
The challenge is to handle the tremendous volume of data generated by the wireless networks and still adhere to all privacy guidelines. This means we have to deploy our solutions within the firewalls of network operators. This is a big-data solution, and as you know, big data requires a lot of hardware and a big infrastructure.

So our challenge is how we can deploy big data with a small hardware footprint and high storage capacity and performance. That’s what we’ve been working on over the last few years. We have a very compelling offer that we've been delivering to our customers for the past five years. We're leveraging HPE Vertica for our storage technology, and it has allowed us to meet very stringent deployment requirements. HPE has been and still is a great technology partner for us.

Gardner: Tell us a little bit more about how you do that in terms of gathering that data, making sure that you adhere to privacy concerns, and at the same time, because velocity, as we know, is so important, quickly deliver analytics back. How does that work?

User experience

Khalil: We deal with a large number of records that are generated daily within the network. This is data coming from deep packet inspection probes. Almost every operator we talk to has them deployed, because they want to understand the user experience on their networks.

These probes capture large volume of clickstream data. Then, they relay it to us almost in a near real-time fashion. This is the velocity component. We leverage open-source technologies that we adapted to our needs that allow us to deal with the influx of streaming data.

We're now in discussion with HPE about their Kafka offering, which deals with streaming data and scalability issues and seems to complement our current solution and enhances our ability to deal with the velocity and volume issues. Then, our challenge is not just dealing with the data velocity, but also how to access the data and render reports in few seconds.

One of our offering is a care product that’s used by care organizations. They want to know what their customers did the last hour on the network. So there's a near real-time urgency to have this data streamed, loaded, processed, and available for reporting. That’s what our platforms offers.

Gardner: Joseph, given that you're global in nature and that there are so many distribution points for the gathering of data, do you bring this all into a single data center? Do you use cloud or other on-demand elements? How do you manage the centralization of that data?
Our customers can go and see the performance of everything that’s happened on the network for the last 13 months.

Khalil: We don’t have cloud deployments to date, even though our technology allows for it. We could deploy our software in the cloud, but again, due to privacy concerns with customers' data, we end up deploying our solutions in-network within the operators’ firewalls.

One of the big advantages of our solution is that we can choose to host it locally on customers’ premises. We typically store data for up to 13 months. So our customers can go and see the performance of everything that’s happened on the network for the last 13 months.

We store the data at different levels -- hourly, daily, weekly, monthly -- but to answer your question, we deploy on-site, and that’s where all the data is centralized.

Gardner: Let’s look at why this is so important to your customer, the mobile carrier, the mobile operator. What is it that helps their business and benefits their business by having this data and having that speed of analysis?

Customer care

Khalil: Our customer care module, the Subscriber Analytix Care, is used by care agents. These are the individuals that respond to 611 calls from customers complaining about issues with their devices, coverage, or whatever the case may be.

When they're on the phone with a customer and they put in a phone number to investigate, they want to be able to get the report to render in under five seconds. They don’t want to have the customer waiting while the tool is churning trying to retrieve the care dashboard. They want to hit "go," and have the information come on their screen. They want to be able to quickly determine if there's an issue or not. Is there a network issue, is it a device issue, whatever the case may be?

So we give them that speed and simplicity, because the data we are collecting is very complex, and we take all the complexity away. We have our own proprietary data analysis and modeling techniques, and it happens on-the-fly as the data is going through the system. So when the care agent loads that screen, it’s right there at a glance. They can quickly determine what the case may be that’s impacting the customer.
Our care module has been demonstrated to reduce the average call handle time, the time care personnel spend with the customer on the phone.

Our care module has been demonstrated to reduce the average call handle time, the time care personnel spend with the customer on the phone. For big operators, you could imagine how many calls they get every day. Shaving a few minutes off each call can amount to a lot of savings in terms of dollars for them.

Gardner: So in a sense, there’s a force-multiplier by having this analysis. Not only do you head off the problems and fix them before they become evident, which includes better user experience, they're happier as a customer. They stay on the network. But then, when there are problems, you can empower those people who are solving the problem, who are dealing with that customer directly to have the right information in hand.

Khalil: Exactly. They have everything. We give them all the tools that are available to them to quickly determine on the fly how to resolve the issue that the customer is having. That’s why speed is very important for a module like care.
Embed the HPE Big Data Analytics Engines
To Meet Enterprise-Scale Requirements
Get More Information
For our marketing module, speed is important, but not as critical as care, because now you don’t have a customer waiting on the line for you to run your report to see how subscribers are using the network or how they're using their devices. We still produce reports fairly quickly in few seconds, which is also what the platform can offer for marketing.

Gardner: So those are some of the immediate and tactical benefits, but I should think that, over time, as you aggregate this data, there is a strategic benefit, where you can predict what demands are going to be on your networks and/or what services will be more in demand than others, perhaps market by market, region by region. How does that work? How do you provide that strategic level of analysis as well?

Khalil: This is on the marketing side of our platform, Subscriber Analytix Marketing. It's used by the CMO organizations, by marketing analysts, to understand how subscribers are using the services. For example, an operator will have different rate plans or tariff plans. They have different devices, tablets, different offerings, different applications that they're promoting.

How are customers using all these services? Before the advent of deep packet inspection probes and before the advent of big data, operators were blind to how customers are using the services offered by the network. Traditional tools couldn’t get anywhere near handling the amount of data that’s generated by the services.

Specific needs

Today, we can look at this data and synthesize it for them, so they can easily look at it, slice and dice it along many dimensions such as, age, gender, device type, location, time, you name it. Marketing analysts can then use these dimensions to ask very detailed questions about usage on the network. Based on that, they can target specific customers with specific offers that match their specific needs.

Gardner: Of course, in a highly competitive environment, where there are multiple carriers vying for that mobile account, the one that’s first to market with those programs can have a significant advantage.

Khalil: Exactly. Operators are competing now based on the services they offer and their related costs. Back 10-15 years ago, radio coverage footprint and voice plans were the driving factors. Today, it's the data services offered and their associated rate plans.

Gardner: Joseph, let’s learn a little bit more about INOVVO. You recently completed purchase of comScore’s wireless solutions division. Tell us a bit about how you’ve grown as a company, both organically and through acquisition, and maybe the breadth of your services beyond what we've already described?
Our tool allows them to anticipate when existing network elements exhaust their current capacity.

Khalil: INOVVO is a new company. We started in May 2015, but the business is very mature. My senior managers and I have been in this business since 2005. We started the Subscriber Analytix product line back in 2005. Then, comScore acquired us in 2010, and we stayed with them for about 5 years, until this past May.

At that time, comScore decided that they wanted to focus more on their core business and they decided to divest the Subscriber Analytix group. My senior management and I executed a management buyout, and that’s how we started INOVVO.

However, comScore is still a key partner for us. A key component of our product is a dictionary for categorizing and classifying websites, devices, and mobile apps. That’s produced by comScore, and comScore is known in this industry as the gold standard for these types of categorizations .

We have exclusive licensing rights to use the dictionary in our platform. So we have a very close partnership with comScore. Today, as far as the services that INOVVO offers, we have a Subscriber Analytix product line, which is for care, marketing, and network.

We talked about care and marketing, we also have a network module. This is for engineers and network planners. We help engineers understand the utilization of their network elements and help them plan and forecast what the utilization is going to be in the near future, given current trends, and help them stay ahead of the curve. Our tool allows them to anticipate when existing network elements exhaust their current capacity.

Gardner: And given that platform and technology providers like HPE are enabling you to handle streaming real-time highly voluminous amounts of data, where do you see your services going next?

It appears to me that more than just mobile devices will be on these networks. Perhaps we're moving towards the Internet of Things (IoT). We're looking more towards people replacing other networks with their mobile network for entertainment and other aspects of their personal and business lives. At that packet level, where you examine this traffic, it seems to me that you can offer more services to more people in the fairly near future.

Two paths

Khalil: IoT is big and it’s showing up on everybody’s radar. We have two paths that we're pursuing on our roadmap. There is the technology component, and that’s why HPE is a key partner for us. We believe in all their big data components that they offer. And the other component for us is the data-science component and data analysis.

The innovation is going to be in the type of modeling techniques that are going to be used to help, in our case, our customers, the mobile operators. Moving down the road, there could be other beneficiaries of that data, for example companies that are deploying the sensors that are generating the data.

I'm sure they want some feedback on all that data that their sensors are generating. We have all the building blocks now to keep expanding what we have and start getting into those advanced analytics, advanced methodologies, and predictive modeling. These are the areas, and this is where we see really our core expertise, because we understand this data.

Today you see a lot of platforms showing up that say, “Give me your data and I'll show you nice looking reports.” But there is a key component that is missing and that is the domain expertise in understanding the data. This is our core expertise.
My advice is that it’s a new field and you need to consider not just the Hadoop storage layer but the other analytical layers that complements it.

Gardner: Before we finish up, I'd like to ask you about lessons learned that you might share with others. For those organizations that are grappling with the need for near real-time analytics with massive amounts of data, having tremendous amount of data available to them, maybe it’s on a network, maybe it’s in a different environment, do you have any 20/20 hindsight that you might offer on how to make the best use of big data and monetize it?

Khalil: There is a lot of confusion in the industry today about big data. What is big data and what do I need for big data? You hear the terms Hadoop. "I have deployed a Hadoop cluster. So I have solved my big data needs." You ask people what’s their big-data strategy, and they say they have deployed Hadoop. Well, then. what are you doing with Hadoop? How are you accessing the data? How are you reporting on the data?

My advice is that it’s a new field and you need to consider not just the Hadoop storage layer but the other analytical layers that complements it. Everybody is excited about big data. Everybody wants to really have strategy to use big data, and there are multiple components to it. We offer a key component. We don't pitch ourselves to our customers and say, “We are your big data solution for everything you have.”

There is an underlying framework that they have to deploy, and Hadoop is one of them. then comes our piece. It sits on top of the data hosting infrastructure and feeds from all the different data types, because in our industry, typical operators have hundreds if not thousands of data silos that exist in their organization.

So you need framework to really host the various data sources, and Hadoop could be one of them. Then, you need a higher-level reporting layer, an analytical layer, that really can start combining these data silos and making sense of it and bringing value to the organization. So it's a complete strategy of how to handle big data.

Gardner: And that analytics layer that's what HPE Vertica is doing for you.

Key component

Khalil: Exactly. HPE is a key component of what do we do in our analytical layer. There are misconceptions. When we go talk to our customers, They say, “Oh, you're using your Vertica platform to replicate our big data store,” and we say that we're not. The big data store is a lower level, and we're an analytical layer. We're not going to keep everything. We're going to look at all your data, throw away a lot of it, just keep what you really need, and then synthesize it to be modeled and reported on.

Gardner: I'm afraid we'll have to leave it there. We've been exploring how INOVVO delivers impactful analytical services to mobile operators so they can foster improved end-user loyalty, and we've identified how advanced analytics, drawing on multiple data sources, provides a better network quality assurance and, of course, an all-important better user experience.
Embed the HPE Big Data Analytics Engines
To Meet Enterprise-Scale Requirements
Get More Information
So join me in thanking Joseph Khalil, President and CEO of INOVVO in Reston, Virginia. And a big thank you as well to our audience for joining us for this big data innovation case study discussion.

I'm Dana Gardner; Principal Analyst at Interarbor Solutions, your host for this ongoing series of HPE-sponsored discussions. Thanks again for listening, and do come back next time.

Listen to the podcast. Find it on iTunes. Get the mobile app. Download the transcript. Sponsor: Hewlett Packard Enterprise.

Transcript of a discussion on how advanced analytics drawing on multiple data sources provides wireless operators improved interactions with their subscribers and enhances customer experience through personalized insights. Copyright Interarbor Solutions, LLC, 2005-2015. All rights reserved.

You may also be interested in: