Last week, I attended the OpenStack Summit in Paris. I wanted to capture key notes I had from the sessions I found most interesting and share them. I’ll also provide a general recap on my own observations. I’d encourage others to write-up their observations and notes. I’d be happy to share and link to interesting write-ups, as they might be interesting to both of the readers of this blog. ☺ I’ve read Mark Potts’ great recap here, as well as this post by Forrester’s James Staten.
For this post, I am recapping Wes Jossey’s session (video) on Tapjoy-1 on OpenStack (here’s a link to the deck as well). Tapjoy is a leading mobile advertising network company. Tapjoy-1, as I understand it, refers to a private cloud, running a core Tapjoy workload (i.e., the bulk of their mobile ad network). Though it had run entirely on Amazon Web Services (AWS) and Tapjoy had been an early AWS adopter, Tapjoy now runs both AWS and Tapjoy-1, its OpenStack based private cloud, provided by Metacloud.
Tapjoy is finding business value success running Tapjoy-1 on OpenStack, as for the same cost they were paying AWS, Tapjoy has gained much higher output on OpenStack in a private cloud (see chart below).
Here’s what this chart says. Assume same amount of spend over 3 years. The blue bars signify the amount of capacity provided in terms of nodes, IOPS, storage, etc on Amazon Web Services. The red bars signify the same amount using OpenStack. The deltas are significant across all 5 major vectors. Eyeballing the chart, I estimated that Tapjoy achieved a doubling of cores, 4x increase in RAM, 10x increase in IOPs, 50% more disk, and 10x increase in nodes. This is the “money” slide, so to speak, in that it makes OpenStack seem like a no brainer, certainly for Tapjoy.
Wes indicated that he and his team penciled out 3 years of spend on their private cloud, including hardware, software, services, floor space, power, etc. He also stated that Tapjoy had experienced better uptime with their OpenStack private cloud implementation than they had when they’d run the app on AWS or on Bare Metal.
The Tapjoy experience and case study is an important bellwether for the OpenStack movement. Despite much excitement and momentum around OpenStack, as a community, our project is still early in full production design wins with concrete business value evidence. Tapjoy’s experience and the evidence presented are a solid example. I wanted to share this and provide some of my own perspectives.
What is Tapjoy-1
Tapjoy-1 is an internal private cloud, built on OpenStack that all Tapjoy devs have access to. It is comprised of 348 “Data” All Purpose Nodes (see slide below). According to my notes, Tapjoy-1 has 12 management nodes. As I understood it delivers a key production workload for Tapjoy, namely delivering mobile ads.
Tapjoy’s Observations / Lessons Learned
Wes shared several observations and lessons learned in his talk, which I’m capturing in no particular order.
Lesson 1: Vendor selection matters
Wes was a happy reference for Metacloud, and he described how they helped validate the design, and then they did the OpenStack deployment and provisioned the network. Quick commercial plug: HP Helion is now open for business, and we can offer these capabilities as well! ☺
Wes was also a positive reference for Equinix.
While he didn’t bash Quanta directly, he did say that delays in procuring hardware blew out both his timelines and his hardware contingency budget.Quick commercial plug #2: HP is a leading server and hardware vendor, and we can *definitely* help here! ☺
Lesson 2: There’s nothing agile about writing a big check up front. At same time, there is something smart about planning through all the money.
Wes’s point here was one he covered only briefly, but it has stuck with me. Tapjoy made the bet to move to an OpenStack based private cloud, specifically choosing to go from from Opex (pay as you go) to Capex (invest up front). With this, he was pointing out that they had to do a lot more planning up front. While it was less agile and real-time than just adding or shrinking capacity as they went along, I think his point was that smart planning, plus the compelling economics made this a stark and valuable business case. I expect we will see more examples of this in the market and the community.
Agile is a very popular and often appropriate methodology, and in many cases it will work. But in this case, and I suspect in many others, particularly where the app and workload are well understood, planning up front and shifting to Capex was a no brainer, business wise. Quick commercial plug #3: HP has financing options available across hardware and software, so if you want to build a private cloud, we can help you do so either in Capex (pay up front) or Opex (pay-as you-go).
Lesson 3: Challenges
As is common with many new IT initiatives, selling internally a change can be a challenge, and Wes hit on this theme. There had been no previous success story with OpenStack internally, so Wes and his team encountered a lot of skepticism. In addition, Wes cited frustration with turn around time (again principally on hardware delivery), as well as on the need to plan out all the different steps (Wes said, “you’ll need a Gantt chart”).
Lesson 4: Design Principles for Cloud
A few key points here. First off, was this message: plan for failure.
Wes talked about Tapjoy’s preference of thinking and trusting of application more than disk replication. In other words, they think of replication occurring at the app layer. They don’t use CEPH, as they want to think of things as ephemeral, and don’t want to be required to re-mount disks. (He also mentioned that Tapjoy doesn’t use AWS’s EBS for similar reasons, namely it doesn’t seem very reliable.)
He also discussed the concept of service boundaries. He described that there were 20ish services that were connecting, and his point was that you just have such a broad mesh of potential failure points, that you have to just avoid an assumption that the other side of the service is healthy. In the architecture of the app, he had the concept of the Circuit Breaker, where if a service were down, there was an alternative flow that the architecture followed. This was Tapjoy’s architectural design to implement the reality that they couldn’t assume that all 20 services were always going to work, it is kind of a real time backup plan. He also talked about having hardware and software contingencies, backup links, and temporary caches. Finally, he advocated testing failure in production.
Basically what I’ve described above you can see watching the video of Wes’s talk that’s linked to at the top of this post. Here is my perspective.
Perspective 1: I’d expect that we will see more examples of compelling business evidence like this from other early adopters.
This was my third OpenStack Summit, starting at Hong Kong a year ago. Wes’s talk was interesting to me, as it was really the first one I’ve sat through with OpenStack running in production, at scale, on a central workload for the business. (Others have talked more at a lab or dev/test level, though I may have missed sessions at some point.)
Wes and his team at Tapjoy are early adopters for sure. But the business results they are realizing are pretty breakthrough. While an early example, it is an important one. The results are providing so much daylight between the status quo of just using AWS or other proprietary offerings, that I expect more companies will have to start looking at OpenStack on a private cloud as an option.
Perspective 2: We’ll see more discussion on agile on demand v planned cloud deployments
Mobile adtech startups like Tapjoy exist on the basis of being disruptive and moving quickly. Speed and fast iteration are hallmarks of tech startups. And this speed and agility is a Good Thing, not just for startups, for anyone.
With this ‘move fast and break things’ ethos, the attraction of public cloud, pay as you go pricing is compelling and natural. You can spin up resources as customers demand those resources.
At the same time, savvy IT shops will seek to optimize cost and value. This should be especially so for tech startups, who generally have to be very aggressive at hacking the most cash efficient options available to them, as the cost of capital for venture funded startups is so high.
So while the image of hacker tech startups just choosing to go Opex and AWS and calling it a day is fashionable, the Tapjoy example provides an interesting contrast. Leading VCs have warned recently that a reckoning is coming in terms of startups and that burn rates are getting out of control. With that as context, you’d have to forecast that more startups will evaluate whether they can reasonably achieve Tapjoy-like economics going private cloud/Capex versus Opex.
For more mature enterprises, who are seeking to drive more agile methodologies more broadly, it will be interesting to watch how this discussion evolves.
Net/ I’d expect that we see more companies coming forward in the next 12 months with similar types of discussions to the Tapjoy example.
Perspective 3: More discussion on building and architecting cloud native apps
Wes had some very useful perspective on building a cloud native app with the assumptions that things wouldn’t work. In particular, and I heard this in other sessions too, Tapjoy had architected its app such that replication was occurring at the app layer, that this obviated the need to replicate storage and remount disks. For those farther down the path of going cloud native, this is known. This still seems an emerging practice, however, as I talk to customers often who are asking about data and disk replication. I think this focus on taking this to the app layers will need continued focus and push.
Really enjoyed Wes’s session, and I’d like to thank him and the Tapjoy team for putting together such a useful case study.
Anyone else got thoughts for this session or others at the OpenStack Summit?