Ep 22: Ran Ribenzaft CTO at Epsagon

Ryan Jones: Welcome to the Talking Serverless podcast. I'm your host, Ryan Jones, joined today by Ran Ribenzaft, co-founder and CTO at Epsilon, one of the most popular serverless observability companies on the market. Ran is also an AWS serverless hero and has been working in the software industry for more than a decade.

Introduction

Ryan Jones: Q: What things are you working on at the moment, and how are they going?
Ran Ribenzaft: Being a co-founder means working alone, within the product. And for me especially, I really like evangelising the concept of serverless and the concepts of observability. So today, the main focus that we're putting into it is combining serverless and observability into the broader term. In short, serverless is not only lambda functions, it's much more than that. At epsilon we really believe that it doesn't have to be one or zero, it's kind of a spectrum. And when you want to provide observability, you need to do that for Fargate, or for ECS or Kubernetes. If it's part of your stack, you need to gain observability from end to end. And the second part is, observability was kind of an acronym to monitoring, which was okay, but then everybody talks about the four pillars of observability. So now what we're trying to do is actually to bring in applied observability, which means that whatever you're getting, you'll be able to triage all pillars all together from the same platform and the same experience, rather than having logs in one place and monitoring in another. What we find today in many companies, it's like we're having CloudWatch, and we're having another solution for infrastructure monitoring, and another solution for APM. We really want engineers to have the ability to work under the same dashboard. So that's what I'm up to.

Ryan Jones: “You bring up a really good topic. A lot of times when we talk about serverless, we talk about lambda functions, DynamoDB, things like that. And a lot of clients that I've worked with, they might be using lambda functions, but the bulk of their stuff is probably still legacy applications that they have to interact with, like you said, running on ECS, containers, all those things.”

Epsilon

Q: Have you seen an adoption increase since covering everything, not just functions?
Ran Ribenzaft: Almost for every customer, there is a small piece of non-serverless with their serverless, it's 20% of their stack, or even 30%. So being able to provide the one solution that really covers, for example, all AWS stuff, really works much better. It's obviously a lot of work and a lot of effort to do not just lambda functions, but it's totally rewarding. People are actually seeing their stack, for example, an express application running on EKS, putting a message into an SNS, which triggers a lambda function, then calling an app sync API, like the old spectrum of services in AWS, covered in a single pane of glass. That's priceless.
Q: How was Epsilon founded? Where did the idea spring from and how did you get involved?
Ran Ribenzaft: I started first with serverless between three or four years ago. I bought an Alexa into my home with the aspiration that I'm going to build something big, voice wise. Since they started, there's been the Alexa SDK for Python that you can interact with. At that time, it wasn't someone's serverless framework, or Zappa, Charlie, or the other frameworks where you just say deploy and it's getting somewhere and it's running. Some of the skills that I've built, after a month or so, I got hundreds of thousands of requests every day. And at that point, I'd never build something with that order of magnitude; I know that it's definitely not a lot, but coming from several users at most to all of a sudden, tens of thousands or hundreds of thousands of users and requests every day, it's mind blowing. I wondered, how do I make all the scale? How do I do everything? And that's the point where I started to really get into lambda functions and understand exactly how it works. The magical part was not needing to take care of all those things that I didn't really want to do.
At that point I started to understand what happens when it's not just a single lambda function, when you were involving message queues, databases and storage, another third party API, it's getting more and more complex. So when my other co-founder, the CEO, and myself got together, it was pretty obvious for us that when you're doing these kinds of distributed applications, it just gets to the point where you don't really understand very easily. You can start day one, day two, day three, all seems fine. Few logs, everything works, okay. But then when you're hitting production, hundreds of thousands of requests, and you can't find loads anymore, you can't understand the metrics, you haven't prepared yourself for that point. But since both of us come from long engineering background, we really understood what's going to be the pain for engineers, when they have so many resources altogether. It was clear for us that we needed to build something for engineers. We started initially with serverless, something that was very tailored to lambda. And over time, we expanded it more and more all the way to the point today. If you have an AWS account, regardless if you're running Python, node, Java running on ECS, or Fargate, or lambda functions, or even AppSync API's, you get complete coverage using Epsilon.

Q: Would you recommend, for people that are interested in learning serverless, to try to build Alexa skills as a first step?
Ran Ribenzaft: I really think it's a good way to start. Mainly because you don't need any background knowledge, except for maybe a bit of Python or node, which is super simple to get started. You don't need to understand how infrastructure works, how to set up web servers or integrate complex things, you just need to have very minimal knowledge in one of the scripting languages and you're all set. And I think it's much more rewarding due to the fact that you're talking to something, it's not like a code that is CNI based and you don't really feel nice when something runs there. When somebody talks back to you and you're having a logical speech and return based on speech, that's super fun. So I really recommend this as a good way to start, it's super fun, super rewarding. The ideas can be endless. I built some stuff that tells me when my boss on the tour will come, how long will it take for me to drive my car back home or things like that. It's very nice that you can speak instead of just creating another API.


Q: How have you seen things grow at Epsilon, since learning that distributed applications are difficult to manage and engineers are having a hard time with it, and building a solution for it?
We've seen a lot of trends moving with serverless. If you asked someone what serverless was in 2017/ 2018, the answer would likely be that it is simply lambda. At some point, people began to understand that it's not only lambda. I think 2019 was the year of the biggest breakthrough for serverless. Especially lambda, which I think leads the serverless revolution. People were debating, realising it's not only functioning as a service, it's also the old composition of the application that we're building. And then the main topic was, what actually is serverless? So, you know, people to get to the far end of Fargate, which is also serverless. Any orchestration is also serverless. Now it's getting to the point where people understand that it doesn't really matter what serverless is defined as- it's more important what the approach is when building applications. The fact is that your code is a liability, it's not an asset, and writing more code doesn't bring you more value. It's not like intellectual property- the less code that you're writing, the better it can be for you and your team.
So people now understand that the agenda for serverless is to focus on what they're doing and spend less time on building things that already exist. This can be anything from third party API's to functions and containers, and anything in between. So that's probably the biggest trend that we're seeing. On top of that, to help the rapid adoption of such new environments, is the tooling that really is continuously evolving in every aspect. You can see that starting from security to development, deployment, observability, and monitoring and logging. Especially experience- experiences were something that was really lacking for serverless, but now you can see more and more best practices use cases. My fellow AWS heroes in serverless this year were really doing a good job spreading the best experience, which is priceless for the community and helps them to grow into the right state of mind. Ten or fifteen years ago, I used to take IBM and Dell servers and install them on a rack with screws, plugging power supplies and networks, and configuring the operating system through the BIOS was a nightmare. Today, you don't need to take care of that, and in five years, the average developer wouldn't really care about what an engineer or a web server is.

Ryan Jones: "...something that I'm often talking about in my personal conversations is that idea that code is a liability. ... it doesn't necessarily bring you value ... this started happening, as I saw apps saying, start with the VTL templates and things like that. There was a lot of pushback around VTL templates, because of the complexity side of it. You can use this VTL template to talk directly to DynamoDB, for instance, or another service. And you have no lambda code in between. No maintenance, no monitoring, it's just right there. It's static, and it works every single time.

Q: AppSync was a really big release. I believe you are one of the only observability platforms that support AppSync. Is that right?
Ran Ribenzaft: Yes, definitely. We provide both logs, traces and metrics, all to AppSync. I really love AppSync. I'm a big fan. The only thing that I really don't like is the VTL templates. I wish they could generate something simpler and more intuitive.

Ryan Jones: "When I think about observability, a lot of people that I see working with serverless, there's definitely evolutions that happen. They start with lambda, and then they add an SNS or SQS, API gateway, ... eventually at some point they tackle observability."

Q: I've seen people stick pretty hard to cloud watch. Have you had any success converting people over to using something like Epsilon? What does that process look like?
Ran Ribenzaft: Yeah, that's the first go-to when you're developing lambda. Probably the next thing that you're doing is opening the cloud watch logs, because it's there, it exists, you don't need to do anything. Any console log or tweets that you make will just go there. So it's really good for getting started, especially in a development environment. But then comes the part where hundreds of people are on your website, or your product, draining tonnes of data, and you can't reload that place anymore, so it doesn't really help you. In that sense, it's a pretty easy move, not always necessarily to Epsilon, but to any different kinds of solutions. Probably the most popular solution that we're seeing people taking up today is distributed towards Elastic. You can really very easily set up and manage ElasticSearch on AWS. There's a somewhat inbuilt integration, it brings much more value, it's much more simple, especially in scale. There is the Elastic itself that needs to be managed, but I think it provides a much better offering.
One thing to bear in mind is that logging is a good tool to tell you what happened, but you need to do it right. So many engineers I see are just doing console logs. We have some raw prints, and instead, you should make sure that you're doing JSON and the five types of prints in which we have some more metadata, for example, what's the function name? What timestamp are we in? Which internal function in the code are we working at? The message that you want to have that really helps teams that once you're having a centralised location for all of the logs is you really need to have context, because you'll have tonnes of logs from tonnes of services. It's not necessarily a single engineer responsible for the logs that can read everything, it's probably several teams that are working on the same centralised logging, so you need to have uniformity.
First make sure logs are structured, then try to standardise the way you're logging inside your organisation so everything will be uniform. Once you're doing that you can build, based on Elastic, really nice things. You can show metrics, show logs, start to do correlation IDs between different lambda functions as messages move from one lambda function to the other. It can be through SNS, HTTP, or even a directing vocation. When people are starting to do it, they soon understand it will take a lot of time to actually implement everything I've just mentioned. So at this point, usually people are starting to look for different solutions. It might be even earlier- some don't really want to build anything, they just want a third party API that will take care of all of the observability aspects. But otherwise, at this point, usually engineers understand that it's going to be a wild ride to have correlation IDs, be able to alert and integrate to slack and pager duty, to visualise everything, and detect serverless specific elements like timeouts or other memories. Continuously building dashboards, integrating everything and training all the engineers can take weeks, or more realistically months, just to get it up and running. That's the point where they start to evaluate other solutions.
When they're hitting Epsilon, that's kind of the best of products today for serverless. You're getting out of the box dashboards for lambda functions, including timeouts of memories, costing, location analysis and insights. Then using the tracing, you can really drill into what's going on inside the function itself, any calls that were being made, and you really don't need to take care of anything like correlation, shipping the logs for scale for storage, and you're getting everything out of the box, which is much easier. The whole agenda of serverless is to move fast, continue building what we're building and not build anything that it's not part of the business.

Ryan Jones: "... so you can use Elastic, you can go that route. But once you get into these more complex things like correlation IDs, that's when things are really going to start breaking down and becoming very complex, and then Epsilon just does that stuff for you. That's really cool."

Industry Advice

Q: Do you have a good example for where listeners can find best practices for logging?
Ran Ribenzaft: Alongside the podcasts that I do know, there are several libraries for node and for Python. I think that the best way would be to type your programming language into Google, and then type structured logs. You'll probably find out things that are out of the box and capturing some things from the environment variables, and will help you to do so. The most important part would be to define what your structure is and what are the pieces of metadata that you need to snapshot. The Burning Monk explains some examples that are good on his website. The other thing that is really important is to try to automate the process as much as possible, because every local line means that somebody has to edit another line of code through their own code, and you need a lot of logging. That means that you needed to add a lot of lines of code to your original code base, which everybody wants to avoid.
It's called instrumentation or middlewares. So for example, every time your function is getting any location, wrap it with a middleware. One of the greatest ones in JavaScript or node is MIDI, a great framework that allows you to put middlewares across your lambda functions instead of every developer being responsible for documenting what was in the event, the context, what you want, or where you're running. That can really simplify things since you don't need to set them up. It's the same for any kind of operation that you're making. AWS, SDK calls, HTTP calls, that's instrumentation. It's pretty advanced. I wouldn't say it's the first thing that you should do, but once you do that, you're really removing the need for any engineer to manually log most of the things that you don't already need to log. Yeah, that's, that's

Q: Let's say you were going to build an application today. Would you immediately start with serverless? How do you view news applications being built?
Ran Ribenzaft: So assuming I'm familiar with the infrastructure, I am feeling comfortable with serverless, ECS, Kubernetes, or whatever my first goal would be, I try to understand how can I implement it on top of serverless. If they're completely serverless, like lambda functions and the rest of the services that are completely serverless, I definitely need to take into consideration scale and performance. I think that the only things that might remove me from a serverless state of mind are if it's a big scale which implies costs. There are some scenarios where building a fully native serverless application would be much more expensive than having it on non-serverless. But I'm talking about tens of billions or hundreds of billions of events every day. In such scenarios I would reconsider, because the cost can be an order of magnitude that you're paying. The second thing is performance. I'm not trying to whine about needing another 100 milliseconds for the page to load or something like that. If you're building something that needs to be performing in a matter of milliseconds, usually it's the advertising start-ups or companies that need to respond in a matter of three or four milliseconds. That's something that can trick serverless, because no matter how hard you try, you won't be able to respond that quickly. We do have adaptive on one location that we just can't do that- we need to reply in a matter of very few milliseconds. So for us, it was an obvious ECS, without taking into consideration serverless or API gateway. Apart from those scenarios my go to would be serverless. If one of the two that I mentioned aren't something that will impact you, you can build it in serverless. Many people are afraid and wonder how they are going to do that event driven and distributed, but if you haven't already done it, you should do so. It will bring you much more value in terms of ruggedness and development velocity, which is priceless compared to building other things.

Conclusion

Ran Ribenzaft: "My Twitter DMS are open if anybody wants to ask anything. Also, being an AWS service hero, I really love helping regardless of observability. Feel free to ping me on my Twitter. Thanks for having me."

Ryan Jones: To all our listeners, definitely check out Epsilon and definitely hit Ran up. He's the guru of observability. This has been another Talking Serverless podcast with Ryan Jones. If you like our show and want to learn more, check out Talking Serverless.io and feel free to leave us a review on iTunes, Spotify and Google podcast. Join us next time as we sit down with another fantastic serverless guest.