How to defeat downtime with Observability?


In today’s world, the essential ingredient for the success of an organization is the ability to reduce downtime. If not handled properly, it interrupts the company’s growth, impacts customer satisfaction, and could result in significant monetary losses. Resolutions can also be difficult when the correct data is unavailable, thus prolonging the downtime. This affects the SLA and decreases the product’s reliability in the market.

The best way to deal with downtime is to avoid its occurrence. Data teams should have access to tools and measures to prevent such an incident by detecting it even before it happens. This kind of transparency can be achieved using Observability. By implementing Observability, teams can manage the health of their data pipeline and dramatically reduce downtime and resolution time.

What is Observability? 

Introduction to Observability

Observability is the ability to measure the internal status of a system by examining its outputs. A system is highly observable if it does not require additional coding and services to assess and analyze what’s going on. During downtime, it is of utmost importance to determine which part of the system is faulty at the earliest possible time. 

Three Pillars of Observability

Three Pillars of Observability

The three pillars that must be considered simultaneously to obtain Observability are logs, metrics, and traces. When you combine these three “pillars,” a remarkable ability to understand the whole state of your system emerges. Let us learn more about these pillars:

Logs are the archival records of your system functions and errors. They are always time-stamped and come in either binary or plain text and a structured format that combines text and metadata. Logs allow you to look through and see what went wrong and where within a system.

Metrics can be a wide range of values monitored over some time. Metrics are often vital performance indicators such as CPU capacity, memory usage, latency, or anything else that provides insights into the health and performance of your system. The changes in these metrics allow teams to understand the system’s end performance better. Metrics offer modern businesses a measurable means to improve the user experience.

Traces are a method to follow a user’s journey through your application. Trace documents the user’s interaction and requests within the system, starting from the user interface to the backend systems and then back to the user once their request is processed. 

A system’s overall performance can be maintained and enhanced by implementing the three pillars of Observability, i.e., logs, metrics, and traces. As distributed systems become more complex, these three pillars give IT, DevSecOps, and SRE teams the ability to access real-time insight into the system’s health. Areas of degrading health can be prioritized for troubleshooting before impacting the system’s performance. 

What are the benefits of Observability?

Observability tools are not only a requirement but a necessity in this fast-paced data-driven world. Key benefits of Observability are:

  1. Detecting an anomaly before it impacts the business, thus preventing monetary losses.
  2. Speed up resolution time and meet customer SLAs
  3. Reduce repeat incidents
  4. Reduce escalations
  5. Improve collaboration between data teams (engineers, analysts, etc.)
  6. Increase trust or reliability in data
  7. Quicker decision making

Observability Use-cases

Observability is essential because it gives you greater control over complex systems. Simple systems have fewer moving parts, making them easier to manage. Monitoring CPU, memory, databases, and networking conditions are usually enough to understand these systems and apply the appropriate fix.

Distributed systems have a far higher number of interconnected parts, so the number and types of failure are also higher. Additionally, distributed systems are constantly updated, and every change can create a new kind of failure. Understanding a current problem is an enormous challenge in a distributed environment, mainly because it produces more “unknown unknowns” than simpler systems. Because monitoring requires “known unknowns,” it often fails to address problems in these complex environments adequately.

Observability is better suited for the unpredictability of distributed systems, mainly because it allows you to ask questions about your system’s behavior as issues arise. “Why is X broken?” or “What is causing latency right now?” are a few questions that Observability can answer.

SREs often waste valuable time combing through heaps of data and identifying what matters and requires action. Rather than slowing down all operations with tedious, manual processes, Observability provides automation to identify which data is critical so SREs can quickly take action, dramatically improving productivity and efficiency, rather than slowing down all operations with tedious, manual processes.

Best practices to implement Observability

  • Monitor what matters most to your business to not overload your teams with alerts.
  • Collect and explore all of your telemetry data in a unified platform.
  • Determine the root cause of your application’s immediate, long-term, or gradual degradations.
  • Validate service delivery expectations and find hot spots that need focus.
  • Optimize the feedback loop between issue detection and resolution.

Observability tools

Features to consider while choosing the right tool

Observability tools have become critical to meeting operational challenges at scale. To get the best out of Observability implementation, you will need a reliable tool that enables your teams to minimize toil and maximize automation. Some of the key features to consider while choosing an application are:

  • Core features offered
  • Initial set-up experience
  • Ease of use 
  • Pricing
  • Third-party integrations
  • After-sales support and maintenance

List of tools

Considering the above factors, we have compiled a list of effective observability tools that can offer you the best results:

  • ContainIQ
  • SigNoz
  • Grafana Labs
  • DataDog
  • Dynatrace
  • Splunk
  • Honeycomb
  • LightStep
  • LogicMonitor
  • New Relic

Reporting for Observability

Skedler Reports helps Observability and SOC teams automate stakeholder reports in a snap without breaking the budget.


Reporting for Observability

With effective observability tools, you also need a reliable reporting tool that can deliver professional reports from these tools to your stakeholders regularly on time. If you use Grafana for Observability or Elastic Stack for SIEM, check out Skedler Reports. 

Skedler Reports helps Observability and SOC teams automate stakeholder reports in a snap without breaking the budget. You can test-drive Skedler for free and experience its value for your team. Click here to download Skedler Reports.

Is observability the future of systems monitoring?

As the pressure increases to resolve issues faster and understand the underlying cause of the problem, IT and DevOps teams need to go beyond reactive application and system monitoring.

They will need to dig deeper into the tiniest technical details of every application, system, and endpoint to witness the real-time performance and previous anomalies to correct repeat incidents.

A mature observability strategy can give you an insight into previous unknowns and help you more quickly understand why incidents occur. And as you continue on your observability journey and understand what and why things break, you’ll be able to implement increasingly automated and effective performance improvements that impact your company’s bottom line.

Three Pillars of Observability – Traces ( Part 3)


The ability to measure a system’s internal state is observability. It helps us understand what’s happening within the system by looking at specific outputs or data points. It is essential, especially when considering complex, distributed systems that power many apps and services today. Its benefits are better workflow, improved visibility, faster debugs and fixes, and agility.

Observability depends on three pillars: logs, metrics, and traces. Hence, the term also refers to multiple tools, processes, and technologies that make it possible. We have already touched upon logs and metrics, and this article will cover the last pillar, traces.

Understanding Traces

The word ‘Trace’ refers to discovery by investigation or to finding a source, for example, tracing the origin of a call. Here too, the term refers to something similar. It is the ability to track user requests fully through a complex system. It differs from a log. A log may only tell us something went wrong at a certain point. However, a trace goes back through all the steps to track the exact instance of error or exception. 

It is more granular than a log and a great tool to understand and sorting bottlenecks in a distributed system. Traces are ‘spans’ that track user activity through a distributed system (microservices). It does this with the help of a unique universal identifier that travels with the data to keep track of it.

Multiple spans form a trace that can be represented pictorially as a graph. One of the most common frameworks used for Traces is OpenTelemetry, created from OpenCensus and OpenTracing.

Why do we need to use Traces?

Traces help us correct failures provided we are using the right tools. Tracks are life-savers for admin and DevOps teams responsible for monitoring and maintaining a system. They can understand the path the user request takes to see where the bottlenecks happened and why to decide what corrective actions need to be taken. 

While metrics and logs provide adequate information, traces go a step better to give us context to better understand and utilize these pillars.

Traces provide crucial visibility to the information that makes it more decipherable.

They are better suited for debugging complex systems and answering many essential questions related to their health. For example, to identify which logs are relevant, which metrics are most valuable, which services need to be optimized, and so on.

Software tracing has been around for quite some time. However, distributed tracing is the buzzword in the IT industry these days. It works across complex systems that span over Cloud-based environments that provide microservices.

Therefore, we cannot pick one over the other from the three observability pillars. Traces work well along with metrics and logs, providing much-needed overall efficiency. That is what observability is all about, to keep our systems running smoothly and efficiently.


Implementing traces in systems is a complex and tedious task, especially considering most are distributed. It might involve codes across many places, which could be challenging for DevOps personnel. Every piece of data in a user request must be traced through and through. Implementing it across multiple frameworks, languages, etc., makes the task more challenging.

Also, tracing can be an issue if you have many third-party apps as part of your distributed system. However, proper planning, usage of compatible tools that support custom traces, monitoring the right metrics, etc., can go a long way in overcoming these.

The Skedler advantage

As we have already seen, if we have to make good use of the three pillars of observability, we need to rely on some good tools. We need a reliable reporting tool if we need good visualization from traces based on the information it has access to. That’s where Skedler comes in. 

Skedler works with many components in the DevOps ecosystem, such as the ELK stack and Grafana, making it easier to achieve observability. The Skedler 5.7.2 release supports distributed tracing, the need of the hour. It performs with a new panel editor and a unified data model.

Skedler gives an edge by leveraging the best from the underlying tools to provide you with incredible visualized outputs. These reports help you make sense of the multitude of logs, metrics, traces, and more. They give you enriched insights into your system to keep you ahead. Thus, it helps ensure a stable, high-availability system that renders a great customer experience.


In conclusion, we could say that observability is a key aspect of maintaining distributed systems. Keeping track of the three pillars of observability is critical – logs, metrics, and traces. Together, they form the pivotal backbone of a healthy system and a crucial monitoring technique for all system stakeholders.

While multiple tools are available for this purpose, a crucial one would be to provide you with unmissable clarity on the system’s health. A good observability tool should generate, process, and output telemetry data with a sound storage system that enables fast retrieval and long-term retention. Using Skedler can help you deliver automated periodic visualized reports to distributed stakeholders prompting them to take necessary action.

Episode 9 – Top 5 Challenges for Mobile Service Providers Today and How to Tackle Them with DevOps and Analytics

In episode 9 of Infralytics, Shankar spoke with John Griffiths. John is the Senior Product Manager for Openmind Networks, a leading provider of messaging infrastructure for mobile service operators and intercarriers. The subject of discussion was the “Top 5 challenges for Mobile Service Providers today and how to tackle them with devops and analytics.” 

Mobile Service Providers: The Interview

Telcos are planning for the 5G rollout and there are huge expectations among consumers and businesses regarding how 5G could transform and improve connectivity. Meeting such expectations is never easy. What are the top challenges faced by mobile service providers today?

Due to increased competition in the mobile sector in general and due to government regulation, operators are having to deal with decreasing revenues and shrinking margins for the same services. They have to do this in the face of usual challenges of the need to upgrade the network and invest with their network.

5G is the latest technology that requires serious investment.  Also mobile operators are not just competing among themselves anymore. New competitors are entering the space and offering over the top services. The risk in this type of climate of mobile operators being marginalized, with the worst case scenario being if they just become providers of data bandwidth over which messaging and streaming services could be carried. It’s a huge challenge for operators to stay relevant.

How are mobile service providers addressing these challenges around competition?

The more cutting edge operators have realized that to survive in this environment with the disruptive innovations that keep happening, they have to create a slim and efficient network. Mobile operators can gain efficiencies by reducing and optimizing their hardware footprint. This is done through Network Function Virtualization (NFV) which is the mobile sectors preferred system for maximizing efficiency. NFV architecture enables operators to plan for tomorrow’s systems and applications through hardware that can run multiple applications simultaneously. 

Another way mobile service providers are addressing these challenges is by becoming more IT centric. Network technologies are being reworked and moved into an IT centric and software driven environment. IT and Internet companies are much farther ahead than mobile operators as far as this process goes, and they have employed devops, continuous integrations, and continuous delivery enabled through automation and optimization of services. The most cutting edge mobile operators are beginning to learn from these IT and internet companies, and they are adopting these new techniques. DevOps involves releasing small incremental improvements weekly or monthly, so the cutting edge operators have done away with large upgrade projects.

DevOps also enables automated testing. Manual testing is replaced by automated testing. In addition to these efficiencies, DevOps is also adding value.

Have you seen adoption of container technologies by Mobile Service providers or is it too early?

Leading mobile operators are beginning to adopt container technologies.  For example, Openmind’s new platform is based on containers and it’s docker based. The advantage of containers is that it enables us to deploy in any environment. It’s also very devops friendly. So containerization makes everything smooth and easy from a mobile operator perspective.

There used to be a lot of testing for a major release but with automated testing you can immediately run testing whenever there is any small change.

Are Telcos implementing any monitoring tools since changes are so frequent?

At Openmind we provide all of the software updates, testing services, and monitoring to the operators. 

In terms of going from 4G LTE to 5G, are there more endpoints that they need to monitor?

Yes there are always a huge number of updates and software rollouts on the network with any new technology. From a messaging perspective, it’s yet to be determined whether the architecture will change in 5G.

I read about the RCS that is being adopted by all of the hardware vendors and Telcos. Any thoughts on that?

RCS has been promised as a great hope for a number of years, and Openmind is the first messaging vendor to have a GSMA accredited RCS product but despite this the industry pickup hasn’t been as great as what was hoped for many years. Even after Google came on board, the adaptation hasn’t been as great as many hoped it would be.

How is analytics being used to address the various challenges you mentioned?

With the new generation of messaging products, Big Data and Analytics are part of the products themselves. These services can be customized as customer based services and AI services. Mobile operators have valuable data that can be used by enterprises to communicate more effectively with their customers. This new generation of products is incorporating this customer data into the products themselves.

Big Data is also aiding the messaging space in artificial intelligence. The latest developments in machine learning and neural networks are now being applied to message categorization based on the content in the message. Mobile operators can then use this classification and categorization to make intelligent routing decisions about where to route different types of messages. This enables them to offer different levels of services and charge different rates for the different levels. 

As far as security is concerned, if you can categorize messages into spam and/or fraudulent you can block these types of messages. 

Is the concept of sending targeted messages to different demographics in the form of ads similar to what Google and Facebook are already doing with their ad businesses something that we are heading towards in the messaging space?

Yes. The categorization is really about targeting the messages and campaigns. It involves offering services to enterprises by making use of the knowledge you have about the consumers themselves. For example if you know that certain subscribers are roaming in a certain location because you can detect that, and they are in a shopping mall, it’s a possibility to send a campaign to this targeted group rather than spamming the whole group of subscribers. So using real-time data and customer profiling to target messages and campaigns. 

So at Openmind are you using your own stack or what do you use to offer these analytics capabilities?

Some operators have their own analytics systems, but for customers that want a messaging system plus analytics capabilities as well we base our products on what  Elasticsearch and Kibana are offering and we have built on top of that. One of the aspects that we have put on top of the standard Kibana is the Skedler Reporting Tool for sending scheduled reports to people who don’t need to access the analytics systems themselves…that just need a regular report being sent to them. 


What are your thoughts on the revelation that mobile service providers can send targeted messages based on real-time data collected and customer profiling similar to how companies like Facebook and Google use data to target ads? 

Openmind uses the Skedler reporting tool to send scheduled reports to their Telco customers. Are you interested in trying Skedler reports or Skedler alerts for your business? Start your free trial today!

Episode 8 – How to Build a Cloud-Scale Monitoring System

In Episode 8 of the Infralytics Show, Shankar interviewed Molly Struve. Molly is the Lead Site Reliability Engineer for DEV Community, an online portal designed as a place where programmers can exchange ideas to help each other. The discussion focused on two topics, “How to build a cloud-scale monitoring system” and “How to scale your Elastic Stack for cloud-scale monitoring.” 

[video_embed video=”8bzSK3EiIPw” parameters=”” mp4=”” ogv=”” placeholder=”” width=”700″ height=”400″]

How Molly started working in software engineering and cloud-scale monitoring

Molly earned an aerospace degree from MIT after originally thinking she would study software engineering. She said that since all engineering degrees provide students with the same core problem-solving skills, so when she later decided to work in the software engineering field, she already had the problem-solving background she needed in order to make the transition. The reason she didn’t end up going the aerospace route is that you have to be located in California or Washington where the aerospace industry is but she is from Chicago and didn’t really want to move. It’s good to know that people with various different educational backgrounds have still been able to find success in software engineering!

Let’s jump into the discussion of cloud-scale monitoring! Here are the key points Molly made in reference to the topics listed above.

The Interview – building a cloud-scale monitoring system

What are some of the key requirements to look for when you build out a large cloud-scale monitoring system?

When you start monitoring, you just want coverage, and to do that you often start adding all of these different tools and before you know it you have 6, 7, or 8 different tools doing all this monitoring. However, when the time comes to use it you have to open up all these different windows in your browser just to piece together what is actually going on in your system. So, one of the key things she tells people when they are building a monitoring system is that they have to consolidate all of the reporting. You can have different tools, but you need to consolidate the reporting to a single place. Make sure everything’s in one place so it’s a one stop shop to go and find all the information you need.

When an alert triggers, it must require an action so alert fatigue is a big problem in many monitoring systems. When you have a small team it might seem fine to have exceptions that everyone knows when you don’t respond to certain alerts, but as your team gets larger you have to tell new engineers what the exceptions are, and this process just simply doesn’t scale. So you have to be very disciplined in responding to alerts.

The goal is to get to a point where whoever is on call, whether it’s one person, two people, or three people, can handle the error workload that is coming into the system by way of alerts. 

In the beginning, when you are setting up a monitoring system you might have a lot of errors, and you just have to fix stuff and the improvement of the system comes with time. The ideal metric is zero errors, so you need to be aware of when errors get to a point where they need to be addressed.

Monitoring from an infrastructure perspective is different from monitoring from a security perspective

Trying to figure out what to monitor is also very challenging. You have to set up your monitoring and adjust it as you go depending on what perspective you are monitoring for. Knowing what to monitor is a little bit based on trial and error. That way, if there is data that you wish you had monitoring for, you can address the error and then go in and add the necessary code so that it’s there in the future. After you do that a few times you will end up with a really robust system so the next time an error occurs, all the information you need will be there and it might only take you a few minutes to figure out what’s wrong.

Beyond bringing the data together and optimizing alerting, what are the other best practices?

Another best practice is tracking monitoring history. When trying to solve the error from an alert, you will want to know what the past behavior was. Past behavior can help you debug a problem. What were you alerted about in the past and how was the problem addressed then?

Also, you have to remove all manual monitoring for your monitoring system to be truly scalable. Some systems require employees to check a dashboard every few hours, but this task is easily forgotten. So, if you want a monitoring system to scale you have to remove all manual monitoring. You don’t want to rely on someone opening up a file or checking a dashboard to find a problem. The problem should automatically come to you or whoever is tasked with addressing it. 

What tools did you use to automate?

At Kenna we used datadog. It’s super simple, it integrates really easily with ruby which is the language I primarily work with.

Anything else important on the topic of best practices for cloud-scale monitoring?

Having the ability to mute alerts when you are in the process of fixing them is important. When a developer is trying to fix a problem, it’s distracting to have an alert going off repeatedly every half hour. Having the ability to mute an alert for a set amount of time like an hour or a day can be very helpful. 

What else is part of your monitoring stack?

The list goes on and on. You can use honeybadger for application errors, AWS metrics for your low-end infrastructure metrics, StatusCake for your APIs to make sure your actual site is up, Elasticsearch for monitoring, circleci for continuous integration. It’s a large list of many different tools, but we consolidated them all through datadog. 

What kind of metrics did your management team look for?

Having a great monitoring system allows you to catch incidents and problems before they become massive problems. It’s best to be able to fix issues before the point at which you would have to alert users to the problem. You want to solve problems before they impact your user base. That way on the front-end it looks to the user like your product is 100% reliable, but it’s just because developers have a system on the backend that alerts them to problems so they can stop them before they directly impact users. Upper management obviously wants the app to run well because that’s what they are selling and the monitoring system allows for that to happen.

How big was the elasticsearch cluster where you worked before?

The logging cluster that we used at Kenna had 10 data nodes. The cluster we used for searching client data was even bigger. It was a 21 node cluster. 

What were some of the problems when it came to managing this large cluster?

You want to be defining what you are logging. and make it systematic. Early on at Kenna we would be logging user information we would end up with a ton of different keys which created more work for elasticsearch. This also makes searching and using the data nearly impossible. To avoid this you need to come up with a logging system by defining keys and making sure that everyone is using those keys when they are in the system and logging data. 

We set up our indexes by date, which is common. When you get a month out from the date on a specific index, you want to shrink them to a single shard, which will decrease the number of resources that elasticsearch needs in order to use that index. Even further out than that, you eventually should close that index so that elasticsearch doesn’t need to use any resources for it. 

Any other best practices for cloud-scale monitoring?

Keep your mapping strict and that can help you to avoid problems. If you are doing the searching yourself, try to use filters rather than queries. Filters run a lot faster and are easier on elasticsearch so you want to use them when you are searching through data.

Finally, educating your users on how to use elasticsearch is important. If developers don’t know how to use it correctly, elasticsearch will time out. So, teach users how to search keys, analyzed fields, unanalyzed fields, etc. Also, this will help your users get the targeted, accurate data they are looking for so educating them on how to use elasticsearch is for their benefit as well. Internal users at Kenna (which is who is being referred to here) were conducting searches through Kibana. Clients would interface with the data relevant to them (after training) through an interface that the Kenna team built which prevented clients from doing things that could take down the entire cluster. 

So are you using elasticsearch in your current role at DEV?

DEV is currently using a paid search tool, but we hope to switch to elasticsearch because elasticsearch is open source and it will give us more control over our data and how we search it.

There’s an affordable solution for achieving the best practices described

Molly described the importance of consolidating reporting, responding to alerts, avoiding alert fatigue, automating alerts and reports, and tracking monitoring history. Just two weeks prior to this interview, Shankar gave a presentation about avoiding alert fatigue, and this relevant topic keeps becoming a focus of discussions. Many of the points Molly made, from the importance of automating alerts and reports to the importance of consolidating reporting, are the reasons we started Skedler. 

Are you looking for an affordable way to send periodic reports from elasticsearch to users when they need it? Try Skedler Reports for free! 

Do you want to automate the monitoring of elasticsearch data and notify users of anomalies in the data even when they aren’t in front of their dashboards? Sign up for a free trial of Skedler Alerts!

We hope you are enjoying our podcast so far. Happy holidays to all of our listeners. We will be taking a short break, but will be back with new episodes of The Infralytics Show in 2020!

Episode 7 – Best Practices for Implementing Observability in Microservices Environments

In this episode of Infralytics, Shankar interviewed Stefan Thies, the DevOps Evangelist at Sematext, a provider of infrastructure and application performance monitoring and log management solutions including consulting services for Elastic Stack and Solr. Stefan also has extensive experience as a product manager and pre-sales engineer in the Telecom domain. Here are some of the key discussion points from our interview with Stefan on implementing observability in microservices!

[video_embed video=”hY1gkea4LDo” parameters=”” mp4=”” ogv=”” placeholder=”” width=”700″ height=”400″]

Microservices based on containers have become widely popular as the platform for deployingsolutions in public, private or hybrid clouds. What are the top monitoring and management challenges faced by organizations deploying container based microservices that want to implement observability?

There are quite a lot of challenges. Some people start simply with a simple host and later use orchestration tools like Kubernetes and what we see is that containers add another infrastructure layer and a new kind of resource management. At the same time we are monitoring performance with a new kind of metrics. What we developed in the past was special monitoring agents to collect these new kinds of metrics on all layers, so we have a cluster node with performance metrics for the specific node, on top of Kubernetes ports and in the port you want several containers and multiple processes, and first, new monitoring agents need to be container aware so they have to collect metrics from all of the layers. 

The second challenge is the new way of dynamic deployment and orchestration. You deal with more objects than just servers and your services, because you also deal with cluster nodes, containers, deployment status of your containers. This can be very dynamic and orchestrators like Kubernetes move your applications around so maybe an application fails on one node and then the cluster shifts the application to another node. It’s very hard to track errors and failures in your application. So the new orchestration tools add additional challenges for DevOps people, because they need to see not only what happens on the applications but at the cluster level. Additional challenges are also added because things are moving around. There is now another layer of complexity added to the process. 

What are additional challenges that come with containers? What should administrators be looking for?

There are metrics on every layer; servers, clusters, ports, containers, deployment status. Also another challenge is that lock management has also changed completely. You need a logging agent that’s able to collect the container logs. With every logline we add information on which node it is on, in which port it is deployed, and which container and container image, so we have better visibility. The next thing that comes with container deployment is microservices. Typically architectures today are more distributed and split into little services that work closely together, but it’s harder to trace transactions that go through multiple services. Transaction tracing is a new pillar of the observability but it requires more work to implement the necessary code. 

Basically, log management becomes a challenge because of all of these microservices and you are also doing the tracing not just on the metrics and events, but you are also now looking at all of the trace files. So having more data requires people to have larger data stores.

How do you consolidate the different datasets?

We use monitoring agents and logagents. Both tools use the same tags so the logs and metrics can be correlated.

How do you standardize the different standards and practices?

With open source, it’s a lot of do-it-yourself which means you need to sit down and think what metrics do I have, what labels do I need, and do the same for the logging and for the monitoring.

What are your recommended strategies for organizations?

More and more users are used to having 24/7 services because they are used to getting that from Google and Facebook. All the big vendors offer 24/7 services. Smaller software vendors really have a challenge to be on the same level to be aware of any problem as soon as possible. 

What you need to do is to first start monitoring; availability monitoring, then add metrics to it for infrastructure monitoring. Are your servers healthy? Are all the processes running? Then the next level is education monitoring to check the performance of your databases, your message queues, and the other tools you use in your stack, and finally the performance of your own applications. 

When it comes to troubleshooting and you recognize some service is not performing well, then you need the logs. In the initial stage typically people use SSH, log into the server, try to find the log file, and look for errors. You need to collect the logs from all your servers, from all of your processes, and from all of your containers. Index the data and make it searchable and accessible. If you want to be really advanced you go to the level of code implementation and tracing. 

What is observability? How is it different from monitoring?

Observability is the whole process. Monitoring, you have metrics, you have logs, and transaction tracing to have code level visibility. This process allows you to pinpoint where exactly the failure happens so it’s easier to fix it. When you have more information, it’s much faster to solve the problem. 

How would an organization move from just monitoring to observability?

At Sematext our log management is very well accepted so people typically start with collecting the logs because it’s the first challenge they have; Where do I store all these logs? Should I set up a seperate server for it, or do I go for Software as a Service? These are the types of questions people are asking, so we see that people start collecting logs and then they start to discover more features and that we offer monitoring and then they start installing monitoring agents and they start to ask about specific applications. Automatically they start to do more and more steps. That is the process that our customers normally follow.

Are you interested in learning more about what Stefan’s company offers? Go to Are you looking for an easy-to-use system for data monitoring that provides you with automated delivery of metrics and provides code-free and easy to manage Alerts? Check out Skedler!

If you want to learn more tips from experts like Stefan, you can read more articles about the Infralytics video podcast on our blog!

Episode 6 – Cybersecurity Alerts: 6 Steps To Manage Them

Is your Security Ops team overwhelmed by cybersecurity alerts? In this episode of The Infralytics Show, Shankar, Founder, and CEO of Skedler, describes the seemingly endless number of cybersecurity alerts that security ops teams encounter. 

[video_embed video=”7nul5V5pM9o” parameters=”” mp4=”” ogv=”” placeholder=”” width=”700″ height=”400″]

The Problem Of Too Many Cybersecurity Alerts

Just to give you an understanding of how far-reaching this problem is, here are some facts. According to information published in a recent study by Bitdefender, 72% of CISOs reported alert or agent fatigue. So, don’t worry, you aren’t alone. A report published by Critical Start found that 70% of SOC Analysts who responded to the study said they investigate 10+ cybersecurity alerts every day. This is a dramatic increase from just last year when only 45% said they investigate more than 10 alerts each day. 78% spend more than 10 minutes investigating each of the cybersecurity alerts, and 45% reported that they get a rate of 50% or more false positives. 

When asked the question, If your SOC has too many alerts for the analysts to process, what do you do? 38% said they turn off high-volume alerts and the same percentage said that they hire more analysts. However, the problem that arises with the need to hire more analysts, is that more than three quarters of respondents reported an analyst turnover rate of more than 10% with at least half reporting a 10-25% rate. This turnover rate is directly impacted by the overwhelming number of cybersecurity alerts, but it raises the question, what do you do if you need to hire more analysts to handle the endless number of alerts, but the cybersecurity alerts themselves are contributing to a high SOC analyst turnover rate. It seems a situation has been created where there are never enough SOC analysts to meet the demand. 

To make matters worse, more than 50% of respondents reported that they have experienced a security breach in the past year! Thankfully, you can eliminate alert fatigue and manage alerts effectively with these 6 simple steps.

A woman looks overwhelmed over cybersecurity alerts on her laptop.

The Solution To Being Overwhelmed By Cybersecurity Alerts

1. Prioritize Detection and Alerting

According to Shankar’s research, step 1 is that business and security goals and the available resources that you have at your disposal to use to achieve them must prioritize threat detection and alerting. Defining what your goals are is a great way to start. Use knowledge of your available resources to better plan how you are going to respond to alerts and how many you will be able to manage per day. 

2. Map Required Data

Step 2 is to map your goals and what you are trying to achieve to the data that you are already capturing. Then you can see if you are collecting all of the required data to adequately monitor and meet your security requirements. Identify the gaps in your data by completing a gap analysis to see what information you are not collecting that needs to be collected, and then set up your telemetry architecture to collect the data that is needed.

3. Define Metrics Based Cybersecurity Alerts

Step 3 is to define metrics based alerts. What type of alerts are you going to monitor? Look for metric-based alerts that often search for variations in events. Metric based alerts are more efficient than other types of alerts, so Shankar recommends this to those of you who are at this step. You should augment your alerts with machine learning.

Definitely avoid cookie cutter detection. The cookie cutter approach is more of a one size fits all organizations approach that most definitely will not be the best approach for YOUR organization. Each organization has its own unique setup, and you need to have your own setup that is derived from your own security goals.  Also, optimize event-based detection but keep these to a minimum so that your analysts do not end up getting overwhelmed by the alerts.

4. Automate Enrichment and Post-Alert Standard Analysis

Once you have set up these rules, the next step is to see how you can automate some of the additional data that your analysts need for their analysis. Can you automate the enrichment of the alert data so that your analysts don’t have to go and manually look for additional data to provide more context to the alerts? Also, 70-80% of the analysis that an analyst goes through as part of the investigation of an alert is very standard. So ask yourself, is it possible to automate it?

5. Setup a Universal Work Bench

  • Use a setup similar to what Kanban or Trello uses where you have a queue and the alerts that need to be investigated are moved from one stage to the next. This will help you keep everything organized. This can help you arrange the alerts in order of importance so that your analysts know which alerts to address first.
  • Add enriched data to these alerts, so automate the enrichment process to make sure it is readily available for your analysts through the work bench.
  • Provide more intelligence to the alerts (adding data or whatever else is needed to provide context). This will help you provide a narrative for the alerts and this will help you use immersion learning to come up with recommendations that your security analysts can investigate.

These first five steps are not intended to be a one time initiative but rather a repetitive process where each step can be perfected over a long period. 

6. Measure and Refine

  • Continuous improvement – measure the effectiveness of your alert system. How many alerts are flowing into the system, how much time is it taking for your analysts to investigate each of the alerts, and what is the false-positive rate vs. the true-positive rate.
  • Iterative approach- Think of a sprint-based approach? What changes can you make to improve your results in the next sprint iteration? Add more data or change your alert algorithms for different results and be more precise.

By making regular changes to improve your results, you can reduce the operations costs of your organization and provide more security coverage, reducing the overall likelihood of a major cybersecurity breach.

If you are looking for alerting and reporting for ELK SIEM or Grafana that is easy to use check out Skedler. Interested in other episodes of the Infralytics Show? Check out our blog for the Infralytics Show videos and articles in addition to other informative articles that may be relevant to your business!

Episode 5 – Elasticsearch Data Leaks: Top 5 Prevention Steps

For this week’s episode, Shankar discussed Elasticsearch data leaks with Simone Scarduzio, Project Lead at ReadOnlyREST, a security plugin for Elasticsearch and Kibana. Before we jump into the interview on how you can prevent an Elasticsearch data leak, here is some context on why this topic is especially relevant today.

[video_embed video=”N5F79BHgTiI” parameters=”” mp4=”” ogv=”” placeholder=”” width=”700″ height=”400″]

Recent Elasticsearch Data Leaks

There were three instances of massive data leaks involving Elasticsearch databases just in the week prior to our interview with Simone. 

  1. An Elasticsearch database containing the records of 2.5 million customers of Yves Rocher, a cosmetics company, was found unsecured. 
  2. A database containing the personal data of the entire population of Ecuador (16.6 million people) was found unsecured. 
  3. An Elasticsearch database containing personally identifiable information linked to 198 million car buyer records was found unsecured.

The frequent occurrence of Elasticsearch database data leaks raises the question, “How can we prevent a data leak in Elasticsearch data stores?” For the answer, we interviewed an Elasticsearch security expert and asked his opinion on the top 5 data leak prevention techniques.

What are the Root Causes of These Data Leaks?

The common theme among these different data leaks regarding what caused them was related to the outsourcing contracts. Contracts should not only include the functional requirement but should also include a security requirement. The good thing is that solutions already exist and they are free. 

If you think about Amazon Elasticsearch Service, it’s very cheap and convenient. However, you can’t install any plugin in Amazon because it’s blocked. So a developer will just find a way around this problem without a viable security plugin, which ultimately leaves the database vulnerable. So a lot of the issue has to do with how Amazon built the Amazon Elasticsearch Service. They split the responsibility for security between the user and the infrastructure manager, which is them (Amazon), so Amazon is not contractually liable for the problems that arise regarding security.

Amazon allows anyone to open up an Elasticsearch cluster without any warning. Simone says he “does not agree with this practice. Amazon should either avoid it or have a very big warning” so that data leaks like the three recent ones can be avoided.

Another problem is that the companies that had these clusters exposed had a massive amount of data accumulated, and Simone says that “even if it was secure, it is not a good practice and the entities that created the GDPR would not agree with the practice” of holding that much data in such a way. It is almost like they were inviting a data breach.

5 Ways To Prevent An Elasticsearch Data Leak

Represents caution on the internet or on a computer because of Elasticsearch data leak potential.

If you have an Elasticsearch cluster and want to keep it protected follow these rules:

  1. Remember that data accumulation is a liability and you should only collect what is necessary at all times. Every piece of data should have an expiration date. 
  2. Every company from the minute they obtain user data should accept the responsibility it comes with and should center their attention on the importance of data handling and data management. Outsource access to the data less, but keep all of the different objectives of the different actors in line at all times.
  3. Use security plugins. When you accumulate data, the security layer should be as close as possible to the data itself.
  4. Use encryption on the http interface and between the Elasticsearch nodes for next-level security.
  5. Rigorously implement local data regulations and laws like the GDPR in the European Union. 

If you are looking to increase the security for your elasticsearch cluster, using a security plugin is a great security measure to start with and can help you prevent a data leak from exposing your clients’ data. Learn more about ReadOnlyREST’s security plugin for Elasticsearch and Kibana here.

The Infralytics Show

Thanks for reading our article and tuning in to episode 5 of the Infralytics show. We have a great show planned for next week as well, so be sure to come back! Interested in checking out your past episodes? Here’s a link to episode 4.

Episode 4 – Let’s Go Phishing for Ransomware

Shankar Radhakrishnan, Founder of Skedler, recently sat down with the CEO of TCE Strategy, Bryce Austin, who is a Cyber Security Expert and Professional Speaker as well as the author of the book Secure Enough? 20 Questions on Cybersecurity. The topic of the discussion, phishing for ransomware, is incredibly important as many organizations and individuals around the world are exposed to the perils of phishing and ransomware attacks daily. Bryce was able to detail why hackers target individual accounts and what best practices organizations can employ to proactively mitigate attacks or handle the fallout after a phishing and ransomware attack. 

[video_embed video=”I5ys5nOowTo” parameters=”” mp4=”” ogv=”” placeholder=”” width=”700″ height=”400″]

Top Phishing Scenarios That Organizations Face

A recent report found that 64% of organizations have experienced a phishing attack in the past year. Research by IBM reveals that 59% of ransomware attacks originate with phishing emails and a remarkable 91% of all malware is delivered by email. This tells us that more users are seeing attacks, but since they are not trained in how to spot or handle them, they become a victim of them. With the volume and variety of phishing attacks on the rise, many organizations are struggling to keep up with the barrage of ransomware attacks that are constantly hitting their networks.

In order to combat these terrible attacks, we must first understand what they are and what their purpose is. Bryce explains that “phishing comes in many forms. It can be a vague email. It could be from someone you know. It could be from someone you know who says ‘I thought you might find this link interesting,’ and it tries to get you to click on a weblink.” Bryce goes on to detail how “it could [also] state ‘please see the file for the next, new, exciting thing in technology.’ Something vague and nondescript.” This is an incredibly important aspect of these attacks since many people open emails that interest them, even if they don’t know the sender. Once the individual clicks through, the hacker has everything they need in order to obtain remote access to the user’s desktop or copy their address book to carry out a phishing attack of epic proportions.

Phishing attacks

Best Practices For Safeguarding Your Company From Phishing Attacks

First and foremost, cybercriminals are interested in money. If they think that there is a reasonable chance of them getting money from a user or company, they will try. This is why, Bryce explains that “one of the biggest things you can do is to have cybersecurity awareness training for yourself and for anyone in your company.” In essence, Bryce tells us that, “Cybersecurity awareness training is cybersecurity 101. It’s the basics of what these phishing scams look like. That is far and above the #1 way to prevent it.”

Too often, employees aren’t familiar with the signs of ransomware and therefore make their companies vulnerable to attacks. This is why, to mitigate the risk of a phishing or ransomware attack, it’s imperative to provide regular and mandatory cyber security training to ensure all employees can spot and avoid a potential phishing scam in their inbox. You also need to look into endpoint detection and ensure that you have built a really strong security posture. This will give everyone from your frontline employees to your executives the tools they need to successfully squash a phishing attack in its tracks before it becomes a catastrophe.

Best Practices to Safeguard Phishing Attacks

Best Practices For Post-Phishing Attacks

If a threat actor successfully phishes an employee, it can provide them with access to the company’s entire network of resources. Bryce explains that “if a phishing attack is successful, it inherits whatever abilities the user has.” This means that a single phishing attack can provide a hacker with access to the organization’s sensitive financial and intellectual property data which can be devastating.

To combat the spread of a phishing attack once it has already made its way into your network, Bryce explains that a huge mitigating step is to “proactively remove local administrator rights so that users don’t log in as a local admin at the company.” This is similar to throwing sand on a roaring fire pit. It doesn’t undo what has already happened, but it can keep the damage from getting out of hand.

Don’t forget to subscribe and review us because we want to help others like you improve their IT operations, security operations and streamline business operations. If you want to learn more about Skedler and how we can help you just go to where you’ll find tons of information on Kibana, Grafana, and Elastic Stack reporting. You can also download a free trial at so you can see how it all works. Thanks for joining. We hope you will tune in to our next episode!

Episode 3 – Are Today’s SOC Ready to Handle Emerging Cyber Threats?

Shankar Radhakrishnan, Founder of Skedler, recently sat down with the Director of Security Operations at Rocus Networks aka Corvid Cyber Defense, John Britton to discuss the top cyber threats that businesses face and if Security Operations Centers (SOCs) are prepared to handle them. John was able to provide a wealth of knowledge on these specific talking points and give us a higher level view of how cyber threats have evolved. Without further ado, let’s review the top cyber threats that plague businesses and if SOCs are up to the task of combatting these threats before they become an issue.

[video_embed video=”eKMmycRGMRY” parameters=”” mp4=”” ogv=”” placeholder=”” width=”700″ height=”400″]

Today’s Top Cyber Threats  

While small and midsized businesses are increasingly targets for cybercriminals, companies are struggling to devote enough resources to protect their technology from attack. John describes how “5 or 6 years ago, if someone wanted to go steal some money, they would go to a bank,” John goes on to explain that “today, the way that the internet has connected everybody and all businesses are now operationalized to be ‘always on,’ every organization is targetable.” These small businesses don’t have access to a large information technology staff and many don’t have expensive, sophisticated software designed to monitor their systems. This leaves them literally defenseless against these types of cyber-attacks.

John tells us that “the biggest thing that really affects any organization is the people because people make mistakes and they can be manipulated out of things.” This why being aware of the tactics and methods used by hackers implementing social engineering attacks and applying them to our everyday lives is the key to a solid defense. As more organizations experience these types of attacks, more will become aware of ways to internally combat them; in the meantime, it’s best to look to the guidance of an SOC to help you keep the ship afloat in rocky cyber waters. 

What Techniques are Hackers Using?

A recent Ponemon Institute-Keeper report showed that 66% of organizations surveyed have experienced a breach within the last 12 months. Since businesses are still proving to be vulnerable to cyberattacks, it’s clear that more needs to be done so they adapt to a fast-moving and ever-increasing threat landscape. In their quest to achieve this goal, businesses are continuing to invest in their IT security and systems.

John explains that “we find that, at least this year, that the biggest threat to any organization is social engineering.” One eye-opening statistic to understand is that 64% of companies have experienced web-based attacks with 62% experiencing phishing & social engineering attacks.  Social engineering attacks are especially dangerous because all it takes is one weak link in an organization to initiate a damaging event. Companies need to remain vigilant when it comes to cybersecurity, because social engineering is only going to get more sophisticated in the future.

Are SOCs Prepared to Handle These Threats? 

SMBs have continued to embrace mobile devices as a way to run their businesses recently which has led to an increase in convenience and efficiency that comes at a price. That price is the diminished role of cybersecurity in their companies. John explains that, in the future, “organizations are going to [need] security as a 24/7 monitoring, data retention, and policy assessments.” SOCs are well up to the task provide companies of all sizes with innovative solutions that are integrated to work efficiently, ensuring that they always have the strongest and most effective cybersecurity defense at their disposal.

Don’t forget to subscribe and review us below because we want to help others like you improve their IT operations, security operations and streamline business operations. If you want to learn more about Skedler and how we can help you just go to where you’ll find tons of information on Kibana, Grafana, and Elastic Stack reporting. You can also download a free trial with us, so you can see how it all works at Thanks for joining and we’ll see you next episode.

Episode 2 – Tactical Security Intelligence and Zero Trust Architecture: How to Adapt Your SIEM and SOC

Welcome to another episode of Infralytics. This episode brings together Shankar Radhakrishnan, Founder of Skedler, and Justin Henderson. Justin is a certified SANS instructor and a member of the Cyber Guardian Blue team at SANS, authoring a number of courses at SANS. Justin is also the Founder and lead consultant at H&A Security Solutions.

Together, Shankar and Justin discuss the intricacies of “Tactical Security Intelligence and Zero Trust Architecture: How to adapt your SIEM and SOC​” during their informative video podcast. Let’s recap their discussion and learn more about what sets tactical security intelligence and zero trust architectures apart from other cybersecurity approaches.

[video_embed video=”0p2PDLyByLg” parameters=”” mp4=”” ogv=”” placeholder=”” width=”700″ height=”400″]

What is Tactical Security Intelligence?

Tactical security intelligence provides information about the tactics, techniques, and procedures (TTPs) used by threat actors to achieve their goals (e.g., to compromise networks, exfiltrate data, and so on). It’s intended to help defenders understand how their organization is likely to be attacked, so they can determine whether appropriate detection and mitigation mechanisms exist or whether they need to be implemented.

What Sources of Data/Information Can Be Divulged?

Tactical security intelligence can divulge what tools threat actors are using during the course of their operations to compromise target networks and exfiltrate data. This type of information will usually come from post-mortem analyses of successful or unsuccessful attacks, and will ideally include details of the specific malware or exploit kits used. It can also identify the specific techniques that threat actors are using to delay or avoid detection. Justin Henderson tells us that most organizations are using tactical security intelligence to “[perform] critical alerting and monitoring back where the data normally resides. The best visibility to see the attacker doesn’t exist there, it exists earlier on like the desktops and laptops.”

Data Monitoring

How do you adapt your SIEM platform for effective tactical intelligence?

In some cases, tactical security intelligence will highlight the need for an organization to invest additional resources in order to address a specific threat. Your tactical security intelligence may lead you to implement a new security protocol or reconfigure an existing technology in order to simplify matters and continue driving innovation forward while averting serious threats. Unfortunately, incident response efficacy relies heavily on human expertise, therefore it can be more difficult to measure the impact of tactical threat intelligence when it comes to identifying serious threats. This is why when supplementing your SIEM platform with tactical security intelligence solutions, it’s best to implement a strong feedback loop between frontline defenders and your threat intelligence experts to ensure more robust network protection.

What is Zero Trust and How Does it Differ From Other Approaches?

Zero trust, as an approach is a reflection of the current, modern working environment that more and more organizations are embracing now. Under the zero trust approach, organizations trust nothing, but verify everything. This approach requires logging, authentication and encryption of all data communication. While it is impossible to fully implement zero trust, Justin Henderson tells us that the best way to go about managing Zero Trust is to “know a baseline, find deviations, then investigate.” The approach is considered as all-pervasive, capable of powering not only large, but also small-scale organizations across various types of industries.

Zero Trust

How Does Zero Trust Impact Your SOC?

To protect, adopting a zero trust approach may be your best bet for success as it allows your organization to seamlessly monitor suspicious activity. This real-time data exposure allows your IT team to reduce the potential for security exposure, thereby giving them the ability to leverage the power of their SOC immediately. Doing so can help your organization sidestep a data breach which can cost $3.9 million on average per a 2019 Ponemon Institute report.

Don’t forget to subscribe and review us below because we want to help others like you improve their IT operations, security operations and streamline business operations. If you want to learn more about Skedler and how we can help you just go to where you’ll find tons of information on Kibana, Grafana, and Elastic Stack reporting. You can also download a free trial with us, so you can see how it all works at Thanks for joining and we’ll see you next episode.

Copyright © 2023 Guidanz Inc
Translate »