Three Pillars of Observability – Metrics ( Part 2)

Introduction

Distributed systems mean services and servers are spread out over multiple clouds. The individual users who consume the services increase their number, device of choice, and location. Having visibility into the client’s experience while using the application – i.e., observability – is now a vital part of the metrics required to operate the applications in your infrastructure.

What is Metrics?

A metric is a quantifiable value measured over a while and includes specific characteristics like timestamp, name, KPIs, and value. Unlike logs, metrics are structured by default, making it easier to query and optimize for storage giving you the ability to retain them for more extended periods.

Metrics help uncover some of the most primary queries of the IT department. Is there a performance issue that’s affecting customers? Are employees having trouble accessing? Is there high traffic volume? Is the rate of customer churn going up?

Standard metrics include

  1. System metrics such as CPU usage, memory usage, disk I/O,
  2. App metrics such as rate, number of errors, time,
  3. Business metrics such as revenue, signups, bounce rate, cart abandonment, etc.

Different Components of Metrics

Metrics is the most valuable of the three pillars because they’re generated very often and by every module, from operating systems to applications. Associating them can give you a complete view of an issue, but associating them is a huge and tedious task for human operators.

Data Collection

The most significant part of metrics is small and does not consume too much space. You can gather them cheaply and store them for an extended period. These give you a general overview of the whole system without insights.

So, metrics answer the question, “How does my system performance change through time?”

Data Storage

Most people used statsd along with graphite as the storage backend. Some people now prefer Prometheus, an open-source, metrics-based monitoring system. It does one thing pretty well, with a simple yet powerful data model and a query language, it lets you analyze how your applications and infrastructure perform.

Visualization and Reporting

I would also consider visualization a part of metrics, as it goes hand in hand with metrics.

Grafana is used to visualize the data scraped by sources like Prometheus, a  data source to grafana, which works on a pull model. You can also use Kibana as your visulaization tool, primarily supporting elastic stack.

And you can use Skedler to generate reports from these visualizations to share with your stakeholders.

There is a simple and effective way to add reporting for your Elasticsearch Kibana (including Open Distro for Elasticsearch) or Grafana applications that are deployed to Kubernetes using Skedler.

You can deploy Skedler on air-gapped, private, or public cloud environments with docker or VM on various flavors of Linux.

Skedler is easy to install, configure and use with Kibana or Grafana. Skedler’s no-code Drag-n-drop UI generates PDF, CSV, Excel Kibana, or Grafana reports in minutes and saves up to 10 hours per week.

Try our new and improved Skedler for custom generated Grafana or Kibana reports for free!

Download Skedler

Conclusion

Metrics are the entry point to all monitoring platforms based on the data collection from CPU, memory, disk, networks, etc. And so, they no longer belong only to operations —  metrics can be created by anyone and any system in the distributed network. For instance, a developer may opt to showcase application-specific data such as the number of tasks performed, the time required to complete the tasks, and the status. Their objective is to link these data to different levels of systems and define an application profile to identify the necessary architecture for the distributed system itself. This adds to improved performance, reliability, and better security system-wide.

Metrics used by development teams to identify points in the source code that need improvement can also be used by operators to assess the system requirements and plan needed to support user demand and the team to control and enhance the adoption and use of the application.

Three Pillars of Observability – Logs

Introduction

Observability evaluates what’s happening in your software from the outside. The term describes one cohesive capability. The goal of observability is to help you see the condition of your entire system.

Observability needs information about metrics, traces, and logs – the three pillars. When you combine these three “pillars,” a remarkable ability to understand the whole state of your system also emerges. This information might go unnoticed within the pillars on their own. Some observability solutions will put all this information together. They do that as different capabilities, and it’s up to the observer to determine the differences. Observability isn’t just about monitoring each of these pillars at a time; it’s also the ability to see the whole picture and to see how these pieces combine to fit in a puzzle and show you the actual state of your system.

The Three Pillars of Observability

As mentioned earlier, there are three pillars of observability: Logs, Metrics, and Traces.

Logs are the archival records of your system functions and errors. They are always time-stamped and come in either binary or plain text and a structured format that combines text and metadata. Logs allow you to look through and see what went wrong and where within a system.

Metrics can be a wide range of values monitored over some time. Metrics are often vital performance indicators such as CPU capacity, memory usage, latency, or anything else that provides insights into the health and performance of your system. The changes in these metrics allow teams to understand the system’s end performance better. Metrics offer modern businesses a measurable means to improve the user experience.

Traces are a method to follow a user’s journey through your application. Trace documents the user’s interaction and requests within the system, starting from the user interface to the backend systems and then back to the user once their request is processed. 

This is a three-part blog series on these 3 pillars of observability.  In this first part, we will dive into logs.

Check out this article to know more about observability here

The First Pillar – Logs

In this part of the blog, we will go through the first pillar of Observability – Logs. 

Logs consist of the system’s structured and unstructured data when specific programs run. Overall, you can think of a log as a database of events within an application. Logs help solve unpredictable and irregular behaviors of the components in a system.

They are relatively easy to generate. Almost all application frameworks, libraries, and languages support logging. In a distributed system, every component generates logs of actions and events at any point.

Log files entail complete system details, like fault and the specific time when the fault occurred. By examining the logs,  you can troubleshoot your program and identify where and why the error occurred. Logs are also helpful for troubleshooting security incidents in load balancers, caches, and databases.

Logs play a crucial role in understanding your system’s performance and health. Good logging practice is essential to power a good observability platform across your system design. Monitoring involves the collection and analysis of logs and system metrics. Log analysis is the process of deriving information from these logs. To conduct a proper log analysis, you first need to generate the logs, collect them, and store them. Two things that developers need to get better at logging are: what and how to log.

But one problem with logging is the sheer amount of logged data and the inability to search through it all efficiently. Storing and analyzing logs is expensive, so it’s essential to log only the necessary information to help you identify issues and manage them. It also helps to categorize log messages into priority buckets called logging levels. It’s vital to divide logs into various logging levels, such as Error, Warn, Info, Debug, and Trace. Logging helps us understand the system better and help set up necessary monitoring alerts. 

Insights from Logs

You need to know what happened in the software to troubleshoot system or software level issues. Logs give information about what happened before, during, and after an error occurred.

A trained eye monitoring log can tell what went wrong during a specific time segment within a particular piece of software.

Logs offer analysis at the granular level of the three pillars. You can use logs to discover the primary causes for your system’s errors and find why they occurred. There are many tools available for logs management like

You can then monitor logs using Grafana or Kibana or any other visualization tool.

The Logs app in Kibana helps you to search, filter, and follow all your logs present in Elasticsearch. Also, Log panels in Grafana are very useful when you want to see the correlations between visualized data and logs at a given time. You can also filter your logs for a specific term, label, or time period.

Check out these 3 best Grafana reporting tools here

Limitations of Logs

Logs show what is happening in a specific program. For companies running microservices, the issue may not lie within a given service but how different connected functions. Logs alone may show the problem but do not show how often the problem has occurred. Saving logs that go back a long time can increase costs due to the amount of storage required to keep all the information.

Similarly, coming up with new containers or instances to handle client activity means increasing the logging and storage cost. 

To solve this issue, you need to again look to another of the three pillars of observability—the solution for this: metrics. We will cover metrics in the second part of our observability series. Stay tuned to learn more about observability.

Try our new and improved Skedler for custom generated Grafana reports for free!

Download Skedler

Everything You Need to Know about Grafana

What is Grafana?

According to GrafanaLabs, Grafana is an open-source visualization and analytics software. No matter where your data is stored, it can be queried, visualized, and explored. In plain English, it provides you with tools to turn your time-series database (TSDB) data into beautiful graphs and visualizations.

Why do companies use Grafana?

Companies use Grafana to monitor their infrastructure and log analytics, predominantly to improve their operational efficiency. Dashboards make tracking users and events easy as it automates the collection, management, and viewing of data. Product leaders, security analysts, and developers use this data to guide their decisions. Studies show companies that rely on database analytics and visualization tools like Grafana are far more profitable than their peers.

Why is Grafana important?

Grafana shows teams and companies what their users really do, not just what they say they do. These are known as revealed behaviors. Users aren’t very adept at predicting their own futures. Having analytics allows tech teams to dig deeper than human-error-prone surveys and monitoring.

Grafana makes that data useful again by integrating all data sources into one single organized view

What Is a Grafana Dashboard?

A Grafana dashboard supports multiple panels in a single grid. You can visualize results from multiple data sources simultaneously. It is a powerful open-source analytical and visualization tool that consists of multiple individual panels arranged in a grid. The panels interact with configured data sources including (but not limited to) AWS CloudWatch, Microsoft SQL server, Prometheus, MySQL, InfluxDB, and many others.

Grafana supports a huge list of data sources including (but not limited to) AWS CloudWatch, Microsoft SQL server, Prometheus, MySQL, InfluxDB, and many others.

What features does Grafana provide?

The tools that teams actually use to uncover insights vary from organization to organization. The following are the most common (and useful) features they might expect of a data analytics/visualization tool like Grafana.

Common Grafana features:

  • Visualize: Grafana has a plethora of visualization options to help you understand your data from graphs to histograms, you have it all.
  • Alerts: Grafana lets you define thresholds visually, and get notified via Slack, PagerDuty, and more
  • Unify: You can bring your data together to get better context. Grafana supports dozens of databases, natively.
  • Open-Source: It’s completely open source. You can use Grafana Cloud, or easily install on any platform.
  • Explore Logs: Using label filters you can quickly filter and search through the laundry list of logs.
  • Display dashboards: Visualize data with templated or custom reports.Create and Share reports:
  • Create and Share reports: Create and share reports to your customers and stakeholders. This feature is not available in the open-source version. You can upgrade to avail it. 

Check out these 3 best Grafana reporting tools here

How to use Grafana

All data visualization platforms are built around two core functions that help companies answer questions about users and events:

  • Tracking data: Capturing visits, events, and monitoring actions through logs
  • Analyzing data: Visualizing data through dashboards and reports.

With data that’s been tracked, captured, and organized, companies are free to analyze:

  • What actions are users taking on the device, network, etc.?
  • The typical behavior flow that users take through our network or app
  • Opportunities to reduce SLA churn

and more.

The answers they receive arm them with statistically valid facts upon which to base security and operational decisions. Grafana is also commonly used to monitor synthetic metrics.

What are Synthetic Metrics?

Synthetic metrics are a collection of multi-stage steps required to complete an API call or transaction.

A set of metrics for an API call would contain:

  • Time to connect to API (connect latency)
  • Duration of request (response latency)
  • Size of response payload
  • Result Code of request (200, 204, 400, 500, etc)
  • Success/Failure state of the request

From there, teams typically graduate to proving or disproving hypotheses. For instance, a patch management solution provider/user may get the following questions addressed — “When is the best time to patch all the systems? Which are the unpatched systems in the network? What are the most vulnerable devices in a network etc.. Over time, teams build up a repository of data-backed evidence which allows them to create positive feedback loops. That is, the more data teams get back from Grafana, the more they can iterate their operations.

Getting started with Grafana is easy — Install Grafana Locally > Configure your data source > Create your first dashboard

What Are Some of the Real-World Industry Use Cases of Grafana?

As mentioned by 8bitmen.com, Grafana dashboards are deployed all over the industry be it Gaming, IoT, FinTech or E-Comm.

StackOverflow used the tool to enable their developers & site reliability teams to create tailored dashboards to visualize data & optimize their server performance.

Digital Ocean uses Grafana to share visualization data between their teams & have in place a common visual data-sharing platform.

What about Grafana reporting?

Grafana allows companies to fully understand the Hows and Whats of users/events with respect to their infrastructure or network. It is especially useful for security analytics teams so they can track events and users’ digital footprints to see what they are doing inside their network. Analytics is a critical piece of modern SecOps and DevOps as most apps and websites aren’t designed to run detailed reports or visualizations on themselves. Without proper visualizations, the data they collect is often inconsistent and improperly formatted (known as unstructured data). Grafana makes that data useful again by integrating all data sources into one single organized view.

The data has to be translated into meaningful reports and shared among the stakeholders. What if you could just use a tool to take care of this task? Skedler is a report automation tool that can automate your Grafana reports. It can create, share and distribute customized reports to all of your stakeholders, all without a single line of code.

Don’t you want to read more about Grafana reporting? Well, we have just the blog from you. Click here and check it out.

Episode 6 – Cybersecurity Alerts: 6 Steps To Manage Them

Is your Security Ops team overwhelmed by cybersecurity alerts? In this episode of The Infralytics Show, Shankar, Founder, and CEO of Skedler, describes the seemingly endless number of cybersecurity alerts that security ops teams encounter. 

[video_embed video=”7nul5V5pM9o” parameters=”” mp4=”” ogv=”” placeholder=”” width=”700″ height=”400″]

The Problem Of Too Many Cybersecurity Alerts

Just to give you an understanding of how far-reaching this problem is, here are some facts. According to information published in a recent study by Bitdefender, 72% of CISOs reported alert or agent fatigue. So, don’t worry, you aren’t alone. A report published by Critical Start found that 70% of SOC Analysts who responded to the study said they investigate 10+ cybersecurity alerts every day. This is a dramatic increase from just last year when only 45% said they investigate more than 10 alerts each day. 78% spend more than 10 minutes investigating each of the cybersecurity alerts, and 45% reported that they get a rate of 50% or more false positives. 

When asked the question, If your SOC has too many alerts for the analysts to process, what do you do? 38% said they turn off high-volume alerts and the same percentage said that they hire more analysts. However, the problem that arises with the need to hire more analysts, is that more than three quarters of respondents reported an analyst turnover rate of more than 10% with at least half reporting a 10-25% rate. This turnover rate is directly impacted by the overwhelming number of cybersecurity alerts, but it raises the question, what do you do if you need to hire more analysts to handle the endless number of alerts, but the cybersecurity alerts themselves are contributing to a high SOC analyst turnover rate. It seems a situation has been created where there are never enough SOC analysts to meet the demand. 

To make matters worse, more than 50% of respondents reported that they have experienced a security breach in the past year! Thankfully, you can eliminate alert fatigue and manage alerts effectively with these 6 simple steps.

A woman looks overwhelmed over cybersecurity alerts on her laptop.

The Solution To Being Overwhelmed By Cybersecurity Alerts

1. Prioritize Detection and Alerting

According to Shankar’s research, step 1 is that business and security goals and the available resources that you have at your disposal to use to achieve them must prioritize threat detection and alerting. Defining what your goals are is a great way to start. Use knowledge of your available resources to better plan how you are going to respond to alerts and how many you will be able to manage per day. 

2. Map Required Data

Step 2 is to map your goals and what you are trying to achieve to the data that you are already capturing. Then you can see if you are collecting all of the required data to adequately monitor and meet your security requirements. Identify the gaps in your data by completing a gap analysis to see what information you are not collecting that needs to be collected, and then set up your telemetry architecture to collect the data that is needed.

3. Define Metrics Based Cybersecurity Alerts

Step 3 is to define metrics based alerts. What type of alerts are you going to monitor? Look for metric-based alerts that often search for variations in events. Metric based alerts are more efficient than other types of alerts, so Shankar recommends this to those of you who are at this step. You should augment your alerts with machine learning.

Definitely avoid cookie cutter detection. The cookie cutter approach is more of a one size fits all organizations approach that most definitely will not be the best approach for YOUR organization. Each organization has its own unique setup, and you need to have your own setup that is derived from your own security goals.  Also, optimize event-based detection but keep these to a minimum so that your analysts do not end up getting overwhelmed by the alerts.

4. Automate Enrichment and Post-Alert Standard Analysis

Once you have set up these rules, the next step is to see how you can automate some of the additional data that your analysts need for their analysis. Can you automate the enrichment of the alert data so that your analysts don’t have to go and manually look for additional data to provide more context to the alerts? Also, 70-80% of the analysis that an analyst goes through as part of the investigation of an alert is very standard. So ask yourself, is it possible to automate it?

5. Setup a Universal Work Bench

  • Use a setup similar to what Kanban or Trello uses where you have a queue and the alerts that need to be investigated are moved from one stage to the next. This will help you keep everything organized. This can help you arrange the alerts in order of importance so that your analysts know which alerts to address first.
  • Add enriched data to these alerts, so automate the enrichment process to make sure it is readily available for your analysts through the work bench.
  • Provide more intelligence to the alerts (adding data or whatever else is needed to provide context). This will help you provide a narrative for the alerts and this will help you use immersion learning to come up with recommendations that your security analysts can investigate.

These first five steps are not intended to be a one time initiative but rather a repetitive process where each step can be perfected over a long period. 

6. Measure and Refine

  • Continuous improvement – measure the effectiveness of your alert system. How many alerts are flowing into the system, how much time is it taking for your analysts to investigate each of the alerts, and what is the false-positive rate vs. the true-positive rate.
  • Iterative approach- Think of a sprint-based approach? What changes can you make to improve your results in the next sprint iteration? Add more data or change your alert algorithms for different results and be more precise.

By making regular changes to improve your results, you can reduce the operations costs of your organization and provide more security coverage, reducing the overall likelihood of a major cybersecurity breach.

If you are looking for alerting and reporting for ELK SIEM or Grafana that is easy to use check out Skedler. Interested in other episodes of the Infralytics Show? Check out our blog for the Infralytics Show videos and articles in addition to other informative articles that may be relevant to your business!

Copyright © 2023 Guidanz Inc
Translate »