Reading and analyzing log files in the RRD database format
Source:vignettes/using_rrd.Rmd
using_rrd.Rmd
I have frequent conversations with R champions and Systems Administrators responsible for R, in which they ask how they can measure and analyze the usage of their servers. Among the many solutions to this problem, one of the my favourites is to use an RRD database and RRDtool.
From Wikipedia:
RRDtool (round-robin database tool) aims to handle time series data such as network bandwidth, temperatures or CPU load. The data is stored in a circular buffer based database, thus the system storage footprint remains constant over time.
RRDtool is a library written in C, with implementations that can also be accessed from the Linux command line. This makes it convenient for system development, but makes it difficult for R users to extract and analyze this data.
I am pleased to announce that I’ve been working on the rrd
R package to import RRD files directly into tibble
objects, thus making it easy to analyze your metrics.
As an aside, the RStudio Pro products (specifically RStudio Server Pro and RStudio Connect) also make use of RRD to store metrics – more about this later.
Understanding the RRD format as an R user
The name RRD is an initialism of Round Robin Database. The “round robin” refers to the fact that the database is always fixed in size, and as a new entry enters the database, the oldest entry is discarded. In practical terms, the database collects data for a fixed period of time, and information that is older than the threshold gets removed.
A second quality of RRD databases is that each datum is stored in different “consolidation data points”, where every data point is an aggregation over time. For example, a data point can represent an average value for the time period, or a maximum over the period. Typical consolidation functions include average
, min
and max
.
The third quality is that every RRD database file typically consists of multiple archives. Each archive measures data for a different time period. For instance, the archives can capture data for intervals of 10 seconds, 30 seconds, 1 minute or 5 minutes.
As an example, here is a description of an RRD file that originated in RStudio Connect:
describe_rrd("rrd_cpu_0")
#> A RRD file with 10 RRA arrays and step size 60
#> [1] AVERAGE_60 (43200 rows)
#> [2] AVERAGE_300 (25920 rows)
#> [3] MIN_300 (25920 rows)
#> [4] MAX_300 (25920 rows)
#> [5] AVERAGE_3600 (8760 rows)
#> [6] MIN_3600 (8760 rows)
#> [7] MAX_3600 (8760 rows)
#> [8] AVERAGE_86400 (1825 rows)
#> [9] MIN_86400 (1825 rows)
#> [10] MAX_86400 (1825 rows)
This RRD
file contains data for the properties of CPU 0 of the system. In this example, the first RRA
archive contains averaged metrics for one minute (60s) intervals, while the second RRA
measures the same metric, but averaged over five minutes. The same metrics are also available for intervals of one hour and one day.
Notice also that every archive has a different number of rows, representing a different historical period where the data is kept. For example, the per minute data AVERAGE_60
is retained for 43,200 periods (12 days) while the daily data MAX_86400
is retained for 1,825 periods (5 years).
If you want to know more, please read the excellent introduction tutorial to RRD database.
Introducing the rrd
package
Until recently, it wasn’t easy to import RRD files into R. But I was pleased to discover that a Google Summer of Code 2014 project created a proof-of-concept R package to read these files. The author of this package is Plamen Dimitrov, who published the code on GitHub and also wrote an explanatory blog post.
Because I had to provide some suggestions to our customers, I decided to update the package, provide some example code, and generally improve the reliability.
The result is not yet on CRAN, but you can install the development version of package from github.
Installing the package
To build the package from source, you first need to install librrd. Installing RRDtool from your Linux package manager will usually also install this library.
Using Ubuntu:
Using RHEL / CentOS:
Once you have the system requirements in place, you can install the development version of the R package from GitHub using:
# install.packages("devtools")
devtools::install_github("andrie/rrd")
Using the package
Once you’ve installed the package, you can start to use it. The package itself contains some built-in RRD files, so you should be able to run the following code directly.
Describing the contents of a RRD
To describe the contents of an RRD file, use describe_rrd()
. This function reports information about the names of each archive (RRA) file, the consolidation function, and the number of observations:
rrd_cpu_0 <- system.file("extdata/cpu-0.rrd", package = "rrd")
describe_rrd(rrd_cpu_0)
#> An RRD file with 10 RRA arrays and step size 60
#> [1] AVERAGE_60 (43200 rows)
#> [2] AVERAGE_300 (25920 rows)
#> [3] MIN_300 (25920 rows)
#> [4] MAX_300 (25920 rows)
#> [5] AVERAGE_3600 (8760 rows)
#> [6] MIN_3600 (8760 rows)
#> [7] MAX_3600 (8760 rows)
#> [8] AVERAGE_86400 (1825 rows)
#> [9] MIN_86400 (1825 rows)
#> [10] MAX_86400 (1825 rows)
Reading an entire RRD file
To read an entire RRD file, i.e. all of the RRA archives, use read_rrd()
. This returns a list of tibble
objects:
cpu <- read_rrd(rrd_cpu_0)
str(cpu, max.level = 1)
#> List of 10
#> $ AVERAGE60 : tibble [43,199 × 9] (S3: tbl_df/tbl/data.frame)
#> $ AVERAGE300 : tibble [25,919 × 9] (S3: tbl_df/tbl/data.frame)
#> $ MIN300 : tibble [25,919 × 9] (S3: tbl_df/tbl/data.frame)
#> $ MAX300 : tibble [25,919 × 9] (S3: tbl_df/tbl/data.frame)
#> $ AVERAGE3600 : tibble [8,759 × 9] (S3: tbl_df/tbl/data.frame)
#> $ MIN3600 : tibble [8,759 × 9] (S3: tbl_df/tbl/data.frame)
#> $ MAX3600 : tibble [8,759 × 9] (S3: tbl_df/tbl/data.frame)
#> $ AVERAGE86400: tibble [1,824 × 9] (S3: tbl_df/tbl/data.frame)
#> $ MIN86400 : tibble [1,824 × 9] (S3: tbl_df/tbl/data.frame)
#> $ MAX86400 : tibble [1,824 × 9] (S3: tbl_df/tbl/data.frame)
Since the resulting object is a list of tibble
objects, you can easily use R functions to work with an individual archive:
names(cpu)
#> [1] "AVERAGE60" "AVERAGE300" "MIN300" "MAX300" "AVERAGE3600"
#> [6] "MIN3600" "MAX3600" "AVERAGE86400" "MIN86400" "MAX86400"
To inspect the contents of the first archive (AVERAGE60
), simply print the object - since it’s a tibble
, you get 10 lines of output.
For example, the CPU metrics contains a time stamp and metrics for average user and sys usage, as well as the nice value, idle time, interrupt requests and soft interrupt requests:
cpu[[1]]
#> # A tibble: 43,199 × 9
#> timestamp user sys nice idle wait irq softirq stolen
#> <dttm> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 2018-04-02 12:24:00 0.0104 0.00811 0 0.981 0 0 0 0.000137
#> 2 2018-04-02 12:25:00 0.0126 0.00630 0 0.979 0 0 0 0.00192
#> 3 2018-04-02 12:26:00 0.0159 0.00808 0 0.976 0 0 0 0
#> 4 2018-04-02 12:27:00 0.00853 0.00647 0 0.985 0 0 0 0
#> 5 2018-04-02 12:28:00 0.0122 0.00999 0 0.978 0 0 0 0
#> 6 2018-04-02 12:29:00 0.0106 0.00604 0 0.983 0 0 0 0
#> 7 2018-04-02 12:30:00 0.0147 0.00427 0 0.981 0 0 0 0.000137
#> 8 2018-04-02 12:31:00 0.0193 0.00767 0 0.971 0 0 0 0.00191
#> 9 2018-04-02 12:32:00 0.0300 0.0274 0 0.943 0 0 0 0
#> 10 2018-04-02 12:33:00 0.0162 0.00617 0 0.978 0 0 0 0.000137
#> # … with 43,189 more rows
#> # ℹ Use `print(n = ...)` to see more rows
Since the data is in tibble
format, you can easily extract specific data, e.g., the last values of the system usage:
tail(cpu$AVERAGE60$sys)
#> [1] 0.0014390667 0.0020080000 0.0005689333 0.0000000000 0.0014390667
#> [6] 0.0005689333
Reading only a single archive
The underlying code in the rrd
package is written in C, and is therefore blazingly fast. Reading an entire RRD file takes a fraction of a second, but sometimes you may want to extract a specific RRA archive immediately.
To read a single RRA archive from an RRD file, use read_rra()
. To use this function, you must specify several arguments that define the specific data to retrieve. This includes the consolidation function (e.g., "AVERAGE"
) and time step (e.g., 60
). You must also specify either the start
time or the number of steps, n_steps
.
In this example, I extract the average for one-minute periods (step = 60
) for one day (n_steps = 24 * 60
):
end_time <- as.POSIXct("2018-05-02") # timestamp with data in example
avg_60 <- read_rra(rrd_cpu_0, cf = "AVERAGE", step = 60, n_steps = 24 * 60,
end = end_time)
avg_60
#> # A tibble: 1,440 × 9
#> timestamp user sys nice idle wait irq softirq stolen
#> <dttm> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 2018-05-01 00:01:00 0.00458 2.01e-3 0 0.992 0 0 0 1.44e-3
#> 2 2018-05-01 00:02:00 0.00258 5.70e-4 0 0.996 0 0 0 5.70e-4
#> 3 2018-05-01 00:03:00 0.00633 1.44e-3 0 0.992 0 0 0 0
#> 4 2018-05-01 00:04:00 0.00515 2.01e-3 0 0.991 0 0 0 1.44e-3
#> 5 2018-05-01 00:05:00 0.00402 5.69e-4 0 0.995 0 0 0 5.69e-4
#> 6 2018-05-01 00:06:00 0.00689 1.44e-3 0 0.992 0 0 0 0
#> 7 2018-05-01 00:07:00 0.00371 2.01e-3 0 0.993 1.44e-3 0 0 0
#> 8 2018-05-01 00:08:00 0.00488 2.01e-3 0 0.993 5.69e-4 0 0 0
#> 9 2018-05-01 00:09:00 0.00748 5.68e-4 0 0.992 0 0 0 0
#> 10 2018-05-01 00:10:00 0.00516 0 0 0.995 0 0 0 0
#> # … with 1,430 more rows
#> # ℹ Use `print(n = ...)` to see more rows
Plotting the results
The original RRDTool
library for Linux contains some functions to easily plot the RRD data, a feature that distinguishes RRD from many other databases.
However, R already has very rich plotting capability, so the rrd
R package doesn’t expose any specific plotting functions.
For example, you can easily plot these data using your favourite packages, like ggplot2
:
Getting the RRD files from RStudio Server Pro and RStudio Connect
As I mentioned in the introduction, both RStudio Server Pro and RStudio Connect use RRD to store metrics. In fact, these metrics are used to power the administration dashboard of these products.
This means that often the easiest solution is simply to enable the admin dashboard and view the information there.
However, sometimes R users and system administrators have a need to analyze the metrics in more detail, so in this section, I discuss where you can find the files for analysis.
The administration guides for these products explain where to find the metrics files:
- The admin guide for RStudio Server Pro discusses metrics in this in section 8.2 Monitoring Configuration.
- By default, the metrics are stored at
/var/lib/rstudio-server/monitor/rrd
, although this path is configurable by the server administrator - RStudio Server Pro stores system metrics as well as user metrics
- By default, the metrics are stored at
-
RStudio Connect discusses metrics in section 16.1 Historical Metrics
- The default path for metrics logs is
/var/lib/rstudio-connect/metrics
, though again, this is configurable by the server administrator.
- The default path for metrics logs is
rsc <- "/var/lib/rstudio-connect/metrics/rrd"
rsp <- "/var/lib/rstudio-server/monitor/rrd"
If you want to analyze these files, it is best to copy the files to a different location. The security and permissions on both products are configured in such a way that it’s not possible to read the files while they are in the original folder. Therefore, we recommend that you copy the files to a different location and do the analysis there.
Warning about using the RStudio Connect RRD files:
The RStudio Connect team is actively planning to change the way content-level metrics are stored, so data related to shiny apps, markdown reports, etc. will likely look different in a future release.
To be clear:
- The schemas might change
- RStudio Connect may stop tracking some metrics
- It’s also possible that the entire mechanism might change
The only guarantees that we make in RStudio Connect are around the data that we actually surface:
- server-wide user counts
- RAM
- CPU data
This means that if you analyze RRD files, you should be aware that the entire mechanism for storing metrics might change in future.
Additional caveat
- The metrics collection process runs in a sandboxed environment, and it is not possible to publish a report to RStudio Connect that reads the metrics directly. If you want to automate a process to read the Connect metrics, you will have to set up a cron job to copy the files to a different location, and run the analysis against the copied files. (Also, re-read the warning that everything might change!)
Example
In the following worked example, I copied some rrd
files that originated in RStudio Connect to a different location on disk, and stored this in a config
file.
First, list the file names:
rrd_location <- system.file("extdata", package = "rrd")
rrd_location %>%
list.files()
#> [1] "content-1.rrd" "cpu-0.rrd"
The file names indicated that RStudio Connect collects metrics for the system (CPU, RAM, etc.), as well as for every piece of published content.
To look at the system load, first describe the contents of the "system.load.rrd"
file:
cpu <- file.path(rrd_location, "cpu-0.rrd")
describe_rrd(cpu)
#> An RRD file with 10 RRA arrays and step size 60
#> [1] AVERAGE_60 (43200 rows)
#> [2] AVERAGE_300 (25920 rows)
#> [3] MIN_300 (25920 rows)
#> [4] MAX_300 (25920 rows)
#> [5] AVERAGE_3600 (8760 rows)
#> [6] MIN_3600 (8760 rows)
#> [7] MAX_3600 (8760 rows)
#> [8] AVERAGE_86400 (1825 rows)
#> [9] MIN_86400 (1825 rows)
#> [10] MAX_86400 (1825 rows)
This output tells you that metrics are collected every 60 seconds (one minute), and then in selected multiples (1 minute, 5 minutes, 1 hour and 1 day.) You can also tell that the consolidation functions are average
, min
and max
.
To extract one month of data, averaged at 5-minute intervals use step = 300
:
cpu <- file.path(rrd_location, "cpu-0.rrd")
dat <- read_rra(cpu, cf = "AVERAGE", step = 300L, n_steps = (3600 / 300) * 24 * 30, end = end_time)
dat
#> # A tibble: 8,640 × 9
#> timestamp user sys nice idle wait irq softirq stolen
#> <dttm> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 2018-04-02 00:05:00 0.0162 0.00821 0 0.975 0 0 0 8.22e-4
#> 2 2018-04-02 00:10:00 0.0221 0.00988 0 0.967 0 0 0 8.19e-4
#> 3 2018-04-02 00:15:00 0.0448 0.0186 0 0.934 0.00124 0 4.14e-4 1.24e-3
#> 4 2018-04-02 00:20:00 0.0177 0.00816 0 0.974 0 0 0 0
#> 5 2018-04-02 00:25:00 0.0285 0.00799 0 0.963 0.000411 0 0 4.10e-4
#> 6 2018-04-02 00:30:00 0.140 0.00910 0 0.850 0 0 0 8.47e-4
#> 7 2018-04-02 00:35:00 0.0151 0.0151 0 0.969 0.000351 0 0 3.84e-4
#> 8 2018-04-02 00:40:00 0.0565 0.00942 0 0.929 0.00491 0 0 0
#> 9 2018-04-02 00:45:00 0.0241 0.00835 0 0.967 0 0 0 8.50e-4
#> 10 2018-04-02 00:50:00 0.0251 0.0111 0 0.963 0 0 0 7.95e-4
#> # … with 8,630 more rows
#> # ℹ Use `print(n = ...)` to see more rows
It is very easy to plot this using your preferred plotting package, e.g., ggplot2
:
ggplot(dat, aes(x = timestamp, y = `user`)) +
geom_line() +
geom_smooth(method = "loess", span = 0.05, formula = y ~ x)
#> Warning: Removed 1 rows containing non-finite values (stat_smooth).
Conclusion
The rrd
package, available from GitHub, makes it very easy to read metrics stored in the RRD database format. Reading an archive is very quick, and your resulting data is a tibble
for an individual archive, or a list of tibble
s for the entire file.
This makes it easy to analyze your data using the tidyverse
packages, and to plot the information.