Collectors Overview

Performance profiles originate either from the user’s own means (i.e. by building their own collectors and generating the profiles w.r.t Specification of Profile Format) or using one of the collectors from Perun’s tool suite.

Perun can collect profiling data in two ways:

  1. By Directly running collectors through perun collect command, that generates profile using a single collector with given collector configuration. The resulting profiles are not postprocessed in any way.
  2. By Using job specification either as a single run of batch of profiling jobs using perun run job or according to the specification of the so called job matrix using perun run matrix command.

The format of resulting profiles is w.r.t. Specification of Profile Format. The origin is set to the current HEAD of the wrapped repository. However, note that uncommited changes may skew the resulting profile and Perun cannot guard your project against this. Further, collector_info is filled with configuration of the run collector.

All of the automatically generated profiles are stored in the .perun/jobs/ directory as a file with the .perf extension. The filename is by default automatically generated according to the following template:

bin-collector-workload-timestamp.perf

Profiles can be further registered and stored in persistent storage using perun add command. Then both stored and pending profiles (i.e. those not yet assigned) can be postprocessed using the perun postprocessby or interpreted using available interpretation techniques using perun show. Refer to Command Line Interface, Postprocessors Overview and Visualizations Overview for more details about running command line commands, capabilities of postprocessors and interpretation techniques respectively. Internals of perun storage is described in Perun Internals.

_images/architecture-collectors.svg

Supported Collectors

Perun’s tool suite currently contains the following three collectors:

  1. Trace Collector (authored by Jirka Pavela), collects running times of C/C++ functions along with the size of the structures they were executed on. E.g. this collects resources such that function search over the class SingleLinkedList took 100ms on single linked list with one million elements. Examples shows concrete examples of profiles generated by Trace Collector
  2. Memory Collector (authored by Radima Podola), collects specifications of allocations in C/C++ programs, such as the type of allocation or the full call trace. Examples shows concrete generated profiles by Memory Collector.
  3. Time Collector, collects overall running times of arbitrary commands. Internally implemented as a simple wrapper over time utility

All of the listed collectors can be run from command line. For more information about command line interface for individual collectors refer to Collect units.

Collector modules are implementation independent (hence, can be written in any language) and only requires simple python interface registered within Perun. For brief tutorial how to create and register new collectors in Perun refer to Creating your own Collector.

Trace Collector

Complexity collector collects running times of C/C++ functions. The collected data are suitable for further postprocessing using the regression analysis and visualization by scatter plots.

Overview and Command Line Interface

perun collect trace

Generates trace performance profile, capturing running times of function depending on underlying structural sizes.

  • Limitations: C/C++ binaries
  • Metric: mixed (captures both time and size consumption)
  • Dependencies: SystemTap (+ corresponding requirements e.g. kernel -dbgsym version)
  • Default units: us for time, element number for size

Example of collected resources is as follows:

{
    "amount": 11,
    "subtype": "time delta",
    "type": "mixed",
    "uid": "SLList_init(SLList*)",
    "structure-unit-size": 0
}

Trace collector provides various collection strategies which are supposed to provide sensible default settings for collection. This allows the user to choose suitable collection method without the need of detailed rules / sampling specification. Currently supported strategies are:

  • userspace: This strategy traces all userspace functions / code blocks without

the use of sampling. Note that this strategy might be resource-intensive. * all: This strategy traces all userspace + library + kernel functions / code blocks that are present in the traced binary without the use of sampling. Note that this strategy might be very resource-intensive. * u_sampled: Sampled version of the userspace strategy. This method uses sampling to reduce the overhead and resources consumption. * a_sampled: Sampled version of the all strategy. Its goal is to reduce the overhead and resources consumption of the all method. * custom: User-specified strategy. Requires the user to specify rules and sampling manually.

Note that manually specified parameters have higher priority than strategy specification and it is thus possible to override concrete rules / sampling by the user.

The collector interface operates with two seemingly same concepts: (external) command and binary. External command refers to the script, executable, makefile, etc. that will be called / invoked during the profiling, such as ‘make test’, ‘run_script.sh’, ‘./my_binary’. Binary, on the other hand, refers to the actual binary or executable file that will be profiled and contains specified functions / static probes etc. It is expected that the binary will be invoked / called as part of the external command script or that external command and binary are the same.

The interface for rules (functions, static probes) specification offers a way to specify profiled locations both with sampling or without it. Note that sampling can reduce the overhead imposed by the profiling. Static rules can be further paired - paired rules act as a start and end point for time measurement. Without a pair, the rule measures time between each two probe hits. The pairing is done automatically for static locations with convention <name> and <name>_end or <name>_END. Otherwise, it is possible to pair rules by the delimiter ‘#’, such as <name1>#<name2>.

Trace profiles are suitable for postprocessing by Regression Analysis since they capture dependency of time consumption depending on the size of the structure. This allows one to model the estimation of trace of individual functions.

Scatter plots are suitable visualization for profiles collected by trace collector, which plots individual points along with regression models (if the profile was postprocessed by regression analysis). Run perun show scatter --help or refer to Scatter Plot for more information about scatter plots.

Refer to Trace Collector for more thorough description and examples of trace collector.

perun collect trace [OPTIONS]

Options

-m, --method <method>

Select strategy for probing the binary. See documentation for detailed explanation for each strategy. [required]

-f, --func <func>

Set the probe point for the given function.

-s, --static <static>

Set the probe point for the given static location.

-d, --dynamic <dynamic>

Set the probe point for the given dynamic location.

-fs, --func-sampled <func_sampled>

Set the probe point and sampling for the given function.

-ss, --static-sampled <static_sampled>

Set the probe point and sampling for the given static location.

-ds, --dynamic-sampled <dynamic_sampled>

Set the probe point and sampling for the given dynamic location.

-g, --global-sampling <global_sampling>

Set the global sample for all probes, sampling parameter for specific rules have higher priority.

--with-static, --no-static

The selected method will also extract and profile static probes.

-b, --binary <binary>

The profiled executable. If not set, then the command is considered to be the profiled executable and is used as a binary parameter

-t, --timeout <timeout>

Set time limit for the profiled command, i.e. the command will be terminated after reaching the time limit. Useful for endless commands etc.

--cleanup, --no-cleanup

Enable/disable the pre-cleanup of possibly running systemtap processes that could cause the corruption of the output file due to multiple writes.

-vt, --verbose-trace

Set the trace file output to be more verbose, useful for debugging.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
{
  "origin": "f7f3dcea69b97f2b03c421a223a770917149cfae",
  "header": {
    "cmd": "../stap-collector/tst",
    "type": "mixed",
    "units": {
      "mixed(time delta)": "us"
    },
    "workload": "",
    "params": ""
  },
  "collector_info": {
    "name": "complexity",
    "params": {
      "rules": [
        "SLList_init",
        "SLList_insert",
        "SLList_search",
        "SLList_destroy"
      ],
      "sampling": [
        {
          "func": "SLList_insert",
          "sample": 1
        },
        {
          "func": "func1",
          "sample": 1
        }
      ],
      "method": "custom",
      "global_sampling": null
    }
  },
  "postprocessors": [],
  "global": {
    "time": "6.8e-05s",
    "resources": [
      {
        "type": "mixed",
        "amount": 6,
        "subtype": "time delta",
        "uid": "SLList_init(SLList*)",
        "structure-unit-size": 0
      },
      {
        "type": "mixed",
        "amount": 0,
        "subtype": "time delta",
        "uid": "SLList_search(SLList*, int)",
        "structure-unit-size": 0
      },
      {
        "type": "mixed",
        "amount": 1,
        "subtype": "time delta",
        "uid": "SLList_insert(SLList*, int)",
        "structure-unit-size": 0
      },
      {
        "type": "mixed",
        "amount": 0,
        "subtype": "time delta",
        "uid": "SLList_insert(SLList*, int)",
        "structure-unit-size": 1
      },
      {
        "type": "mixed",
        "amount": 1,
        "subtype": "time delta",
        "uid": "SLList_insert(SLList*, int)",
        "structure-unit-size": 2
      },
      {
        "type": "mixed",
        "amount": 1,
        "subtype": "time delta",
        "uid": "SLList_insert(SLList*, int)",
        "structure-unit-size": 3
      },
      {
        "type": "mixed",
        "amount": 1,
        "subtype": "time delta",
        "uid": "SLList_destroy(SLList*)",
        "structure-unit-size": 4
      }
    ]
  }
}

The above is an example of profiled data for the simple manipulation with program with single linked list. Profile captured running times of three functions—SLList_init (an initialization of single linked list), SLList_destroy (a destruction of single linked list) and SLList_search (search over the single linked list).

Highlighted lines show important keys and regions in the profile, e.g. the origin, collector-info or resources.

_images/complexity-scatter-with-models.png

The Scatter Plot above shows the example of visualization of trace profile. Each points corresponds to the running time of the SLList_search function over the single linked list with structure-unit-size elements. Elements are further interleaved with set of models obtained by Regression Analysis. The light green line corresponds to linear model, which seems to be the most fitting to model the performance of given function.

_images/complexity-bars.png

The Bars Plot above shows the overall sum of the running times for each structure-unit-size for the SLList_search function. The interpretation highlights that the most of the consumed running time were over the single linked lists with 41 elements.

_images/complexity-flow.png

The Flow Plot above shows the trend of the average running time of the SLList_search function depending on the size of the structure we execute the search on.

Memory Collector

Memory collector collects allocations of C/C++ functions, target addresses of allocations, type of allocations, etc. The collected data are suitable for visualiation using e.g. Heap Map.

Overview and Command Line Interface

perun collect memory

Generates memory performance profiel, capturing memory allocations of different types along with target address and full call trace.

  • Limitations: C/C++ binaries
  • Metric: memory
  • Dependencies: libunwind.so and custom libmalloc.so
  • Default units: B for memory

The following snippet shows the example of resources collected by memory profiler. It captures allocations done by functions with more detailed description, such as the type of allocation, trace, etc.

{
    "type": "memory",
    "subtype": "malloc",
    "address": 19284560,
    "amount": 4,
    "trace": [
        {
            "source": "../memory_collect_test.c",
            "function": "main",
            "line": 22
        },
    ],
    "uid": {
        "source": "../memory_collect_test.c",
        "function": "main",
        "line": 22
    }
},

Memory profiles can be efficiently interpreted using Heap Map technique (together with its heat mode), which shows memory allocations (by functions) in memory address map.

Refer to Memory Collector for more thorough description and examples of memory collector.

perun collect memory [OPTIONS]

Options

-s, --sampling <sampling>

Sets the sampling interval for profiling the allocations. I.e. memory snapshots will be collected each <sampling> seconds.

--no-source <no_source>

Will exclude allocations done from <no_source> file during the profiling.

--no-func <no_func>

Will exclude allocations done by <no func> function during the profiling.

-a, --all

Will record the full trace for each allocation, i.e. it will include all allocators and even unreachable records.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
{
  "origin": "74288675e4074f1ad5bbb0d3b3253911ab42267a",
  "header": {
    "type": "memory",
    "workload": "",
    "cmd": "./mct",
    "units": {
      "memory": "B"
    },
    "params": ""
  },
  "collector_info": {
    "name": "memory",
    "params": {
      "no_func": null,
      "no_source": null,
      "all": false,
      "sampling": 0.025
    }
  },
  "postprocessors": [],
  "snapshots": [
    {
      "resources": [
      {
          "type": "memory",
          "subtype": "malloc",
          "address": 19284560,
          "amount": 4,
          "trace": [
            {
              "source": "unreachable",
              "function": "malloc",
              "line": 0
            },
            {
              "source": "../memory_collect_test.c",
              "function": "main",
              "line": 22
            },
            {
              "source": "unreachable",
              "function": "__libc_start_main",
              "line": 0
            },
            {
              "source": "unreachable",
              "function": "_start",
              "line": 0
            }
          ],
          "uid": {
            "source": "../memory_collect_test.c",
            "function": "main",
            "line": 22
          }
        },
        {
          "type": "memory",
          "subtype": "free",
          "address": 19284560,
          "amount": 0,
          "trace": [
            {
              "source": "unreachable",
              "function": "free",
              "line": 0
            },
            {
              "source": "../memory_collect_test.c",
              "function": "main",
              "line": 27
            },
            {
              "source": "unreachable",
              "function": "__libc_start_main",
              "line": 0
            },
            {
              "source": "unreachable",
              "function": "_start",
              "line": 0
            }
          ],
          "uid": {
            "source": "../memory_collect_test.c",
            "function": "main",
            "line": 27
          }
        }
      ],
      "time": "0.025000"
    }
  ]
}

The above is an example of profiled data on a simple binary, which makes several minor allocations. Profile shows a simple allocation followed by deallocation and highlights important keys and regions in the memory profiles, e.g. the origin, collector-info or resources

_images/memory-flow.png

The Flow Plot above shows the mean of allocated amounts per each allocation site (i.e. uid) in stacked mode. The stacking of the means clearly shows, where the biggest allocations where made during the program run.

_images/memory-flamegraph.png

The Flame Graph is an efficient visualization of inclusive consumption of resources. The width of the base of one flame shows the bottleneck and hotspots of profiled binaries.

_images/memory-heapmap.png

The Heap Map shows the address space through the time (snapshots) and visualize the fragmentation of memory allocation per each allocation site. The heap map aboe shows the difference between allocations using lists (purple), skiplists (pinkish) and standard vectors (blue). The map itself is interactive and displays details about individual address cells.

_images/memory-heatmap.png

Heat map is a mode of heap map, which aggregates the allocations over all of the snapshots and uses warmer colours for address cells, where more allocations were performed.

Time Collector

Time collector collects is a simple wrapper over the time utility. There is nothing special about this, the profiles are simple, and no visualization is especially suitable for this mode.

Overview and Command Line Interface

perun collect time

Generates time performance profile, capturing overall running times of the profiled command.

  • Limitations: none
  • Metric: running time
  • Dependencies: none
  • Default units: s

This is a wrapper over the time linux unitility and captures resources in the following form:

{
    "amount": 0.59,
    "type": "time",
    "subtype": "sys",
    "uid": cmd
    "order": 1
}

Refer to Time Collector for more thorough description and examples of trace collector.

perun collect time [OPTIONS]

Options

-w, --warm-up-repetition <warmup>

Before the actual timing, the collector will execute <int> warm-up executions.

-r, --repeat <repeat>

The timing of the given binaries will be repeated <int> times.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
{
  "origin": "8de6cd99e4dc36cd73a2af906cde12456e96d9f1",
  "header": {
    "type": "time",
    "params": "",
    "units": {
      "time": "s"
    },
    "cmd": "./list_search",
    "workload": "100000"
  },
  "collector_info": {
    "params": {
      "repeat": 2,
      "warmup": 3
    },
    "name": "time"
  },
  "postprocessors": [],
  "global": {
    "timestamp": 0.565476655960083,
    "resources": [
      {
        "subtype": "real",
        "uid": "./list_search",
        "order": 1,
        "type": "time",
        "amount": 0.26
      },
      {
        "subtype": "user",
        "uid": "./list_search",
        "order": 1,
        "type": "time",
        "amount": 0.25
      },
      {
        "subtype": "sys",
        "uid": "./list_search",
        "order": 1,
        "type": "time",
        "amount": 0.0
      },
      {
        "subtype": "real",
        "uid": "./list_search",
        "order": 2,
        "type": "time",
        "amount": 0.27
      },
      {
        "subtype": "user",
        "uid": "./list_search",
        "order": 2,
        "type": "time",
        "amount": 0.28
      },
      {
        "subtype": "sys",
        "uid": "./list_search",
        "order": 2,
        "type": "time",
        "amount": 0.0
      }
    ]
  },
}

The above is an example of profiled data using the time wrapper with important regions and keys highlighted. The given command was profiled two times.

Creating your own Collector

New collectors can be registered within Perun in several steps. Internally they can be implemented in any programming language and in order to work with Perun requires three phases to be specified as given in Collectors Overviewbefore(), collect() and after(). Each new collector requires a interface module run.py, which contains the three functions and, moreover, a cli API for Click.

You can register your new collector as follows:

  1. Run perun utils create collect mycollector to generate a new modules in perun/collect directory with the following structure. The command takes a predefined templates for new collectors and creates __init__.py and run.py according to the supplied command line arguments (see Utility Commands for more information about interface of perun utils create command):

    /perun
    |-- /collect
        |-- /mycollector
            |-- __init__.py
            |-- run.py
        |-- /trace
        |-- /memory
        |-- /time
        |-- __init__.py
    
  2. First, implement the __init__.py file, including the module docstring with brief collector descriptions and definitions of constants that are used for automatic setting of profiles (namely the collector-info) which has the following structure:

1
2
3
4
5
6
7
8
"""..."""

COLLECTOR_TYPE = 'time|memory|mixed'
COLLECTOR_DEFAULT_UNITS = {
    'type': 'unit'
}

__author__ = 'You!'
  1. Next, implement the run.py module with collect() function, (optionally with before() and after()). The collect() function should do the actual collection of the profiling data over the given configuration. Each function should return the integer status of the phase, the status message (used in case of error) and dictionary including params passed to additional phases and ‘profile’ with dictionary w.r.t Specification of Profile Format.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
def before(**kwargs):
    """(optional)"""
    return STATUS, STATUS_MSG, dict(kwargs)


def collect(**kwargs):
    """..."""
    return STATUS, STATUS_MSG, dict(kwargs)


def after(**kwargs):
    """(optional)"""
    return STATUS, STATUS_MSG, dict(kwargs)
  1. Additionally implement the command line interface function in run.py, named the same as your collector. This function will is called from command line as perun collect mycollector and is based on Click library.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
--- /mnt/f/phdwork/perun/gh-pages/docs/_static/templates/collectors_run.py
+++ /mnt/f/phdwork/perun/gh-pages/docs/_static/templates/collectors_run_api.py
@@ -1,3 +1,8 @@
+import click
+
+import perun.logic.runner as runner
+
+
 def before(**kwargs):
     """(optional)"""
     return STATUS, STATUS_MSG, dict(kwargs)
@@ -11,3 +16,10 @@
 def after(**kwargs):
     """(optional)"""
     return STATUS, STATUS_MSG, dict(kwargs)
+
+
+@click.command()
+@click.pass_context
+def mycollector(ctx, **kwargs):
+    """..."""
+    runner.run_collector_from_cli_context(ctx, 'mycollector', kwargs)
  1. Finally register your newly created module in get_supported_module_names() located in perun.utils.__init__.py:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
--- /mnt/f/phdwork/perun/gh-pages/docs/_static/templates/supported_module_names.py
+++ /mnt/f/phdwork/perun/gh-pages/docs/_static/templates/supported_module_names_collectors.py
@@ -6,7 +6,7 @@
         ))
     return {
         'vcs': ['git'],
-        'collect': ['trace', 'memory', 'time'],
+        'collect': ['trace', 'memory', 'time', 'mycollector'],
         'postprocess': ['filter', 'normalizer', 'regression-analysis'],
         'view': ['alloclist', 'bars', 'flamegraph', 'flow', 'heapmap', 'raw', 'scatter']
     }[package]
  1. Preferably, verify that registering did not break anything in the Perun and if you are not using the developer installation, then reinstall Perun:

    make test
    make install
    
  2. At this point you can start using your collector either using perun collect or using the following to set the job matrix and run the batch collection of profiles:

    perun config --edit
    perun run matrix
    
  3. If you think your collector could help others, please, consider making Pull Request.