www.archive-org-2014.com » ORG » V » VALGRIND

Choose link from "Titles, links and description words view":

Or switch to "Titles and links view".

    Archived pages: 74 . Archive date: 2014-06.

  • Title: Valgrind
    Descriptive info: |.. Valgrind User Manual.. 6.. Callgrind: a call-graph generating cache and branch prediction profiler.. 1.. Overview.. Functionality.. 2.. Basic Usage.. Advanced Usage.. Multiple profiling dumps from one program run.. Limiting the range of collected events.. Counting global bus events.. 4.. Avoiding cycles.. 5.. Forking Programs.. Callgrind Command-line Options.. Dump creation options.. Activity options.. Data collection options.. Cost entity separation options.. Simulation options.. Cache simulation options.. Callgrind Monitor Commands.. Callgrind specific client requests.. callgrind_annotate Command-line Options.. 7.. callgrind_control Command-line Options.. To use this tool, you must specify.. --tool=callgrind.. on the Valgrind command line.. Callgrind is a profiling tool that records the call history among functions in a program's run as a call-graph.. By default, the collected data consists of the number of instructions executed, their relationship to source lines, the caller/callee relationship between functions, and the numbers of such calls.. Optionally, cache simulation and/or branch prediction (similar to Cachegrind) can produce further information about the runtime behavior of an application.. The profile data is written out to a file at program termination.. For presentation of the data, and interactive control of the profiling, two command line tools are provided:.. callgrind_annotate.. This command reads in the profile data, and prints a sorted lists of functions, optionally with source annotation.. For graphical visualization of the data, try.. KCachegrind.. , which is a KDE/Qt based GUI that makes it easy to navigate the large amount of data that Callgrind produces.. callgrind_control.. This command enables you to interactively observe and control the status of a program currently running under Callgrind's control, without stopping the program.. You can get statistics information as well as the current stack trace, and you can request zeroing of counters or dumping of profile data.. Cachegrind collects flat profile data: event counts (data reads, cache misses, etc.. ) are attributed directly to the function they occurred in.. This cost attribution mechanism is called.. self.. or.. exclusive.. attribution.. Callgrind extends this functionality by propagating costs across function call boundaries.. If function.. foo.. calls.. bar.. , the costs from.. are added into.. 's costs.. When applied to the program as a whole, this builds up a picture of so called.. inclusive.. costs, that is, where the cost of each function includes the costs of all functions it called, directly or indirectly.. As an example, the inclusive cost of.. main.. should be almost 100 percent of the total program cost.. Because of costs arising before.. is run, such as initialization of the run time linker and construction of global C++ objects, the inclusive cost of.. is not exactly 100 percent of the total program cost.. Together with the call graph, this allows you to find the specific call chains starting from.. in which the majority of the program's costs occur.. Caller/callee cost attribution is also useful for profiling functions called from multiple call sites, and where optimization opportunities depend on changing code in the callers, in particular by reducing the call count.. Callgrind's cache simulation is based on that of Cachegrind.. Read the documentation for.. Cachegrind: a cache and branch-prediction profiler.. first.. The material below describes the features supported in addition to Cachegrind's features.. Callgrind's ability to detect function calls and returns depends on the instruction set of the platform it is run on.. It works best on x86 and amd64, and unfortunately currently does not work so well on PowerPC, ARM, Thumb or MIPS code.. This is because there are no explicit call or return instructions in these instruction sets, so Callgrind has to rely on heuristics to detect calls and returns.. As with Cachegrind, you probably want to compile with debugging info (the.. -g.. option) and with optimization turned on.. To start a profile run for a program, execute:.. valgrind --tool=callgrind [callgrind options] your-program [program options].. While the simulation is running, you can observe execution with:.. callgrind_control -b.. This will print out the current backtrace.. To annotate the backtrace with event counts, run.. callgrind_control -e -b.. After program termination, a profile data file named.. callgrind.. out.. pid.. is generated, where.. pid.. is the process ID of the program being profiled.. The data file contains information about the calls made in the program among the functions executed, together with.. Instruction Read.. (Ir) event counts.. To generate a function-by-function summary from the profile data file, use.. callgrind_annotate [options] callgrind.. This summary is similar to the output you get from a Cachegrind run with cg_annotate: the list of functions is ordered by exclusive cost of functions, which also are the ones that are shown.. Important for the additional features of Callgrind are the following two options:.. --inclusive=yes.. : Instead of using exclusive cost of functions as sorting order, use and show inclusive cost.. --tree=both.. : Interleave into the top level list of functions, information on the callers and the callees of each function.. In these lines, which represents executed calls, the cost gives the number of events spent in the call.. Indented, above each function, there is the list of callers, and below, the list of callees.. The sum of events in calls to a given function (caller lines), as well as the sum of events in calls from the function (callee lines) together with the self cost, gives the total inclusive cost of the function.. Use.. --auto=yes.. to get annotated source code for all relevant functions for which the source can be found.. In addition to source annotation as produced by.. cg_annotate.. , you will see the annotated call sites with call counts.. For all other options, consult the (Cachegrind) documentation for.. For better call graph browsing experience, it is highly recommended to use.. If your code has a significant fraction of its cost in.. cycles.. (sets of functions calling each other in a recursive manner), you have to use KCachegrind, as.. currently does not do any cycle detection, which is important to get correct results in this case.. If you are additionally interested in measuring the cache behavior of your program, use Callgrind with the option.. --cache-sim.. =yes.. For branch prediction simulation, use.. --branch-sim.. Expect a further slow down approximately by a factor of 2.. If the program section you want to profile is somewhere in the middle of the run, it is beneficial to.. fast forward.. to this section without any profiling, and then enable profiling.. This is achieved by using the command line option.. --instr-atstart.. =no.. and running, in a shell:.. callgrind_control -i on.. just before the interesting code section is executed.. To exactly specify the code position where profiling should start, use the client request.. CALLGRIND_START_INSTRUMENTATION.. If you want to be able to see assembly code level annotation, specify.. --dump-instr.. This will produce profile data at instruction granularity.. Note that the resulting profile data can only be viewed with KCachegrind.. For assembly annotation, it also is interesting to see more details of the control flow inside of functions, i.. e.. (conditional) jumps.. This will be collected by further specifying.. --collect-jumps.. Sometimes you are not interested in characteristics of a full program run, but only of a small part of it, for example execution of one algorithm.. If there are multiple algorithms, or one algorithm running with different input data, it may even be useful to get different profile information for different parts of a single program run.. Profile data files have names of the form.. part.. -.. threadID.. where.. is the PID of the running program,.. is a number incremented on each dump (".. part" is skipped for the dump at program termination), and.. is a thread identification ("-threadID" is only used if you request dumps of individual threads with.. --separate-threads.. There are different ways to generate multiple profile dumps while a program is running under Callgrind's supervision.. Nevertheless, all methods trigger the same action, which is "dump all profile information since the last dump or program start, and zero cost counters afterwards".. To allow for zeroing cost counters without dumping, there is a second action "zero all cost counters now".. The different methods are:.. Dump on program termination.. This method is the standard way and doesn't need any special action on your part.. Spontaneous, interactive dumping.. callgrind_control -d [hint [PID/Name]].. to request the dumping of profile information of the supervised application with PID or Name.. hint.. is an arbitrary string you can optionally specify to later be able to distinguish profile dumps.. The control program will not terminate before the dump is completely written.. Note that the application must be actively running for detection of the dump command.. So, for a GUI application, resize the window, or for a server, send a request.. If you are using.. for browsing of profile information, you can use the toolbar button.. Force dump.. This will request a dump and trigger a reload after the dump is written.. Periodic dumping after execution of a specified number of basic blocks.. For this, use the command line option.. --dump-every-bb.. =count.. Dumping at enter/leave of specified functions.. Use the option.. --dump-before.. =function.. and.. --dump-after.. To zero cost counters before entering a function, use.. --zero-before.. You can specify these options multiple times for different functions.. Function specifications support wildcards: e.. g.. use.. ='foo*'.. to generate dumps before entering any function starting with.. Program controlled dumping.. Insert.. CALLGRIND_DUMP_STATS.. ;.. at the position in your code where you want a profile dump to happen.. CALLGRIND_ZERO_STATS.. to only zero profile counters.. See.. Client request reference.. for more information on Callgrind specific client requests.. If you are running a multi-threaded application and specify the command line option.. , every thread will be profiled on its own and will create its own profile dump.. Thus, the last two methods will only generate one dump of the currently running thread.. With the other methods, you will get multiple dumps (one for each thread) on a dump request.. For aggregating events (function enter/leave, instruction execution, memory access) into event numbers, first, the events must be recognizable by Callgrind, and second, the collection state must be enabled.. Event collection is only possible if.. instrumentation.. for program code is enabled.. This is the default, but for faster execution (identical to.. valgrind --tool=none.. ), it can be disabled until the program reaches a state in which you want to start collecting profiling data.. Callgrind can start without instrumentation by specifying option.. Instrumentation can be enabled interactively with:.. and off by specifying "off" instead of "on".. Furthermore, instrumentation state can be programatically changed with the macros.. CALLGRIND_STOP_INSTRUMENTATION.. In addition to enabling instrumentation, you must also enable event collection for the parts of your program you are interested in.. By default, event collection is enabled everywhere.. You can limit collection to a specific function by using.. --toggle-collect.. This will toggle the collection state on entering and leaving the specified functions.. When this option is in effect, the default collection state at program start is "off".. Only events happening while running inside of the given function will be collected.. Recursive calls of the given function do not trigger any action.. It is important to note that with instrumentation disabled, the cache simulator cannot see any memory access events, and thus, any simulated cache state will be frozen and wrong without instrumentation.. Therefore, to get useful cache events (hits/misses) after switching on instrumentation, the cache first must warm up, probably leading to many.. cold misses..  ...   the simulator starts with an empty cache at that moment.. Switch on event collection later to cope with this error.. --collect-atstart= yes|no [default: yes].. Specify whether event collection is enabled at beginning of the profile run.. To only look at parts of your program, you have two possibilities:.. Zero event counters before entering the program part you want to profile, and dump the event counters to a file after leaving that program part.. Switch on/off collection state as needed to only see event counters happening while inside of the program part you want to profile.. The second option can be used if the program part you want to profile is called many times.. Option 1, i.. creating a lot of dumps is not practical here.. Collection state can be toggled at entry and exit of a given function with the option.. If you use this option, collection state should be disabled at the beginning.. Note that the specification of.. implicitly sets.. --collect-state=no.. Collection state can be toggled also by inserting the client request.. CALLGRIND_TOGGLE_COLLECT ;.. at the needed code positions.. --toggle-collect= function.. Toggle collection on entry/exit of.. --collect-jumps= no|yes [default: no].. This specifies whether information for (conditional) jumps should be collected.. As above, callgrind_annotate currently is not able to show you the data.. You have to use KCachegrind to get jump arrows in the annotated code.. --collect-systime= no|yes [default: no].. This specifies whether information for system call times should be collected.. --collect-bus= no|yes [default: no].. This specifies whether the number of global bus events executed should be collected.. The event type "Ge" is used for these events.. These options specify how event counts should be attributed to execution contexts.. For example, they specify whether the recursion level or the call chain leading to a function should be taken into account, and whether the thread ID should be considered.. --separate-threads= no|yes [default: no].. This option specifies whether profile data should be generated separately for every thread.. If yes, the file names get "-threadID" appended.. --separate-callers= callers [default: 0].. Separate contexts by at most callers functions in the call chain.. --separate-callers number = function.. Separate.. number.. callers for.. --separate-recs= level [default: 2].. Separate function recursions by at most.. level.. levels.. --separate-recs number = function.. recursions for.. --skip-plt= no|yes [default: yes].. Ignore calls to/from PLT sections.. --skip-direct-rec= no|yes [default: yes].. Ignore direct recursions.. --fn-skip= function.. Ignore calls to/from a given function.. E.. if you have a call chain A B C, and you specify function B to be ignored, you will only see A C.. This is very convenient to skip functions handling callback behaviour.. For example, with the signal/slot mechanism in the Qt graphics library, you only want to see the function emitting a signal to call the slots connected to that signal.. First, determine the real call chain to see the functions needed to be skipped, then use this option.. --cache-sim= yes|no [default: no].. Specify if you want to do full cache simulation.. By default, only instruction read accesses will be counted ("Ir").. With cache simulation, further event counters are enabled: Cache misses on instruction reads ("I1mr"/"ILmr"), data read accesses ("Dr") and related cache misses ("D1mr"/"DLmr"), data write accesses ("Dw") and related cache misses ("D1mw"/"DLmw").. For more information, see.. --branch-sim= yes|no [default: no].. Specify if you want to do branch prediction simulation.. Further event counters are enabled: Number of executed conditional branches and related predictor misses ("Bc"/"Bcm"), executed indirect jumps and related misses of the jump address predictor ("Bi"/"Bim").. --simulate-wb= yes|no [default: no].. Specify whether write-back behavior should be simulated, allowing to distinguish LL caches misses with and without write backs.. The cache model of Cachegrind/Callgrind does not specify write-through vs.. write-back behavior, and this also is not relevant for the number of generated miss counts.. However, with explicit write-back simulation it can be decided whether a miss triggers not only the loading of a new cache line, but also if a write back of a dirty cache line had to take place before.. The new dirty miss events are ILdmr, DLdmr, and DLdmw, for misses because of instruction read, data read, and data write, respectively.. As they produce two memory transactions, they should account for a doubled time estimation in relation to a normal miss.. --simulate-hwpref= yes|no [default: no].. Specify whether simulation of a hardware prefetcher should be added which is able to detect stream access in the second level cache by comparing accesses to separate to each page.. As the simulation can not decide about any timing issues of prefetching, it is assumed that any hardware prefetch triggered succeeds before a real access is done.. Thus, this gives a best-case scenario by covering all possible stream accesses.. --cacheuse= yes|no [default: no].. Specify whether cache line use should be collected.. For every cache line, from loading to it being evicted, the number of accesses as well as the number of actually used bytes is determined.. This behavior is related to the code which triggered loading of the cache line.. In contrast to miss counters, which shows the position where the symptoms of bad cache behavior (i.. latencies) happens, the use counters try to pinpoint at the reason (i.. the code with the bad access behavior).. The new counters are defined in a way such that worse behavior results in higher cost.. AcCost1 and AcCost2 are counters showing bad temporal locality for L1 and LL caches, respectively.. This is done by summing up reciprocal values of the numbers of accesses of each cache line, multiplied by 1000 (as only integer costs are allowed).. for a given source line with 5 read accesses, a value of 5000 AcCost means that for every access, a new cache line was loaded and directly evicted afterwards without further accesses.. Similarly, SpLoss1/2 shows bad spatial locality for L1 and LL caches, respectively.. It gives the.. spatial loss.. count of bytes which were loaded into cache but never accessed.. It pinpoints at code accessing data in a way such that cache space is wasted.. This hints at bad layout of data structures in memory.. Assuming a cache line size of 64 bytes and 100 L1 misses for a given source line, the loading of 6400 bytes into L1 was triggered.. If SpLoss1 shows a value of 3200 for this line, this means that half of the loaded data was never used, or using a better data layout, only half of the cache space would have been needed.. Please note that for cache line use counters, it currently is not possible to provide meaningful inclusive costs.. Therefore, inclusive cost of these counters should be ignored.. --I1= size , associativity , line size.. Specify the size, associativity and line size of the level 1 instruction cache.. --D1= size , associativity , line size.. Specify the size, associativity and line size of the level 1 data cache.. --LL= size , associativity , line size.. Specify the size, associativity and line size of the last-level cache.. The Callgrind tool provides monitor commands handled by the Valgrind gdbserver (see.. Monitor command handling by the Valgrind gdbserver.. dump [ dump_hint ].. requests to dump the profile data.. zero.. requests to zero the profile data counters.. instrumentation [on|off].. requests to set (if parameter on/off is given) or get the current instrumentation state.. status.. requests to print out some status information.. Callgrind provides the following specific client requests in.. See that file for the exact details of their arguments.. Force generation of a profile dump at specified position in code, for the current thread only.. Written counters will be reset to zero.. CALLGRIND_DUMP_STATS_AT(string).. Same as.. , but allows to specify a string to be able to distinguish profile dumps.. Reset the profile counters for the current thread to zero.. CALLGRIND_TOGGLE_COLLECT.. Toggle the collection state.. This allows to ignore events with regard to profile counters.. See also options.. --collect-atstart.. Start full Callgrind instrumentation if not already enabled.. When cache simulation is done, this will flush the simulated cache and lead to an artifical cache warmup phase afterwards with cache misses which would not have happened in reality.. See also option.. Stop full Callgrind instrumentation if not already disabled.. This flushes Valgrinds translation cache, and does no additional instrumentation afterwards: it effectivly will run at the same speed as Nulgrind, i.. at minimal slowdown.. Use this to speed up the Callgrind run for uninteresting code parts.. to enable instrumentation again.. -h --help.. Show summary of options.. --version.. Show version of callgrind_annotate.. --show=A,B,C [default: all].. Only show figures for events A,B,C.. --sort=A,B,C.. Sort columns by events A,B,C [event column order].. --threshold= 0--100 [default: 99%].. Percentage of counts (of primary sort event) we are interested in.. --auto= yes|no [default: no].. Annotate all source files containing functions that helped reach the event count threshold.. --context=N [default: 8].. Print N lines of context before and after annotated lines.. --inclusive= yes|no [default: no].. Add subroutine costs to functions calls.. --tree= none|caller|calling|both [default: none].. Print for each function their callers, the called functions or both.. -I, --include= dir.. Add.. dir.. to the list of directories to search for source files.. By default, callgrind_control acts on all programs run by the current user under Callgrind.. It is possible to limit the actions to specified Callgrind runs by providing a list of pids or program names as argument.. The default action is to give some brief information about the applications being run under Callgrind.. Show a short description, usage, and summary of options.. Show version of callgrind_control.. -l --long.. Show also the working directory, in addition to the brief information given by default.. -s --stat.. Show statistics information about active Callgrind runs.. -b --back.. Show stack/back traces of each thread in active Callgrind runs.. For each active function in the stack trace, also the number of invocations since program start (or last dump) is shown.. This option can be combined with -e to show inclusive cost of active functions.. -e [A,B,.. ].. (default: all).. Show the current per-thread, exclusive cost values of event counters.. If no explicit event names are given, figures for all event types which are collected in the given Callgrind run are shown.. Otherwise, only figures for event types A, B,.. are shown.. If this option is combined with -b, inclusive cost for the functions of each active stack frame is provided, too.. --dump[= desc ].. (default: no description).. Request the dumping of profile information.. Optionally, a description can be specified which is written into the dump as part of the information giving the reason which triggered the dump action.. This can be used to distinguish multiple dumps.. -z --zero.. Zero all event counters.. -k --kill.. Force a Callgrind run to be terminated.. --instr= on|off.. Switch instrumentation mode on or off.. If a Callgrind run has instrumentation disabled, no simulation is done and no events are counted.. This is useful to skip uninteresting program parts, as there is much less slowdown (same as with the Valgrind tool "none").. See also the Callgrind option.. -w= dir.. Specify the startup directory of an active Callgrind run.. On some systems, active Callgrind runs can not be detected.. To be able to control these, the failed auto-detection can be worked around by specifying the directory where a Callgrind run was started.. 5.. Up.. 7.. Helgrind: a thread error detector.. Home..

    Original link path: /docs/manual/cl-manual.html
    Open archive

  • Title: Valgrind
    Descriptive info: Helgrind: a thread error detector.. Detected errors: Misuses of the POSIX pthreads API.. Detected errors: Inconsistent Lock Orderings.. Detected errors: Data Races.. A Simple Data Race.. Helgrind's Race Detection Algorithm.. Interpreting Race Error Messages.. Hints and Tips for Effective Use of Helgrind.. Helgrind Command-line Options.. Helgrind Client Requests.. 8.. A To-Do List for Helgrind.. --tool=helgrind.. Helgrind is a Valgrind tool for detecting synchronisation errors in C, C++ and Fortran programs that use the POSIX pthreads threading primitives.. The main abstractions in POSIX pthreads are: a set of threads sharing a common address space, thread creation, thread joining, thread exit, mutexes (locks), condition variables (inter-thread event notifications), reader-writer locks, spinlocks, semaphores and barriers.. Helgrind can detect three classes of errors, which are discussed in detail in the next three sections:.. Misuses of the POSIX pthreads API.. Potential deadlocks arising from lock ordering problems.. Data races -- accessing memory without adequate locking or synchronisation.. Problems like these often result in unreproducible, timing-dependent crashes, deadlocks and other misbehaviour, and can be difficult to find by other means.. Helgrind is aware of all the pthread abstractions and tracks their effects as accurately as it can.. On x86 and amd64 platforms, it understands and partially handles implicit locking arising from the use of the LOCK instruction prefix.. On PowerPC/POWER and ARM platforms, it partially handles implicit locking arising from load-linked and store-conditional instruction pairs.. Helgrind works best when your application uses only the POSIX pthreads API.. However, if you want to use custom threading primitives, you can describe their behaviour to Helgrind using the.. ANNOTATE_*.. macros defined in.. helgrind.. Following those is a section containing.. hints and tips on how to get the best out of Helgrind.. Then there is a.. summary of command-line options.. Finally, there is.. a brief summary of areas in which Helgrind could be improved.. Helgrind intercepts calls to many POSIX pthreads functions, and is therefore able to report on various common problems.. Although these are unglamourous errors, their presence can lead to undefined program behaviour and hard-to-find bugs later on.. The detected errors are:.. unlocking an invalid mutex.. unlocking a not-locked mutex.. unlocking a mutex held by a different thread.. destroying an invalid or a locked mutex.. recursively locking a non-recursive mutex.. deallocation of memory that contains a locked mutex.. passing mutex arguments to functions expecting reader-writer lock arguments, and vice versa.. when a POSIX pthread function fails with an error code that must be handled.. when a thread exits whilst still holding locked locks.. calling.. pthread_cond_wait.. with a not-locked mutex, an invalid mutex, or one locked by a different thread.. inconsistent bindings between condition variables and their associated mutexes.. invalid or duplicate initialisation of a pthread barrier.. initialisation of a pthread barrier on which threads are still waiting.. destruction of a pthread barrier object which was never initialised, or on which threads are still waiting.. waiting on an uninitialised pthread barrier.. for all of the pthreads functions that Helgrind intercepts, an error is reported, along with a stack trace, if the system threading library routine returns an error code, even if Helgrind itself detected no error.. Checks pertaining to the validity of mutexes are generally also performed for reader-writer locks.. Various kinds of this-can't-possibly-happen events are also reported.. These usually indicate bugs in the system threading library.. Reported errors always contain a primary stack trace indicating where the error was detected.. They may also contain auxiliary stack traces giving additional information.. In particular, most errors relating to mutexes will also tell you where that mutex first came to Helgrind's attention (the ".. was first observed at.. " part), so you have a chance of figuring out which mutex it is referring to.. For example:.. Thread #1 unlocked a not-locked lock at 0x7FEFFFA90 at 0x4C2408D: pthread_mutex_unlock (hg_intercepts.. c:492) by 0x40073A: nearly_main (tc09_bad_unlock.. c:27) by 0x40079B: main (tc09_bad_unlock.. c:50) Lock at 0x7FEFFFA90 was first observed at 0x4C25D01: pthread_mutex_init (hg_intercepts.. c:326) by 0x40071F: nearly_main (tc09_bad_unlock.. c:23) by 0x40079B: main (tc09_bad_unlock.. c:50).. Helgrind has a way of summarising thread identities, as you see here with the text ".. Thread #1.. ".. This is so that it can speak about threads and sets of threads without overwhelming you with details.. below.. for more information on interpreting error messages.. In this section, and in general, to "acquire" a lock simply means to lock that lock, and to "release" a lock means to unlock it.. Helgrind monitors the order in which threads acquire locks.. This allows it to detect potential deadlocks which could arise from the formation of cycles of locks.. Detecting such inconsistencies is useful because, whilst actual deadlocks are fairly obvious, potential deadlocks may never be discovered during testing and could later lead to hard-to-diagnose in-service failures.. The simplest example of such a problem is as follows.. Imagine some shared resource R, which, for whatever reason, is guarded by two locks, L1 and L2, which must both be held when R is accessed.. Suppose a thread acquires L1, then L2, and proceeds to access R.. The implication of this is that all threads in the program must acquire the two locks in the order first L1 then L2.. Not doing so risks deadlock.. The deadlock could happen if two threads -- call them T1 and T2 -- both want to access R.. Suppose T1 acquires L1 first, and T2 acquires L2 first.. Then T1 tries to acquire L2, and T2 tries to acquire L1, but those locks are both already held.. So T1 and T2 become deadlocked.. Helgrind builds a directed graph indicating the order in which locks have been acquired in the past.. When a thread acquires a new lock, the graph is updated, and then checked to see if it now contains a cycle.. The presence of a cycle indicates a potential deadlock involving the locks in the cycle.. In general, Helgrind will choose two locks involved in the cycle and show you how their acquisition ordering has become inconsistent.. It does this by showing the program points that first defined the ordering, and the program points which later violated it.. Here is a simple example involving just two locks:.. Thread #1: lock order "0x7FF0006D0 before 0x7FF0006A0" violated Observed (incorrect) order is: acquisition of lock at 0x7FF0006A0 at 0x4C2BC62: pthread_mutex_lock (hg_intercepts.. c:494) by 0x400825: main (tc13_laog1.. c:23) followed by a later acquisition of lock at 0x7FF0006D0 at 0x4C2BC62: pthread_mutex_lock (hg_intercepts.. c:494) by 0x400853: main (tc13_laog1.. c:24) Required order was established by acquisition of lock at 0x7FF0006D0 at 0x4C2BC62: pthread_mutex_lock (hg_intercepts.. c:494) by 0x40076D: main (tc13_laog1.. c:17) followed by a later acquisition of lock at 0x7FF0006A0 at 0x4C2BC62: pthread_mutex_lock (hg_intercepts.. c:494) by 0x40079B: main (tc13_laog1.. c:18).. When there are more than two locks in the cycle, the error is equally serious.. However, at present Helgrind does not show the locks involved, sometimes because that information is not available, but also so as to avoid flooding you with information.. For example, a naive implementation of the famous Dining Philosophers problem involves a cycle of five locks (see.. helgrind/tests/tc14_laog_dinphils.. c.. In this case Helgrind has detected that all 5 philosophers could simultaneously pick up their left fork and then deadlock whilst waiting to pick up their right forks.. Thread #6: lock order "0x80499A0 before 0x8049A00" violated Observed (incorrect) order is: acquisition of lock at 0x8049A00 at 0x40085BC: pthread_mutex_lock (hg_intercepts.. c:495) by 0x80485B4: dine (tc14_laog_dinphils.. c:18) by 0x400BDA4: mythread_wrapper (hg_intercepts.. c:219) by 0x39B924: start_thread (pthread_create.. c:297) by 0x2F107D: clone (clone.. S:130) followed by a later acquisition of lock at 0x80499A0 at 0x40085BC: pthread_mutex_lock (hg_intercepts.. c:495) by 0x80485CD: dine (tc14_laog_dinphils.. c:19) by 0x400BDA4: mythread_wrapper (hg_intercepts.. S:130).. A data race happens, or could happen, when two threads access a shared memory location without using suitable locks or other synchronisation to ensure single-threaded access.. Such missing locking can cause obscure timing dependent bugs.. Ensuring programs are race-free is one of the central difficulties of threaded programming.. Reliably detecting races is a difficult problem, and most of Helgrind's internals are devoted to dealing with it.. We begin with a simple example.. About the simplest possible example of a race is as follows.. In this program, it is impossible to know what the value of.. var.. is at the end of the program.. Is it 2 ? Or 1 ?.. #include pthread.. h int var = 0; void* child_fn ( void* arg ) { var++; /* Unprotected relative to parent */ /* this is line 6 */ return NULL; } int main ( void ) { pthread_t child; pthread_create( child, NULL, child_fn, NULL); var++; /* Unprotected relative to child */ /* this is line 13 */ pthread_join(child, NULL); return 0; }.. The problem is there is nothing to stop.. being updated simultaneously by both threads.. A correct program would protect.. with a lock of type.. pthread_mutex_t.. , which is acquired before each access and released afterwards.. Helgrind's output for this program is:.. Thread #1 is the program's root thread Thread #2 was created at 0x511C08E: clone (in /lib64/libc-2.. so) by 0x4E333A4: do_clone (in /lib64/libpthread-2.. so) by 0x4E33A30: pthread_create@@GLIBC_2.. 5 (in /lib64/libpthread-2.. so) by 0x4C299D4: pthread_create@* (hg_intercepts.. c:214) by 0x400605: main (simple_race.. c:12) Possible data race during read of size 4 at 0x601038 by thread #1 Locks held: none at 0x400606: main (simple_race.. c:13) This conflicts with a previous write of size 4 by thread #2 Locks held: none at 0x4005DC: child_fn (simple_race.. c:6) by 0x4C29AFF: mythread_wrapper (hg_intercepts.. c:194) by 0x4E3403F: start_thread (in /lib64/libpthread-2.. so) by 0x511C0CC: clone (in /lib64/libc-2.. so) Location 0x601038 is 0 bytes inside global var "var" declared at simple_race.. c:3.. This is quite a lot of detail for an apparently simple error.. The last clause is the main error message.. It says there is a race as a result of a read of size 4 (bytes), at 0x601038, which is the address of.. , happening in function.. at line 13 in the program.. Two important parts of the message are:.. Helgrind shows two stack traces for the error, not one.. By definition, a race involves two different threads accessing the same location in such a way that the result depends on the relative speeds of the two threads.. The first stack trace follows the text ".. Possible data race during read of size 4.. " and the second trace follows the text ".. This conflicts with a previous write of size 4.. Helgrind is usually able to show both accesses involved in a race.. At least one of these will be a write (since two concurrent, unsynchronised reads are harmless), and they will of course be from different threads.. By examining your program at the two locations, you should be able to get at least some idea of what the root cause of the problem is.. For each location, Helgrind shows the set of locks held at the time of the access.. This often makes it clear which thread, if any, failed to take a required lock.. In this example neither thread holds a lock during the access.. For races which occur on global or stack variables, Helgrind tries to identify the name and defining point of the variable.. Hence the text ".. Location 0x601038 is 0 bytes inside global var "var" declared at simple_race.. c:3.. Showing names of stack and global variables carries no run-time overhead once Helgrind has your program up and running.. However, it does require Helgrind to spend considerable extra time and memory at program startup to read the relevant debug info.. Hence this facility is disabled by default.. To enable it, you need to give the.. --read-var-info=yes.. option to Helgrind.. The following section explains Helgrind's race detection algorithm in more detail.. Most programmers think about threaded programming in terms of the basic functionality provided by the threading library (POSIX Pthreads): thread creation, thread joining, locks, condition variables, semaphores and barriers.. The effect of using these functions is to impose constraints upon the order in which memory accesses can happen.. This implied ordering is generally known as the "happens-before relation".. Once you understand the happens-before relation, it is easy to see how Helgrind finds races in your code.. Fortunately, the happens-before relation is itself easy to understand, and is by itself a useful tool for reasoning about the behaviour of parallel programs.. We now introduce it using a simple example.. Consider first the following buggy program:.. Parent thread: Child thread: int var; // create child thread pthread_create(.. ) var = 20; var = 10; exit // wait for child pthread_join(.. ) printf("%d\n", var);.. The parent thread creates a child.. Both then write different values to some variable.. , and the parent then waits for the child to exit.. What is the value of.. at the end of the program, 10 or 20? We don't know.. The program is considered buggy (it has a race) because the final value of.. depends on the relative rates of progress of the parent and child threads.. If the parent is fast and the child is slow, then the child's assignment may happen later, so the final value will be 10; and vice versa if the child is faster than the parent.. The relative rates of progress of parent vs child is not something the programmer can control, and will often change from run to run.. It depends on factors such as the load on the machine, what else is running, the kernel's scheduling strategy, and many other  ...   we would particularly appreciate feedback from folks who have used Helgrind to successfully debug Qt 4 and/or KDE4 applications.. Runtime support library for GNU OpenMP (part of GCC), at least for GCC versions 4.. 2 and 4.. The GNU OpenMP runtime library (.. libgomp.. so.. ) constructs its own synchronisation primitives using combinations of atomic memory instructions and the futex syscall, which causes total chaos since in Helgrind since it cannot "see" those.. Fortunately, this can be solved using a configuration-time option (for GCC).. Rebuild GCC from source, and configure using.. --disable-linux-futex.. This makes libgomp.. so use the standard POSIX threading primitives instead.. Note that this was tested using GCC 4.. 3 and has not been re-tested using more recent GCC versions.. We would appreciate hearing about any successes or failures with more recent versions.. If you must implement your own threading primitives, there are a set of client request macros in.. to help you describe your primitives to Helgrind.. You should be able to mark up mutexes, condition variables, etc, without difficulty.. It is also possible to mark up the effects of thread-safe reference counting using the.. ANNOTATE_HAPPENS_BEFORE.. ,.. ANNOTATE_HAPPENS_AFTER.. ANNOTATE_HAPPENS_BEFORE_FORGET_ALL.. , macros.. Thread-safe reference counting using an atomically incremented/decremented refcount variable causes Helgrind problems because a one-to-zero transition of the reference count means the accessing thread has exclusive ownership of the associated resource (normally, a C++ object) and can therefore access it (normally, to run its destructor) without locking.. Helgrind doesn't understand this, and markup is essential to avoid false positives.. Here are recommended guidelines for marking up thread safe reference counting in C++.. You only need to mark up your release methods -- the ones which decrement the reference count.. Given a class like this:.. class MyClass { unsigned int mRefCount; void Release ( void ) { unsigned int newCount = atomic_decrement( mRefCount); if (newCount == 0) { delete this; } } }.. the release method should be marked up as follows:.. void Release ( void ) { unsigned int newCount = atomic_decrement( mRefCount); if (newCount == 0) { ANNOTATE_HAPPENS_AFTER( mRefCount); ANNOTATE_HAPPENS_BEFORE_FORGET_ALL( mRefCount); delete this; } else { ANNOTATE_HAPPENS_BEFORE( mRefCount); } }.. There are a number of complex, mostly-theoretical objections to this scheme.. From a theoretical standpoint it appears to be impossible to devise a markup scheme which is completely correct in the sense of guaranteeing to remove all false races.. The proposed scheme however works well in practice.. Avoid memory recycling.. If you can't avoid it, you must use tell Helgrind what is going on via the.. VALGRIND_HG_CLEAN_MEMORY.. client request (in.. Helgrind is aware of standard heap memory allocation and deallocation that occurs via.. malloc.. /.. free.. new.. delete.. and from entry and exit of stack frames.. In particular, when memory is deallocated via.. , or function exit, Helgrind considers that memory clean, so when it is eventually reallocated, its history is irrelevant.. However, it is common practice to implement memory recycling schemes.. In these, memory to be freed is not handed to.. , but instead put into a pool of free buffers to be handed out again as required.. The problem is that Helgrind has no way to know that such memory is logically no longer in use, and its history is irrelevant.. Hence you must make that explicit, using the.. client request to specify the relevant address ranges.. It's easiest to put these requests into the pool manager code, and use them either when memory is returned to the pool, or is allocated from it.. Avoid POSIX condition variables.. If you can, use POSIX semaphores (.. sem_t.. sem_post.. sem_wait.. ) to do inter-thread event signalling.. Semaphores with an initial value of zero are particularly useful for this.. Helgrind only partially correctly handles POSIX condition variables.. This is because Helgrind can see inter-thread dependencies between a.. call and a.. pthread_cond_signal.. pthread_cond_broadcast.. call only if the waiting thread actually gets to the rendezvous first (so that it actually calls.. It can't see dependencies between the threads if the signaller arrives first.. In the latter case, POSIX guidelines imply that the associated boolean condition still provides an inter-thread synchronisation event, but one which is invisible to Helgrind.. The result of Helgrind missing some inter-thread synchronisation events is to cause it to report false positives.. The root cause of this synchronisation lossage is particularly hard to understand, so an example is helpful.. It was discussed at length by Arndt Muehlenfeld ("Runtime Race Detection in Multi-Threaded Programs", Dissertation, TU Graz, Austria).. The canonical POSIX-recommended usage scheme for condition variables is as follows:.. b is a Boolean condition, which is False most of the time cv is a condition variable mx is its associated mutex Signaller: Waiter: lock(mx) lock(mx) b = True while (b == False) signal(cv) wait(cv,mx) unlock(mx) unlock(mx).. Assume.. b.. is False most of the time.. If the waiter arrives at the rendezvous first, it enters its while-loop, waits for the signaller to signal, and eventually proceeds.. Helgrind sees the signal, notes the dependency, and all is well.. If the signaller arrives first,.. is set to true, and the signal disappears into nowhere.. When the waiter later arrives, it does not enter its while-loop and simply carries on.. But even in this case, the waiter code following the while-loop cannot execute until the signaller sets.. to True.. Hence there is still the same inter-thread dependency, but this time it is through an arbitrary in-memory condition, and Helgrind cannot see it.. By comparison, Helgrind's detection of inter-thread dependencies caused by semaphore operations is believed to be exactly correct.. As far as I know, a solution to this problem that does not require source-level annotation of condition-variable wait loops is beyond the current state of the art.. Make sure you are using a supported Linux distribution.. At present, Helgrind only properly supports glibc-2.. 3 or later.. This in turn means we only support glibc's NPTL threading implementation.. The old LinuxThreads implementation is not supported.. Round up all finished threads using.. Avoid detaching threads: don't create threads in the detached state, and don't call.. pthread_detach.. on existing threads.. Using.. to round up finished threads provides a clear synchronisation point that both Helgrind and programmers can see.. If you don't call.. on a thread, Helgrind has no way to know when it finishes, relative to any significant synchronisation points for other threads in the program.. So it assumes that the thread lingers indefinitely and can potentially interfere indefinitely with the memory state of the program.. It has every right to assume that -- after all, it might really be the case that, for scheduling reasons, the exiting thread did run very slowly in the last stages of its life.. Perform thread debugging (with Helgrind) and memory debugging (with Memcheck) together.. Helgrind tracks the state of memory in detail, and memory management bugs in the application are liable to cause confusion.. In extreme cases, applications which do many invalid reads and writes (particularly to freed memory) have been known to crash Helgrind.. So, ideally, you should make your application Memcheck-clean before using Helgrind.. It may be impossible to make your application Memcheck-clean unless you first remove threading bugs.. In particular, it may be difficult to remove all reads and writes to freed memory in multithreaded C++ destructor sequences at program termination.. So, ideally, you should make your application Helgrind-clean before using Memcheck.. Since this circularity is obviously unresolvable, at least bear in mind that Memcheck and Helgrind are to some extent complementary, and you may need to use them together.. POSIX requires that implementations of standard I/O (.. printf.. fprintf.. fwrite.. fread.. , etc) are thread safe.. Unfortunately GNU libc implements this by using internal locking primitives that Helgrind is unable to intercept.. Consequently Helgrind generates many false race reports when you use these functions.. Helgrind attempts to hide these errors using the standard Valgrind error-suppression mechanism.. So, at least for simple test cases, you don't see any.. Nevertheless, some may slip through.. Just something to be aware of.. Helgrind's error checks do not work properly inside the system threading library itself (.. libpthread.. ), and it usually observes large numbers of (false) errors in there.. Valgrind's suppression system then filters these out, so you should not see them.. If you see any race errors reported where.. ld.. is the object associated with the innermost stack frame, please file a bug report at.. http://www.. valgrind.. org/.. The following end-user options are available:.. --free-is-write=no|yes [default: no].. When enabled (not the default), Helgrind treats freeing of heap memory as if the memory was written immediately before the free.. This exposes races where memory is referenced by one thread, and freed by another, but there is no observable synchronisation event to ensure that the reference happens before the free.. This functionality is new in Valgrind 3.. 0, and is regarded as experimental.. It is not enabled by default because its interaction with custom memory allocators is not well understood at present.. User feedback is welcomed.. --track-lockorders=no|yes [default: yes].. When enabled (the default), Helgrind performs lock order consistency checking.. For some buggy programs, the large number of lock order errors reported can become annoying, particularly if you're only interested in race errors.. You may therefore find it helpful to disable lock order checking.. --history-level=none|approx|full [default: full].. --history-level=full.. (the default) causes Helgrind collects enough information about "old" accesses that it can produce two stack traces in a race report -- both the stack trace for the current access, and the trace for the older, conflicting access.. To limit memory usage, "old" accesses stack traces are limited to a maximum of 8 entries, even if.. --num-callers.. value is bigger.. Collecting such information is expensive in both speed and memory, particularly for programs that do many inter-thread synchronisation events (locks, unlocks, etc).. Without such information, it is more difficult to track down the root causes of races.. Nonetheless, you may not need it in situations where you just want to check for the presence or absence of races, for example, when doing regression testing of a previously race-free program.. --history-level=none.. is the opposite extreme.. It causes Helgrind not to collect any information about previous accesses.. This can be dramatically faster than.. --history-level=approx.. provides a compromise between these two extremes.. It causes Helgrind to show a full trace for the later access, and approximate information regarding the earlier access.. This approximate information consists of two stacks, and the earlier access is guaranteed to have occurred somewhere between program points denoted by the two stacks.. This is not as useful as showing the exact stack for the previous access (as.. does), but it is better than nothing, and it is almost as fast as.. --conflict-cache-size=N [default: 1000000].. This flag only has any effect at.. Information about "old" conflicting accesses is stored in a cache of limited size, with LRU-style management.. This is necessary because it isn't practical to store a stack trace for every single memory access made by the program.. Historical information on not recently accessed locations is periodically discarded, to free up space in the cache.. This option controls the size of the cache, in terms of the number of different memory addresses for which conflicting access information is stored.. If you find that Helgrind is showing race errors with only one stack instead of the expected two stacks, try increasing this value.. The minimum value is 10,000 and the maximum is 30,000,000 (thirty times the default value).. Increasing the value by 1 increases Helgrind's memory requirement by very roughly 100 bytes, so the maximum value will easily eat up three extra gigabytes or so of memory.. --check-stack-refs=no|yes [default: yes].. By default Helgrind checks all data memory accesses made by your program.. This flag enables you to skip checking for accesses to thread stacks (local variables).. This can improve performance, but comes at the cost of missing races on stack-allocated data.. The following client requests are defined in.. See that file for exact details of their arguments.. This makes Helgrind forget everything it knows about a specified memory range.. This is particularly useful for memory allocators that wish to recycle memory.. ANNOTATE_NEW_MEMORY.. ANNOTATE_RWLOCK_CREATE.. ANNOTATE_RWLOCK_DESTROY.. ANNOTATE_RWLOCK_ACQUIRED.. ANNOTATE_RWLOCK_RELEASED.. These are used to describe to Helgrind, the behaviour of custom (non-POSIX) synchronisation primitives, which it otherwise has no way to understand.. See comments in.. for further documentation.. The following is a list of loose ends which should be tidied up some time.. For lock order errors, print the complete lock cycle, rather than only doing for size-2 cycles as at present.. The conflicting access mechanism sometimes mysteriously fails to show the conflicting access' stack, even when provided with unbounded storage for conflicting access info.. This should be investigated.. Document races caused by GCC's thread-unsafe code generation for speculative stores.. In the interim see.. http://gcc.. gnu.. org/ml/gcc/2007-10/msg00266.. html.. http://lkml.. org/lkml/2007/10/24/673.. Don't update the lock-order graph, and don't check for errors, when a "try"-style lock operation happens (e.. pthread_mutex_trylock.. Such calls do not add any real restrictions to the locking order, since they can always fail to acquire the lock, resulting in the caller going off and doing Plan B (presumably it will have a Plan B).. Doing such checks could generate false lock-order errors and confuse users.. Performance can be very poor.. Slowdowns on the order of 100:1 are not unusual.. There is limited scope for performance improvements.. 8.. DRD: a thread error detector..

    Original link path: /docs/manual/hg-manual.html
    Open archive

  • Title: Valgrind
    Descriptive info: DRD: a thread error detector.. Multithreaded Programming Paradigms.. POSIX Threads Programming Model.. Multithreaded Programming Problems.. Data Race Detection.. Using DRD.. DRD Command-line Options.. Detected Errors: Data Races.. Detected Errors: Lock Contention.. Detected Errors: Misuse of the POSIX threads API.. Client Requests.. Debugging GNOME Programs.. Debugging Boost.. Thread Programs.. Debugging OpenMP Programs.. DRD and Custom Memory Allocators.. 10.. DRD Versus Memcheck.. 11.. Resource Requirements.. 12.. Hints and Tips for Effective Use of DRD.. Using the POSIX Threads API Effectively.. Mutex types.. Condition variables.. pthread_cond_timedwait and timeouts.. Limitations.. Feedback.. --tool=drd.. DRD is a Valgrind tool for detecting errors in multithreaded C and C++ programs.. The tool works for any program that uses the POSIX threading primitives or that uses threading concepts built on top of the POSIX threading primitives.. There are two possible reasons for using multithreading in a program:.. To model concurrent activities.. Assigning one thread to each activity can be a great simplification compared to multiplexing the states of multiple activities in a single thread.. This is why most server software and embedded software is multithreaded.. To use multiple CPU cores simultaneously for speeding up computations.. This is why many High Performance Computing (HPC) applications are multithreaded.. Multithreaded programs can use one or more of the following programming paradigms.. Which paradigm is appropriate depends e.. on the application type.. Some examples of multithreaded programming paradigms are:.. Locking.. Data that is shared over threads is protected from concurrent accesses via locking.. the POSIX threads library, the Qt library and the Boost.. Thread library support this paradigm directly.. Message passing.. No data is shared between threads, but threads exchange data by passing messages to each other.. Examples of implementations of the message passing paradigm are MPI and CORBA.. Automatic parallelization.. A compiler converts a sequential program into a multithreaded program.. The original program may or may not contain parallelization hints.. One example of such parallelization hints is the OpenMP standard.. In this standard a set of directives are defined which tell a compiler how to parallelize a C, C++ or Fortran program.. OpenMP is well suited for computational intensive applications.. As an example, an open source image processing software package is using OpenMP to maximize performance on systems with multiple CPU cores.. GCC supports the OpenMP standard from version 4.. 0 on.. Software Transactional Memory (STM).. Any data that is shared between threads is updated via transactions.. After each transaction it is verified whether there were any conflicting transactions.. If there were conflicts, the transaction is aborted, otherwise it is committed.. This is a so-called optimistic approach.. There is a prototype of the Intel C++ Compiler available that supports STM.. Research about the addition of STM support to GCC is ongoing.. DRD supports any combination of multithreaded programming paradigms as long as the implementation of these paradigms is based on the POSIX threads primitives.. DRD however does not support programs that use e.. Linux' futexes directly.. Attempts to analyze such programs with DRD will cause DRD to report many false positives.. POSIX threads, also known as Pthreads, is the most widely available threading library on Unix systems.. The POSIX threads programming model is based on the following abstractions:.. A shared address space.. All threads running within the same process share the same address space.. All data, whether shared or not, is identified by its address.. Regular load and store operations, which allow to read values from or to write values to the memory shared by all threads running in the same process.. Atomic store and load-modify-store operations.. While these are not mentioned in the POSIX threads standard, most microprocessors support atomic memory operations.. Threads.. Each thread represents a concurrent activity.. Synchronization objects and operations on these synchronization objects.. The following types of synchronization objects have been defined in the POSIX threads standard: mutexes, condition variables, semaphores, reader-writer synchronization objects, barriers and spinlocks.. Which source code statements generate which memory accesses depends on the.. memory model.. of the programming language being used.. There is not yet a definitive memory model for the C and C++ languages.. For a draft memory model, see also the document.. WG21/N2338: Concurrency memory model compiler consequences.. For more information about POSIX threads, see also the Single UNIX Specification version 3, also known as.. IEEE Std 1003.. Depending on which multithreading paradigm is being used in a program, one or more of the following problems can occur:.. Data races.. One or more threads access the same memory location without sufficient locking.. Most but not all data races are programming errors and are the cause of subtle and hard-to-find bugs.. Lock contention.. One thread blocks the progress of one or more other threads by holding a lock too long.. Improper use of the POSIX threads API.. Most implementations of the POSIX threads API have been optimized for runtime speed.. Such implementations will not complain on certain errors, e.. when a mutex is being unlocked by another thread than the thread that obtained a lock on the mutex.. Deadlock.. A deadlock occurs when two or more threads wait for each other indefinitely.. False sharing.. If threads that run on different processor cores access different variables located in the same cache line frequently, this will slow down the involved threads a lot due to frequent exchange of cache lines.. Although the likelihood of the occurrence of data races can be reduced through a disciplined programming style, a tool for automatic detection of data races is a necessity when developing multithreaded software.. DRD can detect these, as well as lock contention and improper use of the POSIX threads API.. The result of load and store operations performed by a multithreaded program depends on the order in which memory operations are performed.. This order is determined by:.. All memory operations performed by the same thread are performed in.. program order.. , that is, the order determined by the program source code and the results of previous load operations.. Synchronization operations determine certain ordering constraints on memory operations performed by different threads.. These ordering constraints are called the.. synchronization order.. The combination of program order and synchronization order is called the.. happens-before relationship.. This concept was first defined by S.. Adve et al in the paper.. Detecting data races on weak memory systems.. , ACM SIGARCH Computer Architecture News, v.. 19 n.. 3, p.. 234-243, May 1991.. Two memory operations.. conflict.. if both operations are performed by different threads, refer to the same memory location and at least one of them is a store operation.. A multithreaded program is.. data-race free.. if all conflicting memory accesses are ordered by synchronization operations.. A well known way to ensure that a multithreaded program is data-race free is to ensure that a locking discipline is followed.. It is e.. possible to associate a mutex with each shared data item, and to hold a lock on the associated mutex while the shared data is accessed.. All programs that follow a locking discipline are data-race free, but not all data-race free programs follow a locking discipline.. There exist multithreaded programs where access to shared data is arbitrated via condition variables, semaphores or barriers.. As an example, a certain class of HPC applications consists of a sequence of computation steps separated in time by barriers, and where these barriers are the only means of synchronization.. Although there are many conflicting memory accesses in such applications and although such applications do not make use mutexes, most of these applications do not contain data races.. There exist two different approaches for verifying the correctness of multithreaded programs at runtime.. The approach of the so-called Eraser algorithm is to verify whether all shared memory accesses follow a consistent locking strategy.. And the happens-before data race detectors verify directly whether all interthread memory accesses are ordered by synchronization operations.. While the last approach is more complex to implement, and while it is more sensitive to OS scheduling, it is a general approach that works for all classes of multithreaded programs.. An important advantage of happens-before data race detectors is that these do not report any false positives.. DRD is based on the happens-before algorithm.. The following command-line options are available for controlling the behavior of the DRD tool itself:.. --check-stack-var= yes|no [default: no].. Controls whether DRD detects data races on stack variables.. Verifying stack variables is disabled by default because most programs do not share stack variables over threads.. --exclusive-threshold= n [default: off].. Print an error message if any mutex or writer lock has been held longer than the time specified in milliseconds.. This option enables the detection of lock contention.. --join-list-vol= n [default: 10].. Data races that occur between a statement at the end of one thread and another thread can be missed if memory access information is discarded immediately after a thread has been joined.. This option allows to specify for how many joined threads memory access information should be retained.. --first-race-only= yes|no [default: no].. Whether to report only the first data race that has been detected on a memory location or all data races that have been detected on a memory location.. --free-is-write= yes|no [default: no].. Whether to report races between accessing memory and freeing memory.. Enabling this option may cause DRD to run slightly slower.. Notes:.. Don't enable this option when using custom memory allocators that use the.. VG_USERREQ__MALLOCLIKE_BLOCK.. VG_USERREQ__FREELIKE_BLOCK.. because that would result in false positives.. Don't enable this option when using reference-counted objects because that will result in false positives, even when that code has been annotated properly with.. See e.. the output of the following command for an example:.. valgrind --tool=drd --free-is-write=yes drd/tests/annotate_smart_pointer.. --report-signal-unlocked= yes|no [default: yes].. Whether to report calls to.. where the mutex associated with the signal through.. pthread_cond_timed_wait.. is not locked at the time the signal is sent.. Sending a signal without holding a lock on the associated mutex is a common programming error which can cause subtle race conditions and unpredictable behavior.. There exist some uncommon synchronization patterns however where it is safe to send a signal without holding a lock on the associated mutex.. --segment-merging= yes|no [default: yes].. Controls segment merging.. Segment merging is an algorithm to limit memory usage of the data race detection algorithm.. Disabling segment merging may improve the accuracy of the so-called 'other segments' displayed in race reports but can also trigger an out of memory error.. --segment-merging-interval= n [default: 10].. Perform segment merging only after the specified number of new segments have been created.. This is an advanced configuration option that allows to choose whether to minimize DRD's memory usage by choosing a low value or to let DRD run faster by choosing a slightly higher value.. The optimal value for this parameter depends on the program being analyzed.. The default value works well for most programs.. --shared-threshold= n [default: off].. Print an error message if a reader lock has been held longer than the specified time (in milliseconds).. --show-confl-seg= yes|no [default: yes].. Show conflicting segments in race reports.. Since this information can help to find the cause of a data race, this option is enabled by default.. Disabling this option makes the output of DRD more compact.. --show-stack-usage= yes|no [default: no].. Print stack usage at thread exit time.. When a program creates a large number of threads it becomes important to limit the amount of virtual memory allocated for thread stacks.. This option makes it possible to observe how much stack memory has been used by each thread of the client program.. Note: the DRD tool itself allocates some temporary data on the client thread stack.. The space necessary for this temporary data must be allocated by the client program when it allocates stack memory, but is not included in stack usage reported by DRD.. The following options are available for monitoring the behavior of the client program:.. --trace-addr= address [default: none].. Trace all load and store activity for the specified address.. This option may be specified more than once.. --ptrace-addr= address [default: none].. Trace all load and store activity for the specified address and keep doing that even after the memory at that address has been freed and reallocated.. --trace-alloc= yes|no [default: no].. Trace all memory allocations and deallocations.. May produce a huge amount of output.. --trace-barrier= yes|no [default: no].. Trace all barrier activity.. --trace-cond= yes|no [default: no].. Trace all condition variable activity.. --trace-fork-join= yes|no [default: no].. Trace all thread creation and all thread termination events.. --trace-hb= yes|no [default: no].. Trace execution of the.. ANNOTATE_HAPPENS_BEFORE().. ANNOTATE_HAPPENS_AFTER().. ANNOTATE_HAPPENS_DONE().. client requests.. --trace-mutex= yes|no [default: no].. Trace all mutex activity.. --trace-rwlock= yes|no [default: no].. Trace all reader-writer lock activity.. --trace-semaphore= yes|no [default: no].. Trace all semaphore activity.. DRD prints a message every time it detects a data race.. Please keep the following in mind when  ...   thread.. ANNOTATE_IGNORE_WRITES_BEGIN.. tells DRD to ignore all memory stores performed by the current thread.. ANNOTATE_IGNORE_WRITES_END.. tells DRD to stop ignoring the memory stores performed by the current thread.. ANNOTATE_IGNORE_READS_AND_WRITES_BEGIN.. tells DRD to ignore all memory accesses performed by the current thread.. ANNOTATE_IGNORE_READS_AND_WRITES_END.. tells DRD to stop ignoring the memory accesses performed by the current thread.. ANNOTATE_NEW_MEMORY(addr, size).. tells DRD that the specified memory range has been allocated by a custom memory allocator in the client program and that the client program will start using this memory range.. ANNOTATE_THREAD_NAME(name).. tells DRD to associate the specified name with the current thread and to include this name in the error messages printed by DRD.. VALGRIND_MALLOCLIKE_BLOCK.. VALGRIND_FREELIKE_BLOCK.. from the Valgrind core are implemented; they are described in.. The Client Request mechanism.. Note: if you compiled Valgrind yourself, the header file.. will have been installed in the directory.. /usr/include.. by the command.. make install.. If you obtained Valgrind by installing it as a package however, you will probably have to install another package with a name like.. valgrind-devel.. before Valgrind's header files are available.. GNOME applications use the threading primitives provided by the.. glib.. gthread.. libraries.. These libraries are built on top of POSIX threads, and hence are directly supported by DRD.. Please keep in mind that you have to call.. g_thread_init.. before creating any threads, or DRD will report several data races on glib functions.. See also the.. GLib Reference Manual.. for more information about.. One of the many facilities provided by the.. library is a block allocator, called.. g_slice.. You have to disable this block allocator when using DRD by adding the following to the shell environment variables:.. G_SLICE=always-malloc.. for more information.. The Boost.. Thread library is the threading library included with the cross-platform Boost Libraries.. This threading library is an early implementation of the upcoming C++0x threading library.. Applications that use the Boost.. Thread library should run fine under DRD.. More information about Boost.. Thread can be found here:.. Anthony Williams,.. Boost.. Thread.. Library Documentation, Boost website, 2007.. What's New in Boost Threads?.. , Recent changes to the Boost Thread library, Dr.. Dobbs Magazine, October 2008.. OpenMP stands for.. Open Multi-Processing.. The OpenMP standard consists of a set of compiler directives for C, C++ and Fortran programs that allows a compiler to transform a sequential program into a parallel program.. OpenMP is well suited for HPC applications and allows to work at a higher level compared to direct use of the POSIX threads API.. While OpenMP ensures that the POSIX API is used correctly, OpenMP programs can still contain data races.. So it definitely makes sense to verify OpenMP programs with a thread checking tool.. DRD supports OpenMP shared-memory programs generated by GCC.. GCC supports OpenMP since version 4.. GCC's runtime support for OpenMP programs is provided by a library called.. The synchronization primitives implemented in this library use Linux' futex system call directly, unless the library has been configured with the.. option.. DRD only supports libgomp libraries that have been configured with this option and in which symbol information is present.. For most Linux distributions this means that you will have to recompile GCC.. See also the script.. drd/scripts/download-and-build-gcc.. in the Valgrind source tree for an example of how to compile GCC.. You will also have to make sure that the newly compiled.. library is loaded when OpenMP programs are started.. This is possible by adding a line similar to the following to your shell startup script:.. export LD_LIBRARY_PATH=~/gcc-4.. 0/lib64:~/gcc-4.. 0/lib:.. As an example, the test OpenMP test program.. drd/tests/omp_matinv.. triggers a data race when the option -r has been specified on the command line.. The data race is triggered by the following code:.. #pragma omp parallel for private(j) for (j = 0; j rows; j++) { if (i != j) { const elem_t factor = a[j * cols + i]; for (k = 0; k cols; k++) { a[j * cols + k] -= a[i * cols + k] * factor; } } }.. The above code is racy because the variable.. k.. has not been declared private.. DRD will print the following error message for the above code:.. $ valgrind --tool=drd --check-stack-var=yes --read-var-info=yes drd/tests/omp_matinv 3 -t 2 -r.. Conflicting store by thread 1/1 at 0x7fefffbc4 size 4 at 0x4014A0: gj.. omp_fn.. 0 (omp_matinv.. c:203) by 0x401211: gj (omp_matinv.. c:159) by 0x40166A: invert_matrix (omp_matinv.. c:238) by 0x4019B4: main (omp_matinv.. c:316) Location 0x7fefffbc4 is 0 bytes inside local var "k" declared at omp_matinv.. c:160, in frame #0 of thread 1.. In the above output the function name.. gj.. has been generated by GCC from the function name.. The allocation context information shows that the data race has been caused by modifying the variable.. Note: for GCC versions before 4.. 0, no allocation context information is shown.. With these GCC versions the most usable information in the above output is the source file name and the line number where the data race has been detected (.. omp_matinv.. c:203.. For more information about OpenMP, see also.. openmp.. DRD tracks all memory allocation events that happen via the standard memory allocation and deallocation functions (.. ), via entry and exit of stack frames or that have been annotated with Valgrind's memory pool client requests.. DRD uses memory allocation and deallocation information for two purposes:.. To know where the scope ends of POSIX objects that have not been destroyed explicitly.. not required by the POSIX threads standard to call.. pthread_mutex_destroy.. before freeing the memory in which a mutex object resides.. To know where the scope of variables ends.. If e.. heap memory has been used by one thread, that thread frees that memory, and another thread allocates and starts using that memory, no data races must be reported for that memory.. It is essential for correct operation of DRD that the tool knows about memory allocation and deallocation events.. When analyzing a client program with DRD that uses a custom memory allocator, either instrument the custom memory allocator with the.. macros or disable the custom memory allocator.. As an example, the GNU libstdc++ library can be configured to use standard memory allocation functions instead of memory pools by setting the environment variable.. GLIBCXX_FORCE_NEW.. For more information, see also the.. libstdc++ manual.. It is essential for correct operation of DRD that there are no memory errors such as dangling pointers in the client program.. Which means that it is a good idea to make sure that your program is Memcheck-clean before you analyze it with DRD.. It is possible however that some of the Memcheck reports are caused by data races.. In this case it makes sense to run DRD before Memcheck.. So which tool should be run first? In case both DRD and Memcheck complain about a program, a possible approach is to run both tools alternatingly and to fix as many errors as possible after each run of each tool until none of the two tools prints any more error messages.. The requirements of DRD with regard to heap and stack memory and the effect on the execution time of client programs are as follows:.. When running a program under DRD with default DRD options, between 1.. 1 and 3.. 6 times more memory will be needed compared to a native run of the client program.. More memory will be needed if loading debug information has been enabled (.. DRD allocates some of its temporary data structures on the stack of the client program threads.. This amount of data is limited to 1 - 2 KB.. Make sure that thread stacks are sufficiently large.. Most applications will run between 20 and 50 times slower under DRD than a native single-threaded run.. The slowdown will be most noticeable for applications which perform frequent mutex lock / unlock operations.. The following information may be helpful when using DRD:.. Make sure that debug information is present in the executable being analyzed, such that DRD can print function name and line number information in stack traces.. Most compilers can be told to include debug information via compiler option.. Compile with option.. -O1.. instead of.. -O0.. This will reduce the amount of generated code, may reduce the amount of debug info and will speed up DRD's processing of the client program.. For more information, see also.. Getting started.. If DRD reports any errors on libraries that are part of your Linux distribution like e.. libc.. libstdc++.. , installing the debug packages for these libraries will make the output of DRD a lot more detailed.. When using C++, do not send output from more than one thread to.. std::cout.. Doing so would not only generate multiple data race reports, it could also result in output from several threads getting mixed up.. Either use.. or do the following:.. Derive a class from.. std::ostreambuf.. and let that class send output line by line to.. stdout.. This will avoid that individual lines of text produced by different threads get mixed up.. Create one instance of.. std::ostream.. for each thread.. This makes stream formatting settings thread-local.. Pass a per-thread instance of the class derived from.. to the constructor of each instance.. Let each thread send its output to its own instance of.. The Single UNIX Specification version two defines the following four mutex types (see also the documentation of.. pthread_mutexattr_settype.. ):.. normal.. , which means that no error checking is performed, and that the mutex is non-recursive.. error checking.. , which means that the mutex is non-recursive and that error checking is performed.. recursive.. , which means that a mutex may be locked recursively.. default.. , which means that error checking behavior is undefined, and that the behavior for recursive locking is also undefined.. Or: portable code must neither trigger error conditions through the Pthreads API nor attempt to lock a mutex of default type recursively.. In complex applications it is not always clear from beforehand which mutex will be locked recursively and which mutex will not be locked recursively.. Attempts lock a non-recursive mutex recursively will result in race conditions that are very hard to find without a thread checking tool.. So either use the error checking mutex type and consistently check the return value of Pthread API mutex calls, or use the recursive mutex type.. A condition variable allows one thread to wake up one or more other threads.. Condition variables are often used to notify one or more threads about state changes of shared data.. Unfortunately it is very easy to introduce race conditions by using condition variables as the only means of state information propagation.. A better approach is to let threads poll for changes of a state variable that is protected by a mutex, and to use condition variables only as a thread wakeup mechanism.. See also the source file.. drd/tests/monitor_example.. for an example of how to implement this concept in C++.. The monitor concept used in this example is a well known and very useful concept -- see also Wikipedia for more information about the.. monitor.. concept.. Historically the function.. pthread_cond_timedwait.. only allowed the specification of an absolute timeout, that is a timeout independent of the time when this function was called.. However, almost every call to this function expresses a relative timeout.. This typically happens by passing the sum of.. clock_gettime(CLOCK_REALTIME).. and a relative timeout as the third argument.. This approach is incorrect since forward or backward clock adjustments by e.. ntpd will affect the timeout.. A more reliable approach is as follows:.. When initializing a condition variable through.. pthread_cond_init.. , specify that the timeout of.. will use the clock.. CLOCK_MONOTONIC.. CLOCK_REALTIME.. You can do this via.. pthread_condattr_setclock(.. , CLOCK_MONOTONIC).. When calling.. , pass the sum of.. clock_gettime(CLOCK_MONOTONIC).. DRD currently has the following limitations:.. DRD, just like Memcheck, will refuse to start on Linux distributions where all symbol information has been removed from.. This is e.. the case for the PPC editions of openSUSE and Gentoo.. You will have to install the glibc debuginfo package on these platforms before you can use DRD.. See also openSUSE bug.. 396197.. and Gentoo bug.. 214065.. With gcc 4.. 3 and before, DRD may report data races on the C++ class.. std::string.. in a multithreaded program.. This is a know.. issue -- see also GCC bug.. 40518.. If you compile the DRD source code yourself, you need GCC 3.. 0 or later.. GCC 2.. 95 is not supported.. Of the two POSIX threads implementations for Linux, only the NPTL (Native POSIX Thread Library) is supported.. The older LinuxThreads library is not supported.. If you have any comments, suggestions, feedback or bug reports about DRD, feel free to either post a message on the Valgrind users mailing list or to file a bug report.. 9.. Massif: a heap profiler..

    Original link path: /docs/manual/drd-manual.html
    Open archive

  • Title: Valgrind
    Descriptive info: Massif: a heap profiler.. Using Massif and ms_print.. An Example Program.. Running Massif.. Running ms_print.. The Output Preamble.. The Output Graph.. The Snapshot Details.. Measuring All Memory in a Process.. Acting on Massif's Information.. Massif Command-line Options.. Massif Monitor Commands.. Massif Client Requests.. ms_print Command-line Options.. Massif's Output File Format.. --tool=massif.. Massif is a heap profiler.. It measures how much heap memory your program uses.. This includes both the useful space, and the extra bytes allocated for book-keeping and alignment purposes.. It can also measure the size of your program's stack(s), although it does not do so by default.. Heap profiling can help you reduce the amount of memory your program uses.. On modern machines with virtual memory, this provides the following benefits:.. It can speed up your program -- a smaller program will interact better with your machine's caches and avoid paging.. If your program uses lots of memory, it will reduce the chance that it exhausts your machine's swap space.. Also, there are certain space leaks that aren't detected by traditional leak-checkers, such as Memcheck's.. That's because the memory isn't ever actually lost -- a pointer remains to it -- but it's not in use.. Programs that have leaks like this can unnecessarily increase the amount of memory they are using over time.. Massif can help identify these leaks.. Importantly, Massif tells you not only how much heap memory your program is using, it also gives very detailed information that indicates which parts of your program are responsible for allocating the heap memory.. First off, as for the other Valgrind tools, you should compile with debugging info (the.. option).. It shouldn't matter much what optimisation level you compile your program with, as this is unlikely to affect the heap memory usage.. Then, you need to run Massif itself to gather the profiling information, and then run ms_print to present it in a readable way.. An example will make things clear.. Consider the following C program (annotated with line numbers) which allocates a number of different blocks on the heap.. 1 #include stdlib.. h 2 3 void g(void) 4 { 5 malloc(4000); 6 } 7 8 void f(void) 9 { 10 malloc(2000); 11 g(); 12 } 13 14 int main(void) 15 { 16 int i; 17 int* a[10]; 18 19 for (i = 0; i 10; i++) { 20 a[i] = malloc(1000); 21 } 22 23 f(); 24 25 g(); 26 27 for (i = 0; i 10; i++) { 28 free(a[i]); 29 } 30 31 return 0; 32 }.. To gather heap profiling information about the program.. prog.. , type:.. valgrind --tool=massif prog.. The program will execute (slowly).. Upon completion, no summary statistics are printed to Valgrind's commentary; all of Massif's profiling data is written to a file.. By default, this file is called.. massif.. , where.. is the process ID, although this filename can be changed with the.. --massif-out-file.. To see the information gathered by Massif in an easy-to-read form, use ms_print.. If the output file's name is.. 12345.. ms_print massif.. ms_print will produce (a) a graph showing the memory consumption over the program's execution, and (b) detailed information about the responsible allocation sites at various points in the program, including the point of peak memory allocation.. The use of a separate script for presenting the results is deliberate: it separates the data gathering from its presentation, and means that new methods of presenting the data can be added in the future.. After running this program under Massif, the first part of ms_print's output contains a preamble which just states how the program, Massif and ms_print were each invoked:.. -------------------------------------------------------------------------------- Command: example Massif arguments: (none) ms_print arguments: massif.. 12797 --------------------------------------------------------------------------------.. The next part is the graph that shows how memory consumption occurred as the program executed:.. KB 19.. 63^ # | # | # | # | # | # | # | # | # | # | # | # | # | # | # | # | # | :# | :# | :# 0 +----------------------------------------------------------------------- ki 0 113.. 4 Number of snapshots: 25 Detailed snapshots: [9, 14 (peak), 24].. Why is most of the graph empty, with only a couple of bars at the very end? By default, Massif uses "instructions executed" as the unit of time.. For very short-run programs such as the example, most of the executed instructions involve the loading and dynamic linking of the program.. The execution of.. (and thus the heap allocations) only occur at the very end.. For a short-running program like this, we can use the.. --time-unit=B.. option to specify that we want the time unit to instead be the number of bytes allocated/deallocated on the heap and stack(s).. If we re-run the program under Massif with this option, and then re-run ms_print, we get this more useful graph:.. 19.. 63^ ### | # | # :: | # : ::: | :::::::::# : : :: | : # : : : :: | : # : : : : ::: | : # : : : : : :: | ::::::::::: # : : : : : : ::: | : : # : : : : : : : :: | ::::: : # : : : : : : : : :: | @@@: : : # : : : : : : : : : @ | ::@ : : : # : : : : : : : : : @ | :::: @ : : : # : : : : : : : : : @ | ::: : @ : : : # : : : : : : : : : @ | ::: : : @ : : : # : : : : : : : : : @ | :::: : : : @ : : : # : : : : : : : : : @ | ::: : : : : @ : : : # : : : : : : : : : @ | :::: : : : : : @ : : : # : : : : : : : : : @ | ::: : : : : : : @ : : : # : : : : : : : : : @ 0 +----------------------------------------------------------------------- KB 0 29.. 48 Number of snapshots: 25 Detailed snapshots: [9, 14 (peak), 24].. The size of the graph can be changed with ms_print's.. --x.. --y.. options.. Each vertical bar represents a snapshot, i.. a measurement of the memory usage at a certain point in time.. If the next snapshot is more than one column away, a horizontal line of characters is drawn from the top of the snapshot to just before the next snapshot column.. The text at the bottom show that 25 snapshots were taken for this program, which is one per heap allocation/deallocation, plus a couple of extras.. Massif starts by taking snapshots for every heap allocation/deallocation, but as a program runs for longer, it takes snapshots less frequently.. It also discards older snapshots as the program goes on; when it reaches the maximum number of snapshots (100 by default, although changeable with the.. --max-snapshots.. option) half of them are deleted.. This means that a reasonable number of snapshots are always maintained.. Most snapshots are.. , and only basic information is recorded for them.. Normal snapshots are represented in the graph by bars consisting of ':' characters.. Some snapshots are.. detailed.. Information about where allocations happened are recorded for these snapshots, as we will see shortly.. Detailed snapshots are represented in the graph by bars consisting of '@' characters.. The text at the bottom show that 3 detailed snapshots were taken for this program (snapshots 9, 14 and 24).. By default, every 10th snapshot is detailed, although this can be changed via the.. --detailed-freq.. Finally, there is at most one.. peak.. snapshot.. The peak snapshot is a detailed snapshot, and records the point where memory consumption was greatest.. The peak snapshot is represented in the graph by a bar consisting of '#' characters.. The text at the bottom shows that snapshot 14 was the peak.. Massif's determination of when the peak occurred can be wrong, for two reasons.. Peak snapshots are only ever taken after a deallocation happens.. This avoids lots of unnecessary peak snapshot recordings (imagine what happens if your program allocates a lot of heap blocks in succession, hitting a new peak every time).. But it means that if your program never deallocates any blocks, no peak will be recorded.. It also means that if your program does deallocate blocks but later allocates to a higher peak without subsequently deallocating, the reported peak will be too low.. Even with this behaviour, recording the peak accurately is slow.. So by default Massif records a peak whose size is within 1% of the size of the true peak.. This inaccuracy in the peak measurement can be changed with the.. --peak-inaccuracy.. The following graph is from an execution of Konqueror, the KDE web browser.. It shows what graphs for larger programs look like.. MB 3.. 952^ # | @#: | :@@#: | @@::::@@#: | @ :: :@@#:: | @@@ :: :@@#:: | @@:@@@ :: :@@#:: |  ...   code locations for which memory was allocated and then freed (line 20 in this case, the memory for which was freed on line 28).. However, no code location details are given for this entry; by default, Massif only records the details for code locations responsible for more than 1% of useful memory bytes, and ms_print likewise only prints the details for code locations responsible for more than 1%.. The entries that do not meet this threshold are aggregated.. This avoids filling up the output with large numbers of unimportant entries.. The thresholds can be changed with the.. --threshold.. option that both Massif and ms_print support.. If the output file format string (controlled by.. ) does not contain.. , then the outputs from the parent and child will be intermingled in a single output file, which will almost certainly make it unreadable by ms_print.. It is worth emphasising that by default Massif measures only heap memory, i.. memory allocated with.. calloc.. realloc.. memalign.. new[].. , and a few other, similar functions.. (And it can optionally measure stack memory, of course.. ) This means it does.. not.. directly measure memory allocated with lower-level system calls such as.. mmap.. mremap.. , and.. brk.. Heap allocation functions such as.. are built on top of these system calls.. For example, when needed, an allocator will typically call.. to allocate a large chunk of memory, and then hand over pieces of that memory chunk to the client program in response to calls to.. et al.. Massif directly measures only these higher-level.. et al calls, not the lower-level system calls.. Furthermore, a client program may use these lower-level system calls directly to allocate memory.. By default, Massif does not measure these.. Nor does it measure the size of code, data and BSS segments.. Therefore, the numbers reported by Massif may be significantly smaller than those reported by tools such as.. top.. that measure a program's total size in memory.. However, if you wish to measure.. all.. the memory used by your program, you can use the.. --pages-as-heap=yes.. When this option is enabled, Massif's normal heap block profiling is replaced by lower-level page profiling.. Every page allocated via.. and similar system calls is treated as a distinct block.. This means that code, data and BSS segments are all measured, as they are just memory pages.. Even the stack is measured, since it is ultimately allocated (and extended when necessary) via.. ; for this reason.. is not allowed in conjunction with.. After.. is used, ms_print's output is mostly unchanged.. One difference is that the start of each detailed snapshot says:.. (page allocation syscalls) mmap/mremap/brk, --alloc-fns, etc.. instead of the usual.. :.. (heap allocation functions) malloc/new/new[], --alloc-fns, etc.. The stack traces in the output may be more difficult to read, and interpreting them may require some detailed understanding of the lower levels of a program like the memory allocators.. But for some programs having the full information about memory usage can be very useful.. Massif's information is generally fairly easy to act upon.. The obvious place to start looking is the peak snapshot.. It can also be useful to look at the overall shape of the graph, to see if memory usage climbs and falls as you expect; spikes in the graph might be worth investigating.. The detailed snapshots can get quite large.. It is worth viewing them in a very wide window.. It's also a good idea to view them with a text editor.. That makes it easy to scroll up and down while keeping the cursor in a particular column, which makes following the allocation chains easier.. Massif-specific command-line options are:.. --heap= yes|no [default: yes].. Specifies whether heap profiling should be done.. --heap-admin= size [default: 8].. If heap profiling is enabled, gives the number of administrative bytes per block to use.. This should be an estimate of the average, since it may vary.. For example, the allocator used by glibc on Linux requires somewhere between 4 to 15 bytes per block, depending on various factors.. That allocator also requires admin space for freed blocks, but Massif cannot account for this.. --stacks= yes|no [default: no].. Specifies whether stack profiling should be done.. This option slows Massif down greatly, and so is off by default.. Note that Massif assumes that the main stack has size zero at start-up.. This is not true, but doing otherwise accurately is difficult.. Furthermore, starting at zero better indicates the size of the part of the main stack that a user program actually has control over.. --pages-as-heap= yes|no [default: no].. Tells Massif to profile memory at the page level rather than at the malloc'd block level.. See above for details.. --depth= number [default: 30].. Maximum depth of the allocation trees recorded for detailed snapshots.. Increasing it will make Massif run somewhat more slowly, use more memory, and produce bigger output files.. --alloc-fn= name.. Functions specified with this option will be treated as though they were a heap allocation function such as.. This is useful for functions that are wrappers to.. , which can fill up the allocation trees with uninteresting information.. This option can be specified multiple times on the command line, to name multiple functions.. Note that the named function will only be treated this way if it is the top entry in a stack trace, or just below another function treated this way.. For example, if you have a function.. malloc1.. that wraps.. malloc2.. , just specifying.. --alloc-fn=malloc2.. will have no effect.. You need to specify.. --alloc-fn=malloc1.. as well.. This is a little inconvenient, but the reason is that checking for allocation functions is slow, and it saves a lot of time if Massif can stop looking through the stack trace entries as soon as it finds one that doesn't match rather than having to continue through all the entries.. Note that C++ names are demangled.. Note also that overloaded C++ names must be written in full.. Single quotes may be necessary to prevent the shell from breaking them up.. For example:.. --alloc-fn='operator new(unsigned, std::nothrow_t const )'.. --ignore-fn= name.. Any direct heap allocation (i.. a call to.. , etc, or a call to a function named by an.. --alloc-fn.. option) that occurs in a function specified by this option will be ignored.. This is mostly useful for testing purposes.. Any.. of an ignored block will also be ignored, even if the.. call does not occur in an ignored function.. This avoids the possibility of negative heap sizes if ignored blocks are shrunk with.. The rules for writing C++ function names are the same as for.. above.. --threshold= m.. n [default: 1.. 0].. The significance threshold for heap allocations, as a percentage of total memory size.. Allocation tree entries that account for less than this will be aggregated.. Note that this should be specified in tandem with ms_print's option of the same name.. --peak-inaccuracy= m.. Massif does not necessarily record the actual global memory allocation peak; by default it records a peak only when the global memory allocation size exceeds the previous peak by at least 1.. 0%.. This is because there can be many local allocation peaks along the way, and doing a detailed snapshot for every one would be expensive and wasteful, as all but one of them will be later discarded.. This inaccuracy can be changed (even to 0.. 0%) via this option, but Massif will run drastically slower as the number approaches zero.. --time-unit= i|ms|B [default: i].. The time unit used for the profiling.. There are three possibilities: instructions executed (i), which is good for most cases; real (wallclock) time (ms, i.. milliseconds), which is sometimes useful; and bytes allocated/deallocated on the heap and/or stack (B), which is useful for very short-run programs, and for testing purposes, because it is the most reproducible across different machines.. --detailed-freq= n [default: 10].. Frequency of detailed snapshots.. With.. --detailed-freq=1.. , every snapshot is detailed.. --max-snapshots= n [default: 100].. The maximum number of snapshots recorded.. If set to N, for all programs except very short-running ones, the final number of snapshots will be between N/2 and N.. --massif-out-file= file [default: massif.. %p].. The Massif tool provides monitor commands handled by the Valgrind gdbserver (see.. snapshot [ filename ].. requests to take a snapshot and save it in the given filename (default massif.. vgdb.. out).. detailed_snapshot [ filename ].. requests to take a detailed snapshot and save it in the given filename (default massif.. Massif does not have a.. file, but it does implement two of the core client requests:.. ; they are described in.. ms_print's options are:.. -h --help.. Show the help message.. --version.. Show the version number.. Same as Massif's.. option, but applied after profiling rather than during.. --x= 4.. 1000 [default: 72].. Width of the graph, in columns.. --y= 4.. 1000 [default: 20].. Height of the graph, in rows.. Massif's file format is plain text (i.. not binary) and deliberately easy to read for both humans and machines.. Nonetheless, the exact format is not described here.. This is because the format is currently very Massif-specific.. In the future we hope to make the format more general, and thus suitable for possible use with other tools.. Once this has been done, the format will be documented here.. 10.. DHAT: a dynamic heap analysis tool..

    Original link path: /docs/manual/ms-manual.html
    Open archive

  • Title: Valgrind
    Descriptive info: DHAT: a dynamic heap analysis tool.. Understanding DHAT's output.. Interpreting the max-live, tot-alloc and deaths fields.. Interpreting the acc-ratios fields.. Interpreting "Aggregated access counts by offset" data.. DHAT Command-line Options.. --tool=exp-dhat.. DHAT is a tool for examining how programs use their heap allocations.. It tracks the allocated blocks, and inspects every memory access to find which block, if any, it is to.. The following data is collected and presented per allocation point (allocation stack):.. Total allocation (number of bytes and blocks).. maximum live volume (number of bytes and blocks).. average block lifetime (number of instructions between allocation and freeing).. average number of reads and writes to each byte in the block ("access ratios").. for allocation points which always allocate blocks only of one size, and that size is 4096 bytes or less: counts showing how often each byte offset inside the block is accessed.. Using these statistics it is possible to identify allocation points with the following characteristics:.. potential process-lifetime leaks: blocks allocated by the point just accumulate, and are freed only at the end of the run.. excessive turnover: points which chew through a lot of heap, even if it is not held onto for very long.. excessively transient: points which allocate very short lived blocks.. useless or underused allocations: blocks which are allocated but not completely filled in, or are filled in but not subsequently read.. blocks with inefficient layout -- areas never accessed, or with hot fields scattered throughout the block.. As with the Massif heap profiler, DHAT measures program progress by counting instructions, and so presents all age/time related figures as instruction counts.. This sounds a little odd at first, but it makes runs repeatable in a way which is not possible if CPU time is used.. DHAT provides a lot of useful information on dynamic heap usage.. Most of the art of using it is in interpretation of the resulting numbers.. That is best illustrated via a set of examples.. A simple example.. ======== SUMMARY STATISTICS ======== guest_insns: 1,045,339,534 [.. ] max-live: 63,490 in 984 blocks tot-alloc: 1,904,700 in 29,520 blocks (avg size 64.. 52) deaths: 29,520, at avg age 22,227,424 acc-ratios: 6.. 37 rd, 1.. 14 wr (12,141,526 b-read, 2,174,460 b-written) at 0x4C275B8: malloc (vg_replace_malloc.. c:236) by 0x40350E: tcc_malloc (tinycc.. c:6712) by 0x404580: tok_alloc_new (tinycc.. c:7151) by 0x40870A: next_nomacro1 (tinycc.. c:9305).. Over the entire run of the program, this stack (allocation point) allocated 29,520 blocks in total, containing 1,904,700 bytes in total.. By looking at the max-live data, we see that not many blocks were simultaneously live, though: at the peak, there were 63,490 allocated bytes in 984 blocks.. This tells us that the program is steadily freeing such blocks as it runs, rather than hanging on to all of them until the end and freeing them all.. The deaths entry tells us that 29,520 blocks allocated by this stack died (were freed) during the run of the program.. Since 29,520 is also the number of blocks allocated in total, that tells us that all allocated blocks were freed by the end of the program.. It also tells us that the average age at death was 22,227,424 instructions.. From the summary statistics we see that the program ran for 1,045,339,534 instructions, and so the average age at death is about 2% of the program's total run time.. Example of a potential process-lifetime leak.. This next example (from a different program than the above) shows a potential process lifetime leak.. A process lifetime leak occurs when a program keeps allocating data, but only frees the data just before it exits.. Hence the program's heap grows constantly in size, yet Memcheck reports no leak, because the program has freed up everything at exit.. This is particularly a hazard for long running programs.. ======== SUMMARY STATISTICS ======== guest_insns: 418,901,537 [.. ] max-live: 32,512 in 254 blocks tot-alloc: 32,512 in 254 blocks (avg size 128.. 00) deaths: 254, at avg age 300,467,389 acc-ratios: 0.. 26 rd, 0.. 20 wr (8,756 b-read, 6,604 b-written) at 0x4C275B8: malloc (vg_replace_malloc.. c:236) by 0x4C27632: realloc (vg_replace_malloc.. c:525) by 0x56FF41D: QtFontStyle::pixelSize(unsigned short, bool) (qfontdatabase.. cpp:269) by 0x5700D69: loadFontConfig() (qfontdatabase_x11.. cpp:1146).. There are two tell-tale signs that this might be a process-lifetime leak.. Firstly, the max-live and tot-alloc  ...   its destructor will read from it.. So the block's read and write ratios will be non-zero even if the object, once constructed, is never used, but only eventually destructed.. Really, what we want is to measure only memory accesses in between the end of an object's construction and the start of its destruction.. Unfortunately I do not know of a reliable way to determine when those transitions are made.. For allocation points that always allocate blocks of the same size, and which are 4096 bytes or smaller, DHAT counts accesses per offset, for example:.. max-live: 317,408 in 5,668 blocks tot-alloc: 317,408 in 5,668 blocks (avg size 56.. 00) deaths: 5,668, at avg age 622,890,597 acc-ratios: 1.. 03 rd, 1.. 28 wr (327,642 b-read, 408,172 b-written) at 0x4C275B8: malloc (vg_replace_malloc.. c:236) by 0x5440C16: QDesignerPropertySheetPrivate::ensureInfo (qhash.. h:515) by 0x544350B: QDesignerPropertySheet::setVisible (qdesigner_propertysh.. ) by 0x5446232: QDesignerPropertySheet::QDesignerPropertySheet (qdesigne.. ) Aggregated access counts by offset: [ 0] 28782 28782 28782 28782 28782 28782 28782 28782 [ 8] 20638 20638 20638 20638 0 0 0 0 [ 16] 22738 22738 22738 22738 22738 22738 22738 22738 [ 24] 6013 6013 6013 6013 6013 6013 6013 6013 [ 32] 18883 18883 18883 37422 0 0 0 0 [ 36] 5668 11915 5668 5668 11336 11336 11336 11336 [ 48] 6166 6166 6166 6166 0 0 0 0.. This is fairly typical, for C++ code running on a 64-bit platform.. Here, we have aggregated access statistics for 5668 blocks, all of size 56 bytes.. Each byte has been accessed at least 5668 times, except for offsets 12--15, 36--39 and 52--55.. These are likely to be alignment holes.. Careful interpretation of the numbers reveals useful information.. Groups of N consecutive identical numbers that begin at an N-aligned offset, for N being 2, 4 or 8, are likely to indicate an N-byte object in the structure at that point.. For example, the first 32 bytes of this object are likely to have the layout.. [0 ] 64-bit type [8 ] 32-bit type [12] 32-bit alignment hole [16] 64-bit type [24] 64-bit type.. As a counterexample, it's also clear that, whatever is at offset 32, it is not a 32-bit value.. That's because the last number of the group (37422) is not the same as the first three (18883 18883 18883).. This example leads one to enquire (by reading the source code) whether the zeroes at 12--15 and 52--55 are alignment holes, and whether 48--51 is indeed a 32-bit type.. If so, it might be possible to place what's at 48--51 at 12--15 instead, which would reduce the object size from 56 to 48 bytes.. Bear in mind that the above inferences are all only "maybes".. That's because they are based on dynamic data, not static analysis of the object layout.. For example, the zeroes might not be alignment holes, but rather just parts of the structure which were not used at all for this particular run.. Experience shows that's unlikely to be the case, but it could happen.. DHAT-specific command-line options are:.. --show-top-n= number [default: 10].. At the end of the run, DHAT sorts the accumulated allocation points according to some metric, and shows the highest scoring entries.. --show-top-n.. controls how many entries are shown.. The default of 10 is quite small.. For realistic applications you will probably need to set it much higher, at least several hundred.. --sort-by= string [default: max-bytes-live].. --sort-by.. selects the metric used for sorting:.. max-bytes-live.. maximum live bytes [default].. tot-bytes-allocd.. total allocation (turnover).. max-blocks-live.. maximum live blocks.. This controls the order in which allocation points are displayed.. You can choose to look at allocation points with the highest maximum liveness, or the highest total turnover, or by the highest number of live blocks.. These give usefully different pictures of program behaviour.. For example, sorting by maximum live blocks tends to show up allocation points creating large numbers of small objects.. One important point to note is that each allocation stack counts as a seperate allocation point.. Because stacks by default have 12 frames, this tends to spread data out over multiple allocation points.. You may want to use the flag --num-callers=4 or some such small number, to reduce the spreading.. 11.. SGCheck: an experimental stack and global array overrun detector..

    Original link path: /docs/manual/dh-manual.html
    Open archive

  • Title: Valgrind
    Descriptive info: SGCheck: an experimental stack and global array overrun detector.. SGCheck Command-line Options.. How SGCheck Works.. Comparison with Memcheck.. Still To Do: User-visible Functionality.. Still To Do: Implementation Tidying.. --tool=exp-sgcheck.. SGCheck is a tool for finding overruns of stack and global arrays.. It works by using a heuristic approach derived from an observation about the likely forms of stack and global array accesses.. There are no SGCheck-specific command-line options at present.. When a source file is compiled with.. , the compiler attaches DWARF3 debugging information which describes the location of all stack and global arrays in the file.. Checking of accesses to such arrays would then be relatively simple, if the compiler could also tell us which array (if any) each memory referencing instruction was supposed to access.. Unfortunately the DWARF3 debugging format does not provide a way to represent such information, so we have to resort to a heuristic technique to approximate it.. The key observation is that.. if a memory referencing instruction accesses inside a stack or global array once, then it is highly likely to always access that same array.. To see how this might be useful, consider the following buggy fragment:.. { int i, a[10]; // both are auto vars for (i = 0; i = 10; i++) a[i] = 42; }.. At run time we will know the precise address of.. a[].. on the stack, and so we can observe that the first store resulting from.. a[i] = 42.. writes.. , and we will (correctly) assume that that instruction is intended always to access.. Then, on the 11th iteration, it accesses somewhere else, possibly a different local, possibly an un-accounted for area of the stack (eg, spill slot), so SGCheck reports an error.. There is an important caveat.. Imagine a function such as.. memcpy.. , which is used to read and write many different areas of memory over the lifetime of the program.. If we insist that the read and write instructions in its memory copying loop only ever access one particular stack or global variable, we will be flooded with errors resulting from calls to.. To avoid this problem, SGCheck instantiates fresh likely-target records for each entry to a function, and discards them on exit.. This allows detection of cases where (e.. ).. overflows its source or destination buffers for any specific call, but does not carry any restriction from one call to the next.. Indeed, multiple threads may make multiple simultaneous calls to (e.. without mutual interference.. SGCheck and Memcheck are complementary: their capabilities do not overlap.. Memcheck performs bounds checks and use-after-free checks for heap arrays.. It also finds uses of uninitialised values created by heap or stack allocations.. But it does not perform bounds checking for stack or global arrays.. SGCheck, on the other hand, does do bounds checking for stack or global arrays, but it doesn't do anything else.. This is an experimental tool, which relies rather too heavily on some not-as-robust-as-I-would-like assumptions on the behaviour of correct programs.. There are a number of limitations which you should be aware of.. False negatives (missed errors): it follows from the description above (.. ) that the first access by a memory referencing instruction to a stack or global  ...   on a 2.. 4 GHz Core 2 machine.. Reading this information also requires a lot of memory.. To make it viable, SGCheck goes to considerable trouble to compress the in-memory representation of the DWARF3 data, which is why the process of reading it appears slow.. Performance: SGCheck runs slower than Memcheck.. This is partly due to a lack of tuning, but partly due to algorithmic difficulties.. The stack and global checks can sometimes require a number of range checks per memory access, and these are difficult to short-circuit, despite considerable efforts having been made.. A redesign and reimplementation could potentially make it much faster.. Coverage: Stack and global checking is fragile.. If a shared object does not have debug information attached, then SGCheck will not be able to determine the bounds of any stack or global arrays defined within that shared object, and so will not be able to check accesses to them.. This is true even when those arrays are accessed from some other shared object which was compiled with debug info.. At the moment SGCheck accepts objects lacking debuginfo without comment.. This is dangerous as it causes SGCheck to silently skip stack and global checking for such objects.. It would be better to print a warning in such circumstances.. Coverage: SGCheck does not check whether the areas read or written by system calls do overrun stack or global arrays.. This would be easy to add.. Platforms: the stack/global checks won't work properly on PowerPC, ARM or S390X platforms, only on X86 and AMD64 targets.. That's because the stack and global checking requires tracking function calls and exits reliably, and there's no obvious way to do it on ABIs that use a link register for function returns.. Robustness: related to the previous point.. Function call/exit tracking for X86 and AMD64 is believed to work properly even in the presence of longjmps within the same stack (although this has not been tested).. However, code which switches stacks is likely to cause breakage/chaos.. Extend system call checking to work on stack and global arrays.. Print a warning if a shared object does not have debug info attached, or if, for whatever reason, debug info could not be found, or read.. Add some heuristic filtering that removes obvious false positives.. This would be easy to do.. For example, an access transition from a heap to a stack object almost certainly isn't a bug and so should not be reported to the user.. Items marked CRITICAL are considered important for correctness: non-fixage of them is liable to lead to crashes or assertion failures in real use.. sg_main.. c: Redesign and reimplement the basic checking algorithm.. It could be done much faster than it is -- the current implementation isn't very good.. c: Improve the performance of the stack / global checks by doing some up-front filtering to ignore references in areas which "obviously" can't be stack or globals.. This will require using information that m_aspacemgr knows about the address space layout.. c: fix compute_II_hash to make it a bit more sensible for ppc32/64 targets (except that sg_ doesn't work on ppc32/64 targets, so this is a bit academic at the moment).. 12.. BBV: an experimental basic block vector generation tool..

    Original link path: /docs/manual/sg-manual.html
    Open archive

  • Title: Valgrind
    Descriptive info: BBV: an experimental basic block vector generation tool.. Using Basic Block Vectors to create SimPoints.. BBV Command-line Options.. Basic Block Vector File Format.. Implementation.. Threaded Executable Support.. Validation.. Performance.. --tool=exp-bbv.. A basic block is a linear section of code with one entry point and one exit point.. A.. basic block vector.. (BBV) is a list of all basic blocks entered during program execution, and a count of how many times each basic block was run.. BBV is a tool that generates basic block vectors for use with the.. SimPoint.. analysis tool.. The SimPoint methodology enables speeding up architectural simulations by only running a small portion of a program and then extrapolating total behavior from this small portion.. Most programs exhibit phase-based behavior, which means that at various times during execution a program will encounter intervals of time where the code behaves similarly to a previous interval.. If you can detect these intervals and group them together, an approximation of the total program behavior can be obtained by only simulating a bare minimum number of intervals, and then scaling the results.. In computer architecture research, running a benchmark on a cycle-accurate simulator can cause slowdowns on the order of 1000 times, making it take days, weeks, or even longer to run full benchmarks.. By utilizing SimPoint this can be reduced significantly, usually by 90-95%, while still retaining reasonable accuracy.. A more complete introduction to how SimPoint works can be found in the paper "Automatically Characterizing Large Scale Program Behavior" by T.. Sherwood, E.. Perelman, G.. Hamerly, and B.. Calder.. To quickly create a basic block vector file, you will call Valgrind like this:.. valgrind --tool=exp-bbv /bin/ls.. In this case we are running on.. /bin/ls.. , but this can be any program.. By default a file called.. bb.. PID.. will be created, where PID is replaced by the process ID of the running process.. This file contains the basic block vector.. For long-running programs this file can be quite large, so it might be wise to compress it with gzip or some other compression program.. To create actual SimPoint results, you will need the SimPoint utility, available from the.. SimPoint webpage.. Assuming you have downloaded SimPoint 3.. 2 and compiled it, create SimPoint results with a command like the following:.. /SimPoint.. 2/bin/simpoint -inputVectorsGzipped \ -loadFVFile bb.. 1234.. gz \ -k 5 -saveSimpoints results.. simpts \ -saveSimpointWeights results.. weights.. where bb.. gz is your compressed basic block vector file generated by BBV.. The SimPoint utility does random linear projection using 15-dimensions, then does k-mean clustering to calculate which intervals are of interest.. In this example we specify 5 intervals with the -k 5 option.. The outputs from the SimPoint run are the.. results.. simpts.. files.. The first holds the 5 most relevant intervals of the program.. The seconds holds the weight to scale each interval by when extrapolating full-program behavior.. The intervals and the weights can be used in conjunction with a simulator that supports fast-forwarding; you fast-forward to the interval of interest, collect stats for the desired interval length, then use statistics gathered in conjunction with the weights to calculate your results.. BBV-specific command-line options are:.. --bb-out-file= name [default: bb.. This option selects the name of the basic block vector file.. --pc-out-file= name [default: pc.. This option selects the name of the PC file.. This file holds program counter addresses and function name info for the various basic blocks.. This can be used in conjunction with the basic block vector file to fast-forward via function names instead of just instruction counts.. --interval-size= number [default: 100000000].. This option selects the size of the interval to use.. The default is 100 million instructions, which is a commonly used value.. Other sizes can be used; smaller intervals can help programs with finer-grained phases.. However smaller interval size can lead to accuracy issues due to warm-up effects (When fast-forwarding the various architectural features will be  ...   instructions are instrumented.. This is slower (by approximately a factor of two) than a method that instruments at the basic block level, but there are some complications (especially with rep prefix detection) that make that method more difficult.. Valgrind actually provides instrumentation at a superblock level.. A superblock has one entry point but unlike basic blocks can have multiple exit points.. Once a branch occurs into the middle of a block, it is split into a new basic block.. Because Valgrind cannot produce "true" basic blocks, the generated BBV vectors will be different than those generated by other tools.. In practice this does not seem to affect the accuracy of the SimPoint results.. We do internally force the.. --vex-guest-chase-thresh=0.. option to Valgrind which forces a more basic-block-like behavior.. When a superblock is run for the first time, it is instrumented with our BBV routine.. A block info (bbInfo) structure is allocated which holds the various information and statistics for the block.. A unique block ID is assigned to the block, and then the structure is placed into an ordered set.. Then each native instruction in the block is instrumented to call an instruction counting routine with a pointer to the block info structure as an argument.. At run-time, our instruction counting routines are called once per native instruction.. The relevant block info structure is accessed and the block count and total instruction count is updated.. If the total instruction count overflows the interval size then we walk the ordered set, writing out the statistics for any block that was accessed in the interval, then resetting the block counters to zero.. On the x86 and amd64 architectures the counting code has extra code to handle rep-prefixed string instructions.. This is because actual hardware counts a rep-prefixed instruction as one instruction, while a naive Valgrind implementation would count it as many (possibly hundreds, thousands or even millions) of instructions.. We handle rep-prefixed instructions specially, in order to make the results match those obtained with hardware performance counters.. BBV also counts the fldcw instruction.. This instruction is used on x86 machines in various ways; it is most commonly found when converting floating point values into integers.. On Pentium 4 systems the retired instruction performance counter counts this instruction as two instructions (all other known processors only count it as one).. This can affect results when using SimPoint on Pentium 4 systems.. We provide the fldcw count so that users can evaluate whether it will impact their results enough to avoid using Pentium 4 machines for their experiments.. It would be possible to add an option to this tool that mimics the double-counting so that the generated BBV files would be usable for experiments using hardware performance counters on Pentium 4 systems.. BBV supports threaded programs.. When a program has multiple threads, an additional basic block vector file is created for each thread (each additional file is the specified filename with the thread number appended at the end).. There is no official method of using SimPoint with threaded workloads.. The most common method is to run SimPoint on each thread's results independently, and use some method of deterministic execution to try to match the original workload.. This should be possible with the current BBV.. BBV has been tested on x86, amd64, and ppc32 platforms.. An earlier version of BBV was tested in detail using hardware performance counters, this work is described in a paper from the HiPEAC'08 conference, "Using Dynamic Binary Instrumentation to Generate Multi-Platform SimPoints: Methodology and Accuracy" by V.. M.. Weaver and S.. A.. McKee.. Using this program slows down execution by roughly a factor of 40 over native execution.. This varies depending on the machine used and the benchmark being run.. On the SPEC CPU 2000 benchmarks running on a 3.. 4GHz Pentium D processor, the slowdown ranges from 24x (mcf) to 340x (vortex.. 2).. 13.. Lackey: an example tool..

    Original link path: /docs/manual/bbv-manual.html
    Open archive

  • Title: Valgrind
    Descriptive info: Lackey: an example tool.. Lackey Command-line Options.. --tool=lackey.. Lackey is a simple Valgrind tool that does various kinds of basic program measurement.. It adds quite a lot of simple instrumentation to the program's code.. It is primarily intended to be of use as an example tool, and consequently emphasises clarity of implementation over performance.. Lackey-specific command-line options are:.. --basic-counts= no|yes [default: yes].. When enabled, Lackey prints the following statistics and information about the execution of the client program:.. The number of calls to the function specified by the.. --fnname.. option (the default is.. If the program has had its symbols stripped, the count will always be zero.. The number of conditional branches encountered and the number and proportion of those taken.. The number of superblocks entered and completed by the program.. Note that due to optimisations done by the JIT, this is not at all an accurate value.. The number  ...   and ALU operations, differentiated by their IR types.. The IR types are identified by their IR name ("I1", "I8",.. "I128", "F32", "F64", and "V128").. --trace-mem= no|yes [default: no].. When enabled, Lackey prints the size and address of almost every memory access made by the program.. See the comments at the top of the file.. lackey/lk_main.. for details about the output format, how it works, and inaccuracies in the address trace.. Note that this option produces immense amounts of output.. --trace-superblocks= no|yes [default: no].. When enabled, Lackey prints out the address of every superblock (a single entry, multiple exit, linear chunk of code) executed by the program.. This is primarily of interest to Valgrind developers.. for details about the output format.. Note that this option produces large amounts of output.. --fnname= name [default: main].. Changes the function for which calls are counted when.. --basic-counts=yes.. is specified.. 14.. Nulgrind: the minimal Valgrind tool..

    Original link path: /docs/manual/lk-manual.html
    Open archive

  • Title: Valgrind
    Descriptive info: Nulgrind: the minimal Valgrind tool.. --tool=none.. Nulgrind is the simplest possible Valgrind tool.. It performs no instrumentation or analysis of a program, just runs it normally.. It is mainly of use for Valgrind's developers for debugging and regression testing.. Nonetheless you can run programs with Nulgrind.. They will run roughly 5 times more slowly than normal, for no useful effect.. Note that you need to use the option.. to run Nulgrind (ie.. not.. --tool=nulgrind.. Valgrind FAQ..

    Original link path: /docs/manual/nl-manual.html
    Open archive

  • Title: Valgrind
    Descriptive info: COMPILE TIME.. Debugging Memory Problems.. by Steve Best.. May 2003.. Courtesy of.. Linux Magazine.. Dynamic memory allocation seems straightforward enough: you allocate memory on demand -- using.. malloc().. or one of its variants -- and free memory when it's no longer needed.. Indeed, memory management would be that easy -- if only we programmers never made mistakes.. Alas, we do make mistakes (from time to time) and memory management problems do occur.. For example, a memory leak occurs when memory is allocated but never freed.. Leaks can obviously be caused if you.. without a corresponding.. free().. , but leaks can also be inadvertently caused if a pointer to dynamically-allocated memory is deleted, lost, or overwritten.. Memory corruption can occur when allocated (and in use) memory is overwritten accidentally.. Buffer overruns -- caused by writing past the end of a block of allocated memory -- frequently corrupt memory.. Regardless of the root cause, memory management errors can have unexpected, even devastating effects on application and system behavior.. With dwindling available memory, processes and entire systems can grind to a halt, while corrupted memory often leads to spurious crashes.. System security is also susceptible to buffer overruns.. Worse, it might take days before evidence of a real problem appears.. Today, it's common for Linux systems to have a gigabyte of main memory.. If a program leaks a small amount of memory, it'll take some time before the application and system show symptoms of a problem.. Memory management errors can be quite insidious and very difficult to find and fix.. This month, let's take a look at Valgrind, a tool that can help detect common memory management errors.. We'll review the basics, write some "buggy" code, and then use Valgrind to find the mistakes.. Valgrind was written by Julian Seward and is available under the GNU Public License.. Dynamic Memory Functions.. Of all of the library calls in Linux, only four manage memory:.. calloc().. realloc().. All of these functions have prototypes in the.. stdlib.. include file.. allocates a memory block.. It's prototype is.. void* malloc(size_t size).. and the single argument is the number of bytes of memory to allocate.. If the allocation is successful,.. returns a pointer to the memory.. If memory allocation fails for some reason (for example, if the system is out of memory),.. returns a.. NULL.. pointer.. allocates an array in memory and initializes all of the memory to zero (with.. , the allocated memory is uninitialized).. void* calloc(size_t nmemb, size_t size).. is the prototype.. The first argument is the number of elements in the array and the second argument is the size (in bytes) of each element.. Like.. returns a pointer if the memory allocation was successful, and.. otherwise.. is defined as.. void* realloc (void *ptr, size_t size).. changes the size of the object referenced by the pointer to a new size specified by the second argument.. returns a pointer to the moved block of memory.. deallocates a memory block.. It takes a pointer as an argument, as shown in its prototype,.. void free (void *ptr).. , and releases that memory.. While the API for memory management is unusually small, the number and kind of memory errors that can occur is quite substantial, including reading and using uninitialized memory; reading/writing memory after it has been freed; reading/writing from memory past the allocated size; reading/writing inappropriate areas on the stack; and memory leaks.. Luckily, Valgrind can detect all of those problems.. When a program is run under Valgrind, all memory reads and writes are inspected and all calls to.. malloc()/new().. free()/ delete().. are intercepted.. Installing Valgrind.. Valgrind is closely tied to the architecture of the operating system.. Currently, it's only supported on Linux x86 machines with kernels from the 2.. x and 2.. x series and glibc 2.. x or 2.. x.. You can get the source for Valgrind at.. (At the time of this writing, the latest stable release of Valgrind is 1.. The latest development release is 1.. ) Download the latest stable release (or the latest development release, depending on your sense of adventure) and build the software:.. % bunzip2 valgrind-1.. tar.. bz2 % tar xvf valgrind-1.. tar % cd valgrind-1.. 4 %.. /configure % make % make install.. One great feature of Valgrind is that it doesn't require you to build (or re-build) your application in any special way.. Simply place valgrind right in front of the program you want to inspect.. For example, the command.. % valgrind ls -all.. inspects and monitors the.. ls.. command.. (Running this command on Red Hat Linux 8.. 0 showed no errors.. ).. The output from Valgrind has the following format.. ==20691== 8192 bytes in 1 blocks are definitely lost in loss record 1 of 1 ==20691== at 0x40048434: malloc (vg_clientfuncs.. c:100) ==20691== by 0x806910C: fscklog_init (fsckwsp.. c:2491) ==20691== by 0x806E7D0: initial_processing (xchkdsk.. c:2101) ==20691== by 0x806C70D: main (xchkdsk.. c:289).. ==I xxxxx ==.. string prefixes each line  ...   0x40271507: __libc_start_main (.. 0 (in /jfs/article/sample2) ==3016== Address 0x40CA0454 is 0 bytes after a block of size 512 alloc'd ==3016== at 0x400483E4: malloc (vg_clientfuncs.. c:100) ==3016== by 0x80484BF: main (in /jfs/article/sample2) ==3016== by 0x40271507: __libc_start_main (.. 0 (in /jfs/article/sample2).. Finally, to show how Valgrind finds invalid use of uninitialized memory, let's look at the results of analyzing the Journaled File System's (JFS).. fsck.. utility.. As before, we fun.. under the auspices of Valgrind:.. % valgrind -v -leak-check=yes \ fsck.. jfs /dev/hdb1.. Figure Three.. shows a snippet of the output.. Figure Three: Valgrind output for the Journaled File System utility fsck.. jfs.. ==12903== Conditional jump or move depends on uninitialised value(s) ==12903== at 0x8079FCC: __divdi3 (in /sbin/fsck.. jfs) ==12903== by 0x805CB0E: validate_super (fsckmeta.. c:2331) ==12903== by 0x805C266: validate_repair_superblock (fsckmeta.. c:1833) ==12903== by 0x806E2B5: initial_processing (xchkdsk.. c:1968).. validate_super().. routine can be found in the.. jfsutils.. package in.. jfsutils-1.. x/fsck/fsckmeta.. Listing Three.. shows a portion of the code:.. Listing Three: A code snippet from fsckmeta.. int validate_super( int which_super ) { int64_t bytes_on_device; /* get physical device size */ vfs_rc = ujfs_get_dev_size(Dev_IOPort, bytes_on_device);.. dev_blks_on_device = bytes_on_device / Dev_blksize; /* Line 2331 */ if (sb_ptr->s_pbsize != Dev_blksize) {.. The output from Valgrind indicates that an uninitialized variable is used on line 2331 -- that's the line that says,.. dev_blks_on_device = bytes_ on_device / Dev_blksize.. As you can see, bytes_on_device is not set before its used.. Using Valgrind, this memory management problem was identified and fixed before an end-user ever came across it.. Cache Profiling.. Valgrind can also perform cache simulations and annotate your source line-by-line with the number of cache misses.. In particular, it records:.. L1 instruction cache reads and misses.. L1 data cache reads and read misses, and writes and write misses.. L2 unified cache reads and read misses, and writes and writes misses.. L1 is a small amount of SRAM memory that's used as a cache.. L1 temporarily stores instructions and data, ensuring that the processor has a steady supply of data to process while memory catches up delivering new data.. L1 is integrated or packaged within the same module as the processor.. Level 2 caching is performed in L2.. Valgrind's cachegrind tool is used to do cache profiling -- you use it just like valgrind.. For example, the following command looks at the.. program:.. % cachegrind fsck.. jfs -n -v /dev/hdb1.. The output of cachegrind is collected in the file cachegrind.. Sample output from analyzing.. is shown in.. Figure Four.. Figure Four: cachegrind.. out, cachegrind's analysis of fsck.. ==11004== I refs: 99,813,615 ==11004== I1 misses: 4,301 ==11004== L2i misses: 3,210 ==11004== I1 miss rate: 0.. 0% ==11004== L2i miss rate: 0.. 0% ==11004== ==11004== D refs: 68,846,938 (65,916,678 rd + 2,930,260 wr) ==11004== D1 misses: 63,883 ( 37,768 rd + 26,115 wr) ==11004== L2d misses: 37,485 ( 14,330 rd + 23,155 wr) ==11004== D1 miss rate: 0.. 0% ( 0.. 0% + 0.. 8% ) ==11004== L2d miss rate: 0.. 7% ) ==11004== ==11004== L2 refs: 68,184 ( 42,069 rd + 26,115 wr) ==11004== L2 misses: 40,695 ( 17,540 rd + 23,155 wr) ==11004== L2 miss rate: 0.. 7% ) Events recorded abbreviations are: Ir : I cache reads (ie.. instructions executed) I1mr: I1 cache read misses I2mr: L2 cache instruction read misses Dr : D cache reads (ie.. memory reads) D1mr: D1 cache read misses D2mr: L2 cache data read misses Dw : D cache writes (ie.. memory writes) D1mw: D1 cache write misses D2mw: L2 cache data write misses.. Next, you can annotate the output from cachegrind, by using.. vg_annotate.. :.. % vg_annotate.. produces output like that shown in.. Figure Five.. The figure shows one annotation for the routine.. dmap_pmap_ verify().. The entry states that 88,405,584 instructions of 99,813,615 total instructions were spent in.. This information is invaluable for deciding where to tune the program.. You can also further annotate.. to find the actual instructions executed in that routine.. Figure Five: annotation of one entry of cachegrind for fsck.. ------------------------------ Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw ------------------------------ 88,405,584 23 23 61,740,960 14,535 98 576,828 9 9 fsckbmap.. c:dmap_pmap_verify END.. For a complete description of cachegrind, see the Valgrind User's Manual in the Valgrind distribution.. Some Limitations Of Valgrind.. There are two issues that you should be aware of when analyzing an application with Valgrind.. First, an application running under Valgrind consumes more memory.. Second, your program will run slower.. However, these two minor annoyances shouldn't stop you from using this powerful memory management debug tool.. About the Author.. Steve Best works in the Linux Technology Center of IBM in Austin, Texas.. He is currently working on the Journaled File System (JFS) for Linux project.. Steve has done extensive work in operating system development with a focus in the areas of file systems, internationalization, and security.. He can be reached at.. sbest@us.. ibm.. com.. You can download the sample programs used in this article.. here..

    Original link path: /gallery/linux_mag.html
    Open archive

  • Title:
    Descriptive info: ----------------------------------------------------------------------------- 2nd Official Valgrind Survey, September 2005: high-level summary ----------------------------------------------------------------------------- The survey, hosted at www.. org, ran from September 22--October 8, 2005.. 179 responses were received.. The following are recommended "action items" -- ie.. things we should do.. Big, important: - Should do more performance tuning.. Now possible with Cachegrind and self-hosting.. This will make many people happy.. - We should try to bring Addrcheck back, or give Memcheck the characteristics that people like about Addrcheck (speed, less memory use) - We should monitor how effectively the aspacem rewrite helps memory usage problems; if there are still problems, we should try the compressed V bits representation.. - We should try to bring Helgrind back Big, not as important: - We should consider how to improve suppressions.. Just dumping auto-generated ones to file would be a good start.. More flexible and simpler syntax would help too.. - We should work out if debugger attachment can be made more reliable.. - We should perhaps try to improve the understandability of error messages.. - We should investigate a strict definedness option; past experience has shown it to not work well but perhaps things could be improved.. - We should improve ISA coverage.. - It might be worth working on a coverage tool.. Some work has been done on that front already by various people.. Small: - Should make --leak-check=full as the default for Memcheck.. - Should increase --num-callers default from 12 to 20.. - Perhaps change what gets shown with -v.. - Advertise some of the changes that people haven't realised in the release notes of the next release (eg.. --tool no longer mandatory).. Other: - add all the projects to the projects page :) - improve the survey for next time! ----------------------------------------------------------------------------- Things to remember ----------------------------------------------------------------------------- These are things that aren't exactly action items, but are good to remember.. - Memcheck is the most widely used tool by a long way - x86/Linux is by far the most popular platform - Almost all Valgrind use is on C and C++ programs - People love the ease-of-use (ie.. no recompilation required) ----------------------------------------------------------------------------- Short summaries of responses to individual questions ----------------------------------------------------------------------------- Q1.. People have been using Valgrind for this long: 37+ months: 7% 25--36 months: 24% 13--24 months: 37% 0--12 months: 31% The average is 21.. 7 months.. ----------------------------------------------------------------------------- Q2.. People use it this frequently: hourly 1% daily 13% weekly 47% monthly 38% ----------------------------------------------------------------------------- Q3.. They use Valgrind on these hardware platforms: Probably x86 77.. 8% x86 or AMD64 (unclear) 10.. 5% Probably AMD64 9.. 9% Other 1.. 8% These numbers are rough, the question was poorly done so understanding the answers required some guessing.. ----------------------------------------------------------------------------- Q4.. The proportion of OS/distro usage with Valgrind is: Fedora Core 17.. 3% SuSE 16.. 0% Red Hat Enterprise Linux 15.. 6% Red Hat (pre-Fedora, up to verion 9.. 0) 11.. 2% Gentoo 7.. 6% Debian 7.. 2% Linux 7.. 2% Red Hat (unspecified if before or after Fedora) 4.. 0% Mandrake 3.. 4% Ubuntu 2.. 8% Slackware 2.. 1% Mandriva 1.. 7% CentOS 1.. 1% FreeBSD 0.. 6% Other 2.. 4% All the Red Hat versions combined (including Fedora) account for 48.. 1%.. ----------------------------------------------------------------------------- Q5.. They obtain Valgrind in the following ways (the percentages sum to more than 100% because many people mentioned more than one source).. source from website 78% pre-built (eg.. RPM, Debian package) 25% source from CVS/SVN repository 21% already installed 16% other: gentoo ebuild 5% other: FreeBSD ports system <1% ----------------------------------------------------------------------------- Q6.. The  ...   IA64 5.. 5% PPC 5.. 3% Linux 65.. 3% Windows 5.. 7% Solaris 5.. 5% MacOS X 2.. 8% HPUX 1.. 8% These percentages are out of less than 100%, because less than 100% of responses mentioned both a CPU and an OS.. ----------------------------------------------------------------------------- Q10.. The things people like most about Valgrind are (with the number of times the thing was mentioned): memcheck/memory checking 118 - general praise 41 - leak checking 32 - undefined value checking 20 - invalid access checking 18 - other 7 ease-of-use 40 - no recompilation 20 - easy to use 20 other 27 accurate information 19 robust/stable 19 GPL 16 speed 13 general praise 11 ----------------------------------------------------------------------------- Q11.. The things people like least about Valgrind are (with the number of times the thing was mentioned) (at least 10 mentions): speed 45 suppressions 23 simulation correctness/robustness 21 error messages/output 20 Memcheck's error detection 18 memory use 15 debugger attachment 11 want Helgrind back 10 Callgrind 10 documentation 10 ----------------------------------------------------------------------------- Q12.. The features people want added to Valgrind are (at least 10 mentions): other 59 checking of static/stack bounds errors 14 more platforms supported 13 better debugger integration 10 GUIs 10 ----------------------------------------------------------------------------- Q13.. People had the following suggestions for other tools (at least 2 mentions): code coverage tool 9 time profiler 8 function call tracer 3 just improve existing tools 3 resource usage measurer 2 flight recorder/time machine 2 ----------------------------------------------------------------------------- Q14.. People liked the following non-software aspects of Valgrind's development (5 mentions or more): mailing list/developers are good/responsive 31 other 12 development speed/activity 9 website 8 frequent releases 7 open source/free software 6 high quality releases 6 don't know/no opinion 6 bug handling 5 ----------------------------------------------------------------------------- Q15.. People thought the following non-software aspects could be improved (1 mention or more): none 7 Documentation could be improved 7 Hard to follow development, should provide more info 4 Releases should have longer beta/RC period for more testing 3 Callgrind should be integrated 3 Helgrind is broken 3 Releases should be more frequent 2 ----------------------------------------------------------------------------- Q16.. People use Valgrind on the following kinds of project (five or more mentions): scientific/analysis 22 programming (eg.. compilers) 10 telecommunications 8 server 7 libraries 7 graphics 6 audio/video 5 The proportion of language use is the following (1% or more): C++ 49.. 9% C 42.. 9% F77 1.. 6% Ada 1.. 1% Java 1.. 0% The number of programmers in the mentioned projects were: 1 27 1.. 5 - 2 19 3 - 4 14 5 - 9 26 10 - 19 13 20 - 39 8 40 - 99 5 100+ 3 Code size of the projects varied greatly, from 2000 lines to "tens of millions", with fairly even representation across the whole range.. 49 people gave permission to have their project listed on the Valgrind website.. ----------------------------------------------------------------------------- Q17.. People had the following comments about the survey (3 or more mentions): it's good 5 input boxes are too small 4 ----------------------------------------------------------------------------- Q18.. People had final comments in the following categories: short generic compliments 39 longer generic compliments 16 longer, more specific compliments 11 compliments mentioning similar tools 8 other 8 shortcomings 5 ----------------------------------------------------------------------------- Q19.. The responses were from the following regions (with some breakdown): Europe 79.. 5 (inc.. Russia) - Germany 18.. 5 - UK 10.. 5 - France 10 - Netherlands 6 - Belgium 5 N.. America 39.. 5 - USA 37.. 5 - Canada 2 Australia 6 Middle East 6 (inc.. Turkey) - Israel 4 Asia 4 S.. America 2 Africa 1..

    Original link path: /gallery/survey_05/summary.txt
    Open archive





  • Archived pages: 74