nedmalloc
/
Readme.html

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">

<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type" />
<title>nedalloc Readme</title>
<style type="text/css">
<!--
body {
	text-align: justify;
}
h1, h2, h3, h4, h5, h6 {
	margin-bottom: -0.5em;
}
h1 {
	text-align: center;
}
h2 {
	text-decoration: underline;
	margin-bottom: -0.25em;
}
p {
	margin-top: 0.5em;
	margin-bottom: 0.5em;
}
ul li, ol li {
	margin-top: 0.2em;
	margin-bottom: 0.2em;
}
dl {
	margin-left: 2em;
}
dl dt {
	font-weight: bold;
}
dt + dd {
	margin-bottom: 1em;
}
.gitcommit {
	font-family: "Courier New", Courier, monospace;
	font-size: smaller;
}
-->
</style>
</head>

<body>

<div style="text-align: center">
	<h1 style="text-decoration: underline">nedalloc v1.10 beta 4 (?)</h1>
	<h2 style="text-decoration: none;">by Niall Douglas</h2>
	<p>Web site: <a href="http://www.nedprod.com/programs/portable/nedmalloc/">http://www.nedprod.com/programs/portable/nedmalloc/</a></p>
  <p>Trunk build status: <a href="https://travis-ci.org/ned14/nedmalloc"><img style="vertical-align:middle;border:none" src="https://travis-ci.org/ned14/nedmalloc.png?branch=master"/></a></p>
	<hr /></div>
<p>Enclosed is nedalloc, an alternative malloc implementation for multiple threads
without lock contention based on <a href="http://g.oswego.edu/" target="_blank">
dlmalloc</a> v2.8.4 and a specialised user mode page allocator (Windows Vista or
later only). It has the following features:</p>
<ol>
	<li>A per-thread small block cache for maximum CPU scalability.</li>
	<li>A per-thread arena to minimise lock contention.</li>
	<li>The ability to patch Windows binaries to replace the C memory allocation
	API malloc, realloc(), free() et al such that by simply inserting nedmalloc.dll
	into a process one realises performance improvements without recompilation.</li>
	<li>On POSIX, it knows how to talk to valgrind so you can track memory
	corruption and/or memory leaks.</li>
	<li>A unique user mode page allocator implementation which delivers O(1) scaling
	for blocks of any size, including an O(1) very fast realloc(). Improves medium
	sized block (~1Mb) allocation speeds by about 25 times on current hardware.
	Requires Windows Vista or later only, and requires Administrator privileges
	as well as either UAC disabled or a UAC prompt at the start of each program
	run.</li>
    <li>A malloc v2 API which enables considerable improvements in efficiency by
    allowing client code to better inform the allocator on what (not) to do.</li>
	<li>An enhanced C++ STL allocator implementation to enable super-fast std::vector&lt;&gt;
	<strong>[unfinished]</strong></li>
</ol>
<p>It is licensed under the
<a href="http://www.boost.org/LICENSE_1_0.txt" target="_blank">Boost Software License</a>
which basically means you can do anything you like with it. This does not apply
to the malloc.c.h file which remains copyright to others. Commercial support is
available from <a href="http://www.nedproductions.biz/" target="_blank">ned Productions
Limited</a>.</p>
<p>It has been tested on win32 (x86), win64 (x64), Linux (x64), FreeBSD (x64) and
Apple Mac OS X (x86). It works very well on all of these and is very significantly
faster than the system allocator on Windows XP and FreeBSD &lt;v7. If you are using
&gt;= 10.6 Apple Mac OS X or you are on Windows 7 or later then you probably won&#39;t
see much improvement without modifying your source to use the v2 malloc API (and
kudos to Apple and Microsoft for adopting excellent allocators).</p>
<p>The user mode page allocator returns jaw dropping real world performance improvements
but requires running the process as the superuser. Without, it still offers sizeable
gains on all older operating systems and through the v2 malloc API modest gains
on all very recent operating systems, especially in these situations:</p>
<ol>
	<li>If you are repeatedly extending large vector arrays, you will see a LARGE
	improvement if you use the address space reservation features.</li>
	<li>If you do a lot of work with 16 byte aligned vectors e.g. SSE or AVX vector
	arrays, you will find the v2 malloc API a godsend.</li>
</ol>
<p style="text-decoration: underline"><strong>Table of Contents: </strong></p>
<ol style="list-style-type: upper-alpha; position: relative; margin-top: -0.5em;">
	<li><a href="#touse">How to use</a><ul style="list-style-type: none; margin-left: 0; padding-left: 0">
		<li>A1. <a href="#CPPAPI">The C++ API</a></li>
		<li>A2. <a href="#v2mallocAPI">The v2 malloc C API</a></li>
	</ul>
	</li>
	<li><a href="#notes">Notes</a><ul style="list-style-type: none; margin-left: 0; padding-left: 0">
		<li>B1. <a href="#memorybloat">Memory Bloating</a></li>
		<li>B2. <a href="#memoryleaks">Memory Leakage</a></li>
		<li>B3. <a href="#threadcache">The Threadcache</a></li>
		<li>B4. <a href="#largepages">Large Page support</a></li>
		<li>B5. <a href="#logger">Memory operation logging</a></li>
		<li>B6. <a href="#windowsonly">Windows-only features</a></li>
	</ul>
	</li>
	<li><a href="#speedcomparisons">Speed Comparisons</a></li>
	<li><a href="#troubleshooting">Troubleshooting</a></li>
	<li><a href="#changelog">Changelog</a></li>
</ol>
<h2><a name="touse">A. To use:</a></h2>
<p>The quickest way is to drop nedmalloc.h, nedmalloc.c and malloc.c.h into your
project. Call nedmalloc(), nedcalloc(), nedrealloc() and nedfree() instead of your
normal allocator, or nedpmalloc(), nedpcalloc(), nedprealloc() and nedpfree() if
you want to segment your memory usage into pools. Make sure that you call neddisablethreadcache()
for every pool you use on thread exit, and don&#39;t forget neddisablethreadcache(0)
for the system pool if necessary. Run and enjoy!</p>
<p>To test, compile <a href="test.c">test.c</a> (C) and <a href="test.cpp">test.cpp</a>
(C++). Both will run a comparison between your system allocator and nedalloc and
tell you how much faster nedalloc is. They also serve as examples of usage.</p>
<p>If you&#39;d like nedalloc as a Windows DLL or POSIX ELF shared object, the easiest
thing to do is to use <a href="http://www.scons.org/" target="_blank">scons</a>
which comes with a myriad of build options listed using scons -h. <b>If you want
to build some MSVC project files for use with Microsoft Visual Studio</b> then what
you do is (i) install <a href="http://www.python.org/" target="_blank">python</a>
(ii) install <a href="http://www.scons.org/" target="_blank">scons</a> (iii) open
a Visual Studio Command Box for the Visual Studio you wish to use via Start Menu
=&gt; Programs =&gt; Microsoft Visual Studio XXXX =&gt; Visual Studio Tools =&gt; Visual Studio
XXXX Command Prompt (iv) change directory to the nedmalloc directory (e.g. by dragging
in its folder) (v) type &quot;!MakeMSVCProjs&quot; and hit Return. Note that for Visual Studio
2008 and later support you need scons v2.1 or later.</p>
<p>nedalloc comes with two new memory allocator APIs: one is for C++, and the other
is for C. <strong>Full documentation</strong> for all nedalloc&#39;s APIs and features
is provided in the enclosed <a href="nedalloc.chm">nedalloc.chm</a> which is in
Microsoft HTML Help format (Linux and Apple Mac OS X will happily read this format
too). If you don&#39;t want to use the CHM documentation, <a href="nedmalloc.h">nedmalloc.h</a>
is extensively commented with <a href="http://www.doxygen.org/" target="_blank">
doxygen markup</a>.</p>
<h3><a name="CPPAPI">A1: The C++ API:</a></h3>
<p>For the v1.10 release which was generously sponsored by
<a href="http://www.ara.com/" target="_blank">Applied Research Associates (USA)</a>,
a C++ metaprogrammed STL allocator was designed which makes use of advanced nedalloc
features to remedy many of the long standing problems and inefficiencies caused
by C++&#39;s traditional over-fondness for copying things. While its implementation
is complex, usage is extremely easy - simply supply nedallocator&lt;&gt; as the custom
allocator to STL container classes.</p>
<p>As nedmalloc can do even better for vector extension, nedmalloc.h also contains
a nedvector&lt;&gt; implementation which is the standard STL vector&lt;&gt; implementation except
that it makes use of the non-relocating facilities of realloc2() (see below). This
allows nedvector&lt;&gt; to not need to overallocate memory (most STL vector&lt;&gt; implementations
will overallocate by 50%) which saves a lot of memory as well as <strong>completely
avoiding array copy construction</strong> which make std::vector&lt;&gt;::resize() so
very, very slow.</p>
<p>Even without nedalloc&#39;s major speed improvements as a simple C style allocator,
the improvements to the C++ memory infrastructure alone can generate huge performance
gains.</p>
<h3><a name="v2mallocAPI">A2: The v2 malloc C API:</a></h3>
<p><strong>[Note: This API will be completely replaced in v1.2]</strong></p>
<p>For the v1.10 release which was generously sponsored by
<a href="http://www.ara.com/" target="_blank">Applied Research Associates (USA)</a>,
a new general purpose allocator API was designed which is intended to remedy many
of the long standing problems and inefficiencies introduced by the ISO C allocator
API. Internally nedalloc&#39;s implementations of nedmalloc(), nedcalloc(), nedmemalign()
and nedrealloc() all call into this API:</p>
<ul>
	<li><code>void* malloc2(size_t bytes, size_t alignment, unsigned flags)</code></li>
	<li><code>void* realloc2(void* mem, size_t bytes, size_t alignment, unsigned
	flags)</code></li>
</ul>
<p>If nedmalloc.h is being included by C++ code, the alignment and flags parameters
default to zero which makes the new API identical to the old API (roll on the introduction
of default parameters to C!). The ability for realloc2() to take an alignment is
<em>particularly</em> useful for extending aligned vector arrays such as SSE/AVX
vector arrays. Hitherto SSE/AVX vector code had to jump through all sorts of unpleasant
hoops to maintain alignment during array extension :(.</p>
<p>The flags supported include the ability to zero memory, to prevent realloc2()
from moving a memory block, to force mmap() to be used from the beginning (useful
when you know an array will be repeatedly extended) and to cause malloc2() to reserve
additional address space after the allocation such that a realloc2() up to that
reserved space will be very quick. On 32 bit Windows and Linux this reservation
costs no address space in your process, so using it will NOT cause premature address
space exhaustion.</p>
<p>You should note that realloc()&#39;s thunk to realloc2() defaults the flags to M2_RESERVE_MULT(8)
i.e. if realloc() needs to allocate a block larger than mmap_threshold, it will
also reserve eight times the address space of that allocation in order to make future
realloc()&#39;s up to that point much faster. This catches the vast majority of situations
where large arrays are repeatedly extended.</p>
<h2><a name="notes">B. Notes:</a></h2>
<p>If you want the very latest version of this allocator, get it from the TnFOX
GIT repository at either of (both are identical mirrors):</p>
<ul>
	<li>
	<a href="git://nedmalloc.git.sourceforge.net/gitroot/nedmalloc/nedmalloc">git://nedmalloc.git.sourceforge.net/gitroot/nedmalloc/nedmalloc</a></li>
	<li><a href="git://github.com/ned14/nedmalloc.git">git://github.com/ned14/nedmalloc.git</a></li>
</ul>
<p>IF YOU THINK YOU HAVE FOUND A BUG, PLEASE CHECK ONE OF THESE REPOS FIRST BEFORE
REPORTING IT!</p>
<h3><a name="memorybloat">B1: Memory Bloating</a></h3>
<p>Because of how nedalloc allocates an mspace per thread, it <em>can</em> cause
severe bloating of memory usage under certain allocation patterns. You can substantially
reduce this wastage by setting DEFAULTMAXTHREADSINPOOL or the threads parameter
to nedcreatepool() to a fraction of the number of threads which would normally be
in a pool at once. This will reduce bloating at the cost of an increase in lock
contention, with DEFAULTMAXTHREADSINPOOL=1 removing almost all bloating. If the
block sizes typically allocated are less than THREADCACHEMAX, locking is avoided
90-99% of the time and if most of your allocations are below this value, you can
safely set DEFAULTMAXTHREADSINPOOL or even MAXTHREADSINPOOL to one.</p>
<p>If you have LOTS of threads you may find that the threadcache held per thread
is causing memory bloating. You can call nedtrimthreadcache() to trim the cache
in a thread when you know that it won&#39;t be doing memory allocation (e.g. just before
going to sleep), or alternatively you can set THREADCACHEMAXFREESPACE to something
smaller than its default of 1Mb.</p>
<p>Lastly, some people find that memory is not returned to the system when they
think it ought to be. dlmalloc only returns free memory to the system when there
is DEFAULT_TRIM_THRESHOLD (default=2Mb) free in a mspace, and it only checks how
much there is free outside the topmost segment every MAX_RELEASE_CHECK_RATE free()&#39;s.
In other words, if your program very rapidly deallocates an awful lot of memory
and then does not call free() for some time thereafter, dlmalloc will not release
memory to the system. Generally in any real world code scenario free() will be called
fairly frequently, and if not then you can always force release using nedmalloc_trim().</p>
<h3><a name="memoryleaks">B2: Memory Leakage</a></h3>
<p>You will suffer memory leakage unless you call neddisablethreadcache() per pool
for every thread which exits (unless you are using nedalloc from its DLL on Windows).
This is because nedalloc cannot portably know when a thread exits and thus when
its thread cache can be returned for use by other code. Don&#39;t forget pool zero,
the system pool. On some POSIX threads implementations there exists a pthread_atexit()
which registers a termination handler for thread exit - if you don&#39;t have one of
these then you&#39;ll have to do it manually.</p>
<p>Equally if you use nedalloc from a dynamically loaded DLL or shared object which
you later kick out of memory, you will leak memory if you don&#39;t disable all thread
caches for all pools (as per the preceding paragraph), destroy all thread pools
using neddestroypool() and destroy the system pool using neddestroysyspool().</p>
<h3><a name="threadcache">B3: The Threadcache</a></h3>
<p>For C++ type allocation patterns (where the same small sizes of memory are regularly
allocated and deallocated as objects are created and destroyed), the threadcache
always benefits performance as it will cache all malloc/free allocations under THREADCACHEMAX
in size. If however your allocation patterns are different, searching the threadcache
may significantly slow down your code - as a rule of thumb, if cache utilisation
is below 80% (see the source for neddisablethreadcache() for how to enable debug
printing in release mode) then you should disable the thread cache for that thread.
You can compile out the threadcache code by setting THREADCACHEMAX to zero.</p>
<h3><a name="largepages">B4: Large Page support</a></h3>
<p>For some applications defining ENABLE_LARGE_PAGES can give a 10-15% performance
increase by having nedalloc allocate using large pages only (which are 2Mb on x86/x64).
Large pages take much less space in the TLB cache and can greatly benefit programs
with a large working set, particularly on 64 bit systems.</p>
<p>Support for large pages is limited to Linux and Windows. On Linux one must employ
the libhugetlbfs library anyway as this is the &quot;official&quot; form of large page support,
and setting it up and configuring it involves mounting a special hugetlbfs filing
system. dlmalloc does not require a dependency on the libhugetlbfs headers, rather
it searches for the library in the current process and if not found it silently
disables support.</p>
<p>On Windows, large page support is only implemented on Windows Server 2003/Vista
or later and they are only permitted to be allocated by users holding the &quot;Lock
pages in memory&quot; local security setting which is DISABLED by default. Furthermore,
the process using nedalloc must hold the SeLockMemoryPrivilege privilege. If you
are using the DLL then the DLL attempts to enable the SeLockMemoryPrivilege during
initialisation - therefore if you are not using the DLL you will have to do this
manually yourself. As with Linux support, if at any stage large pages cannot be
allocated, then dlmalloc silently disables support - this allows one binary to function
correctly in any environment. <strong>Note that on Windows</strong> if your process
allocates a lot of memory at once when the machine has been running for an extended
period, then the whole computer may hang for several seconds as the Windows kernel
copies memory around in order to coalesce a large page. This is a problem with the
Windows kernel and its VM design, not nedmalloc! If you would like to see how large
pages ought to be implemented, research how FreeBSD implemented them.</p>
<h3><a name="logger">B5: Memory operation logging</a></h3>
<p>It is often very useful to have a log of the memory operations which an application
performs - you would be amazed at the inefficiencies in memory usage that this can
reveal. nedalloc contains a very fast memory operation logger which keeps a per-thread
log of selected operations, including an optional stack backtrace. On pool destruction,
or nedflushlogs(), nedalloc will write out the log as a Comma Separated Value format
file which can be loaded into applications such as Excel for analysis.</p>
<p>To use, define ENABLE_LOGGING to the bitmask of enum LogEntryType items in which
you are interested, so 0xffffffff would log absolutely everything. The macro NEDMALLOC_TESTLOGENTRY,
whose default is (ENABLE_LOGGING &amp; logentrytype), is then used to determine which
items should be logged. You can also enable stack backtracing on MSVC and GCC using
NEDMALLOC_STACKBACKTRACEDEPTH.</p>
<h3><a name="windowsonly">B6: Windows-only features</a></h3>
<p>If you are running on Windows, there are quite a few extra options available
thanks to work generously sponsored by
<a href="http://www.ara.com/" target="_blank">Applied Research Associates (USA)</a>:</p>
<dl>
	<dt>Automatic threadcache cleanup and log output</dt>
	<dd>If you build nedalloc as a DLL and link that into your application, then
	the DLL can trap thread exits in your application and call neddisablethreadcache()
	on all currently existing nedpool&#39;s for you. On process exit, the DLL will also
	call nedflushlogs() for you on all still extant nedpool&#39;s.</dd>
	<dt>Replacing the system allocator in the whole process</dt>
	<dd>
	<p>If you define REPLACE_SYSTEM_ALLOCATOR when building the DLL then the DLL
	will replace <em>most</em> usage of the MSVCRT allocator (release MSVCRT,<strong>
	not</strong> debug MSVCRTD)<strong> </strong>within any process it is loaded
	into with nedalloc&#39;s routines instead, whilst remaining able to handle the odd
	free() of a MSVCRT allocated block allocated during CRT init. This very conveniently
	allows you to simply link with the nedalloc DLL and your application magically
	now uses it with no code changes required, and because the MSVC implementation
	of operators new and delete both call malloc() and free() it also covers all
	C++ code. The following code is suggested:</p>
	<code>#pragma comment(lib, &quot;nedmalloc.lib&quot;)</code>
	<p>This asks the linker to link against nedmalloc.lib during linking - without
	this pragma the linker will generally leave out nedmalloc as there are no explicitly
	imported routines that it understands. This auto-patching feature can also be
	combined with
	<a href="http://research.microsoft.com/en-us/projects/detours/" target="_blank">
	Microsoft&#39;s Detours</a> to run any arbitrary application using nedalloc instead
	of the system allocator:</p>
	<code>withdll /d:nedmalloc.dll program.exe</code>
	<p>For those not able to use Microsoft Detours, there is an enclosed unsupported/nedmalloc_loader
	program which does one variant of the same thing. It may or may not be useful
	to you - it is not intended to be maintained, and it probably doesn&#39;t work on
	newer systems.</p>
	<p>The reason that only the release MSVCRT not the debug MSVCRTD is patched
	is twofold: (i) usually one <em>wants</em> the debug heap in debug builds so
	it does memory corruption checking and reports memory leaks and (ii) the MSVC
	CRT actually implements operator new and malloc using a completely different
	implementation based on the Windows kernel HeapAlloc() function and it does
	a lot of hoop jumping to handle mismatching CRT versions and lots of other stuff.
	You can enable patching of the debug memory allocation functions in winpatcher.c
	by uncommenting the relevant lines.</p>
	</dd>
	<dt>User mode page allocation</dt>
	<dd>The user mode page allocator is a user space implementation of kernel memory
	page allocation made possible by misusing the Address Windowing Extensions (AWE)
	provided by newer versions of Microsoft Windows. AWE allows - with a bit of
	persuasion - direct control of the Memory Management Unit of the CPU, thus allowing
	memory pages to be arbitrarily remapped from one address to another. The user
	mode page allocator can therefore allocate memory in microseconds by simply
	mapping it into where it needs to be, or it can realloc() gigabytes of memory
	from its old location into a new bigger space in microseconds. This O(1) scaling
	gives processes running on the user mode page allocator an <strong>unholy</strong>
	speed increase which gets exponentially better the larger the data set.<br />
	<br />
    Want to know more in lots of detail? Here are two academic papers on the topic:
    <ol>
      <li>Douglas, N, (2011-May), '<a href="http://arxiv.org/abs/1105.1815">User Mode Memory Page Management: An old idea applied anew to the memory wall problem</a>', ArXiv e-prints, vol: 1105.1815.</li>
      <li>Douglas, N, (2011-May), '<a href="http://arxiv.org/abs/1105.1811">User Mode Memory Page Allocation: A Silver Bullet For Memory Allocation?</a>', ArXiv e-prints, vol: 1105.1811.</li>
    </ol>
  </dd>
</dl>
<h2><a name="speedcomparisons">C. Speed comparisons:</a></h2>
<p>See Benchmarks.xls for details.</p>
<p>The enclosed test.c can do one of two things: it can be a torture test which
mostly hammers realloc() or it can be a pure speed test which sticks to simple malloc()
and free(). If you enable C++ mode, half of the allocation sizes will be a two power
multiple less than 512 bytes (to mimic C++ stack instantiated objects) which are
extremely common in C++ code.</p>
<p>The torture test is designed to mercilessly work realloc() which is the most
complex and complete code path in any memory allocator. Most allocators have
<strong>very</strong> poor realloc() performance - not so nedalloc which makes use
of mremap() support on Linux and Windows. Even without mremap() support nedalloc&#39;s
realloc() tends to be significantly faster than any standard allocator.</p>
<p>The speed test is designed to be a representative synthetic memory allocator
test where most allocations follow a stack pattern. It works by randomly mixing
allocations with frees with sizes being a random value less than 16Kb. </p>
<p>The C++ test.cpp simply benchmarks how much difference nedalloc::nedallocatorise&lt;&gt;
makes to std::vector&lt;&gt; performance, particularly the performance of push_back(),
pop_back() and vector assignment all of which are very common in real world code.
As you will see, the STL - even with C++0x move constructor support - does not perform
anywhere close to nedalloc&#39;s version which achieves its gains by simply avoiding
copy and move construction completely.</p>
<p>The real world code results are from Tn&#39;s TestIO benchmark. This is a heavily
multithreaded and memory intensive benchmark with a lot of branching and other stuff
modern processors don&#39;t like so much. As you&#39;ll note, the test doesn&#39;t show the
benefits of the threadcache mostly due to the saturation of the memory bus being
the limiting factor.</p>
<h2><a name="troubleshooting">D. Troubleshooting:</a></h2>
<p>I get a quite a few bug reports about code not working properly under nedalloc.
I do not wish to sound presumptuous, however in an overwhelming majority of cases
the problem is in your application code and not nedalloc (see below for all the
bugs reported and fixed since 2006). Some of the largest corporations and IT deployments
in the world use nedalloc pre-v1.10, and pre-v1.10 has been very heavily stress
tested on everything from 32 processor SMP clusters right through to root DNS servers,
ATM machine networks and embedded operating systems requiring a very high uptime.
The v1.10 release adds a LOT of new code and features, and hence there are quite
likely a lot of new bugs in the new code.</p>
<p>In particular, just because it just happens to appear to work under the system
allocator does not mean that your application is not riddled with memory corruption
and non-ANSI usage of the API! And usually<strong> this is not your code&#39;s fault,
but rather it is usually the third party libraries being used which sadly often
include system libraries</strong>.</p>
<p>Even though debugging an application for memory errors is a true black art made
possible only with a great deal of patience, intuition and skill, here is a checklist
for things to do before reporting a bug in nedalloc:</p>
<ol>
	<li>Make SURE you try nedalloc from GIT HEAD. For around six months of 2007
	I kept getting the same report of a bug long fixed in GIT HEAD.</li>
	<li>Make SURE you try nedalloc v1.06. If it works in v1.06 but isn&#39;t working
	in nedalloc &gt;= v1.10, then it&#39;s probably a bug in the new code (please report
	it to me!)</li>
	<li>Make use of nedalloc&#39;s internal debug routines. Try turning on full sanity
	checks by #define FULLSANITYCHECKS 1. Also make use of all the assertion checking
	performed when DEBUG is defined as 1. A lot of bug reports are made before running
	under a debug build where an assertion trip clearly showed the problem. Lastly,
	try changing the thread cache by #defining THREADCACHEMAX - this fundamentally
	changes how the memory allocator behaves: if everything is fine with the thread
	cache fully on or fully off, then this strongly suggests the source of your
	problem.</li>
	<li>Make SURE you are matching allocations and frees belonging to nedalloc if
	you are not defining REPLACE_SYSTEM_ALLOCATOR. Attempting to free a block not
	allocated by nedalloc will end badly, similarly passing one of nedalloc&#39;s blocks
	to another allocator will likely also end badly. I have inserted as many assertion
	and debug checks for this possibility as I can think of (further suggestions
	are welcome), but no system can ever be watertight. If you&#39;re using C++, make
	use of the C++ nedallocatorise API provided or else use some form of strong
	template type system to have the compiler guarantee membership of a memory pointer
	- see <a href="http://www.boost.org/" target="_blank">the Boost libraries</a>,
	or indeed <a href="http://www.nedprod.com/TnFOX/" target="_blank">my own TnFOX
	portability toolkit</a>.</li>
	<li>If you&#39;re still having problems, or more likely your code runs absolutely
	fine under debug builds but trips up under release which suggests a timing bug,
	it is time to deploy heavyweight tools. Under Linux, you should use
	<a href="http://valgrind.org/" target="_blank">valgrind</a>. Under Windows,
	there is an excellent commercial tool called
	<a href="http://www.glowcode.com/" target="_blank">Glowcode</a>. Any programming
	team serious on quality should ALWAYS run their projects through these tools
	before each and every release anyway - you would be amazed at what you miss
	during all other testing.</li>
	<li>Lastly, in the worst case scenario, consider hiring in a memory debugging
	expert. There are quite a few on the market and they often are authors of memory
	allocators. <a href="http://www.malloc.de/en/" target="_blank">Wolfram Gloger
	(the author of ptmalloc) provides consulting services</a>.
	<a href="http://www.nedproductions.biz/" target="_blank">My own consulting company
	ned Productions Limited</a> may be able to provide such a service depending
	on our current workload.</li>
</ol>
<p>I hope that these tips help. And I urge anyone considering simply dropping back
to the system allocator as a quick fix to reconsider: squashing memory bugs often
brings with it <strong>significant</strong> extra benefits in performance and reliability.
It may cost what appears to be a lot extra now, but it usually will save itself
many times its cost over the next few years. I know of one large multinational corporation
who <strong>saved hundreds of millions of dollars</strong> due to the debugging
of their system software performed when trying to get it working with nedalloc -
they found one bug in nedalloc but over a hundred in their own code, and in the
process improved performance <strong>threefold</strong> which saved an expensive
hardware upgrade and deployment. The conclusion can only be that fixing memory bugs
now tends to be worth it in the long run.</p>
<h2><a name="changelog">E. ChangeLog:</a></h2>
<h3>v1.10 beta 4 ?:</h3>
<ul>
	<li><span class="gitcommit">[master 726d9c7]</span> Fixed memory corruption
	introduced when creating more than two nedpool's (issue #7). Thanks to mxmauro
	for reporting this.</li>
	<li><span class="gitcommit">[master c191ea9]</span> Merged dlmalloc v2.8.6.</li>
	<li><span class="gitcommit">[master 06f1c70]</span> Added support for clang,
	plus fixed up some compile errors in C++11.</li>
	<li><span class="gitcommit">[master 8f8256c]</span> Added support for
	valgrind instrumentation so valgrind can track programs using nedmalloc.</li>
	<li><span class="gitcommit">[master 69825ca]</span> Fixed issue #8 where
	memory allocated via the independent_*() functions was being incorrectly
	identified as system allocated. Thanks to Geri for reporting this.</li>
	<li><span class="gitcommit">[master a6a0dec]</span> Fixed issue #10 where
    a failure to allocate memory on POSIX was not being trapped correctly. Thanks
    to btaudul for reporting this.</li>
	<li><span class="gitcommit">[master a559f9e]</span> Fixed issue #12 where
	RTLD_DEFAULT was undefined. Replaced this code entirely with new code which
    parses /proc/meminfo for huge page size. Thanks to Geri for reporting this.</li>
	<li><span class="gitcommit">[master 9119158]</span> Added Travis CI build bot
  support to nedalloc, testing gcc, clang and clang static analyser.</li>
	<li><span class="gitcommit">[master xxxxxxx]</span> Fixed issue #14 where
	nedalloc was using is_pod&lt;&gt; instead of is_trivially_copyable&lt;&gt;.
  Thanks to JustSid for reporting this.</li>
</ul>
<h3>v1.10 beta 3 17th July 2012:</h3>
<ul>
	<li><span class="gitcommit">[master 5f26c1a]</span> Due to a bug introduced
    in sha 7a9dd5c (17th April 2010), nedmalloc has never allocated more than a
    single mspace when using the system pool. This effectively had disabled
    concurrency for any allocation &gt; THREADCACHEMAX (8Kb) which no doubt made
    nedmalloc v1.10 betas 1 and 2 appear no faster than system allocators. My
    thanks to the eagle eyes of Gavin Lambert for spotting this.</li>
</ul>
<h3>v1.10 beta 2 10th July 2012:</h3>
<ul>
	<li><span class="gitcommit">[master 51ab2a2]</span> scons now tests for C++0x
	support before turning it on and tries multiple libraries for clock_gettime()
	rather than assuming it lives in librt. This ought to fix miscompilation on
	Mac OS X. Thanks to Robert D. Blanchet Jr. for reporting this.</li>
	<li><span class="gitcommit">[master b2c3517]</span> Mac defines malloc_size
    to be const void *ptr, not void *ptr</li>
	<li><span class="gitcommit">[master 9333e50]</span> Updated to use the new
    O(1) Cfind(rounds=1) feature in nedtries</li>
	<li><span class="gitcommit">[master 54c7e44]</span> Avoid overflowing allocation
    size. Thanks to Xi Wang for supplying a patch fixing this.</li>
	<li><span class="gitcommit">[master 5b614a0]</span> Removed __try1 and __finally1
    from MinGW support as x64 target no longer supports SEH. Thanks to Geri for
    reporting this.</li>
	<li><span class="gitcommit">[master 48f1aa9]</span> Tidied up bitrot which
	had broken compilation due to mismatched #if...#endif.</li>
</ul>
<h3>v1.10 beta 1 19th May 2011:</h3>
<ul>
	<li><span class="gitcommit">[master 89f1806]</span> Moved from SVN to GIT. Bumped
	version to v1.10 as new ARA contract will involve significant further improvements
	mainly centering around realloc() performance.</li>
	<li><span class="gitcommit">[master 254fe7c]</span> Added nedmemsize() for API
	compatibility with other allocators. Added DEFAULTMAXTHREADSINPOOL and set it
	to FOUR which is a BREAKING CHANGE from previous versions of nedalloc (which
	set it to 16).</li>
	<li><span class="gitcommit">[nedmalloc_fast_realloc 97d1420]</span> Added win32mremap()
	implementation.</li>
	<li><span class="gitcommit">[nedmalloc_fast_realloc 8a1001e]</span> Significantly
	improved test.c with new test options TESTCPLUSPLUS, BLOCKSIZE, TESTTYPE and
	MAXMEMORY.</li>
	<li><span class="gitcommit">[nedmalloc_fast_realloc 7ea606d]</span> Implemented
	two variants of direct mremap() on Windows, one using file mappings and the
	other using over-reservation. The former is used on 32 bit and the latter on
	64 bit.</li>
	<li><span class="gitcommit">[nedmalloc_fast_realloc 26ff9a7]</span> Added the
	malloc2() interface to nedalloc.</li>
	<li><span class="gitcommit">[nedmalloc_fast_realloc 5bc5d97]</span> Rewrote
	Readme.txt to become Readme.html which makes it much clearer to read.</li>
	<li><span class="gitcommit">[nedmalloc_fast_realloc 2efa595]</span> Added doxygen
	markup to nedmalloc.h and a first go at a policy driven STL allocator class.</li>
	<li><span class="gitcommit">[nedmalloc_fast_realloc d851bde]</span> Added a
	CHM documenting the nedalloc API.</li>
	<li><span class="gitcommit">[nedmalloc_fast_realloc dbd3991]</span> Added a
	fast malloc operations logger which outputs a CSV log on process exit.</li>
	<li><span class="gitcommit">[nedmalloc_fast_realloc d6a8585]</span> Added stack
	backtracing to the logger.</li>
	<li><span class="gitcommit">[master c7ea06d]</span> Finished user mode page
	allocator, so merged nedmalloc_fast_realloc branch.</li>
	<li><span class="gitcommit">[master 9a8800f]</span> Fixed small bug which was
	preventing the windows patcher from correctly finding the proper MSVCRT.</li>
	<li><span class="gitcommit">[master 37c58b1]</span> Fixed leak of mutexes when
	using pthread or win32 mutexs as locks. Thanks to Gavin Lambert for reporting
	this.</li>
	<li><span class="gitcommit">[master f67e284]</span> Fixed nedflushlogs() not
	actually flushing data and/or causing a segfault. Thanks to Roman Tatkin for
	reporting this.</li>
	<li><span class="gitcommit">[master 1324bf3]</span> Finally got round to retiring
	the MSVC project files as they were sources of never ending hassle due to being
	out of sync with the SConstruct config. Rebuilt scons build system to be fully
	compatible with MSVC instead (long overdue!)</li>
	<li><span class="gitcommit">[master 068494e]</span> As the release of v1.10
	RC1 approaches, fixed a long standing problem with the binary patcher where
	multiple MSVCRT versions in the process weren&#39;t handled - everything was sent
	to one MSVCRT only, and needless to say that sorta worked sometimes and sometimes
	not. Now when nedmalloc passes a foreign block to the system allocator, it runs
	a stack backtrace to figure out what MSVCRT in the process it ought to pass
	it to. It&#39;s slow, but fixes a very common segfault on process exit on VS2010.</li>
	<li><span class="gitcommit">[master 4cca52c]</span> Very embarrassingly, nedmalloc
	has been severely but unpredictably broken on POSIX for over a year now when built with DEBUG defined.
	This was turning on DEFAULT_GRANULARITY_ALIGNED whose POSIX implementation
	was causing random segfaults so mysterious that neither gdb nor valgrind
	could pick them up - in other words, the very worst kind of memory
	corruption: undetectable, untraceable and undebuggable. I only found them
	myself due to a recent bug report for TnFOX on POSIX where due to luck, very
	recent Linux kernels just happened by pure accident to cause this bug to
	manifest itself as preventing process init right at the very start - so
	early that no debugger could attach. After over a week of trial &amp; error I
	narrowed it down to being somewhere in nedmalloc, then having something to
	do with DEBUG being defined or not, then two hours ago the eureka moment
	arrived and I quite literally did a jig around the room in joy. Problem is
	now fixed thank the heavens!!!</li>
	<li><span class="gitcommit">[master 3d55a01]</span> Fixed a problem where the
	binary patcher was early outing too soon and therefore failing to patch all
	the binaries properly. It would seem that the Microsoft linker doesn't sort
	the import table like I had thought it did - I would guess it sorts per DLL
	location, otherwise is unsorted. Thanks to Roman Tatkin for reporting this bug.</li>
	<li><span class="gitcommit">[master 6c74071]</span> Added override of _GNU_SOURCE
	for when HAVE_MREMAP is auto-detected. Thanks to Maxim Zakharov for reporting
	this issue.</li>
	<li><span class="gitcommit">[master dee2d27]</span> Marked off the v2 malloc API
    as deprecated in preparation for beta release. Updated CHM documentation.</li>
</ul>
<h3>v1.06 beta 2 21st March 2010:</h3>
<ul>
	<li>{ 1153 } Added detection of whether host process is using MSVCRT or MSVCRTD
	and the fixing up of which runtime tolerant nedalloc should use if nedalloc
	was linked differently. This ought to save a great deal of hassle later on by
	preventing failed-to-RTM user bug reports :)</li>
	<li>{ 1154 } Fixed nedalloc trying to use MLOCK_T even when USE_LOCKS=0. Thanks
	to Ariel Manzur for reporting this.</li>
	<li>{ 1155 } Fixed USE_SPIN_LOCKS=0 not compiling on Windows.</li>
	<li>{ 1157 } Fixed bug where foreign blocks entering the threadcache weren&#39;t
	being marked as such, thus typically causing a segfault on process exit.</li>
	<li>{ 1158 } Fixed compilation problems on mingw. Thanks to Amanieu d&#39;Antras
	for reporting these.</li>
	<li>{ 1159 } Released as beta2.</li>
</ul>
<h3>v1.06 beta 1 13th January 2010:</h3>
<ul>
	<li>{ 1079 } Fixed misdeclaration of struct mallinfo as C++ type. Thanks to
	James Mansion for reporting this.</li>
	<li>{ 1082 } Fixed dlmalloc bug which caused header corruption to mmap() allocations
	when running under multiple threads.</li>
	<li>{ 1088 } Fixed assertion failure for nedblksize() with latest dlmalloc.
	Thanks to Anteru for reporting this.</li>
	<li>{ 1088 } Added neddestroysyspool(). Thanks to Lars Wehmeyer for suggesting
	this.</li>
	<li>{ 1088 } Fixed thread id high bit set bug causing SIGABRT on Mac OS X. Thanks
	to Chris Dillman for reporting this.</li>
	<li>{ 1094 } Integrated dlmalloc v2.8.4 final.</li>
	<li>{ 1095 } Added nedtrimthreadcache(). Thanks to Hayim Hendeles for suggesting
	this.</li>
	<li>{ 1095 } Fixed silly assertion of null pointer dereference. Thanks to Ullrich
	Heinemann for reporting this.</li>
	<li>{ 1096 } Fixed lots of level 4 warnings on MSVC. Thanks to Anteru for suggesting
	this.</li>
	<li>{ 1098 } Improved non-nedalloc block detection to 6.25% probability of being
	wrong. Thanks to Applied Research Associates for sponsoring this.</li>
	<li>{ 1099 } Added USE_MAGIC_HEADERS which allows nedalloc to handle freeing
	a system allocated block. Added USE_ALLOCATOR which allows the changing of which
	backend allocator to use (with choices between the system allocator and dlmalloc
	- choosing the system allocator is intended for debug situations only e.g. valgrind).
	Thanks to Applied Research Associates for sponsoring this.</li>
	<li>{ 1105 } Added ability to build nedalloc as a DLL. Added support for a run
	time PE binary patcher which can patch all usage of the system allocator replacing
	it with nedalloc. Thanks to Applied Research Associates for sponsoring this.</li>
	<li>{ 1108 } Added patcher loader which can load any arbitrary program injecting
	the nedalloc DLL which then patches in its replacement for the system allocator.
	Doesn&#39;t work on all programs, but does on most e.g. Microsoft Word. Thanks to
	Applied Research Associates for sponsoring this.</li>
	<li>{ 1116 } Finished debugging and optimising the latest additions to the codebase.
	The patcher now works well on x64 as well as x86. Added support for large pages
	on Windows. Thanks to Applied Research Associates for sponsoring this.</li>
	<li>{ 1125 } Added nedpoollist() which returns a snapshot of the nedpool&#39;s currently
	existing. The Windows DLL thread exit code now disables the thread cache for
	all currently existing nedpool&#39;s. Thanks to Applied Research Associates for
	sponsoring this.</li>
	<li>{ 1126 } Added ENABLE_TOLERANT_NEDMALLOC which allows nedalloc to recognise
	system allocator blocks and to do the right thing with them.</li>
	<li>{ 1139 } Added link time code generation support for Windows builds. This
	currently has zero performance improvement on x64 (on MSVC9) but can add 15%
	to x86 performance (on MSVC9). Also added scons SConstruct and SConscript files.</li>
</ul>
<h3>v1.05 15th June 2008:</h3>
<ul>
	<li>{ 1042 } Added error check for TLSSET() and TLSFREE() macros. Thanks to
	Markus Elfring for reporting this.</li>
	<li>{ 1043 } Fixed a segfault when freeing memory allocated using nedindependent_comalloc().
	Thanks to Pavel Vozenilek for reporting this.</li>
</ul>
<h3>v1.04 14th July 2007:</h3>
<ul>
	<li>Fixed a bug with the new optimised implementation that failed to lock on
	a realloc under certain conditions.</li>
	<li>Fixed lack of thread synchronisation in InitPool() causing pool corruption.</li>
	<li>Fixed a memory leak of thread cache contents on disabling. Thanks to Earl
	Chew for reporting this.</li>
	<li>Added a sanity check for freed blocks being valid.</li>
	<li>Reworked test.c into being a torture test.</li>
	<li>Fixed GCC assembler optimisation misspecification.</li>
</ul>
<h3>v1.04alpha_svn915 7th October 2006:</h3>
<ul>
	<li>Fixed failure to unlock thread cache list if allocating a new list failed.
	Thanks to Dmitry Chichkov for reporting this. Futher thanks to Aleksey Sanin.</li>
	<li>Fixed realloc(0, &lt;size&gt;) segfaulting. Thanks to Dmitry Chichkov for reporting
	this.</li>
	<li>Made config defines #ifndef so they can be overriden by the build system.
	Thanks to Aleksey Sanin for suggesting this.</li>
	<li>Fixed deadlock in nedprealloc() due to unnecessary locking of preferred
	thread mspace when mspace_realloc() always uses the original block&#39;s mspace
	anyway. Thanks to Aleksey Sanin for reporting this.</li>
	<li>Made some speed improvements by hacking mspace_malloc() to no longer lock
	its mspace, thus allowing the recursive mutex implementation to be removed with
	an associated speed increase. Thanks to Aleksey Sanin for suggesting this.</li>
	<li>Fixed a bug where allocating mspaces overran its max limit. Thanks to Aleksey
	Sanin for reporting this.</li>
</ul>
<h3>v1.03 10th July 2006:</h3>
<ul>
	<li>Fixed memory corruption bug in threadcache code which only appeared with
	&gt;4 threads and in heavy use of the threadcache.</li>
</ul>
<h3>v1.02 15th May 2006:</h3>
<ul>
	<li>Integrated dlmalloc v2.8.4, fixing the win32 memory release problem and
	improving performance still further. Speed is now up to twice the speed of v1.01
	(average is 67% faster).</li>
	<li>Fixed win32 critical section implementation. Thanks to Pavel Kuznetsov for
	reporting this.</li>
	<li>Wasn&#39;t locking mspace if all mspaces were locked. Thanks to Pavel Kuznetsov
	for reporting this.</li>
	<li>Added Apple Mac OS X support.</li>
</ul>
<h3>v1.01 24th February 2006:</h3>
<ul>
	<li>Fixed multiprocessor scaling problems by removing sources of cache sloshing.</li>
	<li>Earl Chew &lt;earl_chew &lt;at&gt; agilent &lt;dot&gt; com&gt; sent patches for the following:
	<ol>
		<li>size2binidx() wasn&#39;t working for default code path (non x86).</li>
		<li>Fixed failure to release mspace lock under certain circumstances which
		caused a deadlock.</li>
	</ol>
	</li>
</ul>
<h3>v1.00 1st January 2006:</h3>
<ul>
	<li>First release</li>
</ul>

</body>

</html>