More

ack_complete · 2026-06-01T23:26:26 1780356386

There is an analogous situation in graphics with signed normalized formats. The solution there is that the R16_SNORM format maps -1 to +1 as [-32767, 32767] with -32768 being a special value (not normally emitted, and mostly but not always interpreted as -32767). Some audio storage formats seem to use this mapping too.

ack_complete · 2026-05-07T02:48:24 1778122104

It's both. Originally Visual C++ binaries built for DLL-based C runtime relied on MSVCRT.DLL and that was installed by the redist. Starting with Visual Studio .NET 2002, separate CRT DLLs starting with MSVCR70.DLL were used. MSVCRT.DLL is now part of Windows to support parts of the OS itself and for compatibility with programs that still use it. I think some versions of MinGW also use MSVCRT.

Current versions of the OS ship with functions in MSVCRT.DLL that weren't in the last VC6 version, such as the updated C++ exception handler (__CxxFrameHandler4). AFAIK, there is no redistributable version of it, it's unique to the OS.

pjmlp · 2026-05-07T05:40:52 1778132452

That is for backwards compatibility, the now finally official C standard library that is distributed with the OS, since Windows 10, is UCRT (Universal C Runtime).

ack_complete · 2026-05-02T18:24:59 1777746299

That's with a simple data operation and using a recent x86 vector ISA (AVX-512) that is only available on some systems, notably excluding any current Intel desktop CPU.

The real killer isn't the data operations, though, it's if the overflow checks interfere with converting the loop logic or data addressing to vectorizable form. Indexing with 32-bit signed int vs. unsigned int on a 64-bit platform in C is a classic case -- with unsigned the compiler cannot assume that addressing offsets don't wrap, which then prevents coalescing data accesses into vector loads and stores.

ack_complete · 2026-04-30T15:44:46 1777563886

Note that CPUs have also gotten dramatically wider in both execution width and vector capability since you were a teenager. The increased throughput shifts the balance more toward being able to burn operations to reduce dependency chains. It's possible for your idea to have been both non-viable on the CPUs at the time and more viable on CPUs now.

ack_complete · 2026-04-22T15:35:48 1776872148

Two issues.

First, regarding application compatibility: the heap was already changed once prior to the segment heap. The Low Fragmentation Heap (LFH) was added in XP and made default in Vista, with applications no longer having to opt into it:

https://learn.microsoft.com/en-us/windows/win32/memory/low-f...

Second, the segment heap has different tradeoffs that make it not a guaranteed win to swap in, it trades off performance for working set:

https://issues.chromium.org/issues/40138716

kh9000 · 2026-04-22T18:07:51 1776881271

It's complicated. It's not always a straightforward space vs time tradeoff. For chromium's allocation patterns, it sounds like segment heap was slower. But BinaryNinja reported the opposite! See https://github.com/Vector35/binaryninja-api/issues/2778

Side note on the Chromium topic: Google Chrome decided NT Heap is still best for their usage, but Microsoft Edge, which is also built on the Chromium, uses segment heap. Not sure what Firefox uses. You can check by attaching WinDbg and doing !heap. Note that not every heap will be segment heap, even if you globally opt into segment heap. Some code paths explicitly create their own heaps as NT heaps.

At the very least, using fewer pages to allocate the same amount of data improves memory locality slightly. Folks should test and see what works best in their applications.

Another benefit of segment heap that we haven't discussed yet is that it's more strict and proactive about detecting problems and terminating. From what I understand, heap metadata is now stored separately from heap data, and they use guard pages. So heap buffer overruns don't overwrite the heap manager's bookkeeping. With NT heap, crashes due to use-after-free might manifest much later and more indirectly. Like, maybe it overwrote the free list, or it overwrote some newer allocation that landed on the same address. So, the crash is usually in some unlucky 'innocent bystander' call stack that worked with the corrupted region. With segment heap, you tend to get earlier, more actionable, specific crashing call stacks, closer to the site of the original bug. So, if you're an engineer who looks at a lot of difficult windows crash dumps involving memory corruption, segment heap makes the challenge slightly more surmountable.

garganzol · 2026-04-22T20:54:01 1776891241

> Segment heap is more strict and proactive about detecting problems and terminating

I definitely noted that in my tests. Under load, machines with flaky RAM have higher memory access violation rates compared to NT Heap.

ack_complete · 2026-04-21T02:25:52 1776738352

The REP MOVS series of instructions have an interesting history due to the advantages and disadvantages of microcode and its shifting performance relative to manual code with each CPU generation. It has long been great for aligned large copies due to the microcode having access to cache-wide copies, but until recently struggled with small copies. Apparently, one of the reasons is a lack of branch prediction in microcode:

https://stackoverflow.com/questions/33902068/what-setup-does...

Non-temporal stores are tricky performance wise. They can be dramatically faster than normal stores (~3x), they may be faster on some generations of CPUs than others, they may be slower if subsequent code needs the destination in the CPU cache, and even for GPUs they may not be ideal if an iGPU is sharing part of the cache hierarchy with the CPU. But the worst issue is that occasionally a specific CPU will have some random pathological behavior with them. IIRC, masked non-temporal stores were horrifically slow on some AMD APUs, on the order of hundreds to thousands of cycles per instruction. I find it hard to recommend them much anymore.

ack_complete · 2026-04-18T19:17:00 1776539820

This case actually works because for finite numbers of a given sign, the integer bit representations are monotonic with the value due to the placement of the exponent and mantissa fields and the implicit mantissa bit. For instance, 1.0 in IEEE float is 0x3F800000, and the next immediate representable value below it 1.0-e is 0x3F7FFFFF.

Signed zero and the sign-magnitude representation is more of an issue, but can be resolved by XORing the sign bit into the mantissa and exponent fields, flipping the negative range. This places -0 adjacent to 0 which is typically enough, and can be fixed up for minimal additional cost (another subtract).

kbolino · 2026-04-18T20:31:14 1776544274

I interpreted OP's "bit-cast to integer, strip few least significant bits and then compare for equality" message as suggesting this kind of comparison (Go):

  func equiv(x, y float32, ignoreBits int) bool {
      mask := uint32(0xFFFFFFFF) << ignoreBits
      xi, yi := math.Float32bits(x), math.Float32bits(y)
      return xi&mask == yi&mask
  }

with the sensitivity controlled by ignoreBits, higher values being less sensitive.

Supposing y is 1.0 and x is the predecessor of 1.0, the smallest value of ignoreBits for which equiv would return true is 24.

But a worst case example is found at the very next power of 2, 2.0 (bitwise 0x40000000), whose predecessor is quite different (bitwise 0x3FFFFFFF). In this case, you'd have to set ignoreBits to 31, and thus equivalence here is no better than checking that the two numbers have the same sign.

ack_complete · 2026-04-19T02:34:28 1776566068

Yeah, that's effectively quantization, which will not work for general tolerance checks where you'd convert float similarity to int similarity.

There are cases where the quantization method is useful, hashing/binning floats being an example. Standard similarity checks don't work there because of lack of transitivity. But that's fundamentally a different operation than is-similar.

ack_complete · 2026-04-06T19:22:57 1775503377

Eh, WinForms did a lot to make Win32 UI accessible and usable -- especially layout and easy customization -- but I have to differ on the cross-language story. It was great, IF you were making primarily a C# program that happened to use some C/C++ components.

From the native code side, it was not so great. The .NET 2.0 CLR had very poor support for hosting from the native side and really wanted you to make a program that was .NET first, it didn't work well if you wanted something like primarily a C++ program that hosted a C# UI in the same process. Reverse P/Invoke via native exports wasn't exposed, so creating DLLs for consumption by non-.NET programs was difficult. Mixed mode debugging was and still is painful, with the debugger being glacially slow at some operations like OutputDebugString() processing and blocking some native features like data breakpoints, and the CLR eating access violation exceptions from native code so they couldn't be debugged properly. Build-mode wise, we had to ban C++/CLI assemblies depending on C# assemblies because the C# project system didn't handle incremental builds properly and forced the dependent C++ assembly to rebuild all the time.

These issues still largely exist and are an issue with WPF. It's a great UI framework, but it's unusable unless your front end is primarily a C# program.

ack_complete · 2026-04-06T15:57:30 1775491050

WPF originally had two major rendering issues. One was the lack of pixel snapping support, and another was gamma correction issues during text rendering, particularly for light text on a dark background (due to an alpha correction approximation, IIRC). The two combined led to blurry text in WPF applications.

These were finally improved for WPF 4, since Visual Studio 2010 switched to it and had a near riot in the betas due to the poor rendering in the text editor.

ack_complete · 2026-04-06T15:39:48 1775489988

The main reason Win32 can't handle automatic background suspension or low-power push notifications is simply that those features haven't been exposed to it. There's nothing preventing a Win32 program from receiving those types of notifications and then being force-ended by the OS if it doesn't respond in time.

When I first started porting programs to Windows ARM64, I didn't have an ARM64 device and had to test in QEMU. It ran extremely slowly, probably 1/50th of real time. All UWP programs like Calculator ran like a slug. But which programs still ran reasonably? Classic WinDbg and Task Manager. Two programs that were still plain Win32.

There are significant issues with Win32, namely its lack of a permissions and isolation and lack of hardware acceleration in the old windowing UI (User/GDI). But the idea that Win32 is inherently power inefficient is, IMO, just BS. Its roots go back to CPUs that were orders of magnitude slower than modern CPUs and there is nothing difficult about making a Win32 program that idles at 0% CPU when not in use.