More

akx · 2026-05-19T06:51:26 1779173486

> Typically, other archives like .tar.bz2 can be smaller. But those aren’t backwards-compatible!

Is there any point for (new) .bz2 archives in the era of Zstd?

j16sdiz · 2026-05-19T07:53:22 1779177202

Tooling ?

It takes years for bzip2 be in every Linux Distro, and we _still_ doing gzip.

LZMA / xz tool are start to get more support, but they are nowhere near universal.

No idea when how long zstd will need.

strenholme · 2026-05-19T09:15:18 1779182118

xz is pretty universal across POSIX and clones though. It comes with any modern Linux distro, Busybox even has an .xz decompressor, so `tar xvJF file.tar.xz` does the right thing in *NIX land, which I presume includes MacOS with Brew.

For Windows systems, 7-zip (.7z, similar compression to .xz) is a free download for Windows 10, and Windows 11 can open up a .7z file with a simple double click.

.zip and .gz no longer need to be used here in 2026.

lstodd · 2026-05-19T10:34:04 1779186844

.zip is used as a seekable container with some compression. There is no replacement comparable in simplicity. 7z is overcomplicated, compressed tar is not seekable.

.gz/deflate is used when something very cheap and very fast is needed. xz/lzma is quite often too slow or requires too much memory even on decompression.

so no, .zip and .gz are very much needed in 2026.

adapiz · 2026-05-19T11:23:59 1779189839

Compared to xz and even parallel xz, gzip and parallel gzip are just better if speed is more important. The compression is not superior but already good if you consider just the uncompressed data. For long term storage, it makes sense, to invest the extra time for better compression but if it's about transfer time, you might end up with a overall longer processing time instead of just a longer transfer time because of a worse compression ratio. It's like with image formats: Pick the right one for your use case.

MrDrMcCoy · 2026-05-19T17:55:21 1779213321

If you add zstd to the comparison matrix, it wins on both speed and compression ratio. Its adoption is quickly catching up to xz as a result, and I expect it to approach gzip in availability in a few years.

Dwedit · 2026-05-19T13:13:17 1779196397

GitHub won't let you upload a 7z file as an attachment for the issue tracker. Thus forcing me to use an inferior and obsolete compression format.

jgalt212 · 2026-05-19T12:10:29 1779192629

gzip is very fast, universally supported, and good enough. It will be around for ever.

you need python 3.14 for zstd.

yjftsjthsd-h · 2026-05-19T14:27:51 1779200871

Zstd is implemented in C?

akx · 2026-05-27T06:14:02 1779862442

To be fair, gnu tar 1.31, 2019-01-02, added `--zstd` (and other automatic .zst support). So that support has been there for a good 7 years and 4 months at the time of writing.

Am4TIfIsER0ppos · 2026-05-19T08:55:33 1779180933

Debian? Did they discover it yet?

sigio · 2026-05-19T10:18:30 1779185910

I think it's been in since debian 11... at least 12, it's been in my default ansible playbooks for a while.

akx · 2026-03-27T19:19:26 1774639166

Sounds like you're not familiar with https://docs.astral.sh/uv/ ...

duskdozer · 2026-03-28T12:13:22 1774700002

It sounds to me like they are: `You know they've given up on backward comparability and version control, when the solution is: run everything in a VM, with its own installation.`

uv taking over basically ensures that dependencies won't become managed properly and nothing will work without uv

akx · 2026-03-31T06:17:23 1774937843

What do you mean with "basically ensures that dependencies won't become managed properly"?

akx · 2026-03-27T12:31:24 1774614684

This doesn't do the same thing though, since it's not Unicode aware.

    >>> 'x\u2009   a'.split()
    ['x', 'a']
    # incorrect; in bytes mode, `\S` doesn't know about unicode whitespace
    >>> list(re.finditer(br'\S+', 'x\u2009   a'.encode()))
    [<re.Match object; span=(0, 4), match=b'x\xe2\x80\x89'>, <re.Match object; span=(7, 8), match=b'a'>]
    # correct, in unicode mode
    >>> list(re.finditer(r'\S+', 'x\u2009   a'))
    [<re.Match object; span=(0, 1), match='x'>, <re.Match object; span=(5, 6), match='a'>]

est · 2026-03-27T15:44:00 1774626240

OP's .split_ascii() doesn't handle U+2009 as well.

edit: OP's fully native C++ version using Pystd

zahlman · 2026-03-27T15:57:36 1774627056

Hmm? Which code are you looking at?

contravariant · 2026-03-27T13:04:40 1774616680

There's bound to be a way to turn a stream of bytes into a stream of unicode code points (at least I think that's what python is doing for strings). Though I'm explicitly not volunteering to write the code for it.

est · 2026-03-27T15:20:01 1774624801

    import mmap, codecs

    from collections import Counter

    def word_count(filepath):

        freq = Counter()
    
        decode = codecs.getincrementaldecoder('utf-8')().decode
    
        with open(filepath, 'rb') as f, mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) as mm:
        
                for chunk in iter(lambda: mm.read(65536), b''):
            
                        freq.update(decode(chunk).split())
            
                    freq.update(decode(b'', final=True).split())
        
                return freq

contravariant · 2026-03-29T16:27:01 1774801621

Oh that's neat, though I might split this into two functions in most cases, no need to entangle opening the file and counting the words in a filelike object.

That's two neat tricks that I'm definitely adding to my bag of python trickery.

zahlman · 2026-03-27T15:51:15 1774626675

Sure, but making one string from the file contents is surely much better than having a separate string per word in the original data.

... Ah, but I suppose the existing code hasn't avoided that anyway. (It's also creating regex match objects, but those get disposed each time through the loop.) I don't know that there's really a way around that. Given the file is barely a KB, I rather doubt that the illustrated techniques are going to move the needle.

In fact, it looks as though the entire data structure (whether a dict, Counter etc.) should a relatively small part of the total reported memory usage. The rest seems to be internal Python stuff.

contravariant · 2026-03-29T16:19:31 1774801171

I dislike loading files into memory entirely, in fact I consider avoiding that one of the few interesting problems here (the other problem being the issue of counting words in a stream of bytes, without converting the whole thing to a string).

If you don't care about efficiency you can just do len(set(text.split())), but that's barely worth making a function for.

akx · 2026-02-24T07:08:48 1771916928

This inspired me to update my over-a-decade-old glitch generator https://akx.github.io/glitch2/ with camera input and a handful of new modules.

akx · 2026-01-16T07:39:10 1768549150

It's pretty good. And for once, a software-engineering-ly high-quality codebase, too!

All too often, new models' codebases are just a dump of code that installs half the universe in dependencies for no reason, etc.

akx · 2025-12-29T07:49:38 1766994578

If someone's curious about those particular constants, they're the PAL Y' matrix coefficients: https://en.wikipedia.org/wiki/Y%E2%80%B2UV#SDTV_with_BT.470

akx · 2025-11-11T15:02:19 1762873339

You'd think there was a LUT you could apply to the digital copies during playback to make it look (more) like the original...

akx · 2025-08-25T11:29:02 1756121342

These days, once you have https://docs.astral.sh/uv/ installed, `uvx --from service-ping-sping sping` is pretty much zero effort to run this software.

1718627440 · 2025-08-25T20:18:55 1756153135

Except if you don't like pulling random programs from the internet.

akx · 2025-08-22T13:52:19 1755870739

> Is `uv format` supposed to be an alias for `ruff check`?

I'd imagine not, since `ruff format` and `ruff check` are separate things too.

godelski · 2025-08-22T18:31:12 1755887472

That makes some more sense. I think I just misunderstood what Charlie was saying above.

But I'll also add another suggestion/ask. I think this could be improved

  $ ruff format --help
  Run the Ruff formatter on the given files or directories
  $ uv format --help
  Format Python code in the project

I think just a little more can go a long way. When --help is the docs instead of man I think there needs to be a bit more description. Just something like this tells users a lot more

  $ ruff format --help
  Formats the specified files. Acts as a drop-in replacement for black. 
  $ uv format --help
  Experimental uv formatting. Alias to `ruff format`

I think man/help pages are underappreciated. I know I'm not the only one that discovers new capabilities by reading them. Or even the double tab because I can't remember the flag name but see something I didn't notice before. Or maybe I did notice before but since the tool was new I focused on main features first. Having the ability to read enough information to figure out what these things do then and there really speeds up usage. When the help lines don't say much I often never explore them (unless there's some gut feeling). I know the browser exists, but when I'm using the tool I'm not in the browser.

akx · 2025-08-06T14:30:04 1754490604

This is a fun model for circuit-bending, because the voice style vectors are pretty small.

For instance, try adding `np.random.shuffle(ref_s[0])` after the line `ref_s = self.voices[voice]`...

EDIT: be careful with your system volume settings if you do this.