You don't have to reinvent the wheel for your script, all the parallel options are ready for you to use and are well documented. It's also packed with features that might take a long time to write into your Python script.
I am trying to use Python by default when writing scripts nowadays, but sometimes the best tool for the job isn't Python or writing your own Python.
IMO, effective "scripting" just means the ability to solve ad hoc problems easily by writing task-specific glue that delegates the hard parts of the program to (1) an effective set of libraries you've written yourself and (2) external code or tools when it makes sense.
From this perspective, the languages of the glue, the libraries, and the external code all matter less than the ease of writing the glue; interfacing with the external code; and maintaining the libraries. The best language for this probably comes down to a combination of what you're comfortable writing (and reading, and maintaining) and what kinds of tasks you're trying to solve.
For me personally, using Python glue and libraries strikes a pretty good balance here. Writing a script "in Python" doesn't mean you need to reinvent the wheel. If you think `parallel` provides a better interface for map-reduce parallelism than `subprocess` (or than a library function you've written on top of `subprocess`), no problem: you can just call `parallel` from Python (and you'll probably find yourself writing a library function on top of it to abstract away the fact that it's a shell script).
But if you're much more effective working in Bash than Python, then writing your glue and developing your libraries in Bash could be the way to go.
start a bunch of threads and e.g. invoke subprocess.run() from them
Done that many, many times and honestly combining python with parallel is in many cases the best way to go. Write your python script to be as fast as possible on one core and then use parallel to run it on all your cores. This has the added advantage that you can go from running on all the cores on your machine to running on all the cores on a 100 machine cluster by just changing a couple of lines of code.
subprocess.run is likely to be significantly slower than a low-level dedicated utility like parallel, and adding a lot of flakyness and overhead. I'm a big pythonaro but one should always use the best tool for the job.
E.g. in Python this would all be very easy to do. Just start a bunch of threads and e.g. invoke subprocess.run() from them.