If you’ve ever trained a modern-day machine learning model, there’s a good chance you’ve had to clean a lot of training data. Maybe that means creating captions for millions/billions of images, maybe that means going through a ton of files and filtering out low-quality ones, it could be anything– but one constant is that these scripts must handle a lot of input/output perfectly, which is a tall order considering how many ways things could go wrong. If you’ve ever been anxious about a data cleaning job failing or frustrated by repeatedly running into stupid problems, this is for you.

Writing scripts to prepare training datasets may seem straightforward, but there are a lot of surprising aspects to it. If you disregard the counterintuitive principles of data cleaning scripts, you’ll likely find yourself worrying about issues like the script crashing 7 hours into a 12-hour run. You’ll wonder why your code is slow. You’ll spend significantly more time than expected debugging your code. Worse yet, things will crash because you overlooked some trivial-seeming detail, so every bug feels extra annoying and a slap to your dignity as a software engineer. It’s one thing to discover your code failed because of some subtle mistake in your code; it’s another level of embarrassing to discover your script is crashed and your compute idle for the past few hours because you just forgot to consider that 1 in 100,000 image files in your dataset might have corrupted data and you should’ve just skipped it instead of bringing the roof down. Things don’t have to be this way.

I’ve found that there are ways to predictably make preparing large datasets less painful. We can break things down into three parts: making the code reliable, making the code run fast, and being able to quickly write the code itself.

Make code reliable: guard against failure

The biggest risk to long-running data processing jobs is something unexpectedly failing in the middle of the script. The computer does not respect your time. The computer does not respect your plans. The computer does not respect the concept of “working hours”. The computer does not respect your intentions when you were originally writing the code. The computer does not respect the idea of “sleep”. This is a frustrating default and we can do better.

You can spend a lot of time trying to make a script perfect, but it’s often hard to foresee every possible outcome. Here is a list of incorrect assumptions that can easily make data processing fail:

The file I’m trying to open will never have corrupt data.
The file I’m trying to open will always exist.
Okay, I double-checked that the file definitely exists and then I tried to open the file and nothing bad ever will happen if I do this.
My list of files are all about the same size and I will not encounter a 1 in 10,000 bizarre edge case that’s 20 times larger than anything my software was designed to handle.
I will never run out of disk space because I requested a very large disk.
The disc got full anyways? How big could this dataset be anyways?
I will implement rate limiting correctly if I’m depending on an external API for my data processing needs.
Okay, fine, we can play it safe and stay a bit under the rate limits they stated.
Okay, let’s stay under half of their rate limit just to be extra safe.
What do you mean “the rate limits are not guaranteed”?
My data processing script will not unexpectedly take too much computer memory.
Fine, it will take a lot of memory but it won’t take all the memory.
Fine, maybe it takes all the memory in the system but at least I’ll still be able to log in and cancel the script if it gets that bad.
There’s literally not enough memory left to create an SSH connection?
You’re kidding me right?
This list is complete and definitely didn’t forget to mention any other potential problems.

By default, computer programs are designed to crash and loudly complain upon reaching an error. This is a good default for a lot of use cases, but it’s a bad choice here. You do ß_not_ want to come back to work the next day and discover your virtual machine has been sitting idle since 2 AM because that’s when the script crashed when the computer tried to process a file that got half-eaten by a little digital gremlin.

The most obvious solution works: just wrap everything in a try-catch block. I know it goes against every software engineering intuition we’ve ever learned about error handling. If I were doing some other task, like writing a web server that blissfully returned HTTP 200 OK to everything, it would take forever to realize if something went wrong—a horrendous design decision in that context. But this is big-data scripting world, where things need to work, and if they don’t work, then the show must go on, and we should not just stop on the first troublesome data point.

# bad
for image_path in image_paths_list:
    process_image(Image.open(image_path))

# good
for image_path in image_paths_list:
    try:
        processed_image = process_image(Image.open(image_path))
    except Exception as e:
        print(f"Failed to process image: {image_path}. Error: {str(e)}")
# flooding your code with this pattern is often incredibly annoying to debug in a lot of other contexts but not here.

No matter how clever you get with your error handling though, you still generally can’t forecast every possible problem. That leads to my next counterintuitive point: checkpointing. If you’re processing a ton of data, it’s worth keeping track of what data you’ve already processed. If things crash halfway through, you can just restart the data processing and skip the half you’ve already finished (distributed systems/database people will recognize this property as idempotence). If you want to inspect how good your data processing is, You can just open up a checkpoint file. Checkpointing doesn’t have to be complicated: If I’m writing captions for images then I’m just going to store a bunch of JSON files that map image paths to captions. Once again, this is one of those things that’s really intuitive once you hear the idea, but software engineers doing this thing for the first time tend to overlook it because it’s kind of got a hacky, messy feel to it. In a perfect world of pure software logic, you’d run the script once and you’d see all the results at the end and nothing would ever break.

We live in the real world.

Make code fast: fix the bottlenecks

Computers are fast and getting faster, but tasks like processing millions of images or billions of words don’t run instantly. I’ve had to let my data processing programs run overnight or even for days or weeks, just to prepare the data before I get to start training a machine learning model. Many first drafts of data processing scripts are slow because they have severe bottlenecks that slow everything down. Making your code fast is about fixing the bottleneck.

The most important bottleneck when processing data is CPU limitations. The obvious solution works well: parallelize as much computation as you can! Every example in your dataset is independent of the others, so you can have many CPU processes and threads processing separate parts of the data in parallel. For instance, instead of having one CPU loop through a million files, consider dividing the work among a hundred workers.

These days, it’s easy to write simple parallel code even in Python, which is notorious for its Global Interpreter Lock (GIL), which ensures only one thread in a process can execute Python bytecode at any given time. Creating multiple processes or multiple threads for I/O-heavy tasks works totally fine. import multiprocessing and import concurrent.futures are great. Fixing this bottleneck is a big deal: a 50x speedup is the difference between your script taking an entire month vs. half a day. One month delay is unfathomable in a field that moves as fast as machine learning. Meanwhile, half a day means you can start the processing right before you leave work and come back the next morning to see everything handled perfectly overnight. I’ve often done exactly that, and it’s really satisfying to know that the computer will work on my behalf while I’m sleeping.

But the story doesn’t end there. After fixing a bottleneck in one place, another might pop up elsewhere. Bottlenecks can hide in many places, including disk, network, any sort of I/O you’re doing, your CPU usage, your memory, or even your GPU if you’re doing anything requiring one. The best thing you could do to speed up your code is learning basic shell tools that tell you about your computer:

htop: how much of the CPU and memory are being used? What are the processes using the most resources?
bmon: how fast is data being uploaded/ downloaded over the network?
iostat -x 1: how much data are we writing/ reading from the hard drive?
df -h: how full is our hard drive?
watch -n1 nvidia-smi: What does the GPU usage look like each second?

In order to figure out what part of your system is slow, you must be able to observe the system instead of guessing what is slow. This practice has saved me countless hours of fruitless optimization by watching the computer: once I thought the job was running suspiciously slowly and suspected something was wrong with the network. One iostat later and I find the disk usage is at 100% because most disks don’t like opening thirty thousand files all at once. Then there was the time I was trying to find a bug in my code that would silently freeze but not crash my data processing script, and only realized what the problem was with a df -h and discovered the disk was completely full because to a computer most pictures are worth a lot more than a thousand words.

And then once you’ve gotten your script running quickly, you can sit back, relax, and enjoy the sight of your full computing power coming to life. What can I say? Some people like cars and pushing engines to limits. Some people enjoy the thrill of skydiving. I like typing htop and seeing every CPU maxed out to 100%.

Write code quickly: use coding assistance

This is the most hacky and honestly most underrated part about writing data processing scripts. It’s often not worth the time to try to make a script perfect. In many contexts, I’d much rather quickly throw together an imperfect script that fails on 5% to 10% of the data points and while the script is running I can go work on something else in the meantime. By far the most impactful tool for quickly writing data processing scripts has been AI-based coding assistance.

There’s some controversy today about how useful coding assistants are in general, but there’s no debate in the domain of boring, boilerplate-heavy tasks, or in other words, exactly the kind of code that we write to process data. These days my workflow looks less like carefully pecking out code on the computer screen and a lot more like dumping this stream-of-consciousness mess¹ in the chat box of a Large Language Model (LLM):

plz do this multithreaded or multiprocess whatever u like the thing is u need to process the images in folder foo/bar include all files with extension jpg png jpeg webp and ignore all other extensions and call the process_image funcaiotn which takes a single pIl img adn returns a string so each worker threads hould take a list of absolute image paths and then make a dict mapping abs image path -> string and then every 100 or so images write a checkpoint json file and also at the start of this script check the checkpoints directory to see which imags are already processed also plz wrap everything in try catch dont break the script just keep goin on failure and log every iage to a log file nad put stuff like num_wrokers and prcoesses and checkpoint paths in env vars

and then I switch tabs to check on a different bug or a training run while my coding assistant of choice produces 300 lines of usually-correct data-processing Python².

Honorable mentions

A few other things that have made my life much easier:

Log files: It’s really cheap to just log your progress and elapsed time and status for each of the data points you’re processing. Storage is cheap; we might as well use it.
nohup: If you’re doing data processing of any appreciable size, you’re doing it on some more powerful computer than your average laptop, which means that you’re probably going to SSH into the thing. And that is really bad news if your long-lasting process depends on your SSH connection staying open, which is exactly what happens when you try to run a shell command on a machine that you’re SSH into by default. Just use nohup, or better yet nohup combined with a log file. Between all the nohup and htop and bmon you’re probably getting a sense that being familiar with the shell is really helpful here and you’d be absolutely right.
Make sure you can easily re-run the script. The easiest way is probably just to put everything in a Docker container because that’s easy to replicate and if I need to restart something I can use the exact same Docker container even if I’m restarting a data processing job months later and I don’t need to worry about dependencies being out of date or whatever because it’s all built into the container.
Write things down: Just a little comment at the top of every script explaining exactly what the script does, what the output is, and how exactly to run it is super helpful. It’s often easy to understand how a script processes data, but it’s often more useful to say why you would use it or how to run the script (which is super annoying when you’re trying to run a data processing job on some new batch of data three months later and you have some dependency that you can’t replicate anymore, and now you’re at a loss). Comments are also great for having LLMs write a lot of the code for you, as they functionally become prompts.
Test/Dry Run mode. Consider implementing a testing mode that you can easily toggle on and off. There are two ways I’ve implemented this: one is simply displaying how the script would modify the data without actually doing anything, or taking a small batch of data and actually attempting to process it so I can manually check that the results look alright before setting the computer loose on the full, large dataset. Both of these are suitable for reducing the chance of the computer producing 10 million incorrect files based on some incorrect code.

“Better things are possible” as a way of software engineering

When you learn something new, the most useful details are usually the counterintuitive ones that are only obvious after you learn about them. One of the most important lessons that I learned early on in my software engineering career is that software bugs are not at all inevitable. There are entire classes of bugs that are entirely predictable and preventable. We can bend some principles of software engineering in service of other, more relevant ones. We can sleep easy at night knowing the computers are chugging away and every error will be caught safely. We can easily restart a data set processing script without worrying about losing our progress.

Better things are possible.

Thank you to Sinan Yumurtacı for reading my draft :)

Footnotes

1: Despite the numerous issues in the prompt (grammar, spelling, run-on sentences, etc.), they’re generally smart enough to understand exactly what I mean. One useful side note, while we’re on the topic of producing working code rapidly, is that it’s often not worth trying to formulate a perfect prompt—just get something approximately right and fix issues in a second pass.

Here’s the cleaned-up version of the prompt (which, notably, was cleaned-up by the LLM itself) in case you have trouble reading the original:

Task: Create a multithreaded/multiprocessed image processing script

Requirements:
1. Process images in folder foo/bar
   - Include files with extensions: jpg, png, jpeg, webp
   - Ignore all other file types

2. Use process_image function
   - Input: PIL image
   - Output: string

3. Worker thread/process functionality:
   - Take a list of absolute image paths
   - Create a dict mapping absolute image path to string

4. Checkpointing:
   - Write checkpoint JSON file every 100 processed images
   - At script start, check checkpoints directory for already processed images

5. Error handling:
   - Wrap everything in try-catch
   - Continue on failure, don't break the script
   - Log every image to a log file

6. Configuration:
   - Use environment variables for:
     - Number of workers
     - Number of processes
     - Checkpoint paths

Additional notes:
- Implement either multithreading or multiprocessing, whichever is preferred
- Ensure robust error handling and logging

Return to text

2: At the time of writing (autumn 2024), the best LLM for coding is Claude 3.5 Sonnet (first released in June 2024). One of the most common complaints about AI-based coding assistance is that they confabulate and hallucinate software that doesn’t exist, but a) I rarely if ever see Claude 3.5 Sonnet ever do this and b) in my experience, so many blanket-level complaints about poor LLM performance on this type of small-scale mundane coding task come from not using the most capable model out there for the job.

Return to text