Skip to main content
Compliance Shell Scripting

Why Your Compliance Shell Scripts Fail at Scale (and How Northpoint’s Modular Checks Solve It)

You wrote a shell script that checks whether SELinux is enforcing on every server. It worked perfectly on your laptop and the first five test nodes. Then you pushed it to the full fleet of two hundred production hosts, and suddenly half the runs timed out, a few returned garbled output, and one accidentally triggered a false alarm that woke up the on-call engineer at 3 AM. That story is so common in compliance scripting that it has become a rite of passage for infrastructure teams. Monolithic shell scripts — one file that checks everything from file permissions to kernel parameters to auditd rules — are the default starting point for compliance automation. They are easy to write, easy to understand, and they get the job done when the job is small.

You wrote a shell script that checks whether SELinux is enforcing on every server. It worked perfectly on your laptop and the first five test nodes. Then you pushed it to the full fleet of two hundred production hosts, and suddenly half the runs timed out, a few returned garbled output, and one accidentally triggered a false alarm that woke up the on-call engineer at 3 AM. That story is so common in compliance scripting that it has become a rite of passage for infrastructure teams.

Monolithic shell scripts — one file that checks everything from file permissions to kernel parameters to auditd rules — are the default starting point for compliance automation. They are easy to write, easy to understand, and they get the job done when the job is small. But as the fleet grows, as the compliance framework evolves, and as different teams start contributing checks, those scripts become brittle, slow, and dangerous. This article explains why that happens and how Northpoint's modular check approach turns each compliance requirement into a self-contained, testable, and composable unit that scales without breaking.

Why Monolithic Scripts Break Under Load

The first sign of trouble is usually a timeout. A script that iterates over a list of checks sequentially might take thirty seconds on a single host, which feels fast. On a thousand hosts, that same script ties up a thread or a process for thirty seconds per host. If you run it in parallel across a hundred hosts, you are suddenly managing a hundred concurrent processes, each with its own stdout, stderr, exit code, and potential race conditions. The script was never designed for that.

Sequential Execution Becomes a Bottleneck

Most compliance scripts are written as a linear series of commands: check this file, check that kernel parameter, verify the service is running. Each command waits for the previous one to finish. On a modern server with low latency, that is fine. But when you wrap that script in a remote execution tool like Ansible, Salt, or a custom SSH loop, the whole fleet waits for the slowest host. If one server has a disk I/O spike or a hung process, the entire batch is delayed. The fix is not to optimize the script's speed — it is to break it into independent checks that can run and report separately.

Hardcoded Paths and Environment Dependencies

A script that hardcodes /usr/local/bin/custom_tool works until you deploy to a container image that puts that tool in /opt/custom/bin. Or until the next OS upgrade moves the audit log from /var/log/audit to /var/log/auditd. Every hardcoded path is a future failure point. Modular checks solve this by externalizing configuration: each module reads its own config file or environment variable, so the same check logic works across different OS versions, cloud images, and container runtimes without editing the script.

Error Handling That Hides Failures

Monolithic scripts often use a single set -e at the top, which exits the entire script on the first error. That is fine for a linear pipeline, but in compliance checking you usually want to collect all failures, not stop at the first one. Worse, some scripts redirect all errors to a log file and never surface them to the operator. A modular approach wraps each check in its own error handler: the module captures stderr, writes a structured result (pass/fail/error), and continues. The orchestrator collects all results and presents them together, so you see the full picture.

What You Need Before Refactoring to Modular Checks

Before you start splitting your monolithic script into modules, you need a clear understanding of the compliance framework you are implementing, a consistent execution environment, and a way to collect results. This section covers the prerequisites that teams often skip — and then regret.

A Clear Compliance Baseline

You cannot modularize what you have not defined. Write down each compliance rule as a single, testable statement: "SELinux must be in enforcing mode," "SSH root login must be disabled," "Audit log size must be at least 100 MB." Each statement becomes one module. If your compliance framework is CIS, DISA STIG, or PCI DSS, map each rule to a unique identifier. That identifier becomes the module name. This mapping is the hardest part of the refactor, and it is also the most valuable: once you have it, you can prioritize, skip, or deprecate individual rules without touching the rest.

Consistent Output Format

Each module must produce output in the same structured format. JSON is the obvious choice because it is machine-readable and easy to parse. Define a schema with at least these fields: check_id, status (pass/fail/error), message, timestamp. Every module writes that JSON to stdout. The orchestrator reads stdout and aggregates. This consistency is what lets you run a hundred modules in parallel and still make sense of the results. Without it, you are back to parsing free-text logs with grep, which is exactly the fragility you are trying to escape.

Execution Orchestration

You need a tool that can run modules on remote hosts, collect their output, and handle timeouts and retries. This could be Ansible with a custom module, a simple Python script that uses subprocess, or a dedicated compliance engine. The orchestrator's job is to spawn each module as a separate process, capture its stdout and stderr, kill it if it exceeds a timeout, and store the result. The orchestrator does not know what the check does — it only knows how to run it and collect the output. That separation of concerns is the core of the modular approach.

Building a Modular Check: Step by Step

Let's walk through converting a common compliance rule — verifying that the audit daemon is running — from a monolithic snippet to a reusable module. The goal is a self-contained script that can be run independently, tested in isolation, and combined with other modules.

Step 1: Define the Check ID and Expected State

Every module starts with a header that declares its identity. Use a standard naming convention like northpoint_check_auditd_status. The module should also define the expected state as a variable, not a hardcoded value. For example, EXPECTED_STATUS="active". This makes it easy to adapt the same module for different service names or statuses later.

Step 2: Implement the Check Logic

The actual check should be as simple as possible. Use system commands that are available on the target OS. For the auditd example, systemctl is-active auditd returns "active" if the service is running. Capture the output and compare it to the expected status. If they match, print a JSON result with status "pass". If they do not match, print "fail" with a message that includes the actual status. If the command itself fails (e.g., systemctl is not installed), print "error" with the stderr output.

Step 3: Write the Result to Stdout

The module must never write anything except the JSON result to stdout. Debug messages, warnings, and progress indicators should go to stderr. This rule is non-negotiable: if the orchestrator cannot parse stdout as JSON, it cannot aggregate results. Use a simple echo or printf to output the JSON object. Many teams use a helper function like output_result() that builds the JSON string and prints it, ensuring every module uses the same format.

Step 4: Handle Edge Cases Gracefully

What if the server is a container that does not use systemd? What if the auditd package is not installed at all? The module should handle these cases by returning an "error" status with a descriptive message, not by crashing or hanging. Add a check at the top: if the required command is not found, print an error and exit. This prevents the orchestrator from waiting indefinitely for a module that will never complete.

Tools and Environment Realities for Modular Compliance

Choosing the right tools and understanding the environment where your modules will run is critical. This section covers the practical decisions that make or break a modular compliance system.

Shell vs. Python vs. Go for Modules

Shell scripts are the lightest dependency — every Linux server has a shell. But shell is terrible at complex logic, error handling, and data structures. Python is more robust and has a rich standard library, but it requires Python to be installed on every target host. Go compiles to a static binary that runs anywhere, but it adds build complexity. Our recommendation: start with shell for simple checks (file existence, permission bits, service status) and use Python for checks that need JSON parsing, HTTP calls, or regex. Keep a single language per module — do not mix shell and Python in the same module because that doubles the dependency surface.

Parallel Execution and Resource Limits

Running a hundred modules in parallel on a single host can exhaust CPU, memory, or file descriptors. Set a concurrency limit at the orchestrator level. A good default is the number of CPU cores plus one. Also set a per-module timeout — 30 seconds is usually enough for a compliance check. If a module times out, the orchestrator should kill it and record a "timeout" result. This prevents a single hung check from blocking the entire batch.

Idempotency and Side Effects

Compliance checks should never modify the system. A module that checks a file permission must not change that permission. If a module accidentally has a side effect (e.g., it touches a file or restarts a service), it can cause drift or outages. Enforce read-only operations by running modules as a non-root user with minimal privileges. Use mount -o remount,ro for critical filesystems if you need extra protection, but that is rarely necessary. The key is discipline: each module should be a pure check, not a remediation.

Adapting the Modular Approach for Different Constraints

Not every environment is the same. This section covers variations for air-gapped networks, containerized workloads, and legacy systems that cannot install new tools.

Air-Gapped and Offline Environments

In air-gapped networks, you cannot download modules or dependencies from the internet. The solution is to package all modules into a single archive (tar or zip) that includes the module scripts, a manifest file listing each module's dependencies, and a minimal orchestrator script. Transfer the archive via secure media. The orchestrator runs locally and does not need external connectivity. Each module should be self-contained — no apt-get install or pip install at runtime. If a module needs a tool like jq for JSON parsing, include a static binary in the archive.

Container and Immutable Infrastructure

In container environments, the host is ephemeral and you usually cannot install packages. Compliance checks must run from within the container or from the host using tools already present. For containers, build a sidecar container that contains all the compliance modules and the orchestrator. The sidecar shares the container's PID namespace and filesystem, so it can check processes and file permissions. For immutable hosts (like CoreOS or Bottlerocket), use the host's built-in tools (systemd, journalctl, etc.) and avoid any dependency on a package manager.

Legacy Systems Without Modern Shell

Some older Unix systems (AIX, HP-UX, Solaris) have limited shell capabilities and no JSON tools. For these, write modules in POSIX sh and output a simple key=value format instead of JSON. The orchestrator can parse that format with a small adapter. Keep the logic simple: avoid arrays, associative arrays, and complex string manipulation. Test each module on the actual legacy OS before deploying — what works in bash on Linux may fail in ksh on AIX.

Common Pitfalls and How to Debug Them

Even with modular checks, things go wrong. This section covers the most frequent failures and how to fix them.

Module Output Not Parsable as JSON

The most common issue is a module that prints debug messages to stdout, breaking the JSON output. The orchestrator then fails to parse the result and marks the check as "error". To debug, run the module manually on the target host and inspect its stdout. If you see anything other than a single JSON object, the module has a bug. Fix by redirecting all non-JSON output to stderr. Use a linter like python -m json.tool to validate the output.

Timeout Because of Network or Disk Wait

A module that tries to read a file on a network filesystem (NFS, FUSE) can hang indefinitely if the mount is stale. Always set a timeout in the orchestrator, but also add a timeout inside the module itself using timeout command. For example: timeout 5 cat /path/to/file. This way, even if the orchestrator's timeout fails to kill the process (which can happen with certain process groups), the module exits on its own.

False Positives from Incomplete Checks

A module that checks only the first line of a file might miss a misconfiguration later in the file. For example, checking grep -q '^PermitRootLogin no' /etc/ssh/sshd_config will pass even if there is a commented-out line or a later override. Always test your checks against real misconfigured systems. Build a test fixture — a set of intentionally broken configurations — and run your modules against them before deploying to production.

Frequently Asked Questions About Modular Compliance Checks

This section answers the questions that come up most often when teams adopt this approach.

How many modules should I have?

One module per compliance rule. If a rule has multiple sub-conditions (e.g., "SSH must use protocol 2 and disable root login"), split it into two modules. This keeps each module simple and makes it easy to track which rules pass and fail. A typical CIS Level 1 benchmark for Linux has about 100 rules, so you will end up with roughly 100 modules. That sounds like a lot, but each module is small (10–30 lines) and the total is manageable.

Can I reuse modules across different compliance frameworks?

Yes, with a mapping layer. Write a module that checks "SELinux is enforcing." That module is the same whether you are implementing CIS, STIG, or PCI DSS. The difference is which rules are required. Create a manifest file for each framework that lists the required check IDs. The orchestrator reads the manifest and runs only the relevant modules. This way, you maintain one library of checks and multiple compliance profiles.

How do I update a module when a rule changes?

Update the module file and redeploy the archive. Because modules are independent, changing one does not affect others. Version your modules with a simple version number in the header, and include that version in the JSON output. This lets you track which version of each check ran on each host. If a rule changes frequently, consider storing the expected value in a configuration file that the module reads at runtime — then you can update the config without touching the module code.

Next Steps: Building Your Modular Compliance Library

You now have the concepts and the workflow. Here are the specific actions to take in the next week.

First, pick one compliance rule that your current monolithic script handles poorly — ideally one that has caused a timeout or a false alarm. Write a standalone module for that rule following the steps in this guide. Test it manually on three different hosts (different OS versions or cloud images). Once it works, integrate it into your existing orchestration alongside the old script. Run both in parallel for a week and compare the results. This gives you confidence that the modular version is correct before you migrate the rest.

Second, create a manifest file for your primary compliance framework. List every rule as a separate line with the check ID and a short description. Use this manifest as your roadmap. Prioritize the rules that are most frequently violated or most critical to security — those are the ones that will give you the fastest return on investment.

Third, set up a simple test harness. Create a directory with intentionally non-compliant configurations (e.g., a copy of sshd_config with root login enabled, a copy of auditd.conf with a small log size). Run your modules against these fixtures and verify that they fail correctly. Then modify the fixtures to be compliant and verify that the modules pass. This test harness is your safety net for every future module you write.

Finally, share the manifest and the first few modules with your team. Let them see how small and readable a well-written module is. The modular approach is not just a technical change — it is a cultural shift toward treating compliance checks as reusable, testable components. Once your team experiences the joy of fixing one module without breaking ten others, they will never go back to monolithic scripts.

Share this article:

Comments (0)

No comments yet. Be the first to comment!