Log rotation is one of those tasks every admin sets up and rarely thinks about again—until a disk fills up or an audit finds a gap. This guide reveals a specific, often-overlooked failure mode that silently destroys logs without triggering alarms, and presents a robust fix derived from Northpoint’s operational playbook. We will walk through the root cause, detection methods, configuration patterns, and how to choose the right strategy for your environment.
The Problem: Why Log Rotation Fails Silently
Most log rotation tools—logrotate on Linux, Windows Event Log size limits, or custom scripts—rely on a combination of size thresholds, age limits, and post-rotation actions like compression or deletion. The silent failure occurs when the rotation completes successfully from the tool’s perspective, but the application or system service does not reopen its log file handle. Instead, the application continues writing to the now-renamed or deleted file, causing all subsequent log entries to be lost or written to a stale file that is never rotated again.
The Copy-Truncate vs. Create-New Confusion
Two common rotation modes exist: copy-truncate (copy the file, then truncate the original) and create-new (rename the old file, then create a new empty file with the same name). The create-new mode requires the application to detect the file change and reopen its handle. Many applications—especially older ones or those using buffered I/O—do not do this. Copy-truncate avoids this problem but can lose a few lines between copy and truncate. Admins often choose one mode without verifying whether their application supports it, leading to silent data loss.
Another subtle variant: rotation scripts that delete old logs based on count but fail to check whether the deletion actually freed space (e.g., if the file is still held open by a process). The disk remains full, but the tool reports success. This scenario is especially common with Java applications that keep file locks.
In a composite case, a team I read about configured logrotate with a 7-day retention and daily rotation. After two weeks, they noticed disk usage was still climbing. Investigation revealed that the application (a legacy Java service) never reopened its log file after rotation. The old logs were being rotated and deleted, but the application was writing to a deleted inode—entries vanished, yet no error was logged. The fix required switching to copy-truncate and adding a post-rotation signal to force the application to reopen.
How Log Rotation Mechanisms Work—And Where They Break
To fix the silent failure, we must first understand the underlying mechanics. Log rotation tools typically operate in three phases: check (evaluate size/age thresholds), rotate (rename or copy the current file), and post-process (compress, delete old archives, or signal applications). The failure often resides in the signaling step or the application’s reaction to it.
File Descriptor Behavior
When a process opens a file, it obtains a file descriptor pointing to the inode. If the file is renamed (create-new mode), the descriptor still points to the original inode—the new file with the same name is a different inode. The process continues writing to the old inode, which is now detached from the directory tree and will be deleted when the descriptor is closed. Until then, the old inode consumes disk space and the new file remains empty.
Signaling Mechanisms
Common signals include SIGHUP (for daemons like syslog-ng), USR1/USR2, or application-specific commands (e.g., nginx -s reopen). If the rotation script does not send the correct signal, or the application ignores it, the failure occurs. Many admins assume the default logrotate configuration handles signaling, but default scripts often only work for syslog.
Compression and Deletion Order
Another break point: compression of rotated logs before deletion. If compression fails silently (e.g., due to disk space or permission issues), old logs accumulate and retention policies are violated. The tool may still report success because the compression step is non-fatal.
To avoid these pitfalls, you must verify that your application supports the chosen rotation mode and that the signaling is correctly configured. A simple test: after rotation, check that the application writes new entries to the correct file (the newly created one, not the old inode). You can do this by monitoring file descriptors with lsof or checking inode numbers with stat.
Northpoint’s Fix: A Structured Approach to Reliable Rotation
Northpoint’s operational team developed a repeatable process to eliminate silent rotation failures. It combines pre-flight checks, a standardized configuration template, and post-rotation verification. The fix is not a single tool but a methodology that can be adapted to any environment.
Step 1: Audit Current Configuration
Inventory all log rotation jobs. For each, note the rotation mode (copy-truncate vs. create-new), the application being rotated, and the post-rotation action (signal, compress, delete). Identify applications that do not reopen log files automatically—these are candidates for copy-truncate or require explicit signaling.
Step 2: Standardize on a Mode
Northpoint recommends using copy-truncate for all applications that do not natively support create-new with signal handling. For applications that do support it (e.g., nginx, Apache, rsyslog), use create-new with the correct signal. Document the signal for each application in a configuration management database.
Step 3: Add Verification Steps
After each rotation, run a script that checks:
- The new log file exists and is writable.
- The application’s file descriptor points to the new inode (or the old inode is no longer growing).
- Disk usage is within expected bounds.
If any check fails, alert immediately. Northpoint uses a simple cron job that runs after logrotate and sends a notification if anomalies are detected.
Step 4: Implement Compressed Archive Validation
After compression, verify that the compressed file is not corrupted and that its size is reasonable. Use gzip -t or equivalent. If compression fails, retain the uncompressed file and alert.
This process turned a once-silent failure into a loud alert, reducing data loss incidents by over 90% in their environment. The key is not to rely on logrotate’s exit code alone, but to actively verify the outcome.
Tools, Stack, and Maintenance Realities
Choosing the right tooling is critical. While logrotate is ubiquitous on Linux, alternatives like systemd-journald, awslogs (for CloudWatch), or custom Python scripts offer different trade-offs. Below is a comparison of common approaches.
| Tool | Rotation Mode | Signaling Support | Compression | Verification |
|---|---|---|---|---|
| logrotate | Copy-truncate or create-new | Via postrotate script | gzip, bzip2, etc. | None built-in |
| systemd-journald | Automatic by size/age | N/A (journal is binary) | Built-in (LZ4/XZ) | Journal integrity checks |
| Custom Python script | Flexible | Manual | Via subprocess | Can be added |
Maintenance Considerations
Log rotation is not set-and-forget. Application updates may change file naming or signaling requirements. Filesystem changes (e.g., moving from ext4 to XFS) can affect inode behavior. Regularly review rotation logs and disk usage trends. Automate the review with dashboards (e.g., Grafana showing log file sizes over time).
For containerized environments, log rotation is often handled by the container runtime (e.g., Docker’s json-file driver with max-size and max-file). However, if applications write to files inside the container, rotation must be managed inside the container or via volume mounts. Northpoint’s team found that using copy-truncate inside containers with a shared volume works reliably, but they also add a sidecar container that monitors log file descriptors.
Cloud-native patterns (e.g., writing to stdout and letting the runtime capture logs) simplify rotation but introduce new failure modes if the runtime’s log driver is misconfigured. Always test rotation in a staging environment before deploying to production.
Growth Mechanics: Scaling Log Rotation Across a Fleet
As your infrastructure grows, manual per-server rotation configuration becomes untenable. Centralized management using configuration management tools (Ansible, Puppet, Chef) or a dedicated log management platform (ELK, Splunk) becomes necessary. However, scaling introduces new failure modes.
Configuration Drift
When rotation settings are managed by different teams or inherited from base images, inconsistencies arise. One server might use copy-truncate, another create-new, leading to different data retention and loss profiles. Standardize via a single configuration template and enforce it with automated compliance checks.
Monitoring at Scale
Verification scripts that run on each server can generate alert noise if thresholds are too tight. Northpoint’s approach: aggregate rotation results into a central metrics system. Each server emits a metric (e.g., log_rotate_success with a 0/1 value). A dashboard shows the percentage of successful rotations across the fleet. Anomaly detection flags servers with persistent failures.
Cost Considerations
Storing rotated logs in cloud object storage (S3, GCS) can reduce local disk pressure but introduces latency and cost. Define retention policies that balance compliance requirements with storage costs. Use lifecycle policies to transition logs to cheaper storage tiers after a set period.
One team I read about had a fleet of 500 servers, each generating 2 GB of logs per day. Their logrotate configuration used 30-day retention with gzip compression. After migrating to a centralized logging system, they reduced local retention to 7 days and archived the rest to S3 Glacier, cutting storage costs by 60% while maintaining compliance.
Risks, Pitfalls, and Mitigations
Even with a solid process, pitfalls remain. Below are common risks and how to mitigate them.
Silent Compression Failures
If compression runs out of disk space or encounters a permission error, logrotate may skip compression and delete the uncompressed file, losing data. Mitigation: use delaycompress to keep one uncompressed copy, and add a post-rotation check that compression succeeded.
Race Conditions with High-Volume Logging
When logs are written at high velocity, the copy-truncate mode can lose data between the copy and truncate. Mitigation: use create-new with application support, or buffer logs through a reliable queue (e.g., syslog over TCP).
Time-Based Rotation vs. Size-Based Rotation
Time-based rotation (e.g., daily) can generate uneven file sizes. A server that is idle most of the day may produce tiny logs, while a busy server may exceed size limits. Mitigation: use size-based rotation with a maximum size, and combine with time-based rotation as a fallback (e.g., rotate when size > 100 MB or daily, whichever comes first).
Inode Exhaustion
If logrotate creates many small files (e.g., per-minute rotation), the filesystem may run out of inodes even if disk space is available. Mitigation: set a minimum size threshold and avoid overly aggressive rotation.
For each risk, document the mitigation in your runbook and test it during incident drills. Regular tabletop exercises help teams respond quickly when rotation failures occur.
Decision Checklist: Choosing Your Rotation Strategy
Use the following checklist to evaluate your current setup or design a new one.
Application Compatibility
- Does the application reopen log files on signal? (Check documentation or test with
strace.) - If not, use copy-truncate.
- If yes, use create-new with the correct signal.
Retention Requirements
- How long must logs be retained for compliance? (e.g., 90 days for PCI DSS)
- Are compressed archives acceptable? (Most compliance frameworks accept compressed logs.)
- Do you need immediate access to old logs, or can they be archived off-site?
Monitoring and Alerting
- Do you have a way to verify that rotation happened correctly? (e.g., cron job + alert)
- Are you alerted if disk usage exceeds a threshold? (e.g., 80% full)
- Do you track rotation success rates over time?
Scalability
- How many servers are involved? (Under 50: manual config acceptable; over 50: use CM tool.)
- Is log rotation managed centrally or per-server?
- Do you have a staging environment to test rotation changes?
This checklist is not exhaustive but covers the most common failure points. Adapt it to your specific stack and compliance needs.
Synthesis and Next Steps
Silent log rotation failures are dangerous because they erode trust in your monitoring and compliance posture. The fix is not a single tool but a combination of understanding file descriptor behavior, choosing the right rotation mode, adding verification, and scaling with centralized management.
Start by auditing one application: check its current rotation mode, verify that logs are being written to the correct file after rotation, and implement a simple verification script. Once you have a working pattern, roll it out to other services. Document every application’s rotation requirements in a central knowledge base.
Remember that log rotation is part of a larger observability strategy. Invest in centralized logging to reduce reliance on local rotation, but do not ignore local rotation—it is your safety net when the central system is down.
Finally, review your setup at least annually, or whenever you upgrade the OS or application. The silent failure can reappear after an update that changes default behavior.
Comments (0)
Please sign in to post a comment.
Don't have an account? Create one
No comments yet. Be the first to comment!