dotlinux guide

Streamlining Data Processing with Shell Scripting

Table of Contents

Fundamental Concepts

What is Shell Scripting for Data Processing?

A shell script is a text file containing a sequence of commands executed by a Unix shell (e.g., Bash, Zsh). For data processing, shell scripts automate repetitive tasks like filtering logs, transforming CSV files, aggregating metrics, or generating reports. Unlike heavyweight tools (e.g., Python, Hadoop), shell scripts leverage built-in command-line utilities, making them lightweight, fast, and accessible on nearly all systems.

Key Tools in the Shell Data Processing Toolkit

Shell scripting’s power lies in combining simple, focused tools to solve complex problems. Here are the workhorses:

ToolPurposeExample Usage
grepSearch for text patterns in filesgrep "ERROR" app.log (find “ERROR” in logs)
awkProcess structured data (CSV, logs)awk -F ',' '{print $1, $3}' data.csv
sedEdit text streams (replace, delete, etc.)sed 's/old_value/new_value/g' file.txt
sortSort lines by text, number, or datesort -n numbers.txt (numeric sort)
uniqRemove duplicate linesuniq -c data.txt (count duplicates)
cutExtract specific columns from textcut -d ',' -f2 data.csv (2nd CSV column)
catConcatenate or print filescat file1.txt file2.txt > combined.txt

Understanding Data Streams: stdin, stdout, stderr

All Unix tools communicate via streams:

  • stdin (Standard Input): Data input to a command (e.g., from a file or keyboard).
  • stdout (Standard Output): Data output by a command (e.g., results printed to the terminal).
  • stderr (Standard Error): Errors or warnings (separate from stdout to avoid polluting results).

Redirection lets you route streams to/from files:

  • >: Overwrite a file with stdout (e.g., ls > file_list.txt).
  • >>: Append stdout to a file (e.g., echo "new line" >> file.txt).
  • 2>: Redirect stderr to a file (e.g., ./script.sh 2> errors.log).

Usage Methods: From Basic to Advanced

Filtering and Searching Data

Use grep to isolate relevant data. For example, to find all “ERROR” entries in a log file with timestamps:

# Syntax: grep [options] "pattern" file
grep -i "error" app.log  # -i: case-insensitive search
grep -E "ERROR|WARN" app.log  # -E: regex search (ERROR or WARN)
grep -A 3 "CRITICAL" app.log  # -A 3: show 3 lines AFTER the match

Transforming Data with awk and sed

awk is ideal for structured data (CSVs, logs with columns). Suppose users.csv has:

id,name,email
1,Alice,[email protected]
2,Bob,[email protected]

Extract names and emails (skip the header):

# Syntax: awk -F "delimiter" 'pattern {action}' file
awk -F ',' 'NR > 1 {print $2 ", " $3}' users.csv  # NR > 1: skip first line

Output:

Alice, [email protected]
Bob, [email protected]

sed excels at text replacement. Clean up a messy CSV by replacing semicolons with commas:

# Syntax: sed 's/old/new/flags' file
sed 's/;/,/g' messy_data.txt > clean_data.csv  # g: global replace (all occurrences)

Aggregating and Sorting Data

Combine sort and uniq to count duplicates or rank values. For example, to find the most frequent IP addresses in an Apache access log (access.log):

# Extract IPs (1st column), sort, count, then sort by count (descending)
awk '{print $1}' access.log | sort | uniq -c | sort -nr | head -5

Output:

  150 192.168.1.1
   87 10.0.0.2
   42 172.16.0.5

Piping and Redirection: Combining Tools

Pipes (|) chain commands, passing stdout of one to stdin of the next. This “tool pipeline” enables complex workflows with minimal code.

Example: Analyze top 5 most requested URLs from an Apache log:

# Step 1: Extract URLs (7th column in Apache logs), then count and sort
cat access.log | grep "GET" | awk '{print $7}' | sort | uniq -c | sort -nr | head -5

Breakdown:

  • cat access.log: Read the log file.
  • grep "GET": Filter for HTTP GET requests.
  • awk '{print $7}': Extract the URL (7th column in Apache logs).
  • sort: Sort URLs alphabetically.
  • uniq -c: Count occurrences of each URL.
  • sort -nr: Sort counts numerically in reverse (highest first).
  • head -5: Show top 5 results.

Common Practices: Real-World Workflows

Log Analysis: Parsing and Summarizing Errors

Automate daily log reviews with a script. Example: Count 404 errors in an Nginx log and save results to a report:

#!/bin/bash
# File: analyze_404s.sh

LOG_FILE="/var/log/nginx/access.log"
REPORT_FILE="404_report_$(date +%Y%m%d).txt"

# Extract 404 URLs, count, and sort
echo "404 Error Report - $(date)" > "$REPORT_FILE"
echo "========================" >> "$REPORT_FILE"
awk '$9 == 404 {print $7}' "$LOG_FILE" | sort | uniq -c | sort -nr >> "$REPORT_FILE"

echo "Report generated: $REPORT_FILE"

Run with chmod +x analyze_404s.sh && ./analyze_404s.sh.

CSV Data Cleaning and Transformation

Use a script to remove duplicates, filter rows, and reformat a CSV. For example, clean sales.csv to keep only 2023 sales with positive revenue:

#!/bin/bash
# File: clean_sales.sh

INPUT="sales.csv"
OUTPUT="cleaned_sales_2023.csv"

# Step 1: Keep header + rows where year=2023 and revenue>0
# Step 2: Remove duplicate rows (based on all columns)
awk -F ',' 'NR == 1 || ($4 == 2023 && $5 > 0)' "$INPUT" | sort -u > "$OUTPUT"

echo "Cleaned data saved to $OUTPUT"

Automating System Data Aggregation

Schedule scripts with cron to collect system metrics (CPU, memory) hourly. Example script:

#!/bin/bash
# File: collect_metrics.sh

METRICS_FILE="/var/log/system_metrics.csv"

# Add header if file is new
if [ ! -f "$METRICS_FILE" ]; then
  echo "timestamp,cpu_usage(%),mem_usage(%)" > "$METRICS_FILE"
fi

# Get CPU and memory usage (using `top` or `vmstat`)
TIMESTAMP=$(date +"%Y-%m-%d %H:%M:%S")
CPU_USAGE=$(top -bn1 | grep "Cpu(s)" | awk '{print $2 + $4}')  # user + system CPU
MEM_USAGE=$(free | awk '/Mem/ {print $3/$2 * 100}' | cut -d '.' -f1)

# Append to metrics file
echo "$TIMESTAMP,$CPU_USAGE,$MEM_USAGE" >> "$METRICS_FILE"

Add to cron to run hourly:

# Edit crontab: crontab -e
0 * * * * /path/to/collect_metrics.sh  # Runs at minute 0 of every hour

Best Practices for Reliable and Efficient Scripts

Error Handling and Robustness

  • Use set -euo pipefail to exit on errors/unset variables/failed pipes:
    # Add this at the top of scripts
    set -euo pipefail  # -e: exit on error; -u: error on unset variable; -o pipefail: exit if any pipe command fails
  • Validate inputs (e.g., check if files exist):
    if [ ! -f "$INPUT_FILE" ]; then
      echo "Error: Input file $INPUT_FILE not found." >&2  # >&2: redirect to stderr
      exit 1
    fi

Readability and Maintainability

  • Comment liberally: Explain why (not just what) the code does.
  • Use functions for reusable logic:
    # Function to clean CSV data
    clean_csv() {
      local input="$1"
      local output="$2"
      awk -F ',' 'NR == 1 || $3 > 0' "$input" | sort -u > "$output"
    }
    # Usage: clean_csv "raw.csv" "clean.csv"
  • Indent code (2–4 spaces) and use meaningful variable names.

Performance Optimization

  • Avoid loops (slow in shells). Use tools like awk or sed for bulk processing instead.
  • Minimize I/O: Process data in memory with pipes instead of writing temporary files.
  • Use efficient tools: Prefer grep -F over grep -E for fixed strings, and sort -u instead of sort | uniq.

Security Considerations

  • Sanitize inputs: Avoid injecting untrusted data into commands (e.g., use "$VAR" instead of $VAR to handle spaces/special chars).
  • Restrict file permissions: Make scripts executable only by the owner (chmod 700 script.sh).
  • Avoid temporary files: Use process substitution (<(command)) instead of writing to /tmp (risk of race conditions).

Testing

  • Use shellcheck to lint scripts for bugs (install with apt install shellcheck):
    shellcheck my_script.sh  # Highlights syntax errors, undefined variables, etc.
  • Test with a small dataset first (e.g., head -100 raw_data.csv | ./clean_script.sh).

Conclusion

Shell scripting is a Swiss Army knife for data processing—lightweight, flexible, and deeply integrated with Unix systems. By combining tools like grep, awk, and sed, you can automate everything from log analysis to ETL pipelines. Adopting best practices (error handling, readability, security) ensures scripts are reliable and maintainable.

Start small: automate a tedious daily task (e.g., log filtering), then build up to complex workflows. With practice, shell scripting will become an indispensable tool for streamlining your data processing workflows.

References