Post

Advanced Data Transformation with jq and yq: A Comprehensive Guide

Advanced Data Transformation with jq and yq: A Comprehensive Guide

In modern DevOps and data processing workflows, dealing with structured data formats like YAML and JSON is a common task. Two powerful command-line tools, jq and yq, make these transformations easier and more efficient. This article explores how to use these tools together to perform complex data transformations.

Introduction to jq and yq

  • jq: A lightweight command-line JSON processor
  • yq: A YAML processor that can convert between YAML and JSON

Real-World Use Case: Converting Complex YAML Structures

Let’s explore a practical example where we need to transform a YAML file with embedded key-value pairs into a structured format.

Initial Input (input.yml)

1
2
name: Kate
info: height=170 age=25 gender="female" school="MIT University"

Desired Output

1
2
3
4
5
name: Kate
height: "170"
age: "25"
gender: female
school: MIT University

Step-by-Step Solution

1. Converting YAML to JSON

First, we convert the YAML file to JSON for processing with jq:

1
yq -o json input.yml > input.json

2. Creating a jq Query

Let’s break down the complex jq query that extracts and transforms the data:

.info |
[scan("(?<key>[^ =]+)=(?<value>\"[^\"]+\"|[^\" ]+)") | 
{
  key: .[0], 
  value: .[1]|(if startswith("\"") then .[1:-1] else . end)
}] |
from_entries

Let’s analyze each part:

  1. .info: Selects the info field
  2. scan(): Uses regex to match key-value pairs
    • (?<key>[^ =]+): Captures the key (anything before =)
    • =: Matches the equals sign
    • (?<value>\"[^\"]+\"|[^\" ]+): Captures the value (quoted or unquoted)
  3. Creates objects with key and value properties
  4. from_entries: Converts array of key-value pairs into an object

3. Complete Pipeline

The complete transformation combining both tools:

1
2
3
4
yq -ojson input.yml | \
  jq '.+(.info|[scan("(?<key>[^ =]+)=(?<value>\"[^\"]+\"|[^\" ]+)") | \
  {key: .[0], value: .[1]|(if startswith("\"") then .[1:-1] else . end)}]|from_entries)|del(.info)' | \
  yq -P

Best Practices and Additional Examples

1. Handling Multiple Data Types

1
2
3
# input.yml
name: John
info: age=30 score=98.5 active=true tags="python,java,golang"

Query to handle different data types:

.info |
[scan("(?<key>[^ =]+)=(?<value>\"[^\"]+\"|[^\" ]+)") |
{
  key: .[0],
  value: .[1]|(
    if startswith("\"") then .[1:-1]
    elif . == "true" then true
    elif . == "false" then false
    elif test("^[0-9]+$") then tonumber
    elif test("^[0-9]+\\.[0-9]+$") then tonumber
    else .
    end
  )
}] |
from_entries

2. Processing Arrays

1
2
3
4
5
# input.yml
name: Team A
members:
  - info: role="developer" level=3 skills="java,python"
  - info: role="designer" level=2 skills="ui,ux"

Query to process arrays:

1
2
3
4
yq -ojson input.yml | \
  jq '.members |= map(.+(.info|[scan("(?<key>[^ =]+)=(?<value>\"[^\"]+\"|[^\" ]+)") | \
  {key: .[0], value: .[1]|(if startswith("\"") then .[1:-1] else . end)}]|from_entries)|del(.info))' | \
  yq -P

3. Error Handling Best Practices

  1. Always validate input format:
    1
    
    yq eval 'type' input.yml || { echo "Invalid YAML"; exit 1; }
    
  2. Use proper error handling in jq:
    if (.info | type) != "string" then
      error("info field must be a string")
    else
      # processing logic
    end
    

Advanced Tips

  1. Performance Optimization: For large files, use streaming:
    1
    
    yq -ojson --stream input.yml | jq -c ...
    
  2. Debugging Complex Queries: Break down the pipeline:
    1
    2
    3
    4
    
    # Debug intermediate JSON
    yq -ojson input.yml | tee debug1.json | \
      jq '.' | tee debug2.json | \
      yq -P
    
  3. Maintaining Code Quality: Store complex jq queries in separate files: ```bash

    transform.jq

    def process_info: [scan(“(?[^ =]+)=(?\"[^\"]+\"|[^\" ]+)") | {key: .[0], value: .[1]|(if startswith("\"") then .[1:-1] else . end)}] | from_entries;

.+(.info|process_info)|del(.info) ```

Conclusion

The combination of jq and yq provides a powerful toolkit for data transformation. By understanding their capabilities and following best practices, you can build robust and maintainable data processing pipelines. Remember to:

  • Break down complex transformations into smaller steps
  • Use proper error handling
  • Store complex queries in separate files
  • Test thoroughly with different input formats
  • Consider performance for large-scale processing

These tools are invaluable for DevOps engineers, data engineers, and anyone working with structured data formats in the command line.

This post is licensed under CC BY 4.0 by the author.