Advanced Data Transformation with jq and yq: A Comprehensive Guide

Posted Nov 1, 2024

By Joey

3 min read

Advanced Data Transformation with jq and yq: A Comprehensive Guide

In modern DevOps and data processing workflows, dealing with structured data formats like YAML and JSON is a common task. Two powerful command-line tools, jq and yq, make these transformations easier and more efficient. This article explores how to use these tools together to perform complex data transformations.

Introduction to jq and yq

jq: A lightweight command-line JSON processor
yq: A YAML processor that can convert between YAML and JSON

Real-World Use Case: Converting Complex YAML Structures

Let’s explore a practical example where we need to transform a YAML file with embedded key-value pairs into a structured format.

Initial Input (input.yml)

  
name: Kate
info: height=170 age=25 gender="female" school="MIT University"

Desired Output

  
name: Kate
height: "170"
age: "25"
gender: female
school: MIT University

Step-by-Step Solution

1. Converting YAML to JSON

First, we convert the YAML file to JSON for processing with jq:

yq -o json input.yml > input.json

2. Creating a jq Query

Let’s break down the complex jq query that extracts and transforms the data:

.info |
[scan("(?<key>[^ =]+)=(?<value>\"[^\"]+\"|[^\" ]+)") | 
{
  key: .[0], 
  value: .[1]|(if startswith("\"") then .[1:-1] else . end)
}] |
from_entries

Let’s analyze each part:

.info: Selects the info field
scan(): Uses regex to match key-value pairs
- (?<key>[^ =]+): Captures the key (anything before =)
- =: Matches the equals sign
- (?<value>\"[^\"]+\"|[^\" ]+): Captures the value (quoted or unquoted)
Creates objects with key and value properties
from_entries: Converts array of key-value pairs into an object

3. Complete Pipeline

The complete transformation combining both tools:

  
yq -ojson input.yml | \
  jq '.+(.info|[scan("(?<key>[^ =]+)=(?<value>\"[^\"]+\"|[^\" ]+)") | \
  {key: .[0], value: .[1]|(if startswith("\"") then .[1:-1] else . end)}]|from_entries)|del(.info)' | \
  yq -P

Best Practices and Additional Examples

1. Handling Multiple Data Types

  
# input.yml
name: John
info: age=30 score=98.5 active=true tags="python,java,golang"

Query to handle different data types:

.info |
[scan("(?<key>[^ =]+)=(?<value>\"[^\"]+\"|[^\" ]+)") |
{
  key: .[0],
  value: .[1]|(
    if startswith("\"") then .[1:-1]
    elif . == "true" then true
    elif . == "false" then false
    elif test("^[0-9]+$") then tonumber
    elif test("^[0-9]+\\.[0-9]+$") then tonumber
    else .
    end
  )
}] |
from_entries

2. Processing Arrays

  
# input.yml
name: Team A
members:
  - info: role="developer" level=3 skills="java,python"
  - info: role="designer" level=2 skills="ui,ux"

Query to process arrays:

  
yq -ojson input.yml | \
  jq '.members |= map(.+(.info|[scan("(?<key>[^ =]+)=(?<value>\"[^\"]+\"|[^\" ]+)") | \
  {key: .[0], value: .[1]|(if startswith("\"") then .[1:-1] else . end)}]|from_entries)|del(.info))' | \
  yq -P

3. Error Handling Best Practices

Always validate input format:

  
yq eval 'type' input.yml || { echo "Invalid YAML"; exit 1; }

Use proper error handling in jq:

if (.info | type) != "string" then
  error("info field must be a string")
else
  # processing logic
end

Advanced Tips

Performance Optimization: For large files, use streaming:
1 yq -ojson --stream input.yml | jq -c ...

Debugging Complex Queries: Break down the pipeline:

  
# Debug intermediate JSON
yq -ojson input.yml | tee debug1.json | \
  jq '.' | tee debug2.json | \
  yq -P

Maintaining Code Quality: Store complex jq queries in separate files: ```bash
transform.jq
def process_info: [scan(“(?[^ =]+)=(?\"[^\"]+\"|[^\" ]+)") | {key: .[0], value: .[1]|(if startswith("\"") then .[1:-1] else . end)}] | from_entries;

.+(.info|process_info)|del(.info) ```

Conclusion

The combination of jq and yq provides a powerful toolkit for data transformation. By understanding their capabilities and following best practices, you can build robust and maintainable data processing pipelines. Remember to:

Break down complex transformations into smaller steps
Use proper error handling
Store complex queries in separate files
Test thoroughly with different input formats
Consider performance for large-scale processing

These tools are invaluable for DevOps engineers, data engineers, and anyone working with structured data formats in the command line.

DevOps, Data Processing

jq yq JSON YAML Data Transformation

This post is licensed under CC BY 4.0 by the author.

Advanced Data Transformation with jq and yq: A Comprehensive Guide

Introduction to jq and yq

Real-World Use Case: Converting Complex YAML Structures

Initial Input (input.yml)

Desired Output

Step-by-Step Solution

1. Converting YAML to JSON

2. Creating a jq Query

3. Complete Pipeline

Best Practices and Additional Examples

1. Handling Multiple Data Types

2. Processing Arrays

3. Error Handling Best Practices

Advanced Tips

transform.jq

Conclusion

Trending Tags