Advanced Data Transformation with jq and yq: A Comprehensive Guide
Advanced Data Transformation with jq and yq: A Comprehensive Guide
In modern DevOps and data processing workflows, dealing with structured data formats like YAML and JSON is a common task. Two powerful command-line tools, jq
and yq
, make these transformations easier and more efficient. This article explores how to use these tools together to perform complex data transformations.
Introduction to jq and yq
- jq: A lightweight command-line JSON processor
- yq: A YAML processor that can convert between YAML and JSON
Real-World Use Case: Converting Complex YAML Structures
Let’s explore a practical example where we need to transform a YAML file with embedded key-value pairs into a structured format.
Initial Input (input.yml)
1
2
name: Kate
info: height=170 age=25 gender="female" school="MIT University"
Desired Output
1
2
3
4
5
name: Kate
height: "170"
age: "25"
gender: female
school: MIT University
Step-by-Step Solution
1. Converting YAML to JSON
First, we convert the YAML file to JSON for processing with jq:
1
yq -o json input.yml > input.json
2. Creating a jq Query
Let’s break down the complex jq query that extracts and transforms the data:
.info |
[scan("(?<key>[^ =]+)=(?<value>\"[^\"]+\"|[^\" ]+)") |
{
key: .[0],
value: .[1]|(if startswith("\"") then .[1:-1] else . end)
}] |
from_entries
Let’s analyze each part:
.info
: Selects the info fieldscan()
: Uses regex to match key-value pairs(?<key>[^ =]+)
: Captures the key (anything before =)=
: Matches the equals sign(?<value>\"[^\"]+\"|[^\" ]+)
: Captures the value (quoted or unquoted)
- Creates objects with
key
andvalue
properties from_entries
: Converts array of key-value pairs into an object
3. Complete Pipeline
The complete transformation combining both tools:
1
2
3
4
yq -ojson input.yml | \
jq '.+(.info|[scan("(?<key>[^ =]+)=(?<value>\"[^\"]+\"|[^\" ]+)") | \
{key: .[0], value: .[1]|(if startswith("\"") then .[1:-1] else . end)}]|from_entries)|del(.info)' | \
yq -P
Best Practices and Additional Examples
1. Handling Multiple Data Types
1
2
3
# input.yml
name: John
info: age=30 score=98.5 active=true tags="python,java,golang"
Query to handle different data types:
.info |
[scan("(?<key>[^ =]+)=(?<value>\"[^\"]+\"|[^\" ]+)") |
{
key: .[0],
value: .[1]|(
if startswith("\"") then .[1:-1]
elif . == "true" then true
elif . == "false" then false
elif test("^[0-9]+$") then tonumber
elif test("^[0-9]+\\.[0-9]+$") then tonumber
else .
end
)
}] |
from_entries
2. Processing Arrays
1
2
3
4
5
# input.yml
name: Team A
members:
- info: role="developer" level=3 skills="java,python"
- info: role="designer" level=2 skills="ui,ux"
Query to process arrays:
1
2
3
4
yq -ojson input.yml | \
jq '.members |= map(.+(.info|[scan("(?<key>[^ =]+)=(?<value>\"[^\"]+\"|[^\" ]+)") | \
{key: .[0], value: .[1]|(if startswith("\"") then .[1:-1] else . end)}]|from_entries)|del(.info))' | \
yq -P
3. Error Handling Best Practices
- Always validate input format:
1
yq eval 'type' input.yml || { echo "Invalid YAML"; exit 1; }
- Use proper error handling in jq:
if (.info | type) != "string" then error("info field must be a string") else # processing logic end
Advanced Tips
- Performance Optimization: For large files, use streaming:
1
yq -ojson --stream input.yml | jq -c ...
- Debugging Complex Queries: Break down the pipeline:
1 2 3 4
# Debug intermediate JSON yq -ojson input.yml | tee debug1.json | \ jq '.' | tee debug2.json | \ yq -P
- Maintaining Code Quality: Store complex jq queries in separate files: ```bash
transform.jq
def process_info: [scan(“(?
[^ =]+)=(? \"[^\"]+\"|[^\" ]+)") | {key: .[0], value: .[1]|(if startswith("\"") then .[1:-1] else . end)}] | from_entries;
.+(.info|process_info)|del(.info) ```
Conclusion
The combination of jq
and yq
provides a powerful toolkit for data transformation. By understanding their capabilities and following best practices, you can build robust and maintainable data processing pipelines. Remember to:
- Break down complex transformations into smaller steps
- Use proper error handling
- Store complex queries in separate files
- Test thoroughly with different input formats
- Consider performance for large-scale processing
These tools are invaluable for DevOps engineers, data engineers, and anyone working with structured data formats in the command line.