Abstract

The Unified Navigation Language (UNL) is designed as an embeddable query language. This specification describes its syntax, data model, type system, and the necessary components of an execution environment for integration into a host programming language. UNL provides a single, coherent syntax to query, navigate, and transform data across multiple structured formats, including XML, JSON, CSV, and RDF.

Introduction

In an increasingly heterogeneous data ecosystem, developers must master multiple query languages (XPath for XML, JSONPath for JSON, SPARQL for RDF, etc.). UNL was designed to eliminate this complexity by offering a unified abstraction layer.

UNL's key innovations are:

Design Principles

Operating Models

A UNL query is always evaluated against an initial context. The initial context can be a single document or a sequence of documents.

In-Memory Navigation

In this model, the UNL engine operates on a data structure that is already parsed and held in memory. The context can be a single root Node or a sequence of Nodes. The UNL query is an expression that navigates this existing structure directly. This corresponds to the InMemoryQuery production in the EBNF grammar.

// context: Node(Object)
let myJson = { "user": { "name": "John" } };
// query: String
let inMemoryQuery = "user/@name";

// unl.run(context: Node, query: String) → Sequence(Leaf)
let result = unl.run(myJson, inMemoryQuery);
// result is a Sequence containing one item: Leaf("John")
          

Resource Loading and Navigation

In this model, the UNL query begins with a Resource Locator. This locator can be a single resource identifier (a file path or a URI) or a pattern (e.g., a glob) that resolves to a sequence of resources. This locator is immediately followed by a mandatory transformation pipe (|) that instructs the engine how to fetch and parse the resource(s). This corresponds to the ResourceQuery production in the EBNF grammar.

Important Note on Resource Locators

The ResourceLocator part of a query is treated as a literal string path or URI. It only supports the standard / as a path separator. The special UNL navigation operators such as ? (optional navigation), @ (leaf access), or wildcards are not valid within the locator itself. UNL's navigation logic is only applied to the path expression that follows the first transformation pipe.

// query: String (A ResourceQuery)
let resourceQuery = "data.csv|csv/*[1]/@price";

// unl.run(query: String) → Sequence(Leaf)
let result = unl.run(resourceQuery);
// result is a Sequence containing one item: Leaf("some_price")
          

Host Language Integration

UNL is designed to be embedded within a host programming language (e.g., Python, JavaScript, Java). A UNL engine implementation MUST be provided with an Execution Environment by the host. This environment supplies the context and configurations necessary to execute a query.

The components of the Execution Environment are:

Initial Context
This is the data upon which the UNL query will run. It corresponds to one of the Operating Models: either a pre-parsed object/node (for In-Memory Navigation) or a string/list of strings representing Resource Locators.
Namespace Mappings
For queries involving prefixed QNames (e.g., foaf:name), the host environment MUST provide a mapping of prefixes to their full namespace URI strings. This is typically a hash map or dictionary. The UNL engine uses this map to resolve EQNames.
Variable Bindings (Optional)
To write secure and reusable queries, an implementation SHOULD support external variable binding. Variables within a UNL query are denoted with a `$` prefix (e.g., $username). The host environment can provide a map of variable names to their values. These values are mapped to UNL's conceptual data types, and can be primitive Leaves (String, Number, Boolean) or structured Nodes (Object, Array).
// Pseudo-code for variable binding
let variables = {
  "user_id": 123,
  "allowed_roles": ["admin", "editor"]
};
// Use a numeric variable in a predicate
unl.run(context, "users[@id = $user_id]", { variables });
// Use an array variable in a predicate
unl.run(context, "users[role = $allowed_roles]");
            
Custom Function Library (Optional)
For advanced extensibility, an implementation MAY allow the host language to register custom functions. To prevent name collisions with built-in functions, custom functions MUST be namespaced. The host environment provides a mapping of namespace prefixes to collections of functions.
// Pseudo-code for registering and using a custom function
// my_validators.js
const is_internal_email = (email_string) => email_string.endsWith('@example.com');

// Main application
unl.registerFunctionLibrary("validate", { isInternal: is_internal_email });

unl.run(context, "//user[validate:isInternal(@email)]");
            

Sequence Transformations

UNL includes a powerful two-stage pipeline model to process sequences of results. This allows for both efficient, low-memory streaming and complex, full-sequence aggregations.

The Streaming Pipeline (|)

The default pipeline, using the single pipe |, operates in a streaming fashion. Each operator processes items one by one as they arrive, typically with minimal memory overhead (O(1) or `O(k)`). This model is highly efficient for large datasets. It includes format parsing, decoding, and a class of "streamable aggregates".

Streamable Aggregate Operators

These operators can produce their result without needing to store the entire sequence in memory. They operate in the standard streaming pipeline.

OperatorDescriptionType Signature
| head(n)Takes the first n items from the stream and terminates the pipeline.Sequence(T) → Sequence(T)
| tail(n)Maintains a fixed-size buffer to output the last n items once the stream ends.Sequence(T) → Sequence(T)
| countCounts all items in the stream and outputs a single leaf with the total at the end.Sequence(T) → Leaf(Number)
| sumCalculates the sum of all items in a numeric stream.Sequence(Leaf(Number)) → Leaf(Number)
| avgCalculates the average of all items in a numeric stream.Sequence(Leaf(Number)) → Leaf(Number)
| minFinds the minimum value in a stream of comparable items.Sequence(T) → T
| maxFinds the maximum value in a stream of comparable items.Sequence(T) → T

The Aggregation Pipeline (||)

The double pipe || acts as a blocking barrier. It instructs the engine to stop streaming, collect all results from the preceding pipeline into a full sequence in memory (O(n)), and then pass that complete sequence to the aggregation operators that follow. This is a conscious trade-off made by the user to enable powerful, whole-sequence operations.

Blocking Aggregate Operators

These operators MUST be preceded by the || barrier, as they require the entire sequence to be available to perform their work.

OperatorDescriptionType Signature
| order-by(key)Sorts the entire sequence based on a key expression.Sequence(T) → Sequence(T)
| group-by(key)Groups items in the sequence based on a key expression.Sequence(T) → Sequence(Node(Group))
| distinctRemoves duplicate items from the sequence.Sequence(T) → Sequence(T)

Final Query Results

The result of a UNL query is the data produced by the final operator in the pipeline. If the pipeline ends with an aggregation operator like order-by or `distinct`, the result is a sequence (an array of nodes or leaves). The task of serializing this final sequence into a specific document format (e.g., by wrapping it in a root element) is left to the calling application or environment.

Fundamental Principles

UNL's data model is based on three core principles.

Node (No Prefix)
Represents a navigable structure that contains other nodes or leaves. Examples: an XML element, a JSON object, a CSV row represented as a node.
Leaf (@ Prefix)
Represents a final, non-navigable atomic value. It is the endpoint of a navigation path. Examples: a string, a number, a boolean value, or a node's metadata attribute.
The Pipe Transformation Principle (| Operator)
A Leaf is normally a terminal point of navigation. However, the pipe operator | allows the value of a Leaf to be re-interpreted as a new data source. This operation performs a type-casting of the leaf's value, creating a new, navigable Node structure. This mechanism is the key to nested parsing and is fundamental to UNL's power.
Symbolically: .../Leaf(String) |json → New Node(JSON)

Type System and Comparisons

UNL operates on a set of conceptual data types defined in the Data Model Mapping appendix. The behavior of comparison and equality operators depends on these types.

Equality (`=` and `!=`)

The equality operator compares two values. The inequality operator `!=` is defined as the negation of `=`. The rules are applied in order:

  1. If one operand is a sequence of multiple items, the comparison is true if the other operand is equal to any item in the sequence.
  2. If both operands are of the same primitive type (e.g., Number, String, Boolean), they are compared by value.
  3. Type Coercion: If operands are of different primitive types, the engine attempts to coerce them to a common type before comparison. The primary rule is to attempt casting to a Number. For example, `5 = "5"` is true. If coercion fails, the items are not equal.
  4. If one operand is a Node and the other is a primitive Leaf, the node's "string value" (e.g., its text content for an XML element) is used for the comparison. For example, `book/title = "UNL"` is true if the text content of the title element is "UNL".

Ordered Comparisons (`<`, `>`, `<=`, `>=`)

Ordered comparisons are primarily defined for primitive, orderable types (Numbers, Strings).

Sorting (`order-by`)

The `| order-by(key)` operator uses these comparison rules to sort a sequence. It evaluates the `key` expression for each item in the sequence, and then sorts the items based on the resulting key values. The host implementation SHOULD provide options to specify data type (e.g., numeric vs. text) for sorting to avoid ambiguity.

Navigation Syntax

Handling Special Characters and Quoting

Many characters have a special syntactic meaning in UNL (e.g., / ? @ * [ ] ( ) | . :). If a node or leaf name in the source data contains one of these characters, it must be enclosed in single (') or double (") quotes to be treated as a literal name.

This quoting mechanism applies to any path segment that is a name test.

Rules for Quoting

When to Quote
A name segment MUST be quoted if it contains any special UNL characters or if it is ambiguous with a numeric literal (e.g., a key named "3").
Escaping within Quotes
To include a quote character within a quoted name, it MUST be escaped with a backslash (\). A literal backslash MUST also be escaped (\\).

Examples

// Example 1: JSON key with a forward slash
// Data: { "a/b": { "c": 1 } }
// Query: "a/b"/c/@value

// Example 2: Filename with special characters
// Resource Locator: my-archive.zip
// Path in zip: reports/report-[v1].xml
my-archive.zip|decomp:zip/reports/"report-[v1].xml"|xml//...

// Example 3: XML element name with dots
// Data: <com.example.Node>Value</com.example.Node>
// Query: "com.example.Node"/text()

// Example 4: Quoting a name containing quotes
// Data: { "node with \"quotes\"": 42 }
// Query: "node with \\\"quotes\\\""/@value
    

URIs and Namespaces

UNL provides robust support for namespaced data formats like XML and RDF. To ensure unambiguous queries, UNL adopts the EQName (Extended QName) notation from [[XPATH-31]].

Prefixed QName (e.g., atom:title)
The classic prefix:local-name syntax. Its use is supported but discouraged in favor of EQNames, as it relies on an external context to map the prefix to a namespace URI.
EQName / URI-Qualified Name (e.g., Q{http://www.w3.org/2005/Atom}title)
The Q{namespace-uri}local-name syntax. This is the recommended approach as it includes the full namespace URI directly within the expression, making queries self-contained and unambiguous.

Because RDF predicates are full URIs, using the EQName syntax is the most precise method for navigating RDF graphs.

// EQName syntax is unambiguous and recommended
Q{http://www.w3.org/2005/Atom}feed/Q{http://www.w3.org/2005/Atom}entry/title

// EQName used to query an RDF property
Q{http://xmlns.com/foaf/0.1/}Person/Q{http://xmlns.com/foaf/0.1/}knows

// Full URI used in a predicate value
*[rdf:type = <http://xmlns.com/foaf/0.1/Person>]
    

Format Transformations

The pipe operator (|) is used for transformations. As described in the Operating Models, the first pipe in a ResourceQuery serves to load and parse a resource. Subsequent pipes transform the current selection, often by type-casting a Leaf's value into a new Node structure, as defined in the Core Principles.

Note on Sequences

All transformations described below adhere to the Implicit Iteration principle. When the input is a Sequence(T), the output will be a Sequence(U), where the transformation T → U has been applied to each item.

Data Serialization Formats

These transformations parse a string-based or binary Leaf or resource into a navigable Node structure.

FormatDescriptionType SignatureReference
|xmlExtensible Markup Language.(ResourceLocator | Leaf(String | Binary)) → Node(XML)W3C XML 1.0
|exiEfficient XML Interchange (binary XML).(ResourceLocator | Leaf(Binary)) → Node(XML)W3C EXI 1.0
|jsonJavaScript Object Notation.(ResourceLocator | Leaf(String | Binary)) → Node(Object | Array)RFC 8259
|csvComma-Separated Values.(ResourceLocator | Leaf(String | Binary)) → Node(Array)RFC 4180
|rdfResource Description Framework.(ResourceLocator | Leaf(String | Binary)) → Node(Graph)W3C RDF 1.1
|htmlHyperText Markup Language.(ResourceLocator | Leaf(String | Binary)) → Node(HTML)WHATWG HTML
|textPlain text. Forces a binary leaf to be interpreted as a string.(ResourceLocator | Leaf(Binary)) → Leaf(String)RFC 2046
|yamlYAML Ain't Markup Language.(ResourceLocator | Leaf(String | Binary)) → Node(Object | Array)YAML 1.2.2
|tomlTom's Obvious, Minimal Language.(ResourceLocator | Leaf(String | Binary)) → Node(Object)TOML 1.0.0

Filesystem Operations

This operator treats a local directory path as a navigable archive-like structure.

FormatDescriptionType SignatureReference
|lsLists the contents of a local directory.ResourceLocator(Directory) → Node(Archive)N/A

Decompression (decomp: prefix)

These transformations operate on a resource or a binary Leaf, decompressing it to expose a virtual filesystem of Nodes or a raw binary stream.

FormatDescriptionType SignatureReference
|decomp:zipZIP file format.(ResourceLocator | Leaf(Binary)) → Node(Archive)PKWARE ZIP
|decomp:tarTape Archive.(ResourceLocator | Leaf(Binary)) → Node(Archive)POSIX.1-2017
|decomp:gzGzip compression.(ResourceLocator | Leaf(Binary)) → Leaf(Binary)RFC 1952
|decomp:7z7z archive format.(ResourceLocator | Leaf(Binary)) → Node(Archive)7-Zip Format
|decomp:rarRoshal Archive.(ResourceLocator | Leaf(Binary)) → Node(Archive)N/A
|decomp:xzXZ compression.(ResourceLocator | Leaf(Binary)) → Leaf(Binary)XZ Format
|decomp:bz2Bzip2 compression.(ResourceLocator | Leaf(Binary)) → Leaf(Binary)Bzip2 Format
|decomp:zstdZstandard compression.(ResourceLocator | Leaf(Binary)) → Leaf(Binary)RFC 8878
|decomp:brotliBrotli compression.(ResourceLocator | Leaf(Binary)) → Leaf(Binary)RFC 7932

Decoding (decode: prefix)

These are intermediate transformations that type-cast a Leaf's value.

FormatDescriptionType SignatureReference
|decode:base64Base64 decoding.Leaf(String) → Leaf(Binary)RFC 4648
|decode:hexHexadecimal decoding.Leaf(String) → Leaf(Binary)RFC 4648
|decode:urlPercent-decoding.Leaf(String) → Leaf(String)RFC 3986
|decode:html-entitiesDecodes HTML/XML character entities.Leaf(String) → Leaf(String)WHATWG HTML
|decode:json-stringUn-escapes a string that was itself encoded as a JSON string literal.Leaf(String) → Leaf(String)RFC 8259
|decode:punycodeDecodes Punycode strings (IDN).Leaf(String) → Leaf(String)RFC 3492
|decode:quoted-printableDecodes Quoted-Printable (MIME) content.Leaf(String) → Leaf(String)RFC 2045

Predicates (Filters)

Predicates, placed between square brackets [...], are used to filter sets of nodes. Path expressions inside a predicate can be absolute (starting with /) or relative to the current node.

// Filter items where category leaf matches a global configuration value
// The path /config/default_category starts from the document root
//item[@category = /config/default_category]
    

Positional Predicates

These functions are used within a predicate to filter based on position in a sequence.

FunctionDescriptionType Signature
position()Returns the 1-based position of the current item in its sequence. `[n]` is a shorthand for `[position()=n]`.() → Leaf(Number)
last()Returns the total number of items in the current sequence.() → Leaf(Number)

Built-in Functions

Built-in functions are called without a namespace prefix (e.g., text()). Custom functions provided by a host language MUST use a namespace prefix (e.g., myfuncs:my_func()).

Structural Navigation

These functions manipulate or query the structure of the data model.

FunctionDescriptionType Signature
root()Returns the root node of the document. Equivalent to starting a path with /.() → Node
outer(selector, n)Navigates upwards n levels, jumping only over nodes that match the selector.(Node, String, Number) → Sequence(Node)
inner(selector)Navigates to the terminal elements matching the selector within the current context.(Node, String) → Sequence(Node)

Hierarchical Filtering

These functions operate on sequences of nodes to filter them based on their hierarchical relationships.

FunctionDescriptionType Signature
outermost(nodes)From a set of nodes, keeps only those that are not contained within other nodes in the set.(Sequence(Node)) → Sequence(Node)
innermost(nodes)From a set of nodes, keeps only those that do not contain any other nodes from the set.(Sequence(Node)) → Sequence(Node)

Utility Functions

These functions provide general utility for inspection and logic within queries.

FunctionDescriptionType Signature
only(elements)Tests if the context contains only the specified elements.(Node, Sequence(Node)) → Leaf(Boolean)
text()Returns the text content of a node.(Node) → Leaf(String)
lang()When used on a literal leaf, returns its language tag as a string.(Leaf) → Leaf(String)
count()Returns the number of elements in a selection.(Sequence(T)) → Leaf(Number)
type()Returns the node's type as a string (e.g., "element", "object").(Node) → Leaf(String)
not(expr)Negation of a predicate expression.(Leaf(Boolean)) → Leaf(Boolean)

Examples

The following examples illustrate the two operating models and advanced features.

Basic Examples

// In-Memory: Get an attribute from an XML node
doc/book/@isbn

// In-Memory: Get the name from the first object in a JSON array
users[1]/@name

// Resource Loading: Get a column from a specific row in a CSV file
data.csv|csv/*[@id="ABC"]/@name

// Resource Loading: Navigate into a ZIP file to get an XML element
http://example.com/data.zip|decomp:zip/docs/report.xml|xml//title
        

Advanced Examples

// Recursive Parsing: A leaf's value is piped into a new parser
// Gets the 2nd tag from a comma-separated string within a CSV cell
users.csv|csv/*[1]/@tags|csv/*[1]/@*[2]

// API Chaining: A payload contains gzipped XML data
// The query decodes, decompresses, and parses the data in one pipeline
api/data|json/@payload|decode:base64|decomp:gz|xml//important

// Sequence Processing: Get all unique, sorted authors from a set of files
"data/*.xml"|xml//author/text()||distinct|order-by(.)
        

References

RFC 2119
Key words for use in RFCs to Indicate Requirement Levels, S. Bradner, IETF, March 1997.
RFC 8174
Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words, B. Leiba, IETF, May 2017.
ISO/IEC 14977:1996
Information technology — Syntactic metalanguage — Extended BNF.
W3C XML 1.0
Extensible Markup Language (XML) 1.0 (Fifth Edition), T. Bray, et al., W3C Recommendation, 26 November 2008.
XML-INFOSET
XML Information Set (Second Edition), J. Cowan, et al., W3C Recommendation, 04 February 2004.
XML-NAMES
Namespaces in XML 1.0 (Third Edition), T. Bray, et al., W3C Recommendation, 8 December 2009.
XPATH 3.1
XML Path Language (XPath) 3.1, J. Robie, et al., W3C Recommendation, 21 March 2017.
W3C EXI 1.0
Efficient XML Interchange (EXI) Format 1.0 (Second Edition), J. Schneider, et al., W3C Recommendation, 15 February 2011.
RFC 8259
The JavaScript Object Notation (JSON) Data Interchange Format, T. Bray, Ed., IETF, December 2017.
RFC 4180
Common Format and MIME Type for Comma-Separated Values (CSV) Files, Y. Shafranovich, IETF, October 2005.
W3C RDF 1.1
RDF 1.1 Concepts and Abstract Syntax, R. Cyganiak, et al., W3C Recommendation, 25 February 2014.
WHATWG HTML
HTML Living Standard, WHATWG.
RFC 2046
Multipurpose Internet Mail Extensions (MIME) Part Two: Media Types, N. Freed, et al., IETF, November 1996.
YAML 1.2.2
YAML Ain’t Markup Language (YAML™) Version 1.2.2, O. Ben-Kiki, et al., 1 October 2021.
TOML 1.0.0
Tom's Obvious, Minimal Language (TOML) v1.0.0.
PKWARE ZIP
.ZIP File Format Specification, PKWARE Inc.
POSIX.1-2017
tar - tape archive utility, IEEE Std 1003.1-2017.
RFC 1952
GZIP file format specification version 4.3, P. Deutsch, IETF, May 1996.
7-Zip Format
7z format, Igor Pavlov.
XZ Format
The .xz File Format, Tukaani.
Bzip2 Format
BZIP2 Format Specification, J. Seward.
RFC 8878
Zstandard Compression Algorithm, Y. Collet & M. Kucherawy, Ed., IETF, February 2021.
RFC 7932
Brotli Compressed Data Format, J. Alakuijala & Z. Szabadka, IETF, July 2016.
RFC 4648
The Base16, Base32, and Base64 Data Encodings, S. Josefsson, IETF, October 2006.
RFC 3986
Uniform Resource Identifier (URI): Generic Syntax, T. Berners-Lee, et al., IETF, January 2005.
RFC 3492
Punycode: A Bootstring encoding of Unicode for Internationalized Domain Names in Applications (IDNA), A. Costello, IETF, March 2003.
RFC 2045
Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Bodies, N. Freed & N. Borenstein, IETF, November 1996.

Common Transformation Paths

This appendix summarizes common data conversion pathways, showing the UNL operators used to transform an input representation into a desired output structure, as defined in the Data Model Mapping appendix.

Producing a Navigable Node (e.g., XML, JSON)

From a ResourceLocator
"my_data.json"|json/...
From a Leaf(String)
...@string_leaf|xml/...
From a Leaf(Binary)
...@binary_leaf|json/... (Parser auto-detects encoding)
...@gzipped_json_leaf|decomp:gz|json/... (Decompress then parse)

Producing a Navigable Archive Node

From a ResourceLocator
"my_files.zip"|decomp:zip/...
From a Leaf(Binary)
...@binary_zip_leaf|decomp:zip/...

Producing a Leaf(String)

From a ResourceLocator
"my_file.txt"|text
From a Leaf(Binary)
...@binary_leaf|text
From a Node
.../my_node|text (Serializes the node to its default text representation, e.g., outer XML)

Producing a Leaf(Binary)

From a ResourceLocator
This is the default interpretation of a resource locator before any pipe is applied.
From a Leaf(String)
...@string_leaf|decode:base64
From a Node
.../my_node|exi (Serializes an XML node to binary EXI format)

EBNF Grammar for UNL v1.0

This appendix provides a non-normative grammar for the Unified Navigation Language, with a syntax conforming to the EBNF standard [[ISO-IEC-14977]].

(* A full query can have an optional, final aggregation stage *)
UNLQuery           ::= ( ResourceQuery | InMemoryQuery ) ( '||' AggregationPath )?

(* Form 1: Starts with a resource, requires a parsing transformation *)
ResourceQuery      ::= ResourceLocator StreamingPipe ( StreamingPipe )*

(* Form 2: Starts with a path, operates on a pre-existing context *)
InMemoryQuery      ::= Path ( StreamingPipe )*

StreamingPipe      ::= '|' ( DataFormat | FilesystemOp | DecompTransform | DecodeTransform | StreamingAggregate )
AggregationPath    ::= BlockingAggregate ( '|' BlockingAggregate )*

Path               ::= ( '/' )? Step ( ( '/' | '?' ) Step )*
Step               ::= ( PrimaryStep | Axis ) Predicate*

PrimaryStep        ::= NameTest | Wildcard | LeafAccess | IdAccess | '.' | '(' Path ')' | FunctionCall

NameTest           ::= EQName | Literal
LeafAccess         ::= '@' ( EQName | '*' | Literal )

EQName             ::= QName | URIQualifiedName
QName              ::= ( NCName ':' )? NCName
URIQualifiedName   ::= 'Q{' URILiteral '}' NCName

Wildcard           ::= '*' | '**' | '?'
IdAccess           ::= '#' NCName
Axis               ::= '..' | '..' Integer | '...' | '+' | '-' | '~' | '~~'

Predicate          ::= '[' FilterExpression ']'

FilterExpression   ::= OrExpression
OrExpression       ::= AndExpression ( '|' AndExpression )*
AndExpression      ::= EqualityExpression ( '&' EqualityExpression )*
EqualityExpression ::= RelationalExpression ( ( '=' | '!=' ) RelationalExpression )?
RelationalExpression ::= PrimaryFilterExpr ( ( '<' | '>' | '<=' | '>=' | '~' ) PrimaryFilterExpr )*

PrimaryFilterExpr  ::= Literal | Variable | FunctionCall | Path | LeafAccess | '!' FilterExpression | '(' FilterExpression ')' | Integer

Variable           ::= '$' NCName
FunctionCall       ::= EQName '(' ( FilterExpression ( ',' FilterExpression )* )? ')'

(* Operator Definitions *)
DataFormat         ::= 'xml' | 'exi' | 'json' | 'csv' | 'rdf' | 'html' | 'text' | 'yaml' | 'toml'
FilesystemOp       ::= 'ls'
DecompTransform    ::= 'decomp:' ( 'zip' | 'tar' | 'gz' | '7z' | 'gzip' | 'rar' | 'xz' | 'bz2' | 'zstd' | 'brotli' )
DecodeTransform    ::= 'decode:' ( 'base64' | 'hex' | 'url' | 'html-entities' | 'json-string' | 'punycode' | 'quoted-printable' )
StreamingAggregate ::= 'count' | 'sum' | 'avg' | 'min' | 'max' | 'head' '(' Integer ')' | 'tail' '(' Integer ')'
BlockingAggregate  ::= 'distinct' | 'order-by' '(' Path ')' | 'group-by' '(' Path ')'

(* Lexical Definitions (Informal) *)
ResourceLocator    ::= (* A literal string representing a URI or file path. It uses standard '/' separators and does not support UNL operators like '?' or '@'. UNL navigation begins after the first pipe. *)
URILiteral         ::= (* A string representing a valid URI, conforming to RFC3986 *)
NCName             ::= (* A Non-Colonized Name, as defined in [[XML-NAMES]]. It must not contain ':' and should be compliant with the full Unicode character set allowed by that standard. *)
Integer            ::= [0-9]+
Literal            ::= '"' ( [^"\\] | '\\' . )* '"' | "'" ( [^'\\] | '\\' . )* "'"
    

Data Model Mapping

This section defines how UNL's abstract concepts of Node and Leaf are mapped onto the concrete structures of each major supported format. UNL is a 1-based language, following the convention of XPath for all positional indexing.

UNL's Conceptual Data Types

To describe transformations accurately, UNL uses a set of conceptual data types.

Node
The base type for any navigable structure. It has several specializations:
  • Node(Object): An unordered collection of key-value pairs, similar to a JSON object.
  • Node(Array): An ordered collection of other Nodes or Leaves.
  • Node(XML | HTML): A structure compliant with the [[XML-INFOSET]].
  • Node(Graph): A structure representing RDF triples.
  • Node(Archive): A virtual filesystem root, containing file and directory nodes. This is produced by |decomp: operators and |ls.
  • Node(Group): A special node produced by the group-by operator, containing a key and a sequence of items.
Leaf
The base type for any terminal, atomic value. It has several specializations:
  • Leaf(String): A Unicode string. This is the primary type for textual data.
  • Leaf(Binary): A sequence of raw bytes. This is the primary type for non-textual data. Text-based parsers like |xml can also consume a Leaf(Binary) directly by auto-detecting character encoding. The |text operator provides an explicit way to interpret binary data as text.
  • Leaf(Number), Leaf(Boolean), Leaf(Null): Primitive data types.
Sequence(T)
An ordered collection of items of type T, for example a Sequence(Node).
ResourceLocator
A string that represents a URI or a local file path, used as the initial input for a ResourceQuery.

XML

The |xml transformation is a strict parser that produces a navigable structure compliant with the [[XML-INFOSET]]. It will fail on malformed documents.

Node
An XML Element.
Leaf
An XML Attribute, or a Text node.

HTML

The |html transformation is a lenient parser that mimics browser behavior. It will attempt to fix errors and will always produce a navigable structure compliant with the [[XML-INFOSET]].

From the perspective of subsequent UNL path navigation, a structure parsed from HTML is indistinguishable from one parsed from well-formed XML (like XHTML). The UNL engine operates on the unified Infoset model.

JSON

Node
A JSON Object ({}) or a JSON Array ([]).
Leaf
A value within an object or array that is a String, Number, Boolean, or Null.

CSV

The |csv transformation parses data into an array of nodes. All indexing is 1-based.

Node
The result of the |csv transformation is a single Array Node. Each record (line) in the CSV is mapped to a child Node within this array.
Leaf
An individual cell value within a row. A row's leaves form an ordered collection that can be accessed in two equivalent ways:
  • By Name (if header exists): Using /@name. This is the most readable method.
  • By Position (always available): Using /@*[n], where n is a 1-based integer. @* selects all leaves, and [n] filters for the n-th position.
// Example 1: CSV with header (in "users.csv")
id,name,role,tags
1,Alice,admin,"a,b,c"

// Query 1: Get the 'role' leaf from rows where 'id' leaf is "1"
users.csv|csv/*[@id="1"]/@role          // Returns leaf "admin"

// Query 2: Get the 2nd tag from Alice's record. This requires a nested parse.
users.csv|csv/*[1]/@tags|csv/*[1]/@*[2]    // Returns leaf "b"

// Example 2: Headerless CSV (in "logs.csv")
1687354800,ERROR,auth_service

// Get the 2nd column of the 1st record
logs.csv|csv/*[1]/@*[2]                   // Returns leaf "ERROR"
    

RDF

UNL navigates an RDF graph by following predicates (properties).

Node
A resource identified by a URI or a Blank Node, when it acts as a subject or object of a triple.
Leaf
A Literal value (e.g., a string, number, or date) that is the object of a triple. Literals are accessed via the @ prefix followed by the EQName of the predicate.
// --- Data (in Turtle syntax) ---
@prefix : <http://example.org/ns#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .

:book1 a :Book ;
    rdfs:label "UNL Specification"@en ;
    rdfs:label "Spécification UNL"@fr ;
    foaf:maker :person1 .

:person1 a foaf:Person ;
    foaf:name "John Doe" .
    
// --- UNL Queries (In-Memory Mode) ---
// Note the use of @ followed by the full EQName of the property
:book1/foaf:maker/@foaf:name
// Returns leaf "John Doe"

// Filter leaves based on their language tag using the lang() function
:book1/@rdfs:label[lang()="fr"]
// Returns leaf "Spécification UNL"
    

Archives and Filesystems

Archives (via decomp:) and local filesystems (via |ls) are treated as a virtual filesystem. Both files and directories are modeled as Nodes to allow querying their metadata.

Node
A directory or a file. Directory nodes can contain other nodes. File nodes are terminal for path navigation but expose metadata leaves.
Leaf
Metadata about a file or directory node. When a file node is piped to another transformation, its raw content is used as the input. A standard set of metadata leaves is defined:
  • @name: The name of the file or directory.
  • @size: The uncompressed size in bytes (files only).
  • @compressed_size: The compressed size in bytes (files only).
  • @modified_date: The modification timestamp.
  • @is_dir: A boolean that is true if the node is a directory.
// Example 1: List contents of a local directory
"./src"|ls/*

// Example 2: Get the size of a specific file in a ZIP archive.
my_archive.zip|decomp:zip/docs/report.xml/@size

// Example 3: Filter files by metadata from a local directory, then pipe their content.
"./src"|ls/*[@is_dir=false() and @name ~ "\.js$"]|count