Perfect Indexed XML (PI-XML)
Reduce Size. Keep XML. Zero Breakage for Modern Data Pipelines.
Extensible Markup Language (XML) has been used for years to model structured data and configuration. A common pattern is a root element with many repeated child elements that all share the same structure:
<users>
<user>
<first_name>Sammy</first_name>
<last_name>Shark</last_name>
<location>Ocean</location>
<online>true</online>
<followers>987</followers>
</user>
<user>
<first_name>Sammy</first_name>
<last_name>Shark</last_name>
<location>Ocean</location>
<online>true</online>
<followers>987</followers>
</user>
</users>
This is easy to understand, but it repeats tag names like
<first_name> and <followers> for every item. Over thousands
of records, that overhead adds up in terms of bytes, bandwidth and storage.
Perfect Indexed XML (PI-XML) applies the same idea as Perfect Indexed JSON (PI-JSON): move descriptive names into a header and use compact indexed references in the data, while staying 100% valid XML.
What Is Perfect Indexed XML (PI-XML)?
The idea behind PI-XML is simple:
- Define a header section that maps field names to indices.
- Represent each data row using short, index-based attributes instead of repeating full tag names.
- Keep everything as normal, well-formed XML that any XML parser can handle.
Starting from the normal structure:
<users>
<user>
<first_name>Sammy</first_name>
<last_name>Shark</last_name>
<location>Ocean</location>
<online>true</online>
<followers>987</followers>
</user>
<user>
<first_name>Sammy</first_name>
<last_name>Shark</last_name>
<location>Ocean</location>
<online>true</online>
<followers>987</followers>
</user>
</users>
We can define a PI-XML version like this:
<users>
<header>
<field name="first_name" index="0"/>
<field name="last_name" index="1"/>
<field name="location" index="2"/>
<field name="online" index="3"/>
<field name="followers" index="4"/>
</header>
<row f0="Sammy" f1="Shark" f2="Ocean" f3="true" f4="987"/>
<row f0="Sammy" f1="Shark" f2="Ocean" f3="true" f4="987"/>
</users>
In this representation:
-
The
<header>defines the mapping betweennameandindex. -
Each
<row>element uses short attribute namesf0,f1,f2, etc. to hold the actual data. - The full field names appear just once, in the header.
f0, f1 to reference those columns, while the header describes what they
mean.
PI-XML Is Still Just XML
PI-XML does not introduce a new file format or a new parser. It is purely a convention for structuring XML:
- There is still a single root element (
<users>). - All tags and attributes follow normal XML rules.
- Any standard XML library can parse and serialize this structure.
That means:
- You can still use Document Object Model (DOM), Simple API for XML (SAX) and XPath tools.
- Existing XML serializers and validators continue to work as usual.
- Application Programming Interface (API) gateways, message queues and HTTP clients see it as normal XML.
The only extra step is a small mapping layer that:
- Uses the header to map
index → field namewhen reading. - Uses the same mapping to convert field names to attributes like
f0,f1when writing.
Why Use Perfect Indexed XML?
-
Less repetition of tag names
In the normal XML, every<user>repeats<first_name>,<last_name>,<location>, etc. In PI-XML, those descriptive names appear once in the header and data is carried by compact attributes likef0,f1. -
Smaller XML size
Shorter attribute names and no repeated tags mean fewer bytes:- Less bandwidth on the network.
- Less storage on disk.
- Potentially faster parsing and transferring.
-
Index-friendly representation
Each field index is stable across rows and payloads:- Index
0always meansfirst_name. - Index
1always meanslast_name.
- Index
-
No change to XML parsers
PI-XML works with any existing XML library. Your code simply loads the XML and then interprets the<header>plus the<row>attributes. -
Minimal change to business logic
Often you can:- Keep your internal model as normal objects with named fields.
- Implement PI-XML only in the adapter that reads/writes the wire format.
-
Better compression even before gzip
You will likely still compress XML in transit, but removing repeated tag names makes the raw XML smaller and makes compression even more effective. -
Expandable header (mini schema)
Just like with PI-JSON, the header can evolve into a schema-like description:
This gives machines extra information without changing the basic idea.<header> <field name="first_name" index="0" type="string" nullable="false"/> <field name="last_name" index="1" type="string" nullable="false"/> <field name="location" index="2" type="string" nullable="true"/> <field name="online" index="3" type="boolean" nullable="false"/> <field name="followers" index="4" type="integer" nullable="false"/> </header>
Normal XML vs PI-XML: Size and Overhead
XML tends to be more verbose than JSON due to opening and closing tags. That makes repeated field names even more expensive when you have many rows.
Imagine 100 <user> elements with the same structure. A simplified, minified XML
representation might look like this per row:
<user><first_name>Sammy</first_name><last_name>Shark</last_name><location>Ocean</location><online>true</online><followers>987</followers></user>
In PI-XML we move names to the header and compress the rows:
<header>
<field name="first_name" index="0"/>
<field name="last_name" index="1"/>
<field name="location" index="2"/>
<field name="online" index="3"/>
<field name="followers" index="4"/>
</header>
<row f0="Sammy" f1="Shark" f2="Ocean" f3="true" f4="987"/>
Example size comparison (100 rows)
| Payload | Approx. length (bytes) | Relative size |
|---|---|---|
| Normal XML | 15000 | 100% |
| PI-XML | 9300 | ≈62% |
| Saved | 5700 | ≈38% reduction |
The exact numbers will vary by data, but the pattern is clear: repeated tag names are a big part of XML size, and PI-XML removes most of that repetition.
Comparison
Like with PI-JSON, it is helpful to visualize the impact of PI-XML using progress-bar style examples.
📦 Payload size (example: 100 rows)
🔖 Tag name overhead (conceptual)
Product List in PI-XML
Normal XML
<products>
<product>
<id>1</id>
<name>Laptop</name>
<price>899.99</price>
<currency>USD</currency>
<in_stock>true</in_stock>
</product>
<product>
<id>2</id>
<name>Mouse</name>
<price>19.99</price>
<currency>USD</currency>
<in_stock>false</in_stock>
</product>
</products>
PI-XML
<products>
<header>
<field name="id" index="0"/>
<field name="name" index="1"/>
<field name="price" index="2"/>
<field name="currency" index="3"/>
<field name="in_stock" index="4"/>
</header>
<row f0="1" f1="Laptop" f2="899.99" f3="USD" f4="true"/>
<row f0="2" f1="Mouse" f2="19.99" f3="USD" f4="false"/>
</products>
How to Encode and Decode PI-XML
Encoding on the server (conceptual Python)
from xml.etree.ElementTree import Element, SubElement, tostring
# define header mapping
header_fields = [
("first_name", 0),
("last_name", 1),
("location", 2),
("online", 3),
("followers", 4),
]
users = [
{
"first_name": "Sammy",
"last_name": "Shark",
"location": "Ocean",
"online": True,
"followers": 987,
},
# more users ...
]
root = Element("users")
header_el = SubElement(root, "header")
for name, idx in header_fields:
field_el = SubElement(header_el, "field")
field_el.set("name", name)
field_el.set("index", str(idx))
# helper: mapping name -> index string and attribute key
name_to_index = {name: idx for (name, idx) in header_fields}
for user in users:
row_el = SubElement(root, "row")
for name, idx in header_fields:
attr_name = f"f{idx}"
row_el.set(attr_name, str(user[name]))
xml_bytes = tostring(root) # can be written to file or sent over network
Decoding on the client (pseudo-code)
// assuming 'doc' is an XML Document parsed by DOM
const header = doc.querySelector("header");
const fieldNodes = header.querySelectorAll("field");
// build index → name mapping
const indexToName = {};
fieldNodes.forEach(field => {
const name = field.getAttribute("name");
const index = field.getAttribute("index");
indexToName[index] = name;
});
const rows = Array.from(doc.querySelectorAll("row"));
const users = rows.map(row => {
const user = {};
Object.keys(indexToName).forEach(index => {
const attrName = "f" + index;
const value = row.getAttribute(attrName);
user[indexToName[index]] = value;
});
return user;
});
// 'users' is now an array of objects like:
// { first_name: "Sammy", last_name: "Shark", location: "Ocean", ... }
When PI-XML Shines (and When It Does Not)
Great use cases
- Large XML feeds with many repeating records (logs, telemetry, analytics).
- Systems where XML is required but bandwidth or storage is limited.
- Internal services where both producer and consumer know the PI-XML convention.
- Data pipelines that process huge XML payloads and want to cut cost or latency.
Trade-offs and limitations
- Less human-readable: attributes like
f0are not self-describing. - Requires a small mapping layer to convert between names and indices.
- Tools that expect descriptive XML might prefer the original tag-per-field layout.
- Header changes (adding or reordering fields) must be managed carefully for compatibility.
Conclusion: Smaller XML Without Breaking XML
Perfect Indexed XML (PI-XML) applies the same principle as Perfect Indexed JSON, but in the XML world:
- Move descriptive field names into a header.
- Reference them by compact indices inside data rows.
- Keep everything as valid XML parsable by existing libraries.
The result is that you keep the interoperability and tooling of XML, while dramatically reducing repeated tag overhead for large, tabular-style datasets.
If your systems still rely on XML and you need to optimize payload size without changing the underlying technology stack, PI-XML is a simple, compatible pattern worth exploring.