PI-XML Data Format Idea

Perfect Indexed XML (PI-XML)

Reduce Size. Keep XML. Zero Breakage for Modern Data Pipelines.

Extensible Markup Language (XML) has been used for years to model structured data and configuration. A common pattern is a root element with many repeated child elements that all share the same structure:

<users>
  <user>
    <first_name>Sammy</first_name>
    <last_name>Shark</last_name>
    <location>Ocean</location>
    <online>true</online>
    <followers>987</followers>
  </user>
  <user>
    <first_name>Sammy</first_name>
    <last_name>Shark</last_name>
    <location>Ocean</location>
    <online>true</online>
    <followers>987</followers>
  </user>
</users>

This is easy to understand, but it repeats tag names like <first_name> and <followers> for every item. Over thousands of records, that overhead adds up in terms of bytes, bandwidth and storage.

Perfect Indexed XML (PI-XML) applies the same idea as Perfect Indexed JSON (PI-JSON): move descriptive names into a header and use compact indexed references in the data, while staying 100% valid XML.

What Is Perfect Indexed XML (PI-XML)?

The idea behind PI-XML is simple:

  1. Define a header section that maps field names to indices.
  2. Represent each data row using short, index-based attributes instead of repeating full tag names.
  3. Keep everything as normal, well-formed XML that any XML parser can handle.

Starting from the normal structure:

<users>
  <user>
    <first_name>Sammy</first_name>
    <last_name>Shark</last_name>
    <location>Ocean</location>
    <online>true</online>
    <followers>987</followers>
  </user>
  <user>
    <first_name>Sammy</first_name>
    <last_name>Shark</last_name>
    <location>Ocean</location>
    <online>true</online>
    <followers>987</followers>
  </user>
</users>

We can define a PI-XML version like this:

<users>
  <header>
    <field name="first_name" index="0"/>
    <field name="last_name"  index="1"/>
    <field name="location"   index="2"/>
    <field name="online"     index="3"/>
    <field name="followers"  index="4"/>
  </header>

  <row f0="Sammy" f1="Shark" f2="Ocean" f3="true" f4="987"/>
  <row f0="Sammy" f1="Shark" f2="Ocean" f3="true" f4="987"/>
</users>

In this representation:

  • The <header> defines the mapping between name and index.
  • Each <row> element uses short attribute names f0, f1, f2, etc. to hold the actual data.
  • The full field names appear just once, in the header.
Conceptually, this is similar to a table where the header defines the column names and each row just stores values in a fixed column order. PI-XML uses attributes like f0, f1 to reference those columns, while the header describes what they mean.

PI-XML Is Still Just XML

PI-XML does not introduce a new file format or a new parser. It is purely a convention for structuring XML:

  • There is still a single root element (<users>).
  • All tags and attributes follow normal XML rules.
  • Any standard XML library can parse and serialize this structure.

That means:

  • You can still use Document Object Model (DOM), Simple API for XML (SAX) and XPath tools.
  • Existing XML serializers and validators continue to work as usual.
  • Application Programming Interface (API) gateways, message queues and HTTP clients see it as normal XML.

The only extra step is a small mapping layer that:

  • Uses the header to map index → field name when reading.
  • Uses the same mapping to convert field names to attributes like f0, f1 when writing.

Why Use Perfect Indexed XML?

  1. Less repetition of tag names
    In the normal XML, every <user> repeats <first_name>, <last_name>, <location>, etc. In PI-XML, those descriptive names appear once in the header and data is carried by compact attributes like f0, f1.
  2. Smaller XML size
    Shorter attribute names and no repeated tags mean fewer bytes:
    • Less bandwidth on the network.
    • Less storage on disk.
    • Potentially faster parsing and transferring.
  3. Index-friendly representation
    Each field index is stable across rows and payloads:
    • Index 0 always means first_name.
    • Index 1 always means last_name.
    This is similar to a columnar model and works well in analytical or batch processing scenarios.
  4. No change to XML parsers
    PI-XML works with any existing XML library. Your code simply loads the XML and then interprets the <header> plus the <row> attributes.
  5. Minimal change to business logic
    Often you can:
    • Keep your internal model as normal objects with named fields.
    • Implement PI-XML only in the adapter that reads/writes the wire format.
  6. Better compression even before gzip
    You will likely still compress XML in transit, but removing repeated tag names makes the raw XML smaller and makes compression even more effective.
  7. Expandable header (mini schema)
    Just like with PI-JSON, the header can evolve into a schema-like description:
    <header>
      <field name="first_name" index="0" type="string"  nullable="false"/>
      <field name="last_name"  index="1" type="string"  nullable="false"/>
      <field name="location"   index="2" type="string"  nullable="true"/>
      <field name="online"     index="3" type="boolean" nullable="false"/>
      <field name="followers"  index="4" type="integer" nullable="false"/>
    </header>
    This gives machines extra information without changing the basic idea.

Normal XML vs PI-XML: Size and Overhead

XML tends to be more verbose than JSON due to opening and closing tags. That makes repeated field names even more expensive when you have many rows.

Imagine 100 <user> elements with the same structure. A simplified, minified XML representation might look like this per row:

<user><first_name>Sammy</first_name><last_name>Shark</last_name><location>Ocean</location><online>true</online><followers>987</followers></user>

In PI-XML we move names to the header and compress the rows:

<header>
  <field name="first_name" index="0"/>
  <field name="last_name"  index="1"/>
  <field name="location"   index="2"/>
  <field name="online"     index="3"/>
  <field name="followers"  index="4"/>
</header>

<row f0="Sammy" f1="Shark" f2="Ocean" f3="true" f4="987"/>

Example size comparison (100 rows)

Payload Approx. length (bytes) Relative size
Normal XML 15000 100%
PI-XML 9300 ≈62%
Saved 5700 ≈38% reduction

The exact numbers will vary by data, but the pattern is clear: repeated tag names are a big part of XML size, and PI-XML removes most of that repetition.

Comparison

Like with PI-JSON, it is helpful to visualize the impact of PI-XML using progress-bar style examples.

📦 Payload size (example: 100 rows)

Normal XML 15000 bytes (100%)
Baseline XML with full tags repeated for every user.
PI-XML 9300 bytes (≈62%)
PI-XML stores descriptive names once in the header and uses compact attributes in each row.
Saved 5700 bytes (≈38%)
Roughly 38% reduction in this example, without changing XML itself.

🔖 Tag name overhead (conceptual)

Normal XML tag text 100%
Every tag like <first_name> is repeated for each user.
PI-XML tag text ≈15%
Names appear once in the header; data rows reuse short attribute names like f0.
Saved ≈85%
A large part of the XML payload is no longer consumed by repeated tag names.
TL;DR PI-XML keeps XML intact but restructures it so that descriptive names live in a header and data rows become short, index-based elements or attributes.

Product List in PI-XML

Normal XML

<products>
  <product>
    <id>1</id>
    <name>Laptop</name>
    <price>899.99</price>
    <currency>USD</currency>
    <in_stock>true</in_stock>
  </product>
  <product>
    <id>2</id>
    <name>Mouse</name>
    <price>19.99</price>
    <currency>USD</currency>
    <in_stock>false</in_stock>
  </product>
</products>

PI-XML

<products>
  <header>
    <field name="id"        index="0"/>
    <field name="name"      index="1"/>
    <field name="price"     index="2"/>
    <field name="currency"  index="3"/>
    <field name="in_stock"  index="4"/>
  </header>

  <row f0="1" f1="Laptop" f2="899.99" f3="USD" f4="true"/>
  <row f0="2" f1="Mouse"  f2="19.99"  f3="USD" f4="false"/>
</products>

How to Encode and Decode PI-XML

Encoding on the server (conceptual Python)

from xml.etree.ElementTree import Element, SubElement, tostring

# define header mapping
header_fields = [
    ("first_name", 0),
    ("last_name", 1),
    ("location", 2),
    ("online", 3),
    ("followers", 4),
]

users = [
    {
        "first_name": "Sammy",
        "last_name": "Shark",
        "location": "Ocean",
        "online": True,
        "followers": 987,
    },
    # more users ...
]

root = Element("users")

header_el = SubElement(root, "header")
for name, idx in header_fields:
    field_el = SubElement(header_el, "field")
    field_el.set("name", name)
    field_el.set("index", str(idx))

# helper: mapping name -> index string and attribute key
name_to_index = {name: idx for (name, idx) in header_fields}

for user in users:
    row_el = SubElement(root, "row")
    for name, idx in header_fields:
        attr_name = f"f{idx}"
        row_el.set(attr_name, str(user[name]))

xml_bytes = tostring(root)  # can be written to file or sent over network

Decoding on the client (pseudo-code)

// assuming 'doc' is an XML Document parsed by DOM

const header = doc.querySelector("header");
const fieldNodes = header.querySelectorAll("field");

// build index → name mapping
const indexToName = {};
fieldNodes.forEach(field => {
  const name = field.getAttribute("name");
  const index = field.getAttribute("index");
  indexToName[index] = name;
});

const rows = Array.from(doc.querySelectorAll("row"));
const users = rows.map(row => {
  const user = {};
  Object.keys(indexToName).forEach(index => {
    const attrName = "f" + index;
    const value = row.getAttribute(attrName);
    user[indexToName[index]] = value;
  });
  return user;
});

// 'users' is now an array of objects like:
// { first_name: "Sammy", last_name: "Shark", location: "Ocean", ... }

When PI-XML Shines (and When It Does Not)

Great use cases

  • Large XML feeds with many repeating records (logs, telemetry, analytics).
  • Systems where XML is required but bandwidth or storage is limited.
  • Internal services where both producer and consumer know the PI-XML convention.
  • Data pipelines that process huge XML payloads and want to cut cost or latency.

Trade-offs and limitations

  • Less human-readable: attributes like f0 are not self-describing.
  • Requires a small mapping layer to convert between names and indices.
  • Tools that expect descriptive XML might prefer the original tag-per-field layout.
  • Header changes (adding or reordering fields) must be managed carefully for compatibility.

Conclusion: Smaller XML Without Breaking XML

Perfect Indexed XML (PI-XML) applies the same principle as Perfect Indexed JSON, but in the XML world:

  • Move descriptive field names into a header.
  • Reference them by compact indices inside data rows.
  • Keep everything as valid XML parsable by existing libraries.

The result is that you keep the interoperability and tooling of XML, while dramatically reducing repeated tag overhead for large, tabular-style datasets.

If your systems still rely on XML and you need to optimize payload size without changing the underlying technology stack, PI-XML is a simple, compatible pattern worth exploring.