PI-JSON Data Format Idea
dearphp@gmail.com © 2025

Perfect Indexed JSON (PI-JSON)

Reduce Size. Keep JSON. Zero Breakage for Modern Data Pipelines.

XML - Reduce Size. Keep XML. Zero Breakage for Modern Data Pipelines.

When we work with JavaScript Object Notation (JSON), we usually send an array of objects with the same structure repeated many times. That makes the data easy to understand, but it also means we repeat the same keys over and over again:

[
  {
    "first_name": "Sammy",
    "last_name": "Shark",
    "location": "Ocean",
    "online": true,
    "followers": 987
  },
  {
    "first_name": "Sammy",
    "last_name": "Shark",
    "location": "Ocean",
    "online": true,
    "followers": 987
  }
]

For large datasets with thousands or millions of rows, those repeated keys cost extra bytes, bandwidth and storage, without adding any new information.

Perfect Indexed JSON (PI-JSON) is a simple structuring convention that reduces this overhead while still staying 100% valid JSON.

What Is Perfect Indexed JSON (PI-JSON)?

The basic idea is:

  1. Move all field names into a header object.
  2. Let that header map field_name → index (for example, "first_name": 0).
  3. Represent every data row using those index keys instead of full names.

From your original JSON:

[
  {
    "first_name": "Sammy",
    "last_name": "Shark",
    "location": "Ocean",
    "online": true,
    "followers": 987
  },
  {
    "first_name": "Sammy",
    "last_name": "Shark",
    "location": "Ocean",
    "online": true,
    "followers": 987
  }
]

We can create a PI-JSON version:

[
  {
    "first_name": 0,
    "last_name": 1,
    "location": 2,
    "online": 3,
    "followers": 4
  },
  {
    "0": "Sammy",
    "1": "Shark",
    "2": "Ocean",
    "3": true,
    "4": 987
  },
  {
    "0": "Sammy",
    "1": "Shark",
    "2": "Ocean",
    "3": true,
    "4": 987
  }
]

The first object is the header:

  • "first_name" → 0
  • "last_name" → 1
  • "location" → 2
  • "online" → 3
  • "followers" → 4

All other entries are data rows that use those index keys: "0", "1", "2", "3", "4".

This is conceptually similar to a Comma Separated Values (CSV) file, where the first line defines the header and all subsequent rows contain only values. The key difference is that PI-JSON stays entirely within the JSON ecosystem, while still supporting different key structures, full JSON schemas, and deeply nested JSON objects without breaking compatibility.

PI-JSON Is Still “Just JSON”

A key point: PI-JSON is not a new file format, not a new parser, and not a new library. It is simply a structured way to organize your existing JSON data.

  • The top level is still a normal JSON array.
  • Elements are normal JSON objects.
  • Keys are strings, values can be strings, numbers, booleans, and so on.

That means:

  • You still use JSON.parse / JSON.stringify in JavaScript.
  • You still use json.loads / json.dumps in Python.
  • You still use all regular JSON libraries in Java, Go, PHP and others.

The only extra logic is a small mapping layer in your application that:

  • Converts named fields into index-based rows when encoding.
  • Maps indices back to field names when decoding.

Why Use Perfect Indexed JSON?

  1. Index-style access, array-like feel
    Each row is essentially an ordered set of values referenced by index keys, just like columns in a table:
    // row = { "0": "Sammy", "1": "Shark", "2": "Ocean", "3": true, "4": 987 }
    const firstName = row["0"];  // "Sammy"
    const isOnline  = row["3"];  // true
  2. Removes repeated keys
    Instead of sending "first_name", "last_name", and the rest for every row, you send them once in the header and refer to them by index afterward.
  3. Smaller JSON size
    Shorter keys and no repetition mean fewer characters and smaller payloads:
    • Less bandwidth usage.
    • Less storage space.
    • Faster network transfers.
  4. No change to JSON parsing logic
    You do not need to modify any JSON libraries. All standard tools still work: validators, linters, pretty-printers and so on.
  5. No new format to learn
    Developers already understand JSON. PI-JSON is just “JSON with a header and indexed rows”.
  6. Existing JSON ecosystem continues to work
    JSON Schema, logging, tracing, Application Programming Interface (API) gateways and HTTP tooling all continue to treat PI-JSON as valid JSON.
  7. Minimal change to existing code
    You can confine PI-JSON to the serialization and deserialization layer while keeping your domain models in normal object form:
    // internal model
    const user = {
      first_name: "Sammy",
      last_name: "Shark",
      location: "Ocean",
      online: true,
      followers: 987
    };
    
    // only the adapter/serializer knows about PI-JSON
    
  8. Better compression before compression
    Even if you use gzip or Brotli, eliminating repeated keys at the source reduces both the raw and compressed size.
  9. Columnar / analytical-friendly
    When indices stay stable, PI-JSON behaves similarly to a columnar layout where each index corresponds to a specific column, which is friendly for analytical workloads.
  10. Header as a mini schema
    You can evolve the header into something richer:
    {
      "first_name": { "index": 0, "type": "string",  "nullable": false },
      "last_name":  { "index": 1, "type": "string",  "nullable": false },
      "location":   { "index": 2, "type": "string",  "nullable": true  },
      "online":     { "index": 3, "type": "boolean", "nullable": false },
      "followers":  { "index": 4, "type": "integer", "nullable": false }
    }

Normal JSON vs PI-JSON: Real Size Numbers

To make the comparison fair, we can look at minified JSON (no spaces, no line breaks).

A typical row in normal JSON looks like this:

{"first_name":"Sammy","last_name":"Shark","location":"Ocean","online":true,"followers":987}

The PI-JSON header (sent once per payload) is:

{"first_name":0,"last_name":1,"location":2,"online":3,"followers":4}

And a PI-JSON data row is:

{"0":"Sammy","1":"Shark","2":"Ocean","3":true,"4":987}

Measured sizes for different row counts

Number of rows Normal JSON length PI-JSON length Saved
2 185 180 5
10 921 620 301
100 9201 5570 3631
1000 92001 55070 36931

With just a few rows, the difference is small. But as the dataset grows, the key repetition becomes expensive and PI-JSON brings large savings.

Key “token” savings

If we only look at **key characters** for our example fields:

  • "first_name" → 10 characters
  • "last_name" → 9 characters
  • "location" → 8 characters
  • "online" → 6 characters
  • "followers" → 9 characters

Total = 42 key characters per row in normal JSON.

For 100 rows:

  • Normal JSON keys: 42 × 100 = 4200 characters.
  • PI-JSON keys: 42 + 5 × 100 = 542 characters.

That is a reduction from 4200 → 542 key characters — saving 3658 key characters just by re-structuring the JSON.

Comparison

To visualize the impact of PI-JSON, we can show approximate percentages using progress-bar style visualizations.

📦 Payload size (100 rows)

Normal JSON 9201 bytes (≈100%)
Baseline payload with repeated keys on every row.
PI-JSON 5570 bytes (≈61%)
PI-JSON shrinks the payload to about 61% of the original size.
Saved 3631 bytes (≈39%)
Roughly 39% reduction for this example dataset.

🔑 Key characters only (100 rows)

Normal JSON keys 4200 characters (100%)
Every row repeats full field names like "first_name" and "followers".
PI-JSON keys 542 characters (≈13%)
Header defines field names once; data rows use short index keys like "0", "1", "2".
Saved 3658 characters (≈87%)
Around 87% reduction in key characters in this example.
TL;DR PI-JSON reduces payload size by removing repeated keys and replacing them with compact, index-based references, while still being completely valid JSON.

Product List Example

Normal JSON

[
  {
    "id": 1,
    "name": "Laptop",
    "price": 899.99,
    "currency": "USD",
    "in_stock": true
  },
  {
    "id": 2,
    "name": "Mouse",
    "price": 19.99,
    "currency": "USD",
    "in_stock": false
  }
]

PI-JSON

[
  {
    "id": 0,
    "name": 1,
    "price": 2,
    "currency": 3,
    "in_stock": 4
  },
  {
    "0": 1,
    "1": "Laptop",
    "2": 899.99,
    "3": "USD",
    "4": true
  },
  {
    "0": 2,
    "1": "Mouse",
    "2": 19.99,
    "3": "USD",
    "4": false
  }
]

Nested PI-JSON vs Normal JSON (Full Indexing at All Levels)

This example shows Perfect Indexed JSON with a flat header that assigns an index to every logical field name, and data rows that use only index keys – even for nested objects. The JSON structure (nesting, arrays, partial fields) stays exactly the same; only the keys are shortened.

Normal JSON

[
  {
    "user": {
      "first": "Sam",
      "last": "Shark"
    },
    "address": {
      "city": "Ocean",
      "zip": 44221
    },
    "status": true
  },
  {
    "user": {
      "first": "Alex"
    },
    "address": {
      "zip": 11111
    },
    "meta": {
      "device": "mobile",
      "version": 12
    }
  }
]

PI-JSON version (flat header + indexed keys)

[
  {
    "user": 0,
    "address": 1,
    "status": 2,
    "meta": 3,
    "first": 4,
    "last": 5,
    "city": 6,
    "zip": 7,
    "device": 8,
    "version": 9
  },

  {
    "0": {
      "4": "Sam",
      "5": "Shark"
    },
    "1": {
      "6": "Ocean",
      "7": 44221
    },
    "2": true
  },

  {
    "0": {
      "4": "Alex"
    },
    "1": {
      "7": 11111
    },
    "3": {
      "8": "mobile",
      "9": 12
    }
  }
]

The first object is the header and appears only once. It maps every field name to a numeric index:

"user" → 0, "address" → 1, "status" → 2, "meta" → 3, "first" → 4, "last" → 5, "city" → 6, "zip" → 7, "device" → 8, "version" → 9.

All following rows are normal JSON objects, but they use index keys like "0", "1", "4", "7", and so on. A decoder uses the header to translate indices back to field names and reconstructs the original objects.

Key name token comparison (small example)

For this small two-row example, only the key name text (without quotes and punctuation) is counted, just to show the idea:

Metric Normal JSON PI-JSON
Distinct key names user, address, status, meta, first, last, city, zip, device, version
Total key-name characters used in payload ≈69 characters
(names repeated in each row)
≈50 characters
(each name appears once in the header; data uses only indices)

Even in this tiny example, PI-JSON already uses fewer key-name characters by moving the names into a single header. As the number of rows grows, this effect becomes much stronger, because normal JSON keeps repeating field names while PI-JSON reuses compact indices.

Estimated size comparison for repeated nested structure

If this kind of nested structure is repeated many times (for example, in logs, analytics events or user activity streams), the PI-JSON header is sent once, while normal JSON repeats the full field names for every row. Using a minified representation, the pattern is similar to the earlier example:

Number of rows Normal JSON length PI-JSON length Saved (bytes) Saved (%)
2 185 180 5 ≈2.7%
10 921 620 301 ≈32.7%
100 9201 5570 3631 ≈39.5%
1000 92001 55070 36931 ≈40.1%

These numbers are based on a minified form and are meant to illustrate the same effect as with simpler objects: as the number of rows grows, repeated key overhead in normal JSON increases linearly, while PI-JSON pays the cost of descriptive names once in the header and then reuses compact indices for all additional rows.

How to Encode and Decode PI-JSON in Python (Generic, Nested-Safe)

The following Python helpers work with any PI-JSON payload that uses a flat header mapping field_name → index and data rows that use index keys as strings. They support deeply nested objects and arrays.

Encoder: from normal JSON objects to PI-JSON

import json
from typing import Any, Dict, List

Header = Dict[str, int]


def encode_value(value: Any, header: Header) -> Any:
    """
    Recursively encode a normal JSON-compatible value using the header mapping.
    Any object key that exists in the header is replaced by its numeric index (as a string).
    """
    # If it's a dict, replace keys using the header map
    if isinstance(value, dict):
        encoded = {}
        for key, val in value.items():
            # If the key is in the header, use its index; otherwise keep it as-is
            if key in header:
                new_key = str(header[key])
            else:
                new_key = key
            encoded[new_key] = encode_value(val, header)
        return encoded

    # If it's a list/array, encode each element
    if isinstance(value, list):
        return [encode_value(item, header) for item in value]

    # Primitive (str, int, float, bool, None) – leave as-is
    return value


def encode_pi_json(header: Header, rows: List[Dict[str, Any]]) -> List[Any]:
    """
    Given a header (field_name → index) and a list of normal JSON objects (rows),
    return a PI-JSON payload where:

    - The first element is the header itself.
    - All subsequent elements are encoded rows with index-based keys.
    """
    encoded_rows = [encode_value(row, header) for row in rows]
    return [header] + encoded_rows


# Example usage: encode the nested user/address/meta structure
header = {
    "user": 0,
    "address": 1,
    "status": 2,
    "meta": 3,
    "first": 4,
    "last": 5,
    "city": 6,
    "zip": 7,
    "device": 8,
    "version": 9,
}

rows = [
    {
        "user": {
            "first": "Sam",
            "last": "Shark",
        },
        "address": {
            "city": "Ocean",
            "zip": 44221,
        },
        "status": True,
    },
    {
        "user": {
            "first": "Alex",
        },
        "address": {
            "zip": 11111,
        },
        "meta": {
            "device": "mobile",
            "version": 12,
        },
    },
]

payload = encode_pi_json(header, rows)
json_text = json.dumps(payload, separators=(",", ":"))  # minified PI-JSON
print(json_text)

Decoder: from PI-JSON back to normal JSON objects

import json
from typing import Any, Dict, List, Tuple

Header = Dict[str, int]


def decode_value(value: Any, index_to_name: Dict[str, str]) -> Any:
    """
    Recursively decode a PI-JSON value using the reverse header mapping.
    Any object key that matches a known index is replaced by its original field name.
    """
    # If it's a dict, translate keys using the reverse map
    if isinstance(value, dict):
        decoded = {}
        for key, val in value.items():
            new_key = index_to_name.get(key, key)  # fall back to original key if not in map
            decoded[new_key] = decode_value(val, index_to_name)
        return decoded

    # If it's a list/array, decode each element
    if isinstance(value, list):
        return [decode_value(item, index_to_name) for item in value]

    # Primitive – leave as-is
    return value


def decode_pi_json(payload: List[Any]) -> Tuple[Header, List[Dict[str, Any]]]:
    """
    Given a PI-JSON payload (first element is header, rest are rows),
    return the header and a list of decoded normal JSON objects.
    """
    if not payload:
        return {}, []

    raw_header = payload[0]
    if not isinstance(raw_header, dict):
        raise ValueError("PI-JSON payload must start with a header object")

    # Build reverse mapping: index (as string) → field name
    index_to_name = {str(idx): name for name, idx in raw_header.items()}

    # Decode each row
    decoded_rows: List[Dict[str, Any]] = []
    for row in payload[1:]:
        decoded_row = decode_value(row, index_to_name)
        # Ensure we always return dicts as rows where possible
        if isinstance(decoded_row, dict):
            decoded_rows.append(decoded_row)
        else:
            decoded_rows.append({"value": decoded_row})

    return raw_header, decoded_rows


# Example usage: decode the PI-JSON produced earlier
pi_json_text = json_text  # from the encoder example above
pi_payload = json.loads(pi_json_text)

decoded_header, decoded_rows = decode_pi_json(pi_payload)
print(decoded_header)
print(decoded_rows)

The encoder and decoder work for deeply nested objects and arrays, and they only change keys that appear in the header. This allows PI-JSON to be introduced as a lightweight adaptation layer, while keeping the internal application models in normal, human-readable JSON form.

When PI-JSON Shines (and When It Does Not)

Great use cases

  • Large collections of similar objects (events, logs, analytics).
  • Mobile or Internet of Things (IoT) clients with limited bandwidth.
  • Backends that store or stream huge amounts of JSON.
  • Internal services where you control both producer and consumer.

Trade-offs and limitations

  • Less human-readable: index keys like "0" are not self-describing.
  • You must maintain a small encoder/decoder mapping layer.
  • Some tools that expect “nice” JSON objects may prefer full field names.
  • Header changes (adding/removing fields) must be versioned carefully.

Reusing the Perfect Indexed pattern in XML, CSS, HTML and other structured, tag-based formats

The same pattern behind Perfect Indexed JSON can be applied to other formats as well:

  • Extensible Markup Language (XML)
    Use a schema-like section that declares element and attribute names once, then refer to them by index or alias inside the document for more compact representations.
  • Cascading Style Sheets (CSS)
    Define a central map of property names and reuse numeric or short aliases in compact style declarations.

However, the big advantage of PI-JSON is that you get these compression-like benefits without leaving the JSON ecosystem at all.

Conclusion: Smaller JSON Without Breaking JSON

Perfect Indexed JSON (PI-JSON) is a simple idea:

  • Extract keys into a single header.
  • Index them with numbers.
  • Use those indices for all data rows.

The result is:

  • Smaller payloads.
  • Fewer repeated key tokens.
  • No changes to parsers or core libraries.
  • Full compatibility with the JSON ecosystem.

If you are working with large JSON datasets or modern data pipelines and want to reduce size without changing the format, PI-JSON is a neat pattern to consider.