Open-sourcing SERAX - a file format optimized for AI data

You think you've known pain?

You haven't known true, soul-shredding agony until you've tried to get an AI to reliably spit out hundreds of millions of JSON files.. I can't even tell you how much of my life I have wasted trying to get AI data pipelines working at scale.

These ancient text formatted documents are not a gift, my friends.. they are cursed zombies bequeathed to us by the ancient, well-meaning, but utterly unprepared creators of formats like JSON, YAML, and XML.

The AI, bless its token-predicting heart, generates a quote inside a company name like this:

{"company": "Stark "Ironclad" Industries"}

And the parser just pukes. errors out and you're left digging through endless examples trying to figure out how to catch and repair them. sorry.. regex can’t save your soul, certainly not your sanity.. There’s just too many different edge cases to try to handle..

YAML is better BUT... all it takes is one freaking colon in something as innocent as a timestamp and now you have extra keys that cause QA checks to fail.

event_time: 10:15 AM

Let's not even talk about XML.. people swear by it but it falls over dead at the mere sight of a >..

profits > $1M

But there was a deeply recessed memory of ancient mainframe days.. they used special, non-printable control characters to manage data streams, keeping them clean and easily parsable because those characters lived outside the data itself. It’s a feed form data format so you split data on the line level, which means you can parse one record at a time.. Funny how often we go back to old school computing solutions when we hit a brick wall.. So that's when we started developing the SERAX philosophy..

What if we could resurrect control character and feed line data formats, but supercharge it for the age of AI? UTF-8 has this colossal, mostly untapped armory of over 140,000 symbols... We saw that and knew it was the perfect solution, we can have endless AI-proof "control" signals! Yes I know 140k is not endless, so calm down Redditors, I'm being hyperbolic.. It's called artistic license.

That's when SERAX was born, we needed a text format that wasn't a curse, we needed a polymorphic data discombobulator and AI QA enabler! We needed something that could shapeshift to handle any data we threw at it, from dense financial reports to the psychological profiles of 80s movie action heroes.

We're open-sourcing SERAX because everyone deserves to escape the tyranny of these ancient text file formats.. This is a file format designed for the AI world!

1. Dead Simple Structure

SERAX records look like this:

[RecordTypeSymbol][Field1Symbol][Data1]...[TerminatorSymbol]⏹

Here's a real example:

⟐⊶Stark "Ironclad" Industries⊷1200000⏹

What that means:

⟐ = "Company" record type

⊶ = text field containing "Stark "Ironclad" Industries"

⊷ = number field containing 1200000

⏹ = end of record

More examples:

⧊ for "Academic Papers"
Whatever record types you need..

Done... no nested brackets for the AI to lose track of, no whitespace indentation… each line is a record.. split the lines.. check for the control characters and parse with confidence you magnificent beast!

2. Smart Symbols

This is the core magic.. Those SERAX symbols aren't just dumb delimiters like commas or brackets.. they have semantic meaning.

⊷ doesn't just say "data here".. it tells your system "The data after ⊷ MUST be a number!"

So if your AI tries to put "Peanut Butter" in a revenue field, you know something's wrong.

We can catch AI hallucinations automatically without needing another expensive LLM pass just to check if the first LLM was having a bad day.

3. Symbols That Shapeshift

Using "Prompt-Driven Semantics," ⊷ isn't stuck being 'Revenue' forever..

Examples:

● Dataset 1: ⊷ means 'Revenue'

● Dataset 2: ⊷ means 'CustomerSentimentScore'

● Dataset 3: ⊷ means 'NumberOfExistentialCrisesToday'

SERAX morphs to your data needs.

4. More Efficient

Because SERAX uses these information-dense symbols, it often uses fewer tokens than those verbose legacy formats..

Benefits:

● Less processing time

● Lower costs

● Fewer chances for the AI to take a wrong turn mid-output..

5. Graceful Degradation

When things go wrong, you don't lose everything..

If one field gets corrupted:

● We don't lose the whole record

● We salvage the good parts

● Flag the bad ones

The result? That 3% daily horror (sometimes 60%!)? It nosedived to 0.3%. And mostly minor, flaggable stuff. The curse was broken.

THE SERAX PHILOSOPHY
Philosophy? Maybe you caught that earlier.. Yeah we get that it’s weird to call a file format a philosophy but what do you call a polymorphic file format that was created by an AI for your specific data extraction task? Forgive me for my buzzword bingo.. is it a file format or a philosophy to follow for endless file specifications?
Hopefully this has gotten you interested enough to spend a few minutes looking through the SERAX repo.

● We have a whitepaper explaining the specification philosophy.

● A prompt that will generate the schema prompt for you so you can inject it into your own prompts..

● There is example parsing code with a few different data types.
We hope that you’ll give it a try and tell us how dumb/brilliant/out of our minds we are on your favorite social media platform.

I hope you'll check out SERAX and we'd love to hear if it's useful for your AI data pipelines.
https://github.com/vantige-ai/serax

We're open-sourcing SERAX - a file format made specifically for AI data generation

1. Dead Simple Structure

2. Smart Symbols

3. Symbols That Shapeshift

4. More Efficient

5. Graceful Degradation