Hypermedia Protocol Schemas

For the protocol to gain adoption, developers need to understand the data that gets signed and distributed. And we also need to understand and document the RPCs that are used to communicate between p2p nodes, and what RPCs a single node may expose to its consumers (GUIs, APIs, command lines, agentic access)

I suggest a unified system to describe both the verifiable signed data and the APIs, because they have overlapping schemas. For example the "block type" which describes chunks of content like paragraphs and images in a document. This type is useful to describe both the node RPCs, and to describe the format of the verifiable signed data.

The Hypermedia Protocol Must be Extensible

There are many use cases which the core team cannot support, or does not want to impose on the community. For example if someone in the community wants to experiment with a new form of media like interactive video, they should be able to write software that interacts with the network safely, not breaking other software.

And on the core team, we may want to develop our own experiments and internal tools, while keeping the core documents+comments functionality of the protocol stable.

Programatic Access to Data and RPCs

With a formal schema, we can deliver some useful tools. For example, we could build a GUI that allows us to safely experiment with the raw RPCs and data.

In the signed verifiable data, for example, an IPFS file URL may be encoded as a string (or a raw IPLD CID). But when we expose that to the user, we can offer a file picker. If the schema implies that this should be an image, we can display the image and make sure the user does not select a zip file.

Another useful application of a robust schema system is automatically creating type-safe SDKs in many languages, lowering the barrier to entry for developers in different ecosystems to participate in the same data universe.

Also, our APIs may be more robust with a schema system. When a user sends data to an API, we can validate it against schemas and provide detailed errors, telling the user what they did incorrectly.

Protocol Robustness

Currently the "verification" of the verifiable data simply checks for the signatures and permissions/capabilities of the writer to ensure they are allowed to edit the resource. It does not validate the structural integrity of a resource. For example if a user uploads arbitrary data inside the content of a comment, the system will accept that comment even if it is nonsensical, and the UI will break as a result.

Design Constraints

Support Existing Data

We have existing data in our system, (DAG-CBOR encoded) which has been signed and then referenced in chains of data. This is basically impossible to change, so we are locked in to certain decisions. The solution is to add a schema system on top of the existing data. New data may reference a schema, and old data will have a fallback schema which matches our current data formats.

Evolution, Partial Compatibility

There must be a system where the data formats can safely evolve, without breaking older software, or software that partially supports the schema.

Extendability

Ideally, the community can openly build upon schemas from others in the community. For example, one member of the community might develop a "livestreaming" resource, and somebody else could create a more detailed schema which extends it, augmenting with some features. The augmented livestreams must still be usable by the original livestream application. We should not rely on the core team to unlock safe collaboration of new data types in the protocol.

Solution

Largely undecided! Some things to explore:

RDF and RDF schema

XRPC Lexicons

JSON Schema

graphQL

protocol buffers

IPLD schemas

Notes and References

We had a document from last year which discussed this problem. And some others which I can't easily find.

Haystack: A Platform for Creating, Organizing and Visualizing Information Using RDF