Circling back to Joylog

August 19, 2016

Joylog is a logic language concept inspired by the concatenative/stack-based language Joy. I’m thinking about it now again as I think it matches the semantics of term-expressions.

The basic concepts are:

  • It’s a stack language but prefix rather than postfix based
  • |X  at the start of an expression Y, where X is a single word, is looked up in the current root environment.
  • If X is not defined, then the term-expression is left as is – it is not an error, it is not an undefined symbol, it is simply a non-reducible term which identifies with and reduces to only itself. There may be many such non-reducible terms which act as structured data types (or somewhat Prolog-like, but context-dependent, terms).
  • If X is a list, at the moment the behaviour is just assumed to be ‘exact match lookup as for a word’. This may change.
  • If X is defined, ie contains child items, the definition D of X (the expression to the right of that one word starting from the environment root) is prepended to the remaining expression R under the following rules:
    • if D is an |all expression, then some form of parallel lazy evaluation takes place, with each element of the |all expanded in its own stack space. The result is a |all expression
    • if R is an |all expression, then each element of D is prepended to each element of R
    • each resulting branch is recursively evaluated, with |X terms expanded, and so on. However some lazy-evaluation mechanism is used to prevent infinite loops, and while each line must be expanded to its end to produce output, only as much output as required by the consuming operation (to the left) is produced
  • If X is a built-in primitive operation O, then operation O is applied to remainder R to produce result R2, and R2. or as much of it as required by a consuming operation, is passed back. This may mean, eg, evaluating (and only partially) only the first term in an |all expression.
  • Otherwise, if the first element of expression Y is not |, then Y is just (Y1 . Y2), Y1 is returned to the left, and Y2 is evaluated as for Y
  • The semantics of ‘returning to the left’, and management of |all expressions, may require a heap rather than a stack for execution, as operations need to ‘return one value but keep

 

The hope is I can have a root expression such as

|all
(parent |all (a b) (b c))
(ancestor |transitive-closure |parent)
(transitive-closure |all () (|compose-self)

where compose-self somehow does a relational self-composition operation on the remainder R (perhaps implemented by wrapping R as a subexpression, duplicating it, applying a two-place compose operator, etc).

There are several major differences to Joy semantics, I think, that complicate evaluation:

  • undefined operations could cause problems if used as data types and then are defined later. Perhaps we need a null |data operation to explicitly mark these as data (with or without type/schema markup)
  • prefix semantics is required for lazy evaluation, so we can’t just act directly on a stack
  • lazy evaluation is required because operations can generate recursive and infinite sets/relations. We don’t want to do more work than we need.
  • multi-branch evaluation is required – this makes operations a little like co-routines, and means the stack isn’t quite a stack
  • some kind of ‘get’ operation is required to retrieve the definition of a word as a wrapped expression, without evaluating it
  • some kind of partial evaluation operation for evaluating only one part of a wrapped expression
  • some kind of ‘match’ operation which evaluates an expression only for the purpose of confirming a match
  • operations which manipulate the environment itself, and create temporary sub-environments which can be returned as data to other operations

 

 

 

Sets and Conjunctions

August 14, 2016

A logical conjunction (logical AND) seems a fairly simple thing. But is it? There are, oddly enough, some rough edges even in something as fundamental as this.

Take and example A AND B, if A and B are two logical assertions, written in roughly English form: (cat is black) and (dog is hungry).

We could write this as a Prolog term: and(black(cat),hungry(dog))

We could rewrite this, using the same Prolog conventions, using term-expressions:

(|and (|black cat) (|hungry dog)). It seems fairly obvious that this encodes the same information.

However, term-expressions allow us to use the S-expression dotted form, which Prolog terms disallow us from using. This means that, if we know the arity of our term (eg (|and) is always two terms), we don’t need the second pair of brackets. This can make it a bit easier to read:

(|and (|black cat) |hungry dog)

Also, one of the points of term-expressions (as well as smoothing out the awkward Prolog difference between list and term syntax, and being a bit more S-expression compliant), is to give us the tools to allow rewriting logical expressions in any order we’d like – such as an object- or at least subject-oriented form. So we could rewrite this  as:

(|and (cat black) dog hungry)

|and is now a global operator, but (cat black) and (dog hungry)  are just lists – we know from context that they’re logical assertions, but being able to freely combine term and list notation means we can now have fragments of logical assertions. This could allow us, to describe the semantics of objects, where ‘black’ and ‘hungry’ might mean different things for different objects.

So far so good. Let’s try adding a third term. English-like notation:

(cat is black) and (cat is white) and (dog is hungry)

rewrites in Prolog – assuming right-bunching ‘and’ – as

and(black(cat),and(white(cat),hungry(dog)))

In term-expressions of the first form this becomes

(|and (|black cat) (|and (|white cat) (|hungry dog)))

In term expressions of the second form we can see that losing the unnecessary parentheses helps a bit for readability:

(|and (|black cat) |and (|white cat) |hungry dog)

and in the third, context sensitive form, we can see that we can cluster statements related to ‘cat’ together:

(|and (cat |and (black) white) dog hungry)

which is a bit hard to parse as English, but you can hopefully see the idea: because the | symbol lets us ‘break out’ of the context at any point, we can put |and anywhere in the tree, creating an object-like collection of statements.

You can maybe now see that |all, which we’ve used before, can be built out of a series of |and operators. |and would actually be the more ‘correct’ form for a machine to process, but |all is easier for a human to read. In |all form we’d get:

(|all (cat |all (black) (white)) (dog hungry))

which, since |all is the same operator as our old ‘.’ from T-expressions, would be

(. (cat . (black) (white) (dog hungry))

and you could see how we might rewrite that in  JSON syntax as:

{“cat”: {“black”: {}, “white”: {}}, “dog”:{“hungry”: {}}}

That JSON syntax for the simplest case like this is still really gnarly, which is why I went on such a long path looking for saner, and self-consistent, alternatives.

|and, then, lets us create something like a set. But is a conjunction really a set of the statements it combines? That is, does it act like our usual idea of a set? Especially, does it contain the notion of set boundary? That’s where it gets tricky.

A AND B AND C can break down as A AND (B AND C) . Or as (A AND B) AND C. The usual definition of AND doesn’t care about order. It also, generally, doesn’t care about nesting. Eg: A AND B AND C AND D is the same as (A AND B) AND (C AND D) or (A AND (B AND (C AND D))). Or, (A AND (B AND (C AND (D AND FALSE)))). Where ‘false’ is the false statement which you might thing would be the same as ‘the empty set of true statements’. All of these, because of the associative law, are identical.

Have you noticed the problem yet?

The problem is that (A AND B) sorta-kinda looks like it might be be a set. It looks like one – it’s a thing that contains two statements. But so is (C AND D). If we then have (A AND B) AND (C AND D), do we have one set? or two sets inside a third? For that matter, is A a set with one element, or an atomic element that isn’t a set at all? Where do the set boundaries go?

We could, I guess, arbitarily decide that ‘one statement is a set with one element’, since we have FALSE to be the empty set, but then how would we represent ‘a set with one element, which is itself a set with one element? In the usual laws of evaluating logical conjunctions, ((A)) is the same as (A AND A) is the same as A. The parens, in evaluation, disappear . Set boundaries, however very much do not disappear. We’ve hit a roadblock. A subtle one, but a very important one.

In Prolog and Kanren – and in Scheme using the AMB operator – a predicate will return a list of answers, which is a little like a set except that it could contain duplicates. A feature (or misfeature, perhaps) of the Prolog unification algorithm is that this set/list is generated in a predictable sequence (which means it’s not trivial to parallelise – see the failure of the Japanese Fifth Generation Computing program). This is almost, but probably not quite, the same as the concept of set comprehensions in mathematics, where we generate a set from a formula via a recursive process of adding elements in order. (And this recursive process sometimes doesn’t terminate, if you try to put a set inside itself, etc).

Do we, perhaps, need a set element-addition operator which is not the same as set union – and/or a logical AND-like operator which has subtly different semantics from AND? For example, perhaps with the rule that the first argument must be an element while the second element can only be a set? So A AND B is not allowed, but A AND B AND FALSE would be allowed? And (A AND FALSE) generates a one-element set wrapping A?

This seems to me to be a problem that isn’t often explored, though it must surely be discussed in the literature somewhere. It seems taken for granted that logical conjunctions can be expressed as set operations – yet how, using only the mechanisms of logical conjunction, do you generate any object resembling a set boundary? Put an empty set – in the way in which all mathematics is constructed – inside another empty set? In logic, that would be a (FALSE AND FALSE) being not the same thing as FALSE, which is an invalid statement. How do you say ‘create a new empty logical container’ in a reasonably terse and natural way in logic?

An actual logician would then I guess ask but which logic grasshopper, because there’s not just one, and I know set theory is strictly more powerful than, say, first-order logic, while equal to combinatory logic, but surely even first-order logic needs a mechanism equivalent to nesting empty sets?

This problem can be worked around, if not entirely solved, in term-expressions – at the moment, the (|all) term looks like a set, but isn’t a set because if you nest it, the boundary disappears. There are two possible workarounds:

  1.  wrap that quasi-set term (|all) or (|and) inside an ordinary list – so ((|all A B)) or ((|and A . B)). The outer list doesn’t evaporate. Since lists are logical statements, or fragments of tgem, this is, pretty much, the same as wrapping that (|all) in a logical statement. Which seems like probably the natural way of creating a setlike entity in logic: put it inside a statement.
  2. assume that a (|all) or (|and) term on the left-hand side is an atomic, opaque set-like entity and don’t automatically deconstruct it. This would include having to construct one-element sets as (|and A |null) or something similar.

If you take the first approach – restricting term-expressions to proper lists, and (|all) terms that contain proper lists, you get something much like (ragged) n-ary relations or tables. But much of what I find interesting and useful about term-expressions comes from the fact that thay are structured improper lists. And it feels somehow ‘wrong’ to have to use a pair when you want a set; they’re two very different, otherwise well-defined concepts. But perhaps that’s how it has to work.

It’s just odd, though. You’d think, from all we know (or have been taught) about boolean operations being interchangeable with set operations, that a logical conjunction would, in fact, be exactly interchangeable with a set. But nope. Only if you wrap it in something first.

Provenance

August 12, 2016

If we were to visualise an Internet-spanning, functional-reactive network as something like a giant distributed spreadsheet, we might decide that the standard data unit would be something like a spreadsheet cell.

A cell is an entity with two faces: it has an inside and an outside. The inside is an expression; the outside is a value. It stores both, it attaches itself to the cells that its expression most immediately evaluates to (creating them if they don’t exist), and it listens for changes to the value (not expression) of these cells. It also stores links to the cells which reference it. When any of the immediate source cells in its expression update, it recomputes its value, then notifies all the cells that reference it so they can update themselves.

At least that’s the theory. The Functional Reactive folks are doing a lot of thinking about prototyping of how this all would work in practice; there are a few quirks still, I think, in terms of what happens when cells (or signals, etc) go in and out of scope inside things like loops and conditions. How much do you cache, how do you optimise, how do you reestablish connections, etc, etc. This is all very important and non-trivial stuff to sort out, on top of more fundamental questions like what calculus do you build on, what core syntax do you implement in, etc.

However an idea about the cell concept makes me think in a slightly different direction: what would happen if we took both a cell’s (or function evaluation site’s) expression and value, and made them available to downstream?

Or even in a non-FRP context : what if we had a data type, let’s call it a proof, which contains both a value and a provenance: the outermost expression (function call) which computed it?

At a first glance, it seems like a proof would look and act much like a promise in lazy functional programming (a reference to an expression yet to be evaluated), but wouldn’t turn into its value when evaluated. You’d want some automatic way of generally looking through the proof into its value, as every value would be a proof – but that seems simple enough. A reserved system function get_provenance would be able to access the provenance of a proof.

Why would this be useful? Well, for example, for runtime typing and security. If you get a value coming in from anywhere in the Internet, it might be useful to know what function gave it to you – what it represents, not just what it is. You might even be able to implement runtime type assertions as proofs: if you got this from the ‘int’ function, then it’s an int (assuming, that is, that ‘int’ really does generate guaranteed ints). If you got it from the ‘SQL_safe_string’ function, then it really is a SQL-safe string.

(Not entirely how type systems work, but something in this vicinity. Possibly not a good idea to put the entire call stack in the provenance; but, eg, the innermost function might be the ‘type’, while the outmost function might be… something we don’t otherwise have a name for.)

If you then think about extending functional programming slightly to ‘functions’ which are mostly large gobs of structured literal data with smaller functions embedded in them… then perhaps you can see how proofs might get us might be able to get something like an audit trail, or Wikipedia’s citations, directly in the code. For any given chunk of data, it starts to become really important, for assigning chains of trust, to know not just what it is, and what type it is, but where it came from. A provenance should give us all the information we require to reconstruct the actual value.

A proof would have to somehow be unforgeable by any user code, which is probably where it gets tricky; this might break clean serialisation/deserialisation. But it seems like an interesting possibility. Values which carry their own source code with them at all times?

It might perhaps give some tools for solving the identity puzzle: values could now carry names with them. We could compare with or without requiring the provenance be equal.

Functions, Relations, Types and Identities

August 12, 2016

Teasing out a further idea: it seems to me that the way we think about identity seems different between functions and types (or functions and objects, or functions and Prolog/Kanren relations).

One of the major breakthroughs of functional programming languages, going back to Lisp and before it to the lambda calculus, is the understanding that the runtime creation of anonymous functions is both possible and useful. This was not at all obvious before Lisp; it’s been very slow even to filter through to modern post-OOP programming languages, though most of them now have some kind of lambda/closure/code-block construct. We all now accept that, yes, anonymous functions are a thing.

If you have two anonymous functions each created by a separate (lambda) – are they theoretically equal? Do they, in practice, compare equal? Do the results of applying those functions compare equal?

To the second: yes, the result of applying two separately defined anonymous functions with equivalent body expressions is guaranteed to be equal (or at least equivalent, on a Lisp/Scheme, if consing up list structure is involved – told you identity is tricky.) But you can test it with a predicate and go yep, this value is that same as that.

To the first, though: I suspect in practice this is undefined and left to each individual language – much like two Lisp/Scheme cons statements may or may not give you a different object, and depending on whether you use eq or eqv, etc. It is probably not safe to depend on two separately defined anonymous lambdas either being the same object or a different object in an actually existing running Lisplike language, or anything else with anonymous functions/closures. Because either the system might give you a different object, or a compiler might notice they’re the same and optimise it away.

The underlying Lambda Calculus though gives us a fairly sensible answer: if two lambdas have the same body, up to variable renaming, then yes, they are equal (in a much stronger sense than merely producing the same output for the same input).

Now imagine that we had, by extension, such a thing as a runtime-creatable anonymous type. Ignoring for the moment that we don’t have an equivalent of lambda calculus describing such things. Also ignoring that in practice, yes, we do have plenty of such things but they’re left implementation-dependent, usually using language-specific dirty tricks like reflection that make type theorists blush. Assume that you could have, as you can have a function returning a function, a function returning a type.

If we had a statement like (type (x) blah) roughly equivalent to (lambda (x) blah), and this gave us a new anonymous type parametrised by its arguments…

… would two of these types compare equal? SHOULD they compare equal, in a perfectly implemented system?

Or is the uniqueness of a type an intrinsic part of our information about it?

If a type should be unique, is this uniqueness a runtime (dynamic) construct, or a lexical construct? Should we get a new type every time we run this function, or should we get the same type from this function on each run – but a different type if we were to run the same statement in a different function?

Finally, would the result of applying or wrapping an anonymous type to an untyped data structure make that data structure unique, or non-unique?

Eg: Suppose we had a type ‘temperature-centigrade’ which, internally, was a floating point number. It seems sensible that, when making type comparisons, we should not accept just any number, or a ‘temperature-farenheit’ as a ‘temperature-centigrade’ – because a reason for having such a specific type would be to make sure we’re comparing the right kind of number – though we would also want to allow conversion of a numeric type to and from the more specific type, because sometimes you deliberately don’t want to know about the units being used.

However, taking that further: in the same kind of application, you might want to generalise this case to a ‘scientific unit’, being perhaps a triplet of a ‘unit’, a ‘SI prefix’ and a ‘number’; the ‘unit’ and ‘prefix’ might be subtypes of ‘string’. You might want to allow the user to enter a new string at runtime which can be a new unit or prefix (because how else are you going to extend it when the SI add new units or prefixes?) Then, having entered and parsed that string, you’re going to want to create a new type. It will probably need to be anonymous, because it; even worse, though, you will probably want to assign it a name and you will want that name itself to be computed. This sort of thing is easy to do in some scripting languages; it’s very hard to do in ‘real’ languages that insist on ‘type discipline’. Such discipline in fact prevents the use of types where they would be very helpful (ie, at runtime, to manage and make assertions about runtime-generated data. Remembering that on the Internet, it’s always runtime somewhere).

By comparison, in the still not very well formalised but at least well discussed world of objects, we tend to assume that yes, every object has identity which is separate from its contents.  This is true even in prototype OO systems which don’t make a hard division between object and class. An object’s identity is something ineffable, which can’t be serialised with the object. Two objects can be equivalent if they represent identical data (and class, if we have one), but they can’t ever be equal.

However we’re a lot more vague about the identity and uniqueness of classes, in class-based OOP; in many cases, the identity of a class is bound up with its name, in a way that an object is very much not. In fact often in class-based OOP, an object doesn’t have a name (just a current reference, which may be a named slot of any other object); while a class may be almost nothing but a name, and some source code, because a class isn’t created at runtime but by the compiler. You may guess that I find this situation weird and wrong, and prefer prototype-based OOP as a much cleaner conceptual base.

Most programming language type systems, I think rely on a similar intuition. If so, this means that we can’t quite reduce a type to a function-like entity. Or at least we could have anonymous types, but two anonymous types, even if they represented the same internal data structures would not be identical.

But I’m not sure about this, and it would be great to have this nailed down. My intuition is that, in a fully coherent runtime system that can span the Internet, every kind of data needs to be able to be computed and manipulated at runtime. That includes things traditionally considered ‘not runtime data’: blocks of uncompiled source code (as different from compiled methods or closures); names and name paths (eg, dot-separated sequences of names, as in OOP); messages; function-evaluations (as different from functions). And the fewer fundamentally ‘separate’ kinds of data, the better.

(In terms of uniqueness, it may be that two objects always being considered different is purely a function of OOP languages not being pure-functional. In other words, two objects are only different because they represent different ‘storage locations’ which can be mutated. If two objects were immutable, we might all be able to agree that they were identical if they shared the same data and the same class/prototype: they would, in other words, behave identically. And that would make unique yet identically structured ‘anonymous types’ in a pure-functional system even more of an exception.)

In Prolog or Kanren relations, identity get a bit odder again: two logical terms (possibly representing elements of a relation) will compare identical if they’re the same. But a term is a construct that combines both the name of a relation, and data which represents a part of that relation. However relations themselves are treated differently again between Prolog and Kanren. Prolog generally binds the name to the relation; Kanren, embedded in Scheme, takes the Lisp-like approach of the name of the relation being unrelated to its content. So, presumably in Kanren, two logical terms or assertions (which could be elements of a relation) will not compare identically if they are members of the same

Prolog doesn’t allow anonymous relations (at least, not in the core language; possibly it does in various language-dependent extensions). Kanren will, I think allow anonymous relations – but I’m not sure it has any matching way of creating terms or assertions that refer to such anonymous relations. (Bearing in mind that a term can take any format and even if it has the same name as a relation, it’s not until it’s actually asserted that it becomes bound to that relation; and at that point, it would only be the body of the term, not its head that gets added to the relation. Scheme-like languages have a single name-binding mechanism, assignments of names to values; Prolog-like languages have assertion of terms and rules and unification of terms and variables, which are two very different mechanisms.)

Prolog and Kanren, furthermore, also have an object-like construct which does have a sort of unique identity – the unknown or unbound variable, which will or won’t compare equal to another depending on how you view it. It will unify to any value, but if printed, it’s a unique value in its own right, and requires external infrastructure (the variable binding environment) to create that identity.

This complicated, somewhat ill-defined situation feels like it’s missing something important.

More about identities

August 8, 2016

A followup thought:

An identity rule is something we can infer from a statement (which, with |all, includes sets of statements). It tells us what set of statements this statement can be made identical to. This may (probably will) include reducing to a most general statement.

A function definition (and evaluation) is an obvious example of an identity rule. If we have a (|lambda…) form in our set of statements, we know that (under evaluation) we can identify that with the function definitions it calls, and with any sequence of literals that match the parameters and final result.

A type (or schema) definition is a less obvious example of an identity rule! It means that, if we encounter a mismatched definition, we can reduce the entire set of statements to the null or false statement! Because it’s ‘malformed data’ or ‘type mismatch’. We’re ‘rejecting’ the data, yes, but rejection is just a case of following a most general identity rule.

Same for a logical contradiction: if a set of assertions contains a contradiction, we’re free to reduce it to a single ‘invalid’ or ‘false’ (depending on what logic we’re implementing). At least under strict interpretation rules – which we would not want to be following if we were, eg, a word processor, or even a reasonably forgiving source code editor.

A sensible human-friendly processing system, even when rejecting data, should encapsulate it in an error statement rather than reducing it entirely. But the statement still can be made identical with ‘error’.

It follows that identity rules aren’t something that come attached to statements as such (though tags suggesting or defining them may be), but to their processing context – how they’re transformed, how they’re passed to another system. In other words, identity is a three-place relation: A =C B. ‘A is identified with B in context C

A slightly more subtle implication of this thought, and one that satisfies me a lot, is that identity rules (in a given context) are properties of sets of statements as a whole – rather than following in linear sequence, as in functional reduction. Identity is non-monotonic: adding more statements (eg, schema restrictions) to a statement set can change its identity set radically.

In Search of a Most General Reduction Rule

August 8, 2016

Ignoring deltas for now, and thinking more generally about the ‘evaluation rule’ of a space composed of logical statements:

It seems that the idea of evaluation comes down more generally to assigning identities. In functional programming, we tend to recursively substitute every instance of a symbol or function – based on pre-defined identities from symbol definitions and lambdas in local and inherited environments – until we get to either data literals or built-in functions. Our definitions (eg (let) and (lambda) forms) tell us what terms we can identify with which; since the identification and term rewriting goes only one way, and every step ‘reduces’ a complex expression to one made from simpler terms (though more of them), this is called ‘reduction’.

There are a few tweaks possible to reduction systems: macro systems, for instance, are effectively custom evaluators of expressions, with good old Lisp 1.5 FEXPRs starting to be recognised as a true generalisation of macros; different ways of handling environments, eg dynamic vs lexical binding; different

In logic, though, we often want a much less aggressive way of handling ‘identities’. The Prolog/Kanren implementation of Robinson resolution, for instance, is one such less-aggressive evaluator, which still tends to ‘reduce’ but is happy not to have to reduce all expressions to known quantities: the Prolog/Kanren ‘unknown variable’ has no equivalent in functional languages. Other logical proof systems – or symbolic maths packages, for instance – may not reduce an expression at all, but might transform an expression into a different form, more complex but useful for another kind of manipulation. So ‘reduction’ is not always what we want to do. Though it’s useful as the core of a language, we need to be able to limit its effects. A standard functional programming language will throw an error when it can’t reduce an expression. A logic programming language, however, is happy to just say ‘a term is a term’.

A core idea I’m still trying to reach for in the concept of ‘dataspace’ is that a bundle of logical statements should be a space: this means we should be able to index into this space in some organised manner and ‘find something inside’ – the idea of ‘containment’ is critical, because we also want to use this notion of ‘space’ to organise not just data but computation, and not just computation but computers and networks. This process of ‘indexing’ seems like it should be very similar to function evaluation by reduction, but since (for the reasons above) we don’t want to commit to aggressively to reduction, we have to think a bit about how we want to approach indexing.

In term-expressions, we can have arbitrary expression forms, so like (|date 2000 dec 25), which looks very much like a data-type. This is something that we probably don’t want to reduce – although we might want to transform it into other date representations. We might also want to encode freeform Web-roaming logical data like Wikidata entries: bundles of logical statements with arbitrary formats, perhaps something like ((|wp thing Eiffel Tower)  (|wp property height) (|metric metres 324))   We could see here that (|wp) might be some kind of namespace – and it’s hard to see how we could have a web-spanning logical dataspace without some kind of namespacing mechanism –  but it’s not at all a function. (Unless we chose to treat it as such.) For the purpose of analysing such logical statements, we’d like to stick with making identities: knowing that (|wp thing Eiffel Tower) is a uniquely identifiable entity. So we’d like to be able to reduce terms in small pieces, at will, and not always recursively.

For a long time I was hoping I could make functional reduction – and reduction of curried functions (functions of one argument) at that – be the one rule for establishing identities. But I don’t think that’s viable. Quotations, for example, just don’t work. Also, Lisp functions with multiple arguments (potentially arbitrary lists of arguments) are really, really nice and expressive, and they don’t map onto curried functions. Sorry, Haskell! But you’re not general enough. (Which is fairly obvious: after all, a Haskell program is not a Haskell expression. And that’s the root of the problem right there.)

Here’s one possibility:

  1. The ‘top of the namespace’ must be a single root node, eg, something like (|root). There’s probably a better word. It would in fact be (|root |all …) in all but the most trivial case, that only has one definition.
  2. Terms that we want to be functions, we define under the root. Perhaps something like (|root inc |lambda x |plus (|x) 1)), if we wanted lambda syntax. Or if we wanted to do it more Prolog/Haskel style: (|root inc |all (0 . 1) (|lambda x |plus (|x) 1)) – which might have a few issues because we’re defining (0 . 1) twice, but it gives the rough outline of how we could mix literal data with computed data in one single expression using |all.
  3. Instead of |lambda terms, we could have say |lambdalist which binds the whole ‘calling expression’ to one parameter. Eg (|root add1 |lambdalist a 1 |a) will take the whole list and prepend 1, so (|add1 a b c) will identify with (1 a b c)
  4. Perhaps (|reserved) could reserve a symbol and prevent it being overwritten by later (|let) or (|lambda) blocks, eg (|root add1 |reserved |lambdalist a 1 |a)
  5. We want something like macros, but fexprs (or vaus, using John Schutt’s vau calculus) are I think a more robust way of getting them. At least locally: just have a form eg (|vau environment param) which receives a calling environment. Problem is… we really, really would not want to expose something like this over the net. There needs to be a way to provide a net-safe fexpr-like functionality, so that you don’t transmit your environment to remote functions. (However, if we download a vau form, and execute it locally, we’re probably mostly okay!)
  6. The trick is subtly shifting our thinking from (|a b c) meaning ‘index c from b from a from |root) to ‘index (b c) from a from (|root)’. In the case of literals, this will come out the same; in the case of |lambda, also the same. But other forms might consume the entire parameter list – this means that, generally, (|(|a) b) is now no longer the same as (|a b). Which is, I think, a required tradeoff for expressiveness. (|date 2000 12 25) is not the same as (|(|date 2000) 12 25), why should it be?
  7. If we get an arbitrary irreducible term, that’s fine! In fact, not reducing automatically (recursively) might be a better approach. We could have, perhaps |= mean ‘force reduce’. So: (|root inc |lambda x |= |plus 1 (|= |x)) … but that seems like it’d get really cumbersome. It’s already a hassle to say (|x) instead of just x. If we’re inside a |lambda, we probably already want functional-programming rules applying. So we probably want forms other than |lambda to show that we want partial reduction.
  8. In the end, I think we want something slightly more general than Schutt’s Kernel language. He’s added ‘operatives’ (fexprs) to lambda’s ‘applicatives’ … but even operatives attempt to reduce, though they don’t do it recursively. So we have ‘terms that auto-reduce, and terms that reduce only one level’. We need at least ‘terms which can’t be reduced’, and we may also need ‘terms which can’t be reduced… but which must (eg, after preprocessing, can be asserted that they do) match a predefined schema’. So we need at least five things: literal sequences; |all expressions; functions; fexprs or vaus; and some kind of type definition or schema. (At least; we can, I think, implement most of logic programming relations/predicates with fexprs, but unknown variables are harder to find a home for). But it would be nicer if we could find something more general underlying fexprs, lambdas and typedefs (and/or unknowns). We can always say they’re all instances of ‘terms’, and that’s a step forward, but…

That’s a lot of words. The deeper idea I’m reaching for here is that reduction (evaluation) is only one form of assigning identities. Not everything is identical. (Trivially obvious, right?) Everything that’s complex is made out of smaller things – yet not everything that’s complex is made out of smaller things in a way that transfers identity to those things. (Perhaps not obvious, but reasonable?) Functional expressions, however can be made identical to their definitions (in the restricted context of evaluation – and forms of pre-evaluation that include, but are not at all limited to, compilation in the classic sense). That, though, is a very heavy restriction on the information that those terms can contain – reducing a term to its simplest components is like boiling it down for its atoms. It destroys information. Sometimes (compilation) we want to do that. Sometimes (source code management) we really really don’t. Sometimes we want to do a number of operations in between (truth maintenance, typechecking, spellchecking, grammar checking, protocol conversion, metadata adding…)

Logical programming terms can also be identified with terms that contain variables, and with implications plus a database of definitions – but only in the context of a binding environment – all of which is subtly different to functional evaluation in multiple tricky ways.

The faint hope I have here is that this process of assigning identities in a set of logical statements could be automated by simply assuming that any statement reachable from the root is an identity. That is, that by declaring a root entity we’re declaring a more general form of specific statements. And that a few built-in forms might be able to tell us what kind of identity we’re dealing with. But it’s still an elusive kind of quality I’m trying to name.

Simpler Deltas

July 7, 2016

To create an interactive ‘dataspace console’ we need some kind of local data storage, and the ability for the user to make changes to it. If we want to be pure-functional about things, we don’t want actions that happen immediately; we want function-like operations which read the storage state and produce deltas. This means we need some kind of minimally marked-up ‘node’ or ‘object’ structure which shows just the changes. Similarly, if we’re doing logic-like operations within a simulated world, or almost any kind of change tracking to documents, we’ll also want some form of delta structure. The question is: what’s a simple, logically consistent format for representing these?

Neither Prolog nor OOP languages traditionally have a very good answer for this; although OOP’s concept of ‘inheritance’ comes very close. The delta format outlined a few posts back might work, but it still feels conceptually ‘clunky’: in particular, because adding an explicit ‘not’ operator seems to break the simple Boolean logic model of ‘if X is mentioned in the dataset, it’s true; if it’s not mentioned in the dataset, it’s false’. The cutlike ‘=’ operator, though, seems like it might be able to take on the role of ‘not’.

The simplest Prolog way of handling changes to a set of knowledge would be, I think, to represent a delta as a list of terms possibly wraped in ‘not’ for negation. But that doesn’t work for me, because I want the quality of ‘locality’ – I want data to be ‘near’ related data, because perhaps I might have my workstation or workspace as a subspace indexed from a wider space, which might be planet-wide; it’s not going to scale well if I have to have a global root node at the DNS level of a world-wide dataweb which is called ‘not’, and I have to ping it every time to want to negate an expression on my local desktop. And a core idea of dataspace is that syntax = structure. So we need a syntax that puts operators lexically close to where they’re used, not at the head of a list.

It’s also very annoying, and very noisy, to have to explicitly negate every fact we already know in order to convey the idea, intuitive in almost every programming paradigm other than logic programming, that ‘the new value of x is y’. Again, the cutlike ‘=’ operator seems to solve this.

Using term-expressions, we can have as many ‘operators’ as we like, embedded at any point. So perhaps for a minimal, delta-ready node syntax, we can get by with two operators, and the magic splice point meta-operator (rendering splice as |  rather than : as it’s less likely to conflict with existing uses):

|+ , to define a node with a list of successors that adds to previous expressions
|= , to define a node with a list of successors that replaces any previous or inherited value

(For production work we’d probably have hash or binary-tree versions of each, for fast searching)

If we assume (as I have earlier in T-expressions) that an expression must end in a non-null value to be ‘true’, then we can use a trailing |= to set an expression to null, effectively making it ‘false’ – but with Prolog-like Boolean search semantics, rather than any kind of three-valued logic. That seems like it might solve several problems at once.

So a delta might be:

(myfile chapter1 |= (a bunch of text…))  — replaces all of the ‘chapter1’ section of ‘myfile’ with all new content
(cat likes tuna |=) — it is no longer true, if it ever was, that the simulated cat likes tuna
(|+ (x |= 1) (y |= 2)) — assign new values to property slots x and y, in a fragment from an object definition that goes on to inherit from a class or prototype

(|+
(player |+
(location |= (westhouse))
(flags |+ (hungry) (sleepy |=))
(score |= 100))
(westhouse light on |=)
(player inventory lamp)
)

— a typical set of updates to an adventure game after a move.  The player’s location (a single-valued attribute) has changed; their ‘hungry’ flag has set while ‘sleepy’ is unset; score is set to a dotted-pair of 100; a fact that a light is switched on has been set to false; the player now has a new item in their inventory.

(employee 123456 salary |= 100000) — one named field of one record in the employee table has changed
(employee 123457 |=) — one employee record has been deleted entirely
(employee 123458 |= (name Mickey Mouse) (salary |= 90000)) — one employee record is created, or completely replaced

or we could bundle all these changes as one atomic change:

(employee |+
(123456 salary |= 100000)
(123457 |=)
(123458 |= (name Mickey Mouse) (salary |= 90000)))

It looks like we would use |= quite a lot in real-world data modelling, which I think makes it a good choice as a primitive operator. A chunk of arbitrary structured data now needs a little thought as to whether it should add or replace previous information. If we wanted to just keep the expressiveness of something like JSON, we’d probably pick |= in every case, but we now have the ability to indicate adds as well as changes; so we can represent

We could define |= so that it doesn’t include the ‘successor list’ feature of |+, which might make processing the expressions simpler, but then we’d have to represent ‘false’ with |= |+ rather than just |= . It’s an aesthetic choice, I think, as to which base set of operators to pick, but the argument in favour of readability seems strong.

Given a bunch of deltas, we can append them all together inside a |+  or |= list, and we could traverse that list in order to search it, stopping after we reach a |= subexpression. We could also traverse it to remove duplicate expressions (or sort it into a |tree), in order to consolidate deltas. That means we probably need at least two forms of reading an expression: one applying operators, and one not. (We’ll probably also need something like this when – if –  we get to formulae.)

It might be interesting to think about what kind of logic |+ and |= represent, as we’ve now smuggled in something roughly equivalent to Boolean ‘and’ and ‘not’, but not quite as powerful.

 

Rationale And Summary IV: What Might A Dataspace Be?

July 6, 2016

Stepping back from the low-level structure considerations for a moment: the original user-facing idea of a ‘dataspace’ seemed, and still seems, simple to me. Until I look closely….

Assume you have a suitably mathematically stable and implementation-independent concept of ‘object’ nailed down.  (As previously described, this is harder than it looks, and we usually assume we already got this sorted log ago; we really haven’t.)  Then the following principles should apply:

  1. Everything in the data universe is an instance of ‘object’, with the possible exception of atoms like numbers and strings; but even those probably are also objects, down to some very low level representation like ‘cons cell’ or ‘machine address’ or ‘machine word’. See, eg, how Picolisp works: the virtual machine is specified with cons cell as the base element, making memory management very simple.
  2. There’s a root environment. In fact each object probably has its own attached environment. See, eg, Lua, which has  almost everything as an objectlike structure called ‘table’ with attached ‘metatable’, which serves as an environment. Environments allow us to implement something like Scheme ‘closures’, which seems required for mathematical stability.

    However, looking at Prolog (and most treatments of classical and first-order logic itself) we find that there’s a big disconnect between the world of ‘logic’ and the world of ‘functions’, focusing around this notion of ‘environment’ and ‘closure’. In short, functional theory tends to describe closures (operations or objects which carry their environment with them), while logic tends to describe formulae which refer to a centralised database of facts or definitions -‘the state of the world now’. For this reason, logic and logic-programming has generally overlooked the definition of what kind of data structure a world is – so Prolog, eg, has no standard way of applying predicates to ‘the world database’ in the way functional or OOP programmers would expect to naturally have functions that work on functions, or objects that pass objects. This is a huge conceptual gap in the foundations of logic (when applied to programming) which really needs to be crossed somehow.

    An analogous problem in functional programming is that generally, FP formalisms – starting from lambda calculus and even SKI calculus – have no representation for ‘the environment’ as data that can (and should) be acted on by functions. Yet all of real-world programming involves functions performing manipulations of environment-like structures! Eg: filesystems, databases, operating systems, and objects. Our fundamental formalisms really ought to represent the actual work we expect our programs to do; as long as we refuse to mathematically define the operations we’re doing, we’re going to find it hard to methodically reason about them.

  3. Objects should be able to include functions or formulae, which are themselves objects (ie, have exactly the same formal internal structure as other objects) and perform a query over their parent object’s attached environment. We should be able to mix and match both ‘literal data objects’ with such ‘functional’ or ‘computed’ objects at a fine-grained level, in a similar manner as we can mix literal assertions with rules in Prolog. The requirement that formulae/queries should themselves be objects is because we want the strong Lisp/Scheme (and to some extent, Prolog) property of ‘homoiconicity’, which means that a program’s structure should have the same form as its syntax. We want, in fact, programs to be able to operate on programs – because since we want to use this for arbitrary knowledge representation, we need to be able to represent chunks of knowledge talking about chunks of knowledge.

    For example: we might want to have an object which is a ‘project’, which itself contains ‘documentation’ and ‘libraries’ and then ‘code’ which ‘imports’ those libraries, while perhaps the documentation itself references both the libraries and the code, as well as references the people who created those libraries… and we want all of those objects to be made from the same kind of datalike ‘stuff’, so that yet another object could take this whole ball of wax and study it and make statements about it.

    OOP languages generally do NOT have this property; they often make a strong distinction between, eg, ‘object’, ‘source code of object’, ‘object property slot’ and ‘object method’. We’d like them to be a lot closer together. We can cope with the idea of ‘compilation’ (so that a piece of source text gets transformed successively into a cached or simplified form, down to machine code) but there must always be a link between a cached or compiled form of an object and its raw source form. See, eg, the COLA (Combined Object/Lambda Architecture) project for an example of this kind of thinking: that there should be a fundamental system ‘object’ model, and all processing should progressively transform and cache objects into objects.

    One thing this implies is that an object represents multiple queries in parallel (because a formula or query is generally rowlike or listlike in form; to be complete, we want to extend it to tablelike or setlike). An important question seems to be: should an object represent ‘logical OR or logical AND / setwise UNION or setwise INTERSECTION’ between multiple queries? It doesn’t seem quite trivial to determine. Set semantics and relational algebra suggests it should be intersection; Prolog semantics suggest it should be union. Intersection of results respects typefulness (to the extent we care about types), and generates smaller result sets; union of results respects the property that ‘every row in an object should be a setwise union’, and is faster to compute as we just add results together. Generally I’m leaning towards union.

  4. Objects should be immutable, except when they aren’t. This is a bit awkward as we now have two kinds of ‘stuff’, but it can’t easily be helped. But generally most ‘stuff’ should be immutable objects, with a few ‘variable’ or ‘signal’ typed objects which change in realtime. This is where dataflow or Functional Reactive Programming semantics seem to be required: we’d like to have this huge sea of data objects, some of which might not exist until we look at it, and then  a ‘cursor’ or ‘window’ or ‘view’ object which represents the values we care about here and now, based on external signals coming in. Ideally, all our actual windows in a windowed desktop would be these kind of ‘data windows’; and all the processes in our operating system would also be these kind of functionlike views. Also ideally, we’d have one signal ‘root view’ representing a physical machine which simply applies or attaches itself to any functionlike object to generate subviews.

    FRP seems like the future of programming (particularly web-scale, UI-type programming), and this is where work like the Elm web language and Racket’s FrTime language look very promising. But a lot of the core semantics of FRP seem yet to be nailed down, and like OOP, remains extremely ad-hoc and implementation-dependent. Views or signals need the ability to store a history or freeze signals (at least some of them do). We also need to model what happens as views go in or out of scope (during loops and conditional expressions, for example). If our OS processes are going to be made of these objects, how do we garbage-collect them? We need a very simple lambda-like calculus to model and understand this behaviour.

  5. Caching, security and migration are very open issues. If we have a chunk of pure-functional, datalike, objectlike stuff which can perform computations, it will be useful to migrate parts of it between different computing fabrics: inner CPU cache, local user/machine process RAM, local user/machine permanent storage, shared storage on a machine, organisation compute cloud, public clouds…  but we must allow migration only to certain spaces based on trust levels. We also need to be able to access both the ‘inside’ (definition, source code) and ‘outside’ (compiled form, API, runtime data) of such objects, but allow access only to authenticated agents.

    Currently we do all of this on very ad-hoc basis and using manual compilation / publication processes (and public cloud computing has the potential to silently undermine every level of enterprise security). We would need to have some kind of ‘capability security’ architecture (where only authenticated users can eveno acquire a reference to an object)  before even thinking of going live with an internet-facing system. We probably also need hard user-interface indications of ‘what object am I talking to’ and ‘where does it live’ and ‘how much should I trust it’. A Web browser gives us a little indication of this, with a URL bar, but current Web trends are rapidly subverting even this. Desktop and mobile apps have almost completely erased this line, with disastrous security implications.

  6. We should, though, have a rough idea now of what we want: a Smalltalk-like recursively-defined sea of Prolog-like literal/computed data objects, only in the FRP style, pure-functional and immutable (except for views, derived from the physical machine doing the processing). The sea of objects extends beyond any given machine, and is cached and stored and computed at various places in a worldwide grid. The construction of a worldwide grid doesn’t have to be immediate, but we should think when we create our core data structures that we’re not talking just about software running in ‘this process’ or even ‘this machine’, and impose no unnecessary restriction at the low level. However most computation should happen locally; only changes should be transmitted, and only transmitted as far as subscriber interest has been indicated. (Recent work on ‘content-centric networking’ suggests the direction the Internet should / will eventually head; we just want to add computation to content distribution.)

    In the large, at web-scale, we will need things like versioning, history-rollback, encryption and  hash codes on all objects. In fact we’ll probably want those even at desktop scale. They don’t have to be in the lowest levels of objects, but our core data definitions should assume that we can add those layers.

  7. User interfaces should be constructable by simply applying or attaching or pointing (via some kind of ‘transclusion’ mechanism) function-like objects to data-like objects. See, for instance, Powershell’s ‘Out-Gridview’ cmdlet for an example of taking arbitrary data and sending it to a function which constructs a UI at runtime, no compilation necessary. The Factor and Picolisp languages can also do similar feats (though usually using Web interfaces). We’d like this to be the rule for an entire UI ecosystem.

    This means our rule for creating interfaces should be to have a small number of orthogonal, complete representations and operations which can take any arbitrary data thrown at them – not the current way of building large, monolithic custom interfaces which are extremely complex. The Smalltalk/Squeak world probably has more experience in this domain.

  8. Both Smalltalk and Prolog implement the principle of ‘late binding’ (Prolog perhaps the most radically of any language), and that seems important to me. A system primarily designed to handle user-specified data and give control to users who want to do as little or as much programming as they choose should leave as many implementation decisions up to that user, and that means baking as little as possible in  at ‘compile time’. (We can still have ‘compilation’, just as we can have ‘types’, but both ideas should simply be program operations performed by programs on other programs, or on objects by other objects).

    How to reconcile this with the pure-functional view of Scheme/Haskell that objects should carry an environment with them is an interesting problem. I think there should be two kinds of environment: an internal and an ‘external’ one. (Perhaps in the style of John Schutt’s ‘vau calculus’). This would give our objects the power of macros – or, classical logical formulae, which like macros, can access the current ‘world state’, and generate revised worlds – or sets of changes to the world.

    This is also tied up with the idea of ‘lazy evaluation’. Dataspace is big, and we don’t want to instantiate more of it than we need to. Like Prolog predicates or Haskell functions, many of our computable objectlike, datalike structures will be infinite in extent in multiple dimensions. We very much need ‘lazy’ semantics to deal with this.

    The Robinson resolution algorithm of Prolog is remarkably clever in how it handles searching large variable spaces while still being simple and reasonably predictable. I wish there was more work on bridging the gap between logic programming and functional programming semantics. Kanren might be part of that bridge, but there still seem parts missing to me.

That’s the long view.  In the short term: the key ideas I want to capture are: 1) a cleanly ASCII-serialisable object specification which can describe literal data and/or program source text of any kind; 2) an object-based formula or query format that can be lazily evaluated; and 3) a FRP engine for applying 2 to 1.

Summary and Rationale III: Term Expressions

July 4, 2016

So T-expressions, though they might help unify serialisation syntax and ways of thinking about the unification of ‘object’, ‘set’ and ‘array’ structured data, didn’t quite satisfy me, because I still needed to think about how to implement a language on top of them, and I knew I’d need at least one extra reserved symbol for ‘evaluatable expression’ – or several. And I really didn’t want to pollute the namespace of my lowest-level syntax, because I didn’t know what direction the higher-level language layers might finally take.

I came back to the idea of logical expressions as in Prolog. But because I wanted to refit any given Prolog program as an object-like structure (so I could add the idea of ‘location’ inside a treelike namespace into what’s currently a flat-file database – which seemed important for scalability) – I kept finding expressions became really ambiguous. I wanted to say things like:

(a b c) + (a b d) = (a b BOTH c d)

But there’s no easy way to interpret that symbol ‘BOTH’! The same one where in T-expressions we have ‘.’. In T-expressions I just assumed it as part of the syntax. But could I find a way to make it something deeper – a logical expression itself?

The answer is yes. If we take one symbol, say : to not confuse with ., and let it mean ‘everything that follows this is a logical expression’, we get a syntax which is like Prolog but inverts its relation between logical terms and lists.

In Prolog, you have terms and lists as

this(is, a, term).
[this, is, a, list] == .(this, .(is, .(a, .(list, .()))))

Leaving aside how noisy that syntax is, note that there’s two separate syntaxes, and the [..] list syntax is pure sugar for a bunch of .() nodes which are terms. And that native .() syntax is just hideous. But in what I’ve taken to calling ‘term-expressions’, we put this the other way around, and it all becomes much cleaner, both at the machine level and visually!

(: this is a logical term)
(this is a list)
(this is a list which ends in : a logical term)

Literally we now have everything as a list. A logical term is a list that starts with the special term indicator, ‘:’. (We can make that one reserved symbol anything we want). Anything that comes after the : is immediately identifiable as special data.  A logical term may contain lists, or other logical terms. We can mix and match lists and terms in a way we couldn’t easily in standard Prolog syntax, or even in many Lisp-embedded logic-programming dialects like Picolog or Kanren.

We can use this to define data types!  Eg:

(: hex a1 b2 c3 d4)
(: address_record (name mickey mouse) (address 1 Disneyland place))

etc. This one small addition to S-expressions now gives us the ability to define a huge range of arbitrarily-structured data types. It breaks down recursively very neatly! Instead of an S-expression dotted pair ending in just an atom or a number, we could now end it with any kind of structured datatype.

You could then take this right down to the machine level, for any arbitrary built-in datatype you want. A ‘string’, for example, might be at the machine level something like:

(: string 65 66 67) – ie, a sequence of bytes, where ‘string’ might be a hardcoded byte, and ‘:’ might be a flag bit.

We could model all of T-expressions by just adding one symbol, ‘all’. Eg:

(the cat sat on the : all (mat) (dog))

And now we can assign a formal logical interpretation to ‘: all … ‘, as a logical term, to mean ‘the set union of the following sequence of expressions’, instead of relying on defining the semantics at the level of syntax. But we also know that we haven’t polluted the global namespace – the word ‘all’ is now not a reserved word, only the symbol ‘:’ is.

The downside to term-expressions is that they might be too general: now that you can put arbitrary logical terms anywhere in a list, how do you interpret them?

That’s the hard part. But – at least having a syntax which allows representation of ideas which are very hard in other syntaxes, and doesn’t get in the way of thought, I think is a small step forward.

Summary of T-Expressions II

July 4, 2016

The core idea behind T-expressions is a search for

  1. a stable minimal semantics of ‘object’ in its simplest form: not even dealing with any issues of inheritance, method or code evaluation, but simply the most general interpretation of the data structure we think of as a ‘dictionary’ or ‘hash table’. We know how to do these, right? Every programming language has its own hash/dictionary implementation these days! Well, that’s part of the problem. Every language has its own definition of what a hash/dict is. What should be the correct, minimal definition? How do we tell if any given language is doing it right? We have such a definition for, eg, relational algebra/calculus (see Darwen and Date) – why not for objectlike data structures, given their ubiquity in data representation today?
  2. function-like semantics, to capture the intuition that an ‘object’ should also be a function. It seems pretty obvious that an object can be implemented as a function that takes each member as an argument and returns the content of it(perhaps plus an auxiliary function to iterate through the names of members), and functions seem very well-defined in computing. So can we just have, maybe, an object with literal members, plus members which are methods (Schemelike closures, say), plus an extra member/method for ‘computed members’?
  3. setlike semantics, so that you can take the intersection and union of an object – a fairly common thing you’d want to do, but surprisingly hard (read: completely undefined) in most object systems. Even the most generous ones, eg prototype OO systems like JavaScript/JSON, Lua, or Io.
  4. listlike semantics, so that the minimal case of ‘one row’ can translate as ‘a list’. This seems like a logical thing you’d want so that you can degrade to Lisplike or Schemelike semantics for a programming language.
  5. a semantics of binary relations, since ‘set of lists’ looks very much like a ‘two-column table’, and the mathematics of binary relations looks like it’s pretty well nailed down, right? Well… sort of, yes. But not entirely.
  6. a data structure that can cope with ‘ragged relations’: the sort of data you get in real-world data structures, where everything isn’t neatly lined up in rows of exactly the same length. Any given object in a sea of objectlike data, for example, will generally not have exactly the same depth of sub-objects. If you try to express an XML file naturally as a Codd table, you have to do terrible things to it first.

#6 means that Codd, Date and Darwen are out of the window, sorry. Codd defined an algebra and calculus of typed, hard-coded fixed-arity tables, which is not how real-world data works. Codd tables are not recursively definable (for example: can you create a table of all the tables in your database? A table of all the databases on your computer? A table of all the computers in a network? Not just their names, but the data itself? If not: why can’t you? Why is the Codd fixed-type, fixed-arity relational model inappropriate for describing the contents of your computer above the ‘database’ level? Because the Codd definition of ‘table’ is not sufficiently general to be useful as the foundational layer of a data model, that’s why.)

#1 is, for the moment, undefined until we decide what the minimal interpretation of  a hashlike/dictlike/object should be.

#2 seems fine! Except there’s no well-defined functional equivalent of ‘function A set-union function B’. Which is kind of odd when you think about it. Especially if you’re used to thinking about Prolog predicates which literally are just that. Also, if you can union two functions together suddenly you get not quite functional, but relational semantics: you can have zero, one or many values for a function call. That seems useful, especially if you’re dealing with logic, but now we’re not quite in the realm of functional programming but something a little larger.

#3 and #4 together seem to give us an obvious way in. Take a set of Lisp expressions (atoms, nils or cons cells) and call it a day. That seems fine, right? Then we could unify object, function, relational, set and list semantics in one structure!

Well, a set of conses is exactly what T-expressions are. But they don’t quite match object or function semantics. (As you might expect given that they have relational, not functional, semantics). However they’re even a bit more general than relations.

After dismissing Codd as not quite fundamental enough, I discovered Dave Childs’ Extended Set Theory (XST), which actually predates Codd – but I couldn’t quite see how his worked. So I hacked up my own.

Taking S-expression dotted pairs and just allowing zero or multiple expressions after the dot, instead of one, gives us a serialisation format that, I think, can describe a binary relation quite neatly, and is fairly simple to visualize:

(a b c) + (x y z) = (. (a b c) (x y z))
(a b c) + (a b) = (. (a b c) (a b)) or (a . (b) (b c) or (a b . () (c))
(a b c) + (a d e) = (. (a b c) (a d e)) or  (a . (b c) (d e))

Note that there are multiple ways you could union two such relations to get a third. If we wanted to be strict about it (as I think Childs is in his XST) we would only allow the first. But it’s probably useful to establish a normal form so that the most compact definition (the last one on each line) is the standard meaning.

Some applications for how you might use this syntax:

(sqrt 4 . 2 -2) — extending functions to multiple values
(the cat sat on the mat) + (the cat sat on the dog)  == (the cat sat on the . (mat) (dog))
(. (surname . jones) (employee_id . 12345) (salary 120000))

We’ve got something here which could replace, at a minimum, the general notion of ‘array’ and ‘dictionary’ with one notation – plus it adds set semantics, so (. a b c) is a set while (a b c) is a list. And it doesn’t add any reserved symbols over S-expressions. In the case of one ‘row’, it collapses to an ordinary S-expression list: (a b c)

The symbol () is S-expression nil. We get another symbol, though, for a ‘true null’ or ’empty set’: (.) We can use the difference between these two to implement something like ‘true’ and ‘false’, where ‘true’ means ‘any value other than the empty set’. The empty set is a lookup that finds no results; nil is a lookup that finds ‘end of file’.

We also have, if we treat the serialisation of the setlike part (the part after the dot) literally, an ‘ordered set’ semantics, which is slightly different than just ‘set’. Perhaps more useful.

There’s a few quirks, though. One thing we find, if we is that if we allow the slightly looser semantics mentioned above, where (a b) + (a)  = (. (a) (a b)) but also collapses as a normalised form to (a . () (b)) — then we also find that (. a) collapses to a. Which means we lose strict setlike semantics: sets are no longer rigid boundaries, only lists (rows) within sets are. That could be problematic.

In short: I think T-expressions have great potential as a simple notation and representation for complex structured treelike, objectlike, listlike data. But since they don’t quite have objectlike or functionlike semantics, they may be harder to round-trip data out of and back into something like JSON. (JSON is hobbled in several other ways, mainly by not allowing arbitrary objects as keys).

And though T-expressions give me a good way of thinking about the semantics of relation-like data structures, there’s a step further we can go which can give us maybe more expressive power still..


Follow

Get every new post delivered to your Inbox.