Chapter 18. Default Values

Each predicate in a LogicBlox database consists of a set of tuples. In a functional predicate, all values but one in a tuple define a key that is unique among all the tuples in the predicate. For example, a sales predicate could define a functional mapping from combinations of skus, stores, and days to the count of a product sold in a store on a given day. Example tuples of the form (sku, store, week, sales) in the sales predicate could be

“Sweater”,      “Atlanta-Midtown”,  “20150704”,   3
“T-shirt”,      “Atlanta-Midtown”,  “20150708”,  12
“Windbreaker”,  “Athens-North”,     “20150708”,   1
“T-shirt”,      “Athens-North”,     “20150702”,   4

All the possible key combinations for a functional predicate define the set of possible tuples the predicate can hold: namely those that associate some value with a valid key combination. However, not all possible key combinations must have a corresponding value. Some or all of the possible tuples can be missing from the database (for example, there is no value in the above data for “Sweater” in the “Athens-North” store on any given day). The meaning of a missing tuple is typically specific to each predicate, and depends on the requirements of the application. The set of possible tuples for a predicate that are not missing from the database is often informally referred to as the set of populated facts in the predicate.

LogiQL rules will often combine predicates using conjunctions. These conjunctions will use the intersection of populated fact keys in the predicates to produce a set of resulting tuples (i.e., an inner join, for those with relational database experience). In practice, this can mean that a LogiQL calculation doesn’t produce data that might be expected in the business context. Application developers must consider how conjunctions in rules and missing tuples in predicates affect the results produced by the rules. As an example, if the returns predicate has the following two tuples of the form (sku, store, week, returns)

“T-shirt”,  “Atlanta-Midtown”,  “20150708”,  1
“T-shirt”,  “Athens-North”,     “20150702”,  2

then the conjunction of the above sales tuples with these returns tuples contains only those tuples that have the same keys in both predicates (the tuples are of the form (sku, store, week, sales, returns)):

“T-shirt”,  “Atlanta-Midtown”,  “20150708”,  12, 1
“T-shirt”,  “Athens-North”,     “20150702”,   4, 2

In past LogicBlox releases (i.e., 3.x), the concept of a “default value” of a predicate has been used to help manage these issues. LogicBlox version 4 releases have recently reintroduced the default value concept. In the remainder of this chapter we discuss the use of default values in more detail.

18.1. Net Sales Example

Consider a simple retail example where net_sales is computed by subtracting returns from sales. An implementation of this in LogiQL might look like the following:

// Define sku, store, and day entities (used as key types for other predicates)
sku(sk), sku_id(sk:id) -> string(id).
store(st), store_id(st:id) -> string(id).
day(d), day_id(d:id) -> string(id).

// Define sales, returns, and net_sales data predicates
sales[sk, st, d] = v -> sku(sk), store(st), day(d), decimal(v).
returns[sk, st, d] = v -> sku(sk), store(st), day(d), decimal(v).
net_sales[sk, st, d] = v -> sku(sk), store(st), day(d), decimal(v).

// Compute net_sales from sales and returns
net_sales[sk, st, d] = sales[sk, st, d] - returns[sk, st, d].

The net_sales rule above will only produce values for [sku, store, day] key combinations that have both a sales and a returns value. To get a little insight into the reasons for this, it might help to look at a slightly more verbose, but equivalent, rule for computing net_sales:

net_sales[sk, st, d] = ns <-
   sales[sk, st, d] = s,
   returns[sk, st, d] = r,
   ns = s - r.

This rule can be read as “Assign net_sales for a particular [sku, store, day] to the value ns where ns is s - r AND s is the sales value for the [sku, store, day] AND r is the returns value for the [sku, store, day]”. The conjunctions (ANDs) in LogiQL cause net_sales values to be produced for the intersection of the [sku, store, day] keys in both sales and returns. In other words, for those with traditional relational database experience, the LogiQL database engine is doing an inner join between sales and returns. If a [sku, store, day] combination has a sales value but no returns value (or vice versa), then the intersection will not contain that [sku, store, day] and no net_sales value for it will be produced.

To see this, put the above rules into a file called sales.logic, and create another file called load_data.logic that contains

+sku(sk), +sku_id[sk] = "sku_1".
+sku(sk), +sku_id[sk] = "sku_2".

+store(st), +store_id[st] = “store_A”.
+store(st), +store_id[st] = “store_B”.

+day(d), +day_id[d] = "20150601".
+day(d), +day_id[d] = "20150602".
+day(d), +day_id[d] = "20150603".

^sales[sk, st, d] = 10.0d <-
   sku_id[sk] = "sku_1",
   day_id[d] = "20150601",
   ( store_id[st] = "store_A"
   ; store_id[st] = "store_B"
   ).

^returns[sk, st, d] = 2.0d <-
   sku_id[sk] = "sku_1",
   store_id[st] = "store_A",
   day_id[d] = "20150601".

Then execute the following commands:

lb create --overwrite /defval
lb addblock -f sales.logic /defval
lb exec -f load_data.logic /defval
lb print /defval sales
lb print /defval returns
lb print /defval net_sales

The output should be similar to the following (same number of rows and same values produced, the key indices in brackets might differ):

$ lb print /defval sales
[10000000005] "sku_1" [10000000004] "store_A" [10000000007] "20150601" 10.00000
[10000000005] "sku_1" [10000000006] "store_B" [10000000007] "20150601" 10.00000
$ lb print /defval returns
[10000000005] "sku_1" [10000000004] "store_A" [10000000007] "20150601" 2.00000
$ lb print /defval net_sales
[10000000005] "sku_1" [10000000004] "store_A" [10000000007] "20150601" 8.00000 

In this example, the sales predicate has values for the key combinations of [sku_1, store_A, 20150601] and [sku_1, store_B, 20150601]. The returns predicate has only one value for the key combination of [sku_1, store_A, 20150601]. The intersection (inner join) between the keys of sales and returns is [sku_1, store_A, 20150601], which means that net_sales will only have one value. This is counterintuitive from a business perspective, and most likely not what was intended.

The net_sales calculation would make more sense if a missing returns or missing sales was treated as if the value were zero, producing a net_sales value if either sales OR returns had a value for a particular [sku, store, day] key. This desired behavior would use the union (outer join for those with relational database experience) of sales and returns to determine what net_sales keys should contain values. This can be accomplished in LogiQL either by using disjunctive rules or by setting default values for the three predicates involved in the calculation. Both approaches are discussed in more detail below.

18.2. Disjunctive Solution

The rules in sales.logic can be altered as follows, using disjunction to tell the database engine how to compute a net_sales value if the value for either sales OR returns is missing for any [sku, store, day] key that is in the union of existing sales and returns keys.

// Define sku, store, and day entities (used as key types for other predicates)
sku(sk), sku_id(sk:id) -> string(id).
store(st), store_id(st:id) -> string(id).
day(d), day_id(d:id) -> string(id).

// Define sales, returns, and net_sales data predicates
sales[sk, st, d] = v -> sku(sk), store(st), day(d), decimal(v).
returns[sk, st, d] = v -> sku(sk), store(st), day(d), decimal(v).
net_sales[sk, st, d] = v -> sku(sk), store(st), day(d), decimal(v).

// Compute net_sales from sales and returns, filling in missing
// values with zeros
net_sales[sk, st, d] = sales[sk, st, d] - returns[sk, st, d].
net_sales[sk, st, d] = sls - ret <-
   returns[sk, st, d] = ret,
   !sales[sk, st, d] = _,
   sls = 0.0d.
net_sales[sk, st, d] = sls - ret <-
   sales[sk, st, d] = sls,
   !returns[sk, st, d] = _,
   ret = 0.0d.

After changing the sales.logic file, execute the following commands again

lb create --overwrite /defval
lb addblock -f sales.logic /defval
lb exec -f load_data.logic /defval
lb print /defval sales
lb print /defval returns
lb print /defval net_sales

to see the output expected originally

$ lb print /defval sales
[10000000006] "sku_1" [10000000000] "store_A" [10000000005] "20150601" 10.00000
[10000000006] "sku_1" [10000000003] "store_B" [10000000005] "20150601" 10.00000
$ lb print /defval returns
[10000000006] "sku_1" [10000000000] "store_A" [10000000005] "20150601" 2.00000
$ lb print /defval net_sales
[10000000006] "sku_1" [10000000000] "store_A" [10000000005] "20150601" 8.00000
[10000000006] "sku_1" [10000000003] "store_B" [10000000005] "20150601" 10.00000

Writing disjunctive rules like this is reasonable for a toy example, but is both tedious and error prone for real applications that contain many thousands of rules. It is especially problematic for rules whose bodies reference many predicates that might not all have values for the same keys, as one must then consider all the predicate combinations to determine the correct behavior. Moreover, this approach entails potentially large storage and performance penalties: see Section 18.4, “Storage and Performance Implications”.

18.3. Default Value Solution

A better approach is to set default values for the sales, returns, and net_sales predicates. A predicate with a default value has a tuple for every combination of key values (i.e., it is a total function). In this case, specifying that each of these predicates has a default value of zero effectively means that all [sku, store, day] key combinations with missing values in the examples above will now have values of zero. Since values are defined for all [sku, store, day] key combinations, the intersection (inner join) of sales and returns in the net_sales rule will now have the same effect as the union (outer join), producing a net_sales tuple for every possible [sku, store, day] combination. Note that the LogicBlox database stores only non-default tuples: this can lead to significant space savings (see Section 18.4, “Storage and Performance Implications”).

To use default values in the net_sales example, change the sales.logic file to contain lang:defaultValue directives as follows

// Define sku, store, and day entities (used as key types for other predicates)
sku(sk), sku_id(sk:id) -> string(id).
store(st), store_id(st:id) -> string(id).
day(d), day_id(d:id) -> string(id).

// Define sales, returns, and net_sales data predicates
sales[sk, st, d] = v -> sku(sk), store(st), day(d), decimal(v).
lang:defaultValue[`sales] = 0.0d.

returns[sk, st, d] = v -> sku(sk), store(st), day(d), decimal(v).
lang:defaultValue[`returns] = 0.0d.

net_sales[sk, st, d] = v -> sku(sk), store(st), day(d), decimal(v).
lang:defaultValue[`net_sales] = 0.0d.

// Compute net_sales from sales and returns
    net_sales[sk, st, d] = sales[sk, st, d] - returns[sk, st, d].

Execute the same set of commands as before.

lb create --overwrite /defval
lb addblock -f sales.logic /defval
lb exec -f load_data.logic /defval
lb print /defval sales
lb print /defval returns
lb print /defval net_sales

The output should be the same as for the disjunctive example above.

$ lb print /defval sales
[10000000004] "sku_1" [10000000005] "store_A" [10000000001] "20150601" 10.00000
[10000000004] "sku_1" [10000000007] "store_B" [10000000001] "20150601" 10.00000
$ lb print /defval returns
[10000000004] "sku_1" [10000000005] "store_A" [10000000001] "20150601" 2.00000
$ lb print /defval net_sales
[10000000004] "sku_1" [10000000005] "store_A" [10000000001] "20150601" 8.00000
[10000000004] "sku_1" [10000000007] "store_B" [10000000001] "20150601" 10.00000 

Note that the tuples that have a default value are not printed, because they would usually be very numerous.

18.4. Storage and Performance Implications

For good performance, is important that:

  • the database does not store tuples that have a default value, for example returns values of zero, and
  • the database does not do unnecessary computations over default values, for example subtracting returns tuples with value zero from sales tuples with value zero to compute net_sales of zero.

The LogicBlox database does not physically store every logical value that may be contained in a predicate with a default value. A retail application could have millions of skus, thousands of stores, and thousands of days. If there is only one returns value for one sku at one store on one day, it makes sense to physically store only one tuple and not consume space for the missing values for [sku, store, day] key combinations. For this reason, the database only stores the non-default tuples. This kind of storage is often described as sparse storage. Apart from saving space, sparse storage has significant performance benefits: when evaluating a rule, the system need not consider billions of possible keys with default values.

In the disjunctive example above, both sales and returns consume space only for the non-default tuples. But there are extra disjunctive rules which insert zero values for any missing sales or returns values. These rules will cause the net_sales predicate to be fully populated (i.e., it will have a physically stored value for every possible key combination). This could not only take up a lot of disk space but could also introduce performance and storage problems for other rules that refer to net_sales.

For predicates with a default value, the LogicBlox database uses sparse storage: it physically stores only values that are different from the default value. For the example in Section 18.3, “Default Value Solution”, the sales, returns, and net_sales predicates will store only non-zero values, so these predicates will tend to consume much less disk space and take less computation time in rules. When executing rules whose bodies refer to default valued predicates, the system constructs a logical view of the predicate where “missing” values (those not physically stored) are replaced by the default value as needed. The optimization challenge for the database is to minimize the usage of default value tuples as much as possible.

Note that mixing predicates with and without default values can result in storage explosion similar to the one we saw in the disjunctive example. If sales and returns have a default value of zero, but net_sales does not have a default value defined, then the net_sales = sales - returns rule will end up fully populating the net_sales predicate. The rule can be written to filter out the zeros in net_sales, but it is best to consistently use or not use default values in application predicates related to each other via LogiQL rules. A predicate without a default value can usually be efficiently computed from predicates with default values by filtering out the virtual default values from the predicates used in the calculation. For example, consider a predicate called yesterday_sales that doesn’t have a default value and is computed from the sales predicate that has a default value of zero. Zero values in the sales predicate can be filtered out and excluded from the yesterday_sales values by adding an s != 0 conjunct, as in the rule below:

yesterday_sales[sk, st, yesterday] = s <-
   s = sales[sk, st, d], yesterday=day:previous[d], s != 0.

The evaluation optimization doesn’t currently apply when a condition like s != 0 is used on values produced by a rule: the condition is only effective when used on values that are inputs to the rule. For example, the following would not necessarily be evaluated very efficiently:

net_sales[sk, st, d] = ns <-
   ns = sales[sk, st, d] - returns[sk, st, d], ns != 0.

18.5. Consistent Default Values

The default values for predicates that are used in LogiQL rules must be consistent . For example, if the default value of sales is 3 and the default value of returns is 1, the default value of net_sales must be 2 (because net_sales = sales - returns). The LogiQL compiler will report errors for rules that use predicates with incompatible default values. This is not a major restriction since the most common default values (zero for numeric predicates, false for boolean predicates, empty strings) will work as expected in most cases.

18.6. Data Updates

A predicate with a default value logically has a tuple for every possible key combination, so insertion and retraction operations do not make sense. Instead, always use upsert operations on such predicates. Upserting the default value is equivalent to retraction if the previous value was not the default. For example:

// insert a new value or change an existing value
^sales[sk, st, d] = 10.0d <-
   sku_id[sk] = "sku_1",
   day_id[d] = "20150601",
   store_id[st] = "store_A”.

// clear (retract) a value by updating to the default value
^sales[sk, st, d] = 0.0d <-
   sku_id[sk] = "sku_2",
   day_id[d] = "20150601",
   store_id[st] = "store_B".

18.7. Caveats

A default value cannot be defined for predicates whose keys include those with primitive types (int, float, decimal, string, etc.). This is because the database system must have a finite set of potential key values for a default-valued predicate. Remember that a predicate with a default value has a logical value for every possible key combination. If such a predicate had a key type with an infinite set of possible values (like the primitive types), rules that refer to the predicate would have to logically consider an infinite number of tuples.

A few operations are not currently supported (or are not efficient) for predicates with default values:

  • min and max aggregations over predicates with default values are not implemented efficiently;
  • count aggregations are generally a problem when used on predicates with default values, and are better written as multiplications of count aggregations on mappings to higher aggregation levels. Count aggregations that exclude the default value (e.g., have F[x] != 0 in the body) are efficient.

Also note that you cannot change a default value once it has been set. Default values are specified at predicate declaration time and are fixed from then on.