Chapter 29. Data Exchange Services

29.1. Configuring Tabular Data Exchange Services

Tabular data exchange services (TDX) are HTTP services that offer delimited files for download to export data from LogicBlox workspaces (GET) and support uploading delimited files for importing data (POST/PUT). ServiceBlox provides a built-in handler for defining such tabular data exchange services at a very high level. This section describes how to configure and use these services.

To introduce TDX file services we will use a simple example of multi-dimensional sales data. Consider a workspace with the following schema defined for the hierarchy and measures.

block(`hierarchy) {
  export(`{
    sku(x), sku_id(x:s) -> string(s).
    store(x), store_id(x:s) -> string(s).
    week(x), week_id(x:s) -> string(s).
  })
} <--. 
block(`measure) {
  alias_all(`hierarchy),
  export(`{
    sales[x, y, z] = v -> sku(x), store(y), week(z), int(v).
  })
} <--. 

For this application the customer uses a delimited file for sales data, as in the following example.

SKU     | STORE       | WEEK | SALES
apples  | atlanta     | W1   | 10
oranges | atlanta     | W2   | 15
apples  | portland    | W1   | 20
oranges | portland    | W2   | 5

We shall define a tabular data exchange service to import data in this delimited file format to the sku, store, week and sales predicates. We shall also define a service to export from these predicates to a delimited file in this format.

A tabular data exchange service is defined by three parts:

  • File definition which defines the format of a file, such as header names, column formats, optional columns, and the delimiter character that is used.

  • File binding which specifies how columns in a delimited file are bound to predicates in the workspace. This file binding is a high-level, bi-directional specification, which means that it can be used for both the import and the export.

  • Service configuration which defines the service to be hosted by the ServiceBlox service container.

The ServiceBlox programming interface for defining these parts follows. The ServiceBlox handler for tabular data exchange services uses predicates in lb:web:delim namespaces. File definition is defined in lb:web:delim:schema and predicate binding is defined in lb:web:delim:binding. To avoid cluttering logic, and to make logic more readable, it is good practice to use aliases.

29.1.1. File Definition

A delimited file is defined by creating a lb:web:delim:schema:file_definition element and populating interface predicates, and then saving this by name in lb:web:delim:schema:file_definition_by_name. Example code:

block(`files) {
  alias_all(`lb:web:delim:schema),
  alias_all(`lb:web:delim:schema_abbr),

  clauses(`{
    file_definition_by_name["sales"] = fd,
    file_definition(fd) {
      file_delimiter[] = "|",
      column_headers[] = "SKU,STORE,WEEK,SALES",
      column_formats[] = "alphanum,alphanum,alphanum,integer"
    }.
  })
} <--. 
Required file definition interface settings
file_delimiter lb:web:delim:schema

Delimiter character.

column_headers lb:web:delim:schema_abbr

Comma separated list of file headers.

column_formats lb:web:delim:schema_abbr

Comma separated list of column formats (see the table below for supported column formats).

The order of the column names in column_headers must match the order of the column formats in column_formats but does not necessarily correspond to the order of the columns in the delimited file.

Optional file definition interface settings
file_mode lb:web:delim:schema

Specification of how to handle quoting and escaping (see File Modes).

file_columns_required lb:web:delim:schema_abbr

Comma-separated list of required columns. Will make all columns that are not listed optional.

file_column_required lb:web:delim:schema_abbr

Set the column with this header as required. Will make all columns that are not listed optional.

file_columns_optional lb:web:delim:schema_abbr

Comma-separated list of optional columns (i.e., missing values are allowed).

file_column_optional lb:web:delim:schema_abbr

Set the column with this header as optional (i.e., missing values are allowed).

file_columns_can_be_absent lb:web:delim:schema_abbr

Comma-separated list of columns that can be optionally absent (i.e., missing values are allowed, and, moreover, the entire column may be missing from the file on imports).

file_column_can_be_absent lb:web:delim:schema_abbr

Set the column with this header to be optionally absent (i.e., missing values are allowed, and, moreover, the entire column may be missing from the file on imports).

file_column_format lb:web:delim:schema_abbr

Set the format of a column by its header.

column_description lb:web:delim:schema

A textual description for a column. This information is displayed by the meta-service and can be used to document file definitions.

The file_column_format setting provides an alternative way to specify column formats. For example,

file_column_formats["SKU"] = "alphanum"

specifies that the format of the SKU column is alphanumeric. The column_formats setting may be omitted, provided that file_column_format is used to specify the format of each column listed in the column_headers setting.

The following table lists the currently supported formats. TDX will enforce these formats. During an import, it is an error if a data value does not conform to the specified format (e.g., if -1 is bound to a positive integer column). During exports, if data in the predicate does not conform to the format, it is simply filtered out from the exported file.

Supported column formats
Syntax Description Details
alphanum

Alphanumeric string that maps to a string.

The string is trimmed. Only letters and numbers are valid.

string

A string that must not be empty.

The string is trimmed. Must not be empty after trimmed.

string*

A possibly empty string.

The string is trimmed. Can be empty after trimmed. Cannot be used for an optional column.

raw_string

A string that is not processed by TDX.

The string is not trimmed. Can be empty. Cannot be used for an optional column.

char

A string that has a single character.

The string is trimmed. Can be the empty character. Must have a single character.

uuid

A Universally Unique Identifier (UUID) string.

The string is trimmed. Only UUIDs are valid.

integer

Integer number.

0+

Non-negative integer.

1+ or >0

Positive integer.

decimal

Decimal number.

0.0+

Non-negative decimal.

>0.0

Positive decimal.

float

Floating-point number.

0.0f+

Non-negative float.

>0.0f

Positive float.

boolean(t;f)

Boolean value.

The literals t and f are case-insensitive specifications of the expected formats for true and false. For example, this could be (1;0), (t;f) or (true;false). Values that are different from these two options are considered invalid.

datetime(format)

Datetime value.

The value is serialized to and from string with datetime:formatTZ and datetime:parse using the format string, e.g., datetime('%m/%d/%y'). See documentation on built-ins.

date(format)

Date value.

The value is serialized to and from string with datetime:formatTZ and datetime:parse using the format string, e.g., date('%m/%d/%y'). See documentation on built-ins. Imports create datetime objects in the UTC timezone.

Format validations

Additional validations can be added to most formats, as illustrated in the code below:

  file_definition_by_name["sales"] = fd,
  file_definition(fd) {
    file_delimiter[] = "|",
    column_headers[] = "SKU,SALES",
    column_formats[] = "string([a-zA-Z0-9]*),float(>=0; <=20; precision 2)"
  }.

In the example above, the string column defines a regular expression [a-zA-Z0-9]*, indicating it only accepts alphanumeric characters. We also have a float column defining three validations, separated by semicolon:

  • >=0: indicating it must be non-negative.
  • <=20: indicating it must not be greater than 20.
  • precision 2: indicating this float number must not have more than 2 digits after the decimal point.

The following table lists currently available validations:

Format Description
>

Greater Than (example >0). Accepted in any numerical format.

>=

Greater Than or Equal To (example >=0). Accepted in any numerical format.

<

Less Than (example <0). Accepted in any numerical format.

<=

Less Than or Equal To (example <0). Accepted in any numerical format.

precision

Maximum number of digits after decimal point. Example precision 2 will accept 3, 3.1 and 3.14, but reject 3.141. Accepted in float and decimal formats. Note: Integer numeric types will not reject this validation, but it will have no effect.

Regular Expression

If the validation does not match any of the above, it defaults to a regular expression. Regular expressions are accepted in all formats, except boolean, date, datetime and uuid. Regular expressions are strings and must be escaped as any String Literal. The following example defines a string with a simple regular expression:

  file_definition_by_name["sales"] = fd,
  file_definition(fd) {
    file_delimiter[] = "|",
    column_headers[] = "SKU,SALES",
    column_formats[] = "string([a-zA-Z0-9]*),float"
  }.

Special attention is needed if the regular expression includes comma (,) or semicolon (;) characters, which are not allowed in the above notation. If needed, a regular expression with these characters must be defined using the format_regex auxiliar predicate and referenced in the format, like illustrated below:

  file_definition_by_name["sales"] = fd,
  file_definition(fd) {
    file_delimiter[] = "|",
    column_headers[] = "SKU,SALES",
    column_formats[] = "string(exp1),float",
    format_regex["exp1"] = "[a-zA-Z]{0,9}"
  }.

Optional Columns

TDX provides support for dealing with missing values, such as the following file:

SKU     | STORE       | WEEK | YEAR | SALES
apples  | atlanta     | W1   | 2012 |
oranges | atlanta     | W2   | 2012 | 15
apples  | portland    | W1   | 2012 |
oranges | portland    | W2   | 2012 | 5

In TDX, a column can be "required", "optional", or "possibly absent". If a column is required, this means that it must be present in the file, and every row must specify a value for that column. If a column is optional, this means that it must be present in the file, but some rows may have a missing value for that column. If a column is allowed to be absent, this means that missing values are allowed, and, furthermore, the entire column may be absent from the file on imports.

By default, all columns mentioned in the file definition are required columns. To change this, we can use the file_columns_required or the file_columns_optional predicate, as shown below. When the file_columns_required predicate is used, all columns not specified in that predicate are treated as optional columns. When the file_columns_optional predicate is used, all columns not specified in that predicate are treated as required columns.

file_definition_by_name["sales"] = fd,
file_definition(fd) {
  file_delimiter[] = "|",
  column_headers[] = "SKU,STORE,WEEK,SALES",
  column_formats[] = "alphanum,alphanum,alphanum,integer",
  file_columns_required[] = "SKU,STORE,WEEK"
}.

Possibly absent columns are specified using file_columns_can_be_absent, as in the following example:

file_definition_by_name["sales-returns"] = fd,
file_definition(fd) {
  file_delimiter[] = "|",
  column_headers[] = "SKU,STORE,WEEK,SALES,RETURNS",
  column_formats[] = "alphanum,alphanum,alphanum,integer",
  file_columns_can_be_absent[] = "SALES,RETURNS"
}.

In this example, the SALES and RETURNS columns may or may not be present in the file. If they are present, then, in any given row, a value may but need not be given for each of these columns.

File Modes

TDX allows the customization of how data values are quoted and escaped in import and export services. The file mode specification consists of a string with zero or more of the following entries.

Supported file mode entries
Syntax Import Export
raw

Import text as is (will break if string contains delimiter or line break).

Export text as is (will create bad csv when string contains delimiter or line break).

quote=c

Parse c as the quote character (c must be a single character, but not a whitespace character).

Use c as quote character.
unix

Remove quotes, unescapes all escaped characters, i.e \n\rt and the quote.

Add quotes, escapes all characters for which it is needed.

excel

Removes quotes and replaces double quotes with quotes.

Adds quotes, double quotes quote character and escapes \n\r\t .

Escaping is applied upon export, and unescaping upon import. In the unix style, escaping is done by prefixing the quote character, line-feed and carriage-return with a backslash (e.g., "foo"bar" is replaced by "foo\"bar"). The backslashes are removed on imports. In excel mode, escaping is done by doubling the quote character (e.g., "foo""bar").

The default file mode is unix quote=" which means that by default Unix escaping rules are used, and values are enclosed in double quotes. That is, imports require values to be either not quoted or quoted with ", and exports always quote with ". For example, a file with a single string column named X behaves like the following by default:

// import
X
foo
"bar"

// export
X
"foo"
"bar"

This behavior can be changed statically with the file_mode attribute. For example, the following definition will export the file with ' as the quote character.

file_definition_by_name[name] = fd,
file_definition(fd) {
  file_delimiter[] = "|",
  file_mode[] = "quote='",
  column_headers[] = "X",
  column_formats[] = "string"
}.

This allows the following interaction (note that double quotes are now simply part of the record):

// import
X
foo
"bar"
'baz'

// export
X
'foo'
'"bar"'
'baz'

By setting the file_mode to raw, no quotes are added on export, and import values are treated completely as part of the record:

// import
X
foo
"bar"
'baz'

// export
X
foo
"bar"
'baz'

Finally, it is possible to override the static default by using a URL parameter to the TDX service:

// POST to /file?tdx_file_mode="raw"
X
foo
"bar"
'baz'

// GET to /file?tdx_file_mode="quote=*"
X
*foo*
*"bar"*
*'baz'*

29.1.2. File Binding

The file binding for a delimited file is defined by creating a lb:web:delim:binding:file_binding element, populating the interface predicates, and then saving this by name in lb:web:delim:binding:file_binding_by_name. Here is an example that shows a basic binding to one predicate:

block(`server_init) {
  alias_all(`lb:web:delim:binding),
  alias_all(`lb:web:delim:binding_abbr),

  clauses(`{
    file_binding_by_name["sales"] = fb,
    file_binding(fb) {
      file_binding_definition_name[] = "sales",
      predicate_binding_by_name["measure:sales"] =
        predicate_binding(_) {
          predicate_binding_columns[] = "SKU,STORE,WEEK,SALES"
        }
    }.
  })
} <--.

The predicate_binding_columns is used to map file column names onto the attributes of the predicate, in order. In this example, columns "SKU,STORE,WEEK,SALES" map to a predicate defined as measure:sales[sku, store, week] = sales. Some of the predicate attributes, such as the keys in this example, may have entity types, provided that the entity type has a reference-mode predicate. In this case, the values in the associated file column are treated as refmode values for the entity type in question (see Section 29.1.5, “Import and Export” below).

The following table shows how TDX maps primitive LogiQL types into column formats. For each primitive type, the table lists the compatible column formats. For example, int primitives can be bound directly onto integer, 0+ and >0 columns, but int128 primitives can only be bound onto uuid columns. When the TDX generator encounters incompatible types it attempts to convert the primitive into some compatible primitive using standard from:to:convert predicates. The last column of the table shows the conversions supported by TDX. For example, an int primitive bound to a float column will be first converted using int:float:convert, and an int128 bound to uuid will be converted with int128:string:convert.

Supported column format bindings and primitive conversions
Primitive LogiQL type Column formats Primitive conversions
int integer, 0+, >0 decimal, float, string
float float, 0.0f+, >0.0f decimal, int, string
decimal decimal, 0.0+, >0.0 float, string
string raw_string, string, string*, alphanum, char, uuid boolean, datetime, decimal, float, int, int128
boolean boolean string
datetime date, datetime string
int128 uuid string

In general, a file binding can have any number of associated predicate bindings. When multiple predicate bindings are specified, then each predicate gets populated independently on import; on export, the predicates are joined to give rise to rows of the output file (and there are some more intricacies when some of the columns have been marked as optional or some of the bindings are to default-valued predicates). See the section on import and export below for more details.

The above example binding will support import to the sales predicate assuming that entity elements already exist in sku, store, and week. It is common to optionally add elements to all entity types, this is defined by populating the lb:web:delim:binding_abbr:file_binding_entity_creation predicate. Example:

file_binding(fb) {
  ...
  file_binding_entity_creation[] = "accumulate",
  ...
}

File binding configurations may apply to all predicate bindings, as in the above entity creation example, or to individual predicate bindings.

Required file binding settings
file_binding_definition_name

Specifies the file definition that this file binding is associated with.

predicate_binding_by_name

Specifies predicate bindings associated with this file binding.

Optional file binding settings
file_binding_predicate_columns lb:web:delim:binding_abbr

Comma-separated list of column headers. Applies to all predicate bindings for this file binding.

file_binding_entity_creation lb:web:delim:binding_abbr

Set entity creation for all predicate bindings of a file binding. The supported values are:

  • none: do not create entity elements. This is the default value if no entity creation is specified.
  • accumulate: add new elements that did not exist previously.
  • ignore: skip lines with non-existing entities. This can be useful for partitioned deployments but can be dangerous as well, as there is no way to distinguish typos from entities that are not on that partition.

If entity creation is configured on the file binding, then it is recursively applied to all predicate bindings in this file binding. The setting on the predicate binding will recursively apply to all column bindings of the predicate binding.

file_binding_column_entity_creation lb:web:delim:binding_abbr

Default entity creation from the file binding to one specific argument. Applies to all predicate bindings for this file binding.

file_binding_ignore_idb_on_import lb:web:delim:binding

Specifies that predicate bindings that bind IDB predicates should be ignored on imports. Importing into IDBs is not possible because IDBs cannot be directly populated. This allows binding to IDBs that contain information to join on exports while reusing the file binding for imports into the remaining predicate bindings.

file_binding_import_default_value lb:web:delim:binding

By default, the value that represents an empty optional column on imports is the empty string. This option allows the definition of a different string to be deemed the empty value. On imports, an optional column with either the empty string or the default value is deemed missing and will not be imported. This allows the definition of "zero-stripping" policies to ignore import rows with certain values.

file_binding_export_default_value lb:web:delim:binding

By default, the value that represents an empty optional column on exports is the empty string. This option allows the definition of a different string to be deemed the empty value. On exports, an optional column without a value will be exported with the default value instead of with an empty string. This allows the definition of "zero-filling" policies.

file_binding_default_value lb:web:delim:binding

A shorthand to apply the same value to both file_binding_import_default_value and file_binding_export_default_value.

The index argument for options that apply to a particular index is a zero-based index to the arguments of a predicate. For example, to enable entity creation for sku and store entities use:

file_binding(fb) {
  ...
  file_binding_column_entity_creation[0] = "accumulate",
  file_binding_column_entity_creation[1] = "accumulate",
  ...
}
Options applying to individual predicate bindings
predicate_binding_columns lb:web:delim:binding_abbr

Comma-separated list of column headers. Multiple columns can be combined to one column header by a semi-colon. A column-binding transformation must be provided to combine multiple columns to one value (see below).

predicate_binding_entity_creation lb:web:delim:binding_abbr

Set entity creation for all column bindings. See file_binding_entity_creation for the supported values. The setting on a predicate binding overrides the setting on the file binding, and will recursively apply to all column bindings of the current predicate binding.

predicate_binding_export_only lb:web:delim:binding

Specifies that this predicate binding should be applied only for exports, and should be ignored for imports. This is useful to specify export filters which do not apply for imports. See Section 29.1.5, “Import and Export” below for more details.

predicate_binding_import_only lb:web:delim:binding

Specifies that this predicate binding should be applied only for imports, and should be ignored for exports. This is useful to populate auxiliary predicates that cannot provide a bi-directional transformation. See Section 29.1.5, “Import and Export” below for more details.

predicate_binding_no_retraction_on_post lb:web:delim:binding

The default behavior on POST requests for predicate bindings that bind optional columns is to retract existing values for the keys if the value column is left empty. This behavior may be undesirable in certain circumstances. This flag can then be used to specify that empty optional columns should not cause a retraction on POSTs.

predicate_binding_filter lb:web:delim:filter

A filter to be applied on imports (use export_only predicate bindings to filter exports). The filter must be a predicate with the same arity as the columns being filtered. Only values that exist in the filter are imported on POSTs or PUTs. Furthermore, on PUTs, only values in the filter are retracted if they are not present in the new file.

column_binding_by_arg lb:web:delim:binding

Column binding to describe more precisely how column(s) from a delimited file map to an argument of a predicate.

Column binding options
column_binding_import_function, column_binding_export_function lb:web:delim:binding

Transformation functions to apply to value(s) of this column binding, see the section called “Transform Functions” below.

column_binding_entity_creation lb:web:delim:binding

Set entity creation for this column binding. See file_binding_entity_creation for the supported values.

Transform Functions

When importing a delimited file, it is often necessary or convenient to combine column values, or to perform a modification of column values prior to inserting them into predicates in a workspace. Similar manipulations may also be called for when exporting to a delimited file. TDX allows developers to specify column binding transform functions to be applied during imports and/or exports.

In the following we will introduce transform functions by means of examples. Suppose that we want to import this simple sales file and that the application model is the one shown in the following listing.

SKU     | STORE       | WEEK | YEAR | SALES
apples  | atlanta     | W1   | 2012 | 10
oranges | atlanta     | W2   | 2012 | 15
apples  | portland    | W1   | 2012 | 20
oranges | portland    | W2   | 2012 | 5
sku(x),   sku_id(x:id)   -> string(id).
store(x), store_id(x:id) -> string(id).
week(x),  week_id(x:id)  -> string(id).

sales[sku, store, week] = sales -> sku(sku), store(store), week(week), integer(sales).

Now suppose that there are small mismatches between the file format and data, and the schema of our application. For example, we may want to have names of all stores in upper case characters. One possible way to transform the import file is to bind the STORE column to the refmode of sample:sales:store entities, and apply the string:upper function upon imports. This can be specified with the following file binding:

...
file_binding_by_name["sales"] = fb,
file_binding(fb) {
  file_binding_definition_name[] = "sales",
  file_binding_entity_creation[] = "accumulate",
  predicate_binding_by_name["sample:sales:store"] =
    predicate_binding(_) {
      predicate_binding_columns[] = "STORE",
      column_binding_by_arg[0] =
        column_binding(_) {
          column_binding_import_function[] = "string:upper"
        }
    }
}.
...

Note that import functions bind the values from columns to their keys, and produce a value that is then stored in the predicate being bound. This binding is thus roughly equivalent to applying string:upper[STORE] = id and then using id as the refmode of a store entity.

This file binding would create stores, but we are also interested in creating sales values. The problem now is that in the workspace weeks are identified by a concatenation of year and week id (for example, '2012-W1') The solution is to define a concat_dash function that concatenates two strings with a dash ('-'). Then, we apply this function to the values of the YEAR and WEEK columns, and use the resulting value as the refmode of the week-of-year entity:

concat_dash[s1, s2] = s1 + "-" + s2.
lang:derivationType[`concat_dash]="Derived".

...
file_binding_by_name["sales"] = fb,
file_binding(fb) {
  file_binding_definition_name[] = "sales",
  file_binding_entity_creation[] = "accumulate",
  predicate_binding_by_name["sample:sales:sales"] =
    predicate_binding(_) {
      predicate_binding_columns[] = "SKU, STORE, YEAR;WEEK, SALES",
      column_binding_by_arg[2] =
        column_binding(_) {
          column_binding_import_function[] = "sample:sales:services:concat_dash"
        },
      column_binding_by_arg[1] =
        column_binding(_) {
          column_binding_import_function[] = "string:upper"
        }
    }
}.
...

The above file binding will concatenate YEAR and WEEK as well as apply string:upper to the STORE column. Note that column_binding_by_arg is indexed by the components in predicate_binding_columns, which are separated by commas (,). If a component contains multiple columns, because the function contains multiple arguments, they are separated by semicolons (;). Also note that any function can be used, including built-ins (such as string:upper), derived and materialized functions.

This file binding supports importing files but it would fail for exports. The problem is that we need to specify the inverse of concat_dash, that is, how to decompose the week-of-year entity refmode to export WEEK and YEAR columns. But before attending to this issue, let us see how we could extend the binding to export the names of store entities in lower case. This is accomplished by using the inverse export function, string:lower

    ...
      column_binding_by_arg[1] =
        column_binding(_) {
          column_binding_import_function[] = "string:upper",
          column_binding_export_function[] = "string:lower"
        }
...

Note that export functions are applied with respect to the predicate being bound. This binding is thus roughly equivalent to applying store_id(x:id), string:lower[id] = STORE. This also means that we need inverse export functions when exporting to multiple columns, as in the case of WEEK and YEAR, because we will have to bind the entity refmode to the value of the function, and the multiple columns to the keys. So to implement our requirement, we define a split_dash function that splits a string into two strings by the '-' character, and then apply this function as an inverse to the binding. The resulting code for our example is the following:

concat_dash[s1, s2] = s1 + "-" + s2.
lang:derivationType[`concat_dash]="Derived".

split[s1, s2] = s -> string(s1), string(s2), string(s).
split[s1, s2] = s <-
  string:split[s, '-', 0] = s1,
  string:split[s, '-', 1] = s2.
lang:derivationType[`split2]="Derived".

...
file_binding_by_name["sales"] = fb,
file_binding(fb) {
  file_binding_definition_name[] = "sales",
  file_binding_entity_creation[] = "accumulate",
  predicate_binding_by_name["sample:sales:sales"] =
    predicate_binding(_) {
      predicate_binding_columns[] = "SKU, STORE, YEAR;WEEK, SALES",
      column_binding_by_arg[2] =
        column_binding(_) {
          column_binding_import_function[] = "sample:sales:services:concat_dash",
          column_binding_export_function_inverse[] = "sample:sales:services:split_dash"
        },
      column_binding_by_arg[1] =
        column_binding(_) {
          column_binding_import_function[] = "string:upper",
          column_binding_export_function[] = "string:lower"
        }
    }
}.
...

Predicate binding helpers

Adding alias_all(`lb:web:delim:tdx_helpers) to the service will expose the helper predicates below to use for TDX specification:

Predicate setting helper Description Example
binds_pred

Associate a predicate with a list of comma-separated column headers.

binds_pred["product:sku:label"] = "SKU_NBR,SKU_DESC".
accumulate_entity

Set entity creation for a column.

accumulate_entity("CLASS").
decumulate_entity

Unset entity creation for a column.

decumulate_entity("SKU").
export_only

Only bind this predicate when exporting.

export_only("product:sku").
import_only

Only bind this predicate when importing.

import_only("product:sku:imported").
transform_col_import

Transformation function to apply to column(s) on import.

transform_col_import("SHRINK_AMT", "string:upper").
transform_col_export

Transformation function to apply to column(s) on export.

transform_col_export("SHRINK_AMT", "string:lower").
transform_col

A set of two transformation functions to apply to column(s), first one on import and the second one on export.

transform_col("SHRINK_AMT", "string:upper", "string:lower").

The example below illustrates a file definition using TDX helpers.

block(`product) {

  alias_all(`lb:web:delim:tdx_helpers),
...

  clauses(`{
...

  fb(fb),
  file_binding_by_name[product_file_definition_name[]] = fb,
  file_binding(fb) {
    file_binding_definition_name[] = product_file_definition_name[],
    accumulate_entity("SKU_NBR"),
    binds_pred["product:sku:label"] = "SKU_NBR,SKU_DESC",
    binds_pred["product:sku"] = "SKU_NBR",
    binds_pred["product:sku:imported"] = "SKU_NBR",
    export_only("product:sku"),
    import_only("product:sku:imported"),

  }.

  })
} <-- . 

29.1.3. File Row Offsets

TDX exposes a file offset as a special (reserved) column header named TDX_OFFSET, which can then be used in predicate bindings. For example, in this code we bind the sales file, which has a single column, to the predicate measure:sales, which is a functional predicate from int to int.

sales[x] = v -> int(x), int(v).

file_definition_by_name["sales"] = fd,
file_definition(fd) {
  file_delimiter[] = "|",
  column_headers[] = "SALES",
  column_formats[] = "int"
},
file_binding_by_name["sales"] = fb,
file_binding(fb) {
  file_binding_definition_name[] = "sales",
  predicate_binding_by_name["measure:sales"] =
    predicate_binding(_) {
      predicate_binding_columns[] = "TDX_OFFSET, SALES"
    }
}.

Upon import, the offset of the rows in the file will be bound to the TDX_OFFSET column and then used to populate measure:sales. Upon export, the value in the column bound to TDX_OFFSET is ignored (equivalent to measure:sales[_] = SALES).

TDX_OFFSET works as any integer column: it can be bound to an entity by refmode, accumulated, be subject of transformation functions, etc. The value of the offset is the number of bytes from the beginning of the file up to the row being imported and, therefore, guaranteed to be unique and monotonically increasing in the file being imported, but not guaranteed to be continuous.

29.1.4. Service Configuration

A tabular data exchange service is configured with ServiceBlox by creating a delim_service. Example:

block(`service_config) {

  alias_all(`lb:web:config:service),
  alias_all(`lb:web:config:service_abbr),
  alias_all(`lb:web:config:delim),

  clauses(`{

    service_by_prefix["/sales"] = x,
    delim_service(x) {
      delim_file_binding[] = "sales"
    }.

  })

} <--.

29.1.5. Import and Export

Import is performed as follows. If one of the columns mentioned in the file definition is missing from the input file, and the column was not specified as possibly absent, the input file is rejected. The file may contain additional columns that are not referred to in the configuration, and these columns are simply ignored. The order in which columns appear in the input CSV file is irrelevant: columns are identified by column name, as listed in the file header.

If a row of the file has a missing value in one of the required columns, or if the value in some column is not of the required type, the row is considered invalid (see the section called “Invalid Records” below for more detail on error handling). If the file binding includes predicate bindings that are filters, then all rows that do not pass the filter are ignored. For the remaining rows, row import is performed for each predicate binding that is not export_only and that involves only columns for which the current row has no missing optional values. Predicate bindings that bind IDBs are ignored (if file_binding_ignore_idb_on_import) or cause an error. Implicit conversions (such as from strings to numeric values) are performed where needed. If the predicate binding involves an import function that cannot be successfully applied to the given column values, the row is considered invalid. In particular, if one of the predicate attributes has an entity type and a value does not occur in the associated refmode predicate, either a new entity is created with the given refmode value (if entity creation is enabled via the accumulate policy), or an error message is generated (if entity creation is not enabled).

Export is performed as follows. Predicate bindings that are import_only or filters are ignored. The remaining predicate bindings are handled as follows: for each predicate that is associated with a predicate binding, a set of partial rows is produced (partial, because they only have the columns that occur in the predicate binding). This may involve applying export functions if present, and refmode lookups in case of entities. Next, a join is applied to these sets of partial rows, for all the predicates whose associated predicate binding refers only to columns of the CSV file that are required.

Exports with Optional Columns

Exports involving optional columns are a bit more complicated. For each row in such an export, all required columns have actual values, while each optional column can have an actual or missing. Moreover, TDX configurations with optional columns are only valid if each exported column is properly bound, i.e.:

  1. it is a required column that appears in some predicate binding in which all of the columns are required, or
  2. it is an optional column that is bound to the value of a functional predicate for which all columns bound to its keys are properly bound.

In the most common case, the proper binding of each optional column is guaranteed by a predicate binding to a functional predicate for which all columns bound to its keys are required. However, the second part of the definition above also allows for an optional column C to be bound to the value of a functional predicate F for which some of the keys are bound to other optional columns. In both cases, the set of rows to be exported is computed iteratively as follows:

  1. First we process all predicate bindings for which all columns are required, as described in the previous section, producing a set of partial rows PR with values for all required columns R.
  2. Let K be the set of known columns (initially set to R) and PK the corresponding set of rows with values for those columns (initially set to PR).
  3. While there exists some binding to a functional predicate F for which all key columns are known and the optional value column C is not known, C is added to the set K of known columns, and each row r in PK is extended with a column C whose value is determined as follows:
    1. if all key columns for F have actual values in r, these values are used for looking up whether F contains a tuple with these key values. If it does, C is filled in with the corresponding actual value from this tuple; otherwise the value for C is considered missing, and C is filled in with the empty string or other default export value for C.
    2. if some of the key columns have missing values in r, then C is considered missing and is filled in with the empty string or other default export value for C. No lookup on F is performed in this case.

At the end of this process, all columns are known, and the resulting rows in PK, each of which contains an actual or missing value for each of the columns, are exported to the CSV file. We illustrate this process through an example below:

sales[sku,week] = sales -> product:sku(sku), time:week(week), int(sales).
returns[sku,week] = returns -> product:sku(sku), time:week(week), int(returns).
...
file_definition_by_name["sales_returns"] = fd,
file_definition(fd) {
  column_headers[] = "SKU, WEEK, SALES, RETURNS",
  column_formats[] = "alphanum, alphanum, int, int",
  file_columns_optional[] = "RETURNS"
},

file_binding_by_name["sales_returns"] = fb,
file_binding(fb) {
  file_binding_definition_name[] = "sales_returns",
  predicate_binding_by_name["measure:sales"] =
    predicate_binding(_) {
      predicate_binding_columns[] = "SKU, WEEK, SALES",
    },
  predicate_binding_by_name["measure:returns"] =
    predicate_binding(_) {
      predicate_binding_columns[] = "SKU, WEEK, RETURNS",
    }
}.

In the TDX configuration above, the columns SKU, WEEK and SALES are properly bound because they appear in a predicate binding (to the predicate measure:sales) in which all columns are required. The optional column RETURNS is properly bound because it is bound to the value of the functional predicate measure:returns, and the columns SKU, WEEK which are bound to the keys of this predicate are properly bound.

The result of this export contains one row for each SKU,WEEK pair appearing in the measure:sales predicate, together with the corresponding value from that predicate for the SALES. For each such row, if the SKU,WEEK pair also appears in measure:returns, the RETURNS column of this row contains the corresponding value from this predicate, otherwise it contains an empty string (or the default export value for RETURNS). This export essentially corresponds to the left-outer-join of measure:sales with measure:returns.

Suppose now we want the export to include all SKU,WEEK pairs for which at least one of the measure:sales and measure:returns predicates has a value. In contrast to the left-outer-join described above, this corresponds to the full-outer-join of the two predicates. To export this full-outer-join, the first step would be to mark both the SALES and RETURNS columns as optional. However, the resulting TDX configuration would be invalid: even though the SALES and RETURNS optional columns are bound to values of functional predicates, the keys of those predicates are bound to the SKU and WEEK columns which do not appear in any predicate binding for which all columns are required.

A simple way to make this TDX configuration valid would be to add, for each of these columns, a new predicate binding in which all columns are required, e.g.:

  predicate_binding_by_name["product:sku"] =
    predicate_binding(_) {
      predicate_binding_columns[] = "SKU",
  predicate_binding_by_name["time:week"] =
    predicate_binding(_) {
      predicate_binding_columns[] = "WEEK",
    }

The result of this export contains one row for each SKU,WEEK pair from the cross-product of product:sku and time:week. For each such pair, the SALES and RETURNS columns contain the corresponding values from measure:sales and measure:returns, respectively, or the empty string or other default export value, if the SKU,WEEK pair does not appear in the corresponding predicate.

However, this may include a lot of SKU,WEEK pairs for which both of the SALES and RETURNS columns have missing values (such tuples also do not belong to the full-outer-join of the two predicates). Thus, to export exactly the contents of the full-outer-join we must ensure that the predicate(s) we use to properly bind the SKU and WEEK columns contain only pairs for which at least one of the predicates has a value. For instance, we can define such a predicate using the following LogiQL rule:

keysToExport(sku,week) <- 
   sales[sku,week] = _ ; returns[sku,week] = _.

We can then use this predicate to bind both of the SKU,WEEK columns, instead of binding them to product:sku and time:week, respectively:

  predicate_binding_by_name["keysToExport"] =
    predicate_binding(_) {
      predicate_binding_columns[] = "SKU, WEEK"
    }

The resulting export is also likely to be more efficient, as it avoids creating partial rows and performing lookups for a (potentially large) number of SKU,WEEK pairs that do not appear in either of the measure:sales or measure:returns predicates.

Exports with Default-valued Predicates

If the measure:sales and measure:returns predicates in the example of the previous section are default-valued predicates, the full-outer-join of their parts that have non-default values could also be exported using a TDX configuration where both the SALES and RETURNS columns are marked as required. In this case, no additional predicate bindings are needed to ensure that all columns are properly bound, and the export computes the (inner) join of the two predicates. Because default-valued predicates have a value for all possible intersections in their key space, the result of this join may contain SKU,WEEK pairs for which one or both of the predicates have the default value. For exports involving default-valued predicates, TDX does some additional filtering to avoid exporting rows for which all columns that are bound to the values of such predicates have the default value. Thus, such a TDX export produces the set of rows in the full-outer-join described in the previous section (the only difference being that in this case missing values are represented by the default value of the corresponding predicate, instead of the empty string or other default export value that was produced when using optional columns).

However, using a combination of required columns with default-valued predicates to export the full-outer-join of these predicates is likely to be inefficient because, as explained above, it essentially involves computing the join over the whole key space of these default-valued predicates, which is typically much larger than the set of intersections for which some of these predicates have non-default values. A better alternative, similar to the one described for non-default-valued predicates in the previous section, would be to:

  • mark the columns corresponding to the values of default-valued predicates as optional, and
  • bind the keys of the default-valued predicates to a new predicate that contains exactly those intersections that have a non-default value for at least one of the default-valued predicates.

For our running example, assuming the default value for measure:sales and measure:returns is 0, such a predicate can be defined using a LogiQL rule along the lines of:

keysToExport(sku,week) <- 
   sales[sku,week] != 0 ; returns[sku,week] != 0.

The SKU and WEEK columns can then be bound to the keys of the keysToExport predicate, to ensure that:

  1. all columns in the TDX configuration are properly bound,
  2. the export corresponds exactly to the full-outer-join of the (non-default parts of the) measure:sales and measure:returns predicates, and
  3. the export is performed as efficiently as possible, by only creating partial rows for intersections in keysToExport and using lookups on the default-valued predicates in order to provide values for the optional columns.

Invalid Records

Records that failed to import are reported back to the user. The format of the report is the same as in the import file, including headers, plus two additional columns that indicate the error (see the section called “Causes”). If no records failed to import, the server returns a file with only the headers. This feature is enabled by default.

Returning error records does have a small performance penalty as the server must write and then return the file containing the bad records. If for some reason you wish to disable the feature, you can specify the --ignore-bad-records flag on lb web-client or simply do not specify an output_file or output_url in the batch. See the section called “Accessing via HTTP” for how to disable when accessing via HTTP.

Causes

The resulting data that reports which records were not imported will contain all the columns of the original import plus additional two columns describing why the records were not imported. The first column CAUSE, will contain a human readable string such as "'SKU' is a required column.". The last column is CAUSE_CODE which will contain a constant string value of the error type for easy parsing. Here are the error codes, along with their descriptions.

Error Code Description
REQUIRED_COLUMN

A column defined as required is missing from the file.

WRONG_FORMAT

An import data value is invalid with respect to the column format.

DOES_NOT_EXIST

An entity or record referenced by the data does not exist.

MALFORMED_ROW

The row could not be parsed because it contains invalid characters or a different number of columns than the rest of the file.

FAILED_FUNCTION_APPLICATION

An import function application resulted in an undefined value.

FAILED_PRIMITIVE_CONVERSION

A primitive conversion resulted in an undefined value. Conversions are used when adjusting import values for import functions and refmode types.

Accessing via HTTP

If you are not using lb web-client or batch, you can still enable or disable this feature by using the return_errors query string parameter. Since the feature is enabled by default, simply accessing the URL as normal will return the bad records. If you do not wish to return the error records, set the tdx_return_errors query parameter equal to 1 or true. Also, on_error can be used to make the server return the error file only if import errors were detected.

Partial Imports and Aborting Transactions

TDX by default aborts an import transaction if any row fails to import. If invalid records are requested, these are still returned, even if the transaction aborts. You can optionally configure a service to allow partial imports, i.e., import good records and ignore bad records. You can configure this option by service or by individual request.

To configure a service to default to allow partial imports, you must assert to the lb:web:config:delim:allow_partial_import predicate. This will default all transactions for the configured service to import valid rows even if they receive error messages. This setting can still be overwritten by individual requests via the following methods, all of which take precedence over the lb:web:config:delim:allow_partial_import predicate.

If you are using a batch configuration, you can set the abort_on_error field of the ImportDelim message; if you are using the lb web-client command line tool, you can specify either the --abort flag or the --allow-partial-import flag; finally, if you are using raw HTTP, you can set the tdx_allow_partial_import query parameter equal to 1 or true to allow, 0 or false to abort.

29.2. Dynamic Tabular Data Exchange Services

While TDX services allows us to statically configure a file binding and thus create one service for each file binding, we can also import and export delimited files by specifying the file binding as a request parameter named file_binding.

To host the dynamic tabular data exchange service we use the dynamic_delim_service predicate, as shown below. This configuration allows us to export or import delimited files by accessing /delim?file_binding=..., where ... is a JSON representation of the FileBinding protobuf message.

service_by_prefix["/delim"] = x,
dynamic_delim_service(x)
  <- .

29.2.1. Building FileBinding Messages

A FileBinding message describes the file structure that we want to export (FileDefinition) and how to build the file from different predicates (PredicateBinding). The optional entity_creation field allows you to set the default entity creation policy for all columns of all predicate bindings.

message FileBinding
{
  required FileDefinition file = 2;

  repeated PredicateBinding binding = 3;
  optional string entity_creation = 4;
}

The FileDefinition describes the format of the files and the options on the columns. If the required field is empty then all columns are considered required (unless otherwise specified). Conversely, if the required field is not empty, all columns that are not mentioned in this field are considered optional.

message FileDefinition
{
  required string delimiter = 2;
  required string column_headers = 3;
  required string column_formats = 4;

  optional string file_columns_required = 5;
  optional string file_columns_optional = 6;
  optional string file_columns_can_be_absent = 7;
  optional string mode = 8;

}

Finally, a PredicateBinding specifies how predicate columns bind to the columns in the file definition.

message PredicateBinding
{
  required string predicate_name = 1;
  required string predicate_binding_columns = 2;
  repeated ColumnBinding column = 3;

  optional bool export = 4 [default = true];
  optional bool import = 5 [default = true];
  optional bool no_retraction_on_post = 11 [default = false];

  optional string entity_creation = 9;

  repeated Filter filter = 10;
}

Example 29.1. 

The following JSON file definition message describes a file with three columns, "PERSON", "FATHER", and "MOTHER".

"file": {
  "delimiter": "|",
  "column_headers": "PERSON,FATHER,MOTHER",
  "column_formats": "alphanum,alphanum,alphanum"
} 

The following two PredicateBinding messages specify how to build the PERSON, FATHER, and MOTHER columns from predicates person_father and person_mother:

"binding":[ {
    "predicate_name": "person_father",
    "predicate_binding_columns": "PERSON,FATHER"
  },
  {
    "predicate_name": "person_mother",
    "predicate_binding_columns": "PERSON,MOTHER"
  }] 

Combining the messages together, we build the file binding message that should be sent as the file_binding parameter to the dynamic delimited file service:

{
    "file": {
      "delimiter": "|",
      "column_headers": "PERSON,FATHER,MOTHER",
      "column_formats": "alphanum,alphanum,alphanum"
    },
    "binding": [
      {
        "predicate_name": "person_father",
        "predicate_binding_columns": "PERSON,FATHER"
      },
      {
        "predicate_name": "person_mother",
        "predicate_binding_columns": "PERSON,MOTHER"
        }]
}