Chapter 29. Data Exchange Services
29.1. Configuring Tabular Data Exchange Services
Tabular data exchange services (TDX) are HTTP services that offer delimited files for download to export data from LogicBlox workspaces (GET) and support uploading delimited files for importing data (POST/PUT). ServiceBlox provides a built-in handler for defining such tabular data exchange services at a very high level. This section describes how to configure and use these services.
To introduce TDX file services we will use a simple example of multi-dimensional sales data. Consider a workspace with the following schema defined for the hierarchy and measures.
block(`hierarchy) { export(`{ sku(x), sku_id(x:s) -> string(s). store(x), store_id(x:s) -> string(s). week(x), week_id(x:s) -> string(s). }) } <--.
block(`measure) { alias_all(`hierarchy), export(`{ sales[x, y, z] = v -> sku(x), store(y), week(z), int(v). }) } <--.
For this application the customer uses a delimited file for sales data, as in the following example.
SKU | STORE | WEEK | SALES apples | atlanta | W1 | 10 oranges | atlanta | W2 | 15 apples | portland | W1 | 20 oranges | portland | W2 | 5
We shall define a tabular data exchange service to import data in this
delimited file format to the sku
,
store
, week
and
sales
predicates.
We shall also define a service to export from
these predicates to a delimited file in this format.
A tabular data exchange service is defined by three parts:
-
File definition which defines the format of a file, such as header names, column formats, optional columns, and the delimiter character that is used.
-
File binding which specifies how columns in a delimited file are bound to predicates in the workspace. This file binding is a high-level, bi-directional specification, which means that it can be used for both the import and the export.
-
Service configuration which defines the service to be hosted by the ServiceBlox service container.
The ServiceBlox programming interface for defining these parts follows.
The ServiceBlox handler for tabular data exchange services uses predicates
in
lb:web:delim
namespaces. File definition is defined in
lb:web:delim:schema
and predicate binding is defined in
lb:web:delim:binding
.
To avoid cluttering logic, and to make logic more readable, it is good
practice
to use aliases.
29.1.1. File Definition
A delimited file is defined by creating a
lb:web:delim:schema:file_definition
element and populating
interface
predicates, and then saving this by name in
lb:web:delim:schema:file_definition_by_name
. Example code:
block(`files) { alias_all(`lb:web:delim:schema), alias_all(`lb:web:delim:schema_abbr), clauses(`{ file_definition_by_name["sales"] = fd, file_definition(fd) { file_delimiter[] = "|", column_headers[] = "SKU,STORE,WEEK,SALES", column_formats[] = "alphanum,alphanum,alphanum,integer" }. }) } <--.
Required file definition interface settings | ||
---|---|---|
file_delimiter |
lb:web:delim:schema |
Delimiter character. |
column_headers |
lb:web:delim:schema_abbr |
Comma separated list of file headers. |
column_formats |
lb:web:delim:schema_abbr |
Comma separated list of column formats (see the table below for supported column formats). |
The order of the column names in column_headers
must match
the order of the column formats in column_formats
but does
not necessarily correspond to the order of the columns in the delimited
file.
Optional file definition interface settings | ||
---|---|---|
file_mode |
lb:web:delim:schema |
Specification of how to handle quoting and escaping (see File Modes). |
file_columns_required |
lb:web:delim:schema_abbr |
Comma-separated list of required columns. Will make all columns that are not listed optional. |
file_column_required |
lb:web:delim:schema_abbr |
Set the column with this header as required. Will make all columns that are not listed optional. |
file_columns_optional |
lb:web:delim:schema_abbr |
Comma-separated list of optional columns (i.e., missing values are allowed). |
file_column_optional |
lb:web:delim:schema_abbr |
Set the column with this header as optional (i.e., missing values are allowed). |
file_columns_can_be_absent |
lb:web:delim:schema_abbr |
Comma-separated list of columns that can be optionally absent (i.e., missing values are allowed, and, moreover, the entire column may be missing from the file on imports). |
file_column_can_be_absent |
lb:web:delim:schema_abbr |
Set the column with this header to be optionally absent (i.e., missing values are allowed, and, moreover, the entire column may be missing from the file on imports). |
file_column_format |
lb:web:delim:schema_abbr |
Set the format of a column by its header. |
column_description |
lb:web:delim:schema |
A textual description for a column. This information is displayed by the meta-service and can be used to document file definitions. |
The file_column_format
setting provides an alternative
way to specify column formats. For example,
file_column_formats["SKU"] = "alphanum"
specifies that the format of the SKU column is alphanumeric. The
column_formats
setting may be omitted, provided that
file_column_format
is used to specify the format of each
column listed in the column_headers
setting.
The following table lists the currently supported formats. TDX will
enforce these formats. During an import, it is an error if a data
value does not conform to the specified format (e.g., if
-1
is bound to a positive integer column). During
exports, if data in the predicate does not conform to the format, it
is simply filtered out from the exported file.
Supported column formats | ||
---|---|---|
Syntax | Description | Details |
alphanum |
Alphanumeric string that maps to a string. |
The string is trimmed. Only letters and numbers are valid. |
string |
A string that must not be empty. |
The string is trimmed. Must not be empty after trimmed. |
string* |
A possibly empty string. |
The string is trimmed. Can be empty after trimmed. Cannot be used for an optional column. |
raw_string |
A string that is not processed by TDX. |
The string is not trimmed. Can be empty. Cannot be used for an optional column. |
char |
A string that has a single character. |
The string is trimmed. Can be the empty character. Must have a single character. |
uuid |
A Universally Unique Identifier (UUID) string. |
The string is trimmed. Only UUIDs are valid. |
integer |
Integer number. |
|
0+ |
Non-negative integer. |
|
1+ or >0 |
Positive integer. |
|
decimal |
Decimal number. |
|
0.0+ |
Non-negative decimal. |
|
>0.0 |
Positive decimal. |
|
float |
Floating-point number. |
|
0.0f+ |
Non-negative float. |
|
>0.0f |
Positive float. |
|
boolean(t;f) |
Boolean value. |
The literals |
datetime(format) |
Datetime value. |
The value is serialized to and from string with
|
date(format) |
Date value. |
The value is serialized to and from string with
|
Format validations
Additional validations can be added to most formats, as illustrated in the code below:
file_definition_by_name["sales"] = fd, file_definition(fd) { file_delimiter[] = "|", column_headers[] = "SKU,SALES", column_formats[] = "string([a-zA-Z0-9]*),float(>=0; <=20; precision 2)" }.
In the example above, the string
column defines a regular
expression [a-zA-Z0-9]*
, indicating it only accepts
alphanumeric characters. We also have a float
column
defining three validations, separated by semicolon:
>=0
: indicating it must be non-negative.<=20
: indicating it must not be greater than 20.precision 2
: indicating this float number must not have more than 2 digits after the decimal point.
The following table lists currently available validations:
Format | Description |
---|---|
> |
Greater Than (example |
>= |
Greater Than or Equal To (example |
< |
Less Than (example |
<= |
Less Than or Equal To (example |
precision |
Maximum number of digits after decimal point. Example
|
Regular Expression |
If the validation does not match any of the above, it
defaults to a regular expression. Regular expressions are accepted
in all formats, except file_definition_by_name["sales"] = fd, file_definition(fd) { file_delimiter[] = "|", column_headers[] = "SKU,SALES", column_formats[] = "string([a-zA-Z0-9]*),float" }.
Special attention is needed if the regular expression includes
comma (,) or semicolon (;) characters, which are not allowed
in the above notation. If needed, a regular expression with
these characters must be defined using the
file_definition_by_name["sales"] = fd, file_definition(fd) { file_delimiter[] = "|", column_headers[] = "SKU,SALES", column_formats[] = "string(exp1),float", format_regex["exp1"] = "[a-zA-Z]{0,9}" }. |
Optional Columns
TDX provides support for dealing with missing values, such as the following file:
SKU | STORE | WEEK | YEAR | SALES apples | atlanta | W1 | 2012 | oranges | atlanta | W2 | 2012 | 15 apples | portland | W1 | 2012 | oranges | portland | W2 | 2012 | 5
In TDX, a column can be "required", "optional", or "possibly absent". If a column is required, this means that it must be present in the file, and every row must specify a value for that column. If a column is optional, this means that it must be present in the file, but some rows may have a missing value for that column. If a column is allowed to be absent, this means that missing values are allowed, and, furthermore, the entire column may be absent from the file on imports.
By default, all columns mentioned in the file definition are required
columns.
To change this, we can use the
file_columns_required
or the file_columns_optional
predicate, as shown
below.
When the file_columns_required
predicate is used,
all columns not specified in that predicate are treated as optional
columns.
When the
file_columns_optional
predicate is used, all
columns not specified in that predicate are treated as required
columns.
file_definition_by_name["sales"] = fd, file_definition(fd) { file_delimiter[] = "|", column_headers[] = "SKU,STORE,WEEK,SALES", column_formats[] = "alphanum,alphanum,alphanum,integer", file_columns_required[] = "SKU,STORE,WEEK" }.
Possibly absent columns are specified using
file_columns_can_be_absent
, as in the following
example:
file_definition_by_name["sales-returns"] = fd, file_definition(fd) { file_delimiter[] = "|", column_headers[] = "SKU,STORE,WEEK,SALES,RETURNS", column_formats[] = "alphanum,alphanum,alphanum,integer", file_columns_can_be_absent[] = "SALES,RETURNS" }.
In this example, the SALES and RETURNS columns may or may not be present in the file. If they are present, then, in any given row, a value may but need not be given for each of these columns.
File Modes
TDX allows the customization of how data values are quoted and escaped in import and export services. The file mode specification consists of a string with zero or more of the following entries.
Supported file mode entries | ||
---|---|---|
Syntax | Import | Export |
raw |
Import text as is (will break if string contains delimiter or line break). |
Export text as is (will create bad csv when string contains delimiter or line break). |
quote=c |
Parse |
Use c as quote character.
|
unix |
Remove quotes, unescapes all escaped characters, i.e \n\rt and the quote. |
Add quotes, escapes all characters for which it is needed. |
excel |
Removes quotes and replaces double quotes with quotes. |
Adds quotes, double quotes quote character and escapes \n\r\t . |
Escaping is applied upon export, and unescaping upon import. In the
unix
style, escaping is done by prefixing the quote
character, line-feed and carriage-return with a backslash (e.g.,
"foo"bar"
is replaced by "foo\"bar"
). The
backslashes are removed on imports. In excel
mode, escaping
is done
by doubling the quote character (e.g., "foo""bar"
).
The default file mode is unix quote="
which means that by
default Unix escaping rules are used, and values are enclosed in double
quotes. That is, imports require values to be either not quoted or
quoted
with "
, and exports always quote with "
. For
example, a file with a single string column named X
behaves like the following by default:
// import X foo "bar" // export X "foo" "bar"
This behavior can be changed statically with the file_mode
attribute. For example, the following definition will export the file
with '
as the quote character.
file_definition_by_name[name] = fd, file_definition(fd) { file_delimiter[] = "|", file_mode[] = "quote='", column_headers[] = "X", column_formats[] = "string" }.
This allows the following interaction (note that double quotes are now simply part of the record):
// import X foo "bar" 'baz' // export X 'foo' '"bar"' 'baz'
By setting the file_mode to raw
, no quotes are added
on export, and import values are treated
completely as part of the record:
// import X foo "bar" 'baz' // export X foo "bar" 'baz'
Finally, it is possible to override the static default by using a URL parameter to the TDX service:
// POST to /file?tdx_file_mode="raw" X foo "bar" 'baz' // GET to /file?tdx_file_mode="quote=*" X *foo* *"bar"* *'baz'*
29.1.2. File Binding
The file binding for a delimited file is defined by creating a
lb:web:delim:binding:file_binding
element, populating the
interface predicates,
and then saving this by name in
lb:web:delim:binding:file_binding_by_name
.
Here is an example that shows a basic binding to one predicate:
block(`server_init) { alias_all(`lb:web:delim:binding), alias_all(`lb:web:delim:binding_abbr), clauses(`{ file_binding_by_name["sales"] = fb, file_binding(fb) { file_binding_definition_name[] = "sales", predicate_binding_by_name["measure:sales"] = predicate_binding(_) { predicate_binding_columns[] = "SKU,STORE,WEEK,SALES" } }. }) } <--.
The predicate_binding_columns
is used to map file column
names onto the attributes of the predicate, in order. In this example,
columns "SKU,STORE,WEEK,SALES"
map to a predicate defined as
measure:sales[sku, store, week] = sales
. Some of the
predicate attributes, such as the keys in this example, may have entity
types, provided that the entity type has a reference-mode predicate. In
this case, the values in the associated file column are treated as refmode
values for the entity type in question (see
Section 29.1.5, “Import and Export” below).
The following table shows how TDX maps primitive LogiQL types into
column formats. For each
primitive type, the table lists the compatible column formats. For
example, int
primitives can be bound directly onto
integer
, 0+
and >0
columns, but
int128
primitives can only be bound onto uuid
columns. When the TDX generator encounters incompatible types it
attempts to convert the primitive into some compatible primitive using
standard from:to:convert
predicates. The last column of the
table shows the conversions supported by TDX. For example, an
int
primitive bound to a float
column will be
first converted using int:float:convert
, and an
int128
bound to uuid
will be converted with
int128:string:convert
.
Supported column format bindings and primitive conversions | ||
---|---|---|
Primitive LogiQL type | Column formats | Primitive conversions |
int |
integer , 0+ , >0 |
decimal , float , string |
float |
float , 0.0f+ , >0.0f |
decimal , int , string |
decimal |
decimal , 0.0+ , >0.0 |
float , string |
string |
raw_string , string , string* , alphanum , char , uuid |
boolean , datetime , decimal , float , int , int128 |
boolean |
boolean |
string |
datetime |
date , datetime |
string |
int128 |
uuid |
string |
In general, a file binding can have any number of associated predicate bindings. When multiple predicate bindings are specified, then each predicate gets populated independently on import; on export, the predicates are joined to give rise to rows of the output file (and there are some more intricacies when some of the columns have been marked as optional or some of the bindings are to default-valued predicates). See the section on import and export below for more details.
The above example binding will support import to the sales predicate
assuming that entity elements already exist in sku
,
store
, and week
. It
is common to optionally add elements to all entity types, this is
defined by populating the
lb:web:delim:binding_abbr:file_binding_entity_creation
predicate. Example:
file_binding(fb) { ... file_binding_entity_creation[] = "accumulate", ... }
File binding configurations may apply to all predicate bindings, as in the above entity creation example, or to individual predicate bindings.
Required file binding settings | ||
---|---|---|
file_binding_definition_name |
|
Specifies the file definition that this file binding is associated with. |
predicate_binding_by_name |
|
Specifies predicate bindings associated with this file binding. |
Optional file binding settings | ||
---|---|---|
file_binding_predicate_columns |
lb:web:delim:binding_abbr |
Comma-separated list of column headers. Applies to all predicate bindings for this file binding. |
file_binding_entity_creation |
lb:web:delim:binding_abbr |
Set entity creation for all predicate bindings of a file binding. The supported values are:
If entity creation is configured on the file binding, then it is recursively applied to all predicate bindings in this file binding. The setting on the predicate binding will recursively apply to all column bindings of the predicate binding. |
file_binding_column_entity_creation |
lb:web:delim:binding_abbr |
Default entity creation from the file binding to one specific argument. Applies to all predicate bindings for this file binding. |
file_binding_ignore_idb_on_import |
lb:web:delim:binding |
Specifies that predicate bindings that bind IDB predicates should be ignored on imports. Importing into IDBs is not possible because IDBs cannot be directly populated. This allows binding to IDBs that contain information to join on exports while reusing the file binding for imports into the remaining predicate bindings. |
file_binding_import_default_value |
lb:web:delim:binding |
By default, the value that represents an empty optional column on imports is the empty string. This option allows the definition of a different string to be deemed the empty value. On imports, an optional column with either the empty string or the default value is deemed missing and will not be imported. This allows the definition of "zero-stripping" policies to ignore import rows with certain values. |
file_binding_export_default_value |
lb:web:delim:binding |
By default, the value that represents an empty optional column on exports is the empty string. This option allows the definition of a different string to be deemed the empty value. On exports, an optional column without a value will be exported with the default value instead of with an empty string. This allows the definition of "zero-filling" policies. |
file_binding_default_value |
lb:web:delim:binding |
A shorthand to apply the same value to both
|
The index argument for options that apply to a particular index is
a zero-based index to the
arguments of a predicate. For example, to enable entity creation for
sku
and store
entities use:
file_binding(fb) { ... file_binding_column_entity_creation[0] = "accumulate", file_binding_column_entity_creation[1] = "accumulate", ... }
Options applying to individual predicate bindings | ||
---|---|---|
predicate_binding_columns |
lb:web:delim:binding_abbr |
Comma-separated list of column headers. Multiple columns can be combined to one column header by a semi-colon. A column-binding transformation must be provided to combine multiple columns to one value (see below). |
predicate_binding_entity_creation |
lb:web:delim:binding_abbr |
Set entity creation for all column bindings. See
|
predicate_binding_export_only |
lb:web:delim:binding |
Specifies that this predicate binding should be applied only for exports, and should be ignored for imports. This is useful to specify export filters which do not apply for imports. See Section 29.1.5, “Import and Export” below for more details. |
predicate_binding_import_only |
lb:web:delim:binding |
Specifies that this predicate binding should be applied only for imports, and should be ignored for exports. This is useful to populate auxiliary predicates that cannot provide a bi-directional transformation. See Section 29.1.5, “Import and Export” below for more details. |
predicate_binding_no_retraction_on_post |
lb:web:delim:binding |
The default behavior on POST requests for predicate bindings that bind optional columns is to retract existing values for the keys if the value column is left empty. This behavior may be undesirable in certain circumstances. This flag can then be used to specify that empty optional columns should not cause a retraction on POSTs. |
predicate_binding_filter |
lb:web:delim:filter |
A filter to be applied on imports (use
|
column_binding_by_arg |
lb:web:delim:binding |
Column binding to describe more precisely how column(s) from a delimited file map to an argument of a predicate. |
Column binding options | ||
---|---|---|
column_binding_import_function ,
column_binding_export_function |
lb:web:delim:binding |
Transformation functions to apply to value(s) of this column binding, see the section called “Transform Functions” below. |
column_binding_entity_creation |
lb:web:delim:binding |
Set entity creation for this column binding. See
|
Transform Functions
When importing a delimited file, it is often necessary or convenient to combine column values, or to perform a modification of column values prior to inserting them into predicates in a workspace. Similar manipulations may also be called for when exporting to a delimited file. TDX allows developers to specify column binding transform functions to be applied during imports and/or exports.
In the following we will introduce transform functions by means of examples. Suppose that we want to import this simple sales file and that the application model is the one shown in the following listing.
SKU | STORE | WEEK | YEAR | SALES apples | atlanta | W1 | 2012 | 10 oranges | atlanta | W2 | 2012 | 15 apples | portland | W1 | 2012 | 20 oranges | portland | W2 | 2012 | 5
sku(x), sku_id(x:id) -> string(id). store(x), store_id(x:id) -> string(id). week(x), week_id(x:id) -> string(id). sales[sku, store, week] = sales -> sku(sku), store(store), week(week), integer(sales).
Now suppose that there are small mismatches between the file format and
data,
and the schema of our application. For example, we may want to have
names of all
stores in upper case characters. One possible way to transform the
import file is to bind the STORE
column to the refmode of
sample:sales:store
entities, and apply the
string:upper
function upon imports. This can be specified
with the following file binding:
... file_binding_by_name["sales"] = fb, file_binding(fb) { file_binding_definition_name[] = "sales", file_binding_entity_creation[] = "accumulate", predicate_binding_by_name["sample:sales:store"] = predicate_binding(_) { predicate_binding_columns[] = "STORE", column_binding_by_arg[0] = column_binding(_) { column_binding_import_function[] = "string:upper" } } }. ...
Note that import functions bind the values from columns to their keys,
and produce a value that is then stored in the predicate being
bound. This binding is thus roughly equivalent to applying
string:upper[STORE] = id
and then using id
as
the refmode of a store
entity.
This file binding would create stores, but we are also interested in
creating sales values. The problem now is that in the workspace weeks
are identified by a concatenation of year and week id (for example,
'2012-W1') The solution is to define a concat_dash
function
that concatenates two strings with a dash ('-'). Then, we apply this
function to the values of the YEAR
and WEEK
columns, and use the resulting value as the refmode of the week-of-year
entity:
concat_dash[s1, s2] = s1 + "-" + s2. lang:derivationType[`concat_dash]="Derived". ... file_binding_by_name["sales"] = fb, file_binding(fb) { file_binding_definition_name[] = "sales", file_binding_entity_creation[] = "accumulate", predicate_binding_by_name["sample:sales:sales"] = predicate_binding(_) { predicate_binding_columns[] = "SKU, STORE, YEAR;WEEK, SALES", column_binding_by_arg[2] = column_binding(_) { column_binding_import_function[] = "sample:sales:services:concat_dash" }, column_binding_by_arg[1] = column_binding(_) { column_binding_import_function[] = "string:upper" } } }. ...
The above file binding will concatenate YEAR
and
WEEK
as well as apply string:upper
to the
STORE
column. Note that column_binding_by_arg
is indexed by the components in predicate_binding_columns
,
which are separated by commas (,
). If a component contains
multiple columns,
because the function contains multiple arguments, they are separated by
semicolons (;
).
Also note that any function can be used, including built-ins (such as
string:upper
), derived and materialized functions.
This file binding supports importing files but it would fail for
exports. The problem is that we need to specify the inverse of
concat_dash
, that is, how to decompose the week-of-year
entity refmode to export WEEK
and YEAR
columns. But before attending to this issue, let us see how we could
extend
the binding to export the names of store entities in lower case. This is
accomplished by using the inverse export function,
string:lower
... column_binding_by_arg[1] = column_binding(_) { column_binding_import_function[] = "string:upper", column_binding_export_function[] = "string:lower" } ...
Note that export functions are applied with respect to the predicate
being bound. This binding is thus roughly equivalent to applying
store_id(x:id), string:lower[id] = STORE
. This also means
that we need inverse export functions when
exporting to multiple columns, as in the case of WEEK
and
YEAR
, because we will have to bind the entity refmode to
the value of the function, and the multiple columns to the keys. So to
implement our requirement, we define a split_dash
function
that splits a string into two strings by the '-' character, and then
apply this function as an inverse to the binding. The resulting code for
our example is the following:
concat_dash[s1, s2] = s1 + "-" + s2. lang:derivationType[`concat_dash]="Derived". split[s1, s2] = s -> string(s1), string(s2), string(s). split[s1, s2] = s <- string:split[s, '-', 0] = s1, string:split[s, '-', 1] = s2. lang:derivationType[`split2]="Derived". ... file_binding_by_name["sales"] = fb, file_binding(fb) { file_binding_definition_name[] = "sales", file_binding_entity_creation[] = "accumulate", predicate_binding_by_name["sample:sales:sales"] = predicate_binding(_) { predicate_binding_columns[] = "SKU, STORE, YEAR;WEEK, SALES", column_binding_by_arg[2] = column_binding(_) { column_binding_import_function[] = "sample:sales:services:concat_dash", column_binding_export_function_inverse[] = "sample:sales:services:split_dash" }, column_binding_by_arg[1] = column_binding(_) { column_binding_import_function[] = "string:upper", column_binding_export_function[] = "string:lower" } } }. ...
Predicate binding helpers
Adding alias_all(`lb:web:delim:tdx_helpers)
to the
service will expose the helper predicates below to use for TDX specification:
Predicate setting helper | Description | Example |
---|---|---|
binds_pred |
Associate a predicate with a list of comma-separated column headers. |
binds_pred["product:sku:label"] = "SKU_NBR,SKU_DESC". |
accumulate_entity |
Set entity creation for a column. |
accumulate_entity("CLASS"). |
decumulate_entity |
Unset entity creation for a column. |
decumulate_entity("SKU"). |
export_only |
Only bind this predicate when exporting. |
export_only("product:sku"). |
import_only |
Only bind this predicate when importing. |
import_only("product:sku:imported"). |
transform_col_import |
Transformation function to apply to column(s) on import. |
transform_col_import("SHRINK_AMT", "string:upper"). |
transform_col_export |
Transformation function to apply to column(s) on export. |
transform_col_export("SHRINK_AMT", "string:lower"). |
transform_col |
A set of two transformation functions to apply to column(s), first one on import and the second one on export. |
transform_col("SHRINK_AMT", "string:upper", "string:lower"). |
The example below illustrates a file definition using TDX helpers.
block(`product) { alias_all(`lb:web:delim:tdx_helpers), ... clauses(`{ ... fb(fb), file_binding_by_name[product_file_definition_name[]] = fb, file_binding(fb) { file_binding_definition_name[] = product_file_definition_name[], accumulate_entity("SKU_NBR"), binds_pred["product:sku:label"] = "SKU_NBR,SKU_DESC", binds_pred["product:sku"] = "SKU_NBR", binds_pred["product:sku:imported"] = "SKU_NBR", export_only("product:sku"), import_only("product:sku:imported"), }. }) } <-- .
29.1.3. File Row Offsets
TDX exposes a file offset as a special (reserved) column header named
TDX_OFFSET
, which can then be used in predicate bindings.
For example, in this code we bind the sales file, which has a single
column, to the predicate measure:sales
, which is a
functional predicate from int
to int
.
sales[x] = v -> int(x), int(v). file_definition_by_name["sales"] = fd, file_definition(fd) { file_delimiter[] = "|", column_headers[] = "SALES", column_formats[] = "int" }, file_binding_by_name["sales"] = fb, file_binding(fb) { file_binding_definition_name[] = "sales", predicate_binding_by_name["measure:sales"] = predicate_binding(_) { predicate_binding_columns[] = "TDX_OFFSET, SALES" } }.
Upon import, the offset of the rows in the file will be bound to the
TDX_OFFSET
column and then used to populate
measure:sales
. Upon export, the value in the
column bound
to TDX_OFFSET
is ignored (equivalent to
measure:sales[_] = SALES
).
TDX_OFFSET
works as any integer column: it can be bound to an
entity by refmode, accumulated, be subject of transformation functions,
etc. The value of the offset is the number of bytes from the beginning of
the file up to the row being imported and, therefore, guaranteed to be
unique and monotonically increasing in the file being imported, but not
guaranteed to be continuous.
29.1.4. Service Configuration
A tabular data exchange service is configured with ServiceBlox by
creating a delim_service
. Example:
block(`service_config) { alias_all(`lb:web:config:service), alias_all(`lb:web:config:service_abbr), alias_all(`lb:web:config:delim), clauses(`{ service_by_prefix["/sales"] = x, delim_service(x) { delim_file_binding[] = "sales" }. }) } <--.
29.1.5. Import and Export
Import is performed as follows. If one of the columns mentioned in the
file definition is missing from the input file, and the column was not
specified as possibly absent, the input file is rejected. The file may
contain additional columns that are not referred to in the configuration,
and these columns are simply ignored. The order in which columns appear in
the input CSV
file is irrelevant: columns are
identified by column name,
as listed in the file header.
If a row of the file has a missing value in one of the required columns,
or if the value in some column is not of the required type, the row is
considered invalid (see the section called “Invalid Records” below for more
detail on error handling). If the
file
binding includes predicate bindings that are filters, then all rows that
do not
pass the filter are ignored. For the remaining rows, row import is
performed for each predicate binding that is not export_only
and
that involves only columns for which the current row has no missing
optional
values. Predicate bindings that bind IDBs are ignored (if
file_binding_ignore_idb_on_import
) or cause an
error. Implicit
conversions (such as from
strings to numeric values) are performed where needed. If the predicate
binding involves an import function that cannot be successfully applied to
the given column values, the row is considered invalid. In particular, if
one of the predicate attributes has an entity type and a value does not
occur in the associated refmode predicate, either a new entity is created
with the given refmode value (if entity creation is enabled via the
accumulate policy), or an error message is generated (if entity creation
is not enabled).
Export is performed as follows. Predicate bindings that are
import_only
or filters are ignored. The
remaining predicate bindings are handled as follows: for each predicate
that is associated with a predicate binding, a set of partial rows
is produced (partial, because they only have the columns that occur in the
predicate binding). This may involve applying export functions if present,
and refmode lookups in case of entities. Next, a join is applied to these
sets of partial rows, for all the predicates whose associated predicate
binding refers only to columns of the CSV file that are required.
Exports with Optional Columns
Exports involving optional columns are a bit more complicated. For each row in such an export, all required columns have actual values, while each optional column can have an actual or missing. Moreover, TDX configurations with optional columns are only valid if each exported column is properly bound, i.e.:
- it is a required column that appears in some predicate binding in which all of the columns are required, or
- it is an optional column that is bound to the value of a functional predicate for which all columns bound to its keys are properly bound.
In the most common case, the proper binding of each optional column is guaranteed by a predicate binding to a functional predicate for which all columns bound to its keys are required. However, the second part of the definition above also allows for an optional column C to be bound to the value of a functional predicate F for which some of the keys are bound to other optional columns. In both cases, the set of rows to be exported is computed iteratively as follows:
- First we process all predicate bindings for which all columns are required, as described in the previous section, producing a set of partial rows PR with values for all required columns R.
- Let K be the set of known columns (initially set to R) and PK the corresponding set of rows with values for those columns (initially set to PR).
-
While there exists some binding to a functional predicate
F for which all key columns are
known and the optional value column
C is not known,
C is added to the set
K of known columns,
and each row r in
PK is extended
with a column C whose value is determined
as follows:
- if all key columns for F have actual values in r, these values are used for looking up whether F contains a tuple with these key values. If it does, C is filled in with the corresponding actual value from this tuple; otherwise the value for C is considered missing, and C is filled in with the empty string or other default export value for C.
- if some of the key columns have missing values in r, then C is considered missing and is filled in with the empty string or other default export value for C. No lookup on F is performed in this case.
At the end of this process, all columns are known, and the resulting rows in PK, each of which contains an actual or missing value for each of the columns, are exported to the CSV file. We illustrate this process through an example below:
sales[sku,week] = sales -> product:sku(sku), time:week(week), int(sales). returns[sku,week] = returns -> product:sku(sku), time:week(week), int(returns). ... file_definition_by_name["sales_returns"] = fd, file_definition(fd) { column_headers[] = "SKU, WEEK, SALES, RETURNS", column_formats[] = "alphanum, alphanum, int, int", file_columns_optional[] = "RETURNS" }, file_binding_by_name["sales_returns"] = fb, file_binding(fb) { file_binding_definition_name[] = "sales_returns", predicate_binding_by_name["measure:sales"] = predicate_binding(_) { predicate_binding_columns[] = "SKU, WEEK, SALES", }, predicate_binding_by_name["measure:returns"] = predicate_binding(_) { predicate_binding_columns[] = "SKU, WEEK, RETURNS", } }.
In the TDX configuration above, the columns SKU
,
WEEK
and SALES
are properly bound
because they appear in a predicate binding (to the predicate
measure:sales
) in which all columns are required. The
optional column RETURNS
is properly bound because it is
bound to the value of the functional predicate
measure:returns
, and the columns SKU
,
WEEK
which are bound to the keys of this predicate are
properly bound.
The result of this export contains one row for each
SKU,WEEK
pair appearing in the measure:sales
predicate, together with the corresponding value from that predicate for
the SALES
. For each such row, if the SKU,WEEK
pair also appears in measure:returns
, the
RETURNS
column of this row contains the corresponding value
from this predicate, otherwise it contains an empty string (or
the default export value for RETURNS
). This export
essentially corresponds to the left-outer-join of
measure:sales
with measure:returns
.
Suppose now we want the export to include all
SKU,WEEK
pairs for which at least one of the
measure:sales
and measure:returns
predicates
has a value. In contrast to the left-outer-join described above, this
corresponds to the full-outer-join of the two
predicates. To export this full-outer-join, the first step would be to
mark both the SALES
and RETURNS
columns as
optional. However, the resulting TDX configuration would be invalid:
even though the SALES
and RETURNS
optional
columns are bound to values of functional predicates, the keys of those
predicates are bound to the SKU
and WEEK
columns which do not appear in any predicate binding for which all
columns are required.
A simple way to make this TDX configuration valid would be to add, for each of these columns, a new predicate binding in which all columns are required, e.g.:
predicate_binding_by_name["product:sku"] = predicate_binding(_) { predicate_binding_columns[] = "SKU", predicate_binding_by_name["time:week"] = predicate_binding(_) { predicate_binding_columns[] = "WEEK", }
The result of this export contains one row for each
SKU,WEEK
pair from the cross-product of
product:sku
and time:week
. For each such pair,
the SALES
and RETURNS
columns contain the
corresponding values from measure:sales
and
measure:returns
, respectively, or the empty string
or other default export value, if the SKU,WEEK
pair does
not appear in the corresponding predicate.
However, this may include a lot of SKU,WEEK
pairs for
which both of the SALES
and RETURNS
columns
have missing values (such tuples also do not belong to the full-outer-join
of the two predicates). Thus, to export exactly the contents of the
full-outer-join we must ensure that the predicate(s) we use to properly
bind the SKU
and WEEK
columns contain only
pairs for which at least one of the predicates has a value. For
instance, we can define such a predicate using the following LogiQL rule:
keysToExport(sku,week) <- sales[sku,week] = _ ; returns[sku,week] = _.
We can then use this predicate to bind both of the
SKU,WEEK
columns, instead of binding them to
product:sku
and time:week
,
respectively:
predicate_binding_by_name["keysToExport"] = predicate_binding(_) { predicate_binding_columns[] = "SKU, WEEK" }
The resulting export is also likely to be more efficient, as it avoids
creating partial rows and performing lookups for a (potentially large)
number of SKU,WEEK
pairs that do not appear in either of
the measure:sales
or measure:returns
predicates.
Exports with Default-valued Predicates
If the measure:sales
and measure:returns
predicates in the example of the previous section are
default-valued predicates, the full-outer-join of their parts that have
non-default values could also be exported using a TDX configuration
where both the SALES
and RETURNS
columns are
marked as required. In this case, no additional predicate bindings are
needed to ensure that all columns are properly bound, and the export
computes the (inner) join of the two
predicates. Because default-valued predicates have a value for all
possible intersections in their key space, the result of this join
may contain SKU,WEEK
pairs for which one or both of the
predicates have the default value. For exports involving default-valued
predicates, TDX does some additional filtering to avoid exporting rows
for which all columns that are bound to the values of such predicates have
the default value. Thus, such a TDX export produces the set of
rows in the full-outer-join described in the previous section (the only
difference being that in this case missing values are represented by
the default value of the corresponding predicate, instead of the
empty string or other default export value that was produced when using
optional columns).
However, using a combination of required columns with default-valued predicates to export the full-outer-join of these predicates is likely to be inefficient because, as explained above, it essentially involves computing the join over the whole key space of these default-valued predicates, which is typically much larger than the set of intersections for which some of these predicates have non-default values. A better alternative, similar to the one described for non-default-valued predicates in the previous section, would be to:
- mark the columns corresponding to the values of default-valued predicates as optional, and
- bind the keys of the default-valued predicates to a new predicate that contains exactly those intersections that have a non-default value for at least one of the default-valued predicates.
For our running example, assuming the default value for
measure:sales
and measure:returns
is
0
, such a predicate can be defined using a
LogiQL rule along the lines of:
keysToExport(sku,week) <- sales[sku,week] != 0 ; returns[sku,week] != 0.
The SKU
and WEEK
columns can then be bound to
the keys of the keysToExport
predicate, to ensure
that:
- all columns in the TDX configuration are properly bound,
-
the export corresponds exactly to the full-outer-join of the
(non-default parts of the)
measure:sales
andmeasure:returns
predicates, and -
the export is performed as efficiently as possible, by only
creating partial rows for intersections in
keysToExport
and using lookups on the default-valued predicates in order to provide values for the optional columns.
Invalid Records
Records that failed to import are reported back to the user. The format of the report is the same as in the import file, including headers, plus two additional columns that indicate the error (see the section called “Causes”). If no records failed to import, the server returns a file with only the headers. This feature is enabled by default.
Returning error records does have a small performance penalty as the
server must write and then return the file containing the bad
records. If for some reason you wish to disable the feature, you can
specify the --ignore-bad-records
flag on lb
web-client
or simply do not specify an output_file or output_url
in the batch. See the section called “Accessing via HTTP”
for how to disable when accessing via HTTP.
Causes
The resulting data that reports which records were not imported will
contain all the columns of the original import plus additional
two columns describing why the records were not imported. The first
column CAUSE
, will contain a human readable string such
as "'SKU' is a required column.". The last column is
CAUSE_CODE
which will contain a constant string value
of the error type for easy parsing. Here are the error codes,
along with their descriptions.
Error Code | Description |
---|---|
REQUIRED_COLUMN |
A column defined as required is missing from the file. |
WRONG_FORMAT |
An import data value is invalid with respect to the column format. |
DOES_NOT_EXIST |
An entity or record referenced by the data does not exist. |
MALFORMED_ROW |
The row could not be parsed because it contains invalid characters or a different number of columns than the rest of the file. |
FAILED_FUNCTION_APPLICATION |
An import function application resulted in an undefined value. |
FAILED_PRIMITIVE_CONVERSION |
A primitive conversion resulted in an undefined value. Conversions are used when adjusting import values for import functions and refmode types. |
Accessing via HTTP
If you are not using lb web-client
or batch, you can
still enable or
disable this feature by using the return_errors
query
string parameter. Since the feature is enabled by default, simply
accessing the URL as normal will return the bad records. If you do not
wish to return the error records, set the
tdx_return_errors
query parameter equal to 1
or true
. Also,
on_error
can be used
to make the server return the error file only if import errors were
detected.
Partial Imports and Aborting Transactions
TDX by default aborts an import transaction if any row fails to import. If invalid records are requested, these are still returned, even if the transaction aborts. You can optionally configure a service to allow partial imports, i.e., import good records and ignore bad records. You can configure this option by service or by individual request.
To configure a service to default to allow partial imports, you must
assert to the
lb:web:config:delim:allow_partial_import
predicate. This will default all transactions for the configured
service to import valid rows even if they receive error messages. This
setting can still be overwritten by individual requests via the
following methods, all of which take precedence over the
lb:web:config:delim:allow_partial_import
predicate.
If you are using a batch configuration, you can set the
abort_on_error
field of the ImportDelim
message;
if you are using the lb web-client
command line tool, you
can specify either the --abort
flag or the
--allow-partial-import
flag;
finally, if you are using raw HTTP, you can set the
tdx_allow_partial_import
query
parameter equal to 1
or true
to allow,
0
or false
to abort.
29.2. Dynamic Tabular Data Exchange Services
While TDX services allows us to statically configure a
file binding and thus create one service for each file binding, we
can also import and export delimited files by specifying the file
binding as a request parameter named
file_binding
.
To host the dynamic tabular data exchange service we use the
dynamic_delim_service
predicate, as shown below.
This configuration allows us to export or import delimited files by
accessing /delim?file_binding=
...,
where
... is a JSON representation of the FileBinding
protobuf message.
service_by_prefix["/delim"] = x, dynamic_delim_service(x) <- .
29.2.1. Building FileBinding Messages
A FileBinding
message describes the file structure that we
want to export (FileDefinition
) and how to build the file
from different predicates (PredicateBinding
). The optional
entity_creation
field allows you to set the default entity
creation policy for all columns of all predicate bindings.
message FileBinding { required FileDefinition file = 2; repeated PredicateBinding binding = 3; optional string entity_creation = 4; }
The FileDefinition
describes the format of the files and the
options on the columns. If the required field is empty then all
columns are considered required (unless otherwise
specified). Conversely, if the required field is not empty, all
columns that are not mentioned in this field are considered optional.
message FileDefinition { required string delimiter = 2; required string column_headers = 3; required string column_formats = 4; optional string file_columns_required = 5; optional string file_columns_optional = 6; optional string file_columns_can_be_absent = 7; optional string mode = 8; }
Finally, a PredicateBinding
specifies how predicate columns
bind to the columns in the file definition.
message PredicateBinding { required string predicate_name = 1; required string predicate_binding_columns = 2; repeated ColumnBinding column = 3; optional bool export = 4 [default = true]; optional bool import = 5 [default = true]; optional bool no_retraction_on_post = 11 [default = false]; optional string entity_creation = 9; repeated Filter filter = 10; }
Example 29.1.
The following JSON file definition message describes a file with three columns, "PERSON", "FATHER", and "MOTHER".
"file": { "delimiter": "|", "column_headers": "PERSON,FATHER,MOTHER", "column_formats": "alphanum,alphanum,alphanum" }
The following two PredicateBinding
messages specify how to
build
the PERSON, FATHER, and MOTHER columns from predicates
person_father
and
person_mother
:
"binding":[ { "predicate_name": "person_father", "predicate_binding_columns": "PERSON,FATHER" }, { "predicate_name": "person_mother", "predicate_binding_columns": "PERSON,MOTHER" }]
Combining the messages together, we build the file binding message
that should be sent as the file_binding
parameter to the dynamic delimited file service:
{ "file": { "delimiter": "|", "column_headers": "PERSON,FATHER,MOTHER", "column_formats": "alphanum,alphanum,alphanum" }, "binding": [ { "predicate_name": "person_father", "predicate_binding_columns": "PERSON,FATHER" }, { "predicate_name": "person_mother", "predicate_binding_columns": "PERSON,MOTHER" }] }