R library for working with Table Schema.
Table
class for working with data and schemaSchema
class for working with schemasField
class for working with schema fieldsvalidate
function for validating schema
descriptorsinfer
function that creates a schema based on a data
sampleIn order to install the latest distribution of R software to your computer you have to select one of the mirror sites of the Comprehensive R Archive Network, select the appropriate link for your operating system and follow the wizard instructions.
For windows users you can:
(Mac) OS X and Linux users may need to follow different steps depending on their system version to install R successfully and it is recommended to read the instructions on CRAN site carefully.
Even more detailed installation instructions can be found in R Installation and Administration manual.
To install RStudio, you can download RStudio Desktop with Open Source License and follow the wizard instructions:
To install the tableschema
library it is necessary to
install first devtools
library to make installation of
github libraries available.
# Install devtools package if not already
install.packages("devtools")
Install tableschema.r
# from CRAN version
install.packages("tableschema.r")
# or install the development version from github
::install_github("frictionlessdata/tableschema-r") devtools
# Install devtools package if not already
# install.packages("jsonlite")
library(jsonlite)
# Install devtools package if not already
# install.packages("future")
library(future)
# load the library using
library(tableschema.r)
Jsonlite package is internally used to convert json data to list objects. The input parameters of functions could be json strings, files or lists and the outputs are in list format to easily further process your data in R environment and exported as desired. The examples below show how to use jsonlite package to convert the output back to json adding indentation whitespace. More details about handling json you can see jsonlite documentation or vignettes here.
Moreover future
package is also used to load and create Table and Schema classes
asynchronously. To retrieve the actual result of the loaded Table or
Schema you have to use value(...)
to the variable you
stored the loaded Table/Schema. More details about future package and
sequential and parallel processing you can find here.
A table is a core concept in a tabular data world. It represents a data with a metadata (Table Schema). Let’s see how we could use it in practice.
Consider we have some local csv file. It could be inline data or
remote link - all supported by Table
class (except local
files for in-brower usage of course). But say it’s data.csv
for now:
data/cities.csv
city,location
london,"51.50,-0.11"
paris,"48.85,2.30"
rome,N/A
Let’s create and read a table. We use static Table.load
method and table.read
method with a keyed
option to get list of keyed rows:
= Table.load('inst/extdata/data.csv')
def = value(def)
table # add indentation whitespace to JSON output with jsonlite package
toJSON(table$read(keyed = TRUE), pretty = TRUE) # function from jsonlite package
## [
## {
## "city": ["london"],
## "location": ["\"51.50 -0.11\""]
## },
## {
## "city": ["paris"],
## "location": ["\"48.85 2.30\""]
## },
## {
## "city": ["rome"],
## "location": ["N/A"]
## }
## ]
= table$headers
table.headers table.headers
## [[1]]
## [1] "city"
##
## [[2]]
## [1] "location"
As we could see our locations are just a strings. But it should be
geopoints. Also Rome’s location is not available but it’s also just a
N/A
string instead of null
. First we have to
infer Table Schema:
# add indentation whitespace to JSON output with jsonlite package
toJSON(table$infer(), pretty = TRUE) # function from jsonlite package
## {
## "fields": [
## {
## "name": ["city"],
## "type": ["string"],
## "format": ["default"]
## },
## {
## "name": ["location"],
## "type": ["string"],
## "format": ["default"]
## }
## ],
## "missingValues": [
## [""]
## ]
## }
toJSON(table$schema$descriptor, pretty = TRUE) # function from jsonlite package
## {
## "fields": [
## {
## "name": ["city"],
## "type": ["string"],
## "format": ["default"]
## },
## {
## "name": ["location"],
## "type": ["string"],
## "format": ["default"]
## }
## ],
## "missingValues": [
## [""]
## ]
## }
$read(keyed = TRUE) # Fails table
Let’s fix not available location. There is a
missingValues
property in Table Schema specification. As a
first try we set missingValues
to N/A
in
table$schema$descriptor
. Schema descriptor could be changed
in-place but all changes should be commited by
table$schema$commit()
:
$schema$descriptor['missingValues'] = 'N/A'
table$schema$commit() table
## [1] TRUE
$schema$valid # false table
## [1] FALSE
$schema$errors table
## [[1]]
## [1] "Descriptor validation error:\n data.missingValues - is the wrong type"
As a good citiziens we’ve decided to check out schema descriptor
validity. And it’s not valid! We sould use an list for
missingValues
property. Also don’t forget to have an empty
string as a missing value:
$schema$descriptor[['missingValues']] = list("", 'N/A')
table$schema$commit() table
## [1] TRUE
$schema$valid # true table
## [1] TRUE
All good. It looks like we’re ready to read our data again:
$read() # or table
toJSON(table$read(), pretty = TRUE) # function from jsonlite package
Now we see that:
locations are lists with numeric lattide and longitude
Rome’s location is null
And because there are no errors on data reading we could be sure that our data is valid againt our schema. Let’s save it:
$schema$save('schema.json')
table$save('data.csv') table
Our data.csv
looks the same because it has been
stringified back to csv
format. But now we have
schema.json
:
{
"fields": [
{
"name": "city",
"type": "string",
"format": "default"
},
{
"name": "location",
"type": "geopoint",
"format": "default"
}
],
"missingValues": [
"",
"N/A"
]
}
If we decide to improve it even more we could update the schema file and then open it again. But now providing a schema path.
= Table.load('inst/extdata/data.csv', schema = 'inst/extdata/schema.json')
def = value(def)
table table
## <Table>
## Public:
## clone: function (deep = FALSE)
## headers: active binding
## infer: function (limit = 100)
## initialize: function (src, schema = NULL, strict = FALSE, headers = 1)
## iter: function (keyed, extended, cast = TRUE, relations = FALSE, stream = FALSE)
## read: function (keyed = FALSE, extended = FALSE, cast = TRUE, relations = FALSE,
## save: function (connection)
## schema: active binding
## Private:
## createRowStream_: function (src)
## createUniqueFieldsCache: function (schema)
## currentStream_: NULL
## headers_: NULL
## headersRow_: 1
## rowNumber_: 0
## schema_: Schema, R6
## src: inst/extdata/data.csv
## strict_: FALSE
## uniqueFieldsCache_: list
It was only basic introduction to the Table
class. To
learn more let’s take a look on Table
class API
reference.
A model of a schema with helpful methods for working with the schema
and supported data. Schema instances can be initialized with a schema
source as a url to a JSON file or a JSON object. The schema is initially
validated (see validate below). By default
validation errors will be stored in schema$errors
but in a
strict mode it will be instantly raised.
Let’s create a blank schema. It’s not valid because
descriptor$fields
property is required by the Table Schema
specification:
= Schema.load({})
def = value(def)
schema $valid # false schema
## [1] FALSE
$errors schema
## [[1]]
## [1] "Descriptor validation error:\n data.fields - is required"
To do not create a schema descriptor by hands we will use a
schema$infer
method to infer the descriptor from given
data:
toJSON(
$infer(helpers.from.json.to.list('[
schema ["id", "age", "name"],
["1","39","Paul"],
["2","23","Jimmy"],
["3","36","Jane"],
["4","28","Judy"]
]')), pretty = TRUE) # function from jsonlite package
## {
## "fields": [
## {
## "name": ["id"],
## "type": ["integer"]
## },
## {
## "name": ["age"],
## "type": ["integer"]
## },
## {
## "name": ["name"],
## "type": ["string"]
## }
## ]
## }
$valid # true schema
## [1] TRUE
toJSON(
$descriptor,
schemapretty = TRUE) # function from jsonlite package
## {
## "fields": [
## {
## "name": ["id"],
## "type": ["integer"],
## "format": ["default"]
## },
## {
## "name": ["age"],
## "type": ["integer"],
## "format": ["default"]
## },
## {
## "name": ["name"],
## "type": ["string"],
## "format": ["default"]
## }
## ],
## "missingValues": [
## [""]
## ]
## }
Now we have an inferred schema and it’s valid. We could cast data row against our schema. We provide a string input by an output will be cast correspondingly:
toJSON(
$castRow(helpers.from.json.to.list('["5", "66", "Sam"]')),
schemapretty = TRUE, auto_unbox = TRUE) # function from jsonlite package
## [
## 5,
## 66,
## "Sam"
## ]
But if we try provide some missing value to age
field
cast will fail because for now only one possible missing value is an
empty string. Let’s update our schema:
$castRow(helpers.from.json.to.list('["6", "N/A", "Walt"]')) schema
## Error in schema$castRow(helpers.from.json.to.list("[\"6\", \"N/A\", \"Walt\"]")): There are 1 cast errors (see following - Wrong type for header: age and value: N/A
# Cast error
$descriptor$missingValues = list('', 'NA')
schema$commit() schema
## [1] TRUE
$castRow(helpers.from.json.to.list('["6", "", "Walt"]')) schema
## [[1]]
## [1] 6
##
## [[2]]
## NULL
##
## [[3]]
## [1] "Walt"
We could save the schema to a local file. And we could continue the work in any time just loading it from the local file:
$save('schema.json')
schema= Schema.load('schema.json') schema
It was only basic introduction to the Schema
class. To
learn more let’s take a look on Schema
class API
reference.
Class represents field in the schema.
Data values can be cast to native R types. Casting a value will check the value is of the expected type, is in the correct format, and complies with any constraints imposed by a schema.
{
"name": "birthday",
"type": "date",
"format": "default",
"constraints": {
"required": true,
"minimum": "2015-05-30"
}
}
Following code will not raise the exception, despite the fact our date is less than minimum constraints in the field, because we do not check constraints of the field descriptor
= Field$new(helpers.from.json.to.list('{"name": "name", "type": "number"}'))
field = field$cast_value('12345') # cast
dateType # print the result dateType
## [1] 12345
And following example will raise exception, because we set flag ‘skip
constraints’ to false
, and our date is less than allowed by
minimum
constraints of the field. Exception will be raised
as well in situation of trying to cast non-date format values, or empty
values
tryCatch(
dateType = field$cast_value(value = '2014-05-29', constraints = FALSE),
error = function(e){# uh oh, something went wrong
})
## Error in private$castValue(...): Field character(0) can't cast value 2014-05-29 for type number with format default
Values that can’t be cast will raise an Error
exception.
Casting a value that doesn’t meet the constraints will raise an
Error
exception.
Table below shows the available types, formats and resultant value of the cast:
Type | Formats | Casting result |
---|---|---|
any | default | Any |
list | | default | | List |
boolean | default | Boolean |
date | default, any | Date |
datetime | default, any | Date |
duration | default | Duration |
geojson | default, topojson | Object |
geopoint | default, list, object | [Number, Number] |
integer | default | Number |
number | default | Number |
object | default | Object |
string | default, uri, email, binary | String |
time | default, any | Date |
year | default | Number |
yearmonth | default | [Number, Number] |
validate()
validates whether a schema is a validate Table Schema accordingly to the specifications. It does not validate data against a schema.
Given a schema descriptor validate
returns a validation
object:
= validate('inst/extdata/schema.json')
valid_errors valid_errors
## $valid
## [1] TRUE
##
## $errors
## list()
Given data source and headers infer
will return a Table
Schema as a JSON object based on the data values.
Given the data file, example.csv:
id,age,name
1,39,Paul
2,23,Jimmy
3,36,Jane
4,28,Judy
Call infer
with headers and values from the
datafile:
= infer('inst/extdata/data_infer.csv') descriptor
The descriptor
variable is now a list object that can
easily converted to JSON:
toJSON(
descriptor,pretty = TRUE
# function from jsonlite package )
## {
## "fields": [
## {
## "name": ["id"],
## "type": ["integer"],
## "format": ["default"]
## },
## {
## "name": ["age"],
## "type": ["integer"],
## "format": ["default"]
## },
## {
## "name": ["name"],
## "type": ["string"],
## "format": ["default"]
## }
## ],
## "missingValues": [
## [""]
## ]
## }
Table representation
List.<string>
Schema
AsyncIterator
|
Stream
List.<List>
| List.<Object>
Object
Boolean
List.<string>
Headers
Returns: List.<string>
- data
source headers
Schema
Schema
Returns: Schema
- table schema
instance
AsyncIterator
| Stream
Iterate through the table data
And emits rows cast based on table schema (async for loop). With a
stream
flag instead of async iterator a Node stream will be
returned. Data casting can be disabled.
Returns: AsyncIterator
|
Stream
- async iterator/stream of rows: -
[value1, value2]
- base -
{header1: value1, header2: value2}
- keyed -
[rowNumber, [header1, header2], [value1, value2]]
-
extended Throws:
TableSchemaError
raises any error occurred in this
processParam | Type | Description |
---|---|---|
keyed | boolean |
iter keyed rows |
extended | boolean |
iter extended rows |
cast | boolean |
disable data casting if false |
forceCast | boolean |
instead of raising on the first row with cast error return an error
object to replace failed row. It will allow to iterate over the whole
data file even if it’s not compliant to the schema. Example of output
stream:
[['val1', 'val2'], TableSchemaError, ['val3', 'val4'], ...] |
relations | Object |
object of foreign key references in a form of
{resource1: [{field1: value1, field2: value2}, ...], ...} .
If provided foreign key fields will checked and resolved to its
references |
stream | boolean |
return Node Readable Stream of table rows |
List.<List>
| List.<Object>
Read the table data into memory
The API is the same as
table.iter
has except for:
Returns: List.<List>
|
List.<Object>
- list of rows: -
[value1, value2]
- base -
{header1: value1, header2: value2}
- keyed -
[rowNumber, [header1, header2], [value1, value2]]
-
extended
Param | Type | Description |
---|---|---|
limit | integer |
limit of rows to read |
Object
Infer a schema for the table.
It will infer and set Table Schema to table.schema
based
on table data.
Returns: Object
- Table Schema
descriptor
Param | Type | Description |
---|---|---|
limit | number |
limit rows sample size |
Boolean
Save data source to file locally in CSV format with ,
(comma) delimiter
Returns: Boolean
- true on success
Throws:
TableSchemaError
an error if there is saving
problemParam | Type | Description |
---|---|---|
target | string |
path where to save a table data |
Table
Factory method to instantiate Table
class.
This method is async and it should be used with await keyword or as a
Promise
. If references
argument is provided
foreign keys will be checked on any reading operation.
Returns: Table
-
data table class instance Throws:
TableSchemaError
raises any error occurred in table
creation processParam | Type | Description |
---|---|---|
source | string | List.<List> |
Stream | function |
data source (one of): - local CSV file (path) - remote CSV file (url) - list of lists representing the rows - readable stream with CSV file contents - function returning readable stream with CSV file contents |
schema | string | Object |
data schema in all forms supported by Schema class |
strict | boolean |
strictness option to pass to Schema constructor |
headers | number | List.<string> |
data source headers (one of): - row number containing headers
(source should contain headers rows) - list of headers
(source should NOT contain headers rows) |
parserOptions | Object |
options to be used by CSV parser. All options listed at https://csv.js.org/parse/options/. By default
ltrim is true according to the CSV Dialect spec. |
Schema representation
Boolean
List.<Error>
Object
List.<string>
List.<Object>
List.<Field>
List.<string>
Field
| null
Field
Field
| null
List.<List>
Object
Boolean
boolean
Boolean
Validation status
It always true
in strict mode.
Returns: Boolean
- returns validation
status
List.<Error>
Validation errors
It always empty in strict mode.
Returns: List.<Error>
- returns
validation errors
Object
Descriptor
Returns: Object
- schema descriptor
List.<string>
Primary Key
Returns: List.<string>
- schema
primary key
List.<Object>
Foreign Keys
Returns: List.<Object>
- schema
foreign keys
List.<Field>
Fields
Returns: List.<Field>
- schema
fields
List.<string>
Field names
Returns: List.<string>
- schema
field names
Field
| null
Return a field
Returns: Field
| null
-
field instance if exists
Param | Type |
---|---|
fieldName | string |
Field
Add a field
Returns: Field
- added field
instance
Param | Type |
---|---|
descriptor | Object |
Field
| null
Remove a field
Returns: Field
| null
-
removed field instance if exists
Param | Type |
---|---|
name | string |
List.<List>
Cast row based on field types and formats.
Returns: List.<List>
- cast data
row
Param | Type | Description |
---|---|---|
row | List.<List> |
data row as an list of values |
failFalst | boolean |
Object
Infer and set schema.descriptor
based on data
sample.
Returns: Object
- Table Schema
descriptor
Param | Type | Description |
---|---|---|
rows | List.<List> |
list of lists representing rows |
headers | integer | List.<string> |
data sample headers (one of): - row number containing headers
(rows should contain headers rows) - list of headers
(rows should NOT contain headers rows) - defaults to 1 |
Boolean
Update schema instance if there are in-place changes in the descriptor.
Returns: Boolean
- returns true on
success and false if not modified Throws:
TableSchemaError
raises any error occurred in the
processParam | Type | Description |
---|---|---|
strict | boolean |
alter strict mode for further work |
Example
<- '{"fields": [{"name": "field", "type": "string"}]}'
descriptor <- Schema.load(descriptor)
def <- value(def)
schema $getField('name') schema
## NULL
$descriptor$fields[[1]]$type schema
## [1] "string"
$descriptor$fields[[1]]$type <-'number'
schema$descriptor$fields[[1]]$type schema
## [1] "number"
$commit() schema
## [1] TRUE
boolean
Save schema descriptor to target destination.
Returns: boolean
- returns true on
success Throws:
TableSchemaError
raises any error occurred in the
processParam | Type | Description |
---|---|---|
target | string |
path where to save a descriptor |
Schema
Factory method to instantiate Schema
class.
This method is async and it should be used with await keyword or as a
Promise
.
Returns: Schema
-
returns schema class instance Throws:
TableSchemaError
raises any error occurred in the
processParam | Type | Description |
---|---|---|
descriptor | string | Object |
schema descriptor: - local path - remote url - object |
strict | boolean |
flag to alter validation behaviour: - if false error will not be
raised and all error will be collected in schema.errors -
if strict is true any validation error will be raised immediately |
Field representation
string
string
string
boolean
Object
Object
any
boolean
Constructor to instantiate Field
class.
Returns: Field
-
returns field class instance Throws:
TableSchemaError
raises any error occured in the
processParam | Type | Description |
---|---|---|
descriptor | Object |
schema field descriptor |
missingValues | List.<string> |
an list with string representing missing values |
string
Field name
string
Field type
string
Field format
boolean
Return true if field is required
Object
Field constraints
Object
Field descriptor
any
Cast value
Returns: any
- cast value
Param | Type | Description |
---|---|---|
value | any |
value to cast |
constraints | Object | false |
boolean
Check if value can be cast
Param | Type | Description |
---|---|---|
value | any |
value to test |
constraints | Object | false |
Object
This function is async so it has to be used with await
keyword or as a Promise
.
Returns: Object
- returns
{valid, errors}
object
Param | Type | Description |
---|---|---|
descriptor | string | Object |
schema descriptor (one of): - local path - remote url - object |
Object
This function is async so it has to be used with await
keyword or as a Promise
.
Returns: Object
- returns schema
descriptor Throws:
TableSchemaError
raises any error occured in the
processParam | Type | Description |
---|---|---|
source | string | List.<List> |
Stream | function |
data source (one of): - local CSV file (path) - remote CSV file (url) - list of lists representing the rows - readable stream with CSV file contents - function returning readable stream with CSV file contents |
headers | List.<string> |
list of headers |
options | Object |
any Table.load options |
The project follows the Open Knowledge International coding standards. There are common commands to work with the project.Recommended way to get started is to create, activate and load the library environment. To install package and development dependencies into active environment:
::install_github("frictionlessdata/tableschema-r", dependencies = TRUE) devtools
To make test:
test_that(description, {
expect_equal(test, expected result)
})
To run tests:
::test() devtools
More detailed information about how to create and run tests you can find in testthat package.
In NEWS.md described only breaking and the most important changes. The full changelog could be found in nicely formatted commit history.