Getting Started

Installation

Install from PyPi using pip:

pip install canvas-data-sdk

Usage

There are two ways this library can be used: you can call the API in your own code, allowing highly customized workflows, or you can perform basic operations using the included command line utility.

Using the command line utility

Installing the module via pip should have also installed a command-line utility called canvas-data. You can get help by using the --help option, which produces output like:

Usage: canvas-data [OPTIONS] COMMAND [ARGS]...

A command-line tool to work with Canvas Data. Command-specific help is
available at: canvas-data COMMAND --help

Options:
  -c, --config FILENAME
  --api-key TEXT
  --api-secret TEXT
  --help                 Show this message and exit.

Commands:
  get-ddl            Gets DDL for a particular version of the...
  get-dump-files     Downloads the Canvas Data files for a...
  get-schema         Gets a particular version of the Canvas Data...
  list-dumps         Lists available dumps
  unpack-dump-files  Downloads, uncompresses and re-assembles the...

The utility has several commands which you can see listed in the help text above. You can get more details on each command by typing:

canvas-data COMMAND --help

For example:

canvas-data get-schema --help

Configuring the command line utility

There are two global options that are needed for all of the commands. You can include them as command line options by placing them before the command:

canvas-data --api-key=XXXXX --api-secret=YYYYY COMMAND [command options]

Alternatively you can create a YAML-formatted config file and specify that instead. Several of the commands need to know where to store downloaded file fragments and where to store the re-assembled data files. You can specify these locations in the config file too. For example, create a config file called config.yml containing:

api_secret: XXXXX
api_key: YYYYY
download_dir: ./downloads
data_dir: ./data

Now you can use it like:

canvas-data -c config.yml COMMAND [command options]

Setting Up Your Database

Before you can load any data into your database, you first need to create all of the tables. You also may need to re-create tables if portions of the schema change in the future.

You can use the get-ddl command to generate a Postgres or Amazon Redshift compatible DDL script based on the JSON-formatted schema definition provided by the Canvas Data API. It will default to use the latest version of the schema, but you can specify a different version if needed:

canvas-data -c config.yml get-ddl > recreate_tables.sql

Note that this script will contain a DROP TABLE and a CREATE TABLE statement for every table in the schema. Please be very careful when running it – it will remove all of the data from your database and you’ll need to reload it.

Listing the Available Dumps

Instructure typically creates one dump per day containing the full contents of most of the tables, and incremental data for the requests table. Occasionally Instructure will produce a full dump of the requests table containing data going back to the start of your instance.

You can use the list-dumps command to see the dumps that are available:

canvas-data -c config.yml list-dumps

Details for each dump will be displayed, including the sequence and dump ID. Full-requests-table dumps will be highlighted.

Getting and Unpacking Data Files

You can fetch all of the files for a particular dump (besides the requests files - more on that later), decompress them, and re-assemble them into a single file for each table by using this command:

canvas-data -c config.yml unpack-dump-files

This command will default to fetch data from the latest dump, but you can choose a specific dump by passing the --dump-id parameter. You can limit the command to just fetch and reassemble the data files for a single table by passing the --table parameter.

The command will create a sub-directory underneath your data directory named after the dump sequence number, and all of the data files will be stored under that.

A SQL script called reload_all.sql (or reload_<table_name>.sql if you’re just unpacking the data for a single table) will also be stored inside the dump sequence directory. It contains SQL statements that will truncate all of the tables (besides the requests table) and will load each of the data files into a database. This can be used as part of a daily refresh process to keep all of your tables up to date. The SQL commands are known to be compatible with Postgres and Amazon Redshift databases; YMMV with other databases.

Downloading Data File Fragments

You can just download the compressed file fragments like this:

canvas-data -c config.yml get-dump-files

Note that if you later run the unpack-dump-files command, it won’t need to re-download files that you’ve already fetched using get-dump-files.

Using the API in your own code

First, create a CanvasDataAPI object. You need to supply your API key and secret. Here we assume that those are available in environment variables, but you could read them from configuration, too:

import os
from canvas_data.api import CanvasDataAPI

API_KEY = os.environ['API_KEY']
API_SECRET = os.environ['API_SECRET']

cd = CanvasDataAPI(api_key=API_KEY, api_secret=API_SECRET, download_chunk_size=1024*1024)

Now you can use this object to interact with the API as detailed below. The download_chunk_size value can be used to control how much data is read into memory when stream-downloading files. Larger values will consume more memory; smaller values will consume more CPU. A chunk size of 1Mb (1024*1024) will probably be resonable in most setups.

Schemas

Instructure occasionally updates the Canvas Data schema, and each change has a version number. To retrieve all of the schema versions that are available:

schema_versions = cd.get_schema_versions()

which will return a list similar to the following:

[ {u'createdAt': u'2016-03-29T21:35:23.215Z', u'version': u'1.9.1'},
  {u'createdAt': u'2016-03-11T17:38:01.877Z', u'version': u'1.9.0'},
  {u'createdAt': u'2016-03-10T20:10:16.361Z', u'version': u'1.8.0'},
  {u'createdAt': u'2016-02-18T23:52:56.214Z', u'version': u'1.6.0'},
  ...
]

You can retrieve a specific version of the schema:

schema = cd.get_schema('1.6.0', key_on_tablenames=True)

Or you can retrieve the latest version of the schema:

schema = cd.get_schema('latest', key_on_tablenames=True)

Dumps

Instructure produces nightly dumps of gzipped data files from your Canvas instance. Each nightly dump will contain the full contents of most tables, and incremental data for others (currently just the requests table). To retrieve a list of all of the nightly dumps that are available:

dumps = cd.get_dumps()

which will return a list similar to the following:

 [{u'accountId': u'9999',
   u'createdAt': u'2017-04-29T02:03:38.247Z',
   u'dumpId': u'125a3cb0-2cf3-11e7-84a8-784f4352af0c',
   u'expires': 1498615418247,
   u'finished': True,
   u'numFiles': 79,
   u'schemaVersion': u'1.16.2',
   u'sequence': 560,
   u'updatedAt': u'2017-04-29T02:03:39.663Z'},
{u'accountId': u'9999',
   u'createdAt': u'2017-04-28T02:03:05.520Z',
   u'dumpId': u'1ab0aacc-2cf3-11e7-8299-784f4352af0c',
   u'expires': 1498528985520,
   u'finished': True,
   u'numFiles': 79,
   u'schemaVersion': u'1.16.2',
   u'sequence': 559,
   u'updatedAt': u'2017-04-28T02:03:07.373Z'},
{u'accountId': u'9999',
   u'createdAt': u'2017-04-27T01:58:08.551Z',
   u'dumpId': u'24f4d347-2cf3-11e7-b1fa-784f4352af0c',
   u'expires': 1498442288551,
   u'finished': True,
   u'numFiles': 79,
   u'schemaVersion': u'1.16.2',
   u'sequence': 558,
   u'updatedAt': u'2017-04-27T01:58:11.533Z'},
   ...
 ]

Files

You can get details on all of the files contained in a particular dump:

dump_contents = cd.get_file_urls(dump_id='125a3cb0-2cf3-11e7-84a8-784f4352af0c')

Usually you’ll just want to get the latest dump:

dump_contents = cd.get_file_urls(dump_id='latest')

The complete data for each table can be quite large, so Instructure chops it into fragments and gzips each fragment file. You can download all of the gzipped fragments for a particular dump:

files = cd.download_files(dump_id='latest',
                          include_requests=False,
                          directory='./downloads')

The requests data is very large and needs to be handled differently from the rest of the tables since it’s an incremental dump. If you want to download everything but the requests data, set the include_requests parameter to False as above.

Typically you’ll want to download the dump files for a particular table, uncompress them, and re-assemble them into a single data file that can be loaded into a table in your local data warehouse. To do this:

local_data_filename = cd.get_data_for_table(table_name='course_dim')

This will default to download and re-assemble files from the latest dump, but you can optionally specify a particular dump:

local_data_filename = cd.get_data_for_table(table_name='course_dim',
                                            dump_id='125a3cb0-2cf3-11e7-84a8-784f4352af0c')