Notebook Service

1. Introduction

The TCUP Notebook Service is an interactive web-based tool for data scientists to explore, experiment, visualize TCUP data and develop algorithms & models. The interface is a Python notebook and is a natural user interface for data scientists. The interface is built using Apache Jupyter.

TCUP Notebook Service provides a library called PyTCUP. This library provides a wide class of utilities including loading TCUP sensor data on Apache Spark, reading and writing data from TCUP Data Lake Service and running data science programs as TCUP Task Service tasks.

Using the PyTCUP library, TCUP Notebook Service lets users write Python code with Apache Spark and execute machine learning tasks in a distributed Spark cluster. TCUP Notebook is a powerful browser-based tool that allows users to create and share documents containing live code, visualizations, explanatory text and equations. In addition to PyTCUP, the entire ecosystem of Python based libraries and tools are available to the user. This makes it easy for data scientists to work on top of TCUP data and create new value adding algorithms and models.

Data integration and Transformation are important requirements for data analysis on TCUP big data. PyTCUP ETL could be used, for generic transformation of data and data preprocessing for the Machine learning.

TCUP big data is the primary source of the data for the PyTCUP ETL. PYTCUP will also access the data from external sources as Postgres data base, AWS S3, Azure Blob etc. The data will be loaded to the distributed Spark and perform ETL operation.

1.1 Intended Audience

The intended audience of this document is anyone who wants to have an overview of TCUP Notebook Service and the capability of TCUP Notebook Service as an IoT platform. This includes:

  • Application Developer

  • Solution and Technical Architect

  • TCUP Platform Support Team

  • Infrastructure and Deployment Team

2. Key Concepts

Jupyter notebooks are a series of “cells” containing executable code, or markdown, the popular HTML markup language for prose descriptions. This service also enables the integration with the TCUP Sensor Observation Service, Task Service and Data Lake Service. PyETL works on Spark for distributed computing and Pandas for non-distributed.

In order to use Notebook Service, a user needs to understand some of the basic concepts of the Service. Please refer to the following section for the concepts:

2.1 Notebook APIs

API for Tenant management, User Login, Authorization and Notebook Docker control is built upon Jupyter Hub as base.

2.2 Jupyter Notebook

A browser-based workbench for interactively capturing the whole computation process, developing, documenting, and executing code, as well as communicating the results. Visualization using matplotlib, data manipulation using pandas, pyspark, numpy etc. is also possible.

2.3 Notebook Container

User specific docker containers are spawned by notebook service on access of the respective servers.

2.4 PyTCUP lib

This is a python3 library interface to TCUP Services. It currently supports interface to Sensor Observation Service (SOS), SOS Bulk Operations, Task Service, Data Lake Service and Spark cluster.

The following are the abridged list of methods supported by PyTCUP lib:

  • Get the Spark Context from spark cluster

  • Get sensor observations as spark dataframes using Spark and Bulk Operations service.

  • Get sensors metadata, sensors capabilities, sensor observed property from TCUP Sensor Observation Service.

  • Create projects, tasks and deploy tasks, get task status, download task files from TCUP Task Service.

2.5 Apache Spark Cluster

Apache Spark is an open source distributed general purpose cluster-computing framework for big data analysis. Spark Driver (client) from the Notebooks submit batch jobs in cluster managed by Spark Master and in-memory computations are managed by Spark Workers.

2.6 Notebook UI Utility

Tenant admin can create/ delete users and start/ stop Notebook Docker Servers within its own group using graphical user interface.

2.7 DataFrame

This is a two-dimensional labeled data structure with potentially different types of columns.

2.8 RDD

Resilient Distributed Datasets (RDD) is an immutable distributed collection of objects. Each dataset in RDD is divided into logical partitions which may be computed on different nodes of the cluster.

2.9 PyETL

Extract-Transform-Load operations using PyTCUP. PyETL could be exploited, for generic as well as customized transformation of data and data preprocessing for the machine learning. Here data processing can be done in both distributed and non-distributed environments.

2.10 Model Management

Model management is primarily done with MLflow. This keeps the record of models and their performance metrices through model registry and tracking. This platform manages to share and describe different model versions, along with deploying the latest version through in-built APIs.

3. Functional Capabilities

The Notebook Service has the following functional capabilities:

  • Get list of sensors or features, sensor metadata, sensor capabilities, and sensor observed property from TCUP Sensor Observation Service.

  • Get sensor observations as Spark data frames using Spark for data analysis.

  • Bulk upload of sensor observations in CSV format in Sensor Observation Service.

  • Get the Spark Context from Spark cluster.

  • Run algorithms in distributed Spark cluster using Spark data frames, RDD leveraging the power of Spark in-memory processing computation.

  • Get file URI from TCUP Data Lake Service with metadata.

  • Get Spark data frames using Spark for given files from TCUP Data Lake Service for data analysis.

  • Write analyzed Spark data frame back to TCUP Data Lake Service.

  • Create projects, tasks and deploy tasks, get task status, and download task files from TCUP Task Service and upload Python programs to TCUP Task Service using PyTCUP library

  • Develop and deploy the Python programs to TCUP Task Service using PyTCUP library.

  • Easy enhancement of the machine learning libraries of Python (e.g scikit-learn , sparkml etc) in notebook.

  • Tenant owner can manage users, start/ stop/ access notebook docker servers within its own group using notebook GUI in TCUP tenant portal or by using REST-API’s.

  • Rest API’s to create/ delete users and start/ stop notebook servers.

  • GUI utility for tenant owner to visualize the list of users

  • Individual and isolated docker based notebook environment for each user.

  • Access the user specific notebook with valid user credentials from browser

  • Visualize the analytics output using Pandas, matplotlib etc.

  • PyTCUP ETL could be used, for generic transformation of data and data preprocessing for the Machine learning. ETL works on Spark for distributed computing and Pandas for non-distributed.

  • Load Data to and Extract Data from multiple sources of different file types

    • Data Sources – TCUP SOS, TCUP Data Lakes, Azure Blob, Azure Files, AWS S3, PostgreSQL, HDFS, Apache HBase, NFS file system etc.

    • File Types - CSV /custom separators, TSV, Parquet, JSON, xls, xml, text, HDF5, Vibration data files (NI TDMS), Automotive (MDF4) etc.

  • Transformation includes some basic and advanced transformations

    • Selection - Cardinal/Ordinal

    • Insertions - Add, Append, Concatenation, Derive, Equate

    • Verify & Filter - Verify with function & Schema validation, Apply expression/ lambda

    • Replace & Scale - Fill Null values & Replace /Scale with function etc.

    • Header Transformations - Add/Rename/Remove headers, Get header lists etc.

    • Conversions - Apply functions/equation, Cast, Convert, Split, pack and unpack etc.

    • Regular Expression Search

    • Joining & Sorting

    • Set Operation - Complement, Difference, Intersection

    • Advanced transform – Transpose

  • Data plots - Line, Bar, Scatter, Histogram, Box, Heat map, Correlation, Time series Distribution etc.

  • Derive various statistical measures (mean, variance, standard deviation, kurtosis, skewness etc.), frequency transformations (peak~peak, Fast Fourier Transformations etc.), exponent calculations from raw signal data using feature engineering.

4. Purpose/Usage

TCUP Notebook is a key component in the platform which provides a containerized web-based work space for the data scientists. TCUP is a multitenant platform where multiple data scientists can work under the same tenant data set. Admin for specific tenant can create or delete notebook users under that tenant.

TCUP admin can activate Notebook Service for a tenant. Tenant owner has the privileges to create or delete notebook users, but tenant owner is not allowed to delete or get information about other tenants. Using the API key tenant owner can create Notebook users (using Notebook service Users API) and start/ stop Notebook containers (Using Notebook service’s user-server API).

After a Notebook user is created, data scientists can log in to their own notebook container and use PyTCUP library and connect to Spark cluster from Jupyter Notebook.

The data scientists can start or stop their server and logout using GUI of Jupyter Notebook. These notebook users have the least privileges i.e. they only obtain their own notebook docker server status (whether the server has started or stopped) and can also stop or start their own notebook server using API.

User credentials and Key - TCUP Notebook service API authenticates with x-api-key key. Along with the API key, each notebook user will access using their unique user name and password.

Model management serves the purpose of tracking, packaging and governing analytics models. This is accessible through the TCUP Notebook interface only. TCUP Analytics models or any client-specific custom models can be managed and tracked by the MLFlow. The UI provides access to Data-scientists. The best performant model can be served as the deployed application on TCUP App-deployement service and model can be accessed through REST interface.

5. Example

Consider a data scientist who wants to do some analysis of the historical time series data present in the TCUP platform. The data acquired from different sources, e.g. Aircraft engines etc. is stored in TCUP Data Lake service data storage.

  1. Notebook users visit the TCUP Notebook UI login page and logs in using their user credentials.

  2. After login, initial home screen can be viewed. Like a file browser, this is where all notebooks and files can be browsed.

  3. The users can create a new Python3 Notebook or use an existing notebook from their workspace.

  4. The users import the PyTCUP module within the workspace

  5. They use the shell to load data from TCUP Data Lake Service using the PyTCUP library utility methods. Users can load this data into a Python numpy data frame or into a Spark data frame.

  6. They can use standard Python plotting libraries to plot and visualize the data within the workspace notebook cells

  7. Develop ETL scripts and analytical models in Notebook using PyETL and PyAnalytics.

  8. Leverage both distributed (Spark) and non-distributed environment.

  9. Track model and measure model performance using MLFLow .

  10. The users can save the changes as a notebook file and download the same on their desktop in (.ipynb/.py/.html/.pdf/.tex/.rst/.md) format.

  11. The notebook files can be also be executed as TCUP Task Service tasks.

6. Reference Document

For more details about this service please refer the following documents

  1. User guide.

  2. API Guide