Skip to content

Creating a Data Source

Overview

Tecton supports connections to many different data sources. This example uses a Hive table for batch data, but the same principles apply for any raw data source, including streams. See Data Sources overview or the Data Source API more more details.

You must register a data source with Tecton before you define features based on that data. To register a data source, follow these steps:

  1. Define a Data Source object.
  2. Apply your Data Source to Tecton using the Tecton CLI.
  3. Test the Data Source by querying it in a notebook.

This guide assume you've already set up the permissions required for Tecton to read from the source.

Creating a Data Source

In this example, we define a BatchDataSource that contains the configuration necessary for Tecton to access our Hive user table.

Create a new file in your feature repository, and paste in the following code:

from tecton import HiveDSConfig, BatchDataSource

fraud_users_batch = BatchDataSource(
    name='users_batch',
    batch_ds_config=HiveDSConfig(
        database='fraud',
        table='fraud_users'
    ),
    family='fraud',
    owner='matt@tecton.ai',
    tags={'release': 'production'}
)

In the example definition above, we also added metadata parameters for organization, like name, family, and tags.

Applying a Data Source

So far, all we've done is written code in our local feature repository. In order to use the data source in Tecton, we need to apply our new definition to Tecton. We can do this using the Tecton CLI:

$ tecton apply
Using workspace "prod"
✅ Imported 15 Python modules from the feature repository
✅ Collecting local feature declarations
✅ Performing server-side validation of feature declarations
 ↓↓↓↓↓↓↓↓↓↓↓↓ Plan Start ↓↓↓↓↓↓↓↓↓↓

  + Create BatchDataSource
        name: users_batch

 ↑↑↑↑↑↑↑↑↑↑↑↑ Plan End ↑↑↑↑↑↑↑↑↑↑↑↑
Are you sure you want to apply this plan? [y/N]>

Enter y to apply this definition to Tecton.

Testing the Data Source in a Notebook

To verify that the data sources are connected properly, use the Tecton SDK in a notebook environment:

import tecton
users_batch = tecton.get_workspace('my_workspace').get_data_source('users_batch')

print(users_batch.get_dataframe().to_spark().limit(10))

With a Data Source defined and verified, you are now ready to define Tecton Feature Views that make use of this data.