Skip to content

Tecton 0.4

Overview

Tecton 0.4 was released in June 2022. Tecton 0.4 includes the following framework improvements and changes:

  • Snowflake support
  • API simplification & improvements
  • Materialization info diffs

Snowflake Support

Tecton 0.4 includes compatibility with Snowflake for processing and storing features. Once connected to a Snowflake warehouse, users can define features in Snowflake SQL or Snowpark.

@batch_feature_view(
    sources=[transactions],
    entities=[user],
    mode='snowflake_sql',
    aggregation_interval=timedelta(days=1),
    aggregations=[
        Aggregation(column='TRANSACTION', function='sum', time_window=timedelta(days=1)),
        Aggregation(column='TRANSACTION', function='sum', time_window=timedelta(days=7)),
        Aggregation(column='TRANSACTION', function='sum', time_window=timedelta(days=40)),
        Aggregation(column='AMT', function='mean', time_window=timedelta(days=1)),
        Aggregation(column='AMT', function='mean', time_window=timedelta(days=7)),
        Aggregation(column='AMT', function='mean', time_window=timedelta(days=40)),
    ],
    online=True,
    feature_start_time=datetime(2020, 10, 10),
    description='User transaction totals over a series of time windows, updated daily.'
)
def user_transaction_metrics(transactions):
    return f'''
        SELECT
            USER_ID,
            1 as TRANSACTION,
            AMT,
            TIMESTAMP
        FROM
            {transactions}
        '''

API Simplification and Improvements

0.4 includes a large set of changes to simplify and improve Tecton’s declarative Feature Repository API.

SDK 0.4 maintains backwards compatibility with the tecton.compat submodule. Users can migrate from 0.3 to 0.4 without changing their Feature Repo by importing Tecton objects from tecton.compat instead of tecton.

Functional Changes

  • Removed batch_window_aggregate_feature_view and stream_window_aggregate_feature_view types.
    • batch_feature_view and stream_feature_view now support Tecton window aggregations.
    • Rationale: These object types overlapped significantly and unnecessarily increased the number of concepts that new users had to learn.
  • Changes to materialization timestamp filtering.
    • During materialization, the output of Feature Views will now be automatically filtered to the materialization period (i.e. the window of time that is being backfilled or updated incrementally at steady state).
    • Data Sources no longer require a timestamp column to be defined because the time filter is now applied on the output of the Feature View.
    • Users have two options for optimizing query performance by pushing down timestamp filtering:
      1. Handle time filtering with custom logic using the materialization_context.
      2. Use FilteredSource to have Tecton automatically filter the Data Source to the correct period before the Feature View transformation is applied.
    • Rationale: Tecton's previous timestamp filtering logic worked well when a Feature View had exactly one Data Source and that Data Source had a timestamp column that was used directly as the Feature View feature time. Outside of that case, Tecton's timestamp filtering logic was unintuitive and the frequent source of bugs. This new logic should be simpler for most users while simultaneously providing more flexibility for power users.
    • See this batch feature view overview for more information.
  • Introduce “Incremental Backfilling” to Batch Feature Views.
    • incremental_backfills is a new parameter for Batch Feature Views that changes how Tecton backfills the feature view. If set to True, Tecton will backfill every period in the backfill window in its own job. In some cases (e.g. customer aggregations), this can lead to much simpler query definitions.
    • Rationale: Provide a means for users to easily and correctly implement Feature Views with custom aggregations.
    • See this guide for more info.
  • Configurable data_delay on Data Sources.
    • Replaces schedule_offset, a Feature View parameter.
    • By default, incremental (i.e. non-backfill) materialization jobs run immediately at the end of the batch schedule period. data_delay configures how long materialization jobs should wait before running after the end of a period, typically to ensure that all data has landed. For example, if a feature view has a batch_schedule of 1 day and one of the data source inputs has a data_delay of 1 hour, then incremental materialization jobs will run at 01:00 UTC (one hour after the period has ended).
    • Rationale: This parameter delays materialization due to upstream data delays, which logically fits as a Data Source property. Feature Views now inherit data delays from all dependent Data Sources.
  • Support custom names for aggregate features.
    • Allow users to set custom names for aggregate features. (Previously, users had to use Tecton auto-generated names like amount_mean_7d_1d.)
    • Example:
      @batch_feature_view(
          ...
          aggregations=[
              Aggregation(name='transaction_amount_daily_avg', column='amount', function='mean', time_window=timedelta(days=1)),
              Aggregation(name='transaction_amount_weekly_avg', column='amount', function='mean', time_window=timedelta(days=7)),
          ]
      )
      def user_transaction_counts(transactions):
          return f'''
              SELECT
                  user_id,
                  timestamp,
                  amount
              FROM {transactions}
              '''
      

Non-functional Changes

  • Tecton data types
    • Tecton now uses tecton.types when defining Feature View schemas and Request Data Sources.
    • Example:
      from tecton import on_demand_feature_view, RequestSource
      from tecton.types import Int64, Bool, Field
      
      transaction_request = RequestSource(schema=[Field('transaction_amount_is_high', Int64)])
      
      @on_demand_feature_view(
          sources=[transaction_request],
          mode='python',
          schema=[Field('transaction_amount_is_high', Bool)],
      )
      def transaction_amount_is_high(transaction_request):
          return {'transaction_amount_is_high': transaction_request['amount'] >= 10000}
      
    • Rationale: Previously Tecton used PySpark data types to define all schemas. This made PySpark a required dependency for the Tecton SDK, but Tecton can now be used without Spark with Snowflake. Tecton will continue to use native data types (PySpark, Snowflake, etc.) in data platform specific contexts, e.g. when providing an explicit schema for a Spark Data Source.
  • Use timedelta for a duration parameters instead of pytime strings.
    • E.g. time_window=timedelta(hours=12) instead of time_window="12h"
    • Rationale: Consistent with API’s usage of datetime objects, removes an API dependency on the PyTime implementation, and less ambiguous.
  • Use functional style to define Feature View overrides in Feature Services.
    • Example:
      transaction_fraud_service = FeatureService(
              name="transaction_fraud_service",
              features=[
                  # Select a subset of features from a feature view.
                  transaction_features[["amount"]],
      
                  # Rename a feature view and/or rebind its join keys. In this example, we want user features for both the
                  # transaction sender and recipient, so include the feature view twice and bind it to two different feature
                  # service join keys.
                  user_features.with_name("sender_features").with_join_key_map({"user_id" : "sender_id"}),
                  user_features.with_name("recipient_features").with_join_key_map({"user_id" : "recipient_id"}),
              ],
          )
      

Parameter/Class Changes

Class Renames/Changes

0.3 Definition 0.4 Definition
Data Sources
BatchDataSource BatchSource
StreamDataSource StreamSource
FileDSConfig FileConfig
HiveDSConfig HiveConfig
KafkaDSConfig KafkaConfig
KinesisDSConfig KinesisConfig
RedshiftDSConfig RedshiftConfig
RequestDataSource RequestSource
SnowflakeDSConfig SnowflakeConfig
Feature Views
@batch_window_aggregate_feature_view @batch_feature_view
@stream_window_aggregate_feature_view @stream_feature_view
Misc Classes
FeatureAggregation Aggregation
New Classes
- AggregationMode
- KafkaOutputStream
- KinesisOutputStream
- FilteredSource
Deprecated Classes in 0.3
Input -
BackfillConfig -
MonitoringConfig -

Feature View/Table Parameter Changes

0.3 Definition 0.4 Definition
inputs sources
name_override name
aggregation_slide_period aggregation_interval
timestamp_key timestamp_field
batch_cluster_config batch_compute
stream_cluster_config stream_compute
online_config online_store
offline_config offline_store
output_schema schema
family - (removed)
schedule_offset - (removed, see DataSource data_delay)
monitoring.alert_email (nested) alert_email
monitoring.monitor_freshness (nested) monitor_freshness
monitoring.expected_freshness (nested) expected_freshness

Data Source Parameter Changes

0.3 Definition 0.4 Definition
timestamp_column_name timestamp_field
batch_ds_config batch_config
stream_ds_config stream_config
raw_batch_translator post_processor
default_watermark_delay_threshold watermark_delay_threshold
default_initial_stream_position initial_stream_position

Materialization info in tecton plan

tecton plan will now print a summary of the backfill and incremental materialization jobs that will result from applying a plan. This feature should help users avoid applying changes that trigger more new jobs than expected.

$ tecton apply
...

  + Create FeatureView
    name:            user_transaction_counts
    owner:           matt@tecton.ai
    description:     User transaction totals over a series of time windows, updated daily.
    materialization: 10 backfills, 1 recurring batch job
    > backfill:      9 Backfill jobs 2020-10-03 00:00:00 UTC to 2022-04-14 00:00:00 UTC writing to the Offline Store
                     1 Backfill job 2022-04-14 00:00:00 UTC to 2022-06-06 00:00:00 UTC writing to both the Online and Offline Store
    > incremental:   1 Recurring Batch job scheduled every 1 day writing to both the Online and Offline Store

Patch Updates

0.4.14

  • Support all aggregation modes (disabled, partial, full) for FeatureView.run() for stream feature views with aggregation_mode=AggregationMode.CONTINUOUS.

0.4.13

  • The spark_config parameter to DatabricksClusterConfig and EMRClusterConfig now allows arbitrary configs to set (before only a small list was allowed).

0.4.11

  • Release --suppress-recreates to suppress rematerialization for one-off refactors and migrations. See docs for full instructions on usage.

0.4.9

  • Improve point-in-time accuracy for get_historical_features(spine) on feature views with Tecton aggregates.
    • Joins for features in a Batch Feature Views now account for the data delay on the input data sources. If a Batch Feature View has multiple sources, then the maximum data delay is used.
    • Joins for features in a Stream Feature View ignore the data delay for their source's batch configuration.

0.4.8

  • Update check to allow using Data Sources with data_delay with compat Feature Views with schedule_offset as long as they are equal.

0.4.7

  • Add support for format_string parameter in DatetimePartitionColumn.
  • Improve pytype hints and docstrings for public python classes and decorators.

0.4.6

  • Improve error message handling on permission errors

0.4.4

  • Add check to prevent using compat Feature Views with non-compat Data Sources that have a data_delay configured.

0.4.3

  • Adds support for instance_availability=SPOT_WITH_FALLBACK to EMRClusterConfig.

0.4.2

  • Fix bug that prevented some CLI commands from working in Window environments.

0.4.1

  • Include all 0.3 declarative classes and functions in tecton.compat so that 0.3 repos can be updated by a single find and replace. Replace from tecton import with from tecton.compat import.