Transformations

Transformations are Tecton objects that describe a set of operations on data. The operations are expressed through standard frameworks such as Spark SQL, PySpark, and Pandas.

Transformations are required to create Feature Views. Once defined, a Transformation can be reused within multiple Feature Views, or multiple Transformations can be composed within a single Feature View. Using these Transformations with your feature store provides several benefits:

Reusability: You can define a common Transformation — to clean up data, for example — that can be shared across all Features.
Feature versioning: If you change a Feature Transformation, the Feature Store increments the version of that feature and ensures that you don't accidentally mix features that were computed using two different implementations.
End-to-end lineage tracking and reproducibility: Since Tecton manages Transformations, it can tie feature definitions all the way through a training data set and a model that's used in production.
Visibility: Enabling data scientists to examine the code and see how the feature is calculated will help them understand if it's appropriate to re-use for their model.

Transformation Types

Register a python function as a Transformation in Tecton by annotating it @transformation, and set the mode parameter depending on the language used for the transformation. The current options are spark_sql, pyspark, and pandas

Spark SQL

SQL transformations are configured with mode=spark_sql, and return a Spark SQL query.

Function inputs must be a Spark dataframe or a Tecton constant. The tables in the FROM clause must be parameterized via the inputs.

Example

from tecton import transformation

@transformation(mode="spark_sql")
def user_has_good_credit_transformation(credit_scores):
    return f"""
        SELECT
            user_id,
            IF (credit_score > 670, 1, 0) as user_has_good_credit,
            date as timestamp
        FROM
            {credit_scores}
        """

Note that Spark SQL transformations cannot be used within an OnDemandFeatureView.

PySpark

PySpark transformations are configured with mode=pyspark, and contain Python code that will be executed within a Spark context. They can additionally include third party libraries as user-defined PySpark functions if your cluster allows third party libraries.

Function inputs must be a Spark dataframe or a Tecton constant.

Example

@transformation(mode="pyspark")
def user_has_good_credit_transformation(credit_scores):
    from pyspark.sql import functions as F

    df = credit_scores.withColumn("user_has_good_credit", \
        F.when(credit_scores["credit_score"] > 670, 1).otherwise(0))
    return df.select("user_id", \
        df["date"].alias("timestamp"), \
        "user_has_good_credit")

Note that PySpark transformations, like Spark SQL transformation, cannot be used within an OnDemandFeatureView.

Pandas

Pandas transformations are annotated with mode=pandas. Pandas transformations they can only be used by an OnDemandFeatureView.

Function inputs must be a Pandas dataframe or a Tecton constant.

Example

@transformation(mode="pandas")
def transaction_amount_is_high_transformation(transaction_request):
    import pandas as pd

    df = pd.DataFrame()
    df['transaction_amount_is_high'] = (transaction_request['amount'] >= 10000).astype('int64')
    return df

Library imports

Only the Transformation function's body is registered with Tecton. This means imports and other references from the outside of the Transformation function's body will result in import errors.

In order to use imported libraries, you must import Python libraries inside the Transformation function, not at the top level as you normally would. Avoid using aliases for imports (e.g. use import pandas instead of import pandas as pd).

Note

Custom PyPI library dependencies are not yet supported in Pandas Transformations. pandas and numpy are the currently supported transformations inside Pandas Transformations.

### Valid
from tecton import transformation

@transformation(mode="pandas")
def my_transformation(request):
    import pandas

    df = pandas.DataFrame()
    df['amount_is_high'] = (request['amount'] >= 10000).astype('int64')
    return df

### Invalid - pandas is imported outside my_transformation!
from tecton import transformation
import pandas

@transformation(mode="pandas")
def my_transformation(request):
    df = pandas.DataFrame()
    df['amount_is_high'] = (request['amount'] >= 10000).astype('int64')
    return df

Any libraries used in function signatures must also be imported outside the function.

from tecton import transformation
import pandas # required for type hints on my_transformation.

@transformation(mode="pandas")
def my_transformation(request: pandas.DataFrame) -> pandas.DataFrame:
    import pandas # required for pandas.DataFrame() below.

    df = pandas.DataFrame()
    df['amount_is_high'] = (request['amount'] >= 10000).astype('int64')
    return df

Local module imports

Tecton supports local imports of certain types of objects. Functions or constants can be imported from local modules. Classes, class instances, and enums cannot be imported. Local module imports must also take place outside of the transformation definition.

### Valid
from tecton import transformation
import pandas # required for type hints on my_transformation.
from my_local_module import my_func, my_int_const, my_string_const, my_dict_const

@transformation(mode="pandas")
def my_transformation(request: pandas.DataFrame) -> pandas.DataFrame:
    import pandas # required for pandas.DataFrame() below.

    df = pandas.DataFrame()
    df[my_dict_const['resultval']] = my_func(request[my_string_const] >= my_int_const)
    return df

### Invalid: unsupported types
from tecton import transformation
import pandas # required for type hints on my_transformation.
from my_local_module import my_class, my_enum # unsupported types for serialization

@transformation(mode="pandas")
def my_transformation(request: pandas.DataFrame) -> pandas.DataFrame:
    import pandas # required for pandas.DataFrame() below.

    # classes cannot be imported
    df = my_class.create_dataframe()
    # enum objects cannot be imported
    df[my_enum.VAL] = 1
    return df

### Invalid: local module imports inside transformation
from tecton import transformation
import pandas # required for type hints on my_transformation.
from my_local_module import my_class, my_enum # unsupported types for serialization

@transformation(mode="pandas")
def my_transformation(request: pandas.DataFrame) -> pandas.DataFrame:
    import pandas # required for pandas.DataFrame() below.

    # import statements of local modules cannot be used within transformation function body
    from my_local_module import my_func

    df['my_val'] = my_func()

    return df

Transformations vs. Python Functions

Transformations are simply Python functions decorated with @transformation. The primary benefit of using Transformations is discoverability and reusability. Transformations are discoverable in the Web UI and can be (re)used individually using the Tecton SDK.

Transformations can depend on standard Python functions, but these functions will only be embedded within the Transformations instead of being registered with Tecton as top-level Transformations. The general best practice is to wrap all data transformation logic in a @transformation.

Transformations have strict rules on return types:

Must return a string type for mode="spark_sql"
Must return a Pandas DataFrame for mode="pandas"
Must return a Spark DataFrame for mode="pyspark"

Using Transformations in Feature Views

Once you've created a Transformation, the next step is to call it from a Feature View. See the Feature View Overview for more details.