🍉 Casual execution time comparison: updating rows per group in Pandas

Posted February 9, 2024 by snrsw ‐ 5 min read

Lightweight comparison of the execution time of different methods for updating rows per group in Pandas.

TL;DR

In this post, we consider the efficient methods of updating rows per group in Pandas. We compare the average execution time of different methods for different sizes of data to find the most efficient method. The result shows that the groupby method is the most efficient in many cases.

Problem and Goals

The problem of updating rows per group is described as follows:

Input:

idvalue
A1
A2
B3
B4
C5

Task:

  • Update the value per group
    • Let us update the value of id A to value * 2, id B to value * 10, and id C to value + 1.

Output:

idvalue
A2
A4
B30
B40
C6

The goal of this post is to compare the efficiency of different methods for updating rows per group in Pandas.

Experiment

Environment

All codes are executed in Google Colabratory with CPU.

Methods of updating rows per group

We consider the following methods for updating rows per group in Pandas:

  1. Using pd.DataFrame.loc
  2. Using pd.DataFrame.groupby
  3. Using pd.DataFrame.apply
  4. Using pd.DataFrame.mask
  5. Using np.where
  6. Using map and pd.DataFrame.apply

(I think these are common methods for updating rows per group in Pandas. If you know other efficient methods, please let me know.)

In detail, the following functions are used for each method: Note that id2newvalue is tasks represented by a dictionary that has the mapping from id to function for updating the value. This is defined by

def generate_id2newvalue(dataframe: pd.DataFrame) -> dict[int, Callable[[int], int]]:
    return {
        _: lambda value: int(value * np.random.rand())
        for _ in dataframe["id"].unique()
    }

Using pd.DataFrame.loc

def use_loc(df: pd.DataFrame, id2newvalue: dict[int, Callable[[int], int]]) -> None:
    for id in df["id"].unique():
        df.loc[df["id"] == id, "value"] = df.loc[df["id"] == id, "value"].map(id2newvalue[id])
    return df

Using pd.DataFrame.groupby

def use_groupby(df: pd.DataFrame, id2newvalue: dict[int, Callable[[int], int]]) -> None:
    return pd.concat(
        [
            gdf.assign(
                value=lambda df: df["value"].map(id2newvalue[g])
            )
            for g, gdf in df.groupby("id")
        ]
    )

Using pd.DataFrame.apply

def use_apply(df: pd.DataFrame, id2newvalue: dict[int, Callable[[int], int]]):
    df["value"] = df.apply(lambda row: id2newvalue[row["id"]](row["value"]), axis=1)
    return df

Using pd.DataFrame.mask

def use_pd_mask(df: pd.DataFrame, id2newvalue: dict[int, Callable[[int], int]]):
    for id in df["id"].unique():
        df["value"].mask(df["id"] == id, df["value"].map(id2newvalue[id]), inplace=True)
    return df

Using np.where

def use_np_where(df: pd.DataFrame, id2newvalue: dict[int, Callable[[int], int]]):
    for id in df["id"].unique():
        df["value"] = np.where(df["id"] == id, df["value"].map(id2newvalue[id]), df["value"])
    return df

Using map and pd.DataFrame.apply

def use_map_and_apply(df: pd.DataFrame, id2newvalue: dict[int, Callable[[int], int]]):
    df["_lambda"] = df["id"].map(id2newvalue)
    df["value"] = df.apply(lambda row: row["_lambda"](row["value"]), axis=1)
    return df

Data and its size

The size of the data is defined as a pair of the number of unique ids and the number of rows per id. The data is generated by the following functions:

def generate_dataframe(n_ids: int, n_rows_per_id: int) -> pd.DataFrame:
    return pd.concat([
        pd.DataFrame(
            {
                "id": [id] * n_rows_per_id,
                "value": [id] * n_rows_per_id
            }
        )
        for id in range(n_ids)
    ]).reset_index(drop=True)

Comparison Methods

We compare the average execution time of the above methods for different sizes of data.

The average execution time is calculated by the following functions:

# use as a decorator to measure the execution time of a function
def log_exec_time(f: Callable) -> Callable:
    def _wrapper(*args: Any, **kwargs: Any) -> Any:
        start = datetime.datetime.now()

        v = f(*args, **kwargs)

        end = datetime.datetime.now()
        exec_time = end - start
        print(f"{f.__name__} execution time: {exec_time}")
        return v, exec_time
    return _wrapper

# calculate the average execution time of a function decorated by `log_exec_time`
def average_exec_time(
    df: pd.DataFrame,
    id2newvalue: dict[int, int],
    f: Callable,
    n_experiments: int = 3
) -> datetime.timedelta:
    exec_times = []
    for _ in range(n_experiments):
        result, exec_time = f(df=df.copy(), id2newvalue=id2newvalue)
        exec_times.append(exec_time)

    exec_time_mean = sum(exec_times, datetime.timedelta(0)) / len(exec_times)
    return exec_time_mean

Results

n_idsn_rows_per_idn_experimentsloc_exec_timegroupby_exec_timeapply_exec_timepd_mask_exec_timenp_where_exec_timemap_and_apply_exec_time
1010312.6247.7032.79712.3595.3926.028
10100310.6166.95814.15925.85813.60113.542
101000328.07315.779206.288100.87799.615104.206
10010386.34947.87214.867192.608122.88919.62
1001003105.75556.326127.5861519.092104.6799.318
10010003256.553192.2221263.79933.729944.581002.69
10001031125.49462.817124.13811143.410721.8108.833
100010031411.8810.5681236.7197575.8971591025.93
1000100037240.261734.8413115.897266194927310632.3

exec time unit: ms

  • The groupby method is the most efficient method in many cases,
  • When n_rows_per_id=10 the apply method and the map and apply method are the most efficient methods,
  • The loc method is relatively efficient,
  • The other methods are less efficient,
  • The number of unique ids has a strong effect on the execution time.

References

For more details, please see the following notebook: Google Colabratory.