🍉 Casual execution time comparison: updating rows per group in Pandas

Posted February 9, 2024 by snrsw ‐ 5 min read

Lightweight comparison of the execution time of different methods for updating rows per group in Pandas.

TL;DR

In this post, we consider the efficient methods of updating rows per group in Pandas. We compare the average execution time of different methods for different sizes of data to find the most efficient method. The result shows that the groupby method is the most efficient in many cases.

Problem and Goals

The problem of updating rows per group is described as follows:

Input:

id	value
A	1
A	2
B	3
B	4
C	5

Task:

Update the value per group
- Let us update the value of id A to value * 2, id B to value * 10, and id C to value + 1.

Output:

id	value
A	2
A	4
B	30
B	40
C	6

The goal of this post is to compare the efficiency of different methods for updating rows per group in Pandas.

Experiment

Environment

All codes are executed in Google Colabratory with CPU.

Methods of updating rows per group

We consider the following methods for updating rows per group in Pandas:

Using pd.DataFrame.loc
Using pd.DataFrame.groupby
Using pd.DataFrame.apply
Using pd.DataFrame.mask
Using np.where
Using map and pd.DataFrame.apply

(I think these are common methods for updating rows per group in Pandas. If you know other efficient methods, please let me know.)

In detail, the following functions are used for each method: Note that id2newvalue is tasks represented by a dictionary that has the mapping from id to function for updating the value. This is defined by

def generate_id2newvalue(dataframe: pd.DataFrame) -> dict[int, Callable[[int], int]]:
    return {
        _: lambda value: int(value * np.random.rand())
        for _ in dataframe["id"].unique()
    }

Using `pd.DataFrame.loc`

def use_loc(df: pd.DataFrame, id2newvalue: dict[int, Callable[[int], int]]) -> None:
    for id in df["id"].unique():
        df.loc[df["id"] == id, "value"] = df.loc[df["id"] == id, "value"].map(id2newvalue[id])
    return df

Using `pd.DataFrame.groupby`

def use_groupby(df: pd.DataFrame, id2newvalue: dict[int, Callable[[int], int]]) -> None:
    return pd.concat(
        [
            gdf.assign(
                value=lambda df: df["value"].map(id2newvalue[g])
            )
            for g, gdf in df.groupby("id")
        ]
    )

Using `pd.DataFrame.apply`

def use_apply(df: pd.DataFrame, id2newvalue: dict[int, Callable[[int], int]]):
    df["value"] = df.apply(lambda row: id2newvalue[row["id"]](row["value"]), axis=1)
    return df

Using `pd.DataFrame.mask`

def use_pd_mask(df: pd.DataFrame, id2newvalue: dict[int, Callable[[int], int]]):
    for id in df["id"].unique():
        df["value"].mask(df["id"] == id, df["value"].map(id2newvalue[id]), inplace=True)
    return df

Using `np.where`

def use_np_where(df: pd.DataFrame, id2newvalue: dict[int, Callable[[int], int]]):
    for id in df["id"].unique():
        df["value"] = np.where(df["id"] == id, df["value"].map(id2newvalue[id]), df["value"])
    return df

Using `map` and `pd.DataFrame.apply`

def use_map_and_apply(df: pd.DataFrame, id2newvalue: dict[int, Callable[[int], int]]):
    df["_lambda"] = df["id"].map(id2newvalue)
    df["value"] = df.apply(lambda row: row["_lambda"](row["value"]), axis=1)
    return df

Data and its size

The size of the data is defined as a pair of the number of unique ids and the number of rows per id. The data is generated by the following functions:

def generate_dataframe(n_ids: int, n_rows_per_id: int) -> pd.DataFrame:
    return pd.concat([
        pd.DataFrame(
            {
                "id": [id] * n_rows_per_id,
                "value": [id] * n_rows_per_id
            }
        )
        for id in range(n_ids)
    ]).reset_index(drop=True)

Comparison Methods

We compare the average execution time of the above methods for different sizes of data.

The average execution time is calculated by the following functions:

# use as a decorator to measure the execution time of a function
def log_exec_time(f: Callable) -> Callable:
    def _wrapper(*args: Any, **kwargs: Any) -> Any:
        start = datetime.datetime.now()

        v = f(*args, **kwargs)

        end = datetime.datetime.now()
        exec_time = end - start
        print(f"{f.__name__} execution time: {exec_time}")
        return v, exec_time
    return _wrapper

# calculate the average execution time of a function decorated by `log_exec_time`
def average_exec_time(
    df: pd.DataFrame,
    id2newvalue: dict[int, int],
    f: Callable,
    n_experiments: int = 3
) -> datetime.timedelta:
    exec_times = []
    for _ in range(n_experiments):
        result, exec_time = f(df=df.copy(), id2newvalue=id2newvalue)
        exec_times.append(exec_time)

    exec_time_mean = sum(exec_times, datetime.timedelta(0)) / len(exec_times)
    return exec_time_mean

Results

n_ids	n_rows_per_id	n_experiments	loc_exec_time	groupby_exec_time	apply_exec_time	pd_mask_exec_time	np_where_exec_time	map_and_apply_exec_time
10	10	3	12.624	7.703	2.797	12.359	5.392	6.028
10	100	3	10.616	6.958	14.159	25.858	13.601	13.542
10	1000	3	28.073	15.779	206.288	100.877	99.615	104.206
100	10	3	86.349	47.872	14.867	192.608	122.889	19.62
100	100	3	105.755	56.326	127.586	1519.09	2104.67	99.318
100	1000	3	256.553	192.222	1263.7	9933.72	9944.58	1002.69
1000	10	3	1125.49	462.817	124.138	11143.4	10721.8	108.833
1000	100	3	1411.8	810.568	1236.71	97575.8	97159	1025.93
1000	1000	3	7240.26	1734.84	13115.8	972661	949273	10632.3

exec time unit: ms

The groupby method is the most efficient method in many cases,
When n_rows_per_id=10 the apply method and the map and apply method are the most efficient methods,
The loc method is relatively efficient,
The other methods are less efficient,
The number of unique ids has a strong effect on the execution time.

References

For more details, please see the following notebook: Google Colabratory.