🍉 Casual execution time comparison: updating rows per group in Pandas
Posted February 9, 2024 by snrsw ‐ 5 min read
Lightweight comparison of the execution time of different methods for updating rows per group in Pandas.
TL;DR
In this post, we consider the efficient methods of updating rows per group in Pandas. We compare the average execution time of different methods for different sizes of data to find the most efficient method. The result shows that the groupby method is the most efficient in many cases.
Problem and Goals
The problem of updating rows per group is described as follows:
Input:
| id | value |
|---|---|
| A | 1 |
| A | 2 |
| B | 3 |
| B | 4 |
| C | 5 |
Task:
- Update the value per group
- Let us update the value of id A to value * 2, id B to value * 10, and id C to value + 1.
Output:
| id | value |
|---|---|
| A | 2 |
| A | 4 |
| B | 30 |
| B | 40 |
| C | 6 |
The goal of this post is to compare the efficiency of different methods for updating rows per group in Pandas.
Experiment
Environment
All codes are executed in Google Colabratory with CPU.
Methods of updating rows per group
We consider the following methods for updating rows per group in Pandas:
- Using
pd.DataFrame.loc - Using
pd.DataFrame.groupby - Using
pd.DataFrame.apply - Using
pd.DataFrame.mask - Using
np.where - Using
mapandpd.DataFrame.apply
(I think these are common methods for updating rows per group in Pandas. If you know other efficient methods, please let me know.)
In detail, the following functions are used for each method: Note that id2newvalue is tasks represented by a dictionary that has the mapping from id to function for updating the value. This is defined by
def generate_id2newvalue(dataframe: pd.DataFrame) -> dict[int, Callable[[int], int]]:
return {
_: lambda value: int(value * np.random.rand())
for _ in dataframe["id"].unique()
}
Using pd.DataFrame.loc
def use_loc(df: pd.DataFrame, id2newvalue: dict[int, Callable[[int], int]]) -> None:
for id in df["id"].unique():
df.loc[df["id"] == id, "value"] = df.loc[df["id"] == id, "value"].map(id2newvalue[id])
return df
Using pd.DataFrame.groupby
def use_groupby(df: pd.DataFrame, id2newvalue: dict[int, Callable[[int], int]]) -> None:
return pd.concat(
[
gdf.assign(
value=lambda df: df["value"].map(id2newvalue[g])
)
for g, gdf in df.groupby("id")
]
)
Using pd.DataFrame.apply
def use_apply(df: pd.DataFrame, id2newvalue: dict[int, Callable[[int], int]]):
df["value"] = df.apply(lambda row: id2newvalue[row["id"]](row["value"]), axis=1)
return df
Using pd.DataFrame.mask
def use_pd_mask(df: pd.DataFrame, id2newvalue: dict[int, Callable[[int], int]]):
for id in df["id"].unique():
df["value"].mask(df["id"] == id, df["value"].map(id2newvalue[id]), inplace=True)
return df
Using np.where
def use_np_where(df: pd.DataFrame, id2newvalue: dict[int, Callable[[int], int]]):
for id in df["id"].unique():
df["value"] = np.where(df["id"] == id, df["value"].map(id2newvalue[id]), df["value"])
return df
Using map and pd.DataFrame.apply
def use_map_and_apply(df: pd.DataFrame, id2newvalue: dict[int, Callable[[int], int]]):
df["_lambda"] = df["id"].map(id2newvalue)
df["value"] = df.apply(lambda row: row["_lambda"](row["value"]), axis=1)
return df
Data and its size
The size of the data is defined as a pair of the number of unique ids and the number of rows per id. The data is generated by the following functions:
def generate_dataframe(n_ids: int, n_rows_per_id: int) -> pd.DataFrame:
return pd.concat([
pd.DataFrame(
{
"id": [id] * n_rows_per_id,
"value": [id] * n_rows_per_id
}
)
for id in range(n_ids)
]).reset_index(drop=True)
Comparison Methods
We compare the average execution time of the above methods for different sizes of data.
The average execution time is calculated by the following functions:
# use as a decorator to measure the execution time of a function
def log_exec_time(f: Callable) -> Callable:
def _wrapper(*args: Any, **kwargs: Any) -> Any:
start = datetime.datetime.now()
v = f(*args, **kwargs)
end = datetime.datetime.now()
exec_time = end - start
print(f"{f.__name__} execution time: {exec_time}")
return v, exec_time
return _wrapper
# calculate the average execution time of a function decorated by `log_exec_time`
def average_exec_time(
df: pd.DataFrame,
id2newvalue: dict[int, int],
f: Callable,
n_experiments: int = 3
) -> datetime.timedelta:
exec_times = []
for _ in range(n_experiments):
result, exec_time = f(df=df.copy(), id2newvalue=id2newvalue)
exec_times.append(exec_time)
exec_time_mean = sum(exec_times, datetime.timedelta(0)) / len(exec_times)
return exec_time_mean
Results
| n_ids | n_rows_per_id | n_experiments | loc_exec_time | groupby_exec_time | apply_exec_time | pd_mask_exec_time | np_where_exec_time | map_and_apply_exec_time |
|---|---|---|---|---|---|---|---|---|
| 10 | 10 | 3 | 12.624 | 7.703 | 2.797 | 12.359 | 5.392 | 6.028 |
| 10 | 100 | 3 | 10.616 | 6.958 | 14.159 | 25.858 | 13.601 | 13.542 |
| 10 | 1000 | 3 | 28.073 | 15.779 | 206.288 | 100.877 | 99.615 | 104.206 |
| 100 | 10 | 3 | 86.349 | 47.872 | 14.867 | 192.608 | 122.889 | 19.62 |
| 100 | 100 | 3 | 105.755 | 56.326 | 127.586 | 1519.09 | 2104.67 | 99.318 |
| 100 | 1000 | 3 | 256.553 | 192.222 | 1263.7 | 9933.72 | 9944.58 | 1002.69 |
| 1000 | 10 | 3 | 1125.49 | 462.817 | 124.138 | 11143.4 | 10721.8 | 108.833 |
| 1000 | 100 | 3 | 1411.8 | 810.568 | 1236.71 | 97575.8 | 97159 | 1025.93 |
| 1000 | 1000 | 3 | 7240.26 | 1734.84 | 13115.8 | 972661 | 949273 | 10632.3 |
exec time unit: ms
- The groupby method is the most efficient method in many cases,
- When n_rows_per_id=10 the apply method and the map and apply method are the most efficient methods,
- The loc method is relatively efficient,
- The other methods are less efficient,
- The number of unique ids has a strong effect on the execution time.
References
For more details, please see the following notebook: Google Colabratory.