# How to sum a variable by group

I have a data frame with two columns. First column contains categories such as "First", "Second", "Third", and the second column has numbers that represent the number of times I saw the specific groups from "Category".

For example:

``````Category     Frequency
First        10
First        15
First        5
Second       2
Third        14
Third        20
Second       3
``````

I want to sort the data by Category and sum all the Frequencies:

``````Category     Frequency
First        30
Second       5
Third        34
``````

How would I do this in R?

• The fastest way in base R is `rowsum`. Jan 4, 2019 at 18:58

Using `aggregate`:

``````aggregate(x\$Frequency, by=list(Category=x\$Category), FUN=sum)
Category  x
1    First 30
2   Second  5
3    Third 34
``````

In the example above, multiple dimensions can be specified in the `list`. Multiple aggregated metrics of the same data type can be incorporated via `cbind`:

``````aggregate(cbind(x\$Frequency, x\$Metric2, x\$Metric3) ...
``````

(embedding @thelatemail comment), `aggregate` has a formula interface too

``````aggregate(Frequency ~ Category, x, sum)
``````

Or if you want to aggregate multiple columns, you could use the `.` notation (works for one column too)

``````aggregate(. ~ Category, x, sum)
``````

or `tapply`:

``````tapply(x\$Frequency, x\$Category, FUN=sum)
First Second  Third
30      5     34
``````

Using this data:

``````x <- data.frame(Category=factor(c("First", "First", "First", "Second",
"Third", "Third", "Second")),
Frequency=c(10,15,5,2,14,20,3))
``````
• @AndrewMcKinlay, R uses the tilde to define symbolic formulae, for statistics and other functions. It can be interpreted as "model Frequency by Category" or "Frequency depending on Category". Not all languages use a special operator to define a symbolic function, as done in R here. Perhaps with that "natural-language interpretation" of the tilde operator, it becomes more meaningful (and even intuitive). I personally find this symbolic formula representation better than some of the more verbose alternatives. Dec 19, 2016 at 4:35
• Being new to R (and asking the same sorts of questions as the OP), I would benefit from some more detail of the syntax behind each alternative. For instance, if I have a larger source table and want to subselect just two dimensions plus summed metrics, can I adapt any of these methods? Hard to tell. Oct 28, 2018 at 10:42
• Is there anyway of maintaining an ID column? Say the categories are ordered and the ID column is `1:nrow(df)`, is it possible to keep the starting position of each category after aggregating? So the ID column would end up as, for example, 1, 3, 4, 7 after collapsing with aggregate. In my case I like `aggregate` because it works over many columns automatically. Jun 24, 2020 at 20:32

You can also use the dplyr package for that purpose:

``````library(dplyr)
x %>%
group_by(Category) %>%
summarise(Frequency = sum(Frequency))

#Source: local data frame [3 x 2]
#
#  Category Frequency
#1    First        30
#2   Second         5
#3    Third        34
``````

Or, for multiple summary columns (works with one column too):

``````x %>%
group_by(Category) %>%
summarise(across(everything(), sum))
``````

Here are some more examples of how to summarise data by group using dplyr functions using the built-in dataset `mtcars`:

``````# several summary columns with arbitrary names
mtcars %>%
group_by(cyl, gear) %>%                            # multiple group columns
summarise(max_hp = max(hp), mean_mpg = mean(mpg))  # multiple summary columns

# summarise all columns except grouping columns using "sum"
mtcars %>%
group_by(cyl) %>%
summarise(across(everything(), sum))

# summarise all columns except grouping columns using "sum" and "mean"
mtcars %>%
group_by(cyl) %>%
summarise(across(everything(), list(mean = mean, sum = sum)))

# multiple grouping columns
mtcars %>%
group_by(cyl, gear) %>%
summarise(across(everything(), list(mean = mean, sum = sum)))

# summarise specific variables, not all
mtcars %>%
group_by(cyl, gear) %>%
summarise(across(c(qsec, mpg, wt), list(mean = mean, sum = sum)))

# summarise specific variables (numeric columns except grouping columns)
mtcars %>%
group_by(gear) %>%
summarise(across(where(is.numeric), list(mean = mean, sum = sum)))
``````

For more information, including the `%>%` operator, see the introduction to dplyr.

• How fast is it when compared to the data.table and aggregate alternatives presented in other answers? Jan 23, 2015 at 14:35
• @asieira, Which is fastest and how big the difference (or if the difference is noticeable) is will always depend on your data size. Typically, for large data sets, for example some GB, data.table will most likely be fastest. On smaller data size, data.table and dplyr are often close, also depending on the number of groups. Both data,table and dplyr will be quite a lot faster than base functions, however (can well be 100-1000 times faster for some operations). Also see here Jan 23, 2015 at 14:50
• What does the "funs" refer to in the second example? Oct 8, 2019 at 19:02
• @lauren.marietta you can specify the function(s) you want to apply as summary inside the `funs()` argument of `summarise_all` and its related functions (`summarise_at`, `summarise_if`) Oct 9, 2019 at 11:52
• In case, the column name has spaces. It might not work. Using back ticks would help. Ref. stackoverflow.com/questions/22842232/… Nov 2, 2020 at 7:57

The answer provided by rcs works and is simple. However, if you are handling larger datasets and need a performance boost there is a faster alternative:

``````library(data.table)
data = data.table(Category=c("First","First","First","Second","Third", "Third", "Second"),
Frequency=c(10,15,5,2,14,20,3))
data[, sum(Frequency), by = Category]
#    Category V1
# 1:    First 30
# 2:   Second  5
# 3:    Third 34
system.time(data[, sum(Frequency), by = Category] )
# user    system   elapsed
# 0.008     0.001     0.009
``````

Let's compare that to the same thing using data.frame and the above above:

``````data = data.frame(Category=c("First","First","First","Second","Third", "Third", "Second"),
Frequency=c(10,15,5,2,14,20,3))
system.time(aggregate(data\$Frequency, by=list(Category=data\$Category), FUN=sum))
# user    system   elapsed
# 0.008     0.000     0.015
``````

And if you want to keep the column this is the syntax:

``````data[,list(Frequency=sum(Frequency)),by=Category]
#    Category Frequency
# 1:    First        30
# 2:   Second         5
# 3:    Third        34
``````

The difference will become more noticeable with larger datasets, as the code below demonstrates:

``````data = data.table(Category=rep(c("First", "Second", "Third"), 100000),
Frequency=rnorm(100000))
system.time( data[,sum(Frequency),by=Category] )
# user    system   elapsed
# 0.055     0.004     0.059
data = data.frame(Category=rep(c("First", "Second", "Third"), 100000),
Frequency=rnorm(100000))
system.time( aggregate(data\$Frequency, by=list(Category=data\$Category), FUN=sum) )
# user    system   elapsed
# 0.287     0.010     0.296
``````

For multiple aggregations, you can combine `lapply` and `.SD` as follows

``````data[, lapply(.SD, sum), by = Category]
#    Category Frequency
# 1:    First        30
# 2:   Second         5
# 3:    Third        34
``````
• +1 But 0.296 vs 0.059 isn't particularly impressive. The data size needs to be much bigger than 300k rows, and with more than 3 groups, for data.table to shine. We'll try and support more than 2 billion rows soon for example, since some data.table users have 250GB of RAM and GNU R now supports length > 2^31. Sep 9, 2013 at 10:05
• True. Turns out I don't have all that RAM though, and was simply trying to provide some evidence of data.table's superior performance. I'm sure the difference would be even larger with more data. Oct 23, 2013 at 23:22
• I had 7 mil observations dplyr took .3 seconds and aggregate() took 22 seconds to complete the operation. I was going to post it on this topic and you beat me to it!
– zazu
Nov 14, 2015 at 19:10
• There is a even shorter way to write this `data[, sum(Frequency), by = Category]`. You could use `.N` which substitutes the `sum()` function. `data[, .N, by = Category]`. Here is a useful cheatsheet: s3.amazonaws.com/assets.datacamp.com/img/blog/… Feb 22, 2017 at 11:47
• Using .N would be equivalent to sum(Frequency) only if all the values in the Frequency column were equal to 1, because .N counts the number of rows in each aggregated set (.SD). And that is not the case here. Mar 1, 2017 at 13:26

You can also use the by() function:

``````x2 <- by(x\$Frequency, x\$Category, sum)
do.call(rbind,as.list(x2))
``````

Those other packages (plyr, reshape) have the benefit of returning a data.frame, but it's worth being familiar with by() since it's a base function.

Several years later, just to add another simple base R solution that isn't present here for some reason- `xtabs`

``````xtabs(Frequency ~ Category, df)
# Category
# First Second  Third
#    30      5     34
``````

Or if you want a `data.frame` back

``````as.data.frame(xtabs(Frequency ~ Category, df))
#   Category Freq
# 1    First   30
# 2   Second    5
# 3    Third   34
``````
``````library(plyr)
ddply(tbl, .(Category), summarise, sum = sum(Frequency))
``````

If `x` is a dataframe with your data, then the following will do what you want:

``````require(reshape)
recast(x, Category ~ ., fun.aggregate=sum)
``````

While I have recently become a convert to `dplyr` for most of these types of operations, the `sqldf` package is still really nice (and IMHO more readable) for some things.

Here is an example of how this question can be answered with `sqldf`

``````x <- data.frame(Category=factor(c("First", "First", "First", "Second",
"Third", "Third", "Second")),
Frequency=c(10,15,5,2,14,20,3))

sqldf("select
Category
,sum(Frequency) as Frequency
from x
group by
Category")

##   Category Frequency
## 1    First        30
## 2   Second         5
## 3    Third        34
``````

Just to add a third option:

``````require(doBy)
summaryBy(Frequency~Category, data=yourdataframe, FUN=sum)
``````

EDIT: this is a very old answer. Now I would recommend the use of `group_by` and `summarise` from `dplyr`, as in @docendo answer.

Another solution that returns sums by groups in a matrix or a data frame and is short and fast:

``````rowsum(x\$Frequency, x\$Category)
``````
• Nicely, and indeed fast. May 1, 2020 at 23:28

I find `ave` very helpful (and efficient) when you need to apply different aggregation functions on different columns (and you must/want to stick on base R) :

e.g.

Given this input :

``````DF <-
data.frame(Categ1=factor(c('A','A','B','B','A','B','A')),
Categ2=factor(c('X','Y','X','X','X','Y','Y')),
Samples=c(1,2,4,3,5,6,7),
Freq=c(10,30,45,55,80,65,50))

> DF
Categ1 Categ2 Samples Freq
1      A      X       1   10
2      A      Y       2   30
3      B      X       4   45
4      B      X       3   55
5      A      X       5   80
6      B      Y       6   65
7      A      Y       7   50
``````

we want to group by `Categ1` and `Categ2` and compute the sum of `Samples` and mean of `Freq`.
Here's a possible solution using `ave` :

``````# create a copy of DF (only the grouping columns)
DF2 <- DF[,c('Categ1','Categ2')]

# add sum of Samples by Categ1,Categ2 to DF2
# (ave repeats the sum of the group for each row in the same group)
DF2\$GroupTotSamples <- ave(DF\$Samples,DF2,FUN=sum)

# add mean of Freq by Categ1,Categ2 to DF2
# (ave repeats the mean of the group for each row in the same group)
DF2\$GroupAvgFreq <- ave(DF\$Freq,DF2,FUN=mean)

# remove the duplicates (keep only one row for each group)
DF2 <- DF2[!duplicated(DF2),]
``````

Result :

``````> DF2
Categ1 Categ2 GroupTotSamples GroupAvgFreq
1      A      X               6           45
2      A      Y               9           40
3      B      X               7           50
6      B      Y               6           65
``````

Since `dplyr 1.0.0`, the `across()` function could be used:

``````df %>%
group_by(Category) %>%
summarise(across(Frequency, sum))

Category Frequency
<chr>        <int>
1 First           30
2 Second           5
3 Third           34
``````

If interested in multiple variables:

``````df %>%
group_by(Category) %>%
summarise(across(c(Frequency, Frequency2), sum))

Category Frequency Frequency2
<chr>        <int>      <int>
1 First           30         55
2 Second           5         29
3 Third           34        190
``````

And the selection of variables using select helpers:

``````df %>%
group_by(Category) %>%
summarise(across(starts_with("Freq"), sum))

Category Frequency Frequency2 Frequency3
<chr>        <int>      <int>      <dbl>
1 First           30         55        110
2 Second           5         29         58
3 Third           34        190        380
``````

Sample data:

``````df <- read.table(text = "Category Frequency Frequency2 Frequency3
1    First        10         10         20
2    First        15         30         60
3    First         5         15         30
4   Second         2          8         16
5    Third        14         70        140
6    Third        20        120        240
7   Second         3         21         42",
stringsAsFactors = FALSE)
``````

You could use the function `group.sum` from package Rfast.

``````Category <- Rfast::as_integer(Category,result.sort=FALSE) # convert character to numeric. R's as.numeric produce NAs.
result <- Rfast::group.sum(Frequency,Category)
names(result) <- Rfast::Sort(unique(Category)
# 30 5 34
``````

Rfast has many group functions and `group.sum` is one of them.

using `cast` instead of `recast` (note `'Frequency'` is now `'value'`)

``````df  <- data.frame(Category = c("First","First","First","Second","Third","Third","Second")
, value = c(10,15,5,2,14,20,3))

install.packages("reshape")

result<-cast(df, Category ~ . ,fun.aggregate=sum)
``````

to get:

``````Category (all)
First     30
Second    5
Third     34
``````
``````library(tidyverse)

x <- data.frame(Category= c('First', 'First', 'First', 'Second', 'Third', 'Third', 'Second'),
Frequency = c(10, 15, 5, 2, 14, 20, 3))

count(x, Category, wt = Frequency)

``````