Saturday, 17 August 2013

Using R to generate a time series of averages from a very large dataset without using for loops

Using R to generate a time series of averages from a very large dataset
without using for loops

I am working with a large dataset of patent data. Each row is an
individual patent, and columns contain information including application
year and number of citations in the patent.
> head(p)
allcites appyear asscode assgnum cat cat_ocl cclass country ddate gday
gmonth
1 6 1974 2 1 6 6 2/161.4 US 6
1
2 0 1974 2 1 6 6 5/11 US 6
1
3 20 1975 2 1 6 6 5/430 US 6
1
4 4 1974 1 NA 5 <NA> 114/354 6
1
5 1 1975 1 NA 6 6 12/142S 6
1
6 3 1972 2 1 6 6 15/53.4 US 6
1
gyear hjtwt icl icl_class icl_maingroup iclnum nclaims nclass
nclass_ocl
1 1976 1 A41D 1900 A41D 19 1 4 2
2
2 1976 1 A47D 701 A47D 7 1 3 5
5
3 1976 1 A47D 702 A47D 7 1 24 5
5
4 1976 1 B63B 708 B63B 7 1 7 114
9
5 1976 1 A43D 900 A43D 9 1 9 12
12
6 1976 1 B60S 304 B60S 3 1 12 15
15
patent pdpass state status subcat subcat_ocl subclass subclass1
subclass1_ocl
1 3930271 10030271 IL 63 63 161.4 161.4
161
2 3930272 10156902 PA 65 65 11.0 11
11
3 3930273 10112031 MO 65 65 430.0 430
331
4 3930274 NA CA 55 NA 354.0 354
2
5 3930275 NA NJ 63 63 NA 142S
142
6 3930276 10030276 IL 69 69 53.4 53.4
53
subclass_ocl term_extension uspto_assignee gdate
1 161 0 251415 1976-01-06
2 11 0 246000 1976-01-06
3 331 0 10490 1976-01-06
4 2 0 0 1976-01-06
5 142 0 0 1976-01-06
6 53 0 243840 1976-01-06
I am attempting to create a new data frame which contains the mean number
of citations (allcites) per application year (appyear), separated by
category (cat), for patents from 1970 to 2006 (the data goes all the way
back to 1901). I did this successfully, but I feel like my solution is
somewhat ad hoc and does not take advantage of the specific capabilities
of R. Here is my solution
#citations by category
citescat <- data.frame("chem"=integer(37),
"comp"=integer(37),
"drugs"=integer(37),
"ee"=integer(37),
"mech"=integer(37),
"other"=integer(37),
"year"=1970:2006
)
for (i in 1:37) {
for (j in 1:6) {
citescat[i,j] <- mean(p$allcites[p$appyear==(i+1969) & p$cat==j],
na.rm=TRUE)
}
}
I am wondering if there is a simple way to do this without using the
nested for loops which would make it easy to make small tweaks to it. It
is hard for me to pin down exactly what I am looking for other than this,
but my code just looks ugly to me and I suspect that there are better ways
to do this in R.

No comments:

Post a Comment