Hey guys, what’s up? In this R programming tutorial I’ll teach you about how to create a dataframe in r. So, let’s start—
- A character vector: containing the names (like: employee)
- A numeric vector: containing numbers (like: salary)
- A date vector: containing the dates(like: startdate)
> employee <- c('John Doe','Peter Gynn','Jolie Hope') > salary <- c(21000, 23400, 26800) > startdate <- as.Date(c('2010-11-1','2008-3-25','2007-3-14'))
> n = c(2, 3, 5) > s = c("aa", "bb", "cc") > b = c(TRUE, FALSE, TRUE) > df = data.frame(n, s, b) # df is a data frame
Characteristics of Data Frame in R
Let’s discuss about characteristics of data frame in R programming language.
- Column names have to be non-empty.
- Row names have to be unique.
- Data frame can hold a numeric, character or of factor type data.
- same number of data items have to contain in each column
To check a variable is whether a frame or not, we need to use class() function
> x SN Age Name 1 1 21 John 2 2 15 Dora > typeof(x) # data frame is a special case of list [1] "list" > class(x) [1] "data.frame"
In this above example, x has a list of three components. Each component has 2 element vectors. There are some useful data.frame() r functions you have to know about it. I am going to show it below
How to create a DataFrame in R?
read.table()
or read.csv()
. We will cover it in another r programming tutorial. Anyway, today we will see creating dataframe in r by using data.frame() function. Below, I’ll create a simple data frame df
and assess its basic structure:df <- data.frame(col1 = 1:3, col2 = c("this", "is", "text"), col3 = c(TRUE, FALSE, TRUE), col4 = c(2.5, 4.2, pi)) # assess the structure of a data frame str(df) ## 'data.frame': 3 obs. of 4 variables: ## $ col1: int 1 2 3 ## $ col2: Factor w/ 3 levels "is","text","this": 3 1 2 ## $ col3: logi TRUE FALSE TRUE ## $ col4: num 2.5 4.2 3.14 # number of rows nrow(df) ## [1] 3 # number of columns ncol(df) ## [1] 4
Above col2
in df
converted to a column of factors. Because in data.frame()
there is a default setting that converts character columns to factors. By setting the stringsAsFactors = FALSE
argument, we can turn this off.
df <- data.frame(col1 = 1:3, col2 = c("this", "is", "text"), col3 = c(TRUE, FALSE, TRUE), col4 = c(2.5, 4.2, pi), stringsAsFactors = FALSE) # note how col2 now is of a character class str(df) ## 'data.frame': 3 obs. of 4 variables: ## $ col1: int 1 2 3 ## $ col2: chr "this" "is" "text" ## $ col3: logi TRUE FALSE TRUE ## $ col4: num 2.5 4.2 3.14
Below, we will see how we can turn multiple vectors, a list, or a matrix into a data frame:
v1 <- 1:3 v2 <-c("this", "is", "text") v3 <- c(TRUE, FALSE, TRUE) # convert same length vectors to a data frame using data.frame() data.frame(col1 = v1, col2 = v2, col3 = v3) ## col1 col2 col3 ## 1 1 this TRUE ## 2 2 is FALSE ## 3 3 text TRUE # convert a list to a data frame using as.data.frame() l <- list(item1 = 1:3, item2 = c("this", "is", "text"), item3 = c(2.5, 4.2, 5.1)) l ## $item1 ## [1] 1 2 3 ## ## $item2 ## [1] "this" "is" "text" ## ## $item3 ## [1] 2.5 4.2 5.1 as.data.frame(l) ## item1 item2 item3 ## 1 1 this 2.5 ## 2 2 is 4.2 ## 3 3 text 5.1 # convert a matrix to a data frame using as.data.frame() m1 <- matrix(1:12, nrow = 4, ncol = 3) m1 ## [,1] [,2] [,3] ## [1,] 1 5 9 ## [2,] 2 6 10 ## [3,] 3 7 11 ## [4,] 4 8 12 as.data.frame(m1) ## V1 V2 V3 ## 1 1 5 9 ## 2 2 6 10 ## 3 3 7 11 ## 4 4 8 12
Adding column to Data Frames
cbind()
function is used for adding columns into a data frame. We have to bear in mind that, one of the objects being combined must already be a data frame otherwise cbind()
could produce a matrix.
df ## col1 col2 col3 col4 ## 1 1 this TRUE 2.500000 ## 2 2 is FALSE 4.200000 ## 3 3 text TRUE 3.141593 # add a new column v4 <- c("A", "B", "C") cbind(df, v4) ## col1 col2 col3 col4 v4 ## 1 1 this TRUE 2.500000 A ## 2 2 is FALSE 4.200000 B ## 3 3 text TRUE 3.141593 C
To add data frame rows together, we can also use the rbind()
function. Currently our data frame df consists of integer, character, logical, and numeric variables.
df ## col1 col2 col3 col4 ## 1 1 this TRUE 2.500000 ## 2 2 is FALSE 4.200000 ## 3 3 text TRUE 3.141593 str(df) ## 'data.frame': 3 obs. of 4 variables: ## $ col1: int 1 2 3 ## $ col2: chr "this" "is" "text" ## $ col3: logi TRUE FALSE TRUE ## $ col4: num 2.5 4.2 3.14
All elements in the vector created by c() must be of the same class. So, If we try to add row using rbind()
and c()
it will converts all the columns to a character class.
df2 <- rbind(df, c(4, "R", F, 1.1)) df2 ## col1 col2 col3 col4 ## 1 1 this TRUE 2.5 ## 2 2 is FALSE 4.2 ## 3 3 text TRUE 3.14159265358979 ## 4 4 R FALSE 1.1 str(df2) ## 'data.frame': 4 obs. of 4 variables: ## $ col1: chr "1" "2" "3" "4" ## $ col2: chr "this" "is" "text" "R" ## $ col3: chr "TRUE" "FALSE" "TRUE" "FALSE" ## $ col4: chr "2.5" "4.2" "3.14159265358979" "1.1"
If, we want to add rows perfectly, we have to convert the items being added to a data frame & the columns have to same class as original data frame.
adding_df <- data.frame(col1 = 4, col2 = "R", col3 = FALSE, col4 = 1.1, stringsAsFactors = FALSE) df3 <- rbind(df, adding_df) df3 ## col1 col2 col3 col4 ## 1 1 this TRUE 2.500000 ## 2 2 is FALSE 4.200000 ## 3 3 text TRUE 3.141593 ## 4 4 R FALSE 1.100000 str(df3) ## 'data.frame': 4 obs. of 4 variables: ## $ col1: num 1 2 3 4 ## $ col2: chr "this" "is" "text" "R" ## $ col3: logi TRUE FALSE TRUE FALSE ## $ col4: num 2.5 4.2 3.14 1.1
There is better ways to join data frames together. I will discuss it another R programming tutorial.
Adding Attributes to Data Frames
Similar to matrices, data frames will have a dimension attribute.
# basic matrix df ## col1 col2 col3 col4 ## 1 1 this TRUE 2.500000 ## 2 2 is FALSE 4.200000 ## 3 3 text TRUE 3.141593 dim(df) ## [1] 3 4 attributes(df) ## $names ## [1] "col1" "col2" "col3" "col4" ## ## $row.names ## [1] 1 2 3 ## ## $class ## [1] "data.frame"
In our data frame df has no row. We can add row using rownames
()
:
# add row names rownames(df) <- c("row1", "row2", "row3") df ## col1 col2 col3 col4 ## row1 1 this TRUE 2.500000 ## row2 2 is FALSE 4.200000 ## row3 3 text TRUE 3.141593 attributes(df) ## $names ## [1] "col1" "col2" "col3" "col4" ## ## $row.names ## [1] "row1" "row2" "row3" ## ## $class ## [1] "data.frame"
To change existing column name we have to use colnames()
or names()
# add/change column names with colnames() colnames(df) <- c("col_1", "col_2", "col_3", "col_4") df ## col_1 col_2 col_3 col_4 ## row1 1 this TRUE 2.500000 ## row2 2 is FALSE 4.200000 ## row3 3 text TRUE 3.141593 attributes(df) ## $names ## [1] "col_1" "col_2" "col_3" "col_4" ## ## $row.names ## [1] "row1" "row2" "row3" ## ## $class ## [1] "data.frame" # add/change column names with names() names(df) <- c("col.1", "col.2", "col.3", "col.4") df ## col.1 col.2 col.3 col.4 ## row1 1 this TRUE 2.500000 ## row2 2 is FALSE 4.200000 ## row3 3 text TRUE 3.141593 attributes(df) ## $names ## [1] "col.1" "col.2" "col.3" "col.4" ## ## $row.names ## [1] "row1" "row2" "row3" ## ## $class ## [1] "data.frame"
Without affecting, we can add a comment to a data frame just like vectors, matrices and lists,
# adding a comment attribute comment(df) <- "adding a comment to a data frame" attributes(df) ## $names ## [1] "col.1" "col.2" "col.3" "col.4" ## ## $row.names ## [1] "row1" "row2" "row3" ## ## $class ## [1] "data.frame" ## ## $comment ## [1] "adding a comment to a data frame"
Subsetting Data Frames
We know that data frame has 2 characteristics of both lists and matrices. If we want to subset with two vectors, they behave like as matrices and it can be subset by row and column. And if we want to subset with a single vector, it behaves like as lists & it will return the selected columns with all rows.
df ## col.1 col.2 col.3 col.4 ## row1 1 this TRUE 2.500000 ## row2 2 is FALSE 4.200000 ## row3 3 text TRUE 3.141593 # subsetting by row numbers df[2:3, ] ## col.1 col.2 col.3 col.4 ## row2 2 is FALSE 4.200000 ## row3 3 text TRUE 3.141593 # subsetting by row names df[c("row2", "row3"), ] ## col.1 col.2 col.3 col.4 ## row2 2 is FALSE 4.200000 ## row3 3 text TRUE 3.141593 # subsetting columns like a list df[c("col.2", "col.4")] ## col.2 col.4 ## row1 this 2.500000 ## row2 is 4.200000 ## row3 text 3.141593 # subsetting columns like a matrix df[ , c("col.2", "col.4")] ## col.2 col.4 ## row1 this 2.500000 ## row2 is 4.200000 ## row3 text 3.141593 # subset for both rows and columns df[1:2, c(1, 3)] ## col.1 col.3 ## row1 1 TRUE ## row2 2 FALSE # use a vector to subset v <- c(1, 2, 4) df[ , v] ## col.1 col.2 col.4 ## row1 1 this 2.500000 ## row2 2 is 4.200000 ## row3 3 text 3.141593
Note: Sub-setting data frames with the [
operator, will produce lowest possible dimension results. To avoid this we have to use drop = FALSE
argument.
# simplifying results in a named vector df[, 2] ## [1] "this" "is" "text" # preserving results in a 3x1 data frame df[, 2, drop = FALSE] ## col.2 ## row1 this ## row2 is ## row3 text
Also we can subset data frames based on conditional statements. To do this we need to use built-in mtcars
data frame:
head(mtcars) ## mpg cyl disp hp drat wt qsec vs am gear carb ## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 ## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 ## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 ## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 ## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 ## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
We can perform two ways, if we want to subset mtcars
for all rows where mpg
is greater than 20
# using brackets mtcars[mtcars$mpg > 20, ] ## mpg cyl disp hp drat wt qsec vs am gear carb ## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 ## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 ## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 ## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 ## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 ## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 ## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 ## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 ## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 ## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 ## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 ## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 ## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 ## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 # using the simplified subset function subset(mtcars, mpg > 20) ## mpg cyl disp hp drat wt qsec vs am gear carb ## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 ## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 ## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 ## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 ## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 ## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 ## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 ## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 ## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 ## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 ## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 ## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 ## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 ## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
If we want to filter for multiple conditions, we can add conditional statement. subset() function helps us to simplify the process by only requiring you to state the data frame once and then directly call the variables to perform the condition on.
# using brackets mtcars[mtcars$mpg > 20 & mtcars$cyl == 6, ] ## mpg cyl disp hp drat wt qsec vs am gear carb ## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 ## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 ## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 # using the simplified subset function subset(mtcars, mpg > 20 & cyl == 6) ## mpg cyl disp hp drat wt qsec vs am gear carb ## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 ## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 ## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
We simply state the columns we want to return, if we want to perform this filtering along with return only specified columns
# using brackets mtcars[mtcars$mpg > 20 & mtcars$cyl == 6, c("mpg", "cyl", "wt")] ## mpg cyl wt ## Mazda RX4 21.0 6 2.620 ## Mazda RX4 Wag 21.0 6 2.875 ## Hornet 4 Drive 21.4 6 3.215 # using the simplified subset function subset(mtcars, mpg > 20 & cyl == 6, c("mpg", "cyl", "wt")) ## mpg cyl wt ## Mazda RX4 21.0 6 2.620 ## Mazda RX4 Wag 21.0 6 2.875 ## Hornet 4 Drive 21.4 6 3.215
Get the Structure of the Data Frame
We can see structure of dat frame by using str() function.
# Create the data frame. emp.data <- data.frame( emp_id = c (1:5), emp_name = c("Rick","Dan","Michelle","Ryan","Gary"), salary = c(623.3,515.2,611.0,729.0,843.25), start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11", "2015-03-27")), stringsAsFactors = FALSE ) # Get the structure of the data frame. str(emp.data)
After executing the above code, the results look like −
'data.frame': 5 obs. of 4 variables: $ emp_id : int 1 2 3 4 5 $ emp_name : chr "Rick" "Dan" "Michelle" "Ryan" ... $ salary : num 623 515 611 729 843 $ start_date: Date, format: "2012-01-01" "2013-09-23" "2014-11-15" "2014-05-11" ...
Example 2 of creating dataframe in r
# Create a, b, c, d variables a <- c(10,20,30,40) b <- c('book', 'pen', 'textbook', 'pencil_case') c <- c(TRUE,FALSE,TRUE,FALSE) d <- c(2.5, 8, 10, 7) # Join the variables to create a data frame df <- data.frame(a,b,c,d) df
Output:
## a b c d ## 1 1 book TRUE 2.5 ## 2 2 pen TRUE 8.0 ## 3 3 textbook TRUE 10.0 ## 4 4 pencil_case FALSE 7.0
# Name the data frame names(df) <- c('ID', 'items', 'store', 'price') df
Output:
## ID items store price ## 1 10 book TRUE 2.5 ## 2 20 pen FALSE 8.0 ## 3 30 textbook TRUE 10.0 ## 4 40 pencil_case FALSE 7.0 # Print the structure str(df)
Output:
## 'data.frame': 4 obs. of 4 variables: ## $ ID : num 10 20 30 40 ## $ items: Factor w/ 4 levels "book","pen","pencil_case",..: 1 2 4 3 ## $ store: logi TRUE FALSE TRUE FALSE ## $ price: num 2.5 8 10 7
Today we have read many topics about how to create a dataframe in rprogramming language. I hope you guys understood everything clearly. So, guys, that’s all about for today. Later we will discuss another topic of r programming language. Till then, take care. Happy Coding