Creating a dataframe in R || crate dataframe in R programming language

Creating a dataframe in R || crate dataframe in R programming language


Hey guys, what’s up? In this R programming tutorial I’ll teach you about how to create a dataframe in r. So, let’s start—

What is Data Frame in R?

In R programming language, Data frame is a two dimensional data structure, which is used for storing data tables. It is a special case of a list which has same length of observations as rows or measurements as columns. The vectors we pass should also have the same length.
In other words, we can say data frame as an array. Unlike an array, the data we store in the columns of the data frame can be of various types. That means 1 column might be a numeric variable, another might be a factor, and a third might be a character variable. The vectors that are contained in the form of a list in a data frame are of equal length.
  • A character vector: containing the names (like: employee)
  • A numeric vector: containing numbers (like: salary)
  • A date vector: containing the dates(like: startdate)
> employee <- c('John Doe','Peter Gynn','Jolie Hope')
> salary <- c(21000, 23400, 26800)
> startdate <- as.Date(c('2010-11-1','2008-3-25','2007-3-14'))
For example: 
> n = c(2, 3, 5) 
> s = c("aa", "bb", "cc") 
> b = c(TRUE, FALSE, TRUE) 
> df = data.frame(n, s, b)       # df is a data frame 
Here, variable df is a data frame containing three vectors n, s, b. 

Characteristics of  Data Frame in R

Let’s discuss about characteristics of data frame in R programming language.

  • Column names have to be non-empty.
  • Row names have to be unique.
  • Data frame can hold a numeric, character or of factor type data.
  • same number of data items have to contain in each column
Check if a variable is a data frame or not


To check a variable is whether a frame or not, we need to use class() function

> x
SN Age Name
1  1  21 John
2  2  15 Dora
> typeof(x)    # data frame is a special case of  list
[1] "list"
> class(x)
[1] "data.frame"

In this above example, x has a list of three components. Each component has 2 element vectors. There are some useful data.frame() r functions you have to know about it. I am going to show it below

How to create a DataFrame in R?

To create a data frame in r programming language, we need to use data.frame() r function. In general, we can create a dataframe by reading in a dataset using the read.table() or read.csv(). We will cover it in another r programming tutorial. Anyway, today we will see creating dataframe in r by using data.frame() function. Below, I’ll create a simple data frame df and assess its basic structure:
df <- data.frame(col1 = 1:3, 
                 col2 = c("this", "is", "text"), 
                 col3 = c(TRUE, FALSE, TRUE), 
                 col4 = c(2.5, 4.2, pi))

# assess the structure of a data frame
str(df)
## 'data.frame': 3 obs. of  4 variables:
##  $ col1: int  1 2 3
##  $ col2: Factor w/ 3 levels "is","text","this": 3 1 2
##  $ col3: logi  TRUE FALSE TRUE
##  $ col4: num  2.5 4.2 3.14

# number of rows
nrow(df)
## [1] 3

# number of columns
ncol(df)
## [1] 4

Above col2 in df converted to a column of factors. Because in data.frame() there is a default setting that converts character columns to factors. By setting the stringsAsFactors = FALSEargument, we can turn this off.

df <- data.frame(col1 = 1:3, 
                 col2 = c("this", "is", "text"), 
                 col3 = c(TRUE, FALSE, TRUE), 
                 col4 = c(2.5, 4.2, pi), 
                 stringsAsFactors = FALSE)

# note how col2 now is of a character class
str(df)
## 'data.frame': 3 obs. of  4 variables:
##  $ col1: int  1 2 3
##  $ col2: chr  "this" "is" "text"
##  $ col3: logi  TRUE FALSE TRUE
##  $ col4: num  2.5 4.2 3.14


Below, we will see how we can turn multiple vectors, a list, or a matrix into a data frame:

v1 <- 1:3
v2 <-c("this", "is", "text")
v3 <- c(TRUE, FALSE, TRUE)

# convert same length vectors to a data frame using data.frame()
data.frame(col1 = v1, col2 = v2, col3 = v3)
##   col1 col2  col3
## 1    1 this  TRUE
## 2    2   is  FALSE
## 3    3 text  TRUE

# convert a list to a data frame using as.data.frame()
l <- list(item1 = 1:3, item2 = c("this", "is", "text"), item3 = c(2.5, 4.2, 5.1))
l
## $item1
## [1] 1 2 3
## 
## $item2
## [1] "this" "is"   "text"
## 
## $item3
## [1] 2.5 4.2 5.1

as.data.frame(l)
##   item1 item2 item3
## 1     1  this   2.5
## 2     2    is   4.2
## 3     3  text   5.1

# convert a matrix to a data frame using as.data.frame()
m1 <- matrix(1:12, nrow = 4, ncol = 3)
m1
##      [,1] [,2] [,3]
## [1,]    1    5    9
## [2,]    2    6   10
## [3,]    3    7   11
## [4,]    4    8   12

as.data.frame(m1)
##   V1 V2 V3
## 1  1  5  9
## 2  2  6 10
## 3  3  7 11
## 4  4  8 12

Adding column to Data Frames

cbind() function is used for adding columns into a data frame. We have to bear in mind that, one of the objects being combined must already be a data frame otherwise cbind()could produce a matrix.

df
##   col1 col2  col3     col4
## 1    1 this  TRUE  2.500000
## 2    2   is  FALSE 4.200000
## 3    3 text  TRUE  3.141593

# add a new column
v4 <- c("A", "B", "C")

cbind(df, v4)
##   col1 col2  col3     col4    v4
## 1    1 this  TRUE   2.500000  A
## 2    2   is  FALSE  4.200000  B
## 3    3 text  TRUE   3.141593  C

To add data frame rows together, we can also use the rbind()function. Currently our data frame df consists of integer, character, logical, and numeric variables.

df
##   col1 col2  col3     col4
## 1    1 this  TRUE  2.500000
## 2    2   is  FALSE 4.200000
## 3    3 text  TRUE  3.141593

str(df)
## 'data.frame': 3 obs. of  4 variables:
##  $ col1: int  1 2 3
##  $ col2: chr  "this" "is" "text"
##  $ col3: logi  TRUE FALSE TRUE
##  $ col4: num  2.5 4.2 3.14

All elements in the vector created by c() must be of the same class. So, If we try to add row using rbind() and c() it will converts all the columns to a character class.

df2 <- rbind(df, c(4, "R", F, 1.1))

df2
##   col1 col2  col3             col4
## 1    1 this  TRUE              2.5
## 2    2   is  FALSE             4.2
## 3    3 text  TRUE     3.14159265358979
## 4    4    R  FALSE             1.1

str(df2)
## 'data.frame': 4 obs. of  4 variables:
##  $ col1: chr  "1" "2" "3" "4"
##  $ col2: chr  "this" "is" "text" "R"
##  $ col3: chr  "TRUE" "FALSE" "TRUE" "FALSE"
##  $ col4: chr  "2.5" "4.2" "3.14159265358979" "1.1"

If, we want to add rows perfectly, we have to convert the items being added to a data frame & the columns have to same class as original data frame.

adding_df <- data.frame(col1 = 4, col2 = "R", col3 = FALSE, col4 = 1.1, 
                 stringsAsFactors = FALSE)

df3 <- rbind(df, adding_df)

df3
##   col1 col2  col3     col4
## 1    1 this  TRUE   2.500000
## 2    2   is  FALSE  4.200000
## 3    3 text  TRUE   3.141593
## 4    4    R  FALSE  1.100000

str(df3)
## 'data.frame': 4 obs. of  4 variables:
##  $ col1: num  1 2 3 4
##  $ col2: chr  "this" "is" "text" "R"
##  $ col3: logi  TRUE FALSE TRUE FALSE
##  $ col4: num  2.5 4.2 3.14 1.1

There is better ways to join data frames together. I will discuss it another R programming tutorial.

Adding Attributes to Data Frames

Similar to matrices, data frames will have a dimension attribute. 

# basic matrix
df
##   col1  col2   col3     col4
## 1    1  this   TRUE   2.500000
## 2    2    is   FALSE  4.200000
## 3    3  text   TRUE   3.141593

dim(df)
## [1] 3 4

attributes(df)
## $names
## [1] "col1" "col2" "col3" "col4"
## 
## $row.names
## [1] 1 2 3
## 
## $class
## [1] "data.frame"

In our data frame df has no row. We can add row using rownames():

# add row names
rownames(df) <- c("row1", "row2", "row3")

df
##      col1 col2  col3     col4
## row1    1 this  TRUE   2.500000
## row2    2   is  FALSE  4.200000
## row3    3 text  TRUE   3.141593

attributes(df)
## $names
## [1] "col1" "col2" "col3" "col4"
## 
## $row.names
## [1] "row1" "row2" "row3"
## 
## $class
## [1] "data.frame"

To change existing column name we have to use colnames() or names()

# add/change column names with colnames()
colnames(df) <- c("col_1", "col_2", "col_3", "col_4")

df
##      col_1 col_2 col_3    col_4
## row1     1  this  TRUE 2.500000
## row2     2    is FALSE 4.200000
## row3     3  text  TRUE 3.141593

attributes(df)
## $names
## [1] "col_1" "col_2" "col_3" "col_4"
## 
## $row.names
## [1] "row1" "row2" "row3"
## 
## $class
## [1] "data.frame"

# add/change column names with names()
names(df) <- c("col.1", "col.2", "col.3", "col.4")

df
##      col.1  col.2  col.3    col.4
## row1     1  this   TRUE   2.500000
## row2     2    is   FALSE  4.200000
## row3     3  text   TRUE   3.141593

attributes(df)
## $names
## [1] "col.1" "col.2" "col.3" "col.4"
## 
## $row.names
## [1] "row1" "row2" "row3"
## 
## $class
## [1] "data.frame"

Without affecting, we can add a comment to a data frame just like vectors, matrices and lists,

# adding a comment attribute
comment(df) <- "adding a comment to a data frame"

attributes(df)
## $names
## [1] "col.1" "col.2" "col.3" "col.4"
## 
## $row.names
## [1] "row1" "row2" "row3"
## 
## $class
## [1] "data.frame"
## 
## $comment
## [1] "adding a comment to a data frame"

Subsetting Data Frames

We know that data frame has 2 characteristics of both lists and matrices. If we want to subset with two vectors, they behave like as matrices and it can be subset by row and column. And if we want to subset with a single vector, it behaves like as lists & it will return the selected columns with all rows.

df
##      col.1 col.2  col.3    col.4
## row1     1  this  TRUE  2.500000
## row2     2    is  FALSE 4.200000
## row3     3  text  TRUE  3.141593

# subsetting by row numbers
df[2:3, ]
##      col.1 col.2  col.3    col.4
## row2     2    is  FALSE  4.200000
## row3     3  text  TRUE   3.141593

# subsetting by row names
df[c("row2", "row3"), ]
##      col.1 col.2  col.3    col.4
## row2     2    is  FALSE  4.200000
## row3     3  text  TRUE   3.141593

# subsetting columns like a list
df[c("col.2", "col.4")]
##      col.2    col.4
## row1  this 2.500000
## row2    is 4.200000
## row3  text 3.141593

# subsetting columns like a matrix
df[ , c("col.2", "col.4")]
##       col.2    col.4
## row1  this  2.500000
## row2    is  4.200000
## row3  text  3.141593

# subset for both rows and columns
df[1:2, c(1, 3)]
##      col.1  col.3
## row1     1  TRUE
## row2     2  FALSE

# use a vector to subset
v <- c(1, 2, 4)
df[ , v]
##      col.1  col.2    col.4
## row1     1  this  2.500000
## row2     2    is  4.200000
## row3     3  text  3.141593

Note: Sub-setting data frames with the [ operator, will produce lowest possible dimension results. To avoid this we have to use drop = FALSE argument.

# simplifying results in a named vector
df[, 2]
## [1] "this" "is"   "text"

# preserving results in a 3x1 data frame
df[, 2, drop = FALSE]
##      col.2
## row1  this
## row2    is
## row3  text

Also we can subset data frames based on conditional statements. To do this we need to use built-in mtcars data frame:

head(mtcars)
##                    mpg  cyl disp hp drat    wt  qsec  vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

We can perform two ways, if we want to subset mtcars for all rows where mpg is greater than 20 

# using brackets
mtcars[mtcars$mpg > 20, ]
##                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4      21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710     22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Merc 240D      24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230       22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic    30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona  21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Fiat X1-9      27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

# using the simplified subset function
subset(mtcars, mpg > 20)
##                 mpg  cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4      21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710     22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Merc 240D      24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230       22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic    30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona  21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Fiat X1-9      27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

If we want to filter for multiple conditions, we can add conditional statement. subset() function helps us to simplify the process by only requiring you to state the data frame once and then directly call the variables to perform the condition on.

# using brackets
mtcars[mtcars$mpg > 20 & mtcars$cyl == 6, ]
##                 mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4      21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag  21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Hornet 4 Drive 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1

# using the simplified subset function
subset(mtcars, mpg > 20 & cyl == 6)
##                 mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4      21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag  21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Hornet 4 Drive 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1

We simply state the columns we want to return, if we want to perform this filtering along with return only specified columns 

# using brackets
mtcars[mtcars$mpg > 20 & mtcars$cyl == 6, c("mpg", "cyl", "wt")]
##                 mpg cyl  wt
## Mazda RX4      21.0   6 2.620
## Mazda RX4 Wag  21.0   6 2.875
## Hornet 4 Drive 21.4   6 3.215

# using the simplified subset function
subset(mtcars, mpg > 20 & cyl == 6, c("mpg", "cyl", "wt"))
##                 mpg  cyl  wt
## Mazda RX4      21.0   6 2.620
## Mazda RX4 Wag  21.0   6 2.875
## Hornet 4 Drive 21.4   6 3.215

Get the Structure of the Data Frame

We can see structure of dat frame by using str() function.

# Create the data frame.
emp.data <- data.frame(
   emp_id = c (1:5), 
   emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
   salary = c(623.3,515.2,611.0,729.0,843.25), 
   
   start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11",
      "2015-03-27")),
   stringsAsFactors = FALSE
)
# Get the structure of the data frame.
str(emp.data)

After executing the above code, the results look like −

'data.frame':   5 obs. of  4 variables:
 $ emp_id    : int  1 2 3 4 5
 $ emp_name  : chr  "Rick" "Dan" "Michelle" "Ryan" ...
 $ salary    : num  623 515 611 729 843
 $ start_date: Date, format: "2012-01-01" "2013-09-23" "2014-11-15" "2014-05-11" ...

Example 2 of creating dataframe in r

At 1st we can create our data set by combining 4 variables of same length. 
# Create a, b, c, d variables
a <- c(10,20,30,40)
b <- c('book', 'pen', 'textbook', 'pencil_case')
c <- c(TRUE,FALSE,TRUE,FALSE)
d <- c(2.5, 8, 10, 7)
# Join the variables to create a data frame
df <- data.frame(a,b,c,d)
df


Output:

##    a         b      c      d
## 1  1        book  TRUE   2.5
## 2  2         pen  TRUE   8.0
## 3  3    textbook  TRUE   10.0
## 4  4 pencil_case  FALSE  7.0
Change the column name with the function names(). 
# Name the data frame
names(df) <- c('ID', 'items', 'store', 'price')
df


Output:

##   ID       items store price
## 1 10        book  TRUE   2.5
## 2 20         pen FALSE   8.0
## 3 30    textbook  TRUE  10.0
## 4 40 pencil_case FALSE   7.0
# Print the structure
str(df)


Output:

## 'data.frame':    4 obs. of  4 variables:
##  $ ID   : num  10 20 30 40
##  $ items: Factor w/ 4 levels "book","pen","pencil_case",..: 1 2 4 3
##  $ store: logi  TRUE FALSE TRUE FALSE
##  $ price: num  2.5 8 10 7


Data frame returns string variables as a factor, by default. If you want to know more about Data frame in R, then just click Here.

Today we have read many topics about how to create a dataframe in rprogramming language. I hope you guys understood everything clearly. So, guys, that’s all about for today. Later we will discuss another topic of r programming language. Till then, take care. Happy Coding

Post a Comment