Introduction to Web scrapping in R
Web scrapping/crawling takes the advantages of common html tags to extract information from a webpage. It is useful in collecting and organising structured information for human or passed to further workflow.
In R platform, rvest
is the package that includes functions to read html source code by providing page urls; with the help from a css selector as a chrome plugin SelectorGadget, which identifies tags of interest, web scrapping can be realised without html/css knowledge.
The main source I learnt this technique are as follows.
- scrapping introduction by analyticsvidhya
- rvest tutorial: scraping the web using R
- vignette(“selectorgadget”) under
rvest
.
Understand the data
The scrapping subject is from question pools of GRE Analytical Writing. There are two pools, issue and arguement, with the format varies.
- Issue Pool is a list of 152. It starts with a prompt for contextualising to a specific topic and a task for providing a direction. Task portion contains 6 potential options and it always starts with “write a response in which you discuss…”. Prompt portion varies in format and number of sentences; two-sentence prompt are claim-response type, which the two start with “claim” and “response” respectively.
- Argument Pool contains 176 entities. It also gives a background followed by task. The background varies while there are only 8 tasks which start with same “write a response in which you discuss…”. There are also two-sentence prompt and the first sentence is always “the following…”.
Approach
The strategy of scrapping this information from the web page and split them into prompt Vs task is as follows
- from html web address, extract the individual entities by using html tags
.divider-50~ p , .indented p
(assisted by SelectorGadget).
library(tidyverse)
library(openxlsx)
library(kableExtra)
library(rvest)
scraping <- function(url, css){
html_text(html_nodes(read_html(url), css))
}
url_issue <- "https://www.ets.org/gre/revised_general/prepare/analytical_writing/issue/pool"
url_argument <- "https://www.ets.org/gre/revised_general/prepare/analytical_writing/argument/pool"
## store issue and argument page as "issue" and "argument", respectively
issue <- scraping(url_issue, ".divider-50~ p , .indented p")
argument <- scraping(url_argument, ".divider-50~ p , .indented p")
- To organised the scrapped information into prompt and question, I processed the captured information in
## For issue, find the index with sentense start with "claim"
start_claim <- unname(which(sapply(issue, substr, start = 1, stop = 5 ) == "Claim"))
## paste start with "claim" item with each of the following items (started with "Reason")
issue[start_claim] <- paste(issue[start_claim], "\n",issue[start_claim+1])
## delete item start with "Reason"
issue <- issue[-(start_claim + 1)]
## now odd is prompt and even is question
issue_tbl <- matrix(issue, ncol = 2, byrow = T, dimnames = list(c(), c(c("background_info", "question")))) %>%
as_tibble() %>%
mutate(., qn_no = as.numeric(factor("question"))) # number the question
######################################################################
## for argument, double-line promt has no single/simple pattern, we identied/indexed questions.
start_response <- unname(which(sapply(argument, substr, start = 1, stop = 16) == "Write a response"))
## creating interval between each question to next.
interval <- findInterval(1:493, start_response, rightmost.closed = T, left.open = T)
interval[2] <- c(0) ## a minor adjust of the first interval.
## split the prompts out, aggregate paste neighbour with sample interval number.
interval_tbl <- tbl_df(cbind(argument, interval)) %>%
group_by(., interval) %>%
filter(., row_number() != n()) %>%
ungroup()
## Warning: `tbl_df()` is deprecated as of dplyr 1.0.0.
## Please use `tibble::as_tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
## Warning in cbind(argument, interval): number of rows of result is not a multiple
## of vector length (arg 1)
bg_tbl <- aggregate(argument~interval, interval_tbl, paste, collapse = "\n") %>%
tbl_df
## split the questions out.
qn_tbl <- tbl_df(cbind(argument, interval)) %>%
group_by(., interval) %>%
filter(., row_number() == n()) %>%
ungroup()
## Warning in cbind(argument, interval): number of rows of result is not a multiple
## of vector length (arg 1)
## joining prompts and questions.
argument_tbl <- inner_join(bg_tbl, qn_tbl, by = c("interval" = "interval")) %>%
select(., -interval) %>%
mutate(., qn_no = as.numeric(factor(argument.y)))
colnames(argument_tbl) <- c("background_info", "question", "qn_no")
######################################################################
## writing in excel
# wb <- createWorkbook()
#
# sn <- "argument"
# addWorksheet(wb = wb, sheetName = sn)
# writeData(wb = wb, sheet = sn, x = argument_tbl, borders = "n")
#
# sn <- "issue"
# addWorksheet(wb = wb, sheetName = sn)
#
# writeData(wb = wb, sheet = sn, x = issue_tbl, borders = "n")
# saveWorkbook(wb, file = "./static/GRE_AW.xlsx", overwrite = TRUE)
background_info | question | qn_no |
---|---|---|
Woven baskets characterized by a particular distinctive pattern have previously been found only in the immediate vicinity of the prehistoric village of Palea and therefore were believed to have been made only by the Palean people. Recently, however, archaeologists discovered such a “Palean” basket in Lithos, an ancient village across the Brim River from Palea. The Brim River is very deep and broad, and so the ancient Paleans could have crossed it only by boat, and no Palean boats have been found. Thus it follows that the so-called Palean baskets were not uniquely Palean. | Write a response in which you discuss what specific evidence is needed to evaluate the argument and explain how the evidence would weaken or strengthen the argument. | 8 |
The following appeared as part of a letter to the editor of a scientific journal. “A recent study of eighteen rhesus monkeys provides clues as to the effects of birth order on an individual’s levels of stimulation. The study showed that in stimulating situations (such as an encounter with an unfamiliar monkey), firstborn infant monkeys produce up to twice as much of the hormone cortisol, which primes the body for increased activity levels, as do their younger siblings. Firstborn humans also produce relatively high levels of cortisol in stimulating situations (such as the return of a parent after an absence). The study also found that during pregnancy, first-time mother monkeys had higher levels of cortisol than did those who had had several offspring.” | Write a response in which you discuss one or more alternative explanations that could rival the proposed explanation and explain how your explanation(s) can plausibly account for the facts presented in the argument. | 2 |
The council of Maple County, concerned about the county’s becoming overdeveloped, is debating a proposed measure that would prevent the development of existing farmland in the county. But the council is also concerned that such a restriction, by limiting the supply of new housing, could lead to significant increases in the price of housing in the county. Proponents of the measure note that Chestnut County established a similar measure ten years ago, and its housing prices have increased only modestly since. However, opponents of the measure note that Pine County adopted restrictions on the development of new residential housing fifteen years ago, and its housing prices have since more than doubled. The council currently predicts that the proposed measure, if passed, will result in a significant increase in housing prices in Maple County. | Write a response in which you discuss what questions would need to be answered in order to decide whether the prediction and the argument on which it is based are reasonable. Be sure to explain how the answers to these questions would help to evaluate the prediction. | 5 |
The following appeared in a memo from the owner of a chain of cheese stores located throughout the United States. “For many years all the stores in our chain have stocked a wide variety of both domestic and imported cheeses. Last year, however, all of the five best-selling cheeses at our newest store were domestic cheddar cheeses from Wisconsin. Furthermore, a recent survey by Cheeses of the World magazine indicates an increasing preference for domestic cheeses among its subscribers. Since our company can reduce expenses by limiting inventory, the best way to improve profits in all of our stores is to discontinue stocking many of our varieties of imported cheese and concentrate primarily on domestic cheeses.” | Write a response in which you discuss what specific evidence is needed to evaluate the argument and explain how the evidence would weaken or strengthen the argument. | 8 |
The following appeared in a memorandum from the general manager of KNOW radio station. “Several factors indicate that radio station KNOW should shift its programming from rock-and-roll music to a continuous news format. Consider, for example, that the number of people in our listening area over fifty years of age has increased dramatically, while our total number of listeners has declined. Also, music stores in our area report decreased sales of recorded music. Finally, continuous news stations in neighboring cities have been very successful. The switch from rock-and-roll music to 24-hour news will attract older listeners and secure KNOW radio’s future.” | Write a response in which you examine the stated and/or unstated assumptions of the argument. Be sure to explain how the argument depends on these assumptions and what the implications are for the argument if the assumptions prove unwarranted. | 10 |
The following appeared in a memorandum from the manager of KNOW radio station. “Several factors indicate that KNOW radio can no longer succeed as a rock-and-roll music station. Consider, for example, that the number of people in our listening area over fifty years of age has increased dramatically, while our total number of listeners has declined. Also, music stores in our area report decreased sales of rock-and-roll music. Finally, continuous news stations in neighboring cities have been very successful. We predict that switching KNOW radio from rock-and-roll music to 24-hour news will allow the station to attract older listeners and make KNOW radio more profitable than ever.” | Write a response in which you discuss what questions would need to be answered in order to decide whether the prediction and the argument on which it is based are reasonable. Be sure to explain how the answers to these questions would help to evaluate the prediction. | 5 |
background_info | question | qn_no |
---|---|---|
To understand the most important characteristics of a society, one must study its major cities. | Write a response in which you discuss the extent to which you agree or disagree with the statement and explain your reasoning for the position you take. In developing and supporting your position, you should consider ways in which the statement might or might not hold true and explain how these considerations shape your position. | 1 |
Educational institutions have a responsibility to dissuade students from pursuing fields of study in which they are unlikely to succeed. | Write a response in which you discuss the extent to which you agree or disagree with the claim. In developing and supporting your position, be sure to address the most compelling reasons and/or examples that could be used to challenge your position. | 1 |
Scandals are useful because they focus our attention on problems in ways that no speaker or reformer ever could. | Write a response in which you discuss the extent to which you agree or disagree with the claim. In developing and supporting your position, be sure to address the most compelling reasons and/or examples that could be used to challenge your position. | 1 |
Claim: Governments must ensure that their major cities receive the financial support they need in order to thrive. Reason: It is primarily in cities that a nation’s cultural traditions are preserved and generated. | Write a response in which you discuss the extent to which you agree or disagree with the claim and the reason on which that claim is based. | 1 |
Some people believe that government funding of the arts is necessary to ensure that the arts can flourish and be available to all people. Others believe that government funding of the arts threatens the integrity of the arts. | Write a response in which you discuss which view more closely aligns with your own position and explain your reasoning for the position you take. In developing and supporting your position, you should address both of the views presented. | 1 |
Claim: In any field — business, politics, education, government — those in power should step down after five years. Reason: The surest path to success for any enterprise is revitalization through new leadership. | Write a response in which you discuss the extent to which you agree or disagree with the claim and the reason on which that claim is based. | 1 |
Output and reflection
Capturing a data set was trivial while organising it requires a fair understand about its structure. The idea of organising GRE analytical issue and argument were conceived in a chronological order, so that the way to organise argument was more generalisable because it use a shared feature in the question instead of the feature only appear in each type. Although no bench marking on code efficiency was performed thus there was no guarantee on the efficiency of the code, this post persented a real life example which required a workflow involving web scrapping and data tidyup.