The post here demonstrates an example of Hacker News scraper with
But a tiny problem emerges as sometimes YC-funded startup post job ads on the front page so that the scraper would find different information vector with different lengths. Specifically, the it would not return any score value for that ad. One possible remedy is to log in as a real user, then do the scraping. In R, it is implemented as:
library(rvest) login <- 'https://news.ycombinator.com/login' session <- html_session(login) form <- html_form(session)[] filled_form <- set_values(form, acct='MyAccountName', pw='MyPassword') submit_form(session, filled_form) content <- jump_to(session, 'https://news.ycombinator.com/') title <- content %>% html_nodes('a.storylink') %>% html_text() link_domain <- content %>% html_nodes('span.sitestr') %>% html_text() score <- content %>% html_nodes('span.score') %>% html_text() age <- content %>% html_nodes('span.age') %>% html_text() df <- data.frame(title = title, link_domain = link_domain, score = score, age = age)
The saved data frame would contain all the 30 links on the front page of Hacker News.