A short-n-sweet node.js-based web scraper.

This is Part One of an ongoing series of articles describing a project that tries to make light on one of the main problems of our contemporary society: the prices of used cars. OK, I am joking a little, but the topic remains hot and interesting still today and new and novel approaches, that often make it to scientific paper, are available every year.

The first part of the project is the most tedious: gathering the data. For this purpose I have selected a major Serbian website specialized in selling used cars. They have a pretty good classification system (although, not perfect, as we’ll see in the next posts), the site architecture is pretty scraper-friendly (some might say old school) and the sheer amount of cars is a good indicator of it’s popularity. I wouldn’t go as far as to define it the most popular car sales site, but it is definitely in the top 2 or 3 that cover 99% of the market.

Scraping

As many of us - I’ve done my share of web scraping. The inglorious but often necessary grunt work that has to be done in the most diverse situations or (project) phases. As Ryan Mitchell in her excellent book Web Scraping with Python: Collecting Data from the Modern Web puts it with enthusiasm:

To those who have not developed the skill, computer programming can seem like a kind of magic. If programming is magic, web scraping is wizardry: the application of magic for particularly impressive and useful—yet surprisingly effortless—feats. In my years as a software engineer, I’ve found that few programming practices capture the excitement of both programmers and laymen alike quite like web scraping. The ability to write a simple bot that collects data and streams it down a terminal or stores it in a database, while not difficult, never fails to provide a certain thrill and sense of possibility, no matter how many times you might have done it before.

This is not the kind of website where I go about how scraping is cool, how it should be done and why you should try it. I just write about how I did it and why I did it the way I did.

Other interesting and useful books about scraping include Website Scraping with Python from Hajba, and {Practical Web Scraping for Data Science](https://www.amazon.com/Practical-Web-Scraping-Data-Science-ebook/dp/B07CH3CH51/) from Boucke.

The Data

The data is your run-of-the-mill online ads from a major serbian car website. I am not going to say which it was and I will try and remove every instance of the original URLs on the site, but if you are really determined… you could easily identify the target. An excellent excuse to use the best data science meme out there there. Forget the fancy statistician, mathematician, PhD-ician data science definitions. Do you know what a data scientist really is?

enter image description here

No, really, working with data, weather you call yourself a scientist, an analyst, a hacker or an enthusiast requires really a lot of discipline and the willingness to go through boring, repetitive tasks.

The goal

The goal is to scrape the data and initially put it in a csv file. After a while, I decided to store it directly to a MongoDB instance on Compass in order to be able to run the scraper periodically.

imgProject outline

The libraries that I used are, more or less, the following:

const request = require("request-promise");
const mongoose = require("mongoose");
const cheerio = require("cheerio");
const cheerioAdv = require("cheerio-advanced-selectors");
cheerio = cheerioAdv.wrap(cheerio);

Request promise is a promise-based NodeJS package that enables us to access URLs in a systematic and compact way. Cheerio is a set of jQuery-like selectors useful for identifying the parts of the web page that we want to extract, while cheerio-advanced-selectors enables cheerio to have CSS pseudo-selectors like :first, :last etc.. Let’s say that this setup covers 90% of the cases that you may encounter in your scraping endeavors.

Connecting to MongoDB

Nothing fancy here, a standard mongo URI that I keep hidden in a config file and a couple of options.

  .connect(db, {
    useNewUrlParser: true,
    useUnifiedTopology: true
  })
  .then(() => console.log("MongoDB Connected"))
  .catch(err => console.log(err));

Pausing the script

We do not want to hit the server too hard. It’s a lesson that everybody teaches, but I wasn’t able to learn it until I got the entire office banned from a pretty useful business listing site. Here I use a pretty simple and rudimentary approach - I make a sleep function using plain-old setTimeout and just wrap it in a JavaScript promise:

async function sleep(ms) {
  return new Promise(resolve => setTimeout(resolve, ms));
}

The main function - getting the details

The main scraping function is heavy and doesn’t generalize well, I’m afraid. It Takes in two arguments: the url of the page to scrape (extract details) and a timeout in milliseconds. For particularly pesky websites, you could make this function wait a random period of time, by adding a bias term, like wait between 2.3 and 4.7 seconds

async function get_detail(url, time_out) {

//  first wait for the sleeping promise
  await sleep(time_out);

// get the actual page and load it into cheerio
  const html = await request.get(url);
  const $ = await cheerio.load(html);

// initialize an empty data object
  let carData = {};

// get the pieces - the boring trial and error part
  let brand = $('span[property="name"]:eq(2)').text();
  let make = $('span[property="name"]:eq(3)').text();  
  let price = $("span.priceReal:eq(1)").text();

// some checks
  if (price.includes("EUR")) {
    price = parseInt(price.split(" ")[0].replace(".", ""));
  } else {
    price = null;
  }

// an artificial text variable to "attach" some grouped data elements to
  let frontPanel = "";
  let dataItems = "";

  let basic_data = [];
  let basic_data_elements = $("ul.basicSingleData:eq(0)>li>span");

// loop in order to populate:
  basic_data_elements.each((index, element) => {
    let item = $(element).text();
    frontPanel += item;
// I use | as a delimiter
    frontPanel += "|";

// basically check for all the items and send them to appropriate "data drawers"
    if (item.includes("godište")) {
      carData.year = parseInt(item.split(".")[0]);
    }

    if (item.includes("cm3")) {
      carData.cm3 = parseInt(item.split(".")[0]);
    }

    if (item.includes("Dizel")) {
      carData.fuel = "diesel";
    }

    if (item.includes("Benzin")) {
      carData.fuel = "petrol";
    }

    if (item.includes("TNG")) {
      carData.fuel = "LPG";
    }

    if (item.includes("kW")) {
      carData.kW = parseInt(item.split(".")[0]);
    }

// push the data into the data structure - array

    basic_data.push($(element).text());
  });

// other data structure - same principles
  const adv_data = [];
  const adv_data_elements = $("div.singleBox.singleBoxPanel").find("li");
  adv_data_elements.each((index, element) => {

    let item = $(element).text(); 
    dataItems += item;
    dataItems += "|";

    if (item.includes("Menjač")) {
      carData.gearbox = item.split(" ")[1];
    }

    if (item.includes("Nije registrovan")) {
      carData.registered = false;
    } else {
      carData.registered = true;
    }

    if (item.includes("Prešao kilometara")) {
      carData.km = parseInt(item.split(" ")[2]);
    }

    if (item.includes("Broj vrata")) {
      carData.doors = item.split(" ")[2];
    }

    if (item.includes("Snaga")) {
      carData.kW = parseInt(item.split("(")[1].slice(0, -4));
    }

// push into array
    adv_data.push($(element).text());
  });

// attach the arrays - data containers to the main object carData
  carData.frontPanel = frontPanel;
  carData.dataItems = dataItems;
  // push everything in an object for further processing
  carData.brand = brand;
  carData.make = make;
  carData.price = price;
  carData.features = adv_data;

  // add the date and the URL
  carData.timeParsed = new Date();
  carData.URL = url;

// return the data object
  return carData;
}

It is long and if your data is complex and rich it will get longer. I probably should have splitted it and given it some more structure, but that remains for some future project I guess.

Inserting into MongoDB

The second function is much lighter. It takes the same parameters - URL and time out and calls the previously defined getLinks function, awaits for the result and then checks if the url is already in the database. If not, it just inserts it.

function insertLink(url, time_out) {
// calls the get_detail and awaits the result
  get_detail(url, time_out)
    .then(carData => {
      // create new Mongoose instance of the Car model with the data
      const newCar = new Car({
        brand: carData.brand,
        make: carData.make,
        year: carData.year,
        price: carData.price,
        km: carData.km,
        gearbox: carData.gearbox,
        doors: carData.doors,
        kW: carData.kW,
        cm3: carData.cm3,
        url: carData.URL,
        fuel: carData.fuel,
        registered: carData.registered,
        frontPanel: carData.frontPanel,
        dataItems: carData.dataItems,
        features: carData.features
      });
      // try to find the url in the database:
      Car.findOne({ url: url }).then(data => {
        if (data) {
        // if exists, abort, skip
          console.log("Already in... SKIPPING");
        } else {
        // if not, save it
          newCar.save(err => console.log(`Error: ${err}, SKIPPING...`)); 
        }
      });
    })
    .catch(err => console.log(err));
}

That’s it. The last part of the script concerns the collection of urls to parse and process.

Getting the links

We start on the last, most current page and work our way back, continuously following the site’s pagination which in this case fortunately exists. First we need a function to gather all the links from one page, any page. The function takes two arguments: - url of the page to get the links from - timeout in ms

async function getLinks(url, time_out) {
  console.log("started...");
  await sleep(time_out);

// We start off with an empty array
// NOTE: base_urls is a bad variable name
// these are actually the urls to be processed from the page
  let base_urls = [];

// standard request / cheerio dance
  let html = await request.get(url);
  let $ = await cheerio.load(html);

// get the links - they have a class of addTitle
  let links = $("a.addTitle");
  links.each((index, element) => {
    // take the url attribute of the links, jQuery-style ftw!
    let url = $(element).attr("href");
    url = "WEBSITE URL" + url;
    base_urls.push(url);
  });

// return the array of links
  return base_urls;
}

So far so good… One last bit and we’re done.

triggering the scraper

This particular website has the wonderful numerical pages structure: * www.site.com/page/1 * www,site.com/page/2 * …

This enabled me to simplify the cycling through all the pages that I want, based on the number of adverts that I want to get. Do not forget to use the parameter ad per page if available and maximize it, in order to have the same amount of ads distributed on a fewer number of ages. Generally, you might have to locate the next page link on the first page and then construct the link and follow it in order to get the next page and so on…

The last function takes advantage of this fact:

// this is fake, but you get the idea
baseUrl = "www.mysite.com/per_page/6/results/1"/

// scrapeAll means fire up the whole procedure
// with a plain old for loop
// the function takes a single argument: the number of pages to scrape
// multiplied by 60 ads per page, you get the idea

async function scrapeAll(pages) {
  let i = 1;
  for (i = 1; i < pages; i++) {

// sleep between pages for safety
    await sleep(30000);
    url = baseUrl + i;

// get all the links on the current page
    getLinks(url)
      .then(data => {
        data.forEach(el => {
// iterate through the links and call insertLink on each and everyone
          insertLink(el, 2000);
        });
      })
      .catch(err => console.log(err));
  }
}

That would be it. It isn’t pretty, it isn’t perfect, but it got the job done pretty good. Of course, there were some errors, some timeout tweaking, some missed selectors, but the main idea is valid and the procedure could be applied to a myriad of web sites with simpler specs. What do I mean by that? Well, sites that:

  • do not require logging in
  • do not have a captcha
  • do not make heavy use of JavaScript in order to render the content (this could be a blessing in disguise, but that’s another topic)