Sunday, August 3, 2014

Data Mining a Website with C#

The first step in data mining for myself is being able to get ALL the search results from a database with a single search. Once you can get all the information you want you can start planning. I'm going to use http://www.suttonkersh.co.uk as my example for this article.

I already found a search that will give me 945 listings, http://www.suttonkersh.co.uk/residential_sales_gallery.php?priceMin=0&priceMax=9999999&bedroomsMin=0&bedroomsMax=99999&postcode=&sf=sales&x=34&y=14

Now for some planning, which is pretty easy once you have done it a few times. I am going to use something new for this, never used it before, draw.io. Let's do a little activity diagram to see what we have going on here. (Please excuse the horrible UML skills)

UMl DiagramFrom that search page (the long link above) you can start using those regular expressions to extract data. I will select all the listings on the page, right click and select View Selection Source (I'm using Firefox). Once you have copied the source code go to Rubular and start building those expressions.

Here is a Regular Expression to get all the links to the details of each properties, http://rubular.com/r/qiSXgY4Ala. We not only need the detail links but we also need to find the last page so we can keep moving forward in the search results. Here is the last page number from the initial search page: http://rubular.com/r/ualZtLASsI. With the last page you can setup a loop to get all the links within your search results. Here is an example on shoving the links in an ArrayList. (I shortened the URL with periods)

  • int page_num = 1;
  • string url = string.Empty;
  • ArrayList search_pages = new ArrayList();
  • while (page_num < 80)
  • {
  •     url = "http://www.suttonkersh.co.uk/...e=" +
  •     page_num.ToString() + "&priceMin=0&priceMax=9999999&...ob=DESC";
  •     search_pages.Add(url);
  •    page_num++;
  • }

  • Details Page Expressions

    Images http://rubular.com/r/hnxQ6ITuPP
    Address http://rubular.com/r/czG9doi9oh
    Bedrooms http://rubular.com/r/eTdEvVFEG1
    Property Type http://rubular.com/r/i9IisPWU2a
    Price http://rubular.com/r/83jByplsZE

    Here is a sample console application written in C# that will get a single listing by the listing ID.

    No comments: