Jul 20 2008

Mine information from a website

Posted by neio

Neio,  July 20, 2008, in Huaqiao Unversity

NOTICE: THIS ARTICLE IS ONLY CONCERNING THE WAY OF MINING INFORMATION FORM A WEBSITE. IF YOU ARE CONCERNING THE DATA OF IMDB, PLEASE READ THE COPYRIGHT OF IMDB.COM FIRST, AND DO NOT DO ANYTHING THAT AGAINST THE LICENCE UNLESS YOU ARE PERMITTED TO DO SO.

IMDB.com may be the largest movie database in our world. We can gain abundant information form this website. And if you are permitted, you can use the data to do some research. As I know, there are thousands of photos of stars, and some professors had been using these data to do some research on data mining or AI. Anyone can get information from IMDB in variable ways besides HTML website.

Here I discuss how to mine some data from IMDB (I just take IMDB for example). If you are more concerning the data of IMDB themselves than the way of mining data, you can visit http://www.imdb.com/interfaces instead.

Well, to better mine photos from IMDB, we should understand how to access photo data first.

  1. let’s visit URL of http://www.imdb.com/Sections/Gallery/Names/X (X is the prefix of the name, from A-Z, etc), then we can see the star list.
  2. If we click one of the names in this list, it jump into http://www.imdb.com/name/nm000000/mediaindex (000000 is the Id of stars),  and we can see all of his/her photos if the number of the photo is less than 50. More photos was list in the URL like http://www.imdb.com/name/nm000000/mediaindex?page=2.
  3. If we visit http://www.imdb.com/name/nm000000/, we can see more information about this star like birthday, awards and alternative names.
  4. If we click one of the photo from http://www.imdb.com/name/nm000000/mediaindex, the URL redirect into http://www.imdb.com/media/rm111111111/nm000000, and the photo is there.  In this URL, 111111111 is the id of the picture, and 000000 is the id of star. We can access http://www.imdb.com/media/rm111111111/ to get the same photo, but some feature would lost.

After understanding how to get the photos, we can analysis how to automatically download all photos you want (NOTICE: you should do this under the licence of IMDB).

  • First, download /Sections/Gallery/Names/{0},  the {0} could be replace by one of below:

{"1","3","4", "5","A","B","C","D","E","F","G","H","I","J",
 "K","L","M","N","O","P","Q", "R","S","T","U","V","W","X",
 "Y","Z","Aacute","Aring","Eacute","Ouml","Oslash","Uuml","THORN"}

In downloaded html, you can see something like this:
<LI><A HREF='/name/nm0045198/photogallery'>Baca, Shawna</A>
<LI><A HREF='/name/nm0000002/photogallery'>Bacall, Lauren</A>
<LI><A HREF='/name/nm0045209/photogallery'>Bacall, Michael</A>
<LI><A HREF='/name/nm1832162/photogallery'>Bacalski, Roberto</A>
<LI><A HREF='/name/nm0045214/photogallery'>Bacalso, Joanna</A>
<LI><A HREF='/name/nm0045219/photogallery'>Bacarella, Mike</A>

 

Then wen can write regular expression to get the ids and the names of the stars. The regular express can be written like:

@"HREF='(/name/nm(\d+)/photogallery)'>(.+)</A>"

  • Second, after you get all ids and names, you can access any star directly. Let’s download http://www.imdb.com/name/nm{0} (replace {0} by the id of star you want to download).
In this html, we can see some information of the star like:
 
<h3>Overview</h3>
<div class="info">
<h5>Date of Birth:</h5>
<a href="/OnThisDay?day=27&month=May">27 May</a>
<a href="/BornInYear?1975">1975</a>,
<a href="/BornWhere?Atlanta,%20Georgia,%20USA">
Atlanta, Georgia, USA</a>
<a class="tn15more inline" href="bio">more</a>
</div>
<div class="info">
<h5>Trivia:</h5>
Part of the band <a href="/name/nm1642036/">Outkast</a>.
<a class="tn15more inline" href="bio">more</a>
</div>

Then we can use the regular express like :

"href=\"/OnThisDay\\?day=(\\d+)&month=(.+)\">(.+) href=\"/BornInYear\\?(\\d+)\">\\d+</a>"

to get his/her birthday and other regular express to get other information.

  • Third, Let get photo list form IMDB according the id by accessing /name/nm{0}/mediaindex?page={1} (please replace {0} with the id of star and replace {1} with the page number). And in fetched pages, we can see something as below:
<div class="thumb_list">
<a title="" href="/media/rm2901448448/nm0071275"><img alt="" height="100" width="100"  src="http://ia.media-imdb.com/images/M/MV5BMjA4MjgzMTQyOV5BMl5BanBnXkFtZTcwNTE0OTA3MQ@@._V1._CR0,0,323,323_SS100_.jpg" /></a>
<a title="" href="/media/rm62166528/nm0071275"><img alt="" height="100" width="100"  src="http://ia.media-imdb.com/images/M/MV5BMTc5NjEwNjE3MV5BMl5BanBnXkFtZTcwNDczMDE2MQ@@._V1._CR56,0,287,287_SS100_.jpg" /></a>
<a title="" href="/media/rm45389312/nm0071275"><img alt="" height="100" width="100"  src="http://ia.media-imdb.com/images/M/MV5BMTI0MTc0MDQyNF5BMl5BanBnXkFtZTcwNTczMDE2MQ@@._V1._CR0,0,267,267_SS100_.jpg" /></a>







And we can use regular expression as BELOW to get URLs of all photos.
"href=\"(/media/rm(\\d+)/nm(\\d+))\">"
Do not download the image as we can see above, for they are only thumb images.
  • Forth, we can download the big photos by accessing the URLs fetched above. After we download the HTML of the URLs mentioned above, we can see the code of HTML that referring to the photo image, like below:
<center><table id="principal">
<tr><td valign="middle" align="center"><img oncontextmenu="return false;" galleryimg="no" onmousedown="return false;" onmousemove="return false;" src="http://ia.media-imdb.com/images/M/MV5BNDUwODE2OTQxMF5BMl5BanBnXkFtZTYwMTE0MzM3._V1._SX268_SY400_.jpg"></td></tr>
</table></center>
<div style="margin-bottom:0.25em;">
<b>Title:</b> <a href="/title/tt0417225/">Idlewild</a><br />
<b>Names:</b> <a href="/name/nm0071275/">André Benjamin</a>, <a href="/name/nm1745736/">Paula Patton</a><br />
</div>
<span class="less-emphasis">
<b>Photo 23 of 70:</b>
Photo date: 25 August 2006
</span>
<hr />







 









Then we can get photo image URL and photo information by regular expression:

"galleryimg=\"no\" onmousedown=\"return false;\" onmousemove=\"return false;\"  src=\"(http://ia.media-imdb.com/images/M/(.+(\\.jpg)))\"></td>"

and

"Photo date: ((\\d{1,2}) (\\w{3,12}) (\\d{4}))"

  • At last, you can really download the photos using the URL you have just fetched.

As discuss above, you can fetch photo and some information of stars. What’s more, you can use the same way to analysis the structure of the website and write regular expression you own.

However, I don’t want to discuss the detail about writing a program to download them automatically. I have just write one for experiment, using C# language with Window UI. You know, programming is easy for me.

If you are interesting in the data of IMDB for scientific purpose, and already gain the permission from IMDB to do so, you can contact me and I can send you one copy of my software. Or if you are interesting in programming, you can share with me, too.

(Neio, July 20, 2008)