IMDb is undoubtedly the leading information source for media information and is the top target of web scraping for movie lovers around the world. Unfortunately IMDb does not provide an API to access its database so web scraping is the only resort for us. PHP being one of the most commonly used and powerful web development language enables easy web scraping with the power of PCRE (Perl Compatible Regular Expressions).
For my recent project on a Movie Catalog (http://movies.abhinayrathore.com), I needed a IMDb scraper and found one built by Tyler Hall. His version was not robust enough to scrap all kind of movie pages so I extended it and made it more robust to support different type of titles, BUT recently IMDb changed its page template and most of the old scrapers stopped working including mine. So, I modified my scraper to accommodate the new template changes and considered it as my moral responsibility to contribute back to the developer community.
This new scraper is very robust and capable enough to handle a wide variety of new template modifications. Apart from the regular information it even goes deep to scan extra media images and release dates.
Last Updated: Oct 18, 2013
Major changes in Feb 20, 2013 version:
- Now we use the combined information page to scrape the data. This page doesn't change quite often and we can get complete list of individual departments.
- Add a few more entities; producers, musicians, cinematographers, editors etc. Removed metascore information. Removed small poster url.
- You can now pass a second boolean parameter to the getMovieInfo() and getMovieInfoById() functions to disable the extra information. By default it is set to true and may slow down the scraping. If you don't need all the extra info like Storyline, Release Dates, Recommendations or Media Images, just pass false as second parameter to these methods. Example $movieArray = $imdb->getMovieInfo("The Godfather", false);.
- Information for individuals in the list of directors, cast, writers etc. is now in an associative array with key being the IMDb id of the individual.
Here is a list of all the attributes it scraps from the IMDb page:
How to use this PHP Scraper?
Include the class file on your php page
Instantiate the class and get the results in an array:
$imdb = new Imdb();
$movieArray = $imdb->getMovieInfo("The Godfather");
You can try this scraper on my lab page: http://lab.abhinayrathore.com/imdb/
To download the PHP Source Code directly use this link: http://lab.abhinayrathore.com/imdb/imdb_php.htm
Fork it on GitHub: https://github.com/abhinayrathore/PHP-IMDb-Scraper
Example usage: http://lab.abhinayrathore.com/imdb/usage.htm
Proxy script for downloading or displaying Media images on your website: http://lab.abhinayrathore.com/imdb/imdbImage.txt
To implement you own IMDb Web Service API to return data in XML, JSON or JSONP format, use this script along with the API: http://lab.abhinayrathore.com/imdb/imdbWebService.htm
To implement IMDb.com's search suggestions on your website, please follow this post: http://web3o.blogspot.com/2011/10/imdb-search-suggestions-with-jquery.html
If you find any part of this scraper broken or incorrect, please drop a comment here and I’ll try to fix it as soon as possible.
IMDb has a leechers policy in place for media images. You may not be able to use the URL for some of the images to display on your website. As a workaround you can use a PHP Proxy to display or download those images. I’ve written a small proxy script to grab the images: http://lab.abhinayrathore.com/imdb/imdbImage.txt. To use this script you just need to pass the image URL as a request parameter:
<img src="imdbImage.php?url=<?=$url?>" />
NOTE: For users outside of USA
IMDb will automatically redirect you to titles listed in the language used for release in your country (Read more).
To see films listed under their original titles regardless of your country region you will have to modify this script to scrap the titles from http://akas.imdb.com because http://www.imdb.com will automatically redirect you to your country specific title page.
Happy Scraping :)