Being a big movie buff I have a huge collection of movies and a movie database which I maintain in Microsoft Excel. The only info I keep in Excel is the movie Name, Year and Genre which is not enough. I face a lot of difficulty in deciding the movie I wanna see from the Excel spreadsheet and updating it is another painful task, so I thought of migrating all the Excel data to an online database with lot of extra movie information and an easy to use interface.
The first roadblock in my way was to pull information for all the movies from IMDb.com. So I wrote this parser in ASP.NET which accepts movie name as input parameter and searches on IMDb.com returning a list of matching titles. The diagram below shows the program flow for this parser:
This parser is written in ASP.NET because that was the only handy tool I had in front of me. I know a lot of you folks would like to have this for PHP or Perl, but don’t worry, this code can be easily converted to PHP or Perl because the only two important String functions I have used here are Substring and IndexOf, which are available in PHP and Perl as well. It only took me 1 day to write this code and blog entry so you may find a few bugs in here with a non-structured coding. Feel free to divide this parser into different functions for your ease or add more parsing code.
What all it can parse:
- Title ID
- Title
- Year
- Poster URL
- User Rating
- Directors
- Writers
- Release Date
- Genre
- Cast
- Runtime
I will walk you through the whole Parser program file by file. As the parsing is done based on HTML tags, you can figure out the details on your own.
The first file is the Default.aspx file which only contains the search field and submit button to submit the search string.
Default.aspx
<form id="form1" runat="server">
<div>
Search: <asp:TextBox ID="movie" runat="server"></asp:TextBox>
<asp:Button ID="Submit" runat="server" Text="Submit" />
<br /><br /><br />
<asp:Label runat="server" Text="" ID="CONTENT" ></asp:Label>
</div>
</form>
The second file is the Default.aspx.vb file which only contains the VB code for the Default.aspx page. This main function in this file is the event handler for the submit button which loads the HTML file and checks if it is the Search Result page or the Movie Page. If it is the Search result page, it can populate the search results for Popular Titles, Exact match and Partial match Titles. The second function is the Title search function which parses all the titles based upon the title class and forms an html unordered list of results. The third function is just to retrieve the Movie’s Title Id from the HTML page.
If an exact match is found for the searched title, the page is redirected to the MovieInfo.aspx page with the Title Id or else the search results are displayed on the same page with results linking to MovieInfo page.
Default.aspx.vb
Protected Sub Submit_Click(ByVal sender As Object, ByVal e As EventArgs) Handles Submit.Click
Dim searchstring As String = movie.Text
Dim imdb_search = "http://www.imdb.com/find?s=tt&q=" & searchstring
Dim inStream As StreamReader
Dim webRequest As Net.WebRequest
Dim webresponse As Net.WebResponse
webRequest = Net.WebRequest.Create(imdb_search)
webresponse = webRequest.GetResponse()
inStream = New StreamReader(webresponse.GetResponseStream())
Dim HTML As String = inStream.ReadToEnd()
Dim temp As String = ""
Dim List As String = ""
' If it is the search listings page
If HTML.IndexOf("<title>IMDb Title Search</title>") > 0 Then
' List of Popular Titles
If HTML.IndexOf("<b>Popular Titles</b>") > 0 Then
List = List & TitleSearch(HTML, "Popular")
End If
' List of Exact Match in Titles
If HTML.IndexOf("<b>Titles (Exact Matches)</b>") > 0 Then
List = List & TitleSearch(HTML, "Exact")
End If
' List of Partial Match in Titles
If HTML.IndexOf("<b>Titles (Partial Matches)</b>") > 0 Then
List = List & TitleSearch(HTML, "Partial")
End If
Content.Text = List
Else
Response.Redirect("MovieInfo.aspx?ID=" & MovieID(HTML))
End If
End Sub
Protected Function TitleSearch(ByVal HTML As String, ByVal Title_Type As String)
If Title_Type = "Popular" Then
Title_Type = "Popular Titles"
ElseIf Title_Type = "Exact" Then
Title_Type = "Titles (Exact Matches)"
ElseIf Title_Type = "Partial" Then
Title_Type = "Titles (Partial Matches)"
Else
Title_Type = "Popular Titles"
End If
Dim temp As String = ""
Dim List As String = ""
temp = HTML.Substring(HTML.IndexOf("<b>" & Title_Type & "</b>"))
' get the number of results displayed
Dim i As Integer = CInt(temp.Substring(temp.IndexOf("(Displaying ") + 12, 2))
' get the 'p' html block of Popular Titles
temp = temp.Substring(0, temp.IndexOf("</p>"))
' set custom dilimiter for parsing
temp = temp.Replace("link=/title/", "##")
List = List & "<strong>" & Title_Type & " (" & i & "):</strong><br /><ul>"
' loop through all the results
For x = 1 To i
Dim link As String = ""
Dim image As String = ""
Dim name As String = ""
Dim year As String = ""
' trim everything before then delimiter
temp = temp.Substring(temp.IndexOf("##"))
' get the link id
link = temp.Substring(temp.IndexOf("##") + 2, temp.IndexOf("/") - temp.IndexOf("##") - 2)
temp = temp.Substring(temp.IndexOf(">") + 1)
' get the Image link or Title before the </a> tag
Dim temp1 As String = temp.Substring(0, temp.IndexOf("</a>"))
' Check if poster for the title is displayed or not
If temp1.IndexOf("img src=""http://ia.media") > 0 Then
' if Poster is displayed then get the Title and Year for the movie
image = temp1
temp = temp.Substring(temp.IndexOf("##"))
link = temp.Substring(temp.IndexOf("##") + 2, temp.IndexOf("/") - temp.IndexOf("##") - 2)
temp = temp.Substring(temp.IndexOf(">") + 1)
name = temp.Substring(0, temp.IndexOf("<"))
temp = temp.Substring(temp.IndexOf("</a> ") + 6)
year = temp.Substring(0, temp.IndexOf(")"))
Else
' if Poster is not displayed then temp1 is the Title
name = temp1
temp = temp.Substring(temp.IndexOf("</a> ") + 6)
year = temp.Substring(0, temp.IndexOf(")"))
End If
temp = temp.Substring(temp.IndexOf("</tr>"))
List = List & "<li>" & image & " <a href='MovieInfo.aspx?ID=" & link & "'>" & name & "</a> (" & year & ")</li><br />"
Next x
List = List & "</ul><br />"
TitleSearch = List
End Function
Protected Function MovieID(ByVal HTML As String)
Dim temp As String = HTML.Substring(HTML.IndexOf("<link rel=""canonical"""), 1000)
temp = temp.Substring(temp.IndexOf("title/") + 6)
temp = temp.Substring(0, temp.IndexOf("/"))
MovieID = temp
End Function
The third file is the MovieInfo.aspx file which finally displays all the movie information.
MovieInfo.aspx
<form id="form1" runat="server">
<div>
<asp:Label runat="server" Text="" ID="CONTENT" ></asp:Label>
</div>
</form>
The fourth file is the MovieInfo.aspx.vb file which contains the VB code to parse the Movie Page based upon the Title ID. The Page Load event receives the Title ID from URL string and loads the HTML Movie page in a String to parse it. You will have to look at the actual IMDb page’s HTML code to understand how this parser works.
Protected Sub Page_Load(ByVal sender As Object, ByVal e As System.EventArgs) Handles Me.Load
Dim ID As String = Request.QueryString("ID").Trim
Dim URL As String = "http://www.imdb.com/title/" & ID & "/"
Dim inStream As StreamReader
Dim webRequest As Net.WebRequest
Dim webresponse As Net.WebResponse
webRequest = Net.WebRequest.Create(URL)
webresponse = webRequest.GetResponse()
inStream = New StreamReader(webresponse.GetResponseStream())
Dim HTML As String = inStream.ReadToEnd()
Dim List As String = "<strong>Movie ID:</strong> " & ID & "<br />"
Dim temp As String = ""
' Movie Name/Year
temp = HTML.Substring(HTML.IndexOf("<title>"), 1000)
Dim movie_name As String = temp.Substring(7, temp.IndexOf("(") - 8)
Dim movie_year As String = temp.Substring(temp.IndexOf("(") + 1, 4)
List = List & "<strong>Movie:</strong> " & movie_name & "<br />"
List = List & "<strong>Year:</strong> " & movie_year & "<br />"
' Movie Poster
temp = HTML.Substring(HTML.IndexOf("<div class=""photo"">"))
temp = temp.Substring(0, temp.IndexOf("</div>"))
Dim movie_poster As String = ""
If temp.IndexOf("noposter.gif") < 0 Then
movie_poster = temp.Substring(temp.IndexOf("src=""") + 5, temp.LastIndexOf("""") - temp.IndexOf("src=""") - 5)
Else
movie_poster = "http://i.media-imdb.com/images/SFd0ed3aeda7d77e6d9a8404cc3cd63dc6/intl/en/title_noposter.gif"
End If
List = List & "<strong>Poster:</strong> " & movie_poster & "<br />"
'User Rating
temp = HTML.Substring(HTML.IndexOf("<div class=""meta"">"))
temp = temp.Substring(0, temp.IndexOf("/"))
Dim movie_rating As String = ""
If temp.IndexOf("<b>") > 0 Then
movie_rating = temp.Substring(temp.IndexOf("<b>") + 3, 3)
End If
List = List & "<strong>Rating:</strong> " & movie_rating & "<br />"
' Director(s)
If HTML.IndexOf("<h5>Director") > 0 Then
temp = HTML.Substring(HTML.IndexOf("<h5>Director"))
temp = temp.Substring(0, temp.IndexOf("</div>"))
temp = temp.Substring(temp.IndexOf("</h5>") + 5)
temp = temp.Replace("<br/>", "#")
Dim cd As Integer = 1
Dim Directors As String = ""
If temp.IndexOf("<a") <> temp.LastIndexOf("<a") Then
cd = Count(temp, "<a")
If cd > 2 Then cd = 2
For x = 1 To cd
Dim dir As String = temp.Substring(temp.IndexOf(">") + 1, temp.IndexOf("</a>") - temp.IndexOf(">") - 1)
Directors = Directors & dir & ", "
temp = temp.Substring(temp.IndexOf("#") + 1, temp.Length - temp.IndexOf("#") - 1)
Next x
Else
Directors = temp.Substring(temp.IndexOf(">") + 1, temp.IndexOf("</a>") - temp.IndexOf(">") - 1)
End If
List = List & "<strong>Directors(" & cd & "):</strong> " & Directors & " <br />"
End If
' Writer(s)
If HTML.IndexOf("<h5>Writer") > 0 Then
temp = HTML.Substring(HTML.IndexOf("<h5>Writer"))
temp = temp.Substring(0, temp.IndexOf("</div>"))
temp = temp.Substring(temp.IndexOf("</h5>") + 5)
temp = temp.Replace("<br/>", "##")
Dim cw As Integer = 1
Dim Writers As String = ""
If temp.IndexOf("<a") <> temp.LastIndexOf("<a") Then
cw = Count(temp, "<a")
If cw > 2 Then cw = 2
For x = 1 To cw
Dim wrt As String = temp.Substring(temp.IndexOf(">") + 1, temp.IndexOf("</a>") - temp.IndexOf(">") - 1)
Writers = Writers & wrt & ", "
temp = temp.Substring(temp.IndexOf("##") + 1, temp.Length - temp.IndexOf("##") - 1)
Next x
Else
Writers = temp.Substring(temp.IndexOf(">") + 1, temp.IndexOf("</a>") - temp.IndexOf(">") - 1)
End If
List = List & "<strong>Writers(" & cw & "):</strong> " & Writers & " <br />"
End If
' Release Date
If HTML.IndexOf("<h5>Release Date") > 0 Then
temp = HTML.Substring(HTML.IndexOf("<h5>Release Date"))
temp = temp.Substring(0, temp.IndexOf("</div>"))
temp = temp.Substring(temp.IndexOf("</h5>") + 7)
Dim release_date As String = temp.Substring(0, temp.IndexOf("<a") - 1)
List = List & "<strong>Release Date:</strong> " & release_date & "<br />"
End If
' Genre
If HTML.IndexOf("<h5>Genre") > 0 Then
temp = HTML.Substring(HTML.IndexOf("<h5>Genre"))
temp = temp.Substring(0, temp.IndexOf("</div>"))
temp = temp.Substring(temp.IndexOf("</h5>") + 6)
temp = temp.Substring(0, temp.LastIndexOf("<a"))
temp = temp.Replace(" | ", "##")
Dim cg As Integer = 1
Dim Genre As String = ""
If temp.IndexOf("<a") <> temp.LastIndexOf("<a") Then
cg = Count(temp, "<a")
For x = 1 To cg
Dim gen As String = temp.Substring(temp.IndexOf(">") + 1, temp.IndexOf("</a>") - temp.IndexOf(">") - 1)
Genre = Genre & gen & ", "
temp = temp.Substring(temp.IndexOf("##") + 1, temp.Length - temp.IndexOf("##") - 1)
Next x
Else
Genre = temp.Substring(temp.IndexOf(">") + 1, temp.IndexOf("</a>") - temp.IndexOf(">") - 1)
End If
List = List & "<strong>Genre(" & cg & "):</strong> " & Genre & " <br />"
End If
' Cast
If HTML.IndexOf("<table class=""cast"">") > 0 Then
temp = HTML.Substring(HTML.IndexOf("<table class=""cast"">"))
temp = temp.Substring(0, temp.IndexOf("</div>"))
temp = temp.Replace("<td class=""nm"">", "##")
temp = temp.Replace("</a>", "**")
If temp.IndexOf("##") <> temp.LastIndexOf("##") Then
Dim c = Count(temp, "##")
List = List & "<strong>Cast:</strong> "
If c > 5 Then c = 5
For x = 1 To c
temp = temp.Substring(temp.IndexOf("##") + 2)
Dim cast As String = temp.Substring(temp.IndexOf(">") + 1, temp.IndexOf("**") - temp.IndexOf(">") - 1)
List = List & cast & ", "
temp = temp.Substring(temp.IndexOf("##") + 1)
Next x
End If
List = List & "<br />"
End If
' Runtime
If HTML.IndexOf("<h5>Runtime") > 0 Then
temp = HTML.Substring(HTML.IndexOf("<h5>Runtime"))
temp = temp.Substring(0, temp.IndexOf("</div>"))
temp = temp.Substring(temp.IndexOf("</h5>") + 6)
Dim runtime As String = temp.Substring(0)
List = List & "<strong>Runtime:</strong> " & runtime & "<br />"
End If
List = "<img border='0' alt=""" & movie_name & """ title=""" & movie_name & """ src=""" & movie_poster & """ /><br>" & List
Content.Text = List
End Sub
' count the number of occurances of substring in a string
Protected Function Count(ByVal mainstring As String, ByVal searchstring As String)
Dim matches As Integer = 0
For x = 1 To mainstring.Length
If searchstring = Mid(mainstring, x, searchstring.Length) Then
matches = matches + 1
End If
Next x
Count = matches
End Function
I have only copied all the results in a String and displayed it on html page, but you can use the variables to populate form fields or output results to a database.
Feel FREE to use this code or modify it as you wish for your own website. If you need any tips or tricks, just leave a comment below :)
- Happy Parsing :)