Wednesday, April 29, 2009

IMDb Parser in ASP.NET

Being a big movie buff I have a huge collection of movies and a movie database which I maintain in Microsoft Excel. The only info I keep in Excel is the movie Name, Year and Genre which is not enough. I face a lot of difficulty in deciding the movie I wanna see from the Excel spreadsheet and updating it is another painful task, so I thought of migrating all the Excel data to an online database with lot of extra movie information and an easy to use interface.
The first roadblock in my way was to pull information for all the movies from IMDb.com. So I wrote this parser in ASP.NET which accepts movie name as input parameter and searches on IMDb.com returning a list of matching titles. The diagram below shows the program flow for this parser:

imdb_parser

This parser is written in ASP.NET because that was the only handy tool I had in front of me. I know a lot of you folks would like to have this for PHP or Perl, but don’t worry, this code can be easily converted to PHP or Perl because the only two important String functions I have used here are Substring and IndexOf, which are available in PHP and Perl as well. It only took me 1 day to write this code and blog entry so you may find a few bugs in here with a non-structured coding. Feel free to divide this parser into different functions for your ease or add more parsing code.

What all it can parse:

  1. Title ID
  2. Title
  3. Year
  4. Poster URL
  5. User Rating
  6. Directors
  7. Writers
  8. Release Date
  9. Genre
  10. Cast
  11. Runtime

I will walk you through the whole Parser program file by file. As the parsing is done based on HTML tags, you can figure out the details on your own.

The first file is the Default.aspx file which only contains the search field and submit button to submit the search string.

Default.aspx
<form id="form1" runat="server">
  <div>
  Search: <asp:TextBox ID="movie" runat="server"></asp:TextBox>
    <asp:Button ID="Submit" runat="server" Text="Submit" />
    <br /><br /><br />
    <asp:Label runat="server" Text="" ID="CONTENT" ></asp:Label>
  </div>
</form>
IMDb Search 

The second file is the Default.aspx.vb file which only contains the VB code for the Default.aspx page. This main function in this file is the event handler for the submit button which loads the HTML file and checks if it is the Search Result page or the Movie Page. If it is the Search result page, it can populate the search results for Popular Titles, Exact match and Partial match Titles. The second function is the Title search function which parses all the titles based upon the title class and forms an html unordered list of results. The third function is just to retrieve the Movie’s Title Id from the HTML page.
If an exact match is found for the searched title, the page is redirected to the MovieInfo.aspx page with the Title Id or else the search results are displayed on the same page with results linking to MovieInfo page.

Default.aspx.vb
Protected Sub Submit_Click(ByVal sender As Object, ByVal e As EventArgs) Handles Submit.Click
    Dim searchstring As String = movie.Text
    Dim imdb_search = "http://www.imdb.com/find?s=tt&q=" & searchstring
    Dim inStream As StreamReader
    Dim webRequest As Net.WebRequest
    Dim webresponse As Net.WebResponse
    webRequest = Net.WebRequest.Create(imdb_search)
    webresponse = webRequest.GetResponse()
    inStream = New StreamReader(webresponse.GetResponseStream())
    Dim HTML As String = inStream.ReadToEnd()
    Dim temp As String = ""
    Dim List As String = ""
 
    ' If it is the search listings page
    If HTML.IndexOf("<title>IMDb Title Search</title>") > 0 Then
        ' List of Popular Titles
        If HTML.IndexOf("<b>Popular Titles</b>") > 0 Then
            List = List & TitleSearch(HTML, "Popular")
        End If
 
        ' List of Exact Match in Titles
        If HTML.IndexOf("<b>Titles (Exact Matches)</b>") > 0 Then
            List = List & TitleSearch(HTML, "Exact")
        End If
 
        ' List of Partial Match in Titles
        If HTML.IndexOf("<b>Titles (Partial Matches)</b>") > 0 Then
            List = List & TitleSearch(HTML, "Partial")
        End If
 
        Content.Text = List
    Else
        Response.Redirect("MovieInfo.aspx?ID=" & MovieID(HTML))
    End If
End Sub
 
Protected Function TitleSearch(ByVal HTML As String, ByVal Title_Type As String)
    If Title_Type = "Popular" Then
        Title_Type = "Popular Titles"
    ElseIf Title_Type = "Exact" Then
        Title_Type = "Titles (Exact Matches)"
    ElseIf Title_Type = "Partial" Then
        Title_Type = "Titles (Partial Matches)"
    Else
        Title_Type = "Popular Titles"
    End If
 
    Dim temp As String = ""
    Dim List As String = ""
 
    temp = HTML.Substring(HTML.IndexOf("<b>" & Title_Type & "</b>"))
    ' get the number of results displayed
    Dim i As Integer = CInt(temp.Substring(temp.IndexOf("(Displaying ") + 12, 2))
    ' get the 'p' html block of Popular Titles
    temp = temp.Substring(0, temp.IndexOf("</p>"))
    ' set custom dilimiter for parsing
    temp = temp.Replace("link=/title/", "##")
    List = List & "<strong>" & Title_Type & " (" & i & "):</strong><br /><ul>"
    ' loop through all the results
    For x = 1 To i
        Dim link As String = ""
        Dim image As String = ""
        Dim name As String = ""
        Dim year As String = ""
        ' trim everything before then delimiter
        temp = temp.Substring(temp.IndexOf("##"))
        ' get the link id
        link = temp.Substring(temp.IndexOf("##") + 2, temp.IndexOf("/") - temp.IndexOf("##") - 2)
        temp = temp.Substring(temp.IndexOf(">") + 1)
        ' get the Image link or Title before the </a> tag
        Dim temp1 As String = temp.Substring(0, temp.IndexOf("</a>"))
        ' Check if poster for the title is displayed or not
        If temp1.IndexOf("img src=""http://ia.media") > 0 Then
            ' if Poster is displayed then get the Title and Year for the movie
            image = temp1
            temp = temp.Substring(temp.IndexOf("##"))
            link = temp.Substring(temp.IndexOf("##") + 2, temp.IndexOf("/") - temp.IndexOf("##") - 2)
            temp = temp.Substring(temp.IndexOf(">") + 1)
            name = temp.Substring(0, temp.IndexOf("<"))
            temp = temp.Substring(temp.IndexOf("</a> ") + 6)
            year = temp.Substring(0, temp.IndexOf(")"))
        Else
            ' if Poster is not displayed then temp1 is the Title
            name = temp1
            temp = temp.Substring(temp.IndexOf("</a> ") + 6)
            year = temp.Substring(0, temp.IndexOf(")"))
        End If
        temp = temp.Substring(temp.IndexOf("</tr>"))
        List = List & "<li>" & image & " <a href='MovieInfo.aspx?ID=" & link & "'>" & name & "</a> (" & year & ")</li><br />"
    Next x
    List = List & "</ul><br />" 

    TitleSearch = List 
End Function 

Protected Function MovieID(ByVal HTML As String)
    Dim temp As String = HTML.Substring(HTML.IndexOf("<link rel=""canonical"""), 1000)
    temp = temp.Substring(temp.IndexOf("title/") + 6)
    temp = temp.Substring(0, temp.IndexOf("/"))
    MovieID = temp
End Function

IMDb Movie Search Results

The third file is the MovieInfo.aspx file which finally displays all the movie information.

MovieInfo.aspx
<form id="form1" runat="server">
  <div>
    <asp:Label runat="server" Text="" ID="CONTENT" ></asp:Label>
  </div>
</form>

IMDb Movie Page Results

The fourth file is the MovieInfo.aspx.vb file which contains the VB code to parse the Movie Page based upon the Title ID. The Page Load event receives the Title ID from URL string and loads the HTML Movie page in a String to parse it. You will have to look at the actual IMDb page’s HTML code to understand how this parser works.

Protected Sub Page_Load(ByVal sender As Object, ByVal e As System.EventArgs) Handles Me.Load
    Dim ID As String = Request.QueryString("ID").Trim
    Dim URL As String = "http://www.imdb.com/title/" & ID & "/"
 
    Dim inStream As StreamReader
    Dim webRequest As Net.WebRequest
    Dim webresponse As Net.WebResponse
    webRequest = Net.WebRequest.Create(URL)
    webresponse = webRequest.GetResponse()
    inStream = New StreamReader(webresponse.GetResponseStream())
    Dim HTML As String = inStream.ReadToEnd()
 
    Dim List As String = "<strong>Movie ID:</strong> " & ID & "<br />"
    Dim temp As String = ""
 
    ' Movie Name/Year
    temp = HTML.Substring(HTML.IndexOf("<title>"), 1000)
    Dim movie_name As String = temp.Substring(7, temp.IndexOf("(") - 8)
    Dim movie_year As String = temp.Substring(temp.IndexOf("(") + 1, 4)
    List = List & "<strong>Movie:</strong> " & movie_name & "<br />"
    List = List & "<strong>Year:</strong> " & movie_year & "<br />"
 
    ' Movie Poster
    temp = HTML.Substring(HTML.IndexOf("<div class=""photo"">"))
    temp = temp.Substring(0, temp.IndexOf("</div>"))
    Dim movie_poster As String = ""
    If temp.IndexOf("noposter.gif") < 0 Then
        movie_poster = temp.Substring(temp.IndexOf("src=""") + 5, temp.LastIndexOf("""") - temp.IndexOf("src=""") - 5)
    Else
        movie_poster = "http://i.media-imdb.com/images/SFd0ed3aeda7d77e6d9a8404cc3cd63dc6/intl/en/title_noposter.gif"
    End If
    List = List & "<strong>Poster:</strong> " & movie_poster & "<br />"
 
    'User Rating
    temp = HTML.Substring(HTML.IndexOf("<div class=""meta"">"))
    temp = temp.Substring(0, temp.IndexOf("/"))
    Dim movie_rating As String = ""
    If temp.IndexOf("<b>") > 0 Then
        movie_rating = temp.Substring(temp.IndexOf("<b>") + 3, 3)
    End If
    List = List & "<strong>Rating:</strong> " & movie_rating & "<br />"
 
    ' Director(s)
    If HTML.IndexOf("<h5>Director") > 0 Then
        temp = HTML.Substring(HTML.IndexOf("<h5>Director"))
        temp = temp.Substring(0, temp.IndexOf("</div>"))
        temp = temp.Substring(temp.IndexOf("</h5>") + 5)
        temp = temp.Replace("<br/>", "#")
        Dim cd As Integer = 1
        Dim Directors As String = ""
        If temp.IndexOf("<a") <> temp.LastIndexOf("<a") Then
            cd = Count(temp, "<a")
            If cd > 2 Then cd = 2
            For x = 1 To cd
                Dim dir As String = temp.Substring(temp.IndexOf(">") + 1, temp.IndexOf("</a>") - temp.IndexOf(">") - 1)
                Directors = Directors & dir & ", "
                temp = temp.Substring(temp.IndexOf("#") + 1, temp.Length - temp.IndexOf("#") - 1)
            Next x
        Else
            Directors = temp.Substring(temp.IndexOf(">") + 1, temp.IndexOf("</a>") - temp.IndexOf(">") - 1)
        End If
        List = List & "<strong>Directors(" & cd & "):</strong> " & Directors & " <br />"
    End If
 
    ' Writer(s)
    If HTML.IndexOf("<h5>Writer") > 0 Then
        temp = HTML.Substring(HTML.IndexOf("<h5>Writer"))
        temp = temp.Substring(0, temp.IndexOf("</div>"))
        temp = temp.Substring(temp.IndexOf("</h5>") + 5)
        temp = temp.Replace("<br/>", "##")
        Dim cw As Integer = 1
        Dim Writers As String = ""
        If temp.IndexOf("<a") <> temp.LastIndexOf("<a") Then
            cw = Count(temp, "<a")
            If cw > 2 Then cw = 2
            For x = 1 To cw
                Dim wrt As String = temp.Substring(temp.IndexOf(">") + 1, temp.IndexOf("</a>") - temp.IndexOf(">") - 1)
                Writers = Writers & wrt & ", "
                temp = temp.Substring(temp.IndexOf("##") + 1, temp.Length - temp.IndexOf("##") - 1)
            Next x
        Else
            Writers = temp.Substring(temp.IndexOf(">") + 1, temp.IndexOf("</a>") - temp.IndexOf(">") - 1)
        End If
        List = List & "<strong>Writers(" & cw & "):</strong> " & Writers & " <br />"
    End If
 
    ' Release Date
    If HTML.IndexOf("<h5>Release Date") > 0 Then
        temp = HTML.Substring(HTML.IndexOf("<h5>Release Date"))
        temp = temp.Substring(0, temp.IndexOf("</div>"))
        temp = temp.Substring(temp.IndexOf("</h5>") + 7)
        Dim release_date As String = temp.Substring(0, temp.IndexOf("<a") - 1)
        List = List & "<strong>Release Date:</strong> " & release_date & "<br />"
    End If
 
    ' Genre
    If HTML.IndexOf("<h5>Genre") > 0 Then
        temp = HTML.Substring(HTML.IndexOf("<h5>Genre"))
        temp = temp.Substring(0, temp.IndexOf("</div>"))
        temp = temp.Substring(temp.IndexOf("</h5>") + 6)
        temp = temp.Substring(0, temp.LastIndexOf("<a"))
        temp = temp.Replace(" | ", "##")
        Dim cg As Integer = 1
        Dim Genre As String = ""
        If temp.IndexOf("<a") <> temp.LastIndexOf("<a") Then
            cg = Count(temp, "<a")
            For x = 1 To cg
                Dim gen As String = temp.Substring(temp.IndexOf(">") + 1, temp.IndexOf("</a>") - temp.IndexOf(">") - 1)
                Genre = Genre & gen & ", "
                temp = temp.Substring(temp.IndexOf("##") + 1, temp.Length - temp.IndexOf("##") - 1)
            Next x
        Else
            Genre = temp.Substring(temp.IndexOf(">") + 1, temp.IndexOf("</a>") - temp.IndexOf(">") - 1)
        End If
        List = List & "<strong>Genre(" & cg & "):</strong> " & Genre & " <br />"
    End If
 
 
    ' Cast
    If HTML.IndexOf("<table class=""cast"">") > 0 Then
        temp = HTML.Substring(HTML.IndexOf("<table class=""cast"">"))
        temp = temp.Substring(0, temp.IndexOf("</div>"))
        temp = temp.Replace("<td class=""nm"">", "##")
        temp = temp.Replace("</a>", "**")
        If temp.IndexOf("##") <> temp.LastIndexOf("##") Then
            Dim c = Count(temp, "##")
            List = List & "<strong>Cast:</strong> "
            If c > 5 Then c = 5
            For x = 1 To c
                temp = temp.Substring(temp.IndexOf("##") + 2)
                Dim cast As String = temp.Substring(temp.IndexOf(">") + 1, temp.IndexOf("**") - temp.IndexOf(">") - 1)
                List = List & cast & ", "
                temp = temp.Substring(temp.IndexOf("##") + 1)
            Next x
        End If
        List = List & "<br />"
    End If
 
    ' Runtime
    If HTML.IndexOf("<h5>Runtime") > 0 Then
        temp = HTML.Substring(HTML.IndexOf("<h5>Runtime"))
        temp = temp.Substring(0, temp.IndexOf("</div>"))
        temp = temp.Substring(temp.IndexOf("</h5>") + 6)
        Dim runtime As String = temp.Substring(0)
        List = List & "<strong>Runtime:</strong> " & runtime & "<br />"
    End If
 
    List = "<img border='0' alt=""" & movie_name & """ title=""" & movie_name & """ src=""" & movie_poster & """ /><br>" & List
 
    Content.Text = List
End Sub
 
' count the number of occurances of substring in a string
Protected Function Count(ByVal mainstring As String, ByVal searchstring As String)
    Dim matches As Integer = 0
    For x = 1 To mainstring.Length
        If searchstring = Mid(mainstring, x, searchstring.Length) Then
            matches = matches + 1
        End If
    Next x
    Count = matches
End Function

I have only copied all the results in a String and displayed it on html page, but you can use the variables to populate form fields or output results to a database.

Feel FREE to use this code or modify it as you wish for your own website. If you need any tips or tricks, just leave a comment below :)

- Happy Parsing :)

1 comment:

Thanks a lot for your valuable comments :)