Mark Pooley

Scraping Data to for the Cage Page

April 25, 2015.

A year in the Cage is no small feat, and not for the weak willed. Building a web page complete with data on all of Cage's movies seemed like the best approach for tracking progress, while providing an avenue for tracking, reviewing, and recording quotes.

IMDB discourages/prohibits data mining. But, this was a small task that was able to run in a fairly short amount of time. I looked at the IMDB page structure(s) and scraped the relevant info and wrote it to an HTML page and CSV. Writing to a CSV will allow for data visualization with d3 (which I likely do when there's enough data from watching his movies).

For this, I used BeautifulSoup, urllib2, re, csv, and python packages. After importing the necessary packages, we can create the soup from the main page for Nicolas Cage


url = 'http://www.imdb.com/name/nm0000115/'
page = urllib2.urlopen(url)
soup = bs(page.read())
films = soup.find("div",{"class":"filmo-category-section"})

#get a BS result set of all the rows in which NC was an actor/voice
fs = films.findAllNext('div',{'id': re.compile('actor-tt*')})

First, let's define some functions that get the information we're interested in. Most of them will take soup or row as input depending on what is getting fed into them. I like to keep functions simple and distinct when scraping - it makes it easier for me to keep track of what I'm doing.


#get the Title of the film and url for IMDB film entry
---------------------------------------------------------------------
def getTitle(row):
   titleRow = row.find('a',href=True)
   title = titleRow.contents[0]
   link = 'http://www.imdb.com' + titleRow['href'] #generate a correct URl

   return [title,link]

#get year
def getYr(row):
  yr = row.find('span').contents[0]
  #the contents were messy, this finds the index of the character needed
  return yr[yr.index(';')+1: yr.index(';')+5]

The function(s) above are pretty self-explanatory. Originally, I wrote a function that just found the linked movie poster image. But, IMDB wasn't allowing those to load on my page. So, I changed the function and saved a local copy of each movie poster.


def findImage(soup):
   try:
       img = soup.find('td',{'id': 'img_primary'})
       img = img.find('img')
       link = img['src']
       link_title = str(img['title']).lower().replace(' ','_')
       title = link_title.translate(None, ':()')


       jpg = urllib.urlretrieve(link, 'cage_images/{0}.jpg'.format(title))
       img['src'] = jpg[0]

       if link == None:
          #if no image found, just stick a place holder in there
           link = '<img src="http://placehold.it/214x317">'
   except AttributeError:
   #if no image found, just stick a place holder in there
       link = '<img src="http://placehold.it/214x317">'

   return img

There are a few other functions, similarly structured to find information that I wanted (i.e. budgets, actors, ratings, etc.). It would have been quite monotonous to write/create the lists and modal divs for all of the movies in the Cage Cannon. So, I went through the Nic Cage IMDB entry and found all the movies with a release year ≤ 2015 (He has a number in pre-production, rumored, etc.) and where he was credited as an actor. All of these were written to a dictionary in python.


#get a BS result set of all the rows in which NC was an actor/voice
fs = films.findAllNext('div',{'id': re.compile('actor-tt*')})

#loop through the soup and find all the titles were Cage is an actor
for row in fs:
try:
    yr = getYr(row)
    title = getTitle(row)[0]

    if yr.isdigit() and int(yr) ≤ 2015:
        count+=1 #increment counter

        #get soup from the corresponding film page
        url = urllib2.urlopen(getTitle(row)[1])
        print title
        #print page
        soup = bs(url.read())
        imgLink = findImage(soup)
        info = findInfo(soup)
        people = findPeople(soup)
        finance = findEarnings(soup)
        rating =  findRating(soup)

        CageDict[title] = [yr, title, info[2],rating[0],rating[1],info[0],info[1],people[0],people[1],finance[0],finance[1],imgLink]
        csvDict[title] = [yr, title, rating[1], rating[0],info[1], getInt(finance[0]),getInt(finance[1]),getInt(info[0]),'Unwatched']

        #aggreate run times, budgets, and gross

        totalRuntime += getInt(info[0])
        totalBudget += getInt(finance[0])
        totalGross += getInt(finance[1])

#rudimentary/lazy error handling
except UnicodeEncodeError:
    print 'error encountered at {0}'.format(count)
    pass

With dictionaries created, it's easy to then write the HTML code iteratively. This allowed for code testing and checking of HTML formatting pretty easily. If anything was way off, I could just re-run the code and overwrite mistakes. I did use Sublime Text to do some basic search/replace, as well as make some minor edits. I wrote the list, and modal divs to two separate HTML documents and just added the two together, which isn't ideal, but worked just fine.


  #write out cage list to html document
  with open('CageList.html','wb') as f:
      f.write('<ul class="large-block-grid-5">\n')

      for row in CageDict:
          line = CageDict[row]
          #create modal link from title
          mod = line[1].replace(' ','').translate(None,'\/#&#')
          f.write('<li class="unwatched">
          <a data-reveal-id="mod{0}">{1}
          </a></li>\n'.format(mod,line[-1]))

      f.write('</ul>')

  #write modal div html document that will be used for
  #modal reveals linked to the CageList.html doc
  with open('CagePage.html','wb') as CagePage:


for row in CageDict:
    line = CageDict[row]

    mod = line[1].replace(' ','').translate(None,'\/#&#')#create modal link from title

    CagePage.write('<div id="mod{0}" class="reveal-modal" data-reveal aria-labelledby="modalTitle" aria-hidden="true" role="dialog">\n'.format(mod))
    CagePage.write('  <div class="small-9 columns" id="modalTitle">\n')
    CagePage.write('  <table class="filmInfo">\n')
    CagePage.write('        <tr><td colspan="2"><h4><b>{0}</b>, <i>{1}</i></h4></td></tr>\n'.format(line[1],line[0])) #Title and Year
    CagePage.write('        <tr><td colspan="2"><b>Description: </b>{0}</td></tr>\n'.format(line[2].encode('utf-8'))) #Description
    CagePage.write('        <tr><td><b>MPAA Rating: </b>{0}</td><td><b>Genre: </b>{1}</td></tr>\n'.format(line[3],line[4])) #rating and genre
    CagePage.write('        <tr><td><b>Runtime:</b> {0}</td><td><b>Avg IMDB User Rating: </b>{1}</td></tr>\n'.format(line[5],line[6])) #Runtime and IMDB rating
    CagePage.write('        <tr><td><b>Director: </b>{0}</td><td><b>Actors: </b>{1}</td></tr>\n'.format(line[7],line[8].encode('utf-8'))) #director and "big name" actors
    CagePage.write('        <tr><td><b>Budget: </b>{0}</td><td><b>Gross: </b>{1}</td></tr>\n'.format(line[9],line[10])) #budget and gross for film
    CagePage.write('    </table>\n') #close table
    CagePage.write('  </div>\n') #close the 9 column div
    CagePage.write(' <div class="small-3 columns"><a>{0}</a></div>\n'.format(line[-1]))
    CagePage.write(' <a class="close-reveal-modal" aria-label="Close">×</a>\n') #close 'x' in reveal box
    CagePage.write(' <div class="small-12 columns">\n')
    CagePage.write(' <br>\n')
    CagePage.write(' <table width="100%">\n')
    CagePage.write(' <tr><td><h4 align="center">Meta Cage</h4></td></tr>\n')
    CagePage.write(' <tr><td><b>The Cage Character:</b></td></tr>\n')
    CagePage.write(' <tr><td><b>Cage Hair:</b></td></tr>\n')
    CagePage.write(' <tr><td><b>Cage Rating:</b></td></tr>\n')
    CagePage.write(' </table>\n')
    CagePage.write('  <h3 id="review" align="center">Review</h3><h6 align="center"></h6><hr>\n')
    CagePage.write('  <p><i>forthcoming</i></p>\n')
    CagePage.write('  <h3 id="cagisms" align="center">Quotes & Cagisms</h3><hr>\n')
    CagePage.write('  <p><i>forthcoming</i></p>\n')
    CagePage.write('  </div>\n')
    CagePage.write('</div>\n') #close main div

Viola! I was able to throw the two HTML documents together and start an epic "Year in the Cage". I also wrote the dictionaries to a CSV in the event I feel a little nerdy and want to add some D3 visualizations in the future.