A year in the Cage is no small feat, and not for the weak willed. Building a web page complete with data on all of Cage's movies seemed like the best approach for tracking progress, while providing an avenue for tracking, reviewing, and recording quotes.
IMDB discourages/prohibits data mining. But, this was a small task that was able to run in a fairly short amount of time. I looked at the IMDB page structure(s) and scraped the relevant info and wrote it to an HTML page and CSV. Writing to a CSV will allow for data visualization with d3 (which I likely do when there's enough data from watching his movies).
For this, I used BeautifulSoup, urllib2, re, csv, and python packages. After importing the necessary packages, we can create the soup from the main page for Nicolas Cage
url = 'http://www.imdb.com/name/nm0000115/'
page = urllib2.urlopen(url)
soup = bs(page.read())
films = soup.find("div",{"class":"filmo-category-section"})
#get a BS result set of all the rows in which NC was an actor/voice
fs = films.findAllNext('div',{'id': re.compile('actor-tt*')})
First, let's define some functions that get the information we're interested in. Most of them will take soup or row as input depending on what is getting fed into them. I like to keep functions simple and distinct when scraping - it makes it easier for me to keep track of what I'm doing.
#get the Title of the film and url for IMDB film entry
---------------------------------------------------------------------
def getTitle(row):
titleRow = row.find('a',href=True)
title = titleRow.contents[0]
link = 'http://www.imdb.com' + titleRow['href'] #generate a correct URl
return [title,link]
#get year
def getYr(row):
yr = row.find('span').contents[0]
#the contents were messy, this finds the index of the character needed
return yr[yr.index(';')+1: yr.index(';')+5]
The function(s) above are pretty self-explanatory. Originally, I wrote a function that just found the linked movie poster image. But, IMDB wasn't allowing those to load on my page. So, I changed the function and saved a local copy of each movie poster.
def findImage(soup):
try:
img = soup.find('td',{'id': 'img_primary'})
img = img.find('img')
link = img['src']
link_title = str(img['title']).lower().replace(' ','_')
title = link_title.translate(None, ':()')
jpg = urllib.urlretrieve(link, 'cage_images/{0}.jpg'.format(title))
img['src'] = jpg[0]
if link == None:
#if no image found, just stick a place holder in there
link = '<img src="http://placehold.it/214x317">'
except AttributeError:
#if no image found, just stick a place holder in there
link = '<img src="http://placehold.it/214x317">'
return img
There are a few other functions, similarly structured to find information that I wanted (i.e. budgets, actors, ratings, etc.). It would have been quite monotonous to write/create the lists and modal divs for all of the movies in the Cage Cannon. So, I went through the Nic Cage IMDB entry and found all the movies with a release year ≤ 2015 (He has a number in pre-production, rumored, etc.) and where he was credited as an actor. All of these were written to a dictionary in python.
#get a BS result set of all the rows in which NC was an actor/voice
fs = films.findAllNext('div',{'id': re.compile('actor-tt*')})
#loop through the soup and find all the titles were Cage is an actor
for row in fs:
try:
yr = getYr(row)
title = getTitle(row)[0]
if yr.isdigit() and int(yr) ≤ 2015:
count+=1 #increment counter
#get soup from the corresponding film page
url = urllib2.urlopen(getTitle(row)[1])
print title
#print page
soup = bs(url.read())
imgLink = findImage(soup)
info = findInfo(soup)
people = findPeople(soup)
finance = findEarnings(soup)
rating = findRating(soup)
CageDict[title] = [yr, title, info[2],rating[0],rating[1],info[0],info[1],people[0],people[1],finance[0],finance[1],imgLink]
csvDict[title] = [yr, title, rating[1], rating[0],info[1], getInt(finance[0]),getInt(finance[1]),getInt(info[0]),'Unwatched']
#aggreate run times, budgets, and gross
totalRuntime += getInt(info[0])
totalBudget += getInt(finance[0])
totalGross += getInt(finance[1])
#rudimentary/lazy error handling
except UnicodeEncodeError:
print 'error encountered at {0}'.format(count)
pass
With dictionaries created, it's easy to then write the HTML code iteratively. This allowed for code testing and checking of HTML formatting pretty easily. If anything was way off, I could just re-run the code and overwrite mistakes. I did use Sublime Text to do some basic search/replace, as well as make some minor edits. I wrote the list, and modal divs to two separate HTML documents and just added the two together, which isn't ideal, but worked just fine.
#write out cage list to html document
with open('CageList.html','wb') as f:
f.write('<ul class="large-block-grid-5">\n')
for row in CageDict:
line = CageDict[row]
#create modal link from title
mod = line[1].replace(' ','').translate(None,'\/#&#')
f.write('<li class="unwatched">
<a data-reveal-id="mod{0}">{1}
</a></li>\n'.format(mod,line[-1]))
f.write('</ul>')
#write modal div html document that will be used for
#modal reveals linked to the CageList.html doc
with open('CagePage.html','wb') as CagePage:
for row in CageDict:
line = CageDict[row]
mod = line[1].replace(' ','').translate(None,'\/#&#')#create modal link from title
CagePage.write('<div id="mod{0}" class="reveal-modal" data-reveal aria-labelledby="modalTitle" aria-hidden="true" role="dialog">\n'.format(mod))
CagePage.write(' <div class="small-9 columns" id="modalTitle">\n')
CagePage.write(' <table class="filmInfo">\n')
CagePage.write(' <tr><td colspan="2"><h4><b>{0}</b>, <i>{1}</i></h4></td></tr>\n'.format(line[1],line[0])) #Title and Year
CagePage.write(' <tr><td colspan="2"><b>Description: </b>{0}</td></tr>\n'.format(line[2].encode('utf-8'))) #Description
CagePage.write(' <tr><td><b>MPAA Rating: </b>{0}</td><td><b>Genre: </b>{1}</td></tr>\n'.format(line[3],line[4])) #rating and genre
CagePage.write(' <tr><td><b>Runtime:</b> {0}</td><td><b>Avg IMDB User Rating: </b>{1}</td></tr>\n'.format(line[5],line[6])) #Runtime and IMDB rating
CagePage.write(' <tr><td><b>Director: </b>{0}</td><td><b>Actors: </b>{1}</td></tr>\n'.format(line[7],line[8].encode('utf-8'))) #director and "big name" actors
CagePage.write(' <tr><td><b>Budget: </b>{0}</td><td><b>Gross: </b>{1}</td></tr>\n'.format(line[9],line[10])) #budget and gross for film
CagePage.write(' </table>\n') #close table
CagePage.write(' </div>\n') #close the 9 column div
CagePage.write(' <div class="small-3 columns"><a>{0}</a></div>\n'.format(line[-1]))
CagePage.write(' <a class="close-reveal-modal" aria-label="Close">×</a>\n') #close 'x' in reveal box
CagePage.write(' <div class="small-12 columns">\n')
CagePage.write(' <br>\n')
CagePage.write(' <table width="100%">\n')
CagePage.write(' <tr><td><h4 align="center">Meta Cage</h4></td></tr>\n')
CagePage.write(' <tr><td><b>The Cage Character:</b></td></tr>\n')
CagePage.write(' <tr><td><b>Cage Hair:</b></td></tr>\n')
CagePage.write(' <tr><td><b>Cage Rating:</b></td></tr>\n')
CagePage.write(' </table>\n')
CagePage.write(' <h3 id="review" align="center">Review</h3><h6 align="center"></h6><hr>\n')
CagePage.write(' <p><i>forthcoming</i></p>\n')
CagePage.write(' <h3 id="cagisms" align="center">Quotes & Cagisms</h3><hr>\n')
CagePage.write(' <p><i>forthcoming</i></p>\n')
CagePage.write(' </div>\n')
CagePage.write('</div>\n') #close main div
Viola! I was able to throw the two HTML documents together and start an epic "Year in the Cage". I also wrote the dictionaries to a CSV in the event I feel a little nerdy and want to add some D3 visualizations in the future.