im tryin to use pyparsing to grab cnn's top 5 headlines. Anyone know of a way to do this?
also, is it possible to grab only the urls in between
startnews = '<div class="cnnSubHead">Latest News</div>'
endnews = '/24hours/'
my code is only grabbing all on page
-------------------------------------------------------------
----------END CODE----------------Code:from pyparsing import Word, Suppress, CharsNotIn # import what we need import urllib startnews = '<div class="cnnSubHead">Latest News</div>' endnews = '/24hours/' filter1 = Suppress('>') # filter out stuff we dont want to show up filter2 = Suppress('</a>') pattern = filter1 + CharsNotIn('<').setResultsName('newslisting') + filter2 # setup search pattern cnnurl = 'http://www.cnn.com/' # url to search cnnconnect = urllib.urlopen(cnnurl) # connect to url readpage = cnnconnect.read() # read html src into list cnnconnect.close() # close connection to resource for theloop,startnews,endnews in pattern.scanString(readpage): # loop through resource print '[+]', theloop.newslisting # display results
scripteaze

Reply With Quote

