Discussion Board

Results 1 to 2 of 2

Thread: pyparsing

  1. #1
    Registered User metasploit's Avatar
    Join Date
    Feb 2008
    Posts
    1
    im tryin to use pyparsing to grab cnn's top 5 headlines. Anyone know of a way to do this?

    also, is it possible to grab only the urls in between

    startnews = '<div class="cnnSubHead">Latest News</div>'
    endnews = '/24hours/'

    my code is only grabbing all on page
    -------------------------------------------------------------

    Code:
    
    from pyparsing import Word, Suppress, CharsNotIn # import what we need
    import urllib
    
    startnews = '<div class="cnnSubHead">Latest News</div>'
    endnews = '/24hours/'
    
    filter1 = Suppress('>') # filter out stuff we dont want to show up
    filter2 = Suppress('</a>')
    
    pattern = filter1 + CharsNotIn('<').setResultsName('newslisting') + filter2 # setup search pattern
    
    cnnurl = 'http://www.cnn.com/' # url to search
    
    cnnconnect = urllib.urlopen(cnnurl) # connect to url
    
    readpage = cnnconnect.read() # read html src into list
    
    cnnconnect.close() # close connection to resource
    
    for theloop,startnews,endnews in pattern.scanString(readpage): # loop through resource
    
        print '[+]', theloop.newslisting # display results
    ----------END CODE----------------
    scripteaze

  2. #2
    Regular Contributor miohtama's Avatar
    Join Date
    Jan 2004
    Location
    Helsinki
    Posts
    376
    Since this is pyparsing specific question, it might catch more fire in pyparsing related forum.
    Mikko Ohtamaa

    http://mfabrik.com
    http://blog.mfabrik.com

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Nokia Developer aims to help you create apps and publish them so you can connect with users around the world.

京ICP备05048969号  © Copyright Nokia 2013 All rights reserved