http - Scraping an aspx website that uses cookies and login using python -


i'm trying scrape pdfs snl.com. have paid subscription , valid login credentials.

the url 1 of pdf files : http://www.snl.com/interactivex/file.aspx?id=10735427&keyfileformat=pdf

after loggin in manually , accessing above url, actual url in address bar when pdf rendered in browser : http://ofccolo.snl.com/cache/44d87724ce10735427.pdf?cachepath=%5c%5cdmzdoc2%5cwebcache%24%5c&t=&o=pdf&y=&d=

when access url i'm redirected https://www.snl.com/interactivex/default.aspx - login page.

i have read several threads in python requests , tried below code past login page , handle cookies, still keep getting login page response says : "if registered snl user, log in using email address , password."

import requests, sys requests.packages.urllib3 import add_stderr_logger  add_stderr_logger() s = requests.session() s.headers['user-agent'] = 'mozilla/5.0'  name_form = 'username' password_form = 'password' login = {name_form: 'my_email_id', password_form: 'my_password'} login_response = s.post("https://www.snl.com/interactivex/default.aspx", data=login) print 'l',login_response r in login_response.history:     if r.status_code == 401:  # 401 means authentication failed         sys.exit(1)  # abort  pdf_response = s.get("http://www.snl.com/interactivex/file.aspx?id=17670354&keyfileformat=pdf") 

output:

2014-06-26 13:04:54,555 debug added stderr logging handler logger: requests.packages.urllib3 2014-06-26 13:04:54,605 info starting new https connection (1): www.snl.com 2014-06-26 13:04:55,943 debug "get /interactivex/default.aspx http/1.1" 302 152 2014-06-26 13:04:56,282 debug "get /interactivex/logincookiecheck.aspx http/1.1" 302 143 2014-06-26 13:04:56,650 debug "get /interactivex/default.aspx http/1.1" 200 none 2014-06-26 13:04:56,865 info starting new http connection (1): www.snl.com 2014-06-26 13:04:57,447 debug "get /interactivex/file.aspx?id=17670354&keyfileformat=pdf http/1.1" 302 143 2014-06-26 13:04:57,788 debug "get /interactivex/default.aspx http/1.1" 302 162 2014-06-26 13:04:58,151 debug "get /interactivex/default.aspx http/1.1" 200 none 

i don't know how interpret output when googled response code 200, learnt means ok.

but when print pdf_response.text, returns login page again.


Comments

Popular posts from this blog

google api - Incomplete response from Gmail API threads.list -

Installing Android SQLite Asset Helper -

Qt Creator - Searching files with Locator including folder -