Question about python-scripted https login

ndb · ‎01-06-2013

Hi,
does anyone know how to use python urllib2, mechanize, etc to achieve a successful scripted https login?
I'm trying to write a python script to automatically get my plus.net broadband usage, without having to log into the website every time. Would be much easier to just run a daily python script.
I'm been using python mechanize & urllib2 libs to submit the form on plus.net https login page, but it always comes back with 'invalid password' in the https response.
Thanks in advance....

alanb · ‎24-05-2007

I have never used mechanize but I have used urllib to write a script that logs into a router to start or stop a backup connection. I needed to experiment a bit to get it to work, but it turned out to be pretty trivial.
All the work to handle HTTP authentication is done in urllib.FancyURLopener. By default it prompts the user to enter the necessary username and password. I subclassed it to hardcode both. Then I just had had to submit one form by the 'POST' method to get the router to do what I wanted. Here is the guts of the code ...


import os, sys, urllib
url = 'http://192.168.0.1/'
urlpage = 'http://192.168.0.1/st_sp.htm'
username = 'router username'
password = 'router password'
connect = {'dial' : 'Connect'}
disconnect = {'hang_Up' : 'Disconnect'}
class myopener(urllib.FancyURLopener):
    def __init__(self, *args):
        urllib.FancyURLopener.__init__(self, *args)
    def prompt_user_passwd(self, host, realm): # override prompt routine to hard-code user name and password
        return (username, password)
def do_form_action(action):
    urllib._urlopener = myopener() # handles standard HTTP authentication dialogue
    data = urllib.urlencode(action) # encodes form data - action is a dictionary of values to append to the end of the url in response to the POST method
    urllib.urlopen(urlpage, data) # open page using POST method and submit form data

To start the backup connection I call do_form_action(connect). To stop the backup conenction I call do_form_action(disconnect)

alanb · ‎24-05-2007

A few further thoughts for you ...
1. You probably should check if the web site is using standard HTTP authentication methods or something proprietary. Examining the source code for the web page you are attempting to load should give you some insight into this. If it is proprietary, you will not be able to rely on the the authentication functions in urllib and mechanize.
2. Check the HTTP response headers returned after you attempt to authenticate. If you are getting a '401 Unauthorized', that is an indication that the server is using a standard HTTP authentication method. Some of the fields in that 401 response may tell you what went wrong. It may be that the library does not support the authentication method the server wants or perhaps the wrong character set is being used.
3. If the server expects HTTP authentication, your first attempt to 'GET' a page will result in a 401 response. It will include a field name called 'WWW-Authenticate', which will tell your client what authentication method to use and what realm to provide authentication for. To authenticate, your client must re-submit the 'GET' with a 'Authorization' field and encoded password added to the original request. The server will send a new response, typically another 401 if it is unhappy or a '200 OK' if it is happy. Python's urllib handles all this transparently. I don't know how mechanize handles it, but it would be worth investigating its debugging options to see if you can use them to confirm this interaction is happening.

ndb · ‎01-06-2013

Thanks for the info so far, but I don't think it's http authentication. The site I'm trying to log into is this plus.net broadband usage stats:
https://www.plus.net/view_my_broadband_usage/index.php
It's just a form with username and password but submitted over an https connection (running over TLS according to wireshark).
I've got one python script that tries to use the urllib2 method to submit this form, and also another python script that tries to do it with mechanize. Neither work. urllib2 gets further but the https response that is returns contains html with 'invalid username/password'.
When I use Firefox > Firebug to check the form submission, it shows two extra values being submitted (x=<num>, y=<num>) though I don't know where these come from as the Page Source does not show any (exposed) javascript setting these values. Maybe they are just page coordindates or something?
The 2nd consideration is the cookie handling. Firefox > Firebug does show quite a few complicated-looking cookies going back to the web server, but urllib2 and mechanize are both supposed to be able to handle cookies by default?
Please let me know if you would like me to paste in my python code...

alanb · ‎24-05-2007

It is difficult for me to offer any substantive comments on mechanize, as I haven't used it, but I am wondering if your expectation of the capabilities of mechanize and urllib is correct. Perhaps you should treat what I write below as just me thinking out loud. I don't want to put a damper on your ideas, but are you sure that mechanize processes the HTML elements and other elements in the page object when you retrieve it? I ask because urllib doesn't do that.
Urllib does not emulate a web browser. I am not altogether convinced that mechanize will either. When urllib receives a page from the server, that page object is all you get. Urllib doesn't process any the sub-elements in the document (it does not retrieve images or execute Javascript, for example,) it leaves all that for the Python code to carry out (or ignore) as the programmer chooses. Which means the Python code would have to parse the HTML, find all the additional elements that have to be retrieved, then retrieve each element separately, and similarly with script elements. It is really just an implementation of the HTTP request/response mechanism. HTML processing has to be done by other code. Having briefly looked through the mechanize web site yesterday, my impression is that it does much the same as urllib but in a more structured and flexible way.
If I am correct, you may have to find a way to execute some or all of the Javascript elements from your target web page in Python in order to post a valid response to the server and get logged in. I have no idea if you can run a Javascript interpreter from Python. My urllib example only works because it is a very simple static HTML form, which I have in reality ignored and responded instead with a correctly formed POST response containing the variables that a browser would have created for that form. You may be able to do something similar, but on the other hand the server may refuse to play if certain Javascript elements have not been run.
The examples on the mechanize web site seem to me to be a bit too basic to get a broad understanding of its abilities. If you haven't already looked for other examples, then perhaps it would be beneficial to search for some real life examples (a simple browser implemented with mechanize would be ideal) to give yourself a better understanding of how your program needs to be structured and what callbacks mechanize expects you to handle in Python code to interact correctly with a server.
Have you tried your code with a simpler page? Ideally without any Javascript to start off. If you can get a simple page working, doing a POST or PUT response, you will at least know that your ideas about using mechanize are are on the right track.

alanb · ‎24-05-2007

I take back half of what I wrote last night. I downloaded mechanize and had a look through its source code. It seems ideal for what you are trying to do. It does handle HTML, It uses the beautifulsoup module to process the HTML along with a bit of custom code it seems.
However, I'd still be suspicious that the Javascript elements are getting in your way. I'd be inclined to take a closer look at beautifulsoup too if I were you. I think it is likely that it will be ignoring Javascript..

alanb · ‎24-05-2007

After looking more closely at the source for mechanize, I have found a comment to the effect that it does not support Javascript - see _mechanize.py.
If you can somehow confirm that that your login refusal is related to this. You might be able to use Python code to emulate what the Javascript would have done. My HTML and scripting is a bit rusty, but, I'll have a quick look at the web page source later.

ndb · ‎01-06-2013

Thanks for the help and encouragement to solve this. I got it working in the end with urllib2, which seems to be more reliable than mechanize. I tested without javascript disabled and verified again with firebug. It turned out not to be a javascript issue.
It was a cookie issue. It seems the main problem was the tiny detail of having to run the form submission twice, once to get the cookie jar and again to submit the form. After a lot of time searching google, the following code works and is the only combination that appears to configure the cookie jar correctly:
import urllib
import urllib2
import sys
import cookielib
cookie_filename = "cookie_file"
jar = cookielib.MozillaCookieJar(cookie_filename)
file_handle = open("output.html", "w")
username = 'guest'
password = 'guest'
x = '0'
y = '0'
cj = cookielib.MozillaCookieJar("cookie_file")
cj.load()
opener = urllib2.build_opener(
urllib2.HTTPRedirectHandler(),
urllib2.HTTPHandler(debuglevel=0),
urllib2.HTTPSHandler(debuglevel=0),
urllib2.HTTPCookieProcessor(cj)
)

opener.addheaders = [('User-agent', "Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:21.0) Gecko/20100101 Firefox/21.0")]
cj.save()
url = "https://portal.plus.net/view_my_broadband_usage/index.php"
form = { "username" : username,
"authentication_realm" : "portal.plus.net",
"password" : password,
"x" : x,
"y" : y}

encodedForm = urllib.urlencode(form)
response = opener.open(url, encodedForm) # submit once for cookie jar
response = opener.open(url, encodedForm) # and again for form
file_handle.write(response.read())
file_handle.close()

alanb · ‎24-05-2007

Well done!
After looking at the web page I was coming round to your view that mechanize is not getting all the cookies it needs. I thought the Javascript elements in the head section may be setting some extra cookies. I was about to suggest you try logging into the member centre using a browser with Javascript disabled to see if you could eliminate it as the cause of the problem.

alanb · ‎24-05-2007

Another thought ...

Quote from: ndb
... It seems the main problem was the tiny detail of having to run the form submission twice, ...

In HTTP protocol terms, you have to do a 'GET https://portal.plus.net/view_my_broadband_usage/index.php'; to retrieve the page, and cookies come along for the ride. Then you do a PUT or POST as needed by the form to submit the data.
I think that your first opener.open might be more correct without the encoded data. Without the data, urllib will do a GET operation, with data it will use POST. Though it obviously seems to work as you have it, so it looks like the server is handling the error gracefully. It's not a big problem but it could stop working in the future if the server is changed.

Question about python-scripted https login

Question about python-scripted https login

Re: Question about python-scripted https login

Re: Question about python-scripted https login

Re: Question about python-scripted https login

Re: Question about python-scripted https login

Re: Question about python-scripted https login

Re: Question about python-scripted https login

Re: Question about python-scripted https login

Re: Question about python-scripted https login

Re: Question about python-scripted https login