Lecture 5: Databases, Cookies, Server Modules --------------------------------------------- Databases: Last time we introduced several different ways you can store state information on a web server. We talked about one in particular, where the server sends all information back at the browser. Let's visit the first two again: 1. Store state information in files on the web server. This works for small amounts of data, but you'll quickly run into some limitations if you try to make it work in a more general setting, especially if you're reading from the files. The problems are: - It's too much of a pain to organize and process data. If you use file I/O for every access, you either have to write a whole bunch of shortcut functions or do all of the file opens, closes, and reads by hand. This can be a pain, to say the least. - The Unix filesystem is efficient, but it won't compensate for you if you decide to put a lot of files in the same directory. We'll talk more about that in the Performance lecture. - Opening and closing files all the time isn't exactly fast. - Serious problems with data locking: You have to make up your own scheme for locking files so that two web server processes can't write to one file at the same time, and so that a server process can't read from one that's being written to by another process. - Security problems. 2. Store state information in memory. Faster than putting it all in files, but organization is a lot worse, and you have the problem that the information isn't very permanent, unless you back it up to the disk (and when you have all of the same problems listed above). Let's look at these two methods again. We've got two ways to store data but each has its disadvantages. We're looking for some permanent way to store data, with speedy access. We also want some nice way to organize everything. If this doesn't sound like a job for databases, I don't know what is. 3. Store state information on a database server (not necessarily one running on the web server). Databases exist partly because of the problems with 1. and 2. above. People have been working a long time on how to index large amounts of information and mirror it in memory for fast access. Since this is a really common task, you can buy general purpose database systems like Sybase and Oracle and they'll do what you want them to do (provided that you know how to make them do what you want them to). I know most of you have had a course in databases and you've heard all sorts of blah, blah, blah, blah about relational database theory and stuff. But for this class, all you need to know is "databases are just a bunch of tables." Clearly, there are some people who would be annoyed that I'm saying this. Whatever. You might suspect that a database server is a machine that runs a database server program. We're running Sybase here on a server I call skunk. In case you're wondering, skunk is a 500MHz AMD K6-2 machine with 192MB of memory. Though the extra memory is handy to have in a database server, processor speed usually doesn't matter. You also have to be careful about setting up your server, especially with a system like Sybase. You can't just throw memory into it and expect it to handle more capacity. There are fixed limits in the server configuration files, and even though you may have more physical memory in the server, if you don't set the configuration up to take advantage of it, you won't take advantage of it. On Linux, for example, the defaults for Sybase say to use 32MB of memory. This isn't even enough to handle more than 12 open connections. [show some stuff on skunk] You may wonder why database servers are so restrictive with memory. It's to make sure that they're always capable of handling a given load without crashing the machine that they're on. If database connections fail because they have no more resources, it's much better to give temporary error messages instead of making the whole server fail, leaving the cause of failure a mystery. Another lesson to learn from this is that you shouldn't blame slow or "small" hardware for problems with your software. Anyway. - Each database server configuration has a number of databases. The database server administrator determines the databases and which database users own them. - Each database has a number of tables. The database owner normally defines the tables (and who has access to them). - Each table has a number of rows of data. - Each row is split up into a number of fields. [draw a picture] Relational databases like Sybase normally use a language called SQL to make queries. Most of all of you have learned this already, but in case you haven't, look at the examples below for the idea. It's a really easy language since there are only a couple of commands: SELECT, INSERT, CREATE, DELETE, DROP, UPDATE.. What I do want to show you, though, is how to access the Sybase server in languages like Perl, Python, and PHP. Pretty much all of you have used sqsh to type sql commands in by hand, and PHP to run things in web servers. But there's much more that you can do, especially with Python and Perl. What you need to know in each case is: 1. How to open and close a Sybase connection. 2. How to send commands to that Sybase connection. 3. How to read data from the query commands into variables on your scripting language. You MUST have the SYBASE environment variable set to /opt/sybase/default in order for any of this stuff to work. It's very important to note that you'll need to set that environment variable in the startup scripts that start up your httpd. In sh, do this: SYBASE=/opt/sybase/default export SYBASE If anyone asks why their stuff isn't working, I'm going to send them to this place in the notes (you get tired of saying it over and over again). Python: You'll use the Sybase extension module for Python access. Here's a really simple example to grab all of the titles out of the pubs2 example database that comes with Sybase: #!/usr/local/bin/python import Sybase mydb = Sybase.connect('skunk', 'bri', 'blahblah') mydb.execute("use pubs2") c = mydb.cursor() c.execute("select title from titles") t = c.fetchone() while t: print t[0] t = c.fetchone() mydb.close() A few things to note here: - You get a "Connect" object from Sybase.connect. Do a dir(mydb) after you assign mydb to see its methods. - There's an execute() method for Connect objects. Do all of your Sybase and SQL commands with that method (like the mydb.execute("use pubs2")), EXCEPT SELECT commands. - For reasons which you don't want to know, you must create a cursor object with the cursor() method before you can actually get data back from a SELECT command. - Use a cursor's fetchone() method to grab one row of data. The data is returned as a list of tuples, even if there is only one element in each tuple. Hence the t[0] access. You can also slurp up all of the data at once with the fetchall() command. #!/usr/local/bin/python import Sybase mydb = Sybase.connect('skunk', 'bri', 'blahblah') mydb.execute("use pubs2") c = mydb.cursor() c.execute("select * from titles") for t in c.fetchall(): print "Title: %s. Description:\n%s" % (t[1], t[3]) print mydb.close() You probably want to fool around on the Python command line to figure out what the result of a call like fetchall() does. Perl/Sybase: It's best to illustrate this sucker by an example: #!/usr/local/bin/perl use DBI; $dbhandle = DBI->connect("dbi:Sybase:server=skunk", "bri", "blahblah"); $dbhandle->do("use pubs2") || die "whoops\n"; $sth = $dbhandle->prepare("select * from titles"); $sth->execute; while (@row = $sth->fetchrow_array) { print "Name: $row[1]\n Inventory ID: $row[0]\n"; } $dbhandle->disconnect; You can see that you need to use DBI->connect() to connect, where you put in your password, how you execute commands that don't return data versus commands that do return data, and how to extract that data. There's one more thing I'd like to say about this stuff, and it's that scripting languages are really handy for dumping massive amounts of data into a database at once. For example, this Perl script can be run on /usr/dict/words to throw all of the words ending with "y" (and its line number) into a table called somestuff: #!/usr/local/bin/perl use DBI; $dbhandle = DBI->connect("dbi:Sybase:server=skunk", "bri", "blahblah"); $dbhandle->do("use bri") || die "whoops\n"; $dbhandle->do("DROP TABLE somestuff") || print "hmm.. table didn't exist"; $dbhandle->do("CREATE TABLE somestuff (name CHAR(20), quantity INT)"); $i = 1; while (<>) { chop; if (/y$/) { $dbhandle->do("INSERT INTO somestuff VALUES(\"$_\", $i)"); } $i++; } $dbhandle->disconnect; So keep this in mind.. Cookies: -------- First, let's identify the problem. We've got a nice web site up, and we're able to personalize it for a user to a certain degree. But how does a web site see a returning user the next time they type the URL into their browser? If I go to amazon.com, how does it know that I'm Brian Ward? Well, with what we have right now, we can approach this as follows: 1. Give the user a userid and password and make them login as soon as they enter the site. 2. Tack an extra parameter on to every link and form onto every generated page identifying that user the next time the user logs in. This is kind of fuggly. You don't want to login every time you go to a web site, that's bogus. Maybe you could get around that with a special bookmark, but how secure is that? (Hint: not very.) You also can't identify a user based on the IP address of the machine that they're coming from because several users may use the same machine or proxy server. By now, you should probably be thinking that the only way to identify yourself to a web site is to send some piece of information unique to you to the server every time you make a request to the server. That piece of information is called a "cookie" on the web, and it's somewhat controversial. The browser doesn't make up the cookie; the web server does. The server puts it the HTTP header: Set-Cookie: =; path=/; domain=.example.com; expires= for example: Set-Cookie: yourcookie=3429afdasfal525; path=/; domain=.example.com; expires=Monday, 30-Apr-2001 08:00:00 GMT [Talk to the amazon.com server] When the browser gets this, it puts the information in a file on your system. For your computer science account, it's in your .netscape/cookies file. Check it out sometime. If you're generating all of the Set-Cookie header by hand, make sure you do it right. Web browsers are strict about the format, and if the cookie doesn't conform, the browser won't accept it. In particular: - The domain must match the web server's host domain. - The header must come before the Content-Type (and probably a lot of other stuff) - The expiry time must be valid. For a port, like on your web servers, I think you just tack it onto the server name.. You can set as many cookies in a header as you want. You can also set a number of parameters in the same cookie line. The web browser sends the cookie in its HTTP request whenever the information in the cookie matches the request that it's about to make. So why is this controversial? Well, it's a potential invasion of privacy. Sure, an online store can take a look at each individual item that you look at, and in which order, and so on.. but that shouldn't concern you all that much. There are two practices which might bug you: 1. Companies sharing cookies and email addresses to correlate activity. There's not much you can do about this except not deal with slimy companies, though you wouldn't believe the kinds of things some people do with credit card numbers.. 2. Companies putting image (ad) banners on their pages which are actually on other sites. When you get one of those, they can send a cookie too, and they can also figure out where the original page was. Get enough banners, you get a lot of interesting information. You can, however, do something about this: go into your preferences and set the cookies to only accept the cookies coming from the sender of the original document. That said, cookies are just really useful for stuff like shopping carts. Here's why: when you set a cookie with a web server, you can set it to a unique identifier. Then you can put that identifier along with shopping cart data into table on a database. Let's do a really simple cookie program: #!/usr/local/bin/perl print <<"DONE"; Set-Cookie: mycookie=likehi; path=/; domain=mikado.aem7.net; expires=Monday, 30-Apr-2001 08:00:00 GMT Content-type: text/html

You've been cookied.

DONE Sure enough, when we look in our .netscape/cookies file, the cookie is there. You want the domain to match the full hostname and port of your web server (you may want to set up virtual hosts on your web server to redirect anything without a fully qualified domain name to the fully qualified one). Because working with these cookies is kind of painful to do by hand (try to do it sometime.. especially with that date stuff), there are some library tools in the languages we've studied so far to help you out. Let's look at Perl's with a really simple example. (use "perldoc CGI" to find out all of the options..) #!/usr/local/bin/perl use CGI; $query = new CGI; # first see if we've got a cookie already $band = $query->cookie('favband'); # then see if we're setting the cookie to something else. $newband = $query->param('thebest'); if ($newband) { # setting a new band # make up a new cookie $cookie = $query->cookie(-name=>"favband", -value=>$newband, -expires=>"+5d", # you'll forget in 5 days -path=>"/", -domain=>"mikado.aem7.net"); print $query->header(-cookie=>$cookie); } else { # don't need to set a cookie print $query->header; } print "\n"; if ($band && $newband && ($band ne $newband)) { print "Changing your favorite from $band to $newband, I see...

\n"; print "Well, if you change your mind, click on a new one below

\n"; } elsif ($band) { print "Welcome back! I see your favorite is still $band..

\n"; print "Want to change it?

\n"; } elsif ($newband) { print "Setting your favorite to $newband..

\n"; print "Well, if you change your mind, click on a new one below

\n"; } else { print "Hi there. Pick your favorite band.

\n"; } print <<"DONE";

Ramones
Beatles
GWAR
Backstreet Boys

DONE Python: Look for the example using the Cookie module off the lecture notes on the main class page. Server Modules/PHP ------------------ Most people do web programming these days not with CGIs (which, really, when it comes down to it, are kind of a pain), but with server modules. A server module is some kind of package that the web server process loads directly into its process space with dynamic linking. So the new code is part of the web server process and the web server doesn't need to start up any new programs. Aside from this gain in efficiency, most of these server modules have a slightly different model of programming. Instead of writing a program to print out an HTML document, which is what a CGI does, these things let you write an HTML document and embed little pieces of code inside. For example, here's something in PHP:

Some numbers

Weren't they nice?

When the web server gets a PHP file, it parses the file like regular HTML until it gets to a marker, and at that point, it goes back to spitting out the HTML. During PHP execution, you use the print function to generate additional HTML output. (There are also some extra functions for working with the header a little.) Just remember: the browser does NOT see any PHP code at any time. The web server returns only HTML (or at least something which the programmer thinks is HTML). This turns out to be a really convenient. Instead of thinking about writing programs all the time, when you write these things, you can think about writing the document and the program together. These server side modules are pretty popular not just because they're easy, but also because they're fast and they've been fined-tuned to provide almost any function that you'd ever want. So PHP is an example of this. You may have heard of ASP. Well, that's just the same thing, except that since it's a microsoft thing, you normally put an awful language like Visual Basic instead of PHP inside the delimiters (also, the default ASP delimiters are <% and %>, not ). But the fact is: if you know PHP, you know ASP. There are some others, like mod_perl, mod_py, and Cold Fusion. Other server modules (which don't quite work like this) are the Java-based ones. There's another class working with these, so I won't go into those. [next part is all slides]