Lecture 5: Databases, Cookies, Server Modules
---------------------------------------------

Databases:
Last time we introduced several different ways you can store state
information on a web server. We talked about one in particular, where the
server sends all information back at the browser.

Let's visit the first two again:

1. Store state information in files on the web server.

This works for small amounts of data, but you'll quickly run into some
limitations if you try to make it work in a more general setting,
especially if you're reading from the files. The problems are:

- It's too much of a pain to organize and process data. If you use file
  I/O for every access, you either have to write a whole bunch of
  shortcut functions or do all of the file opens, closes, and reads by
  hand. This can be a pain, to say the least.
- The Unix filesystem is efficient, but it won't compensate for you if you
  decide to put a lot of files in the same directory. We'll talk more
  about that in the Performance lecture.
- Opening and closing files all the time isn't exactly fast.
- Serious problems with data locking: You have to make up your own scheme
  for locking files so that two web server processes can't write to one
  file at the same time, and so that a server process can't read from one
  that's being written to by another process.
- Security problems.

2. Store state information in memory.

Faster than putting it all in files, but organization is a lot worse, and
you have the problem that the information isn't very permanent, unless you
back it up to the disk (and when you have all of the same problems listed
above).


Let's look at these two methods again. We've got two ways to store data
but each has its disadvantages. We're looking for some permanent way to
store data, with speedy access. We also want some nice way to organize
everything.

If this doesn't sound like a job for databases, I don't know what is.

3. Store state information on a database server (not necessarily one
   running on the web server).

Databases exist partly because of the problems with 1. and 2. above. People
have been working a long time on how to index large amounts of information
and mirror it in memory for fast access. Since this is a really common
task, you can buy general purpose database systems like Sybase and Oracle
and they'll do what you want them to do (provided that you know how to make
them do what you want them to).

I know most of you have had a course in databases and you've heard all
sorts of blah, blah, blah, blah about relational database theory and stuff.
But for this class, all you need to know is "databases are just a bunch
of tables." Clearly, there are some people who would be annoyed that I'm
saying this. Whatever.

You might suspect that a database server is a machine that runs a database
server program. We're running Sybase here on a server I call skunk.

In case you're wondering, skunk is a 500MHz AMD K6-2 machine with 192MB of
memory. Though the extra memory is handy to have in a database server,
processor speed usually doesn't matter. You also have to be careful about
setting up your server, especially with a system like Sybase. You can't just
throw memory into it and expect it to handle more capacity. There are fixed
limits in the server configuration files, and even though you may have more
physical memory in the server, if you don't set the configuration up to
take advantage of it, you won't take advantage of it.

On Linux, for example, the defaults for Sybase say to use 32MB of memory.
This isn't even enough to handle more than 12 open connections.

[show some stuff on skunk]

You may wonder why database servers are so restrictive with memory. It's to
make sure that they're always capable of handling a given load without
crashing the machine that they're on. If database connections fail because
they have no more resources, it's much better to give temporary error
messages instead of making the whole server fail, leaving the cause of
failure a mystery.

Another lesson to learn from this is that you shouldn't blame slow or
"small" hardware for problems with your software.

Anyway.

- Each database server configuration has a number of databases. The database
  server administrator determines the databases and which database users own
  them.
- Each database has a number of tables. The database owner normally defines
  the tables (and who has access to them).
- Each table has a number of rows of data.
- Each row is split up into a number of fields.

[draw a picture]

Relational databases like Sybase normally use a language called SQL to make
queries. Most of all of you have learned this already, but in case you
haven't, look at the examples below for the idea. It's a really easy
language since there are only a couple of commands: SELECT, INSERT, CREATE,
DELETE, DROP, UPDATE..

What I do want to show you, though, is how to access the Sybase server in
languages like Perl, Python, and PHP. Pretty much all of you have used sqsh
to type sql commands in by hand, and PHP to run things in web servers. But
there's much more that you can do, especially with Python and Perl.

What you need to know in each case is:

1. How to open and close a Sybase connection.
2. How to send commands to that Sybase connection.
3. How to read data from the query commands into variables on your
   scripting language.

You MUST have the SYBASE environment variable set to /opt/sybase/default
in order for any of this stuff to work. It's very important to note that
you'll need to set that environment variable in the startup scripts that
start up your httpd. In sh, do this:

SYBASE=/opt/sybase/default
export SYBASE

If anyone asks why their stuff isn't working, I'm going to send them to
this place in the notes (you get tired of saying it over and over again).


Python:
You'll use the Sybase extension module for Python access.

Here's a really simple example to grab all of the titles out of the pubs2
example database that comes with Sybase:

#!/usr/local/bin/python

import Sybase
mydb = Sybase.connect('skunk', 'bri', 'blahblah')
mydb.execute("use pubs2")

c = mydb.cursor()
c.execute("select title from titles")
t = c.fetchone()
while t:
    print t[0]
    t = c.fetchone()

mydb.close()

A few things to note here:

- You get a "Connect" object from Sybase.connect. Do a dir(mydb) after you
  assign mydb to see its methods.
- There's an execute() method for Connect objects. Do all of your Sybase
  and SQL commands with that method (like the mydb.execute("use pubs2")),
  EXCEPT SELECT commands.
- For reasons which you don't want to know, you must create a cursor
  object with the cursor() method before you can actually get data back from
  a SELECT command.
- Use a cursor's fetchone() method to grab one row of data. The data is
  returned as a list of tuples, even if there is only one element in each
  tuple. Hence the t[0] access.

You can also slurp up all of the data at once with the fetchall() command.

#!/usr/local/bin/python

import Sybase
mydb = Sybase.connect('skunk', 'bri', 'blahblah')
mydb.execute("use pubs2")

c = mydb.cursor()
c.execute("select * from titles")
for t in c.fetchall():
    print "Title: %s. Description:\n%s" % (t[1], t[3])
    print

mydb.close()

You probably want to fool around on the Python command line to figure out
what the result of a call like fetchall() does.


Perl/Sybase:

It's best to illustrate this sucker by an example:

#!/usr/local/bin/perl

use DBI;
$dbhandle = DBI->connect("dbi:Sybase:server=skunk", "bri", "blahblah");
$dbhandle->do("use pubs2") || die "whoops\n";
$sth = $dbhandle->prepare("select * from titles");
$sth->execute;

while (@row = $sth->fetchrow_array) {
    print "Name: $row[1]\n  Inventory ID: $row[0]\n";
}

$dbhandle->disconnect;

You can see that you need to use DBI->connect() to connect, where you put
in your password, how you execute commands that don't return data versus
commands that do return data, and how to extract that data.

There's one more thing I'd like to say about this stuff, and it's that
scripting languages are really handy for dumping massive amounts of data
into a database at once. For example, this Perl script can be run on
/usr/dict/words to throw all of the words ending with "y" (and its line
number) into a table called somestuff:

#!/usr/local/bin/perl

use DBI;

$dbhandle = DBI->connect("dbi:Sybase:server=skunk", "bri", "blahblah");
$dbhandle->do("use bri") || die "whoops\n";
$dbhandle->do("DROP TABLE somestuff") || print "hmm.. table didn't exist";
$dbhandle->do("CREATE TABLE somestuff (name CHAR(20), quantity INT)");

$i = 1;
while (<>) {
    chop;
    if (/y$/) {
        $dbhandle->do("INSERT INTO somestuff VALUES(\"$_\", $i)");
    }
    $i++;
}

$dbhandle->disconnect;

So keep this in mind..


Cookies:
--------
First, let's identify the problem.

We've got a nice web site up, and we're able to personalize it for a user
to a certain degree. But how does a web site see a returning user the next
time they type the URL into their browser?

If I go to amazon.com, how does it know that I'm Brian Ward?

Well, with what we have right now, we can approach this as follows:

1. Give the user a userid and password and make them login as soon as they
   enter the site.
2. Tack an extra parameter on to every link and form onto every generated
   page identifying that user the next time the user logs in.

This is kind of fuggly. You don't want to login every time you go to a web
site, that's bogus. Maybe you could get around that with a special
bookmark, but how secure is that? (Hint: not very.)

You also can't identify a user based on the IP address of the machine that
they're coming from because several users may use the same machine or proxy
server.

By now, you should probably be thinking that the only way to identify
yourself to a web site is to send some piece of information unique to you
to the server every time you make a request to the server. That piece of
information is called a "cookie" on the web, and it's somewhat
controversial.

The browser doesn't make up the cookie; the web server does. The server
puts it the HTTP header:

Set-Cookie: <parameter>=<value>; path=/; domain=.example.com; expires=<expire time>

for example:

Set-Cookie: yourcookie=3429afdasfal525; path=/; domain=.example.com; expires=Monday, 30-Apr-2001 08:00:00 GMT

[Talk to the amazon.com server]

When the browser gets this, it puts the information in a file on your
system. For your computer science account, it's in your .netscape/cookies
file. Check it out sometime.

If you're generating all of the Set-Cookie header by hand, make sure you do
it right. Web browsers are strict about the format, and if the cookie
doesn't conform, the browser won't accept it. In particular:

- The domain must match the web server's host domain.
- The header must come before the Content-Type (and probably a lot of other
  stuff)
- The expiry time must be valid.

For a port, like on your web servers, I think you just tack it onto the
server name..

You can set as many cookies in a header as you want. You can also set
a number of parameters in the same cookie line.

The web browser sends the cookie in its HTTP request whenever the
information in the cookie matches the request that it's about to make.

So why is this controversial? Well, it's a potential invasion of privacy.
Sure, an online store can take a look at each individual item that you look
at, and in which order, and so on.. but that shouldn't concern you all that
much. There are two practices which might bug you:

1. Companies sharing cookies and email addresses to correlate activity.
   There's not much you can do about this except not deal with slimy
   companies, though you wouldn't believe the kinds of things some
   people do with credit card numbers..
2. Companies putting image (ad) banners on their pages which are actually on
   other sites. When you get one of those, they can send a cookie too, and
   they can also figure out where the original page was. Get enough banners,
   you get a lot of interesting information. You can, however, do something
   about this: go into your preferences and set the cookies to only accept
   the cookies coming from the sender of the original document.

That said, cookies are just really useful for stuff like shopping carts.
Here's why: when you set a cookie with a web server, you can set it to a
unique identifier. Then you can put that identifier along with shopping
cart data into table on a database.

Let's do a really simple cookie program:

#!/usr/local/bin/perl

print <<"DONE";
Set-Cookie: mycookie=likehi; path=/; domain=mikado.aem7.net; expires=Monday, 30-Apr-2001 08:00:00 GMT
Content-type: text/html

<html><head></head><body>
<h1>You've been cookied.</h1>
</body></html>
DONE

Sure enough, when we look in our .netscape/cookies file, the cookie is
there. You want the domain to match the full hostname and port of your web
server (you may want to set up virtual hosts on your web server to redirect
anything without a fully qualified domain name to the fully qualified one).

Because working with these cookies is kind of painful to do by hand (try to
do it sometime.. especially with that date stuff), there are some library
tools in the languages we've studied so far to help you out. Let's look at
Perl's with a really simple example. (use "perldoc CGI" to find out all of
the options..)

#!/usr/local/bin/perl

use CGI;
$query = new CGI;

# first see if we've got a cookie already
$band = $query->cookie('favband');

# then see if we're setting the cookie to something else.
$newband = $query->param('thebest');

if ($newband) {		# setting a new band
    # make up a new cookie
    $cookie = $query->cookie(-name=>"favband", -value=>$newband,
			     -expires=>"+5d", 	# you'll forget in 5 days
			     -path=>"/",
			     -domain=>"mikado.aem7.net");
    print $query->header(-cookie=>$cookie);
} else {		# don't need to set a cookie
    print $query->header;
}

print "<html><head></head><body>\n";

if ($band && $newband && ($band ne $newband)) {
    print "Changing your favorite from $band to $newband, I see...<p>\n";
    print "Well, if you change your mind, click on a new one below<p>\n";
} elsif ($band) {
    print "Welcome back! I see your favorite is still $band..<p>\n";
    print "Want to change it?<p>\n";
} elsif ($newband) {
    print "Setting your favorite to $newband..<p>\n";
    print "Well, if you change your mind, click on a new one below<p>\n";
} else {
    print "Hi there. Pick your favorite band.<p>\n";
}

print <<"DONE";
<form action=remem.cgi>
<input type=radio name=thebest value=Ramones>Ramones<br>
<input type=radio name=thebest value=Beatles>Beatles<br>
<input type=radio name=thebest value=GWAR>GWAR<br>
<input type=radio name=thebest value=Backstreet_Boys>Backstreet Boys<br>
<input type=submit>
</form>
<p>
</body></html>
DONE

Python: Look for the example using the Cookie module off the lecture notes
on the main class page.

Server Modules/PHP
------------------
Most people do web programming these days not with CGIs (which, really,
when it comes down to it, are kind of a pain), but with server modules.
A server module is some kind of package that the web server process loads
directly into its process space with dynamic linking. So the new code is
part of the web server process and the web server doesn't need to start up
any new programs.

Aside from this gain in efficiency, most of these server modules have a
slightly different model of programming. Instead of writing a program to
print out an HTML document, which is what a CGI does, these things let you
write an HTML document and embed little pieces of code inside. For example,
here's something in PHP:

<html><head></head><body>

<h1>Some numbers</h1>

<!-- up to here, it's been HTML -->

<?php
  /* this is now php code */
  $i = 0;
  while ($i < 10) {
    print "$i ";
    $i++;
  }
?>

<!-- back to HTML -->

<p>
Weren't they nice?<p>
</body></html>

When the web server gets a PHP file, it parses the file like regular HTML
until it gets to a <?, at which point it drops into PHP and starts running
the PHP code until the PHP parser hits the ?> marker, and at that point, it
goes back to spitting out the HTML. During PHP execution, you use the print
function to generate additional HTML output. (There are also some extra
functions for working with the header a little.)

Just remember: the browser does NOT see any PHP code at any time. The web
server returns only HTML (or at least something which the programmer thinks
is HTML).

This turns out to be a really convenient. Instead of thinking about writing
programs all the time, when you write these things, you can think about
writing the document and the program together.

These server side modules are pretty popular not just because they're easy,
but also because they're fast and they've been fined-tuned to provide
almost any function that you'd ever want.

So PHP is an example of this. You may have heard of ASP. Well, that's just
the same thing, except that since it's a microsoft thing, you normally put
an awful language like Visual Basic instead of PHP inside the delimiters
(also, the default ASP delimiters are <% and %>, not <? and ?>). But the
fact is: if you know PHP, you know ASP.

There are some others, like mod_perl, mod_py, and Cold Fusion. Other server
modules (which don't quite work like this) are the Java-based ones. There's
another class working with these, so I won't go into those.

[next part is all slides]