Web Programming Lecture 1 Notes: Administrivia: -------------- blah blah.. What the course is about: ------------------------- Fundamentals of web servers, web server development. What are you concerned about when you visit a web site? - Ease-of-use and Readability - Security - Cost (human + physical resources) - Privacy (up to how evil you want to be) - Performance - Reliability - Content (up to how incompetent you want to be) Well, how do we get there? A web server is like a jigsaw puzzle in one way: it's got a bunch of pieces that you need to fit together. However, it's unlike any puzzle you've ever had to work on before; you need to file some pieces at the edges to make them fit, chop up other pieces, plus you'll end up making some of them yourself. If this is starting to sound like a hack, that's good, because that's all the web is. For some reason, history has shown computer scientists that it's usually better to do things the wrong way. History of the web: ------------------- Does anyone remember the internet before the web? Well, I do: Well, we had a lot of different kinds of communications protocols, like: - DNS (name service) - SMTP (mail) - NNTP (Usenet/news) - telnet (remote access) - ftp (file transfer) A significant thing to notice about all of these is that not only are they stupid and insecure, but they aren't the easiest things to use. In particular, the telnet and ftp protocols need to die a quick and horrible death. We put software (and sometimes) documents up on our ftp servers, leaving people to figure out what the server names were - well, sometimes we told our pals. A tool called archie made looking up ftp servers easier later on. Sometimes we would post our server addresses on Usenet, along with our semi-informational rants. This was all fine and good, except that not a lot of people used the net at that time, mainly because it generally kind of sucked. That wasn't just because we were using stupid and insecure protocols, but also because it was kind of hard to use. People could handle email. Using rn was daunting. There were all sorts of proposed improvements, like Gopher and WAIS (both of which no one really ever used). Then, in 1989, Tim Berners-Lee at CERN decided that he wanted to sort of do his own thing for high-energy physics-based document management. The idea was the following: The other protocols would remain in place, but there'd only be one program to access them. And this program would need some sort of glue to bind all of the information together - it required a hypertext language so that'd you'd be able to "link" to the actual information. They showed off their first web browsers in late 1990, and released it early the next year. It sort of dawdled for two years. I used it during this time (using a browser called tkWWW) to find manual pages for all sorts of different operating systems. For some reason, I thought this was neat. Then things started to happen in 1993. A student working at NCSA (yes, down in Urbana) started fiddling around and wrote a browser called Mosaic. The first version was on Unix's X Window System. It was immediately popular for one reason: it had pictures. When they gave demos of it, the funding people liked it (because of the pictures), and NCSA decided to port it to the mac and windows (remember, it was 3.1 at that time). The release of all three was in late 1993, and The New York Times responded with a front-page article describing "the internet's first killer app." Well, that was that. I remember that day; my boss came down and said, "um, it looks like we actually have to do those web pages now." The story from there goes like this: - Netscape Communications Corporation formed. Their web browser, Netscape, became popular immediately because it gratuitously added all sorts of extensions to make pages even prettier. Netscape's intentions were all about commercialization: they wanted encrypted transmission so that you could send credit card numbers. - Linux, which had been around for two years now, managed to get their networking code working to a point where it could be used as a web server. FreeBSD also got big. It was a lot cheaper to run a web server on a crummy PC than on an expensive Sun. - Microsoft released Windows 95- the look of which they completely ripped off from NeXT. Of course, it still wasn't any good for web servers, but it was less pathetic than Windows 3.1 in terms of its TCP/IP performance, and it came with PPP support, so having Netscape as a client on that really bolstered internet use. By this time, Microsoft had noticed that the internet had really passed them by and started work in earnest on their own web browser, Internet Explorer. - After having a couple versions of their Secure Socket Layer (SSL) hit and sunk, Netscape introduced one that was a bit more secure after getting some scientists to work on the problem for them. Commercialization really took root. e-commerce got hot. - The air got let out of the dot-com boom and the economy. A bunch of dot-coms folded. It's hard to notice these kind of things, though: why were there so many "web designers" out there making pages that all looked the same, anyway? - Even though the tools for creating it have changed, "dynamic content" remains popular. That's what we're going to be talking about in this course. HTTP: ----- When you make a connection to a web server with your browser, you're connecting to a port using a particularly silly protocol called HTTP (Hypertext Transfer Protocol). You can't just yack at a machine and expect it to know what you're talking about. HTTP goes over the TCP/IP protocol. A TCP service sits on a particular port on a server - and a port is just a number (there's usually a name associated with it in /etc/services on a Unix machine). You use ports to differentiate between services. 25 is SMTP (for email transport), 22 is ssh, 23 is telnet, 79 is the finger service, and so on. HTTP usually sits at port 80, but you can put it anywhere you like (and you will in this course). You can connect to a service using the telnet program: HTTP expects a request. here's a simple example in HTTP version 1.1: telnet www 80 GET / HTTP/1.1 As you can see, you get a whole bunch of stuff back, starting with some headers: HTTP/1.1 200 OK Date: Mon, 26 Mar 2001 00:48:11 GMT Server: Apache/1.3.12 (Unix) PHP/4.0.4 Last-Modified: Wed, 27 Sep 2000 17:42:36 GMT ETag: "6b7eb-5b9-39d2318c" Accept-Ranges: bytes Content-Length: 1465 Connection: close Content-Type: text/html X-Pad: avoid browser bug After that, an actual document follows. Note the Content-Type: part of the header; it tells the browser about the document's format. The original HTTP version, 1.0, had some severe performance issues. To speed up loading of a bunch of images, Netscape decided that it would open up a bunch of connections and pull a bunch of stuff off a server simultaneously. This hogged resources on the server side, but it also did badly on the client side; if you were on a slow modem connection and opened a bunch of connections at once, they'd get in the way of each other, making the actual page download even slower. In addition, after each document transfer, the server would disconnect and your browser would need to go through the pain of reconnecting. That was a serious drag if you wanted to have lots of pictures on a web page. Not only that, people noticed that you'd need a different IP address for each web server name (increasingly called distinct "sites"), even though one web daemon on a real computer is capable of serving several sites at once. People started doing stupid things like sticking five ethernet interfaces in a single machine, running different web server processes off each interface. (I'd just like to note again that this is dumb.) To get around these problems, HTTP 1.1 came about. Everyone uses it. It not only supports "virtual hosts" but persistent connections. Here's an example: telnet www 80 GET / HTTP/1.1 Host: www.cs.uchicago.edu Notice that the connection hasn't closed right away (it will in a while). You get this stuff (explain it): HTTP/1.1 200 OK Date: Mon, 26 Mar 2001 01:15:42 GMT Server: Apache/1.3.12 (Unix) PHP/4.0.4 X-Powered-By: PHP/4.0.4 Transfer-Encoding: chunked Content-Type: text/html URLs: ----- URL stands for Uniform Resource Locator. It's that thing in the bar at the top of your browser that usually starts with "http://". The thing after the // is the server name. You can specify a different port (say, 3124) here with http://www.example.com:3124/ Without the port number, the browser looks at port 80. Since you won't be running your servers on port 80, you're going to need to remember this. Web servers: ------------ A web server is a daemon that sits around listening to a TCP port. When it gets an HTTP request, it parses that request and gives a response (which we've already outlined already). Of course, it needs to figure out where to find whatever the request asked for. In the old days, this was usually just a file; the server would find the file, and after the HTTP response header, just spit the file at the connection. These days, the server may have to do more. A lot of content isn't in a static file; it's a file that the server has to look at, figure out if there's a program inside, and run that program, sending the program's output back out the port and eventually to whatever made the request. The most popular free server is called Apache (at http://www.apache.org/). HTML: ----- HTML is a bunch of tags. This part of the lecture will be all slides that I ripped off from my advisor.