Archive for September 2010

Deconstructing URLs

September 19th, 2010 — 8:41pm

As a follow-on to my investigation (if that’s the right word) into URL dispatch in Python frameworks I thought I would look at how an application discovers, calculates or otherwise works out what URL to use to refer to its own objects. The application wants to provide a link to an object edit page, say, so it must somehow know how to formulate such a link and where to find the contextual information that places the application in the particular environment it is running in. Let’s start by deconstructing an example.

I have an application that currently uses a URL something like{identifier}/edit. This breaks down into the scheme (http:), the network location ( and a path (/a/projects/{identifier}/edit). The path is an absolute path according to rfc 1808, because it starts with “/”, and it is this path that we want to recreate somehow.

As it happens, this path has three different elements

  • /a/ – I used this to manage the path that cookies belong to. In principle, any path that did not begin with /a/ could be used for static files, whereas paths that did start with /a/ would be processed by an application using the cookie to manage the session. In practice, it defines a name-space that allows us to put more than one instance of an application environment onto a single net location.
  • projects/ – This points to a particular resource handler used for project management. There could be other resource handlers running in the same environment. In effect, this part of the path could be thought of as narrowing down the selection of objects referenced by the remainder of the path. This creates another name-space that distinguishes available resources. We could, potentially, have a URL that looks like{identifier}/edit where the {identifier} in this case comes from a different set of identifiers than the projects set, and the edit element implies quite different functionality.
  • {identifier}/edit – The final element of the URL that is actually interpreted by some resource handler code to support an identified object.

The important point here is that the first two parts of this URL (/a/projects/) are irrelevant to the resource handler. This is, in effect, the SCRIPT_PATH of the CGI definition, and {identifier}/edit is the PATH_INFO. Clearly the SCRIPT_PATH can be changed to reflect the context, and it can be as long or as short as required. So long as it links to the correct code to interpret PATH_INFO the URL works.

I am, of course, making an assumption here. The URL I have deconstructed is rather old fashioned in the sense that the structure seems to represent the application in some way. I could, in principle, write /a/{identifier}/projects/edit. The /a/ still has to come first, because in my example it is being interpreted by the http client for returning cookies, but projects/ can be anywhere I like. This doesn’t make much difference, except to emphasise two things: there is going to be some part (/a/) that is dependent on the server environment, and some part (projects/) that is going to be dependent on some sort of framework environment. The underlying problem remains the same – how to feed these two parts into the URL generation process without making the resource handler aware of the details.

I need to do two different things. I want to serve more than one resource type from the same environment, and I want to run more than one environment from the same net location. The second I could solve by virtual servers. I (simply?) configure the http server to direct to one place and to another. That’s fine if I have full control over the server and is probably the ‘best’ solution. The first could also be solved in the http server if the URL is strictly hierarchical, but it can’t be avoided by limiting the site to only one resource type. Generally I am at least likely to want to refer to user objects (for access control, capturing addresses, credit cards, whatever) and, say, product objects (for the users to buy). At the very least that means choosing the object names very carefully for any particular site. On different sites, ‘users’, ‘customers’, ‘clients’, ‘patients’, or the same in the singular, may be valid options, but I have to choose just one and stick with it. (A quick read of this style guide is worth it for the reminder.)

In general, there is a three step sequence of places that might do URL dispatch – http server, application framework, and resource handler. The http handler communicates with the applications it serves using CGI. For our purposes here, the SCRIPT_PATH tells us the fist part of the path we eventually want to create, so that is what we must use in the next step.

The application framework could be null if all the routing is done in the http server, but we might need to provide something if we don’t have access to the server, or if we want to be reasonably dynamic. This framework will have a URL dispatcher, and this dispatcher may support named routes. The resource handler could delegate all the routing to the framework. This works fine, because the framework extracts the useful parts of the path and presents them to the resource handler as parameters. The resource handler, however, has to ask the framework to do URL generation and this locks the resource handler firmly to the framework.

I rather like the idea of a resource handler that consists of a dispatcher and code combination that is dedicated to handling a single resource type. I can plug this in to a framework, or serve directly from an http server. Of course, if the handler provides a user interface to a browser, then there may be some conventions to follow, or code to share, but that would be a necessary consideration whatever was done. The framework becomes little more than a dispatcher that looks like an http server. It creates an appropriate SCRIPT_PATH to hand down to the resource handler, and the resource handler can handle the remaining parts of the path.

I think I’ll work more on this idea.

1 comment » | Uncategorized

Back to top