wiki:WebserverSetup

Version 1 (modified by joe, 11 years ago) (diff)

--

Setting up a Web Server to Serve Solar Physics Data

This is collection of notes of common issues typical to solar physics archives serving data via HTTP. All rules below are for Apache, as over the last ~7 years that I (Joe Hourclé) have been doing this, I haven't run into any solar physics archives using other server software. If you're using something different, contact me and we'll work up a similar list for your software.

Serving FITS files

By default, Apache does not know what a .fits or .fts file is, and so will mark it as text/plain, which will cause many web browsers to attempt to display it themselves instead of downloading them and/or prompting to view in an alternate program. You can fix this by adding the following to your httpd.conf file:

AddType image/fits .fts .fits .fit .FIT .FITS .FTS

If you're not serving image data in those FITS files, so you're against claiming it's an image, you can instead use:

AddType application/fits .fts .fits .fit .FIT .FITS .FTS

If you have a directory of FITS files that do not have a file extension, you can instead tell your webserver that all files in that directory, or all files matching a given pattern are FITS files:

<Location /URI/path/to/files/>
    ForceType application/fits
</Location>
<Directory /local/filesystem/path/to/files>
    ForceType application/fits
</Directory>
<FilesMatch "regular expression">
    ForceType application/fits
</FilesMatch>

See the Apache documentation on Configurable Sections for information about other ways to selectively apply configuration, such as <DirectoryMatch> and <LocationMatch>.

Serving CDF, HDF or NetCDF files

If you're using other scientific file formats, you may need to add one or more of the following:

AddType application/x-netcdf .nc .cdf
AddType application/x-cdf .cdf
AddType application/x-hdf .hdf .h4 .hdf4 .h5 .hdf5 .he4 .he5

Please note that you can not copy & paste that block in, as it defines .cdf as being both CDF and NetCDF.

You can also use the ForceType directive, as explained above under FITS files.

Customized Directory Listings

If you're using Apache's FancyIndexing to serve directory listings, the default width is often not enough to show the filenames of the length used in solar physics. We can fix this with:

IndexOptions +SuppressDescription +NameWidth=*

If you have people using 'wget' or similar to scrape your directory listings, you may also want to turn off the ability to sort columns, as the default wget settings will result in multiple requests for each directory, even when nothing has changed. The following options are recommended if you expect people to try mirroring your data:

IndexOptions +FoldersFirst +TrackModified +IgnoreClient +SuppressColumnSorting 

If you wish to adjust the style of the folders, to avoid the extra overhead from Server-Side Includes, and to reduce the bandwidth from people mirroring your data, consider using JavaScript to wrap the style around the listing. For example, for STEREO's data listing, we're already using jQuery for other functionality, so we use it to rewrite the page after it's loaded, and tell the webserver to insert the necessary HTML necessary to call the required files:

IndexHeadInsert "<script src='/jquery/jquery-1.3.2.min.js'></script><script src='/incl2/wrap_index.js'></script>"

Hardening Your Web Server

At a bare minimum, you should prevent your web server from revealing all of the modules that it has installed, with their version numbers, so you're not advertising the equivalent of an unlocked window. You can do this by finding the ServerTokens directive in your httpd.conf file, and changing it to (or inserting it, if it doesn't exist) :

ServerTokens Prod

For a much more in-depth list of possible changes to make, see the security benchmarks from the Center for Internet Security, and review the appropriate document for your OS, web server, and database, if applicable. Make sure to back up your configuration first, though, as some of the recommendations are overly restrictive and may break your web server. (also note that they ask you for info about yourself, but you only have to answer the yes/no questions to download files). These are things you can do to make the system more secure, not things you necessarily should do for all servers.

A few words of warning, though:

  1. Back up your configuration directory before making changes. Yes, I said it before, and yes, this is in bold for a reason.
  2. You can test the webserver's configuration on unix-like systems with apachectl -t. You should do this as you make changes, so you're not trying to figure out which of the dozens of changes broke the configuration.
  3. If you want to restart your webserver after each change, use apachectl graceful, which won't kick off people currently downloading files and make them have to restart. (although, once you're all done, you make want to do a full stop & start)
  4. After making changes, pay more attention than normal to your web server's error logs for the next few days to see if anything strange is happening. If you're using 'piped logging', you'll likely need to check in two places, as error logs when starting up are not written to the piped logs.
  5. The section on disabling the auth* modules doesn't make it clear that you 'should not' turn off the module authz_host_module, which is required for some of the other changes they recommend.
  6. If you're using any of those authentication modules, if the directives were in <IfModule> blocks, turning off the modules will turn off access restrictions. (if you're not using IfModule, it should stop the webserver from staring back up)
  7. The section "Deny Access to OS Root Directory" telling you to deny all access to <Directory />, and then turn on access to web root neglects to tell you to also give access to the location of the icons used in directory listings, the files used for error messages, etc, which are in different places in each OS. (MacOS uses /usr/share/httpd/; CentOS uses /var/www/ )

A Note on 'htaccess' files

Turning on support for .htaccess files means that the webserver needs to not only look in the file's directory, but in each and every directory above it as well. Yes, there's caching, but it still has to check if they've changed, or new ones get introduced. It's better to use <Directory> or <Location> sections in the main httpd.conf, which is read only when the server restarts. You can turn off .htaccess checking by setting the AllowOverride directive to:

AllowOverride None

Dealing with Problem Users / Search Spiders / Etc

I've noticed a few archives have tried putting a robots.txt file in their data directory, eg:

http://host.example.edu/data/robots.txt

Search engines don't look for them there. They must be served from the root of the web server for it to be of any use. Even then, it's only a request for them not to scrape certain sections, and won't actually stop them.

If you have to, you can block misbehaving IP addresses in a few different ways:

Deny from 10.1.1.117

This is rarely of much use, as they'll typically just change IP addresses, and do it again. You can block networks by giving a partial address, eg:

Deny from 10.1.1.

Although you can specify host names or domain names, it's not recommended as it requires doing a DNS lookup for requests, which can slow things down.

If you're trying to identify who it might be, you can try narrowing it down with:

  1. Do a DNS lookup on the IP
  2. Use your favorite internet search engine to search for both the IP address and the DNS name (if there was one). They might've posted to a newsgroup or similar that logged the host or IP.
  3. If it didn't have a DNS name, you can try using the whois command to see who institution the IP range is registered to.