Version 7 (modified by joe, 2 years ago) (diff)

added note re:iptables rule to slow abusive parallel downloads

Setting up a Web Server to Serve Solar Physics Data

This is collection of notes of common issues typical to solar physics archives serving data via HTTP. All rules below are for Apache, as over the last ~7 years that I (Joe Hourclé) have been doing this, I haven't run into any solar physics archives using other server software. If you're using something different, contact me and we'll work up a similar list for your software.

Serving FITS files

By default, Apache does not know what a .fits or .fts file is, and so will mark it as text/plain, which will cause many web browsers to attempt to display it themselves instead of downloading them and/or prompting to view in an alternate program. You can fix this by adding the following to your httpd.conf file:

AddType image/fits .fts .fits .fit .FIT .FITS .FTS

If you're not serving image data in those FITS files, so you're against claiming it's an image, you can instead use:

AddType application/fits .fts .fits .fit .FIT .FITS .FTS

If you don't see any AddType directives, but have a TypesConfig /path/to/mime.types, you can add to the end of the referenced file:

image/fits                                      fits fit fts

... or application/fits, as appropriate.

If you have a directory of FITS files that do not have a file extension, you can instead tell your webserver that all files in that directory, or all files matching a given pattern are FITS files:

<Location /URI/path/to/files/>
    ForceType application/fits
<Directory /local/filesystem/path/to/files>
    ForceType application/fits
<FilesMatch "regular expression">
    ForceType application/fits

See the Apache documentation on Configurable Sections for information about other ways to selectively apply configuration, such as <DirectoryMatch> and <LocationMatch>.

Serving CDF, HDF or NetCDF files

If you're using other scientific file formats, you may need to add one or more of the following:

AddType application/x-netcdf .nc .cdf
AddType application/x-cdf .cdf
AddType application/x-hdf .hdf .h4 .hdf4 .h5 .hdf5 .he4 .he5

Please note that you can not copy & paste that block in, as it defines .cdf as being both CDF and NetCDF.

You can also use the ForceType directive, as explained above under FITS files.

Customized Directory Listings

If you're using Apache's FancyIndexing to serve directory listings, the default width is often not enough to show the filenames of the length used in solar physics. We can fix this with:

IndexOptions +SuppressDescription +NameWidth=*

If you have people using 'wget' or similar to scrape your directory listings, you may also want to turn off the ability to sort columns, as the default wget settings will result in multiple requests for each directory, even when nothing has changed. The following options are recommended if you expect people to try mirroring your data:

IndexOptions +FoldersFirst +TrackModified +IgnoreClient +SuppressColumnSorting 

If you wish to adjust the style of the folders, to avoid the extra overhead from Server-Side Includes, and to reduce the bandwidth from people mirroring your data, consider using JavaScript to wrap the style around the listing. For example, for STEREO's data listing, we're already using jQuery for other functionality, so we use it to rewrite the page after it's loaded, and tell the webserver to insert the necessary HTML necessary to call the required files:

IndexHeadInsert "<script src='/jquery/jquery-1.3.2.min.js'></script><script src='/incl2/wrap_index.js'></script>"

Hardening Your Web Server

At a bare minimum, you should prevent your web server from revealing all of the modules that it has installed, with their version numbers, so you're not advertising the equivalent of an unlocked window. You can do this by finding the ServerTokens directive in your httpd.conf file, and changing it to (or inserting it, if it doesn't exist) :

ServerTokens Prod

For a much more in-depth list of possible changes to make, see the security benchmarks from the Center for Internet Security, and review the appropriate document for your OS, web server, and database, if applicable. Make sure to back up your configuration first, though, as some of the recommendations are overly restrictive and may break your web server. (also note that they ask you for info about yourself, but you only have to answer the yes/no questions to download files). These are things you can do to make the system more secure, not things you necessarily should do for all servers.

A few words of warning, though:

  1. Back up your configuration directory before making changes. Yes, I said it before, and yes, this is in bold for a reason.
  2. You can test the webserver's configuration on unix-like systems with apachectl -t. You should do this as you make changes, so you're not trying to figure out which of the dozens of changes broke the configuration.
  3. If you want to restart your webserver after each change, use apachectl graceful, which won't kick off people currently downloading files and make them have to restart. (although, once you're all done, you make want to do a full stop & start)
  4. After making changes, pay more attention than normal to your web server's error logs for the next few days to see if anything strange is happening. If you're using 'piped logging', you'll likely need to check in two places, as error logs when starting up are not written to the piped logs.
  5. The section on disabling the auth* modules doesn't make it clear that you 'should not' turn off the module authz_host_module, which is required for some of the other changes they recommend.
  6. If you're using any of those authentication modules, if the directives were in <IfModule> blocks, turning off the modules will turn off access restrictions. (if you're not using IfModule, it should stop the webserver from staring back up)
  7. The section "Deny Access to OS Root Directory" telling you to deny all access to <Directory />, and then turn on access to web root neglects to tell you to also give access to the location of the icons used in directory listings, the files used for error messages, etc, which are in different places in each OS. (MacOS uses /usr/share/httpd/; CentOS uses /var/www/ )

A Note on 'htaccess' files

Turning on support for .htaccess files means that the webserver needs to not only look in the file's directory, but in each and every directory above it as well. Yes, there's caching, but it still has to check if they've changed, or new ones get introduced. It's better to use <Directory> or <Location> sections in the main httpd.conf, which is read only when the server restarts. You can turn off .htaccess checking by setting the AllowOverride directive to:

AllowOverride None

Dealing with Problem Users / Search Spiders / Etc

I've noticed a few archives have tried putting a robots.txt file in their data directory, eg:

Search engines don't look for them there. They must be served from the root of the web server for it to be of any use. Even then, it's only a request for them not to scrape certain sections, and won't actually stop them.

If you have to, you can block misbehaving IP addresses in a few different ways:

Deny from

This is rarely of much use, as they'll typically just change IP addresses, and do it again. You can block networks by giving a partial address, eg:

Deny from 10.1.1.

Although you can specify host names or domain names, it's not recommended as it requires doing a DNS lookup for requests, which can slow things down.

If you're trying to identify who it might be, you can try narrowing it down with:

  1. Do a DNS lookup on the IP
  2. Use your favorite internet search engine to search for both the IP address and the DNS name (if there was one). They might've posted to a newsgroup or similar that logged the host or IP.
  3. If it didn't have a DNS name, you can try using the whois command to see who institution the IP range is registered to.

Blocking by Browser / User-Agent

You can also reject by the client's browser identification string, although, they can always change or hide it, so this isn't always good solution. It can be a hint to search crawlers or wget users that their activity isn't appreciated, though.

As we have hundreds of thousands of requests per day for the same file, all coming from thousands of IP addresses, by a single strange browser (that I've tried asking companies with similarly named software if their stuff acts as a browser, and if so, to stop it), we use the following to block access:

<Location "/images/latest_eit_304.gif">
    BrowserMatch ^CompanionLink badClient
    Order allow,deny
    Allow from all
    Deny from env=badClient

Restricting Access to Local Users

If you have data that's under embargo, and you want to make it available only to local users, you can limit access to a directory such as /embargoed to specific IP addresses or to group of IP addresses:

<Location "/embargoed/">
    Order deny,allow
    Deny from all
    Allow from
    Allow from
    Allow from

And I know, I said the DNS lookup was bad, but in this case, hopefully the embargoed data is only a small portion of your total traffic.

If you want to allow the data to be accessed by local users, by also from outside with the proper username and password, you can do:

<Location "/embargoed/">
    Order deny,allow
    Deny from all
    Allow from
    Allow from
    Allow from

    AuthType Digest
    AuthName "Embargoed Data"
    AuthDigestDomain /embargoed/
    AuthDigestFile /path/to/htdigest/file
    Require valid-user

    Satisfy any

Note that in Apache 2.2, AuthDigestFile should be changed to AuthUserFile. You can use AuthType Basic instead, but then the passwords are sent in the clear, and you'll have to change some of the lines.

Also note that having username and passwords at NASA means your system falls under the 'Web Application' designation, and requires additional approvals through STRAW. If you're setting up a webserver at NASA, and you don't know what STRAW is, stop immediately, and talk to your local syadmins.

Debugging Slow Web Servers

Entries into the webserver's access logs only occur once the client disconnects. This means that if you have lots of connections sitting open, you can't use the access logs to see what's going on. Apache does have a way to get some information about what's going on, however. Look for a section in your server config mentioning ExtendedStatus or server-status, and change it to read something like:

<IfModule mod_status.c>
    <Location "/server-status">
        SetHandler server-status
        Order deny,allow
        Deny from all
        Allow from
        Allow from your-ip-address
    ExtendedStatus On

Obviously, set your-ip-address to an appropriate value, and you can set more than one Allow line to allow connections from more than one machine. You can then (after restarting the webserver) request the page http://servername/server-status, which down at the bottom will give a report that includes what requests are being processed, what IP asked for it, and how long it's been processing, so you can try to identify what might be having problems, or what connecting IP address might be doing strange things. You have to do this in advance of the request; as it requires a web server restart to turn on this feature, you can't just turn it on when you have a problem connection (unless it's a flood of problem connections that you're actively monitoring).

As the process ID is listed, you can also use this to get information about processes that you identify as problematic via the unix ps or top commands.

Slowing Abusive Parallel Downloading

There are modules that allow you to do rate limiting within the webserver, but if you have a machine using IPTables, you can limit a given IP address to only 5 connections at once using:

-A INPUT -p tcp -m tcp --dport 80 --tcp-flags FIN,SYN,RST,ACK SYN -m connlimit --connlimit-above 5 --connlimit-mask 32 -j REJECT --reject-with tcp-reset

You can also set limits per IP block by reducing --connlimit-mask. Use --connlimit-mask=24 for a 256 IP address block.