|Version 4 (modified by joe, 10 months ago) (diff)|
Setting up a Web Server to Serve Solar Physics Data
This is collection of notes of common issues typical to solar physics archives serving data via HTTP. All rules below are for Apache, as over the last ~7 years that I (Joe Hourclé) have been doing this, I haven't run into any solar physics archives using other server software. If you're using something different, contact me and we'll work up a similar list for your software.
Serving FITS files
By default, Apache does not know what a .fits or .fts file is, and so will mark it as text/plain, which will cause many web browsers to attempt to display it themselves instead of downloading them and/or prompting to view in an alternate program. You can fix this by adding the following to your httpd.conf file:
AddType image/fits .fts .fits .fit .FIT .FITS .FTS
If you're not serving image data in those FITS files, so you're against claiming it's an image, you can instead use:
AddType application/fits .fts .fits .fit .FIT .FITS .FTS
If you have a directory of FITS files that do not have a file extension, you can instead tell your webserver that all files in that directory, or all files matching a given pattern are FITS files:
<Location /URI/path/to/files/> ForceType application/fits </Location>
<Directory /local/filesystem/path/to/files> ForceType application/fits </Directory>
<FilesMatch "regular expression"> ForceType application/fits </FilesMatch>
See the Apache documentation on Configurable Sections for information about other ways to selectively apply configuration, such as <DirectoryMatch> and <LocationMatch>.
Serving CDF, HDF or NetCDF files
If you're using other scientific file formats, you may need to add one or more of the following:
AddType application/x-netcdf .nc .cdf AddType application/x-cdf .cdf AddType application/x-hdf .hdf .h4 .hdf4 .h5 .hdf5 .he4 .he5
Please note that you can not copy & paste that block in, as it defines .cdf as being both CDF and NetCDF.
You can also use the ForceType directive, as explained above under FITS files.
Customized Directory Listings
If you're using Apache's FancyIndexing to serve directory listings, the default width is often not enough to show the filenames of the length used in solar physics. We can fix this with:
IndexOptions +SuppressDescription +NameWidth=*
If you have people using 'wget' or similar to scrape your directory listings, you may also want to turn off the ability to sort columns, as the default wget settings will result in multiple requests for each directory, even when nothing has changed. The following options are recommended if you expect people to try mirroring your data:
IndexOptions +FoldersFirst +TrackModified +IgnoreClient +SuppressColumnSorting
IndexHeadInsert "<script src='/jquery/jquery-1.3.2.min.js'></script><script src='/incl2/wrap_index.js'></script>"
Hardening Your Web Server
At a bare minimum, you should prevent your web server from revealing all of the modules that it has installed, with their version numbers, so you're not advertising the equivalent of an unlocked window. You can do this by finding the ServerTokens directive in your httpd.conf file, and changing it to (or inserting it, if it doesn't exist) :
For a much more in-depth list of possible changes to make, see the security benchmarks from the Center for Internet Security, and review the appropriate document for your OS, web server, and database, if applicable. Make sure to back up your configuration first, though, as some of the recommendations are overly restrictive and may break your web server. (also note that they ask you for info about yourself, but you only have to answer the yes/no questions to download files). These are things you can do to make the system more secure, not things you necessarily should do for all servers.
A few words of warning, though:
- Back up your configuration directory before making changes. Yes, I said it before, and yes, this is in bold for a reason.
- You can test the webserver's configuration on unix-like systems with apachectl -t. You should do this as you make changes, so you're not trying to figure out which of the dozens of changes broke the configuration.
- If you want to restart your webserver after each change, use apachectl graceful, which won't kick off people currently downloading files and make them have to restart. (although, once you're all done, you make want to do a full stop & start)
- After making changes, pay more attention than normal to your web server's error logs for the next few days to see if anything strange is happening. If you're using 'piped logging', you'll likely need to check in two places, as error logs when starting up are not written to the piped logs.
- The section on disabling the auth* modules doesn't make it clear that you 'should not' turn off the module authz_host_module, which is required for some of the other changes they recommend.
- If you're using any of those authentication modules, if the directives were in <IfModule> blocks, turning off the modules will turn off access restrictions. (if you're not using IfModule, it should stop the webserver from staring back up)
- The section "Deny Access to OS Root Directory" telling you to deny all access to <Directory />, and then turn on access to web root neglects to tell you to also give access to the location of the icons used in directory listings, the files used for error messages, etc, which are in different places in each OS. (MacOS uses /usr/share/httpd/; CentOS uses /var/www/ )
A Note on 'htaccess' files
Turning on support for .htaccess files means that the webserver needs to not only look in the file's directory, but in each and every directory above it as well. Yes, there's caching, but it still has to check if they've changed, or new ones get introduced. It's better to use <Directory> or <Location> sections in the main httpd.conf, which is read only when the server restarts. You can turn off .htaccess checking by setting the AllowOverride directive to:
Dealing with Problem Users / Search Spiders / Etc
I've noticed a few archives have tried putting a robots.txt file in their data directory, eg:
Search engines don't look for them there. They must be served from the root of the web server for it to be of any use. Even then, it's only a request for them not to scrape certain sections, and won't actually stop them.
If you have to, you can block misbehaving IP addresses in a few different ways:
Deny from 10.1.1.117
This is rarely of much use, as they'll typically just change IP addresses, and do it again. You can block networks by giving a partial address, eg:
Deny from 10.1.1.
Although you can specify host names or domain names, it's not recommended as it requires doing a DNS lookup for requests, which can slow things down.
If you're trying to identify who it might be, you can try narrowing it down with:
- Do a DNS lookup on the IP
- Use your favorite internet search engine to search for both the IP address and the DNS name (if there was one). They might've posted to a newsgroup or similar that logged the host or IP.
- If it didn't have a DNS name, you can try using the whois command to see who institution the IP range is registered to.
Blocking by Browser / User-Agent
You can also reject by the client's browser identification string, although, they can always change or hide it, so this isn't always good solution. It can be a hint to search crawlers or wget users that their activity isn't appreciated, though.
As we have hundreds of thousands of requests per day for the same file, all coming from thousands of IP addresses, by a single strange browser (that I've tried asking companies with similarly named software if their stuff acts as a browser, and if so, to stop it), we use the following to block access:
<Location "/images/latest_eit_304.gif"> BrowserMatch ^CompanionLink badClient Order allow,deny Allow from all Deny from env=badClient </Location>
Restricting Access to Local Users
If you have data that's under embargo, and you want to make it available only to local users, you can limit access to a directory such as /embargoed to specific IP addresses or to group of IP addresses:
<Location "/embargoed/"> Order deny,allow Deny from all Allow from 10.1.1.1/24 Allow from 10.11.12.167 Allow from example.edu </Location>
And I know, I said the DNS lookup was bad, but in this case, hopefully the embargoed data is only a small portion of your total traffic.
If you want to allow the data to be accessed by local users, by also from outside with the proper username and password, you can do:
<Location "/embargoed/"> Order deny,allow Deny from all Allow from 10.1.1.1/24 Allow from 10.11.12.167 Allow from example.edu AuthType Digest AuthName "Embargoed Data" AuthDigestDomain /embargoed/ AuthDigestFile /path/to/htdigest/file Require valid-user Satisfy any </Location>
Note that in Apache 2.2, AuthDigestFile should be changed to AuthUserFile. You can use AuthType Basic instead, but then the passwords are sent in the clear, and you'll have to change some of the lines.
Debugging Slow Web Servers
Entries into the webserver's access logs only occur once the client disconnects. This means that if you have lots of connections sitting open, you can't use the access logs to see what's going on. Apache does have a way to get some information about what's going on, however. Look for a section in your server config mentioning ExtendedStatus or server-status, and change it to read something like:
<IfModule mod_status.c> <Location "/server-status"> SetHandler server-status Order deny,allow Deny from all Allow from 127.0.0.1 Allow from your-ip-address </Location> ExtendedStatus On </IfModule>
Obviously, set your-ip-address to an appropriate value, and you can set more than one Allow line to allow connections from more than one machine. You can then (after restarting the webserver) request the page http://servername/server-status, which down at the bottom will give a report that includes what requests are being processed, what IP asked for it, and how long it's been processing, so you can try to identify what might be having problems, or what connecting IP address might be doing strange things. You have to do this in advance of the request; as it requires a web server restart to turn on this feature, you can't just turn it on when you have a problem connection (unless it's a flood of problem connections that you're actively monitoring).
As the process ID is listed, you can also use this to get information about processes that you identify as problematic via the unix ps or top commands.