= Setting up a Web Server to Serve Solar Physics Data =
This is collection of notes of common issues typical to solar physics archives serving data via HTTP. All rules below are for Apache, as over the last ~7 years that I ([People/JoeHourcle Joe Hourclé]) have been doing this, I haven't run into any solar physics archives using other server software. If you're using something different, [People/JoeHourcle contact me] and we'll work up a similar list for your software.
== Serving FITS files ==
By default, Apache does not know what a `.fits` or `.fts` file is, and so will mark it as `text/plain`, which will cause many web browsers to attempt to display it themselves instead of downloading them and/or prompting to view in an alternate program. You can fix this by adding the following to your httpd.conf file:
{{{
AddType image/fits .fts .fits .fit .FIT .FITS .FTS
}}}
If you're not serving image data in those FITS files, so you're against claiming it's an image, you can instead use:
{{{
AddType application/fits .fts .fits .fit .FIT .FITS .FTS
}}}
If you have a directory of FITS files that do not have a file extension, you can instead tell your webserver that all files in that directory, or all files matching a given pattern are FITS files:
{{{
ForceType application/fits
}}}
{{{
ForceType application/fits
}}}
{{{
ForceType application/fits
}}}
See the Apache documentation on [http://httpd.apache.org/docs/2.0/sections.html Configurable Sections] for information about other ways to selectively apply configuration, such as `` and ``.
== Serving CDF, HDF or NetCDF files ==
If you're using other scientific file formats, you may need to add one or more of the following:
{{{
AddType application/x-netcdf .nc .cdf
AddType application/x-cdf .cdf
AddType application/x-hdf .hdf .h4 .hdf4 .h5 .hdf5 .he4 .he5
}}}
Please note that you ''can not'' copy & paste that block in, as it defines `.cdf` as being both CDF and NetCDF.
You can also use the !ForceType directive, as explained above under FITS files.
== Customized Directory Listings ==
If you're using Apache's [http://httpd.apache.org/docs/2.0/mod/mod_autoindex.html#indexoptions.fancyindexing FancyIndexing] to serve directory listings, the default width is often not enough to show the filenames of the length used in solar physics. We can fix this with:
{{{
IndexOptions +SuppressDescription +NameWidth=*
}}}
If you have people using 'wget' or similar to scrape your directory listings, you may also want to turn off the ability to sort columns, as the default wget settings will result in multiple requests for each directory, even when nothing has changed. The following options are recommended if you expect people to try mirroring your data:
{{{
IndexOptions +FoldersFirst +TrackModified +IgnoreClient +SuppressColumnSorting
}}}
If you wish to adjust the style of the folders, to avoid the extra overhead from Server-Side Includes, and to reduce the bandwidth from people mirroring your data, consider using !JavaScript to wrap the style around the listing. For example, for [http://stereo-ssc.nascom.nasa.gov/data/ins_data/ STEREO's data listing], we're already using [http://jquery.com/ jQuery] for other functionality, so we use it to [http://stereo-ssc.nascom.nasa.gov/incl2/wrap_index.js rewrite the page after it's loaded], and tell the webserver to insert the necessary HTML necessary to call the required files:
{{{
IndexHeadInsert ""
}}}
== Hardening Your Web Server ==
At a bare minimum, you should prevent your web server from revealing all of the modules that it has installed, with their version numbers, so you're not advertising the equivalent of an unlocked window. You can do this by finding the !ServerTokens directive in your `httpd.conf` file, and changing it to (or inserting it, if it doesn't exist) :
{{{
ServerTokens Prod
}}}
For a much more in-depth list of possible changes to make, see the [http://benchmarks.cisecurity.org/ security benchmarks from the Center for Internet Security], and review the appropriate document for your OS, web server, and database, if applicable. Make sure to back up your configuration first, though, as some of the recommendations are overly restrictive and may break your web server. (also note that they ask you for info about yourself, but you only have to answer the yes/no questions to download files). These are things you ''can'' do to make the system more secure, not things you necessarily ''should'' do for all servers.
A few words of warning, though:
1. '''Back up your configuration directory before making changes'''. Yes, I said it before, and yes, this is in bold for a reason.
2. You can test the webserver's configuration on unix-like systems with `apachectl -t`. You should do this as you make changes, so you're not trying to figure out which of the dozens of changes broke the configuration.
3. If you want to restart your webserver after each change, use `apachectl graceful`, which won't kick off people currently downloading files and make them have to restart. (although, once you're all done, you make want to do a full stop & start)
4. After making changes, pay more attention than normal to your web server's error logs for the next few days to see if anything strange is happening. If you're using 'piped logging', you'll likely need to check in two places, as error logs when starting up are not written to the piped logs.
5. The section on disabling the `auth*` modules doesn't make it clear that you 'should not' turn off the module `authz_host_module`, which is required for some of the other changes they recommend.
6. If you're using any of those authentication modules, if the directives were in `` blocks, turning off the modules will turn off access restrictions. (if you're ''not'' using `IfModule`, it should stop the webserver from staring back up)
7. The section "Deny Access to OS Root Directory" telling you to deny all access to ``, and then turn on access to web root neglects to tell you to also give access to the location of the icons used in directory listings, the files used for error messages, etc, which are in different places in each OS. (MacOS uses `/usr/share/httpd/`; CentOS uses `/var/www/` )
== A Note on 'htaccess' files ==
Turning on support for `.htaccess` files means that the webserver needs to not only look in the file's directory, but in each and every directory above it as well. Yes, there's caching, but it still has to check if they've changed, or new ones get introduced. It's better to use `` or `` sections in the main httpd.conf, which is read only when the server restarts. You can turn off `.htaccess` checking by setting the `AllowOverride` directive to:
{{{
AllowOverride None
}}}
== Dealing with Problem Users / Search Spiders / Etc ==
I've noticed a few archives have tried putting a `robots.txt` file in their data directory, eg:
{{{
http://host.example.edu/data/robots.txt
}}}
Search engines don't look for them there. They '''must''' be served from the root of the web server for it to be of any use. Even then, it's only a request for them not to scrape certain sections, and won't actually stop them.
If you have to, you can block misbehaving IP addresses in a few different ways:
{{{
Deny from 10.1.1.117
}}}
This is rarely of much use, as they'll typically just change IP addresses, and do it again. You can block networks by giving a partial address, eg:
{{{
Deny from 10.1.1.
}}}
Although you ''can'' specify host names or domain names, it's not recommended as it requires doing a DNS lookup for requests, which can slow things down.
If you're trying to identify who it might be, you can try narrowing it down with:
1. Do a DNS lookup on the IP
2. Use your favorite internet search engine to search for both the IP address and the DNS name (if there was one). They might've posted to a newsgroup or similar that logged the host or IP.
3. If it didn't have a DNS name, you can try using the `whois` command to see who institution the IP range is registered to.
=== Blocking by Browser / User-Agent ===
You can also reject by the client's browser identification string, although, they can always change or hide it, so this isn't always good solution. It can be a hint to search crawlers or `wget` users that their activity isn't appreciated, though.
As we have hundreds of thousands of requests per day for the same file, all coming from thousands of IP addresses, by a single strange browser (that I've tried asking companies with similarly named software if their stuff acts as a browser, and if so, to stop it), we use the following to block access:
{{{
BrowserMatch ^CompanionLink badClient
Order allow,deny
Allow from all
Deny from env=badClient
}}}
== Restricting Access to Local Users ==
If you have data that's under embargo, and you want to make it available only to local users, you can limit access to a directory such as `/embargoed` to [http://httpd.apache.org/docs/2.0/mod/mod_access.html#allow specific IP addresses or to group of IP addresses]:
{{{
Order deny,allow
Deny from all
Allow from 10.1.1.1/24
Allow from 10.11.12.167
Allow from example.edu
}}}
And I know, I said the DNS lookup was bad, but in this case, hopefully the embargoed data is only a small portion of your total traffic.
If you want to allow the data to be accessed by local users, by also from outside with the proper username and password, you can do:
{{{
Order deny,allow
Deny from all
Allow from 10.1.1.1/24
Allow from 10.11.12.167
Allow from example.edu
AuthType Digest
AuthName "Embargoed Data"
AuthDigestDomain /embargoed/
AuthDigestFile /path/to/htdigest/file
Require valid-user
Satisfy any
}}}
Note that in Apache 2.2, `AuthDigestFile` should be changed to `AuthUserFile`. You can use `AuthType Basic` instead, but then the passwords are sent in the clear, and you'll have to [http://httpd.apache.org/docs/2.0/howto/auth.html change some of the lines].
Also note that having username and passwords at NASA means your system falls under the 'Web Application' designation, and requires additional approvals through STRAW. If you're setting up a webserver at NASA, and you don't know what STRAW is, stop immediately, and talk to your local syadmins.
== Debugging Slow Web Servers ==
Entries into the webserver's access logs only occur once the client disconnects. This means that if you have lots of connections sitting open, you can't use the access logs to see what's going on. Apache does have a way to get some information about what's going on, however. Look for a section in your server config mentioning `ExtendedStatus` or `server-status`, and change it to read something like:
{{{
SetHandler server-status
Order deny,allow
Deny from all
Allow from 127.0.0.1
Allow from your-ip-address
ExtendedStatus On
}}}
Obviously, set `your-ip-address` to an appropriate value, and you can set more than one `Allow` line to allow connections from more than one machine. You can then (after restarting the webserver) request the page `http://servername/server-status`, which down at the bottom will give a report that includes what requests are being processed, what IP asked for it, and how long it's been processing, so you can try to identify what might be having problems, or what connecting IP address might be doing strange things. You have to do this in advance of the request; as it requires a web server restart to turn on this feature, you can't just turn it on when you have a problem connection (unless it's a flood of problem connections that you're actively monitoring).
As the process ID is listed, you can also use this to get information about processes that you identify as problematic via the unix `ps` or `top` commands.