Changes between Initial Version and Version 1 of WebserverSetup


Ignore:
Timestamp:
08/12/11 11:28:52 (13 years ago)
Author:
joe
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • WebserverSetup

    v1 v1  
     1= Setting up a Web Server to Serve Solar Physics Data = 
     2 
     3This is collection of notes of common issues typical to solar physics archives serving data via HTTP.  All rules below are for Apache, as over the last ~7 years that I ([People/JoeHourcle Joe Hourclé]) have been doing this, I haven't run into any solar physics archives using other server software.  If you're using something different, [People/JoeHourcle contact me] and we'll work up a similar list for your software. 
     4 
     5== Serving FITS files == 
     6 
     7By default, Apache does not know what a `.fits` or `.fts` file is, and so will mark it as `text/plain`, which will cause many web browsers to attempt to display it themselves instead of downloading them and/or prompting to view in an alternate program.  You can fix this by adding the following to your httpd.conf file: 
     8 
     9{{{ 
     10AddType image/fits .fts .fits .fit .FIT .FITS .FTS 
     11}}} 
     12 
     13If you're not serving image data in those FITS files, so you're against claiming it's an image, you can instead use: 
     14 
     15{{{ 
     16AddType application/fits .fts .fits .fit .FIT .FITS .FTS 
     17}}} 
     18 
     19 
     20If you have a directory of FITS files that do not have a file extension, you can instead tell your webserver that all files in that directory, or all files matching a given pattern are FITS files: 
     21 
     22{{{ 
     23<Location /URI/path/to/files/> 
     24    ForceType application/fits 
     25</Location> 
     26}}} 
     27 
     28{{{ 
     29<Directory /local/filesystem/path/to/files> 
     30    ForceType application/fits 
     31</Directory> 
     32}}} 
     33 
     34{{{ 
     35<FilesMatch "regular expression"> 
     36    ForceType application/fits 
     37</FilesMatch> 
     38}}} 
     39 
     40See the Apache documentation on [http://httpd.apache.org/docs/2.0/sections.html Configurable Sections] for information about other ways to selectively apply configuration, such as `<DirectoryMatch>` and `<LocationMatch>`. 
     41 
     42== Serving CDF, HDF or NetCDF files == 
     43 
     44If you're using other scientific file formats, you may need to add one or more of the following: 
     45 
     46{{{ 
     47AddType application/x-netcdf .nc .cdf 
     48AddType application/x-cdf .cdf 
     49AddType application/x-hdf .hdf .h4 .hdf4 .h5 .hdf5 .he4 .he5 
     50}}} 
     51 
     52Please note that you ''can not'' copy & paste that block in, as it defines `.cdf` as being both CDF and NetCDF. 
     53 
     54You can also use the !ForceType directive, as explained above under FITS files. 
     55 
     56== Customized Directory Listings == 
     57 
     58If you're using Apache's [http://httpd.apache.org/docs/2.0/mod/mod_autoindex.html#indexoptions.fancyindexing FancyIndexing] to serve directory listings, the default width is often not enough to show the filenames of the length used in solar physics.  We can fix this with: 
     59 
     60{{{ 
     61IndexOptions +SuppressDescription +NameWidth=* 
     62}}} 
     63 
     64If you have people using 'wget' or similar to scrape your directory listings, you may also want to turn off the ability to sort columns, as the default wget settings will result in multiple requests for each directory, even when nothing has changed.  The following options are recommended if you expect people to try mirroring your data: 
     65 
     66{{{ 
     67IndexOptions +FoldersFirst +TrackModified +IgnoreClient +SuppressColumnSorting  
     68}}} 
     69 
     70If you wish to adjust the style of the folders, to avoid the extra overhead from Server-Side Includes, and to reduce the bandwidth from people mirroring your data, consider using !JavaScript to wrap the style around the listing.  For example, for [http://stereo-ssc.nascom.nasa.gov/data/ins_data/ STEREO's data listing], we're already using [http://jquery.com/ jQuery] for other functionality, so we use it to [http://stereo-ssc.nascom.nasa.gov/incl2/wrap_index.js rewrite the page after it's loaded], and tell the webserver to insert the necessary HTML necessary to call the required files: 
     71{{{ 
     72IndexHeadInsert "<script src='/jquery/jquery-1.3.2.min.js'></script><script src='/incl2/wrap_index.js'></script>" 
     73}}} 
     74 
     75== Hardening Your Web Server == 
     76 
     77At a bare minimum, you should prevent your web server from revealing all of the modules that it has installed, with their version numbers, so you're not advertising the equivalent of an unlocked window.  You can do this by finding the !ServerTokens directive in your `httpd.conf` file, and changing it to (or inserting it, if it doesn't exist) : 
     78 
     79{{{ 
     80ServerTokens Prod 
     81}}} 
     82 
     83For a much more in-depth list of possible changes to make, see the [http://benchmarks.cisecurity.org/ security benchmarks from the Center for Internet Security], and review the appropriate document for your OS, web server, and database, if applicable.  Make sure to back up your configuration first, though, as some of the recommendations are overly restrictive and may break your web server.  (also note that they ask you for info about yourself, but you only have to answer the yes/no questions to download files).  These are things you ''can'' do to make the system more secure, not things you necessarily ''should'' do for all servers. 
     84 
     85A few words of warning, though:   
     86 
     871. '''Back up your configuration directory before making changes'''.  Yes, I said it before, and yes, this is in bold for a reason. 
     882. You can test the webserver's configuration on unix-like systems with `apachectl -t`.  You should do this as you make changes, so you're not trying to figure out which of the dozens of changes broke the configuration. 
     893. If you want to restart your webserver after each change, use `apachectl graceful`, which won't kick off people currently downloading files and make them have to restart.  (although, once you're all done, you make want to do a full stop & start) 
     904. After making changes, pay more attention than normal to your web server's error logs for the next few days to see if anything strange is happening.  If you're using 'piped logging', you'll likely need to check in two places, as error logs when starting up are not written to the piped logs. 
     915. The section on disabling the `auth*` modules doesn't make it clear that you 'should not' turn off the module `authz_host_module`, which is required for some of the other changes they recommend.   
     926. If you're using any of those authentication modules, if the directives were in `<IfModule>` blocks, turning off the modules will turn off access restrictions.  (if you're ''not'' using `IfModule`, it should stop the webserver from staring back up) 
     937. The section "Deny Access to OS Root Directory" telling you to deny all access to `<Directory />`, and then turn on access to web root neglects to tell you to also give access to the location of the icons used in directory listings, the files used for error messages, etc, which are in different places in each OS.  (MacOS uses `/usr/share/httpd/`; CentOS uses `/var/www/` ) 
     94 
     95== A Note on 'htaccess' files == 
     96 
     97Turning on support for `.htaccess` files means that the webserver needs to not only look in the file's directory, but in each and every directory above it as well.  Yes, there's caching, but it still has to check if they've changed, or new ones get introduced.  It's better to use `<Directory>` or `<Location>` sections in the main httpd.conf, which is read only when the server restarts.  You can turn off `.htaccess` checking by setting the `AllowOverride` directive to: 
     98 
     99{{{ 
     100AllowOverride None 
     101}}} 
     102 
     103== Dealing with Problem Users / Search Spiders / Etc == 
     104 
     105I've noticed a few archives have tried putting a `robots.txt` file in their data directory, eg: 
     106 
     107{{{ 
     108http://host.example.edu/data/robots.txt 
     109}}} 
     110 
     111Search engines don't look for them there.  They '''must''' be served from the root of the web server for it to be of any use.  Even then, it's only a request for them not to scrape certain sections, and won't actually stop them. 
     112 
     113If you have to, you can block misbehaving IP addresses in a few different ways: 
     114 
     115{{{ 
     116Deny from 10.1.1.117 
     117}}} 
     118 
     119This is rarely of much use, as they'll typically just change IP addresses, and do it again.  You can block networks by giving a partial address, eg: 
     120 
     121{{{ 
     122Deny from 10.1.1. 
     123}}} 
     124 
     125Although you ''can'' specify host names or domain names, it's not recommended as it requires doing a DNS lookup for requests, which can slow things down. 
     126 
     127If you're trying to identify who it might be, you can try narrowing it down with: 
     128 
     1291. Do a DNS lookup on the IP 
     1302. Use your favorite internet search engine to search for both the IP address and the DNS name (if there was one).  They might've posted to a newsgroup or similar that logged the host or IP. 
     1313. If it didn't have a DNS name, you can try using the `whois` command to see who institution the IP range is registered to. 
     132 
     133