PDSF - General
From Computing@RNC
Line 47: | Line 47: | ||
- | == How Can I tell if a Resource is used up? == | + | == SGE Questions == |
+ | |||
+ | For a good overview, please see the [ http://www.nersc.gov/nusers/systems/PDSF/software/SGE.php PDSF SGE page ]. There are several tools available to monitor your jobs. The '''qmon''' command is a graphical interface to SGE which can be quite useful if your network connection is good. Try the inline commands 'sgeuser' and 'qstat' for over-all farm status and your individual job listings, respectively and are discussed in the overview page linked above. | ||
+ | |||
+ | === How Can I tell if a Resource is used up? === | ||
Consumable resources in SGE on PDSF (known as complexes) are configured globally but are requested by host. Thus to determine if a resource is available you can use the SGE '''qhost''' command specified by any host known to the system. SGE will report the value of that resource common to any host. | Consumable resources in SGE on PDSF (known as complexes) are configured globally but are requested by host. Thus to determine if a resource is available you can use the SGE '''qhost''' command specified by any host known to the system. SGE will report the value of that resource common to any host. | ||
Line 54: | Line 58: | ||
will produce output showing the availability of these io resources. If any are 0.0000, then the resource is being used up. If any are greater than 0, then new jobs can access these resources. | will produce output showing the availability of these io resources. If any are 0.0000, then the resource is being used up. If any are greater than 0, then new jobs can access these resources. | ||
+ | |||
+ | |||
+ | === Batch jobs: Local scratch space ($SCRATCH) === | ||
+ | Each node has local disk storage associated with it, through $SCRATCH. It is recommended that users read and write to the scratch area while their jobs is running, then copy their output files to the final destination (either HPSS or GPFS disk). | ||
+ | |||
+ | SGE, the batch queue system, maintains a unique disk area for each job as scratch. The environment variable $SCRATCH is mapped to this area for each individual job. This means that users do not have to worry about their jobs running on different cores of one node interfering with each other. | ||
+ | |||
+ | It's important to remember that SGE removes this directory as soon as the job is complete. If you want to keep any ouput files, your job will need to archive those files before exiting. | ||
+ | |||
+ | === Batch jobs: I get an error when I try to create a directory under /scratch. What do I do? === | ||
+ | The local scratch area is now managed by SGE, and users *cannot* create and maintain their own directories on /scratch. The disk area you can write to is pointed to by the env variable $SCRATCH or $TMPDIR. Please use these instead of /scratch/$username: | ||
+ | |||
+ | <pre> | ||
+ | #!/bin/sh | ||
+ | |||
+ | mudstfile = $1 | ||
+ | cd $SCRATCH | ||
+ | pwd | ||
+ | root4star -q ~/analysis/macros/myAnalysis.C $mudstfile | ||
+ | mv myoutput.root $mudstfile.analysis.root | ||
+ | hsi "cd analysis; prompt; mput $mudstfile.analysis.root" | ||
+ | </pre> | ||
+ | |||
+ | Has the output: | ||
+ | |||
+ | <pre> | ||
+ | /scratch/1135296.1.starprod.64bit.q | ||
+ | Warning in <TEnvRec::ChangeValue>: duplicate entry <Library.TMCParticle=libEGPythia6.so | ||
+ | libEG.so libGraf.so libVMC.so> for level 0; ignored | ||
+ | ******************************************* | ||
+ | * * | ||
+ | * W E L C O M E to R O O T * | ||
+ | * * | ||
+ | * Version 5.12/00f 23 October 2006 * | ||
+ | * * | ||
+ | * You are welcome to visit our Web site * | ||
+ | * http://root.cern.ch * | ||
+ | * * | ||
+ | ******************************************* | ||
+ | |||
+ | FreeType Engine v2.1.9 used to render TrueType fonts. | ||
+ | Compiled on 23 July 2008 for linux with thread support. | ||
+ | |||
+ | CINT/ROOT C/C++ Interpreter version 5.16.13, June 8, 2006 | ||
+ | Type ? for help. Commands must be C++ statements. | ||
+ | Enclose multiple statements between { }. | ||
+ | *** Float Point Exception is OFF *** | ||
+ | *** Start at Date : Thu Oct 15 11:08:59 2009 | ||
+ | QAInfo:You are using STAR_LEVEL : new, ROOT_LEVEL : 5.12.00 and node : pdsf3 | ||
+ | |||
+ | [clip] | ||
+ | |||
+ | *********************************************************************** | ||
+ | * NERSC HPSS User SYSTEM (archive.nersc.gov) * | ||
+ | *********************************************************************** | ||
+ | Username: aarose UID: 34500 Acct: 34500(34500) Copies: 1 Firewall: off [hsi.3.4.3 Thu Jan 29 16:10:54 PST 2009][V3.4.3_2009_01_28.05] | ||
+ | A:/home/s/starofl-> | ||
+ | [clip] | ||
+ | |||
+ | </pre> | ||
+ | |||
+ | === How to retrieve SGE info for jobs that have finished === | ||
+ | |||
+ | Accounting information can be obtained using the SGE '''qacct''' command which by defaut queries the SGE accounting file $SGE_ROOT/default/common/accounting. Since on PDSF, the accounting file is rotated, you will need to point to an specific accounting file to query your job. First, find the accounting file by date, | ||
+ | |||
+ | ls $SGE_ROOT/default/common/accounting.* | ||
+ | |||
+ | And then query the file by: | ||
+ | |||
+ | qacct -j ''yourjobid'' -f $SGE_ROOT/default/common/accounting.''yourjobrundate'' |
Revision as of 19:59, 19 November 2009
Contents |
How do I contact the PDSF admin team in case of a problem?
You can always send email to consult@nersc.gov, or submit a ticket through the web interface:
Then follow "Ask nersc Consultants".
My password no longer works at PDSF
- I miss-typed the password three (or more) times
For security reasons, NERSC will *lock out* an account that has three or more failed login attempts. The lock out will last *12 hours*. If this happens to you, contact NERSC account support (1-800-666-3772, option #2) to have it reset.
- It's been a while, and I can't remember my password
Please call NERSC account support (1-800-666-3772, option #2) to have your password reset. Note, if you have not logged in for a long time (~6 months) your account may be deactiviated. If this is the case, NERSC account support will ask that you re-submit your signed NERSC User Agreement.
How to create & access individual web content on PDSF
- Current & Future Model
* static content put under group writeable area: /project/projectdirs/star/www/ * accessed by http://portal.nersc.gov/project/star/ * Please add your own user area subdirectory - e.g. http://portal.nersc.gov/project/star/username * there will not be a system wide migration; each user should migrate their own web area. * static html only (e.g. dynamic content must be pre-generated )
- Old model is Deprecated
* static content put into $HOME/public_html * accessible via http://pdsfweb01.nersc.gov/~username
How to use IO resources of networked file systems (*eliza*)
The networked file systems on PDSF are visible from both interactive (pdsf.nersc.gov) and batch nodes. Batch processes should always specify an IO resource in the job description. The star scheduler handles this more or less automatically. For explicit job submission, use:
qsub -hard -l elizaXXio=1 [script]
Where -l elizaXXio=1 identifies the network resources IO (XX should be a number of the eliza system) being accessed by the job and assigns a resource limit of 1. Failure to supply resource limits explicitly can cause your jobs to take a larger fraction of an IO resource, degrading it's overall performance to the detriment of everyone.
Users who abuse the limits will have their use of the system limited more directly. If you over-specify your resource needs, your job will likely not run. For example, if you just always submit with say, -l eliza1io=1,eliza8io=1,eliza9io=1, because you've used those resources in the past, you will find that your jobs can wait in the queue for a long time until all of those resources become free at the same time.
For more information about setting io resource usage, please see this PDSF FAQ entry.
SGE Questions
For a good overview, please see the [ http://www.nersc.gov/nusers/systems/PDSF/software/SGE.php PDSF SGE page ]. There are several tools available to monitor your jobs. The qmon command is a graphical interface to SGE which can be quite useful if your network connection is good. Try the inline commands 'sgeuser' and 'qstat' for over-all farm status and your individual job listings, respectively and are discussed in the overview page linked above.
How Can I tell if a Resource is used up?
Consumable resources in SGE on PDSF (known as complexes) are configured globally but are requested by host. Thus to determine if a resource is available you can use the SGE qhost command specified by any host known to the system. SGE will report the value of that resource common to any host.
qhost -F eliza8io,eliza9io,eliza13io -h pc1008
will produce output showing the availability of these io resources. If any are 0.0000, then the resource is being used up. If any are greater than 0, then new jobs can access these resources.
Batch jobs: Local scratch space ($SCRATCH)
Each node has local disk storage associated with it, through $SCRATCH. It is recommended that users read and write to the scratch area while their jobs is running, then copy their output files to the final destination (either HPSS or GPFS disk).
SGE, the batch queue system, maintains a unique disk area for each job as scratch. The environment variable $SCRATCH is mapped to this area for each individual job. This means that users do not have to worry about their jobs running on different cores of one node interfering with each other.
It's important to remember that SGE removes this directory as soon as the job is complete. If you want to keep any ouput files, your job will need to archive those files before exiting.
Batch jobs: I get an error when I try to create a directory under /scratch. What do I do?
The local scratch area is now managed by SGE, and users *cannot* create and maintain their own directories on /scratch. The disk area you can write to is pointed to by the env variable $SCRATCH or $TMPDIR. Please use these instead of /scratch/$username:
#!/bin/sh mudstfile = $1 cd $SCRATCH pwd root4star -q ~/analysis/macros/myAnalysis.C $mudstfile mv myoutput.root $mudstfile.analysis.root hsi "cd analysis; prompt; mput $mudstfile.analysis.root"
Has the output:
/scratch/1135296.1.starprod.64bit.q Warning in <TEnvRec::ChangeValue>: duplicate entry <Library.TMCParticle=libEGPythia6.so libEG.so libGraf.so libVMC.so> for level 0; ignored ******************************************* * * * W E L C O M E to R O O T * * * * Version 5.12/00f 23 October 2006 * * * * You are welcome to visit our Web site * * http://root.cern.ch * * * ******************************************* FreeType Engine v2.1.9 used to render TrueType fonts. Compiled on 23 July 2008 for linux with thread support. CINT/ROOT C/C++ Interpreter version 5.16.13, June 8, 2006 Type ? for help. Commands must be C++ statements. Enclose multiple statements between { }. *** Float Point Exception is OFF *** *** Start at Date : Thu Oct 15 11:08:59 2009 QAInfo:You are using STAR_LEVEL : new, ROOT_LEVEL : 5.12.00 and node : pdsf3 [clip] *********************************************************************** * NERSC HPSS User SYSTEM (archive.nersc.gov) * *********************************************************************** Username: aarose UID: 34500 Acct: 34500(34500) Copies: 1 Firewall: off [hsi.3.4.3 Thu Jan 29 16:10:54 PST 2009][V3.4.3_2009_01_28.05] A:/home/s/starofl-> [clip]
How to retrieve SGE info for jobs that have finished
Accounting information can be obtained using the SGE qacct command which by defaut queries the SGE accounting file $SGE_ROOT/default/common/accounting. Since on PDSF, the accounting file is rotated, you will need to point to an specific accounting file to query your job. First, find the accounting file by date,
ls $SGE_ROOT/default/common/accounting.*
And then query the file by:
qacct -j yourjobid -f $SGE_ROOT/default/common/accounting.yourjobrundate
Debug data: