RESCOMP Archives

August 2015

RESCOMP@LISTSERV.MIAMIOH.EDU

Options: Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
"Dhananjai M. Rao" <[log in to unmask]>
Reply To:
Research Computing Support <[log in to unmask]>, Dhananjai M. Rao
Date:
Tue, 25 Aug 2015 09:16:47 -0400
Content-Type:
text/plain
Parts/Attachments:
text/plain (202 lines)
That should work, but prefer a loop where there is an upper limit than
an infinite loop. Having a sleep of 1 second should be sufficient.

Now that is just 1 spot. Are you going to modify all the spots including
those in the libraries and C/C++ code underlying python (that may create
intermediate JIT files as needed)?  Perhaps you going to do the
easy/correct thing and create directories for each job?

On Tue, 2015-08-25 at 09:13 -0400, Karro, John wrote:
> Its just a timing issue, correct?  So if I were to do something like:
> 
> 
>     while True:
>        try:
>          fp = open(blah blah)
>        except:
>          continue
>        break
> 
> 
> That should allow it to eventually open the file and continue on,
> correct? 
> 
> 
> Would it be better to stuck a sleep statement in exception -- is this
> something I can count on needing  few second to resolve?
> 
> 
> John
> 
> ----------------------------------------------------------------------------------------------
> Dr. John Karro, Associate Professor
> Department of Computer Science and Software Engineering
> Affiliate: Department of Microbiology, Department of Statistics
> Office: Benton 205D, Miami University, Oxford, Ohio
> ----------------------------------------------------------------------------------------------
> 
> On Tue, Aug 25, 2015 at 8:59 AM, Dhananjai M. Rao <[log in to unmask]>
> wrote:
>         If they are writing to the same directory from different
>         compute nodes
>         then this problem can occur because the directory entries are
>         being
>         updated.
>         
>         node #1: Open's a file so the directory's timestamps and
>         inodes have to
>         be updated.
>         
>         node #2: Tries to open a file in the same directory as inodes
>         are being
>         updated and NFS has to reject the second one because directory
>         file
>         handles are stale
>         
>         This is a standard issue and no amount of additional file
>         space or speed
>         or solution will fix it other than to create separate
>         directories for
>         each job.
>         
>         On Tue, 2015-08-25 at 08:57 -0400, Karro, John wrote:
>         > Yes.
>         >
>         >
>         ----------------------------------------------------------------------------------------------
>         > Dr. John Karro, Associate Professor
>         > Department of Computer Science and Software Engineering
>         > Affiliate: Department of Microbiology, Department of
>         Statistics
>         > Office: Benton 205D, Miami University, Oxford, Ohio
>         >
>         ----------------------------------------------------------------------------------------------
>         >
>         > On Tue, Aug 25, 2015 at 8:56 AM, Dhananjai M. Rao
>         <[log in to unmask]>
>         > wrote:
>         >         Are there multiple jobs writing to the same
>         directory?
>         >
>         >         On Tue, 2015-08-25 at 08:53 -0400, Karro, John
>         wrote:
>         >         > I really don't think so.  Obviously, I could have
>         a bug I'm
>         >         unaware
>         >         > of.  But that output file should be unique to that
>         call to
>         >         that
>         >         > script.  I really don't see how I could have
>         screwed it up.
>         >         >
>         >         >
>         >
>          ----------------------------------------------------------------------------------------------
>         >         > Dr. John Karro, Associate Professor
>         >         > Department of Computer Science and Software
>         Engineering
>         >         > Affiliate: Department of Microbiology, Department
>         of
>         >         Statistics
>         >         > Office: Benton 205D, Miami University, Oxford,
>         Ohio
>         >         >
>         >
>          ----------------------------------------------------------------------------------------------
>         >         >
>         >         > On Tue, Aug 25, 2015 at 8:49 AM, Dhananjai M. Rao
>         >         <[log in to unmask]>
>         >         > wrote:
>         >         >         Is it possible that there is some other
>         job that is
>         >         writing to
>         >         >         the same
>         >         >         file? If you are bulk scheduling jobs, is
>         it
>         >         possible that you
>         >         >         are
>         >         >         accidentally scheduling 2 jobs that are
>         writing to
>         >         the same
>         >         >         directory?
>         >         >
>         >         >         On Tue, 2015-08-25 at 08:44 -0400, Karro,
>         John
>         >         wrote:
>         >         >         > Can anyone explain to me the following
>         OS errors
>         >         occurring
>         >         >         > sporadically on Redhawk, as returned by
>         my Python
>         >         code:
>         >         >         >
>         >         >         >
>         >         >         > Traceback (most recent call last):
>         >         >         >   File "consensus_seq.py", line 84, in
>         <module>
>         >         >         >
>         >
>         main(args.seq,args.elements,args.output,args.fa_output)
>         >         >         >   File "consensus_seq.py", line 45, in
>         main
>         >         >         >     wp = open(output, "w")
>         >         >         > OSError: [Errno 116] Stale NFS file
>         handle:
>         >         >         >
>         'SEEDS1/PHRAIDER/ce10.chrV.s2.f3.consensus.txt'
>         >         >         >
>         >         >         >
>         >         >         > I'm running batches of jobs, and this
>         seems to pop
>         >         up
>         >         >         every-once in a
>         >         >         > while and kill my pipeline.  The
>         directory does
>         >         exist.  And
>         >         >         if I rerun
>         >         >         > the program (many hours later) it works
>         fine.
>         >         >         >
>         >         >         > Any idea why this might happen?
>         >         >         >
>         >         >         >      John
>         >         >         >
>         >         >         >
>         >         >         >
>         >         >         >
>         >         >
>         >
>         ----------------------------------------------------------------------------------------------
>         >         >         >
>         >         >         >
>         >         >         > Dr. John Karro, Associate Professor
>         >         >         > Department of Computer Science and
>         Software
>         >         Engineering
>         >         >         > Affiliate: Department of Microbiology,
>         Department
>         >         of
>         >         >         Statistics
>         >         >         > Office: Benton 205D, Miami University,
>         Oxford,
>         >         Ohio
>         >         >         >
>         >         >
>         >
>         ----------------------------------------------------------------------------------------------
>         >         >
>         >         >
>         >         >
>         >         >
>         >         >
>         >
>         >
>         >
>         >
>         >
>         
>         
>         
> 
> 

ATOM RSS1 RSS2