Dec 10, 2013, 10:18 AM

Congratulations again for having your paper (NetCheck: Network Diagnoses
from Blackbox Traces) accepted to NSDI 2014. The reviews are appended below
and can also be viewed at the submission site
(https://papers.usenix.org/hotcrp/nsdi14).

Competition was tough this year. We got many more submissions than last
year and could accept only 38 out of 213 submissions.

Each accepted paper has been assigned a shepherd who will help revise the
paper. Before the paper can be published, the shepherd needs to be
satisfied that all (addressable) reviewer concerns have been addressed.

Your shepherd is Firstname Lastname (email). Please reach out to them
soon.

Here is a possible revision plan to follow:
 Dec 16: The authors send a revision plan to the shepherd
 Dec 20: The shepherd provides feedback on the plan
 Feb 10: The authors send a revised version of the paper to the shepherd
 Feb 17: The shepherd provides feedback on the paper
 Feb 23: The authors send what should be close-to-final version to the
shepherd
 Feb 25: The shepherd approves the paper, possibly with minor changes
 Feb 26: The final version is submitted.

The important deadlines in this plan correspond to Dec 16 and Feb 26. The
rest of the plan can vary based on agreement between you and your shepherd.

Submission and formatting instructions for preparing your camera ready are
available at https://www.usenix.org/conference/nsdi14/participants.

Regards,

Ion and Ratul (NSDI '14 chairs)

===========================================================================
                           NSDI '14 Review #118A
                     Updated 6 Dec 2013 5:45:18pm PST
---------------------------------------------------------------------------
       Paper #118: NetCheck: Network Diagnoses from Blackbox Traces
---------------------------------------------------------------------------

                      Overall merit: 3. Good paper: I can accept this
                                        paper, but I will not champion it
                Reviewer confidence: 2. Low confidence

                         ===== Paper summary =====

The paper presents a new tool (called NetCheck) for diagnosing problems with network applications. It takes as input the sequence of system calls made by the participating end-systems, and it determines whether these sequences indicate an "error" (local, network, or portability-related); an "error" is a deviation from a manually defined model of the POSIX network API. The challenge lies in ordering system calls performed at different end-systems without assuming synchronized clocks; the authors address it through a simple but effective heuristic. The paper presents strong evidence that NetCheck can automatically (and accurately) diagnoze real problems that were previously diagnozed manually.

                        ===== Paper strengths =====

- A real, useful, publicly available troubleshooting tool.
- Well-justified, well-described design.
- Compelling evaluation.

                       ===== Paper weaknesses =====

- The properties of the proposed system (what it does and does not diagnoze) are not 100% clear.

                      ===== Comments for author =====

This is undeniably useful work: any tool that automatically diagnoses real bugs is worth presenting and discussing.

However NetCheck's properties (what it can and cannot diagnoze) are not 100% clear:
- The evaluation section considered only single bugs (that occur one at a time). What about multiple bugs occurring one after the other? E.g., according to the authors' techreport, in one of the nonblock_connect cases reported in Table 2, a non-blocking connect was followed by a non-blocking recv, and the bug could not be identified. Would such ambiguities confuse the diagnosis in multiple-bug cases?
- In certain (few) cases, NetCheck terminated without fully processing all syscalls. Why was that?
- In Section 6.1, paragraph 2, it says that NetCheck correctly identified the block_tcp_socket bug; in Section 6.2, it says that NetCheck did *not* identify the closethread bug. But according to the authors' techreport, these two bugs are the same(?) Most interestingly, this same bug *was* identified when Skype was tested.

As somebody who is not an expert but has a deep interest in network debugging, I feel lost in a sea of network debugging tools that address this case but not that case, and whose capabilities overlap in non-obvious ways. If the paper is accepted, it would be great if the authors summarized the design space and where existing solutions fall. For example, there was a paper (http://conferences.sigcomm.org/sigcomm/2005/paper-BisFai.pdf) several years back from the PL community, which presented a formal model of the POSIX network interface and a way to automatically test whether a given network trace conforms to the model. This is not the same as NetCheck (for one thing, it considers a trace from a single end-system). However, the two pieces of work seem to address some common problems in slightly but not fundamentally different ways:
- Each of them relies on a manually derived model of the POSIX API. The two models seem to have the same goal (define correct end-system interaction), yet the model in the PL paper is significantly more complex. 
- Each of them involves testing real traces for conformance against a model. Again, the algorithm used in the PL paper is significantly more complex.
Is the complexity of the PL paper unnecessary? Why is it that NetCheck relies on a simpler model and algorithm, yet performs accurate diagnosis? Is there a way to related one work to the other? Could we view NetCheck as an optimized version of the PL work, which leverages domain-specific knowledge to improve performance?

A related question: Is NetCheck the only tool that can automatically diagnoze the problems mentioned in Section 6?

===========================================================================
                           NSDI '14 Review #118B
                     Updated 2 Nov 2013 1:14:22pm PDT
---------------------------------------------------------------------------
       Paper #118: NetCheck: Network Diagnoses from Blackbox Traces
---------------------------------------------------------------------------

                      Overall merit: 2. Weak paper: This paper should be
                                        rejected, but I will not fight
                                        strongly against it
                Reviewer confidence: 3. Moderate confidence

                         ===== Paper summary =====

This paper describes NetCheck, a tool for diagnosing network
problems. It relies on black-box tracing mechanisms (e.g.,strace that
traces system calls). The idea is twofold: 1) totally order the set of
distributed traces (without requiring a priori synchronized clocks), 2) use a simple network model to check whether traced behavior deviates from the norm. NetCheck was able to reproduce diagnosing of a few tens of known bugs.

                        ===== Paper strengths =====

- trace processing is quick
- the approach makes sense and is effective on known problems

                       ===== Paper weaknesses =====

- no new problems found
- too simplistic of a network model
- effort required to build understanding of system calls. unclear if helps with porting. Unclear how long this took.
- unclear how long weeding out of 5-10% of false positively took

                      ===== Comments for author =====

Overall, this was an enjoyable paper to read. 

First off, I think that the limitations are substantial. Yes, it is a black-box tool and it is neat that you got it to do as much as it did. But then, you didn't report a single new problem. The network model is overly simplistic.

How did you pick the problems you reproduced?

It is unclear if the tool works with threaded applications. It appears to be, in some cases?

I wasn't able to see what what was the effort (person months) do understand/codify posix calls. How would this help, at all, in porting to a different platform?

The false positive rate is non-negligible: it is 5-10%. How long did it take to eliminate these cases?

===========================================================================
                           NSDI '14 Review #118C
                     Updated 4 Nov 2013 4:57:29pm PST
---------------------------------------------------------------------------
       Paper #118: NetCheck: Network Diagnoses from Blackbox Traces
---------------------------------------------------------------------------

                      Overall merit: 3. Good paper: I can accept this
                                        paper, but I will not champion it
                Reviewer confidence: 3. Moderate confidence

                         ===== Paper summary =====

This paper presents a system called NetCheck that diagnoses faults by (i) using heuristics to totally order a set of traces collected by tracing tools like strace at multiple locations, and (ii) applying a simple network model to detect deviations to the (idealized) network semantics.

                        ===== Paper strengths =====

The idea of combining system call traces from multiple end points for fault diagnosis is promising.  Evaluation has suggested good accuracy of the technique.

                       ===== Paper weaknesses =====

>From the paper, it's difficult to see how much diagnosis is automated as opposed to manual inspection of the collected traces (which is known to be labor intensive and error prone).  Also the NetCheck seems more useful for diagnosing faults that are reproducible.

                      ===== Comments for author =====

By combining system call traces from multiple end points, one clearly obtains a lot of information about the potential root causes of the faults.  The heuristcs used by the paper to totally order events seems reasonable (though it feels a little ad hoc).  The paper has also done a good job evaluating their technique and the performance seems quite good.  The fact that the tool is open source also makes it valuable.

A couple of potential concerns:

1) It appears that the technique is more appropriate for diagnosing faults that are reproducible.  Otherwise, when the faults occur, it's difficult to ensure that one happens to have strace logs at all involved end points.  In fact, I think this is a rather strong requirement.  It is useful if the paper can show whether NetCheck is useful even if one has strace logs at a subset of endpoints involved in the fault.

2) Automation versus manual inspection.  From the paper, it's not clear how much of the diagnosis done using NetCheck is automated (as opposed to manual inspection, which is both labor intensive and error prone).  In fact, most bugs reported in bug trackers are already diagnosed by hand in advance.  So NetCheck already knows where to look at.  For the case studies involving some popular apps, the root causes also seem require a lot of manual digging to figure out.

===========================================================================
                           NSDI '14 Review #118D
                     Updated 23 Nov 2013 7:54:37am PST
---------------------------------------------------------------------------
       Paper #118: NetCheck: Network Diagnoses from Blackbox Traces
---------------------------------------------------------------------------

                      Overall merit: 4. Excellent paper: I will champion
                                        this paper for acceptance
                Reviewer confidence: 3. Moderate confidence

                         ===== Paper summary =====

The paper presents the NetCheck tool for debugging network problems in distributed applications.  NetCheck logs system calls for unmodified applications, combines logs from multiple hosts into a plausible global log (based on domain details from the POSIX API), and analyzes the logs to uncover errors.  NetCheck runs quickly and uncovered bugs in several real applications.

Overall, this is an interesting paper and the resulting tool is clearly useful in practice.  Some of the presentation could be improved (to help the reader rise above the details), but this can be addressed in revising the paper.

                        ===== Paper strengths =====

Finding network bugs in distributed applications -- whether application bugs or problematic network behavior (severe loss, NAT boxes) -- is important and challenging. The paper gets good traction on the problem, especially given the techniques work with unmodified applications.

                       ===== Paper weaknesses =====

The paper sometimes gets a bit lost in the details.  Some changes to the presentation in Section 4 would be helpful to separate the domain-specific assumptions and constraints from the general algorithmic problems. (These presentation issues can be addressed in revising the paper.)

                      ===== Comments for author =====

- The example for Challenge 1 could be stronger.  The two orderings are semantically equivalent, right?  In either case, the recv() call is made before data has reached the receiver.  A more compelling example would illustrate a more inherent ambiguity in two fundamentally different possible causes of a problem.

- The discussion of Contribution 1 says that "NetCheck assumes that syscalls are atomic".  What exactly does that mean?  That one non-blocking call starts-and-completes before the next one starts?  (If so, what about multi-threaded applications?)

- Why does NetCheck form just one plausible global ordering instead of many?

- Before jumping in to the algorithm in Section 4.1, it'd be nice to explain the challenges and the constraints for the algorithm.  That would help motivate (say) why exploring the syscall at the top of each trace is not sufficient.  Similarly, the forward reference to the dependency graph makes the text hard to follow.  The reader needs more of a "big picture" understanding of the algorithm before diving in to details.  Maybe identify a set of concepts or principles (i.e., the networking "domain details" that constrain the interpretation of the trace) and highlight them first? 

- With the domain-specific constraints for merging the local traces made explicit, does finding a plausible ordering reduce to any previously-known technique?

- Elaborating more on the example rules from [53] would be nice.

- The NAT example is interesting, and may be worth mentioning earlier.  It is a nice example of why the merging the local traces is hard -- you cannot necessarily rely on the five-tuple to combine two sets of socket calls, at least not for failed connection attempts.

===========================================================================
                           NSDI '14 Review #118E
                    Updated 25 Nov 2013 12:03:58am PST
---------------------------------------------------------------------------
       Paper #118: NetCheck: Network Diagnoses from Blackbox Traces
---------------------------------------------------------------------------

                      Overall merit: 2. Weak paper: This paper should be
                                        rejected, but I will not fight
                                        strongly against it
                Reviewer confidence: 2. Low confidence

                         ===== Paper summary =====

NetCheck is a system that helps diagnose application-level problems related to networking -- such as misplaced system calls or IPv4/v6 compatibility issues or NATs causing nonreachability. It collects syscall traces on two communicating hosts, uses a dependency-based algorithm to heuristically replay them in the order they happened, and then uses a set of preconfigured diagnosis rules to pick out probable causes of a fault.  The paper applies NetCheck to over 40 pre-existing bugs and flagged causes in nearly all cases (95.7%).  The paper also checked bugs injected into a testbed, and diagnosed several bugs discovered organically in popular applications(an FTP client, Pidgin chat, Skype, and VirtualBox).

The direction of this paper is very interesting and I would give it a high score, but it is not clear what a diagnosis provided by NetCheck means in practice: it's not clear how useful it is at diagnosing root causes.

                        ===== Paper strengths =====

Tackling an important and interesting problem.
Evaluation on real bugs.
Several interesting techniques including the replay algorithm and "encoding misconceptions" to flag unexpected behavior.

                       ===== Paper weaknesses =====

While NetCheck is clearly successful at producing diagnoses, it is unclear what a diagnosis means -- that is, how closely NetCheck narrows down the root cause.  This was rather opaque in the paper.

Several aspects of the design and evaluation were confusing or not fully described.

                      ===== Comments for author =====

The key question in my mind is how useful NetCheck's diagnoses are.  One way to show usefulness of a debugging tool is to show that it discovers new bugs.  Here, NetCheck was used to diagnose bugs from an existing bug database, or bugs already noticed by users.  I found it very opaque how much easier NetCheck makes the debugging process.  The evaluation shows that NetCheck "diagnoses" a large fraction of the bugs, but what exactly does "diagnose" mean?  The text has nice descriptions, like "This may cause a known issue ... a close call made on the socket from a different thread will keep the socket blocking indefinitely ... NetCheck successfully diagnosed it".  However, the diagnosis apparently doesn't mean outputting that high-level text; although I didn't see it stated clearly, I assume the definition of a successful diagnosis is mentioning one of the related syscalls in the output.  I checked the tech report [53] for examples of the NetCheck output.  Looking at the bind6 !
 output, for example, NetCheck does seem to output useful information ("...not the same IP version...") but also other information ("This may be due to the timing of the connect and listen...").  So what exactly is the definition of diagnosis?  Similarly, the definition of "root cause" was not clear; sometimes this would refer to a line of source code in an application, but I think that is not what you mean.

Regardless of the definition, the early part of the paper should describe what a successful diagnosis is, to put some meaning behind a statement like "NetCheck correctly detects and diagnoses more than 90%...".

A couple of the design and evaluation choices were not clear:

- How were the 46 bugs to be reproduced chosen?  Are they a random sample or hand-picked to work well with NetCheck?

- A big chunk of the design deals with the lack of synchronized clocks.  But, it is often or usually possible to synchronize clocks well enough, i.e., less than an RTT (or at least have the NetCheck agent check the local time vs. a time server so it can adjust traces).  Wouldn't it help to fold in this timing so that in the common case you have accurate timing information and don't need heuristic reconstruction of the trace?

- Why do you give an integer priority value to syscalls?  Why not directly follow the dependency graph, i.e., the partial ordering, from which it appears you derive the total ordering?

- The "best case" runtime evaluation is confusing.  Isn't the best case for runtime that the software errors-out on the first line of input?  What good is a best case guarantee?


It seems NetCheck strongly depends on the rules hand-coded into the fault diagnosis engine.  That's fine; I can see such a database being an extremely useful resource when combined with NetCheck's automated diagnosis algorithms.  But some discussion would help: how big is this database, how was it produced, and how long did that take?  Do you think a reasonably small database covers most problems?

I appreciated the case studies of diagnosing real apps in sec 6.3.

Smaller items:

p. 5: "then (b) would have been the valid ordering": why is this the only valid ordering?  Couldn't 'connection refused' happen with ordering (a)?

p. 5, last full paragraph:  I didn't understand why a rejection indicates definite dependence.

p. 6: "sys call should be simulated at a later point": later in simulated time?  If so, couldn't the problem be that it should have appeared *earlier* in simulated time?


===========================================================================
                                  Comment
       Paper #118: NetCheck: Network Diagnoses from Blackbox Traces
---------------------------------------------------------------------------
PC discussion summary: The PCs find the direction the paper takes is interesting, the design makes sense, and evaluation is quite solid. However, it is not clear how much is automated and what are the diagnosis results. The paper points to [53] for identifying the diagnosis results, but even that report does not include sufficient information. Please try to clarify these in the camera ready.