I am Ewen Brune, a PhD student working in Inria research group ‘DiverSE’. I am working on the state of practice of scientific software testing.
I have some questions to better understand the test suite of Devsim
What is the motivation behind the choice to test at the python package level over the C++ code?
The oracles of the test suite is CPU and platform specific.
Is it impossible to have repetable results across configurations?
How do you validate a result on a new architecture?
Multiple C++ components are required to set up a simulation, and it is easier to coordinate them using Python. Also input validation for the commands are also done in the API code. The preference for testing is to test the system as a whole, as opposed to unit testing. There is some worry that small changes to parts of the code may affect the whole application.
There are smaller tests that do not start a simulation in the testing directory, but most bug reports come from actual usage of the program.
Since the Python API allows a lot of inspection to the C++ internals, there are some Python tests that use independent calculations to compare results to those generated from C++.
Most of the numerical results in the test suite are to IEEE double precision floating point, or extended precision. Even if these values were only compared to a few digits of precision, it is not possible to get an exact match. This is due to reasons, such as:
Using iterative methods, there may be different number of iterations reported on each platform, even for the same solver.
The Intel MKL Pardiso library is only available on x86_64, so UMFPACK is used on arm64 and aarch64 systems.
Each compiler/operating system has a different standard math library.
Platform specific compiler options can affect the results.
Linux gcc compilation on x86_64 supports 128-bit floating point precision, while all other systems are using the Boost libraries for extended precision, which is implemented with C++ templates and is not as specific with its accuracy.
OS/Compiler specific issues have arisen, that cannot be detected using alternative platforms. For example, specific versions of Visual Studio C++ have failed on some threading cases.
Often times, it is possible to start from a related system, such as moving from Linux x86_64 to arm64. The text differences are visually inspected. Some graphical results may be compared as well.
Examples:
Moving from Ubuntu 20.04 to 22.04 for testing Linux x86_64 resulted in some small numerical differences.
Long term, it would be nice to have a unified system, or better automated testing.
From time to time, a user will discover an issue that only presents itself in their specific use cases.
A code coverage tool should also be used to make sure that everything is being tested. However, actual usage cases of the software is still needed to discover new issues when the software is being used in a novel way.