By Edmund Chattoe-Brown, Alvaro Gil and The STREAMS Group1
1 Professor Graeme Ackland (School of Physics and Astronomy, University of Edinburgh), Professor Fiifi Amoako Johnson (Department of Population and Health, University of Cape Coast), Dr Daniel Amoako-Sakyi (Department of Medical Biochemistry, University of Cape Coast), Dr Rebecca S. Balira (National Institute for Medical Research, Mwanza, Tanzania), Dr Edmund Chattoe-Brown (School of Media, Communication and Sociology, University of Leicester), Professor Elizabeth David-Barrett (School of Law, Politics and Sociology, University of Sussex), Dr Martins Ekor (Department of Pharmacology, University of Cape Coast), Dr Heather Hamill (Department of Sociology, University of Oxford), PI: Professor Kate Hampshire (Department of Anthropology, University of Durham), Dr Gerry Mshana (National Institute for Medical Research, Mwanza, Tanzania), Dr Simon Mariwah (Department of Geography and Regional Planning, University of Cape Coast), Dr Adams Osman (University of Education Winneba), Dr Samuel Asiedu Owusu (Directorate of Research, Innovation and Consultancy, University of Cape Coast).
When should Agent-Base Modellers write their own code and when should they reuse or extend code that already exists? From a purely individual point of view, they should write their own code when, all things considered, it is quicker than reusing or extending existing code. This may happen because they are very fast (though not necessarily very accurate) coders, when existing code is impenetrable (either ineffectively documented or not documented at all) or when it is badly designed (so it cannot really be extended) and/or a long way from the use to which the researcher wants to put it.
The complication arises because (like many phenomena) code reuse has social as well as individual benefits. Reused code has been checked by at least one other person. This may disclose such things as bugs and potential efficiency improvements but also establishes whether the code and documentation is really accessible for general use or whether the original designers merely believe it is. (Coders always know more about their own code than they realise and therefore don’t always understand how to explain everything that is actually relevant.) Ultimately, it probably makes more sense if there is one well checked and somewhat general set of code for a particular area of modelling than a whole bunch of overlapping code of variable quality (which is what we tend to observe). But for the “supply” of code to travel in that direction, at least some individuals have to elect for code reuse. This note reports an experience of trying to do that and draws provisional conclusions from the case study (which can be supported or refuted by analysis of further cases in subsequent work).
Most of the authors (The STREAMS Group) are involved in a research project about supply chains and substandard/falsified medicines in Africa (see, for example, Ackland et al. 2019 and Hamill et al. 2019) where most of the available funding had to be allocated to fieldwork. This meant that as a modeller with a very small time involvement (but nonetheless a requirement to deliver research outputs) it seemed to the first author (Chattoe-Brown) that he would have to reuse an existing supply chain ABM (in NetLogo because that is the main language he uses) rather than build one from scratch. After looking in the NetLogo Modeling Commons (a resource for model access that can be found at <http://modelingcommons.org/>) he chose Gil (2012) as potentially suited to his purpose (and not having any feedback on its download page to suggest that there were any serious problems with it.) Despite being written in NetLogo 5, it worked straightaway in NetLogo 6.1.1 (although a “step by step” version also offered by Gil and documented in French would not – though an additional advantage of NetLogo is that older versions remain available and still run so use could have made use of that code if necessary.)
Rather to Chattoe-Brown’s surprise, however (given the known challenges of using undocumented code and aims of collective resources generally), the code was barely documented at all and certainly not in a way that would allow a reader to establish how it worked overall. The information provided in the Info Tab also did not fulfil that function (though it did provide a broad overview of the model’s aims).
The solution was simply to work through the code, annotating it line by line, both with regard to its procedures and “logic”. (This version of the code is available from Chattoe-Brown on request.) By the end of this process Chattoe-Brown was pretty confident that the programme did what it claimed and that he understood about 95% of it (with the remaining parts being equations with opaque variable names whose specifics did not seem crucial to the overall logic of the code.) This level of confidence certainly makes him think that he can extend the model in ways that are needed for the African project (which was the original aim). For the first analysis he “parked” various add on features (like being able to observe a specific organisation and organise “promotions” of the product) and just concentrated on the core code.
Below, therefore, are conclusions about how code needs to be documented based not on abstract principles but on the challenge of understanding a real piece of code intended for actual reuse.
- Think about variable naming: This is a trivial point (though its implications are surprisingly non trivial to making sense of code) but a reader will probably not be able to interpret a variable called “K” even from its use in equations (particularly if these are not themselves explained). Don’t assume that the reader will know the standard abbreviations for particular variables in your field (or the standard equations linking them). Instead use variables that attempt to self-describe like fixed_cost_of_order.
- Make sure any version of the code you make available is actually “finished”. Don’t declare variables for functions you no longer call (or even define). Don’t include commented out “scaffolding” for things you were either going to do or didn’t end up needing. (The person intending to reuse the code still has to spend time figuring these out only to discover they “lead nowhere”.)
- Establish (or follow) conventions for naming in your code. For example, give sliders variable names with words connected by underscores (like population_size) to distinguish them from “in code” variables or turtle attributes, perhaps linked by dashes (total-sales). Consider having prefixes to further distinguish attributes and variables: For example t-age is the age of a turtle while v-age is a variable holding some kind of age data. For obvious reasons (debugging as well as reading) don’t use total_sales and total-sales in the same piece of code!
- Be aware that “material” relating to model inputs (like slider values) and outputs (like the use of global variables in plotting) is much less directly accessible to the reader than material in the code itself (which can often partly be interpreted by context). Document your input and output materials specifically in the code (or, if this is more easily understood), in the Info Tab.
- Make it clear (by self-describing naming or naming conventions) which are “local” or “throwaway” variables or constructs (like loop counters or lists in the process of construction) and which are meaningful throughout the code. Establish standard names for common local variable functions like loop-counter or variable-holder to make it clearer that they are “disposable”. Local variables for list construction should be used with caution as they miss out on the support to interpretation that comes from variables that name themselves.
- If your code involves undifferentiated data structures (like lists of lists) the reader will find it very hard to infer which list elements represent what. It is much easier simply to explain in the comments that an “order” is defined as a two part list with the first element being a quantity ordered and the second a “tick for delivery due”. This is particularly important when using commands like map which, while very powerful as operators on lists, are extremely hard to parse unless you already know what the elements of a list are intended to represent.
- If code is to be made available undocumented, it is particularly important that it has “good logic”. For example, an operation (like updating a plot) should not be incorporated into a procedure serving a totally different function (like evaluating potential suppliers) just because that is convenient (because the reader may become very frustrated trying to read programme logic into that decision when it is not actually there). Procedures should not be “aggregated” with names that only make sense to the programmer (like main-sequence) but should also be self-naming (all-attribute-updating-for-wholesalers). It is particularly important that key design issues do not “get lost” in the allocation of operations across procedures. For example, the main conceptual challenge Chattoe-Brown had in understanding the Gil model was that although there were plainly “such things” as orders (a two element list of a quantity and delivery date as above) and there were procedures that could be identified as transferring products from one organisation to another and “soliciting” products, the actual procedure that “created” the orders as objects was a sub procedure of a procedure with an uninformative name. Furthermore, because orders were not indexed by the organisation placing them, one had to infer the implication of order lists being sorted in a certain way (that you could tell which order “belonged” to which wholesaler, for example, by its position in a list which corresponded to another list with the ordered identities of the wholesalers). A third example is the difficulty of interpreting procedures which take arguments. These should only be used when the argument is really something that modifies or qualifies an otherwise similar process in an intuitive way and not as a way of “code sharing” to do qualitatively very different things within the same procedure depending on the argument used.
- If the code relies on “external” materials (like libraries or data files) make sure these are briefly described and, in particular, say where they can be accessed, in case they become separated from the code itself. Otherwise, unless they are well known resources that can be tracked down independently, the code may become inoperable without them. (This seems to have happened in the Gil code with some, fortunately minor, dedicated “shapes” to display the various organisations in the supply chain.)
- Try to document “design decisions” in the Info Tab if not in the code itself. For example, the Gil model relies on building lists for nearly every event in the model (lists of successful sales, lists of lost sales and so on). These lists extend as time passes to serve as a complete record of a particular kind of “event” (like when a particular distributor is unable to fulfil its order and how big each of these lost orders were.) There is thus a technical concern that such ever growing lists may raise efficiency issues if the simulation runs for very long period and, in fact, there is a procedure that stops some lists (but perhaps surprisingly not all of them) from growing too long. In addition it is clear that the functioning of the code relies on all these various lists remaining the same length (and thus having to be “padded” with zeroes when nothing relevant happens) but the reader cannot tell whether the aim is that these very long lists will ever be used (in data analysis for example) and whether they are “padded” as part of their role as material for analysis (or just because it is just less programming effort than doing lots of list length checking.) The motivation of decisions like this (and their implications for improving or extending the models) are the hardest things to infer from the code itself. Another example is the code needed to “hook together” clients and suppliers in both directions (so if I choose you as my supplier, you must then record me as your client.) In retrospect it is obvious what the “aim” of this part of the code is but the practical details of programming alone leave it rather hard to work out.
- If you haven’t “finished” your code (or in Chattoe-Brown’s case the comments on it), make clear how/where it is unfinished so later readers (or you yourself) can add to it. For example, I (Chattoe-Brown) use the string “xx” (which almost never occurs in natural text) to mark places where, for example, I am still unsure how a variable or procedure works. Only when I have dealt with all such instances is an article ready to submit or the code ready to circulate. This can be combined with the use of conventions, for example to mark whether or not you have yet debugged a particular procedure in your code. As with documentation generally, the aim is that mistakes should not be made because you go away from your code and come back to it later. The code should always “explain its own current state” (however incomplete or inelegant that may be).
- Think about the best “supporting material” suitable to documenting specific code. For example, in the Gil code, some variables and procedures are unique to specific organisations or positions in the supply chain while others apply to several kinds of organisation. A table would be a very effective way of showing the overall pattern and rationale here. For example, do these kinds of organisation all do this particular procedure because they are “internal” to the supply chain i. e. having both upstream and downstream partners? Does this organisation not have this procedure because customer purchases are modelled differently from the rest of the purchase decisions in the model? Again, it is particularly hard to read out this kind of “model motivation” from the workings of the code alone.
- “Hacky” solutions (things that may be clever in programming terms) are generally harder to interpret. For example, to give a variable a value, the Gil code “strips off” a number that is part of the name of the strategy and uses that. The syntax of string processing is not necessarily what a reader expects to engage with in what is mainly a list based model. (Generally, the more different approaches a single piece of code takes – for example list processing and string processing – the fewer users will be skilled enough to read it overall and the harder it will be to interpret. Once one has explained the logic of, say, a list based approach, it may be better to stick to it as far as possible.)
- Consider having a “variable table” as part of your documentation. It is particularly fiddly to sort out which variables come from sliders (as discussed above) and which may be defined at arbitrary points in the code as a whole. It is also useful to have a sense of the overall “call structure” of code (so this procedure is strictly used as a subroutine of another while this one is called at many different points, perhaps with different arguments too.) The same applies to variables (which may simply pull a slider value into the code “once only” or be intertwined with almost every procedure.)
- Be aware of the “passing implications” of concise code (and record them even if you do not amend them). For example, in the Gil code, demand is subtracted from stock if stock is greater than demand but if stock is less than demand, stock is set to zero. This means that some demand “evaporates”. This not only has implications for the general applicability of the model (the good cannot be an essential one for example) but may also affect the simulation outcome in unexpected ways. (It is one thing to discard one unit of demand because you want 9 and the stock is 8 but this implementation will equally discard 15 units if you want 16 and the stock is just 1. “Demand evaporation” on this scale may prove to have significant implications for the overall dynamics of the model This seems unlikely to be intended.)
These conclusions operate at a number of levels from the practical to the institutional:
- One can get a long way with print statements. Chattoe-Brown often found it feasible to test “hypotheses” about (unexplained) data structures by printing these out along with associated variables (so the third element of this data structure must be “ticks” because it always matches when that variable is printed out as well.) By contrast, it is still extremely hard to infer the “motivation” of an equation even if you print out material that describes how it behaves.
- Even though it involves significant effort (which is nonetheless quite educational) it is probably still quicker (unless you are a real programming ace) to document and develop a solid piece of code built by someone else than to build your own from scratch.
- The caveat emptor approach to code availability needs to be considered carefully in terms of wider community benefits. The danger is that the lower the cost of making things available, the lower the expected value of doing so. Chattoe-Brown was lucky in finding a “solid” bit of code in the NetLogo Modelling Commons but it might be (had the code reuse exercise not been conducted partly out of intellectual curiosity) that he would have done better to look at the model library of the COMSES network (which enforces minimum standards of documentation for example).
- We need to consider the institutional structures that may support the division of labour in code reuse to improve its efficiency (more improvement and consolidation of existing code and less proliferation of ad hoc overlapping code) and thus the “efficiency” of the modelling community as a whole. (There is actual little virtue in everyone building their own models from scratch unless it is part of their training for example.) If Chattoe-Brown et al. document the Gil model so it qualifies for COMSES (where it will probably be more used as it deserves) does that remain Gil’s contribution, does it now “belong” to Chattoe-Brown et al. or is it some sort of joint venture? (Chattoe-Brown has assumed the latter. This explains the complicated authorship of the paper and the somewhat awkward references to separate authors in the text.) When academics are busy, the right incentives to engage in certain activities are a key aspect in making them happen. Another example is the difference between measuring the quality of code by simple “downloads” and in terms of people actually running it (since maybe the download wouldn’t even run) or actively endorsing it (since even if it ran it might be no good for further use on closer examination). More generally this suggests that we might need to support more nuanced “versioning” of shared code and consider commentary (as well as the code itself) as something that can usefully be shared and communally improved. (For example, a better coder than Chattoe-Brown could perhaps add to his documentation and fill the remaining gaps when they wouldn’t be motivated to document the whole programme from scratch. There may be a “brain drain” from communal code development because the ablest coders are most likely to write their own code and therefore have the least individual incentive to contribute to collective code development activities.)
- There is generally not that much discussion of the ‘nitty gritty’ of code documentation, coding “styles” and so on in order that common practices can be shared and improved. (Arguably the account given here is not “research” in the sense intended for peer reviewed publication and this may explain its relative lack of visibility.) This is particularly the case of work inductively based on practice (and thus guaranteed to be relevant) rather than on “principle” (which may be impractical or inapplicable). We would be interested to hear if others use these “tricks” or whether they have other or better ones in the same vein. In the spirit of collective action, it is clear that others could perform the kind of analysis presented here and thus support, refute or improve on our tentative conclusions.
- This article is clearly geared up to the specifics of NetLogo. It would be interesting to know whether the same (or different) specific problems arise in reusing code written in SWARM, RePast, MASON or whatever. RofASSS might be a good outlet for such related contributions (following its brief to provide, in permanent citable form, useful material that does not have an obvious existing outlet). It would also be interesting if others attempting code reuse endorsed or qualified our analysis. (Just as programmers who know their own code very well don’t always know what to explain, perhaps Chattoe-Brown’s view of what is difficult in this code may just reflect his own limitations. On the other hand, the less experience/expertise it takes to interpret and reuse code, perhaps the better that is communally.)
- In the spirit of collaborative endeavour, we would be very grateful if anyone could further illuminate any areas of the more documented version of the Gil code that we still cannot follow!
- The possibility of code reuse is a real issue in terms of efficiency and reliability and not simply an “academic” matter. By my good fortune, Gil was still at the same email address and responded very promptly and helpfully to an inquiry about some aspects of the code but freely admitted, after all these years, that he could no longer follow it either! (In another piece of research Chattoe-Brown was working on, a replication came to a dead stop because the modeller on a project had left academia and was therefore now too busy with other things to answer “academic” questions and none of the other authors really “owned” the code on which the article was based. In this situation, the supposed conclusions of the article stand but the model on which they are based fades into unverifiability.) It is not simply a cliché that undocumented code can easily become unusable. This suggests practical steps (like deleting anything from code archives of more than a certain age unless some “responsible” can still be contacted to affirm that it should remain.)
- There may be useful work to be done in assessing the extent to which, in practice, published models tend to overlap in their functionality. Is it in fact the case that most supply chain simulations do basically the same things (but based on separate ad hoc code rather than the communal creation of a robust platform). A further issue here is that a whole set of independent models which are basically similar may create a kind of “groupthink” about what modelling in certain areas should involve. By contrast, are there certain fields where such collective work would not be feasible or useful (and why?) How do we promote communal models as an “output” where these turn out to be suitable? (This is part of a more general problem: Can we rely on individualistic academics with a competitive structure imposed on them to “communalise” their activities at appropriate points in time to avoid negative social outcomes – including waste – or do we have to find ways to somehow “encourage” this process by institutional development/design?)
ECB’s contribution to this article was as part of “Towards Realistic Computational Models of Social Influence Dynamics” a project funded through ESRC (ES/S015159/1) by ORA Round 5.
Ackland , Graeme J., Chattoe-Brown, Edmund, Hamill, Heather, Hampshire, Kate R., Mariwah, Simon and Mshana, Gerry (2019) ‘Role of Trust in a Self-Organizing Pharmaceutical Supply Chain Model with Variable Good Quality and Imperfect Information’, Journal of Artificial Societies and Social Simulation, 22(2), March, article 5, <http://jasss.soc.surrey.ac.uk/22/2/5.html>. doi:10.18564/jasss.3984
Gil, Alvaro (2012) ‘Artificial Supply Chain’, <http://modelingcommons.org/browse/one_model/3378#model_tabs_browse_info>, École Polytechnique de Montréal.
Hamill, Heather, Hampshire, Kate, Mariwah, Simon, Amoako-Sakyi, Daniel, Kyei, Abigail and Castelli, Michele (2019) ‘Managing Uncertainty in Medicine Quality in Ghana: The Cognitive and Affective Basis of Trust in a High-Risk, Low-Regulation Context’, Social Science and Medicine, 234, August, article 112369. doi:10.1016/j.socscimed.2019.112369
Wilensky, Uri (1999) ‘NetLogo’, <http://ccl.northwestern.edu/netlogo/>. Center for Connected Learning and Computer-Based Modeling, Northwestern University, Evanston, IL.
Chattoe-Brown, E., Gil, A. and The STREAMS Group (2021) How To Make Your Code “Immortal”: NetLogo Edition. Review of Artificial Societies and Social Simulation, 6th May 2021. https://rofasss.org/2021/05/06/how-to-make-your-code-immortal-netlogo-edition/