The Industry as the Lab of Software Engineering Research' -
Position on Empirical Software Engineering
Our ability to construct ... needed software systems and our ability to analyze and predict the performance of the enormously
complex software systems that lie at the core of our economy are painfully inadequate. We are neither training enough professionals
to supply the needed software, nor adequately improving the efficiency and quality of our construction methods.
This quotation is from the 1999 President Clinton's Information Technology Advisory Committee Report but is as valid today as it was then.
Software systems form the foundation of modern information society and many of those systems are among the most complex things ever created
by man. Software engineering (SE) is about developing, maintaining, and managing high-quality software systems in a cost-effective and
predictable way. SE research studies the real-world phenomena of SE and concerns (1) the development of new technologies (process models,
methods, techniques, tools, or languages), or the modification of existing ones, to support SE activities and (2) the evaluation and
comparison of the effect of using such technology in the often very complex interaction of individuals, teams, projects, and organisations
and various types of tasks and software systems. Sciences that study real-world phenomena, that is, empirical sciences, necessarily use
empirical methods as their principal mode of inquiry. Hence, if SE research is to be scientific, it too must use empirical methods (using
logic to reason about the behaviour of software engineers is generally infeasible). Empirical research seeks to explore, describe, predict,
and explain natural, social, or cognitive phenomena by obtaining and interpreting evidence through, for example, experimentation, systematic
observation, interviews or surveys, or by the careful examination of documents or artefacts.
A particular challenge when conducting empirical studies in SE is that we want to produce results that are scientifically sound yet at the
same time relevant to engineers and managers in the software industry. Hence, there is a trade-off between control (for identifying cause-effect
relations) and realism (for making the results transferrable to industrial applications).
1 Collaboration with industry
Collaboration between a research institution and a company in industry can take many forms. Note here that we interpret the term collaboration
in a wide sense, to include many kinds of more or less formal interactions. For example, Table 1 shows the numbers of companies and persons
that were involved with the Software Engineering Department at Simula Research Laboratory that I founded and was the leader of from 2001 to 2008. (Note that
the term company also includes public service units.) The companies column refers to unique companies within each category. The people involved
may not be unique, however, because we did not record their identities as individuals. The massive number and extent of empirical studies
conducted in that period was the main reason for the ranking by the Journal of Systems and Software (JSS) of Simula as number 1 in the world
with respect to publications in the period 2004-2008.
Included under the category of joint research in Table 1 are the processes of writing grant applications and working actively together in research
projects that have received grants. Writing scientific articles together with people from industry reflects close collaboration.
Table 1. Forms of collaboration with industry.
Type of collaboration
Empirical studies with professional practitioners as participants
Experiments (from 10 minutes to 2 weeks)
Case studies (a few days to 3 years)
Action research (up to 5 years)
Interviews (typically 1 hour)
Co-authoring scientific papers
Knowledge and technology transfer
Teaching SE courses
Guest lectures from industry at courses given by the SE department
Acquiring consultancy work
For example, to build infrastructure and organise studies
411 (322 unique)
I have been the overall leader of controlled experiments on a scale and with a realism never before seen in the context of SE research. Studies of
fully realistic development projects, in which several companies develop the same system independently of each other, have also never been seen
before in the research community. Crucial to the success of these studies has been the strategy of hiring consultants as participants. We pay the
consultancy companies their standard fees for individuals or a fixed price for a complete project, just like any other client. The companies have
routines for defining (small) projects with local project management, resource allocation, budgeting, invoicing, the provision of satisfactory
equipment, and so forth. It is difficult to find subjects who are employed in an in-house software development company, because management typically
will focus all effort on the next release of their product.
Another model is when industry pays for its own time and the Norwegian Research Council funds the researchers. We successfully applied this model
in several software process improvement projects in which action research was applied. Action research attempts to provide practical value to the
client organisation while simultaneously contributing to the acquisition of new theoretical knowledge. It can be characterised as 'an iterative
process involving researchers and practitioners acting together on a particular cycle of activities, including problem diagnosis, action intervention,
and reflective learning' (Avison et al., 1999).
In a systematic review, we identified 113 experiments published in the leading journals and conferences in the field over a decade. None of the
articles reported that professionals were paid to take part in the experiments. (In three cases, students were paid to take part.) One of the
reasons that we in the early days of Simula were able to pay directly to conduct empirical studies, which cost up to 200,000 Euros for a single
study, was that I fully exploited the unique opportunity to use the resources of the SE department in an optimal manner. I decided to spend about
20% of the budget on such studies. This was achieved mainly at the expense of employing a larger number of researchers.
The opportunity at Simula was exceptional in this respect. However, when applying for research grant applications, one budgets for money for positions,
equipment, and travel. There is no reason why money for conducting empirical studies could not be included in research grants, too. Of course,
this means that the grants available in our field must be increased but, given the importance of software systems in society, research projects
in SE should not be less comprehensive or funded less adequately than large projects in other disciplines, such as physics and medicine.
3 Experiment infrastructures
The logistics involved in running large-scale experiments and other studies with industry are tremendous. The participants must be registered,
the experimental materials (e.g., questionnaires, task descriptions, code, and tools) must be distributed to each participant, the progress of
the experiment needs to be controlled and monitored, the results of the experiment need to be collected and analysed, the task duration must be
recorded, payment information must be collected for hired subjects, and so on. To support these logistics, as well as the automatic recovery of
experiment sessions and backup of experimental data, we developed an Internet-based tool, called SESE (Arisholm et al., 2002), which was crucial
to the success of many of our experiments. It is built on top of a commercial human resource management system. The commercial vendor of this
system was also hired to develop SESE.
We ran studies in many countries (see Table 2) in order to (1) be able to provide enough subjects with sufficient qualifications for a
given study, (2) reduce the costs of hiring professionals, (3) provide a more representative selection of subjects than those limited to a given
country, and (4) explicitly study cultural differences in, for example, studies on outsourcing and off-shoring. SESE has been invaluable, both
for running distributed experiments in many countries (enabling flexibility regarding location and thus making it easier for professionals to
take part in the experiments) and for managing many subjects simultaneously (up to 100 subjects have taken part in an experiment at the same
time in one location).
SESE has been extended with a module that supports the collection of qualitative data obtained from SE experiments, in particular
feedback from subjects during experiments (Karahasanovic, 2005). Such feedback includes useful complementary data to validate data
obtained from other sources about the quality and duration of tasks, process conformance, problem-solving processes, problems with
experiments, and the subjects' perception of the experiment.
Table 2. Companies and people that participated in studies during my time at Simula.
We realised early in our use of SESE to conduct experiments that the tool could be modified to assess programmer skill in general, which would
lead to cost savings and improve the decision-making process in industry. Hence, as early as 2002 we held interviews in seven organisations about
their interest in an instrument that would support the assessment of programmers based on actual programming tasks, as opposed to simply answering
questions textually or using multiple-choice forms. As a result of the great interest in this issue, we announced a PhD scholarship that should
focus partly on basic research in this area and partly on the development of a commercial tool for assessing programming skill. A PhD student has
developed an instrument that includes both a model (a Rasch model, which is frequently used in cognitive psychology) and a prototype tool for such
an assessment. We believe that the commercial potential is great. This project has already received more than 100.000 Euros from a Norwegian
One can also envisage many extensions of SESE. In particular, it would be useful to have an infrastructure that supports (1) experiments lasting
longer than one day, (2) many research methods (case studies and action research in addition to the present support for experiments and surveys),
(3) usability studies, and (4) integration with other study equipment, including audio and video facilities.
4 Increased realism in experiments
Empirical studies usually involve a trade-off between control and realism (Sj berg et al., 2002). The classical method for identifying cause-effect
relations is to conduct controlled experiments (called simply experiments below) where only a few variables vary. A common criticism of SE
experiments is their lack of realism, which may deter the transfer of technology from the research community to industry. Hence, a challenge is
to increase realism while retaining a relatively high likelihood that the technology studied (the treatment) is actually the cause of the observed outcome.
Even though we can increase the realism of experiments, there remain many SE phenomena that occur in complex, real-life environments that cannot be
studied fully through experiments. We need complementary case studies, particularly when the boundaries between phenomenon and context are not clearly
evident. So, while an experiment may choose to deliberately divorce a phenomenon from its context, the case study aims to cover contextual conditions.
However, embracing more of the context into the focus of study makes it more difficult to identify what affects what. Hence, achieving control is a challenge.
Table 3. Major entities of software development.
Individual, team, project, organisation, or industry
Process model, method, technique, tool, or language
Plan, create, modify, or analyse (a software system) and associated subactivities
Software systems may be classified along many dimensions: size, complexity, application domain, business/scientific/student project, administrative/embedded/real time, etc.
The discussion below on realism will be structured around the main entities of software development. The typical situation is that an actor
applies technologies to perform certain activities on a (existing or planned) software system. Thus, the purpose of a typical SE experiment is to
compare different technologies in the context of various actors, tasks, and systems. These high-level entities or concepts, with examples of subentities,
are listed in Table 3. One may also envisage collections of entities for each (sub)entity. For example, a software system may consist of requirement
specifications, design models, source and executable code, test documents, various kinds of documentation, and so forth. Moreover, the usefulness of a
technology for a given activity may depend on characteristics of the software engineers, such as their experience, education, mental ability, personality,
motivation, and knowledge of a software system, including its application domain and technological environment.
4.1 Increased representativeness of subjects (actors)
In the review reported in (Sjøberg et al., 2005), only 517 of 5488 subjects were professionals; some were academics, but almost 90% were students.
The makeup of the samples makes it very unlikely that the results of these experiments can be generalised to an industry setting:
Practitioners are understandably sceptical of results acquired from a study of 18-year-old college freshmen ... finding 100 developers willing to participate
in such an experiment is neither cheap nor easy. But even if a researcher has the money, where do they find that many programmers? (Harrison, 2005)
I have actually provided the funding for and organized up to several hundred subjects in one experiment. A large number of subjects makes it easier to obtain
a representative rather than a biased sample of the target population. Only one of 113 experiments reported sampling from a well-defined target population
(Sjøberg et al., 2005). Moreover, many aspects of the complexity of SE, such as differences among subgroups of subjects, only manifest themselves in
controlled experiments if they involve a large number of subjects and tasks. For example, in an experiment on pair programming (where programmers work in
pairs instead of individually), we wanted to investigate whether there was a difference in the effect of pair programming with respect to seniority among
the subjects (Arisholm et al., 2007). We also wanted to test the effect of different levels of system complexity. For the conclusions drawn from a statistical
test to be sound, the test must have sufficient power (Dybå et al., 2006). The three variables pair programming (two levels), control style (two levels), and
expertise (three levels) constituted 12 groups in total. The power analysis showed that we needed at least 170 subjects (85 individuals and 85 pairs). The
approach of hiring professionals from different companies in different countries to take part made it possible to end up with 99 individuals and 98 pairs.
The results showed that the effects of pair programming were dependent on both the seniority of the subjects and the complexity of the systems: The juniors
made more correct changes to complex systems when pair programming than when programming alone, whereas the intermediates and seniors spent less time making
correct changes on simple system. Hence, the skill or expertise level relative to the technology being evaluated must be made explicit, thus indicating the
population to which the results apply.
Most experiments in SE have individuals as the experimental unit. As shown, pairs may also be the unit. Even rarer is to have companies as the unit, although
in an experiment on software bidding we used 35 companies as the subjects (Jørgensen and Carelius, 2004).
4.2 Increased realism of the studied technology and technological environment
There are two aspects regarding the realism of the technology in an experiment. One is the technology being evaluated; the other one is the technological environment
of the experiment. Often the technology being evaluated is developed in a research setting (frequently by the evaluators themselves) and not compared with relevant
alternative technology used in the software industry. The motivation for studying a given technology should be driven by the needs of industry. The particular technology
that we study is related to the topic of investigation. Due to the limited extent of empirical research in SE, however, there are many important topics that receive no
attention in empirical studies (Sjøberg et al., 2007; Höfer and Tichy, 2007). An overview of topics investigated in SE experiments can be found in Sjøberg et al. (2005).
There is relatively little reporting of the impact of technology environment on experimental results but it seems clear that using artificial classroom settings without
professional development tools can, in many situations, threaten the validity of the results. Note that in smaller experiments, realistic environments may also threaten
the validity of the results due to confounding effects associated with the technological environment. For example, in an experiment on Unified Modeling Language (Anda
and Sjøberg, 2005), the subjects who used a modelling tool spent more time than those who used pen and paper to obtain similar quality. Apparently, those who used the
tool spent extra time understanding how to perform tasks with it and getting the syntax correct to avoid error messages.
A review of reported experiments (Sjøberg et al., 2005) found that the use of computer tools was slightly higher than the use of pen and paper. Only about half of the
experiments, however, reported on the use of tools to support assigned tasks. This lack of reporting may be due to a lack of awareness of or interest in this issue's
relevance. Nevertheless, the community should recognise the effort and resources needed to set up PC or workstation environments with the right licences, installations,
access rights, and so forth, and to familiarise the subjects with the tools.
4.3 Increased realism of tasks
Activities in SE are comprised of tasks, which typically have time limits. Large development tasks can take months, while many maintenance tasks take only a couple of
hours. Nevertheless, typical tasks in SE experiments are much smaller than typical industrial tasks. In the review reported in Sjøberg et al. (2005), the median duration
of experiments was 1.0 hour for experiments in which each subject's time was taken and 2.0 hours when only an overall time for the whole experiment was given. About half
of the articles mention that the tasks used in the experiments were not representative of industrial tasks and systems with respect to size/duration, complexity,
application domain, and so forth.
The community seems to agree that it is a problem that most experiments do not resemble industrial situations, but there is no consensus on what an industrial situation
is. There are an endless number of industrial technologies, tasks and systems, so one should be careful to claim that a given technology, task, or system is representative,
because it is difficult to specify what it is representative of. In order to meet this challenge, it is necessary to develop well-defined taxonomies that contain
representative categories. This can be done by first conducting surveys, logging the activities in certain companies, consulting project information databases, and then
analysing the results. Once suitable taxonomies have been developed, experiments can take their samples from the populations that are indicated by the categories of the
Nevertheless, development tasks in industry usually take longer and are more complex than is the case in most experiments (as is the case with technology and systems).
Hence, in experiments that have included programming tasks, we have attempted to increase the realism by running the experiments from one day up to two weeks rather than
one or two hours (Arisholm and Sjøberg, 2004; Vokác et al., 2004; Dzidek et al., 2008). The full-realism experiment on bidding (Jørgensen and Carelius, 2004) included, by
definition, realistic estimation tasks. These tasks can be typical of small, web-based information systems.
4.4 Increased realism of systems
Most software systems that are used in SE experiments are either constructed for the purpose of the experiment or are student projects. In the review reported in Sjøberg
et al. (2005), only 14% were commercial systems. Accordingly, the systems are generally small and simple. This is also the case for the systems used in the experiments
that I have conducted.
The experiment on pair programming described above demonstrated that system complexity, which is one attribute of a system, might have an effect. In general, however,
the research community rarely focuses on what kind of systems are used in the experiments, which may be due to the fact that we do not know how to classify or describe
them in a systematic way.
5 Increased control in case studies
A case study should be conducted when a 'how' or 'why' question is being asked about a contemporary set of events investigated within its real-life context (Yin, 2003).
The typical situation in case studies is that there are more variables of interest than data points; hence, there are many possible confounding factors that cannot be
controlled for. So, rather than attempting to control for all variables of interest, the context in which the study is conducted should be described in as much detail
Nevertheless, we have launched a new approach in SE to achieve more control in case studies. We have found that by running multiple case studies in which certain variables
are controlled across the studies, it is more likely that cause-effect relations can be identified. For example, in a study on variability and reproducibility in SE (Anda
et al., 2009), the same requirements specification was sent in a call for tenders to 81 software companies. The study was conducted in two parts. In the first part, a
randomised controlled field experiment was conducted on the bids from the 35 companies that responded (Jørgensen and Carelius, 2004). In the second part, four companies
were selected for an in-depth multiple case study in which they developed the system independently, in parallel (Anda et al., 2009). The unit of study was thus the company.
Figure 2 shows the relation between bids and outcomes, which was investigated in the controlled context. Table 4 compares the different companies with respect to the project
and product quality dimensions.
The system that was referred to previously is a web-based information system to track empirical studies. The four versions of the system had been running for two years when
they needed to be upgraded. Implementing the necessary changes and other corrections was then used as an opportunity to conduct another multiple case study on software
maintenance in which six developers from two Eastern European countries implemented the same set of changes on two systems each. The unit of study in this case was thus
the individual programmer.
Figure 2. Multiple case study with some controlled context.
Table 4. Quality of project and product.
Actual lead time
The ultimate goal of SE research is to support practical software development. Ideally, the research community should provide guidance when industrial users pose the
question which method or technology should we use in our context? Of course, this would require sufficient information about a given setting, which in turn requires a
wide spectrum of high-quality empirical studies. While the community seems to agree that it is a problem that most experiments do not resemble industrial situations,
it is a challenge to define what an industrial situation is. Well-defined, representative taxonomies are needed.
In general, further progress in this area, however, requires increased competence in conducting empirical studies, improved links between academia and industry, the
promotion of common research agendas, and increased funding for empirical studies proportionate to the importance of software systems in society.
Anda, B. C. D., and D. I. K. Sjøberg. Investigating the Role of Use Cases in the Construction of Class Diagrams, Empirical Software Engineering 10(3):285-309, 2005.
Anda, B., D. I. K. Sjøberg, and A. Mockups. Variability and Reproducibility in Software Engineering: A Study of Four Companies That Developed the Same System, IEEE Transactions on Software Engineering, 35(3):407-429, 2009.
Arisholm, E., H. E. Gallis, T. Dybå, and D. I. K. Sjøberg, Evaluating Pair Programming with Respect to System Complexity and Programmer Expertise, IEEE Transactions on Software Engineering, 33(2):65-86, 2007.
Arisholm, E., and D. I. K. Sjøberg. Evaluating the Effect of a Delegated Versus Centralized Control Style on the Maintainability of Object-Oriented Software, IEEE Transactions on Software Engineering, 30(8):521-534, 2004.
Arisholm, E., D. I. K. Sjøberg, G. J. Carelius, and Y. Lindsjørn. A Web-Based Support Environment for Software Engineering Experiments, Nordic Journal of Computing 9(4):231-247, 2002.
Alison, D., F. Lau, M. Myers, and P. A. Nielsen, Action Research, Communications of the ACM, 42(1):94-97, 1999.
Dybå, T., V. B. Kampenes, and D. I. K. Sjøberg, A Systematic Review of Statistical Power in Software Engineering Experiments, Information and Software Technology, 48(8):745-755, 2006.
Dzidek, J., E. Arisholm, and L. C. Briand. A Realistic Empirical Evaluation of the Costs and Benefits of UML in Software Maintenance, IEEE Transaction on Software Engineering 34(3):407-432, 2008.
Harrison, W. Skinner Wasn't a Software Engineer, Editorial, IEEE Software, May/June, 2005.
Höfer, A., and W. F. Tichy. Status of Empirical Research in Software Engineering. In: Basili et al. (eds.), Experimental Software Engineering Issues: Assessment and Future Directions, Springer-Verlag Berlin Heidelberg, Lecture Notes in Computer Science 4336, 2007.
Jørgensen, M., and G. J. Carelius. An Empirical Study of Software Project Bidding, IEEE Transactions of Software Engineering 30(12):953-969, 2004.
Karahasanovic, A., B. C. D. Anda, E. Arisholm, S. E. Hove, M. Jørgensen, D. I. K. Sjøberg, and R. Welland. Collecting Feedback During Software Engineering Experiments, Empirical Software Engineering 10(2):113-147, 2005.
President's Information Technology Advisory Committee Report, http://www.ccic.gov/ac/report, 1999.
Sjøberg, D. I. K., B. Anda, E. Arisholm, T. Dybå, M. Jørgensen, A. Karahasanovic, E. Koren, and M. Vokác. Conducting Realistic Experiments in Software Engineering, In: First International Symposium on Empirical Software Engineering, 2002, IEEE Computer Society, pp. 17-26, 2002.
Sjøberg, D. I. K., T. Dybå, and M. Jørgensen. The Future of Empirical Methods in Software Engineering Research. In: L. Briand and A. Wolf (eds.), Future of Software Engineering '07, IEEE CS Press, pp. 358-378, 2007.
Sjøberg, D. I. K., J. E. Hannay, O. Hansen, V. B. Kampenes, A. Karahasanovic, N.-K. Liborg, and A. C. Rekdal. A Survey of Controlled Experiments in Software Engineering, IEEE Transactions on Software Engineering, 31(9): 733-753, 2005.
Vokác, M., W. Tichy, D. I. K. Sjøberg, E. Arisholm, and M. Aldrin. A Controlled Experiment Comparing the Maintainability of Programs Designed With and Without Design Patterns - A Replication in a Real Programming Environment, Empirical Software Engineering 9(3):149-195, 2004.
Yin, R. K., Case Study Research: Design and Methods, 3rd ed., Sage Publications, Thousand Oaks, CA, 2003