Summary

The paper “Learning Independent Causal Mechanisms” (Parascandolo et al., 2018) presents a framework for uncovering independent causal mechanisms through a set of generators knows as “experts”, where each “expert” models a single inverse transformation. By doing so, the authors believe that these “experts” act as modular & independent “mechanisms” that help in generalization to other domains. During training, the “experts” compete to extract these mechanisms by learning the independent transformations and the inverse transformations that help map the transformed distribution back to their “canonical” forms.

To test the efficacy of their method, the authors perform experiments using the MNIST dataset and test generalizability in the Omniglot dataset.

One of the major concerns of the paper is the limited setting in which the experiments and evaluation were carried out. The authors only considered the MNIST dataset for training & Omniglot dataset for evaluating generalization where the mechanisms(transformations) were simple & singular and thus did not necessarily embody the complexity of real-world samples.

According to Keras author Francois Chollet, MNIST data cannot represent modern computer vision tasks (zalandoresearch/fashion-mnist, 2021). Creator of GANS, Ian Goodfellow, also called for people to move away from the MNIST dataset due to its extreme overuse and simplicity (zalandoresearch/fashion-mnist, 2021). Using MNIST as a starting point may be reasonable but it cannot be used as a benchmark since it does not represent real-world computer vision tasks. In the case of testing generalizability, the Omniglot dataset has figures that are very similar to the numbers present in the MNIST dataset. Using such a distribution that possesses similarity (point further elaborated in Section 7) may not prove to be a strong validation of the hypothesis (that learning independent mechanisms allow for generalizability) and hence, calls for a need to perform evaluation on more diverse datasets.

Moreover, the paper talks of its goal of “moving towards a form of robustness that animate intelligence excels at”. We can all agree that any real-world example will possess far more complex processes with many interdependencies existing at once. Therefore, there is a need to experiment across complex settings that, for instance, contain multiple simultaneous transformations than atomic transformations. However, by missing out on performing experiments that are validated across domains derived from diverse distributions & settings, the paper seems to only partially achieve its goal. For instance, the authors could explore the domain of vision and text by leveraging datasets such as ERASER (DeYoung et al., 2020), Open Images (Open Images V6 - Description), CIFAR-10 (Krizhevsky, 2009). The authors do mention the need for complex settings as future work; however, this seems to be the core of what must be tackled and if addressed, can help make the research consistent & complete.

The second major concern is that the paper seems to be highly motivated by the objective of causality but does not showcase any results or analysis on how the current methodology can be used for causal inference. The authors mention that the Independent Mechanism(IM) assumption has “implications for machine learning more broadly”; in the case that the authors leverage “causality” only as an inspiration and not as an objective for their proposed method, then it only seems fit that the authors describe concrete empirical results & analysis on how their proposed method compares to other methods used for serving generalizability, such as domain adaptation methods and self- supervision models - which unfortunately has also not been clearly done.

The final major concern of the paper is that the authors do not mention how and why the certain set of mechanisms(transformations) are chosen. In real world cases, we often are not aware of the true independent mechanisms underlying within the data. In such a case, how can one determine what transformations or mechanisms to employ? How does one address the problem of confounding variables that inevitably exist?

The preceding paragraphs in this critique detail the points mentioned above, along with other additional details that the authors could have considered. Indeed, apart from the concerns of the paper, it is worthwhile to highlight that method employed using the assumption of Independent mechanisms truly show promise in the field of machine learning and causality. The easy language and the logic used for uncovering the mechanisms will allow for researchers to learn and take inspiration from to further the hypothesis that learning independent mechanisms lead to better generalizability.

In conclusion, the paper mentions a clear hypothesis that learning independent causal mechanisms lead to better generalizability and life-long learning; however, due to the limited & restrictive setting of the experiment and unclear addressal of causal inference within the methodology, the paper does not provide strong evidence for the said hypothesis and leaves room for incremental work.

Statement of the Problem

The paper outlines the problem statement in the “Introduction” section, where emphasis is given on the need to move towards a robust model utilizing “modular, reusable and broadly applicable mechanisms” that can generalize well across un-trained or out-of-distribution settings. This is mentioned to be akin to how animate intelligence (such as humans) perform tasks on varying real-world data without the need to re-learn a new model every time.

In motivation of the same, the authors mention the need to move from machine learning models excelling at tasks from large i.i.d datasets towards generalization across tasks. Indeed, the shift from statistical i.i.d assumption to a robust interventions & counterfactuals-based modelling is one of the fundamental arguments made for furthering causality in the field of machine learning. It is to be noted that the paper does not explore interventions or counterfactuals in the proposed methodology.

A hint of the premise of the above argument is mentioned as the first line of the paper in the abstract. However, this statement may be written more clearly to drive the argument stronger. The statement reads – “Statistical learning relies upon data sampled from a distribution, and we usually do not care what actually generated it in the first place.” A concern of this statement (more literal than logical) lies in the second part – “...do not care what actually generated it in the first place”- since there may be space for ambiguous interpretation here.

It is common knowledge statistical modelling aims to model the true data generating process using what we know and observe to arrive at an approximation, subject to some form of uncertainty quantified by an error term. One does not necessarily know what exactly the true data generating process is, but one tries to make the best hypothesis about it and test it using the observations. This implies that one does “care” about the data generating process.

Logically perhaps, the author was trying to convey the fundamental i.i.d assumption (independent and identically distributed) made in statistics that disregards often the true data generating process. If such was the case, then the author could have written the intention and essence of the statement explicitly to provide a clear and strong premise for the argument towards the need for generalization using independent causal mechanisms.

Literature Review

The authors have provided literature reference to a total of 16 research papers covering topics such as mixture of experts(5 references), unsupervised domain adaptation(2 references), causal inference & the non-i.i.d regime(6 references), disentangling factors of variation(2 references) and non-linear ICA(1 reference).

The highest no. of references has been made for the topic of causal inference where the authors mention their goal of “extending applications of causal inference to more complex settings” and “aim to learn causal mechanisms and ultimately causal SEMs without supervision”. There are two concerns with this – 1) the authors do not provide any concrete experimental results or analysis on the applications of causal inference within the paper, 2) the authors mention “complex settings”, however the experimental setting has been limited to the use of MNIST and Omniglot dataset with simple and singular transformations that do not mimic “complex settings”.

The authors have proceeded to clearly mention the differences of the proposed method in comparison to the methods existing in their references. The proposed method also seems to be based on an implication of causal invariance – hence, it would be interesting for the authors to also explicitly shed light on the difference in approach and benefits of the proposed method over methods such as domain invariant representation (Muandet,et al., 2013; Ganin et. al., 2016)

Hypothesis

In the paper, the authors hypothesize that by learning independent causal mechanisms, models can generalize better across multiple domains. This statement has been repeated throughout the paper and the experimental objective prove to be coherent to this statement, although the datasets used for the experiment in itself can be improved. Additionally, the authors also do point out that by learning independent mechanisms, one can provide more interpretability and insights in terms of causal inference. However, this statement appears scattered throughout the paper with no experimental support or analysis using the proposed methodology. This may be perceived as “incremental work”.

Method

The authors propose a “unsupervised” method to learn causal mechanisms as independent modules using “competing experts” & adversarial training. The formal setting begins with a canonical distribution P, N measurable functions (M1, M2, ... Mn) called mechanisms(they are a priori unknown) and the resultant distributions stemming from the mechanisms i.e., transformed distributions described as Q.

During training, a sample from the canonical distribution as well as the transformed distribution is made available. The experts compete where each example from Dq (dataset drawn i.i.d from Q) is fed to all the experts independently and parallel. A distribution- modelling function c is applied to all the outputs of all the experts. An expert with the maximum value of c is chosen for training and inference, where the “expert’s” parameters are updated to maximize the objective function while the other “experts” remain unchanged. By doing so, the experts specialize and become better at mapping back a particular transformed example back to its canonical form – this learning of the mechanisms and the inverse mapping make each expert independent and modular from each other.

The method overall seems to be promising given its clear and direct implication from the IM assumption thus making it coherent. The authors also address the following three aspects of the method – 1) Selecting appropriate no. of experts 2) Convergence criterion 3) Time and Space complexity. Point 1) has been substantiated with experimental findings that clearly support the claims. In point 3), the authors mention that the expert will “in principle have a smaller architecture than a single large network” and hence “will be typically faster to execute”. This claim however is not supported by experiments or empirical results. The authors could provide more specific & formal definitions of “smaller architectures” & “faster” to make the statement more precise and not open to subjective interpretation.

Another point of query and concern is if the proposed methodology is completely “unsupervised” since the method calls for samples from the canonical distribution to be made available during training. The authors lightly touch upon it in the “Experiments” section, however more clarity and stronger arguments can be given on the same.

Experiment

To test the efficacy of the method, the authors perform training on the MNIST dataset. The reasons as to why this dataset was chosen has not been explicitly mentioned by the authors.For training, the authors consider a total of 10 transformations(mechanisms) that are composed of 8 directional translations, contrast inversions and addition of noise. The training examples are preprocessed by applying scaling and zero padding. The authors leverage the adversarial training scheme and utilize neural networks(CNNs) for both the experts and the discriminator.

One of the major concern here is the limited setting of the experiment. Whilst the authors mention their goal to create “robust models” and “move towards animate intelligence”, the experiments however do not truly represent the complexity of real-world data that require these “robust” models.

The MNIST dataset considered is simple. Convolutional nets can achieve 99.7% on MNIST and classic machine learning algorithms are able to achieve 97% easily (Fashion-MNIST Benchmark dashboard). MNIST cannot represent modern CV tasks according to Francois Chollet (Keras author) and in April 2017, Ian Goodfellow (Google Brain research scientist and creator of GANS) also called for people to move away from MNIST due to its highly overused nature.

In real world scenario, there is high probability for images to have mechanisms that affect the data simultaneously such as “lighting and position in a portrait”, however this was not considered in the experiment. The paper misses out on the fundamental fact that the reason why i.i.d assumption fails is due to the very same problem that the experimental setting has– that the real-world data is far more complex than our statistical assumption (or) the experimental setting we have considered now. The authors mention a version of these points as future work, however by not empirically addressing these fundamental aspects within the paper, the authors have only partially addressed the motivation of this research. Indeed, this can be taken as a starting point for incremental work to be done.

It would be interesting to see the experiments of the same in complex datasets across various domains. Particularly by employing this method through multiple domains, one can understand the various independent semantic and spatial mechanisms at play. Datasets like the ERASER dataset may help in validating performance of the method on text data. Popular datasets such as CIFAR-10 dataset allow for providing a rich set of images on which mechanisms can be learnt to unearth causal structures of invariance.

In the case of transformations, the authors have not explicitly mentioned the rationale behind considering the specific 10 mechanisms. Often, in real world cases, one would not be aware of the true independent mechanism at play. Due to the multi-modal richness of the world, it is inevitable that one faces confounding variables. In such a case, how can one determine the right kind of mechanisms(or transformations) to employ? How does one determine the number of mechanisms to consider? These are questions that the authors have not addressed in the paper and would be worthwhile to get the answer for.

Results & Analysis

The paper presents key results and evaluation of generalizability on the Omniglot dataset. There are three main findings that are discussed: 1) the experts specialize w.r.t c 2) the transformed outputs improve a classifier 3) the experts learn mechanisms that generalize.

For point 1), the authors present graphs that show how experts end up specializing or “winning” in a single mechanism after ~250-750 iterations, even if initially more than 1 expert competed for the same mechanism. The authors present a generic reference to the observations; however, the results could have been substantiated further and the authors could have given a formalized & analytical view of exactly when these “experts” start winning over the others. For instance, by performing various runs of the experiment and plotting a distribution curve of the no. of iterations it took for each experts to start learning a single independent mechanism, one may uncover insights as to why and when the experts start specializing and how this translates in general to any dataset & mechanism taken into consideration.

For point 2), the authors provide insight into how the transformed digits (by the experts) when fed into a “pre-trained classifier” can achieve upper bound of ~99% after about 600 iterations. Initially, due to the identity initialization, the accuracy starts off very small and then slowly increases as the model about 1/3rd of the whole dataset once. It would be nice if the author mentioned more details of the architecture of the “standard pre-trained classifier”. Due to the abundant use of MNIST for classification tasks, most of the common & popular MNIST classification models easily achieve ~99% accuracy which almost overfits the data – hence, it would be worthy to experiment and empirically measure how the same transformed data performs across different architectures.

For point 3), the authors present the results of the Omniglot dataset where the authors can recover the original data examples using the “experts” trained on MNIST. However, if we look at the MNIST dataset (Fig a) and the Omniglot dataset (Fig b) side by side, it is seen that they possess similar distributions to an extent and cannot be deemed as datasets that completely vary from each other. For instance, if you look at the fourth row of the Omniglot dataset, the figures present are very similar in shape to the number ‘1’ in the MNIST dataset. Similarly, the second row of the Omniglot dataset is very similar to the number ‘0’ of the MNIST dataset. Therefore, it is crucial that the authors evaluate generalizability on true out-of-distribution variations - such as colored sentences with multiple style of words, Fashion-MNIST(Xiao, Rasul and Vollgraf, 2017) data or Kuzushiji-MNIST(Clanuwat et al., 9999) to truly verify if the “mechanism” allow for generalization.

In the “Conclusion” of the paper, the authors touch upon points of working towards more complex settings and diverse domains. But it seems to be very contradictory to their goals and motivation to do so as “future work” and not in the “present work”. Causality is a theme that is seen across the paper however, the only aspect that is implicitly touched upon is invariance and there is no addressal of interventions and counterfactuals. The authors conclude that the independent modules or mechanisms can be “learnt across multiple domains or tasks, added subsequently, and transferred to other problems” – however, this statement has not been proven through empirical results since no experiments were performed across multiple-domains or transferred to other problems and therefore, stands as a weak claim within the context of the proposed method.

Research in the direction of causality and machine learning proves to be very promising. This paper constitutes an important step towards the same, a step that can indeed be enhanced , but proves to be a good starting point for researchers to further work in the direction of learning causal structures that underlie within our data.

References

Parascandolo, G. et al. (2018) ‘Learning Independent Causal Mechanisms’, arXiv:1712.00961 [cs, stat]. Available at: http://arxiv.org/abs/1712.00961

Xiao, H., Rasul, K. and Vollgraf, R. (2017) ‘Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms’, arXiv:1708.07747 [cs, stat]. Available at: http://arxiv.org/abs/1708.07747

Benchmark dashboard (no date). Available at: http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/

Clanuwat, T. et al. (9999) ‘Deep Learning for Classical Japanese Literature’, arXiv:1812.01718 [cs, stat]. doi: 10.20676/00000341.

DeYoung, J. et al. (2020) ‘ERASER: A Benchmark to Evaluate Rationalized NLP Models’, in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. ACL 2020, Online: Association for Computational Linguistics, pp. 4443–4458. doi: 10.18653/v1/2020.acl-main.408.

Krizhevsky, A. (no date) ‘Learning Multiple Layers of Features from Tiny Images’, p. 60. Muandet, K., Balduzzi, D. and Schölkopf, B. (2013) ‘Domain Generalization via Invariant Feature Representation’, in International

Conference on Machine Learning. International Conference on Machine Learning, PMLR, pp. 10–18. Available at: http://proceedings.mlr.press/v28/muandet13.html

Open Images V6 - Description (no date). Available at: https://storage.googleapis.com/openimages/web/factsfigures.html Parascandolo, G. et al. (2018) ‘Learning Independent Causal Mechanisms’, arXiv:1712.00961 [cs, stat]. Available at: http://arxiv.org/abs/1712.00961

Xiao, H., Rasul, K. and Vollgraf, R. (2017) ‘Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms’, arXiv:1708.07747 [cs, stat]. Available at: http://arxiv.org/abs/1708.07747

zalandoresearch/fashion-mnist (2021). Zalando Research. Available at: https://github.com/zalandoresearch/fashion-mnist