Editor's note: This is an article to be published at the ICML 2018 seminar. Its authors are Assistant Professor Zachary C. Lipton of Carnegie Mellon University and Jacob Steinhardt, a graduate student of Stanford University. Although for such an international summit, these two young scholars are just unknown juniors, but recently, their criticism has attracted a large number of expert support and has also triggered deep thinking in the academic community. In August last year, Zachary C. Lipton also made an appeal on "whether arXiv preprints should be listed in the paper citations." He believed that even if the arXiv preprints were of uneven quality and were not officially published, if the opinions contained therein were used, then The rights and interests of the author of the preprint should be protected.

Because the article is too long, there will be free translations and sample deletions in this article. Please forgive me.

1 Introduction

The common goal of machine learning (ML) researchers is to create and disseminate knowledge about data-driven algorithms. In a paper, they hope they can achieve the following goals: theoretical generalization, experimental demonstration, or constructing a more accurate work structure. Although the choice of knowledge in the research process is subjective, once the results are formed, the paper should serve the readers, explain the basic concepts clearly, and facilitate readers' communication. In other words, it should reflect the complete value of the community.

So, what kind of paper can best serve readers? Here we list some characteristics: These papers should (i) provide intuitions that help readers understand-clearly distinguish from stronger conclusions supported by evidence; (ii) introduce empirical investigations that exclude other hypotheses; (iii) be clear The relationship between theoretical analysis, intuition and experience summary; (iv) Use terminology to avoid confusion of concepts and facilitate readers’ understanding.

Although often deviating from the above ideas, new results of machine learning are still emerging in recent years. This article will focus on the four common drawbacks in the papers, so as to spy on the bad trends in the machine learning academic circle:

Failure to distinguish between explanation and speculation;

Failure to clarify the source of "progress", for example, the improvement of model performance is obviously due to parameter adjustment, but some authors will still overemphasize model architecture modifications that have no effect;

Mathematics: Use vague and suggestive descriptions when making mathematical arguments, such as confusing professional and non-professional concepts;

Misuse of expressions, such as using colloquial or artistic language to describe the results, rather than professional terms that everyone recognizes.

Although the reasons behind these drawbacks are not yet known, the rapid expansion of the machine learning community, lack of paper reviewers, and misaligned rewards (paper citations, attention, and entrepreneurial opportunities) between academic achievement and short-term success are all possible incentives. Although there are remedies for these drawbacks, we still advise you not to do so.

With the increasing influence of machine learning, the audience of research papers is not only students, but also media reporters and government personnel. These are all considerations for paper writing. By using clear expression to deliver more accurate information, we can speed up the research progress, shorten the entry time of new researchers, and play a more constructive role in public discourse.

Unfortunately, academic achievements may mislead the public. They will also damage the knowledge base of machine learning and hinder future research. In fact, in the history of artificial intelligence, or more broadly speaking, in scientific research, these problems have been generated repeatedly. In 1976, Drew McDermott once accused the artificial intelligence community of lack of self-discipline. He predicted that "if we don't criticize ourselves, sooner or later others will do it for us." Similar discussions run through the 1980s and 1990s. And now, it has appeared again. For other fields such as psychology, loose experimental standards have greatly weakened the academic authority of these disciplines. In contrast, the current status of machine learning is the result of a large amount of rigorous research, including theoretical research and experience.

2 Disclaimer

This article aims to promote discussion, and the ICML Machine Learning Debate Symposium called for papers from us. This is our response. Although the point of view is ours, the problem described here is not a common problem in the machine learning community. We will not discuss the overall quality of scientific research papers, and we are not willing to target a specific individual or institution and draw any critical conclusions.

This is the key introspection of being an insider, not a sniper from an outsider. We may also fall into these ills and "sick" repeatedly in the future. Although some specific examples are involved in the article, our principle is to (i) use ourselves as an example; (ii) give priority to more authoritative and mature researchers and institutions. We are thankful that we belong to a free community and thank it for allowing us to express critical views.

3 disturbing trends

In this section, we (i) describe the trend of abuse; (ii) provide a few examples of the trend (including positive examples); (iii) explain the consequences. Since pointing out weaknesses in individual papers may be a sensitive topic, we will try to avoid this situation.

3.1 Explanation and speculation

Exploring new areas usually needs to be based on intuition, but these intuitions have not been scientifically verified to form a formal definition. According to our findings, although these intuitions have not been scientifically reviewed, some researchers will directly treat it as a professional fact, "present the facts and make sense" on it, and then explain on the basis of speculation. In the end, people who read the paper deeply believe in the author's "professionalism" and believe in the results, and this intuition becomes an authoritative "truth."

For example, [33] this Google paper puts forward an intuitive theory around "internal covariate shift". Starting from the abstract, the author stated:

During deep network training, due to the continuous modification of model parameters, the probability distribution of the input of each layer is constantly changing, which makes us have to use a smaller learning rate and better initial values ​​of weights, which leads to slow training and also leads to Training is difficult when using saturating nonlinearities activation functions. This phenomenon plus internal covariate shift, the solution is: normalize the input of each layer. ——Translator's Note: This paper is regarded as "the most powerful paper in 2015" and has a great influence

According to these descriptions, this phenomenon and attribution seems to be a professional fact, and the thesis is also justified. But where is its proof? No matter what the reality is, such an unclear explanation of key terms is unbelievable.

For another example, this paper pointed out that Batch Normalization can improve model performance by reducing the distribution change of hidden activation functions during the training process, but the article does not mention the quantification method of this change. Although studies have shown that the explanation of Batch Normalization may be inaccurate [65], the speculative explanation given by [33] has been regarded as fact by some researchers. "As we all know, due to the existence of internal covariate conversion, deep neural networks The network is difficult to optimize..."[60].

In [72], Jacob Steinhardt, one of the authors of this article, also had the same problem (tired, not translated), but let's look at a positive example, such as the paper [3]. This article is a practical guide to training neural networks, but the author did not preach authority, but stated: "Although these suggestions come from... years of experimentation and are supported by mathematics to some extent, they should be challenged. They should be challenged. It is a good starting point... but it has not been formally verified, leaving many problems to be solved."

3.2 Failure to clarify the source of "progress"

Machine learning peer review places great emphasis on technological novelty. In order to satisfy the appetite of the judges, many papers now have complicated models and fancy mathematical inferences. Although the complex model itself is reasonable, it is not the only way to reflect technological progress: ingenious problem formulas, scientific experiments, optimization, data preprocessing, large-scale parameter adjustment, using existing algorithms for new tasks...sometimes, If a researcher achieves a breakthrough result with many technologies, he has an obligation to let the reader understand which necessary technology this result should be attributed to.

In many cases, the author did propose many improvement methods, but because they did not properly dissolve the research, they obscured the source of "progress." Sometimes, these advancements are actually only brought about by an improvement. In this case, the authors seem to have done a lot of work, but in fact they are not doing enough. And this kind of wrong impression will mislead readers and make them think that all improvements are necessary.

Recently, Melis et al. [54] published a result in which they used large-scale automatic black box hyperparameter adjustment to re-evaluate some popular RNNs, and found that their progress lies in better hyperparameter adjustments rather than complex innovations in architecture. If everyone is on the same starting line, then the original LSTM, which has hardly been modified since 1997, is still among the best. Perhaps the community can benefit more from the details of tuning parameters than being distracted to do other research. For deep reinforcement learning [30] and generative adversarial networks [51], some similar evaluation papers have also caused controversy. If you want to know more about this type of issue, I recommend reading this article at the ICLR 2018 seminar [68].

In contrast, [41,45,77,82] these papers have a good digestion of the research process, [10,65] also reviewed the research process and found new discoveries through separation and improvement. Of course, resolution is neither necessary nor sufficient for understanding the method. If there is a computational power limit, it may be impractical to implement it. But in addition, we can also find out the reason by checking the model's robustness (robustness) and qualitative error analysis.

For empirical studies aimed at understanding, they can even get results without new algorithms. For example, by exploring the behavior of neural networks, researchers can distinguish its sensitivity to adversarial disturbances [74]; through careful study, they can discover the limitations of the data set to a more powerful baseline model; the paper [11] is designed for reading Understand the task of news paragraphs and found that 73% of the questions can be answered in the same sentence, while only 2% of the questions require multiple sentences. In addition, in the same spirit, simpler neural networks and linear classifiers tend to perform better than complex neural architectures.

3.3 Mathematics

When writing the dissertation in the early stage of the PhD, we (ZL) received feedback from an experienced postdoc: Your dissertation needs more equations. He did not judge the results of the paper, but suggested that the paper should look clearer. Even if the content of the paper is obscure, if it contains more calculation equations, the reviewer will think it has extraordinary professional depth.

Mathematics is an important tool for scientific communication. If used correctly, the information it conveys is highly accurate and clear. However, not all ideas and propositions are suitable for mathematical description. Natural language is also an indispensable communication tool, which is particularly prominent in expressing intuition and empirical propositions.

When we combine mathematics and natural language, but do not clarify their relationship, whether it is prose or theory, we can not express it well: the problems in the theory may be summarized by vague definitions, but the emotion in the prose is expressed. It can be "argued" by mathematical inference. Mathematics is a combination of formal and informal expressions, as the economist Paul Romer said: Just like mathematics theory, mathematics is a mixture of language and symbols, but it is not closely related, but in natural language and form. Leave ample smooth space between languages.

The disadvantages that accompany the mathematics are mainly manifested in the following aspects:

First of all, some papers abuse mathematics to reflect the depth of the article-forcing it to have depth. Assuming the most common form of the theorem, by inserting the theorem into the paper, the paper has empirical results and seems to be more authoritative-even if the theorem has nothing to do with the paper. We (JS) made this mistake in [70]. The discussion of "staged strong Doeblin chains" in the paper has almost nothing to do with the proposed algorithm, but readers may find it very in-depth.

Adam’s paper [35] is very good, but it also proves that this problem is everywhere. The paper introduces the convergence theorem in the case of convex optimization, but this is not a convex optimization paper, it is unnecessary. Later, [63] also confirmed that it was wrong.

Second, the mathematical expressions of some papers are neither formal nor informal. For example, the paper [18] believes that the optimization difficulties of neural networks are not due to local minimums, but saddle points. As evidence, they cited a statistical physics paper on Gaussian random fields [9], pointing out that high-dimensional "all local minimums of Gaussian random fields may have an error very close to the global minimum". This seems to be a very formal proposition, but lacks a clear theorem, so its results are difficult to verify. If the researcher can give a formal statement, the doubt here can be resolved. In [18], the researchers found that the local minimum has a lower loss than the saddle point, and also gave clear explanations and experiments, and the content is more interesting.

Finally, some of the theories cited in the papers are too broad, and it is still doubtful whether this theorem can be used in this scenario. For example, some people like to use "there is no free lunch in the world" as an analogy of using heuristics that are not guaranteed, but the original meaning of this sentence does not say that we cannot learn.

Although avoiding the use of equations is the best remedy for mathematical problems, some papers also demonstrate by example that mathematics is not a "scourge". For example, the recently published paper [8] covers a large number of mathematical foundations in a solid manner, and these data calculations are also clearly related to application problems. We strongly recommend this paper here, and novices who are new to the industry can also refer to their research directions.

3.4 Abuse of expression

We identified three common forms of misuse in machine learning papers: suggestive definitions, misuse of existing terms, and misuse of suitcase words.

3.4.1 Implied definition

The suggestive definition refers to a newly created term that has a suggestive colloquial meaning that can be understood only by looking at the literal meaning. These words often appear in anthropomorphic tasks (reading comprehension [31], music comprehension [59]) and skill-based tasks (curiosity [66], fear [48]). Many papers will name the components of the model in terms of human cognition, such as "thought carrier" [36] and "consciousness prior" [4]. We are not saying that these words must not be used. If they are qualified, their relationship with machine learning may become an effective source of inspiration for creating expressions. However, when a suggestive definition is treated as a technical term, there is no choice in future papers. Researchers can only use this term, otherwise readers will be confused.

On the other hand, using "human" performance to describe the results of machine learning may produce cognitive errors about the current state of the art. Taking the "dermatologist-level skin cancer classification" in [21] as an example, the researchers used the classifier to compare with dermatologists to conceal the fact that the tasks performed by the two are essentially different. Real dermatologists will encounter various situations. Although there are unpredictable changes, they must give a diagnostic opinion, and the classifier only achieves low errors on the test data. Similarly, the classifier in [29] also claims that it has an advantage over humans in the ImageNet classification task. Imagine that among so many "open mouth" [21,57,75] papers, even if we have a rigorous one, can it bring public discourse back on track?

Although deep learning papers are not the only "initiator," the abuse of expression in this field does affect research in other machine learning sub-domains. For example, [49] studied the issue of algorithmic "fairness". It shows well how researchers use legal terms to do machine learning research. The most prominent example is that they put a simple equation that expresses the concept of statistical equality. Named "different influence". The problem that arises from this is that everyone starts to refer to the statistical data of simple predictive models that use "fair", "opportunity" and "discrimination", and then the public and government officials will mistakenly believe that it is difficult to incorporate moral needs into machine learning. .

3.4.2 Abuse of existing terms

The second way of abuse involves using existing specific terminology, but using it in inaccurate and even contradictory ways. For example, deconvolution (transposed convolution, deconvolution, deconvolution, deconvolution: astonished:), which describes the inverse operation of convolution, but in deep learning papers, especially autoencoders and generative confrontation networks In the paper, this word is equivalent to transpose convolutions (transpose convolution, also known as up-convolutions). When [79] first mentioned this term in a deep learning paper, its definition was accurate, but as soon as [78,50] was quoted and generalized, it became any neural architecture that uses upper convolution. The misuse of this terminology can cause lasting confusion. If there is a new machine learning paper, deconvolution appears in it, it may mean (i) the original meaning, (ii) convolution, or (iii) trying to solve this Kind of confusion [28].

As another example, let's look at the generative model and the discriminative model. From a general definition, if the input distribution is p(x) or the joint distribution p(x,y), it is a generative model; on the contrary, the discriminant model deals with the conditional probability distribution P(y|x). However, in recent papers, "generative model" has become a collective term for models that produce realistic structured data. On the surface, this does not seem to conflict with the definition, but it conceals several shortcomings-for example, GAN and VAE cannot perform conditional inference (x1 and x2 are two different input features, they cannot be derived from p(x2|x1 ) Sample). On the basis of this misunderstanding, some people have also begun to describe the discriminant model as a generative model responsible for generating structured output [76]. We (ZL) also made this mistake in [47].

Let's continue to look at the Batch Normalization mentioned earlier. [33] describes covariate shift as a change in the input distribution of the model. In fact, this term refers to a specific type of transformation-despite the input distribution p(x) It may change, but p(y|x) will not change [27]. In addition, due to the misuse of [33], Google Scholar now lists Batch Normalization as a reference for "covariate conversion".

One of the consequences of misusing existing terminology like this is that we can define some unsolved tasks by "swapping concepts", and then make it easy for ourselves to quote past achievements, thereby packaging "progress" with no real progress. It is usually combined with suggestive definitions. Language comprehension and reading comprehension, which were once huge challenges for AI, have now become predictions for specific data sets [31].

3.4.3 Luggage word

Finally, let's look at the common excessive use of Suitcase Words in ML papers. This is a new word created by Minsky in "The Emotion Machine" [56] published in 2007. It refers to a type of vocabulary that brings together various meanings, such as consciousness, thinking, attention, emotion and feeling, and their generation The mechanisms and sources may be different, but we can collectively refer to them as "psychological processes." There are many similar words in machine learning. For example, [46] pointed out that the word "interpretability" does not have a universally recognized meaning. It often appears in papers with different methods and different needs. Therefore, although the presentations of the papers look similar, they may express different concepts.

Another example is generalization, which can generalize a specific technology (generalize training to test), it can also mean the transfer between two concepts that are close to each other (generalize from one group to another), or even derive to the outside ( From the experimental environment to the real environment). If we confuse these concepts, we will overestimate the current level of technology.

When suggestive definitions are combined with the abuse of existing terms, new luggage words often follow. In papers involving "fairness", the terms of law, philosophy, and statistical linguistics are often abused, and then these words will be generalized by a word called "bias" [17].

If it is a speech or talk about ideals, the suitcase word can indeed play an effective role, because it reflects the overall concept of unifying various meanings. For example, artificial intelligence is an ideal term. On the other hand, excessive use of luggage word in the process of technical demonstration may lead to confusion. For example, [6] wrote some equations with terminology and optimization capabilities, and very loosely assumed them as the same category. thing.

4 Reasons behind the trend

Is the above question a trend in the ML academic circle? If so, what is the root cause? We made some speculations, and finally came up with several possible causal factors:

4.1 Be proud and complacent in the face of progress

The rapid development of ML will give researchers the illusion that strong results can mask the weakness of the argumentation process. So they began to insert things that didn't matter in order to support the conclusion, began to set up experiments with results as the goal, began to use exaggerated statements, or stopped trying to avoid loose mathematical inferences.

At the same time, in the face of a large number of homogeneous papers, paper reviewers have no choice but to accept papers with strong quantitative results. In fact, even if the paper is rejected this time, they cannot guarantee that they will notice the flaws next time, so it is a good thing to accept defective papers.

4.2 Growing pain

Since 2012, deep learning has been a great success, and people's pursuit of academia has become increasingly enthusiastic, so the ML community has expanded rapidly. Although we believe that community expansion is a good thing, it also has side effects.

In order to protect junior authors, we mainly cited our own papers and citing papers from large institutions in this article, but we don’t say it doesn’t mean it doesn’t exist. The above problems are more common in their papers. Some junior authors will re-define the term because they are not clear about the definition of the term. Of course, experienced researchers also have this problem.

For paper review, perhaps increasing the paper-reviewer ratio can improve this situation, but there are still problems. Inexperienced reviewers pay more attention to the novelty of papers, they are often blinded by false theorems; experienced reviewers tend to take on more work, they will be relatively conservative, prefer papers with many mathematical formulas, and will ignore them Innovative research; and the remaining large number of overworked reviewers don't even have enough time to review the manuscripts, and they fail to notice many problems in the paper.

4.3 Dislocation of incentives

Reviewers are not the only group of people who provide bad incentives to paper authors. As ML research gains more and more media attention, ML startups have become commonplace. To some extent, news (what to report) and investors (what to invest in) ) Is the main motivation. The media guides the trend of ML research, and the anthropomorphic expression in the ML algorithm provides a steady stream of material for the popular topic. Take [55] as an example, it describes the autoencoder as "simulating the brain", which is sensational to put this hint on news headlines; another example is [52], it writes the description of image generation using deep learning as "imitating human Level of understanding".

Investors have also shown a strong interest in artificial intelligence research, and sometimes they even provide funding for startups because of a paper. We (ZL) have also worked with investors. The media reports on the results of a startup company and they invest in which company. This dynamic financial incentive is tied to media attention. We have noticed that the investment community has become very interested in chatbots recently, and this is happening at the same time as the media's hype about dialogue systems and reinforcement learning.

5 suggestions

Faced with these trends, how should we respond? What can we do to allow the community to improve the level of experimental practice, elaboration and theory? What can we do to more easily refine the knowledge of the community and eliminate the misunderstanding of research by the general public? (Let’s quit, let’s do the lazy trick)

5.1 Suggestions to the author

Ask more "why" and "how", not just "how good is the effect". Use error analysis, resolution research, and robustness checks in empirical papers (for example, carefully adjust parameters, choose the ideal data set), read more and cite more.

Don't forcefully use a specific algorithm to find its contribution to the progress of the research topic. Even if there is no new algorithm, you can still generate some new insights on the subject. For example, a neural network trained by stochastic gradient descent can fit randomly assigned labels.

When writing a thesis, you have to ask yourself: Do I recognize the system I proposed, and will I use it in practice? This can simulate what the reviewer thinks when he sees this article, and it can also detect whether the system really fits your smart model.

To be clear about which problems are open, and those problems that have been solved, and to have a clear understanding of the research status.

5.2 Suggestions for publishers and reviewers

Ask yourself: If the author’s results were worse, would I still accept this paper? For example, there are two articles with similar conclusions, but the first one uses a simple idea to achieve improvement, and gives two negative results, and the second combines three ideas to achieve the same improvement (no resolution) , Then the first one should be selected.

A retrospective investigation is required to delete exaggerated claims and irrelevant materials, and to change the personification expressions into clear terms and symbols.

Call for critical essays and challenge traditional thinking.

The peer review system needs further discussion: public review or anonymous review? How do reviewers represent the values ​​of most researchers? What consequences will these improvements bring to the improvement of the above drawbacks?

There are some scattered content behind, which will not be translated here. Based on the full text, these problems are indeed common problems in many papers. When the editor is gnawing on the paper, he will also be tortured by the abused terminology and suitcase words. In the end, he may misunderstand and mislead more readers. And why is there only one editor?

If you have read it patiently, I hope this article can help us learn from it. Whether it is a beginner, a researcher or the news media, we all hope to witness the healthy development of the machine learning field, and we don’t want hyperbole to ruin the predecessors’ rigorous scholarship The attitude is left to our foundation.

Guangzhou Yunge Tianhong Electronic Technology Co., Ltd , https://www.e-cigarettesfactory.com

Posted on