Readability in Accounting: An Ensemble Learning Approach
We employ a language model (BERTimbau) trained in Brazilian Portuguese to simulate a
"human reader", reading and assigning scores to 2210 yearly financial statement notes from 225
publicly listed companies in Brazil over the span of 10 years. Additionally, we calculate the usual
readability metrics (Flesch-Kincaid reading ease, Fog index, SMOG index, Loughran-McDonald
Index) for all the notes and employ machine learning models to evaluate which readability metric
best represents the approximated human’s readability score provided by the AI model. The
evaluation of which metric is preferred is based on the feature importance, which indicates the
best proxy for financial text readability of Portuguese text should be the Loughran-McDonald
Index. These findings are in line with the literature. This research contributes to the literature
by employing novel methods (Machine learning and language models) within a not-so-explored
field (Portuguese financial information) with a reasonably large dataset. Further research may
be needed to aggregate different Language models or human experiments to increase the validity
of the metric concept.