Quintiliānī Institūtiõnēs Ōrātōriae, IX.2.67: “Sed enim fīet, ut jūdex quaerat illud nescio quid ipse, quod fortasse nōn crēderet sī audīret, et eī, quod ā sē inventum existimat, crēdat.”
Sunt rēs quī numquam mūtant.
Quintiliānī Institūtiõnēs Ōrātōriae, IX.2.67: “Sed enim fīet, ut jūdex quaerat illud nescio quid ipse, quod fortasse nōn crēderet sī audīret, et eī, quod ā sē inventum existimat, crēdat.”
Sunt rēs quī numquam mūtant.
I don’t have an opinion about emojis in bios, but when I saw your “red flag” comment, I assumed you were making some kind of pun. And I’ll make another: I don’t see any red flags in that person’s bio.
Cuius Exemplum monstrat: “testāmentō quīdam iussit pōnī statuam auream hastam tenentem. Quaeritur, statua hastam tenens aurea esse dēbeat an hasta esse aurea in statuā alterīus māteriae?” (Ibid.)
Maximē mihi placet vidēre apud Quintiliānum dē genere ambiguitātis quae mē saepe vexat: “per collocātiōnem, ubi dubium est, quid quō referrī oporteat, ac frequentissimē, cum quidem medium est, cum utrimque possit trahī” (Inst. VI.9)
Nunc videō eum linguā Latīnā priscā scriptum esse.
Commentārium ēlectōrum recentis biduī legens locum quem mihi super vīrēs invēnī: sigmaticī futūrī verba atque aliās rēs grammaticās mihi ignōtās habet. Quod nōn omnīnō displicet prasertim quia hīs diēbus nōn saepe accidit.
Herī vesperī cēnāvī cum amīcō linguae Latīnae perītō. Duās vel trēs hõrās collocũtī sumus, omnīnō Latīnē. Jūcundum fuit.
I worked in Belfast for a while. One of the admins sent out a weekly trivia quiz. Shortly after I joined, he gave it an American theme. Coworkers asked me, “What do Americans call courgettes?” I said, “I don’t know. What’s a courgette?”
The error that gave me the most trouble today was in Quintilian book 5: “…, cum pardem ējēcissent”. I think it’s “patrem”, but Perseus says “pardem” is an adverb meaning “equal”, and the Latin Library does not appear to have this section in it at all.
I’ve been thinking about how future paleographers will talk about OCR: “The misspelling of rein for rem suggests this comes from a text that was originally digitized in the 1990s…”
Ōrātiō apud Quintillium quae mihi placet: Ut certa manus ūnō tēlō potest esse contenta, incerta plūra spargendā sunt, ut sit et fortūnātē locus.
I’ve actually heard some people argue that Nixon might have survived Watergate if he had managed the economy more responsibly.
I’ve long known the English word “imprimatur”, but today is the day I realized it comes directly from Latin.
As I said before, the validation views for the latest version of the model immediately looked more sensible than previous versions. It feels good to think I solved a general, fundamental data science issue while simultaneously being able to show clear business benefits to my boss.
And that means I could add the importance score for the two models to get a single measure of importance for the original variable. Since SHAP values are also additive, I could do the same thing for all my validation views.
Another thing I did was to created a single importance score for these hybrid variables. For the variable selection phase, I used “total gain” in XGBoost instead of the default “average gain”. Totals can be meaningfully added together, unlike averages.
This isn’t entirely new, but it’s not built into most open-source data science packages. I wrote some code this week to successfully implement this as part of an XGBoost model, and I could immediately see the benefits when I generated validation views for my model.
The model can use the numerical variable to understand how customers with calculated values behave and use the categorical variable to make any adjustments necessary to accurately represent the behavior of the other types of customers.
The categorical variable will then record which category the customer falls in: “Regular value”, “True missing value”, “No loans”, “No loans in 12 months”, “No minimum payments”, etc.
The numeric variable has the actual values for customers when it can be calculated and either a missing value or an imputed value for those customers for whom it cannot be calculated.
This week what I did was to turn this single variable into two variables: a numeric variable and a categorical variable.
This variable that initially sounded like a numeric variable, has categories of customers for which it cannot be calculated. It’s partly numeric and partly categorical. How do we handle this?
For some customers, that is a straightforward calculation: 100% or 66.7% or 125%. But for other customers it makes no sense. Some might not have any loans, some might not have any in the past 12 months, some might not have a minimum payment amount.
Take an example variable: the ratio of a customer’s actual payments to their minimum required payments on all their loans in the past 12 months.
All of this so far is stuff that you are likely to see in intro statistics and data science textbooks. One thing that rarely gets mentioned in textbooks is that some variables can be partially numeric and partially categorical. That’s what I was working on this week.
There are lots of techniques for coding categorical variables, and some algorithms (like XGBoost) do a good job of handling most of this for us.
But we have to be careful how we do this. If we’re not careful, the predictions of our model will be based more on how we coded our data rather than any actual patterns in the data itself.
However, When we want to use categorical variables in statistical or machine learning models, we have to convert them to numbers somehow. Like “Red = 1, Orange = 2, etc”.
Categorical variables are variables for which we can’t meaningfully use numbers. It doesn’t usually make sense to say “Red < Orange” or “Red - Orange = Blue” or “Red / Orange = Green”. Categorical variables do not have any inherent ordering or any numerical relationships between the categories.
First, definitions: a numeric variable is a variable that takes on numeric values. For numeric variables, we can sensibly say things like “3 > 2” or “3 - 2 = 1” or “3 / 2 = 1.5”. It is relatively straightforward to incorporate these variables in statistical or machine learning models.