（评论）

（评论）
(comments)

原始链接: https://news.ycombinator.com/item?id=40097375

在学术环境中，评分过程中的考试顺序可能会影响结果，因为谁先对哪项考试评分是随机的。传统方法是评分者坐在一起，每个人都专注于一项特定的练习，以保持一致性。然而，考试顺序可能会变得混乱，导致潜在的不公平。为了解决这个问题，一些机构要求评分者将每周的练习提交到数字存储库中，通过使用字母顺序来确保随机性。学生们认识到每周重新整理提交内容以确保公平的重要性，并且会带来一些好处：新的视角、改善最后的情绪、提高对早期提交错误的警惕性，以及减少因熟悉某些类型的错误而产生的偏见。此外，历史和个人经历的元素会塑造观点，影响信念和行为，包括优惠待遇和偏见，甚至是无意的。因此，在考试评分中实施随机化策略有助于最大限度地减少潜在的差异。

I work in academia. When we grade exams, the order of the exams on the stack is the order in which they were collected in the room (people can sit wherever they like). For grading, we are usually 5 people in a single room, and everyone grades a specific exercise for consistency. The exams are getting shuffled heavily, with everyone just grabbing stacks, looking for exams where "their" exercise was not yet graded, and taking them out. So basically, the order in which we grade exams can be considered random.

However, I also grade weekly exercise sheets during the semester, and these are committed into a repository, where each student has a folder that... begins with the first letter of their first name. Everyone I have ever worked with acknowledges that you have to shuffle the order in which you grade these submissions each week, for fairness. Several effects come into play: (1) your are usually less tired at the beginning, (2) your mood gets better during the last 2 sheets because you know you are done soon, (3, and crucially) at the beginning, you have not yet seen all the common errors / developed a "feeling" for them, and you might thus miss them in early submissions, but spot them immediately in later submissions.

Another alphabetic effect: In elementary school, my name was on top of the list of students in my class. I remember that I often had to do some special job simply because I was the first name on this list (for example, carry a group ticket when we visited some museum, keep track of something, be the first at something where nobody wanted to be the first, with everyone watching, be the first to be graded in PE, again with everyone watching, etc.). As a fairly shy kid, this already annoyed me in first grade.

My strategy was to, like you said, grade problem by problem. Then for each problem, first find all those who got full marks. Then group the others into piles based on what mistakes they made.

This ensures that everyone who made the same mistake(s) gets the same grade. It also tends to shuffle the order of the exams after every problem.

Obviously you don’t need this strategy for simple multiple choice questions, and it’s probably also not a great fit for long-form essays. But it worked great for technical short answer problems in CS and security.

When I was a TA at CMU, we used Gradescope https://www.gradescope.com/ for this. Every exam would be scanned and divided into problems (based on a predefined template - fixed page space for answers).

Then, each problem was assigned to a TA. Either there's a predefined rubric, or you create it as you go (-1 point for mistake X, half credit for mistake Y, etc.). There's a pretty slick interface where you just read the answer, and use keyboard shortcuts to apply the relevant deductions.

It still has the issue that every time you change the rubric, you'd need to go back and re-do previously-graded instances of that problem. But it was way faster and (equally important) less tiring.

This sounds like an organisational nightmare to be honest. You'd be going through the pile of exams multiple times (at least twice) and what do you do if there are multiple mistakes that are common in a single exam question?

Also: if you're sorting into "mistakes piles" for single exercises, how can you parallelise marking of separate and independent questions?

Even at top-notch universities (public or private), when I talk to retired faculty, grading almost always comes up as a reason they don't want to teach anymore.

[Edit: not disagreeing with your point.]

Not only is it generally time intensive, you are also subject to lots of tiring back and forth with some students about their grades.

No grading is perfect, but there’s also some undercurrent of an attitude that students have paid to be there and are entitled to a certain grade.

> No grading is perfect, but there’s also some undercurrent of an attitude that students have paid to be there and are entitled to a certain grade.

Given that students have taken on hundreds of thousands of dollars in debt that they'll have to repay no matter what and on top of that a lot of jobs being completely out of reach these days without an academic degree (that for fucks sake isn't remotely required by virtually all jobs requiring it!), that's completely understandable.

Want to fix higher education? Bring the hammer down on companies abusing it as a proxy for legally discriminating against classes of society that are closely correlated with poor academic outcomes. Academic education should be reserved for the best of the best of our youth, and it should be fully paid for by the government, not simply another hurdle to pass to get a job that pays barely more than flipping burgers.

I think it is rational that students can feel entitled to that.

I also think that the vast majority of poorly paid, non-tenured professors and other teaching staff don't love being the targets of this harassment, since it's not their fault and largely out of their control, and it's not like they're getting the bulk of the tuition money. (That mostly goes to administrative expenses and sports programs.)

Heck, most adjunct faculty are often paid below minimum wage and qualify for food stamps.

> Given that students have taken on hundreds of thousands of dollars in debt that they'll have to repay no matter what and on top of that a lot of jobs being completely out of reach these days without an academic degree (that for fucks sake isn't remotely required by virtually all jobs requiring it!), that's completely understandable.

Would that my students were this engaged before the exam. Guess which students show up the most often for office hours? ... yeah, the ones that are getting the best grades.

If my students spent half as much time learning the subject as arguing with me about grades, they would be getting a higher grade than the one they are arguing for.

Online tools like Gradescope make this a little less painful (but still painful), but sometimes it's what's needed, especially on problems that are a little open-ended.

Sibling comment already said so, but I want to emphasize - this requires two run-throughs (at least).

When I was grading homework, it took about 5 hours a week per class per run through. They didn't pay me enough to make sense for it to be 10 hours.

True, but the overhead is large. I graded into linear algebra and intro calculus, so there were a lot of students - I think 150 or so - and most of them were wrong.

Graders know that wrong homework takes much longer than correct homework to grade. It's correct? Full marks, move on. Is it wrong? Well, how wrong is it? Did they make a bad assumption, but followed it through to its conclusion? Did they forget a minus sign? Or is it complete hogwash?

So it might not be 10 hours, but still would be around 8 hours. And that's still too much.

For final exams, we use to mark across all sections of a course (so for 101 type courses, this can be hundreds to 1000s of papers).

Get all the profs and TA's together, break in to groups taking one problem or set of problems. Then you random sample (each group takes a stack) to get a feel for the 'typical' errors, once that's done - you are a machine going through the stacks.

Every once in a while (not that often) you run into a novel error or approach, and the group discusses.

My CS school implemented OCR test sheets, with some exceptions, and equivalent strategies, such as test suites and benchmarks for programming assignments. This was done to avoid subjective grading, as it was a big issue even in well-intentioned cases.

Often, you still get big problems, but the set of solutions is small. It's always three options plus a fourth option (none / all). If you make a mistake you score negative points. It's not perfect, sometimes wording is ambiguous and it's unclear whether you need to tick the fourth catch-all option, but I found it better than the alternatives as it removes most arbitrariness from the process, but obviously has other issues.

Regular exams often had wildly different grading standards for the same course depending on the class, and thus on the professor who was correcting exams. This was really annoying.

A teacher friend of mine always goes through his stack twice. Once to correct all mistakes and a second time to write down points. As you said, once you have seen all mistakes you know how "bad" of a mistake it actually is.

> As you said, once you have seen all mistakes you know how "bad" of a mistake it actually is.

Crucially, this is not quite what the poster said. It’s not about stack ranking students against each other.

Say every paper makes the same subtle mistake, and you only notice it halfway through the pile. Unless you go back through them all, you’ll unfairly grade the later entries more harshly.

I think we’re talking about the same thing, but to clarify my meaning:

If you weigh the severity of students mistakes (or successes for that matter) in relation to each other rather than to an objective rubric, you’re effectively stack ranking them whether you mean to or not.

I'm not a big fan of putting everything in the cloud, but one of the advantages of online grading systems is that it is easier to make this kind of adjustment. The workflow goes like this: make a rubric item for a specific kind of mistake (it takes a little experience to know which mistakes are likely one-off and which ones are likely to be repeated by other students), assign X points, and later if you decide there are worse mistakes, adjust the points and that gets applied to everyone.

In around the year 2000 I had an essay due that day I had forgotten, and about ten minutes of computer lab time before home room in the morning. I wrote an introduction and conclusion; then filled the remainder with copy pasted chunks of the introduction and conclusion. The thought being at least I’d get a laugh. If anyone had read the thing it would have been clear it was nonsense.

I received an 80% with no notes or markup.

I have been left wondering for the last 25 years how much student work is actually even reviewed.

I work in EdTech and every time we add a feature that requires manual teacher review of student work you will see that some teachers are VERY diligent while others never touch it.

There was this numerical calculus class at Uni where the teacher forbid us to use the calculator. So I just programmed the integral on it, got the partial steps, and just wrote random numbers to fill the the substeps. Got full grade :D The other case everybody got to pass the class, but after vacation we found the stack of exams completely untouched under a desk. The teacher had a side business to run...

We graded similarly, incidentally, when I was at U-of-M (lol). I don’t think we ever sorted by name so I don’t know if we’d have a bias effect by name unless it’s an implicit bias towards lexicographical esthetics. I won’t deny that grading fatigue can have subjective effects. I always thought we did a pretty fair and objective job. I taught Computer Architecture and we we developed answer keys and grading scales before grading a single test. Of course assigning partial credit always ended up being pretty subjective. Typically though people would error in the same ways and so those would be subjectively identical. I never thought names factored into this much but, to be fair, no one ever collected data…

Finally, I guess I’ll admit that I’m probably very biased because my initials as A.B. and I’ve always gotten excellent grades, so… maybe maybe maybe

Probably it would be something like as follows:

Have a group of N graders. And a parity of k. Let's say N is 6 and k is 2. Randomly shuffle the assignments and partition the assignments into N groups.

Each grader gets assigned k of the N groups such that they share at most 1 overlap with any other grader and each group is assigned to k people. The assignment orders are shuffled for each grader. They mark up and then grade the assignments.

Then for each of the N groups, randomly shuffle the group and equally distribute the assignments to the N-k graders.

Now each grader reviews the assignment grades/markups (in random order) and assigns a grade based on the k grades/markups from the previous rounds along with a rationale for the grade assigned.

From there the student receives the final assigned grade, the rationale for the grade, and the k markups. If they have a complaint they can go to the professor (who then can also see the k initial grades along with everything else) to dispute the grade for the assignment.

---

This way each TA only has to mark up (class size * k / N) assignments, and review (class size / N) assignments to assign a final grade (which should take far less time to do than the initial markups). On top of that every assignment has a guaranteed (k + 1) separate eyes on it. And then the professors can serve as an unbiased arbiter while retaining all the context from the process.

To take it an additional step further, the professors could sample a random subset of the assignments to verify the markup and grading is going properly.

And those reviews/grade adjustments can then be recorded (along with the final grade/rationales) to document how a given TA's grading deviates from the final reviewed grade or the grade the professor assigns. Likewise for a TA's final assigned grade deviating from the professor's. This would allow deviations to be mitigated over time and major deviations to be identified.

No, the solution is for the scoring to be handled by software that doesn't exist yet. Some things have easy, objective measures of correctness. STEM is mostly this way. Others, your humanities et al, are fairly subjective.

You could probably cover most of this with an LLM, and access to a large body of graded material for a given course, provided said material was graded fairly. Generating that data would be time consuming, as, any given assignment would need to be graded by as many people as possible in order to find a fair average.

From there, it's simple comparison between your sample work and the presented work. We're probably a decade from this really being viable en masse, but, it no doubt will happen, and for better or worse we'll likely end up with EDUAAS (education as a service).

LLMs are not going to be a solution. LLMs have absolutely no concept of truth.

And not everything has an objective solution. Even those that do often have a process associated with them and factoring in that work/process is an important part of grading. Reducing that subjective grading process to only objective solutions being right is grossly reductive and disproportionately punishes students who have the process right and understand the material but make small errors. That's exactly what you don't want to do.

---

Instead the solution is to make sure each assignment gets multiple eyes on it and in a random order. Then to document biases and trends in biases so that the TAs and professors can be aware of them and mitigate them.

It's a process problem that can only be solved by a process solution. Replacing the graders with technology or reducing problems to a binary right/wrong will never ever solve this and in many cases will end up being more harmful than the biases they claim to solve would be.

The LLM can compile verbose prose down to a short summary. If the summaries of each chunk are consistent, then it’s at least structurally well written. Then you grade the summary itself.

This might come off rude on accident, but I mean genuinely without malice. When I'm writing an essay to submit to my professor/teacher, I am asked to make multiple drafts to get a proper end result that is ready to submit. Understanding that educational staff is already often overworked, should I expect _less_ from the person I receive my education from? If you acknowledge that many of the grades I receive are actually not fair to me, and there's an attempt to randomize the order that papers are graded, many of the grades that I received (whether high or low) were done partially (that is to say, the opposite of "impartially".) And there's a real concern that in your example where the submissions are committed to a repository that you need to shuffle, that my submission ends up in a similar position in the stack week after week, unless you're actually doing something to ensure my position in the stack is different between submissions. It's probably sufficient in many cases but doesn't guarantee randomness unless the algorithm to randomize submissions takes previous stack orderings into account.

I know it's not the point of your post, but I think it's worth pointing out that you're misunderstanding randomness (albeit in a very typical way). Although randomness is likely eventually (over a lot of instances) going to be the most "fair" way to distribute where your submission is in the order, it does not guarantee that it will always be different, and in fact a "random" algorithm that took previous orderings into account would be provably less random than one that didn't

It's also worth noting that randomization in a context like this is inherently an imperfect solution to a problem that generally can't be solved perfectly. If we find out that weird ordering biases exist, I think randomization is done on the assumption that many we don't know about could also exist, that there's no clear way to mitigate them completely, and then randomizing the order per-instance is just the best we can do to ensure it's fair (Which, again, won't be perfect. Perfect isn't available)

I somewhat assumed there would be commenters suggesting the human angle as a retort. That's why I prefaced with both "this is what the teacher expects of me" and "understanding that educational staff is already often overworked." It just seems to me that the current systems aren't sufficient, and acknowledging that is what leads people to improving those systems. The above commentor suggested what they do in academia as workarounds to what the study showed, and I'm saying even that is not sufficient.

It seems like you're agreeing with me, but jumping to their defense with "people are fallible." People are fallible, that's why we build systems to take human elements out of it. Recognizing where humanity has soured something is key to that.

> [..] As a fairly shy kid, this already annoyed me in first grade.

(I suppose the cons outweighted the cons.)

Did you perceive any pros?

I suppose one way to do grades is first read through all papers to get an idea of the levels of the students. Though you still have bias/nepotism and such then. Perhaps a teamwork or commitee would work, or teachers swapping classes/schools?

I had a French teacher on high school who dropped a pen on list of students and then where it landed that person would get rehearsal. People in mid (waves) were fried.

Plus, there is also the issue of certain last names being common in certain cultures, leading to skewed statistics.

There are all sorts of good ways to avoid these biases. I use the same practice described above for paper exams, and grading order for eg question 2 may be affected by score on question 1, but it won’t be affected by name or ID number.

If you use Canvas or Gradescope with the default settings, it’s almost impossible to avoid this sort of bias.

Worse yet, in Gradescooe you’re strongly steered towards grading with a fixed “rubric” with specific points off for each of N pre-defined errors, allowing grading to be done by TAs with little more knowledge than the students themselves, resulting in scores which have little relationship to the quality of the student answer.

It's hard when you are the only TA for 260 students who get 3 assignments per week, you must also hold free hours and you aren't allowed to go over 27 hrs each week so the school isnt breaking federal laws.

We tried a lot of things. What eventually worked was ending grades. You mastered the material or you did not; perhaps a couple of students mastered it with high marks.

Obvs this takes an administration that is OK with that, which most aren't.

When I saw the title I would have thought that the higher concentrations of Asian names starting with V, W, X, Y, Z would have led to higher grades at that end of the alphabet, and thought that effect would have eclipsed anything else.

Anecdotally, the course I grade has this effect (just looking at the average score). I have been grading this course from last 5 years(9-10 times). Last names with L-Z score slightly more than A-L.

Have you ever thought about just passing out a set of grades on random to random individuals and see how that shakes out. Like totally random and unjustified grades. D minus for an A+ student. A+ for fails etc. Just random chaos. Then just score the final correctly and see the effect?

Or just having a Kafkaesque pass fail grade with no feedback for each student relative to their own performance over time with an expected growth rate applied?

> you have to shuffle the order in which you grade these submissions each week, for fairness

I don't think this is fair. It's just a more randomly distributed unfairness, rather than by a deterministic factor (like the student's name)

'Fair' would be each student is assessed independently for the work they did, rather than their mark being impacted by how early or late they were marked.

I think an important difference is that when you shuffle them, the unfairness stops being correlated across multiple assignments, so the "aggregate" unfairness over the course of the semester is much lower.

It would be essentially impossible to have something "truly" fair for open-ended questions since humans are stateful.

Maybe this is a case that AI could actually do quite well.

Manually grade the answers and identify the classes of mistakes. Then hand the classes of mistakes to the AI and ask for it to determine which answers have which types of mistakes.

Once you've done that, you just need to associate a deduction for each type of mistake and do some simple math.

Imagine a question: compare bubble sort and quick sort algorithm.

Some students might mix up the algorithms, some might give the incorrect computation complexity, some might describe them incorrectly in some way.

Manually grade some (or all of) the answers by noting the kinds of things students got wrong (e.g. the above criteria). Then, feed in to ChatGPT (or your favorite alternative) the answer + the categories of mistakes to expect.

Here's a simplified example: https://chat.openai.com/share/bf801e12-51d5-4255-9968-bbf91b...

There are many notions of "fairness", many of which are logically incompatible with each other.

In this example, I think it's kind of fair to give everyone an equal chance of being advantaged. You're not hurting anyone specifically.

Is that distinction worth making here? There’s no way to “assess independently” the work of each student without some amount of randomness. But I think that’s okay, because isn’t randomly distributed unfairness just… fairness?

Maybe related, or maybe not, but I remember when I was in K-12 school back in the 80s and early 90s, they would always seat us physically in the class front-to-back by last name. So the kids with last names starting with A-D or so would always be in front, and the kids with last names starting with U-Z would be in the back. For every class. I remember this because many of my friends had last names "near" my last name since we were always in close proximity to each other. I vaguely remember, by the time we were in high school, there were definitely more high-achieving kids with A-D last names and definitely more of the troublemakers were U-Z. Was it caused by sitting in closer proximity to the teacher and getting more teacher attention? We'll never know because this wasn't an experiment and there wasn't a control group.

"students who sit closer are more likely to be high achievers" might also be the source of most of the stereotypes of people with glasses. It took me years to realize I'm mildly shortsighted, so the first half of school I chose seats in the front half of the classroom to make reading the blackboard easier. Many of my friends had glasses and preferred to sit up front because their glasses didn't fully correct their vision.

In a somewhat reverse scenario, when I was in 4th grade (9 years old), I knew 100% that I was getting nearsighted, and I absolutely did NOT want glasses. Fortunately (debatable) we got to pick our seats so I always picked a seat in the very first row, where I could kinda-sorta-almost see what was written on the board if I squinted. And I was also way above my grade level so I was able to fake it pretty well for most of the year even when this started to fail me. My mom insisted on taking me to get my eyes checked about 2/3 of the way through the year and I couldn't fake my way through that, though, so I finally got glasses, but by that point I was used to sitting at the front of the room, so I choose front-of-room seats when possible for most of the rest of my schooling. There's probably some moral here but I don't know what it is.

I moved states and schools midway through 3rd grade and was seated alphabetically, in the back, for the first time in my life. The teachers in my previous school knew me to be a model student, so would sit me up front "to set an example."

My parents couldn't figure out for the life of them why I was suddenly struggling and thought I was having adjustment issues. I had taught myself to read when I was 3; how could I suddenly be having trouble keeping up?

It took longer to figure out because I was only nearsighted in one eye. I was tall for my grade, so as long as the person in front of me to the left was shorter than me or the teacher was writing high enough on the board, I was fine, because my left eye was fine. But when everything aligned just wrong, I was suddenly helpless, because my right eye could barely see clearly an arm's length from my face! It's a hard thing to notice when only one of your eyes isn't working very well, especially when you're 9.

I'm a teacher now, and this made me wince. It's exactly how I've been told by my parents that seating worked for them in school (India, 60s-80s) but their grading was done by semi-anonymous roll numbers.

Today I'm 99% sure all CBSE board exams (I think equivalent to A-levels?) are randomized heavily. However I did notice the name's alphabetical order effect in school, albeit in a minor way (folks with later letters were less involved in anything a teacher might need a volunteer for).

Circular shift is the trivial solution. In my high school every row moved up on Mondays and the front row moved to back. Of course you could argue the ones who started at the front on week 1 still has an advantage but it's likely not that significant.

As someone whose initials are Z and W, I tend to notice alpha sort a lot. Asking a friend whose initials are A and B about this, it's not something they ever noticed.

I haven't noticed a grading/ranking difference, but far more frequently I'll hear that "oh, we ran out of item/time/etc. before we got to you", which has made me much more sensitive to issues of planning/organization.

When I was a kid marbles were the big thing, and if you were playing with them in class the teacher would put it in a big glass jar. When it was full he would call out the kids and each would get a handful.

I was last in the alphabet; this was already an issue with books we had to read; you could choose which book to read, but it was always in alphabetical order. When it was my turn there were just a few left, and certainly all the popular high-demand ones were gone.

Anyway, when it finally was my turn to get my marbles he was all out. When I asked "where's my marbles?" he just shrugged and said "all out". I must've been about 7. Lots of crying ensued and I think I got some marbles from other kids, but it wasn't about the marbles – not really.

I still don't understand how anyone can expect any different result...

Probably every single health or wellness "Find a Provider" portal lists them A-Z. That's a multi-billion dollar industry. If I was Dr. Zachary Zane, I'd change my name.

My mom made the critical mistake of marrying from first five letters down to last five letters during the police academy, only later to be released from the "we have to expose you to tear gas so you know how it feels and only use it judiciously" chamber in alphabetical order by last name.

It was 40ish years ago and I still don't think she's forgiven my dad.

My daily standup is run by the order my boss sees the participants in the JIRA board -- My first name starts with W, so I'm last in that list. Makes staying engaged the whole meeting hard...

I'm a near last letter surname. It's not uncommon for arbitrary things to be sorted by name, but a ton of official things use surname ordering. There's things that also I tend to seem to be last on where I don't know the sorting method, but I suspect it isn't uncommon for someone to just throw in a sort somewhere (though it's also common to see people do things in a LIFO so disadvantage people who get shit done on time... My apartment renewal does that...). I also remember getting a PCR test in covid where they binned by last name.

I can just say I do remember being last in a lot of arbitrary and official things and seeing other friends just get done with it faster and have to waste less time sitting and waiting.

As the other poster said, the order of standup and other such things. Having a "Z" means that you're usually last, and sometimes people make a point of "hey let's do it in reverse today" where I end up being first.

I remember when working on joint tasks, by the time it got to me, most of the people that worked with me had already given their updates and details. So when it was my turn, I'd say "same as A, B, C", cause they'd given all the juicy details.

Other than that, it's pretty straightforward and boring. The world doesn't magically function differently for us.

It happens in your phone contacts when you’re deciding who to talk to. You’re starting with your Abrahams, Billys and Changs, probably rarely reaching out to your Xaviers, Yusufs and Zeldas about going out tonight because you’ve already assembled a crew by the time you reach the Mimis, Natashas and Ottos.

I wouldn't be surprised. It's very natural. Probably not for that specific use case but if for some reason you are actually going through the list then it's natural

Plus, it's common for me to meet someone on a first-name basis and not find out their last name right away. And people's last names change more often than their first names. Phone sorting by first name is the way to go.

just want to add that in my lifetime that switched from being "by last name" to "by first name". So, Yusuf Ahmed and Abraham Zigfeld experienced a noticeable shift in popularity that they were totally unprepared for

This reminds me of a funny event when in fourth of fifth grate. When the class was supposed to stand in line we always had to sort based on last name. My last name started with Ö (Last letter in alphabet in the Nordics) so I always ended up last. Then one time, the teacher said something like "Let's reverse the order today, but wait, we also sort on the first name". My first name starts with an A so I ended up last in line anyway, much to the joy of everyone :)

Similar initials, frequently last in line, and same.

I wonder if this was the kiln of my patience and acceptance, or if people who road rage and get frustrated with waits are more likely to have earlier lettered names?

> Asking a friend whose initials are A and B about this, it's not something they ever noticed.

Kinda surprised, my last name starts with C and I was hyper-aware of this and how random it was probably all the way from kindergarten. Being a child and therefore an asshole, I was grateful for my advantage rather than thinking the system was unjust.

This seems like a good example (free of cultural baggage) of how people with privilege often don't notice that they're receiving that privilege. What seemed normal and fair to your friend turned out to be an advantage that they didn't even consider.

At my university, almost all of our marking was pseudonymised. We were assigned a random candidate number at the beginning of each year, and that is what went on our important papers/exams. The less important coursework often didn’t bother with this, and used our student numbers instead, but the general idea was the same.

We didn’t put our names on any of our work other than our dissertation (and a few trivial assignments that didn’t impact overall marks). It wasn’t that hard to de-anonymise, but it meant that the system had a bit more integrity.

It’s a really straightforward system to implement and I don’t know why it isn’t done more frequently.

I also think that our VLE sorted assignments by time of submission rather than any identifier.

Wouldn't a possible outcome here though be that it just randomly reduces grades instead of reducing them in a way that's related to the students? If the issue is the sorting the random candidate numbers would still be sorted. It solves the problem of bias related to the individual but it doesn't solve the problem of bias related to the way that the submissions are sorted.

A random identifier coupled with a random sort order seem like the way to go here.

University exams, this probably makes a lot of sense. After all, the exam is the exam and whether a student is well-spoken and actively participates in class shouldn't matter for an exam grade. I'm less convinced that blinded conference proposals are a good idea--an argument I've had with various people. If you know based on past experience that someone will almost certainly hit a home run, I'm less inclined to pick a random person without obvious qualifications for the same topic--although just picking friends of the committee can obviously go too far.

You could try to work around that by first grading all anonymized proposals, then grading all potential speakers without knowing their proposal. In the third round you deanonymize and look at the weighted average of the two grades. You probably still need some judgment calls because the combination of speaker and topic can be important. But the score would give you a good base to work of.

Maybe you could make it even more impartial by allowing conditional scores in the first two rounds. Like "Jim is a 6, but a 8 if his talk is about molecular biology" or "this Lessons Learnt talk is a 5, but if it's by X, Y or Z it's a 9"

Yeah, but I'm not sure conference proposals by themselves actually have a lot of value given that, in many cases (ask me how I know), the presentations won't actually exist until week or two before the the event.

Certainly a talk by X that's totally unconnected from anything they're directly involved with has less value.

I had classes like that, where at the beginning of the quarter, each student gets assigned an username of the form " " and all participation is based on username from then on. Even though the usernames are seemingly random, certain usernames started gaining reputations on the class discussion forums, and students come to recognize some names.

But computer science courses tend to have very objective rubrics for grading, so I am not sure the anonymity mattered much.

I think the point is that some automated systems like Canvas may hide the names, but they're still presented in alphabetical order. Pseudonyms don't help if you don't shuffle them.

>One simple fix would be to make random order the default setting.

Fixed in the sense that the bias will be random. Presumably students graded last will still receive lower grades.

It would be less than ideal, but still an improvement over the current situation as long as the order is re-randomized for every assignment, because at least then you'd only be occasionally disadvantaged rather than consistently disadvantaged.

There are however other factors involved in the grade, which have a higher impact on the grade. Like, understanding of the material and ability to present a solution. - E - I'm mostly saying that because a bunch of comments are jumping on this as a significant bias against some students.

From my experience as a tutor, yes, this bias exists. But it won't turn a horribly wrong or an excellently correct solution into anything else.

I eventually knew my strugglers and my excellers. I'd skim the excellers first, because if they messed up, something bad was going on. Then I'd go through the strugglers to see problems. And then I'd grade the rest first in whatever order I got the sheets, then the strugglers and then the excellers. I needed the baseline to see how bad the worst ones actually do. Some exercise sheets were an accidental adventure, I can tell you.

And writing it like that, it sounds totally callous and cold. But focusing on the lower third in the exercises and communicating their struggles to the TA and prof was very appreciated by everyone, especially those students. It makes sure to get the important fundamentals right.

It won’t average out perfectly. There will still be lucky and unlucky students.

Of course it’s better than a fixed order, and if it’s easy to switch then might as well. But we should keep thinking about how we can make it even better.

Since the effect looks very small, it looks to me like it's only a problem because it adds up if it happens for every assignment for every course. I don't think it needs to average out perfectly; it looks to me like you'd have to be astronomically lucky/unlucky for it to matter if each assignment is in random order.

Some courses are only graded based on a small number of tests. I actually went to UM and a grade might be something like 30% midterm 60% final 10% homework (obviously different professors have different systems). In that case if you get unlucky just twice on the two tests you basically get the full penalty.

I'm not sure how much a +/- 0.3 (out of 100) deviation from average on a single course matters even if you end up dead first/last for both midterm and final in that example. I mean, it will matter sometimes. But it's (by far) not as big a deal as if it happens for all your courses.

Still, yes, you could flip the order from midterm to final instead of randomizing both and the effect goes to more like +/- 0.1 out of 100 for the luckiest and unluckiest.

Yes, that sort of mirror-sampling would reduce variance. The problem is, though, you need to know all the uses of randomness in order to properly counterbalance them, and these systems are already enough of a pain to use.

(For example, if you have two, you can simply swap: but what about other biases? like if it's broken in half to assign to 2 grades. Or what about if there are three exams? And what about balance across other courses? if you want to do variance-reduction and tricks like antithetic sampling, you need to know all this in order to structure it properly - get it wrong, and you may make things worse.)

So that's why simple random shuffling would be preferred. It allows total ignorance of all other uses (past present and future), handles all ordering biases, and can be done independent in parallel across arbitrary sets of courses/exams/grades/students.

Yep, I noticed this with myself too when I first did some grading a few months ago.

There was also the factor that the ones I graded initially did not make certain mistakes or answered in expected ways, such that when I did encounter unexpected answers/mistakes, I had to go back and rethink the grading on the papers I had graded previously. Eg if someone answered in a way that made me think an answer I considered incorrect was actually less wrong.

I only had to deal with a small class, so backtracking was doable and I graded the papers in whatever shuffled up order they were turned in, otherwise there would have definitely been a bias.

I especially noticed this when grading programming projects, because it is slightly complicated.

I’d either find that:

A bug was really common, got to re-evaluate after the first couple times I see it, apparently it is an easy mistake to make.

Or, I’d find a new bug that was pretty common, but which I didn’t know about at first. Got to update my tests and re-run everybody.

I tended to be really thorough and re-do the whole stack eventually, but it was a real pain. Could have half-assed it of course, but they spend weeks on these things, feel like I owe them honest feedback.

It would tend to lead me to “softer” grading as well, if you are lazy and only check for a couple bugs, you might take a large number of points off for each problem. Finding some problems and punishing them harshly is not very fair for those students that randomly hit the bugs you expect. If you find every bug, you can only take a couple points off per bug without tanking everybody’s score.

> I only had to deal with a small class, so backtracking was doable and I graded the papers in whatever shuffled up order they were turned in, otherwise there would have definitely been a bias.

Grading papers in submission order just introduces a different bias, though.

(For what it's worth, I'm in the same boat and I do the same, because I don't trust my ability to give the papers any true random sorting by hand, so I take the very weak randomization that the submission order gives me.)

Introducing a slight bias factor that is randomized each time results in a lower average bias compared to a bias factor that is the same every time. Plus, as these weren't take-home assignments, I think someone finishing earlier is more likely to be either someone who was already going to score well, or someone who was already going to make the most common errors.

I take tests extremely quickly, I either know the answer or guess it from what I know. I don't think about it. I was usually one of the first people to turn in tests.

I was usually (almost always) the last person to turn in assignments, I like to be one of the last people out of a door or the last person in a line (I don't like crowds).

Grading by order-turned-in would almost always mean my assignment would be one of the first or last one's graded.

If I were to guess that if you did a frequency analysis of people to order, you'd find there were always a certain group who turned it in first, and another group that turned it in last.

> Introducing a slight bias factor that is randomized each time results in a lower average bias compared to a bias factor that is the same every time.

That's what I'm saying—it's reasonable to believe that the submission time is correlated with other factors, such as ability or confidence (though the effect can cut both ways, with extremely able students submitting early because they finish early or late because they are extra careful, and similarly for other factors). Thus, this isn't really randomization, just correlation with another factor than the name.

This is basically the reason my kids have the last name that they do.

My last name starts with E and my wife's with Y. Bucking tradition, she didn't change her name when we got married, so when we had kids we had to decide what name to give them. We opted to hyphenate.

Historically, hyphenated last names were [Woman's last name]-[Man's last name]. However, my wife hated that her last name was near the end of the alphabet growing up.

We bucked tradition once again and put my name first, so that when sorted alphabetically they would be at the front of the list. Incidentally their first names start with A and B so that they show up at the front when sorted by first name too.

Haha, I've always enjoyed being at the end getting less attention from teachers. If the data merely shows a correlation, it may as well be explained by us at the end being under less pressure.

> Bucking tradition, she didn't change her name when we got married,

Unless you were married earlier than the 90s, I wouldn't really call that "bucking tradition" any time from, say, the mid-90s onwards.

If you really want to buck tradition, then don't get married - just live together, and have kids :-)

(After all, there's nothing more traditional than marriage, is there?)

In the US, 80% of women still take their husband’s last name.

But you hit on an important point — a lot of couples are just skipping marriage now.

We went halfway there — we bought the house together years before we got married.

It was just a simple observation. You are reading into things too much. There's no need to be so defensive. Your comment hasn't upset me in the slightest ( rationally or irrationally ) and I sincerely hope you weren't offended by mine.

> It must be exhausting being married to a woman who wants to 'buck tradition'. Why didn't she buck tradition and just name your kids 'Aa, Aa', 'Aaa, Aaa', etc and be done with it? Heck why not go all the way and let them go nameless.

You managed to combine snarky reductio ad absurdum and a gratuitous attack on his wife in three sentences. Why wouldn't someone be annoyed by that?

My last name starts with a letter at the bottom of the alphabet. I notice this all the time. Anecdote from this year: My son is in a high school class that requires constant input from the teacher on long running projects they have. The teacher reviews the projects alphabetically by surname, about 40% of the time, the teacher never gets to the bottom of the class, and asks the students to find her after school if they have issues. But the nature of the projects definitely requires proactive comments from the teacher. I ask my son to go find the teacher regardless and get a pro-active review, but not all the kids do that, and hence the potential for a lower grade.

There's a section of one of the Diary of a Wimpy Kid books that talks about this exact thing. I was reminded of it as soon as I saw the headline. The justification is comes up with is that kids with names at the front of the alphabet sit in the front of the classroom, so they get called on and learn more. It definitely turned some gears in my brain when I first read it as a teen. Here's the relevant page: https://imgur.com/a/6wIx6qg

Randomizing the grading order just hides the problem at the level of an individual course, but at least it helps in the average.

More worrying is when e.g. job candidates are discussed (often in alphabetical order) and people simply tire out near the end of the meeting. When this happens, be sure to suggest taking a break!

That 0.6 pt gap over multiple semesters is the difference between graduating with “summa cum laude” or “magma cum laude”

It’s 0.6% so it would only be if you happened to drop a letter grade as a result. Like 90.5 -> 89.9. And that would have to happen multiple times to significantly affect your GPA.

Multiple factors at play here.

1) Rubrics are often defined, but the application of the rubric is by a human. Application will shift as the grader gets a sense of the classes understanding.

2) As you get fatigued while grading, you'll make mistakes, and be less tolerant of others. Especially if you're an overworked adjunct or graduate student.

3) There are probably a lot more last names early in the alphabet so weighting is important.

My policy on this when I was a grad student was to publish the rubric, and ask all students to check their grades too.

When I studied engineering in India, we never put our names in the finals at college. Every one gets a exam id and that goes in the answer sheets.

Also, it is never your professor who grades you - the answer sheets are collected and lecturers/professors will correct them at the state level across all the engineering colleges in my state.

I do not know how it is now as there has been an explosion of colleges in the state. But expect the standardized tests are similarly conducted.

A lot of bachelor's degrees these days are awarded on the basis of modules with no finals. For instance when I did a course on C# a few years ago in Norway that was worth 6 points (I got full marks :-) ). If I had done another 29 modules of similar difficulty I would have got 180 points and been awarded a BSc in Computer Science.

It's quite different from the way it was when I studied physics in the 1970s when only the final counted. Annual exams only determined whether one was allowed to continue but had no effect on the class of degree that was awarded.

The thing is, it’s unclear why that effect would make you give people lower grades. surely an equally reasonable guess is that less cognitive abilities could make you give higher grades because you don’t notice errors?

Sometimes you see the result is wrong so you do not give any points initially and then look on the steps and try to find something that looks correct to give at least some points. The willingness to track through every step diminishes with increasing fatigue.

It depends on what you are doing and how you are grading. I’d try to not take many points off if an error is somehow “really easy to make,” but that depends on my ability to evaluate the difficulty of mistakes.

This looks like one of the classic studies that won’t reproduce. For one thing, the effect size is unreasonably large. 50% more positive words just because of sequence order would be so huge we should be able to notice it anecdotally.

Is anyone confused by "lower-ranked names"? To me this means A, B, C, but the article says "Wang said students whose surnames start with A, B, C, D or E received a 0.3-point higher grade out of 100 possible points than compared with when they were graded randomly."

So I guess "alphabetically lower ranked" means the last letters of the alphabet, not first? Confused.

This is an important observation!

The programmer's perspective and the user's perspective aren't always the same, and both need consideration. A user is going to see a list: it starts at the top, and it ends at the bottom. The first fields are higher, the later fields are lower.

Of course, if this is a sorted list, the first field will be the "lowest" value, for whatever comparison is used to sort it.

Yeah, I misunderstood this at first and then was somewhat confused by the comments until I actually clicked through and looked that the post. :-)

I can actually believe the effect going in either direction and it's small.

https://nautil.us/impossibly-hungry-judges-236688/

> we should dismiss this finding, simply because it is impossible. When we interpret how impossibly large the effect size is, anyone with even a modest understanding of psychology should be able to conclude that it is impossible that this data pattern is caused by a psychological mechanism. As psychologists, we shouldn’t teach or cite this finding, nor use it in policy decisions as an example of psychological bias in decision making.

Odd article. It simply states that the effect size is too big to be believable (it calls it repeatedly "impossible," but it doesn't seem like it can possibly mean "literally impossible" or "mathematically impossible.") It doesn't give any alternative explanations or specific ways the study is wrong. And it links to a rebuttal by the original authors where the responded to a bunch of the suggestions for data error or confounding factors and found that their results remain.

That is explained in pretty much the section I quoted. The explanation of the effect is given in the article's links.

But the article is written specifically to make the point that it should be enough to observe that it isn't possible for the effect to be real. You aren't making a good point when you cite an effect that is obviously nonsense.

I have an surname that's alphabetically low. Even at uni amount time I went to class and came out empty-handed as my teacher didn't score my assignement on time (at my uni 90% we have oral discussion about it) and I have to come next week while others don't are way too high.

I also have the theory that having an app/software starting with A, B, or an "alphabetically first" letter was noticeable in the past. Nowadays things are usually sorted "algorithmically", but it was common for stores to list searches with some alphabetical score, which meant that those apps were usually shown first.

Even now, for example, if you go to Play Store and want to know the apps that you had but are not installed, the default sorting is by name.

As a different but similar situation: I have a first name that is usually at the top when sorted alphabetically. Nowadays it's not a problem anymore, but as a kid I usually received a lot of calls from people that either misclicked or didn't know how to use a phone properly. It turned out it was because I was the first on the phonebook list.

It seems it would take less time for Instructure, Inc. (makers of the mentioned software) to fix this than it took do this research.

Anyone know whether this is happening, and if not why not?

With huge grade inflation in US universities, all students are already getting better grades than they really deserve. The amount of gymnastics that professors do to pass all students is insane. So, no student is really receiving a lower grade.

If we changed our policy of exams from discriminative to evaluative, grading bias would be a trivial issue but here we are since we just NEED ways to fit everyone into numbers that we can easily use.

I propose one of the following:

1. Keep the present system of grading by alphabetical order

2. Record the order in which the papers are actually graded

When the grading is done, the teacher assigns a point scale (A = 90, B = 80 or whatever) but the computer does a regression fit and removes the bias.

This is a great idea! Next time I mark a stack of exams I will also note the time of day that the mark was entered. I can then cross-reference this with how long I have been sitting between breaks, since my last meal, etc, etc. Unfortunately I will not have this opportunity until mid-fall 2024.

A .3 point difference isn’t going to make a real difference to anyone’s life and is likely a wash when other yet undiscovered biases are in the mix. Unfairness and bias is a critical factor in driving people to extraordinary achievements.

> Unfairness and bias is a critical factor in driving people to extraordinary achievements.

The evidence is a strong negative correlation between bias and achievement: Extraordinary achievements so disproportionately achieved by people in groups that are not the target of bias. Look at top government officials, SV leaders, Nobel Prize winners, etc etc - mostly white males.

The biggest targets of bias in the US, for example - probably women and black people - genrerally get the worst results (in areas where there is discrimination). By contrast, as an example wherever black people aren't subject to bias, such as certain forms of music and certain sports, achievement is extraordinary. Imagine all that talent and drive in other fields.

Most exam grading is not viewing the writing as a whole but rather looking for incidences of specific points to assign credit for. One could imagine an LLM be quite effective at labeling sentences as pertaining to a predefined idea at scale.

I noticed this in myself last time I was as a TA. I'd go back and re-grade the first 15 assignments or so to make sure the rules were being applied consistently.

Order effects are real. I'm a prof. I notice that the longer I grade, the less motivated I am to take off points and then justify why I took off those points. It's easier just to give points and move on. (And if anybody wants to criticize this, I'll be happy to launch into a diatribe on the psychometric dumpster fire that most assignments and their associated grading scales really are.)

Also prof: me too. I'm much more likely to provide comments on the first couple of exams I grade than on the later ones.

I've found that gradescope is helpful in this regard, because it at least forces every point assignment to be matched to a rubric item. I don't have data, but I believe it makes our grading a lot more uniform compared to the pre-gradescope days. (This might be easier in grading computer science exams than in more subjective areas, though.)

The article mentions that the paper is under review, but I'm guessing the effect size is small and that individual differences between graders is very substantial. The article states:

> The researchers collected available historical data of all programs, students and assignments on Canvas from the fall 2014 semester to the summer 2022 semester.

Thousands of students X 8 years X lots of assignments per year and you get a sample size so big that it would be hard not to find statistically significant effects.

We know there are big disparities of academic success by ethnic group (cf the whole harvard discrimination against asians controversy), and there are also big concentrations of patronyms by ethnic groups (or at the minimum first letters that are more common in one part of the world than another). And on top of that if the university itself discriminates against certain ethnic groups in its recruitment it will reinforce this bias (like if asians students require better grades to get in, it is unsurprising those students that get in perform better than the rest).

That would be my best guess for a rationale behind that result.

What other popular systems might lead to different outcomes based on sort order? Dating site matches? Your own contact list?

Interesting category of problems...

> Wang said students whose surnames start with A, B, C, D or E received a 0.3-point higher grade out of 100 possible points than compared with when they were graded randomly. Likewise, students with later-in-the-alphabet surnames received a 0.3-point lower grade — creating a 0.6-point gap.

The hand-wringing over such a small effect size seems unwarranted. I suspect you would find similar effect sizes for other small interventions, like whether the grading took place during the week or the weekend, or in the morning vs. the evening.

I'm Indian (in the US) and I've noticed a vast majority of my Indian friends name their kids Aanav, Aanir or Aanvi etc. some of which aren't even words in any Indian language. Now I probably know why.

This reminds me of an experience I had of just the opposite: tightly-controlled consistency in writing assessments:

Almost 20 years ago I worked for a standardized test essay grading service. We graded against all sorts of secondary-level rubrics (not AP, who do their own). These would usually be from 9 - 12 grade, from every US state, and evaluating everything from reading comprehension to subject matter-specific assessment. We'd do weeks long jobs of a single test (e.g. Alabama 9th grade reading proficiency). These usually had at least 3 dimensions, and at least 4 points per dimension. We would go through a week or more of training on a rubric, then another week of 'leveling', where a manager would occasionally bring you aside and talk through why that '3' you gave on a dimension should have been a '2'.

By the end of the training, we usually had had enough discussions and encountered enough edge cases to understand the weaknesses/inconsistencies in the rubric (which we had to abide by anyway). Once we were running at full-speed, everything was still double-graded and inconsistent scores were reviewed. Sometimes graders were pulled if they still didn't get the rubric.

It was a simultaneously stimulating and very boring job, and most readers were educators themselves. I wonder how long before it disappears completely.

In my experience, it varies. I've been on interview panels where we just weren't feeling it for a number of candidates and basically told the recruiter to try harder and eventually hit someone who we were "That's who we want. Find a way to make it happen."

It's the same with applying to jobs. The first applicants have a greater likelihood to get the job. If you're given a list of names... you're just generally more likely to pick something from the top of the list than the bottom.

I had a theory in school that this was the case for presentations too so I always forced myself to go first. No one else to compare me against, and no sitting around getting jittery.

The other benefit for being higher in the alpha order is you get the snow day calls first - 4:30 am, and get to call your friends before school calls them.

We were always woken up by my daughter screaming as here friends called her. No such luck for the post-pandemic kids.

I can explain why the kids with A names outperform the kids with Z names.

As someone whose first and last names are both very early in the alphabet, I was always called on first or second when I was in elementary school and middle school. I always had to be there early.

My friend whose name was very late in the alphabet learned he did not have to be ready for the first minute or two of class.

He would be standing near the door talking as I was quickly pulling out last night's homework, and I would be marked down for not being ready while he would later be commended for being ready when the teacher called his name.

As a teacher, I see that the kids who stand outside the door talking do not do as well as the kids who are there early.

A computer-based system like this is an opportunity to remove all personal details from an assignment while grading it, it baffles me that this isn't the default.

The database could tag every assignment with a UUID4, and present them for grading top-to-bottom in UUID lexical order, without exposing who is being graded in any way.

You can't fix fatigue bias, but this would distribute it randomly. It also removes the opportunity for favoritism and hostility, subconscious or otherwise, which is probably more important.

Once grading is completed, the assignments are reconnected with students. Give the profs a way to mark assignments with metadata, sometimes they need to talk to a student personally about something, this should be made easy.

Grades can't be immutable, professors need discretion in that, but it would leave an audit trail if professors maliciously modified grades (or the opposite). That should be uncommon to begin with, but both professors and students benefit from an audit trail here.

A system like this should be used whenever it's practical, and always for high-stakes tests like midterms and finals. Not making a case against oral exams here, just that when it's possible to blind the grading process, it should be.

Clearly evidence of anti-Polish bias when all the Zbigniews and Zygmunts and Wojteks get lower grades. (Or just another example of correlation vs. causation in action)

> Wang noted that for a small group of graders (about 5%) that grade from Z to A, the grade gap flips as expected

This is critical. Otherwise we could not discount some group (e.g. some ethnicity) disproportionately occupying one end of the alphabet or another.

Super interesting and important finding. I hope this gets wide visibility and universities take a break from politicking to fix the problem - presumably through enforced randomizing.

Enforced randomization isn't going to fix the problem, it just evenly distributes the problem.

Based on these results, it would mean that the graders are just getting tired/lazy/inattentive the further they get in their stack of papers to grade. That's the problem the needs to be fixed, not the order they get graded in. Enforced randomization is simply a short term alleviation so no student(s) get disproportionately affected by this phenomenon.

> it would mean that the graders are just getting tired/lazy/inattentive the further they get in their stack of papers

Or maybe they are getting better / more picky.

I know in code reviews I often pass a few and then notice something that I realize was also wrong in previous reviews I allowed, but later reviews that day (week?) will not allow that.

I've participated in day-long and multi-day interview events for job candidates, and I see the same effect. At the beginning you don't have a frame of reference and you're more likely to question your own decision or give someone the benefit of the doubt, but by the end you're far more systematic, plus a little bit numb to the effect your decision is having.

For grading, you could probably just add a mediating factor and throw in test cases that calibrate the factor and then you curve everyone on that factor.

It'd seemingly be more work but would result in averages that are more reasonable to the changes in stress.

Yes, and:

Additionally, universities (and, by extension departments) want grades to approximately follow a normal distribution (and yes, you in the back, their actions show they do actually want that, even if they say otherwise).

When you start grading a problem you have some idea what a "good" solution looks like, what an "ok" solution looks like, and same for "bad" solutions... If you award points based on that, the result will be a normal-ish distribution. But your idea of a good/ok/bad solution evolves as you see more papers.

There's two reasons for that:

First, you can't (ahead of time) imagine all the ways that students will invent to fuck up a problem set, and find edge cases in your grading rubric that result in unfairly-high or -low scores. As you gain experience teaching, you will anticipate more of the ways, but you will never anticipate every way.

Second, the TA/grader wants to be able to stack-rank the papers and have the scores be monotonic. The grader wants this because non-monotonic scoring triggers far more complaining than harsh scoring or picky scoring. When you come across papers that are worse than ones you've already recently graded, you assign even lower scores.

This results in a ratcheting effect with more extreme scores as you get closer to the bottom of the pile. But, since the mean score is usually a B/B-/C+ (~75-85), and since scores are usually limited to the range 0-100, this means that papers closer to the bottom will receive statistically lower scores.

Now, you could go back a re-grade ones you've already done, but:

1. The university is officially only paying you for 20hrs/week (and requires a signed end-of-semester statement attesting to the same).

2. The assigned workload of teaching and grading doesn't permit a two-pass grading scheme while keeping within 20 hours.

3. If you complain to the graduate ombudsman about the workload needing more than 20 hours, you won't have funding next semester (so you have a prisoner's dilemma among TAs who might want to grade more fairly).

4. If you're grading (say) a final exam for a frosh/soph class, you're probably in a room with 4-8 other graders late into the night. One effective way to make your coworkers hate you is to be that guy who always finishes grading his stack last, when everyone is worried about catching the last train/bus.

Basically, all the incentives are aligned to make this happen.

That's thought-provoking - thank you.

Essentially, unless it's an old exam where the universe of bad answers is already known, you need two passes - a discovery pass followed by the grading pass.

In my case, I have to make a conscious effort to remain consistently (in)tolerant of lazy writing. It’s hard to keep on reading between the lines and giving the benefit of the doubt.

In my experience, it's not tired/lazy/inattentive, but resignation. You normally have some expectation what students will be able to solve. Typically, these expectations are set too high. That's very common, not only for me, but for pretty much anyone I know. So over the time of grading, one adjusts down the expectations and gives partial credit earlier, for example.

I was a grader once. I guarantee if someone gives a good answer they'll get full marks even near the bottom of the stack. For BS answers I'll admit I got less generous as the hours went on.

No one's getting hurt by this system if it's randomized. It's a matter of graders giving out partial credit for wrong answers which is discretionary. Rarely students are granted a small mercy. Seems OK.

I was one of many TAs for a large math class in college (pre-calc - think high school math for college students). For uniformity, the prof had the partial credit down to a science - specifying points for getting certain aspects of the problem. For the finals, a few TAs would be assigned to a given page, for uniformity.

The fascinating thing was that the distribution of grades was about the same every year.

And I had a math prof for analysis who would give negative points for BS answers. You could say “I need X but don’t know how to prove it” in the middle of a proof, but if you made up something that was incorrect, you’d get negative points.

>For BS answers I'll admit I got less generous as the hours went on.

What do you think is the cause of this? Do you become more cynical (and less generous) because you’ve seen so many BS answers previously? Is it just that getting fatigued makes you less generous?

When I was a TA in grad school, I noticed the same. Early on I thought some BS answers were at least kind of funny, and I gave them the benefit of the doubt, maybe giving more attention to the parts that were correct. After I saw similar answers later on, the novelty wore off and I was probably less amused, so the inclination to be lenient disappeared. Sometimes I went back to previous decisions if I remembered them, to be fair, but I don't think I always remembered since the volume could be high (grading 80 exams in a row is TEDIOUS).

> Enforced randomization isn't going to fix the problem, it just evenly distributes the problem.

Evenly distributing the problem does fix the problem. Proportionality is what matters. Grading being arbitrary is fine if everyone is graded equally.

Random order would still mean a few students in the class get unlucky and near the end the majority of the time. Although over the course of all classes it would tend to even out somewhat.

It’s certainly better than fixed order.

"randomization" is not the important part here. "evenly distributing" is. It is absolutely possible to reorder the sequence fairly such that your scenario doesn't occur. It could even to a human observer look randomized if you want. In a trivial example case where the effect were linear you could just switch the order back and forth, and on average every student would receive the same middle-of-group impact.

For me I grade tests as follows. The stack is created as students turn in the test. I grade the first page in that order. The stack reverses for the second page. So on and so forth. I teach college math. I just cant imagine a system of grading done in alphabetical order.

I also came here to say this. My only guess is that the alphabetization (by the "learning management system") to make filling the grades into a table "easier" for the computer or for the person handing out the results? Why is it "easier" if the system doesn't have to order them at all, or it could do so by student number (same issue as alphabetical order) or something random, which is the other (non default) option for the "learning management system".

I feel like only the most obsessive compulsive humans would have this issue (without computer "help"), as the last thing I wanted to do as a TA was to add another step of ordering all the papers before grading them. I also always reviewed the first few papers I graded after grading the rest to make sure I was being fair, because it was obvious to me that until I saw a representative distribution of answers I couldn't do fair grading/marking.

It's a 0.6 gap from top to bottom out of a score of 100. Plus or minus a third of a percent from average. Pretty small effect. But it would add up (or, well, persist - it wouldn't get bigger) if it happens to you for every assignment for every class and that sucks.

If there's more than one assignment you can basically erase it by randomizing each separately.

If you really care beyond that then randomize for one assignment, flip it for the next, then randomize again for the next etc.

> graders are just getting tired/lazy/inattentive the further they get in their stack of papers to grade.

I will admit to this. Initially, my patience and tolerance for errors is significantly higher than towards the end of the grading. By the second hour grading, I am not only mentally exhausted my tolerance is significantly lower.

I try to prevent this by creating very explicit grading rubric and I stick to it as much as possible.

Clear rubrics are the thing where possible. They aren't everywhere though. I've been on conference committees and so many different factors come into play--including how late in the day it is. But, in that case, a bunch of people are rating and commenting and there's no strict order so it probably evens out to a reasonable degree.

Even distribution would fix the problem. If grading has a subjective component, there will always be deviations from the "correct" grade. If those patterns are randomly distributed over all students, their grade averages will be comparable again.

My first thought was, "Who takes the time to sort before grading?" Computers change the world in such incredibly subtle ways. Of course, such subtleties exist without computers. This is just one case where computers make the subtleties more detectable.

（评论） (comments)

（评论）
(comments)