User Comments, Suggestions, or Complaints | Privacy Policy | Terms of Service | Advertising
Page rendered in 0.9519 seconds
41 querie(s) executed
You are here > Home > Hall of Merit > Discussion
| ||||||||
Hall of Merit — A Look at Baseball's All-Time Best Thursday, December 18, 2008Hall of Merit Plaque Room WebsiteHey everyone, I wanted to share some news and also ask for some help. Shock and TangoTiger (Awe was unavailable) have offered to help with building a website for the Hall of Merit’s plaque room. Shock is going to take the lead on designing the site and TangoTiger is going to take the lead on the back end database. I’ve reserved the domain name www.hallofmerit.com for this. We’ll need some help populating the database if you have some spare time and want to help out, that would be great. What I’m envisioning is something where you could search on a voter’s history, a player’s ballot history, a particular year’s election results, etc.. I’d like to have a page for every player who has ever received a vote, and and every voter who has ever voted. We could also have things like DanR’s WARP; Chris Cobb’s Negro League translations, my pitcher WARP, projected military service credit, links back to their discussion pages, etc.. I’ll leave it Shock and TangoTiger to explain what will specifically be needed, etc.. I’m really excited about this, and cannot thank them enough for offering to work on this. JoeD has the Imperial March Stuck in His Head
Posted: December 18, 2008 at 03:47 PM | 224 comment(s)
Login to Bookmark
Related News: |
BookmarksYou must be logged in to view your Bookmarks. Hot TopicsMock Hall of Fame Ballot 2024
(13 - 11:23pm, Dec 08) Last: Space Force fan 2024 Hall of Merit Ballot Discussion (191 - 7:43pm, Dec 07) Last: Howie Menckel 2024 Hall of Merit Ballot Ballot (4 - 3:10pm, Dec 07) Last: Jaack Hall of Merit Book Club (17 - 10:20am, Dec 07) Last: cookiedabookie Mock Hall of Fame 2024 Contemporary Baseball Ballot - Managers, Executives and Umpires (28 - 10:54pm, Dec 03) Last: cardsfanboy Most Meritorious Player: 2023 Results (2 - 5:01pm, Nov 29) Last: DL from MN Most Meritorious Player: 2023 Ballot (12 - 5:45pm, Nov 28) Last: kcgard2 Most Meritorious Player: 2023 Discussion (14 - 5:22pm, Nov 16) Last: Bleed the Freak Reranking First Basemen: Results (55 - 11:31pm, Nov 07) Last: Chris Cobb Mock Hall of Fame Discussion Thread: Contemporary Baseball - Managers, Executives and Umpires 2023 (15 - 8:23pm, Oct 30) Last: Srul Itza Reranking Pitchers 1893-1923: Results (7 - 9:28am, Oct 17) Last: Chris Cobb Ranking the Hall of Merit Pitchers (1893-1923) - Discussion (68 - 1:25pm, Oct 14) Last: DL from MN Reranking Pitchers 1893-1923: Ballot (13 - 2:22pm, Oct 12) Last: DL from MN Reranking Pitchers 1893-1923: Discussion (39 - 10:42am, Oct 12) Last: Guapo Reranking Shortstops: Results (7 - 8:15am, Sep 30) Last: kcgard2 |
|||||||
About Baseball Think Factory | Write for Us | Copyright © 1996-2021 Baseball Think Factory
User Comments, Suggestions, or Complaints | Privacy Policy | Terms of Service | Advertising
|
| Page rendered in 0.9519 seconds |
Reader Comments and Retorts
Go to end of page
Statements posted here are those of our readers and do not represent the BaseballThinkFactory. Names are provided by the poster and are not verified. We ask that posters follow our submission policy. Please report any inappropriate comments.
Voters
- voter_id
- voter_name
- voter_??
- voter_??
What other information do you want to capture for each voter? Note: if you can get the information from the other tables (like when he first voted, or how often he voted, etc, then we do NOT want to capture it).
Years
- year_id
- electee_count
Players
- player_id
- player_name
- player_bio (large text field)
No other player information is needed, as we can then link to the BDB/Lahman database to get playing and other bio data. Indeed, you can provide links to Baseball-Reference.com or Retrosheet.org or BaseballProspectus.com player pages.
You can, as an add-on, provide career stats based on the BDB database. For example:
Player_Voter_Evaluations
- player_id
- voter_id
- score_value
And
Player_Voter_Year_Evaluations
- player_id
- voter_id
- season_id
- score_value
So, Joe can provide say a career "wins" total for any (or all) players, and we'll find his data in the Player_Voter_Evaluations table. Dan can do similarly on an annual basis and provide it here: Player_Voter_Year_Evaluations . Other voters can do likewise. And if that above structure doesn't work for them, then we'll build an add-on table to support them.
Ballots
- voter_id
- year_id
- player_id
- ballot_rank
- voter_comment
- selection_rejected_switch
This last table is basically the one that links all the other tables. You can, at its best, be able to link this table to all the other tables, and if someone asks:
"Give me the Joe Dimino evaluation score for each player on the 1998 ballot, along with all the bio commentary for those players, and how many times each player has appeared on that ballot, and what was his highest ranking until that point, and when was he ultimately elected, if he was. And only give me for those players that John voted on."
Then with a bit of SQL, you can get your answer.
If there are other pieces of data that voters would like to capture, then what needs to be done is it ITEMIZE EVERY single piece of data that you want to record, and then we can construct a design to hold that data.
So, what needs to happen is for each voter out there to list the data you'd like to record. You should not, at this moment, consider "lists" or "totals" or "reports" or the like. Think of it like instructions for building that table: itemize all the tools and materials you need. LATER, we'll worry about how we're going to put these things together.
Tom
Those discussions can happen afterwards, when discussing reports and anything else that can be handled with instructions or algorithms.
Right now, we are only worrying about "atomic" things (things that are so basic that you can't divide it and can't be derivable from other things).
***
Is there anything else about the voters or ballots that need to be captured? Do you want to know the voter's birth year, or favorite team? Do you want to know if his expertise is history, sabermetrics, or knitting?
For the Years table, I think you'll want to add:
election_date
Try to go through some of these election ballot threads, and see what kind of data voters are providing about players or ballots, or what kind of information you are providing.
I just thought of another. For the Players table:
bio_url
plaque_url
For the Years table
discussion_url
results_url
Again, just think about anything that you can capture, so that all reference information is here.
I'll be able to get you specific votes by specific voters but only back to the early 20's. Does someone else have that data for the early years of the project?
I just spent 20 minutes attempting to drive to work and getting about 10 feet, so I guess I'll stay home today and work on this. ;-)
I can also do any database stuff required, and what have you.
I was thinking we have a players table and a ballots table, and a bridge table so we can store the balloting history of every player. eg.
Players
* player_id
- name
Ballots
* year
- 1stPlayer
- 1stPoints
- 2ndPlayer
- 2ndPoints
etc.
Player_Ballots
* player_id
* year
- 1stPlaceVotes
- 2ndPlaceVotes
...etc...
I am just thinking out loud here.
The one thing I cannot really do, by the way, is graphics. So if anyone out there wants to make logos, banners and what have you, go nuts.
This is going to be so cool. :-)
I know John Murphy went through there and rebuilt many previously destroyed posts, but I don't know if he wasn't able to find them all, or didn't have the time to find them all. If it's the later, I assume we should be OK.
Another thing that could be calculated from the individual voter ballots is an 'as voted' personal Hall of Merit, especially for voters that have participated in every election.
- Your Players table is a repeat of mine.
- Your Ballots post is a cumulative table of all voters results. If we do not have voter history for a particular ballot year, for whatever reason, then we need a table to handle that, similar to what you are stating. Is it the case that we have results, but not individual voter tallies for some years?
- Your Player_Ballots has a similar issue to the Ballots table.
This was my impression. But I suppose if we have individual results for every year then there's no need for a ballots table as the voters table would have all the necessary data.
We're only interested in the periodic table. Molecules, figuring out if you want H2O or CO2, those are derivative of the periodic table, and we're not interested for periodic table purposes. That's for later, for all the Dr. Frankenstein's out there.
discussion_url should be in the players table as well.
The discussions about individual players is my absolute favorite part of the HOM. I can't tell you guys how much I've learned about players from before my time from here!
Just as an example, when you look at the Fans Scouting Report, all that is is about 10 columns of data from each player, with some calculation behind the scenes.
So, if you have basic data like that, then it's a snap to generate web pages for them.
Dan: if you have an example of the kind of data (or formula) that you would supply, that would be helpful.
Building reports, etc. is the easy part. The only reason I haven't made the site yet is because I haven't had the data. :-)
I made a quick little er diagram just to make sure we're on teh same page:
http://img120.imageshack.us/my.php?image=homdbsv0.jpg[/img]]Diagram
...yes?
Diagram
***
Dan supplied me with his files. So, I would create a table called Rosenhack or Dan or whatever, with these columns:
YEAR
Team
Lg
Player
Age
Pos
SFrac
BWAA1
BRWAA1
FWAA1
Rep1
WARP1
WARP1/Yr
LgAdj
BWAA2
BRWAA2
FWAA2
Rep2
WARP2
WARP2/Yr
PennAdd
Salary
(Though I wouldn't use the /)
Since he has multiple versions of his data, I would include a column called "Version", as he has three different versions (so far), all showing the same data.
You'd like a DanVersions table, so that we can show what each version is:
- version_Id
- version_comment
We'd need an additional table called DanBDBTeams, so that we can link Dan's teamid to the BDB team ID, since Dan uses a different team id.
- dan_team_id
- bdb_team_id
He also has a fielding table, so we'd have DanFielding with these columns:
Year
Team
FWAA1
NRA-DERA Adjustment
All this is pretty straightforward.
I will also send you my pitcher spreadsheet.
One thing to note, I will have limited internet access from the afternoon of 12/24 through the morning of January 5. I'm going to be traveling, visiting family and friends back east, etc..
One good thing about Access is that it lets you "link" databases, rather than forcing you to import them. We could easily have an HOM database, and a RosenhEck database, and then just link the two. The user won't know the difference, and it will make life VERY EASY whenever Dan has updates. You simply delete the old database, and upload the new, all the while never touching the HOM database, as it'll simply link in to the new one automatically.
This is preferable, as you can also do the same with the BDB database, simply linking in to the MASTER, BATTING, PITCHING, FIELDING and whatever other tables are out there. And when the BDB gets updated, you don't do a thing to HOM.
Umm...we all know you're a miracle worker, Tango, but I think you're about six years late to the party here. :) Shame you never voted, by the way.
Joe mentioned in the thread header that my Negro-League translations might be part of the database.
If that is to happen, in what form would you want or need the data for Negro League MLEs?
Compared to folks who build databases, my data management and manipulation skills are VERY primitive, so I am concerned that the data I have are far from being in form suited to being uploaded into a database. With time, I can get the data into other forms (my excel spreadsheet skills are passable, and I can learn new skills as need for them arises), but I would need guidance on what format will be needed for the database.
Would I be correct in assuming that as long as the database is structured to receive the data, and the data is structured to be entered into the database, the actual uploading could happen either a bit at a time or quite a while after the initial launch of the database?
VoterAliases
- voter_id
- screen_name
- effective_date (or current_indicator)
It seems that some people have used more than one screen name. So, while you can maintain one voter_id that will never change (and it might as well be your Primer ID), we can show all the screen names of that voter. While effective_date would be the preferred field, I'm realistic that the best we can do here is to note what the voters current screen name is.
***
Chris, the only requirement is that for players that are not in the BDB database, then we need to make sure they exist somewhere to link them. So, it would seem that would need to not only link to the BDB MASTER table, but also include other master tables, so that we have a "super duper" master table that the Players table will link to.
So, you should create your own master table with:
- player_id
- player_name
and everything else that you want
When it comes time for HOM to create its own master table, it will LINK to BDB and to yours, and then do something like:
select player_id, player_name
from bdb.master
union
select player_id, player_name
from chriscobb.master
And we'd have to add validation to remove any duplicates that the UNION did not catch.
So VoterID in the Voters table and Aliases table is the number of their primer reg? So Chris Cobb = 354?
Makes sense to me, although I am not sure it is necessary. Can't we just add a ScreenName column to the Voters table?
Voters
------
ID: 354
Name: Chris Cobb
Awesome?: Yes
VoterAliases
------------
ID: 354
Alias: Chris Cobb
------------
ID: 354
Alias: Jim Sandwich
------------
ID: 354
Alias: Cocky Baloney
------------
ID: 6542
Name: Sexy Beast
And what have you. But if we are only storing the most recent ScreenName, I don't see the reason for two tables.
Here is what I have now:
http://ryanjlind.07x.net/index.html
Thanks Ryan! That attempt didn't work out, but it got me to reserve the URL, which saved time now, etc. . . . there's definitely some butterfly effect there :-)
Plaques
-------
Player_id
year_elected
position_elected
team_elected
? Something like that. I think in the current plaque room there is already designations for these.
I'm going to be a bit hesitant here, because I haven't programmed for money since the 1990s, but at least it was partially in Access and its predecessor, Microsoft Basic Professional Development System. Does your DBMS have ways to avoid what we used to call "column creep?" That is, if we want, say, extrapolated seasons for people like the Wright brothers (Harry and George) who played professionally before 1871, and also estimated negro league MLEs, does that inevitably involve two columns, both of which serve only small numbers of players, when there are no players who have both types of entry, so no player needs more than one of these two columns? If so, does the DBMS have ways of avoiding the large number of columns slowing the database to a crawl? This was a big issue when I last worked in Access, or DB2, or whatever. My understanding is that Oracle was able to get around this by going relational, but I've never worked in Oracle, and understand that it can be a b**ch.
If column creep is a problem, then we do have a practical limit on how many different types of data we can include, if we don't want slow response and crashed servers. If there is a way around it, we can have really specialized columns, like a war credit column just for 1918. If this is a ridiculously dated question, and all modern DBMS have solved column creep, than please forgive me for having not worked in the field for a decade. I just want to know the answer before I start making too many column suggestions.
Thanks! This is a huge project you've offered to take on, and I'll gladly help any way I can. Hopefully, this was help, and not just dated nonsense.
- Brock Hanke
For team_elected, for cases like Rickey or Winfield or (eventually) Raines, since that designation will only happen upon election, then putting it here is fine. Preferably, we should put it on an Players or similar table, since the HoM is not going to ask these guys what teams they want to go in as, right? It'll be predetermined.
For year_elected, you should be able to derive it from the YEARS and BALLOTS tables, and therefore, is not atomic.
Therefore, we might have a need for a PLAQUES-type table, but for now, it seems that this information can be handled elsewhere.
For what we are doing here, 10% the columns, and, I dunno, 1% the records, I'd be terribly disappointed if even the most complicated query comes back in more than 20 seconds.
Even in the event that it is slow, we can always generate a data warehouse of denormalized data, similar to how you are suggesting it, while continuing to maintain the system of record as normalized data.
Our biggest table would be the ballots table which would be years * ballots * players = about 100,000 rows or something. Which is nothing of course provided you use indexes and are smart. And the problems that you raise wouldn't affect this table anyway.
An example:
The way we have it, some rows might look like so:
Year VoterID Rank PlayerID
1930 123 1 Ruthba01
1930 123 2 Gehrlo01
1930 123 3 Lajona01
But alternatively, since there are always 10 spots on a ballot, we could have the "ranks" be their own columns. So 3 rows would be 3 ballots:
Year VoterID Player1 Player2 Player3
1930 123 Ruthba01 Gehrlo01 Lajona01
1930 5423 Gehrlo01 Ruthba01 Sislge01
1930 5685 Ruthba01 Gehrlo01 Stovha01
The former way makes the most intuitive sense to me, but the latter way would cut our record by a factor of 10 and would also work I think.
Again, just sort of thinking out loud here.
Dan's data has a primary position for every player already in a column, but he considers LF and RF the same position post-deadball era.
Tango - they way I've been calculating positions and 'hats' for players is by taking his pennants added, as opposed to straight playing time. But we did have a few players who were manually adjusted - guys like Harry Stovey and Stan Musial for example - who played more games at 1b than anywhere else, but played many more games in the OF combined than at 1B. For those guys we've used the OF position they played the most as their primary position.
So if you think it would be easier to just code it in a column, than to calculate it with that kind of intricacy, that should give your answer.
When I do my queries on the website, I just have to use a like rather than a specific match. For example, I could do:
"Select * from players where elected=yes and position like '%C%'"
Not only would that pull up all the obvious catchers like Bench, but it would also pull up someone like Torre who might be a "C, 3B." Then Torre would come up whether you clicked on C or 3B.
The guys with "OF" would just be put in as "LF,CF,RF" Personally I think it would be better to enter all these manually.
No, you always never want to expand horizontally; almost always you want to go vertical.
If you want to do a join to another table, what are you going to do? You'll need 10 joins to 10 copies of the Players table just to get the player's name.
When in doubt: vertical.
***
where pos like '%C%'"
and pos not like '?%'
Using digits would be better. So, you can have a player that's a 79 or an 8 or a 2 or a 37 or whatnot.
But, I'd favor having a "primPos" and "secPos".
Thanks. That was comforting. I worked on a DB2 system for the old Ralston Purina once. When it got done, they ran a test query. It came up correct, but it took - I'm not exaggerating - 24 hours! What had happened was that every division of the company had several one-customer-only items that they wanted an entire column for. The tables just got way out of hand. They had to redesign the whole thing. That's what I was afraid of.
BTW, I agree with your opinion regarding vertical and horizontal approaches. The one thing I learned documenting the one Oracle project I worked on was that joins are bad. Almost anything you can do to decrease the joins is good. I know one whole Oracle DBA, and he said that's still true. At least in Oracle. I don't know about modern Access.
- Brock
PS - I spent the last decade doing systems documentation instead of programing. If you need that kind of help, as opposed to coding help, ring me up. My coding is out of date because I don't know the languages any more. English hasn't changed that much. I can read code in languages I can't write in, because I don't have to know the details of small grammar and syntax.
Voters
Years
Players
Ballots
As it stands, someone sent me a voter list.
What I would prefer is that we get four people (none of them me) to take ownership of each of these 4 tables. Once I have the data, I can handle the design and cleansing and uploading.
The Years table should be the easiest, the Voters the next easiest, the Players one of moderate effort, and the Ballots one is the killer. You'll probably need three or thirteen people just for that one.
Ideally, each voter would supply his own votes for each year. That would be just lovely.
Howie Menckel: what do you have that you use to report cumulative totals? Is that something you could send to Tango?
Let's put a deadline of say Mon, Dec 22, for people to offer their inputs as to what data we should be recording. Then we can come up with an initial working design before the end of the year.
Hi, Tom.
I'll try to make sure I "think on it" before your deadline.
When I google my own name a site called WikiGonzalez comes up first. Presuming that Kent was not the author, it may be useful for someone to track down the site, or a cached version. The author covered HallofMerit aliases from 1898 until yyyy. I did use it a couple of years ago to look up a Hall of Merit voter or participant.
By the way, Tom, have you voted here?
== quoting Google
Paul Wendt - Wiki Gonzalez
Paul Wendt. From Wiki Gonzalez. A contributor and mentor for the Hall of Merit ( though never a voter). Retrieved from "http://digamma.net/btfwiki/Paul_Wendt ...
www.digamma.net/btfwiki/Paul_Wendt - 8k - Cached - Similar pages -
probably btfwiki/ includes the data that I have in mind
== just in case you were wondering
Paul Wendt - Boston, MA | Facebook
Paul Wendt (Boston, MA) is on Facebook. Facebook gives people the power to share and makes the world more open and connected.
www.facebook.com/people/Paul_Wendt/1033821430 - 21k - Cached - Similar pages -
no he isn't
If primary fielding position is derived --for career by method identical to its derivation for single season-- then Hall of Merit fielding position "homPos" should also be in the database.
For Hall of Merit members, homPos is equivalent to the key for the unique 2008 special election, "Ranking [HOM members by fieldpos]" in which he appears. Is homPos null for every nonmember? Maybe so.
53. bjhanke Posted: December 19, 2008 at 12:11 PM (#3033923)
Tom says, "Brock, I don't think we have to worry about too many columns. I work with Retrosheet data, and I've got 150 columns and 8 million rows, and my slowest queries take, I dunno, 10 minutes?"
and the computing power at hallofmerit.com, wherever that will be?
(and that may be my last contribution to the the website :-)
Another thing that could be calculated from the individual voter ballots is an 'as voted' personal Hall of Merit, especially for voters that have participated in every election.
Not really, Joe. You have to deal with players who are in the HOM but haven't made a person's PHoM yet. How can we decide where to slot them when they're not on the ballot anymore?
Thanks shock!
I think this has promise, one thing I'm not sure of though, is the picture taking up so much real estate . . . I tend to like sites like B-R and BTF more than ESPN.com for example . . . what do others think?
Points
- electee_count
- rank
- points
So, this would look like this:
2,1,24
2,2,23
2,3,18
2,4,17
...
2,15,6
3,1,24
3,2,23
...
Or whatever the point system is.
I can own this table. I saw in your constitution what the point system is, so it'll be a snap.
This way, when you link the Ballot table to the Years table to this Points table, it will automatically tell you how many points each player gets.
But when we had a year where we only elected one, it was 24-19-18-17 etc. . . . elect two was 24-23-18-17-16, elect 4 (the first year) was 24-23-22-21-16-15.
Does that change anything?
It's Friday afternoon, kind of a dead time on the interwebby thingamajiggy . . . not to mention half the country is covered with ice and snow, so I'm guessing that not too many people are working, and if they aren't working, they aren't surfing the web.
The next two weeks will be even worse I imagine . . . we'll get it done I'm sure.
So if I guess right, the ballots table will have columns like year, voter, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
Then each ballot will be a row.
I think inputting by voter would be much easier, as most ballots are fairly stable from year to year, which could mean a lot of copying and pasting, as opposed to typing, right? Especially if OCFs data has all of this post-1920?
We had fewer voters in the early years also . . . so the 'manual' years will have fewer entries at least.
Each row is a vote...
It will look something like:
Year, Voter, Rank, Player, (explanation)
"Voter" will be the Primer ID number.
I'm not sure what format we are using for player names. Will have to decide that I guess.
If OCF's data has all this then that is excellent!
Why is each vote a row?
Oh wait, so we can include the explanations too . . . wow. Maybe we add those later? That is massive.
But could you just easily make each ballot a row . . . and just add columns like player1, explanation1, player2, explanation2 . . .
The voter ID was a the handle used at the time of the vote, or a shortened version of that handle. That's going to require some hand intervention to cover cases like Marc/sunnyday2 or Don F/Pedro Feliz Navidad. I also sent Tango a document with consensus scores in it that may help a little with that part.
That's easy to match to LahmanID. We can just do a vlookup to catch most of them (with the complete set of LahmanIDs as a table). Manually correct the guys that it misses due to multiple C.Jones.
As far as the consensus scores, if we have the formula, the DB should be able to calculate those, right?
It's so that when we join tables together we only have to look at one column; the players column. If we had one ballot be a row then we would have 10 columns to store players which means 10 joins every time we reference the table, which means poor performance.
Suppose we wanted to query the DB to find out how many times a voter voted for a 1B that had 200 career HR or less.
With the current design, it's pretty simple. Something like:
select count(b.playerID) from players p, ballots b, playerStats s
where
p.priPos = '1B'
and
s.HR <= 200
and
p.playerID = s.playerID
and
p.playerID = b.playerID
If you only wanted to count instances where a voter had that player in his top 5, simply add:
and
b.rank <= 5
But with one ballot being a row, how would you do it? Like this?
select count(p.playerID) from players p, ballots b, playerStats s
where
p.priPos = '1B'
and
s.HR <= 200
and
p.playerID = s.playerID
and
p.playerID = b.player1
or
p.playerID = b.player2
or
p.playerID = b.player3
or
p.playerID = b.player4
or
p.playerID = b.player5
or
p.playerID = b.player6
...
it's ugly...
And even that wouldn't give the right answer I don't think because you are counting players instead of votes.
And it would be like this:
players p1, players p2, players p3..., players p15
And then in your where clause you match b.player1 to p1.player and b.player2 to p2.player, etc.
the hard part will be the typoes and spelling mistakes.
That is the hard part only because all the face and font variation is "hopeless", right? In effect, participants take the forum software for granted, its limitations as well as its strengths. Regarding the strengths: comments rely on typog devices such as boldface and "code" font. The database uses "printing characters" from ASCII or is it some larger set?
There is no money here, Brock :-)
[on "column creep] . . . does the DBMS have ways of avoiding the large number of columns slowing the database to a crawl? This was a big issue when I last worked in Access, or DB2, or whatever. My understanding is that Oracle was able to get around this by going relational, but I've never worked in Oracle, and understand that it can be a b**ch.
They all went relational and call themselves RDBMs or RDBMSes, but I think that's a different problem.
If column creep is a problem, then we do have a practical limit on how many different types of data we can include, if we don't want slow response and crashed servers. If there is a way around it, we can have really specialized columns, like a war credit column just for 1918. If this is a ridiculously dated question, and all modern DBMS have solved column creep
At this stage, I believe, (this fortnight) the HOM database should not include any of this extra stuff except an Extras table, at most, or perhaps no more than an note field in the Voters table that covers the Extras.
voter_id, voter_name, comment, note
33, Joe Dimino, founder and bbtf owner of the Hall of Merit, Pennants Added for mlb Pitchers (maybe forthcoming)
354, Chris Cobb, , MLE for Negro Leagues and other play outside MLB (maybe forthcoming)
"maybe forthcoming" would mean that it may be forthcoming as part of the website, probably linked to but not part of the database. Later those words might be replaced by the URL or pathname of the other database.
- voter_id
- voter_name
- voter_??
- voter_??
Perhaps that should be a people table. For example, if "Shock" is a never-voter then may yet be a listing as website designer.
: contact
Will some provide contact data here? if not directly, then bbtf username provides indirect contact in some cases; contact data may be eddress or current bbtf alias
: comment
Most tables in most databases should include a comment field. Just above I have suggested and exemplified two fields, comment and note.
YEARS
- year_id
- electee_count
: comment
Example for illustration only
year_id,electee_count,comment
1950, orginally elect two; elect two by 1930 revision
Another example
PLAYERS
- player_id
- player_name
- player_bio (large text field)
No other player information is needed,
famous last words :-)
: comment
: YearEligible
PLAYER_VOTER_EVALUATIONS
PLAYER_VOTER_YEAR_EVALUATIONS
The suggested "score_value" (one numerical dimension) is too restrictive but I suggest that this be postponed. Handle voter ratings as voters make them available.
The single most valuable "ratings" by far may be called year1 and year2 representing a kind of begin and end of the careers of those who are not in the mlb databases. I daresay a kind of begin and end date although I have not forgotten that HOM evaluations of some major league players --not only Dickey Pearce and Satchel Paige but Earl Averill-- rely on consideration of years outside their major league career timespans, but any begin or end dates for Frank Grant is more valuable, one order of magnitude more valuable, than a nonstandard begin date for Earl Averill.
(Retrosheet does not provide a page for Frank Grant and baseball-reference does not provide any but vital data --what SABR calls "biographical data" rather than "baseball records"-- and the Hall of Fame election year, which is 2006 for Grant and many others.)
BALLOTS
- voter_id
- year_id
- player_id
- ballot_rank
- voter_comment
- selection_rejected_switch
This last table is basically the one that links all the other tables.
"selection_rejected_switch"? Does that provide scope for invalid selections? I think we have only invalid ballots.
--
Unfortunately, there were some tie votes in 1898 if not thereafter, evidently two two-man ties for 15th and one five-way tie for 15th, concerning seven players because two appeared in two ties.
1898 Results (table mis-aligned)
--
Regarding tie results the question is which should be considered official ties, which may differ from which were reported as ties in the posted results. Maybe the rank-order results should be ignored for now --left out of the database design so that we are initially able to derive ties but not break them.
== note
POINTS
On the recent suggestion by Tom Tango, what data do we have for each year after joining the Years and Points tables? Something like this.
2009,3,1,24
2009,3,2,23
2009,3,3,22
. . .
2009,3,15,6
(right? )
- year_id
- electee_count
: comment
Example for illustration only
year_id,electee_count,comment
1950, orginally elect two; elect two by 1930 revision
Another example
<<
Excuse me for that. Maybe it is sufficient to illustrate the "comment" field, whose value in the example is "orginally elect two; elect two by 1930 revision". (I skipped the electee_count and the illustration is fictitious anyway and I didn't provide "Another example" as it says.
If ordered by year then electee_count shows some anomalies. Eg, why Elmer Flick elected all alone? At least some of the anomalies were introduced by revision along the way, not part of the original design. The "comment" may be used to note such things.
At stage two, multiple "comment" and "note" and such-like fields may be used to generate various footnotes if the website will have footnotes.
For example Sean Forman at baseball-reference uses the nameNote field to hold the text value of the slightly-free form note on player pages.
Bill Dinneen: "commonly misspelled Dineen"
Frank Selman: "Born with last name Sellman, Also Played Under Name Of Frank C. Williams 1871-75"
Hal Trosky: "born Harold Arthur Trojovsky"
(At the Hall of Merit, an explanation for revision and perhaps anomaly in electee_count is something like "re-calibrated to match Clark Griffith as a Hall of Fame player" or something like that.)
:)
What I'd be able to offer is perhaps an arrangement of the players that more appeals to math-phobic people.
For instance, I have "Hall of Merit starting pitcher tandems," with every example of two HOM SPs on the same team in the same year. I think there are a few with 3-4-5, and that shows up there.
Also 2B-SS combos, 1B-3B combos, 3 OF HOMers on same team, etc.
And other approaches - list of total number of HOMers per league per year, both as long lists and as short ones listing only the number figure and not the player. What it does is show you, for example, the years where about as many Negro League HOMers were in action that year as there were AL or NL HOMers.
I did one showing all the black/Hispanic HOMers by year, which when plucked out that way shows some startling imbalance (some people knew NL had more in the 1960s, but I think one year had 13 in NL and ZERO in AL. That I didn't realize).
Another list shows how many HOMers played each position that year. So you see that 1890s-1900s C scarcity, and gluts at other positions in other years. 2B goes several decades with only controversial pick Nellie Fox iirc (or at least it did at the time I did the chart).
Some of these I could update in 5 minutes, others would take longer.
I have "team by team" versions for the Mets, Yankees and Red Sox. That could be done for each franchise.
The advantage of these lists would be fun stuff that any baseball fan might enjoy, not just SABR types.
As for "total votes points," I'm just a chimp posting up a version of other people's numbers in that case. I think Ron W and someone else actually have them.
Anyway, this is great news. I'd be happy to provide whatever modest help I can give (just don't ask for an Excel spreadsheet, a pdf, a pdq or any other stuff beyond just a bunch of words typed into an email or Website posting area, lol).
Chris, the only requirement is that for players that are not in the BDB database, then we need to make sure they exist somewhere to link them. So, it would seem that would need to not only link to the BDB MASTER table, but also include other master tables, so that we have a "super duper" master table that the Players table will link to.
So, you should create your own master table with:
- player_id
- player_name
and everything else that you want
When it comes time for HOM to create its own master table, it will LINK to BDB and to yours, and then do something like:
select player_id, player_name
from bdb.master
union
select player_id, player_name
from chriscobb.master
And we'd have to add validation to remove any duplicates that the UNION did not catch.
--
In my desktop database the Master table is an extension of bbdb 2007 (thru the 2006 season).
At the moment the ID range is -1119 to 18702 where I have added everyone in [-1119, 0] and [1, 18702] covers everyone in the bbdb.
(Frank Osborn to Jim Creighton; Hank Aaron to Mark Kiger. You should look up all four if you don't know them.)
There are 344 records in the ID range [-1119, 0] and they should include everyone with a player page or a vote at the Hall of Merit. Is that everyone with a Chris Cobb mle? I doubt it but it's a start. To the point at hand, that table includes biographical data posted on the web (and found by quick web search 15-18 months ago); for "Negro Leagues" players published in Riley's encyclopedia.
In another table I have data for some more "Negro Leagues" players.
Kevin Johnson has digitized some data too.
I will try to confer with KJOK and Chris Cobb before Monday evening.
1949 Cleveland AL - Bob Lemon (22-10) and Bob Feller (15-12) and Early Wynn (11-7)
1950 Cleveland AL - Bob Lemon (23-11) and Early Wynn (18-8) and Bob Feller (16-11)
1951 Cleveland AL - Bob Feller (22-8) and Early Wynn (20-13) and Bob Lemon (17-14)
1952 Cleveland AL - Bob Lemon (22-11) and Early Wynn (23-12) and Bob Feller (9-13)
1953 Cleveland AL - Bob Lemon (23-15) and Early Wynn (17-12) and Bob Feller (10-7)
1966 Los Angeles NL - Sandy Koufax (27-9) and Don Drysdale (13-16) and Don Sutton (12-12)
(1993-99 and 2001-02 Atlanta NL will be Glavine/Smoltz/Maddux)
This is with minimum 1 IP per team game or 35 G, so 1949 Satchel Paige 31 G wouldn't quite make it. We could discuss minimums, or maybe better have asterisks to indicate who got cups of coffee or came close to qualifying. Or just have two lists, one with a minimum and one without.
The lists show unexpected stuff, I think, like no NL P fulltime HOMer duos between 1917 (Alexander and Rixey, PHI) and 1959 (here come Koufax and Drysdale).
And what tandems had the most times doing this together?
8 - Wynn and Lemon
8 - Koufax and Drysdale
8 - Gaylord Perry and Juan Marichal
7 - Feller and Lemon
7 - Christy Mathewson and Joe McGinnity
6 - Eddie Plank and Rube Waddell
6 - Ted Lyons and Red Faber
5 - Feller and Wynn
5 - Bob Gibson and Steve Carlton
5 - Alexander and Rixey
But Glavine and Smoltz one day will blow them all away with 13 (!), though missing in 2008.
Until 1970, I only see two examples where there were no repeats (makes sense I guess; you have two great pitchers on your team, you want to keep them..
They are the first one:
1882 Providence NL - Charley Radbourn (33-20) and John Ward (19-12)
And the one you might never think of:
1962 Baltimore AL - Robin Roberts (10-9) and Hoyt Wilhelm (7-10/15 SV)
Ok, maybe this suggests a "story" to go with some lists.
Some want to delve into the high-end math.
Some want to chew on some lists, and notice surprising things for themselves.
And others just want to know, "Ok, if I spent the time to study this list, what's some of the coolest stuff I'd find?"
As Tango has said, as long as the data is there, we can answer any question with the right query. It shouldn't be necessary to make these lists manually once the database is complete.
Ideally, if this REALLY gets off the ground, we can make a "PI" type feature that would allow people like yourself to have the system generate these kinds of reports without the knowledge of SQL. Then maybe you could maintain a blog on the site every week where you show us a cool list and talk about it for a bit. ;-)
Really, the possibilities are endless as long as we have data and man-power.
I haven't seen Chris chime in, but I have heard from Paul. If Paul sends me what he has, I should be able to create an OTHER MASTER or OTHER PEOPLE table that would have all of the players Chris and Eric have MLE's for. I know a couple of others (Brent for one I think) also created MLE's, but I would guess they are the same players Chris and Eric have. If not, someone can let me know that, and I'll add them.
There are for the ones in the Hall of Fame, but I would probably use a different ID system for the 'other' players.
So far I have private extensions to widely available data tables. If the HallofMerit is entering the business of maintaining one extension of the Master table, we should probably add a couple of fields. One new field should identify the source of the record(row); for this purpose it may be "Paul Wendt 2008-12-dd" rather than any account of his sources. Another field should specify whether the record matches what the bbdb or the source provided, or whether it has been edited during its HOM tenure.
It's possible that what I provide and what Kevin does with it, in conjunction with Chris and Brett and Eric, will be in some MLE database that is strictly separate from the HOM database that Tom Tango is creating :-)
Much of this will be part of any '"super duper" master table' (#32), so it must concern Tom, but maybe no one else, outside this paragraph.
(My own preference will be obvious to those who understand what I have done myself.
No "super duper master".
Instead extend the Master table by adding records.
Facilitate update of bbdb component by a device such as lahmanID<1 vs. lahmanID>0.
Since I've been doing this project for what? a month now, I'm not a good choice to think up things like table columns and IDs, but I do write, and I have proofread for a living. When it comes to checking for typoes and spelling mistakes, I'm in. - Brock
Some of Brent's MLEs will be unique, I believe. He has been the main source for minor league MLEs for players who were not also NeL players: Gavvy Cravath, Buzz Arlett, and others. These players would already be in this master database, I guess, since they also played in the majors, but Brent's MLE data should certainly be included in our database.
As an aside, I have had little to say in this conversation because its technical aspect is way over my head. I truly know nothing whatsoever about databases, so I cannot even parse a good deal of what is being said here, and I have no idea what implications it has for the preparation of the data in my possession for inclusion in the database.
I hope that when work on the database progresses a little farther, someone can tell me--in straightforward descriptive terms that I can apply to the formatting of spreadsheets--what I will need to do.
Just as an example of the kind of instructions that are too technical for me to use, see Tango's comment to me, which Paul Wendt copies in post 90 above.
You've said this in posts on other threads. Since I pass myself off as someone who explains computer things for a living, it's my duty and privilege here to say,
1. If you can do spreadsheets, you can do databases. There aren't that many differences.
2. If you've ever summed up a column or anything like that in a spreadsheet, you've crossed the border and are doing the kind of things that databases do - manipulating the data.
3. The biggest difference is that databases have clever ways of linking up different spreadsheets. Think of a database "table" as one spreadsheet. Now think of a group of such "tables." Now think of trying to link the info in one table to the info in another. That's the basic idea of a a database: linking spreadsheets (tables) to each other.
4. Here's an example. Suppose you had one spreadsheet with every batting season by every player in history. One row in the sheet equals one season of play for one player. The spreadsheet columns would be things like Player Name, Year, At-Bats, Runs, Walks, etc. Now think of a DIFFERENT spreadsheet. This one contains ballplayer names just like the first table, but also dates of birth, debut dates, dates of death, things like that that don't change from season to season, but only happen once in a career or lifetime. Those are two different spreadsheets. The second one has only one row for each player. The first one has one row for each season that player played. They are linked by the player name. What a database allows you to do is to go into spreadsheet (table) 2, locate Honus Wagner's name code, and then go to table 1 and gather up all the seasons that have, say, "Honus_Wagner" as their player name. Then you put those rows from spreadsheet 1 into a list sorted by Year. You could then display both Honus' biographical stuff and then list all his seasons. That's what BB-Ref does, essentially. They have many more tables than two, and much more data than just batting seasons, but the "database" part of it is simply the linking of the various tables together. We call those individual spreadsheets "tables." We call the links between them "joins." We call the entire collection of tables and the joins between them a "database."
Everything else that you've read where we computer types start using terms that you've never seen is just jargon. We're just talking in database shorthand. The basic concept really is just what I have here.
Did that help? Personally, I think I've made it much more simple than you seem to need. I think you should develop some confidence here. What you've done with spreadsheets is pretty impressive. I have no doubt that you could pick up database theory pretty easily. You seem to have all the talent you'll need.
- Brock
You must be Registered and Logged In to post comments.
<< Back to main