old
This commit is contained in:
		
							parent
							
								
									720f21a85b
								
							
						
					
					
						commit
						43076bcbb1
					
				
					 42 changed files with 237415 additions and 7831 deletions
				
			
		
							
								
								
									
										40
									
								
								Temporal Coherence Blog Post Caveats.md
									
										
									
									
									
										Normal file
									
								
							
							
						
						
									
										40
									
								
								Temporal Coherence Blog Post Caveats.md
									
										
									
									
									
										Normal file
									
								
							|  | @ -0,0 +1,40 @@ | |||
| **3\. Caveats and Future Work** | ||||
| 
 | ||||
| Our project has some important limitations, both theoretical and empirical. The caveats below offer room for improvement in future iterations of the project, as well as new lines of investigation.  | ||||
| 
 | ||||
| **3.1 Limitations of the theoretical approach**  | ||||
| 
 | ||||
| **3.1.1 Temporal coherence as a concept** | ||||
| 
 | ||||
| 
 | ||||
| Our theoretical framework is based on the concept of temporal coherence, which we characterise as the ability of putting together, consistently, different atomistic tasks in the pursuit of a more complex task. However, this is not the only possible framework for understanding AIs’ capabilities. For instance, some people have used the [t-AGI](https://www.alignmentforum.org/posts/BoA3agdkAzL6HQtQP/clarifying-and-predicting-agi) framework, which bears some resemblance to ours but is not identical. Also, we don’t offer a formal definition of the concept. One way in which we plan to expand the project further is by incorporating it more explicitly into the model proposed by Korinek and Suh | ||||
| 
 | ||||
| One objection that could be raised to the concept of temporal coherence comes from the ‘jagged-frontier’ of AI systems. This is the observation that AI capabilities  are very irregular – while they can perform tasks that would take a human many hours in mere seconds or minutes, they struggle with seemingly easy, short-term tasks. We plan to incorporate this concern in our framework in the future. | ||||
| 
 | ||||
| **3.1.2 Temporal coherence as a significant bottleneck**  | ||||
| 
 | ||||
| We make an additional claim, if AIs could act coherently over arbitrary time periods of time, putting together the different atomistic tasks they are already capable of performing, more economic value would be unlocked than if any other ability was ‘solved.’  | ||||
| 
 | ||||
| However, it is plausible to believe other bottlenecks are equally or more important. Other important bottlenecks include memory, context, multimodality and cooperation. We aim to dedicate more time to identifying the most important bottlenecks to job automation in our future research, possibly with the help of an economic model like the O-Ring model.  | ||||
| 
 | ||||
| **3.1.3 Task length as a proxy for coherence**  | ||||
| 
 | ||||
| Our measure for temporal coherence is the length it would take a human to complete a given task. The idea behind the proxy is that the longer the task, the more it requires putting together different pieces or atomistic tasks to complete it. But this is not a perfect measure. One way in which the proxy could be misleading is if a task takes a lot of time for a human to complete, not because it requires more coherence for putting together different subtasks, but rather because the atomistic tasks *themselves* take a lot of time to complete. To some extent, this seems contingent on how narrowly we define atomistic tasks; if done narrowly enough, as Korinek and Suh do in their original paper, then task length seems like a good proxy, because no single task would take a lot of time to complete, by definition. But there’s room for disagreement, and the measure could be improved.  | ||||
| 
 | ||||
| Finally, the concern of the ‘jagged-frontier’ also applies here, because if AI systems are very irregular, it’s not clear that measuring the time it would take a *human* to complete a task is very informative of the temporal coherence required by *AIs.* | ||||
| 
 | ||||
| **3.2 Limitations of the estimates of task length**  | ||||
| 
 | ||||
| **3.2.1 LLM classification limitations**  | ||||
| 
 | ||||
| Humans have a [hard time](https://en.wikipedia.org/wiki/Planning_fallacy) estimating how long something takes, and LLMs too. In our test bench, we evaluated LLM’s self-consistency, and found that the more the task statement seemed ambiguous, the more the model changed its estimate between a few values, but stayed consistent between those values.  | ||||
| 
 | ||||
| We also manually estimated the time to completion for 45 task statements, and compared our results to the model, its estimates were close enough to ours.  | ||||
| 
 | ||||
| In the future, we might try to cross-validate the estimates by developing additional methods for time estimation. There is [prior work](https://www.ons.gov.uk/economy/environmentalaccounts/articles/developingamethodformeasuringtimespentongreentasks/march2022) in this area.  We might also use a model harnessed with estimation tools, or spend time developing a larger validation set so that we can evaluate our prompt and model choice more accurately. | ||||
| 
 | ||||
| **3.2.3 METR’s estimates**  | ||||
| 
 | ||||
| We use METR’s estimates on the increasing capacity of AI agents to perform longer time-horizon tasks, doubling every 7 months, to achieve the result that more than 60 per cent of tasks can be performed by AI agents by 2026\. | ||||
| 
 | ||||
| The problem with using those estimates is that they are based on a set of coding benchmarks, which, as [others have commented](https://epoch.ai/gradient-updates/where-is-my-ten-minute-agi), do not reflect the messiness of real-world work environments. Also, the extent to which we can extrapolate the performance of AI agents in coding tasks to other types of tasks is an open question, especially considering how reasoning models are particularly good at coding and math, as opposed to other fields. Finally, we don’t take into account the success rate variable in METR’s results, which for obvious reasons seems important when it comes to automating real-world tasks. | ||||
							
								
								
									
										53
									
								
								Temporal Coherence Blog Post Part Intro.md
									
										
									
									
									
										Normal file
									
								
							
							
						
						
									
										53
									
								
								Temporal Coherence Blog Post Part Intro.md
									
										
									
									
									
										Normal file
									
								
							|  | @ -0,0 +1,53 @@ | |||
| **Temporal Coherence: A Bottleneck in Automation** | ||||
| 
 | ||||
| *This research was conducted as part of the Economics of Transformative AI hackathon with Apart Research.* | ||||
| 
 | ||||
| Why hasn't AI automated away more professions? One potential explanation is that, despite their intelligence, these systems cannot (yet) act coherently over long periods of time. While AI has demonstrated impressive capabilities in solving concrete, isolated problems, maintaining consistent goals, reasoning, and plans over extended time frames has proven a more significant challenge.  | ||||
| 
 | ||||
| In this context, our project seeks to answer two questions. First, we know AI systems are getting better at completing tasks over longer time horizons, as [METR’s recent work](http://.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/) shows. But how much better will they need to get in order to start having a real impact on the economy, or at least for their automation capabilities to increase significantly?[^1] Second, supposing this ability is a crucial bottleneck preventing AI systems from automating economic tasks, how much value could be unlocked and how soon? | ||||
| 
 | ||||
| TIn order to answer these questions, we define a concept that tries to capture AI’s ability to maintain consistent goals and plans while pursuing some task, called *temporal coherence*, in the context of [recent work](https://www.nber.org/papers/w32255) by Anton Korinek and Donghyun Suh. Then, we argue that temporal coherence might be the AI ability that, if solved, could unlock more economic value. With this framework in mind, we try to estimate the importance of temporal coherence in the US economy by measuring the time needed to complete all remotable tasks in the [O\*Net dataset](https://www.onetonline.org/). Our findings showsuggest that AI agents could have enough temporal coherence to perform any remote economic task by 2030, and that AI automation potential could increase in a pretty discontinuous manner, unlocking lots of economic value soon.  | ||||
| 
 | ||||
| **What is Temporal Coherence?** | ||||
| 
 | ||||
| Recent discussions about the economics of AI have highlighted that while LLMs are very useful assistants and have most likely increased the productivity of many workers across the board, their effects in terms of full-job task automation have still to be felt (see [here](https://epoch.ai/epoch-after-hours/disagreements-on-agi-timelines) and [here](https://www.dwarkesh.com/p/ege-tamay)). The explanations offered for solving this puzzle usually invoke the words ‘agency’, ‘autonomy’, ‘coherence over long horizons’, or ‘adapt plans to simple circumstances’. TBecause this myriad of concepts is never clearly defined, this can create confusion.[^2] | ||||
| 
 | ||||
| To impose some conceptual clarity, we define the concept of *temporal coherence*. Temporal coherencehis concept is to be understood in the context of modern labor economics models, in particular, the task-based framework created by David Autor, Daron Acemoglu, Pascual Restrepo and others. In these models, tasks, rather than occupations, are the fundamental units of economic production.[^3] [Recent work](https://www.nber.org/papers/w32255) by Anton Korinek and Donghyun Suh refines this approach by modelling human work as composed of atomistic tasks that vary in computational complexity, conceptualising technological progress as expanding the ‘automation frontier’, gradually enabling machines to perform increasingly complex atomistic tasks. Our proposal is to add, on top of this atomistic framework, the ability of *putting together different atomistic tasks in the pursuit of a more complex task*. We call this ability temporal coherence. | ||||
| 
 | ||||
| To illustrate these abstract concepts and the usefulness of temporal coherence, let us consider teaching an economics course. While occupational databases like O\*NET might list ‘teach economic theories’ as a task, this high-level label can be decomposed into subtasks: planning a syllabus, preparing lectures, delivering explanations, answering questions, recognising confusion, adjusting content dynamically, and grading assignments. Each subtask differs in computational complexity. Planning a lecture may be relatively simple, and current AI systems might already be able to do a decent job at it; dynamically adapting explanations in response to subtle student cues, however, might be more challenging. | ||||
| 
 | ||||
| But even if an AI system could perform each subtask—preparing slides or answering factual questions—it is very likely that current systems could not deliver a coherent semester-long course. Teaching an economics course requires doing each one of the subtasks consistently and coherently until the course is effectively completed. For instance, effective teaching requires maintaining thematic and conceptual consistency over time, adapting to cumulative student understanding, and revising instructional strategies based on longitudinal feedback. In short, AIs do not yet have enough temporal coherence to complete this task at present. | ||||
| 
 | ||||
| The fact that temporal coherence remains a challenge when a task involves many different atomistic tasks being put together, seems a good explanation for why we haven’t seen more AI task automation yet. | ||||
| 
 | ||||
| **The economic value of temporal coherence**  | ||||
| 
 | ||||
| Having established the concept of temporal coherence, we now want to argue that, out of all the abilities which AI currently struggles with, solving temporal coherence would unlock the *most* economic value. | ||||
| 
 | ||||
| This argument comes from the observation that current AI systems are really smart, combined with an intuition about what most real-world economic activity involves in practice. | ||||
| 
 | ||||
| We say this because we observeThe argument for this is a combination of thethinking this is the case combines the observation that current AI systems are really smart, and we have anwith an intuition about what most real-world economic tasks involve in practice. The intelligence of state-of-the-art LLMs arecan be clearly established by the results they obtain in various benchmarks. For example, reasoning models, like [OpenAI’s o3](https://openai.com/es-ES/index/introducing-o3-and-o4-mini/), seem very competent in domains like math or coding, where they get results approaching, matching or even surpassing experts in some cases.[^4] Models getting smarter and smarter rapidly is a [trend](https://ourworldindata.org/grapher/test-scores-ai-capabilities-relative-human-performance) that began with ChatGPT and continues to this day. | ||||
| 
 | ||||
| Despite these impressive results which clearly demonstrate AIs’ ability and intelligence, it does not seem that most economically valuable tasks could be performed *just* with this ‘raw’ intelligence. ‘'Unhobblings’' are necessary, to borrow Leopold Aschenbrenner’s [terminology](https://situational-awareness.ai/). What are AI systems lacking, exactly? Some possibilities include full multimodality, memory, context, cooperation abilities, and temporal coherence.[^5]  | ||||
| 
 | ||||
| IConsidering just how intelligent these systems are, or, using our terminology, how many atomistic tasks they can perform, it would seem to us that if AI systems were capable of putting to use those abilities in a coherent way in the pursuit of a goal or task (that is, if they had perfect temporal coherence), then their automation potential would vastly increase. In contrast, systems with, say, perfect multimodality but without temporal coherence would suffer from the same automation limitations that LLMs face now. The automation potential of temporal coherence just seems much greater than for any other ability.[^6] | ||||
| 
 | ||||
| **Our Research Approach** | ||||
| 
 | ||||
| Temporal coherence is an ability that comes in degrees, and [recent work](http://.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/) by METR shows that AI systems are getting better at it,[^7] and at a rapid pace. But this, by itself, does not tell us how important these advances are in terms of AI task automation potential. It could be that, as METR’s projections show, by 2026 we have systems capable of completing tasks over an 8-hour time horizon. But if most tasks in the economy require much more coherence than this, then by 2026 temporal coherence will still be a bottleneck. To understand how big of a challenge temporal coherence will be for task automation, *we need a measure of the importance of temporal coherence in the economy*.    | ||||
| 
 | ||||
| Unfortunately, measuring temporal coherence directly is challenginga challenge, since \--no established metric exists yet that categorises long-term projects by saying ‘this task requires X months of coherence’ or ‘that task requires Y years of coherence’. For this reason, we use as a proxy for temporal coherence *the time it would take a human to complete a given task autonomously*. This is an imperfect measurement, but it’s supported by the simple heuristic that tasks requiring longer periods of sustained, autonomous human effort seemingly necessitate a higher degree of temporal coherence for an AI agent to successfully automate them. Further, by using this proxy, our results can be combined with METR’s results to produce additional insights on the possible economic impact of future AI agents, as we explain in the next section. | ||||
| 
 | ||||
| [^1]:  We make this clarification because many other factors are involved in actual automation of tasks. For instance, there might be other capabilities that current AI models lack which would also be needed for automation; or maybe it’s just not profitable to do so; or there’s regulation that prohibits it; etc.  | ||||
| 
 | ||||
| [^2]:  Just as an example, in the podcast linked above from Epoch AI, Ege Erdil makes a distinction between some of these concepts that is not usually made: ‘So to me it looks like the lack of common sense and lack of agency and ability to execute plans is a different competence than maintaining coherence over long context.’ This suggests clearer definitions are needed. | ||||
| 
 | ||||
| [^3]:  See [here](https://www.nber.org/papers/w30074) for a great overview of the evolution of models about wage inequality and automation. The main papers in this area can be found [here](https://economics.mit.edu/sites/default/files/publications/the%20task%20approach%202013.pdf), [here](https://www.aeaweb.org/articles?id=10.1257/jep.33.2.3), and [here](https://economics.mit.edu/sites/default/files/publications/The%20Race%20Between%20Man%20and%20Machine%20-%20Implications%20of.pdf).  | ||||
| 
 | ||||
| [^4]:  We don’t want to go over all benchmarks in this post, but some important examples include MathFrontier, SWE-bench or GPQA Diamond, in which o3 obtained impressive results according to many commentators. | ||||
| 
 | ||||
| [^5]:  This is not meant to be an exhaustive list, in part because, again, definitions get tricky and some of these abilities might be grouped together or involve other abilities depending on how you think about them.  | ||||
| 
 | ||||
| [^6]:  Some definitional issues will be commented on Section 3\. | ||||
| 
 | ||||
| [^7]:  That’s the way we read the evidence, at least, from the framework of temporal coherence we have proposed. | ||||
							
								
								
									
										40
									
								
								Temporal Coherence Blog Post Results.md
									
										
									
									
									
										Normal file
									
								
							
							
						
						
									
										40
									
								
								Temporal Coherence Blog Post Results.md
									
										
									
									
									
										Normal file
									
								
							|  | @ -0,0 +1,40 @@ | |||
| **Methodology** | ||||
| 
 | ||||
| To estimate the importance of temporal coherence in the economy, we used the O\*NET database, which lists all occupations and tasks involved infor each occupation for the US economy.  We used [Barnett 2025](https://epoch.ai/gradient-updates/consequences-of-automating-remote-work)’s classification of the remotable tasks in this database[^1]. The reason for this is that the automation potential from AI, at the moment, comes mostly from AI systems that can act in digital environments, rather than in the physical world. | ||||
| 
 | ||||
| We use a large language model to classify all remotable tasks by how much time it would take a human to perform them, which gives us a lower- and upper-bound for each task – if such estimate is possible. | ||||
| 
 | ||||
| We exclude non-estimable tasks from the estimate, as arguably, these tasks are the ones most likely to stay human. | ||||
| 
 | ||||
| **![][image1]** | ||||
| **Results** | ||||
| 
 | ||||
| Our main findings arecan be  summarised byin two simple graphs. Figure 1 shows the distribution of tasks by length as classified by the LLM. The result is clear: more than 80% of tasks can be performed by humans in less than 8 hours of autonomous work. Comparatively few tasks take more than a week, and just a handful of them go over 6 months. | ||||
| ![][image2] | ||||
| The implications of these estimates, if we take them at face value, are striking: it just doesn’t seem that AIs will have to improve that much at temporal coherence for this automation bottleneck to be ‘solved’ for most tasks. | ||||
| 
 | ||||
| This is precisely what the second graph shows. Figure 2 combines our estimates for task length with METR’s projections for the increase in coherence of AI models. The result is that by 2026, less than 40% of tasks *won’t* be automatable, if they depended only on temporal coherence. By the end of the decade, AI systems will have enough temporal coherence to essentially take over the US economy – again, if temporal coherence were the only capability left standing. Perhaps even more importantly, the change in AI automation potential looks discrete rather than continuous, going from nearly 0% of tasks in 2025 to more than 60% in 2026\. This is due to most tasks taking around 8 hours to complete. Though we don’t explore them, it’s clear this has important policy implications. | ||||
| ![][image3] | ||||
| 
 | ||||
| To understand how much economic value could be unlocked if temporal coherence were solved, conditional on it being the fundamental bottleneck, we investigated how much temporal coherence the five most remotable occupations required. | ||||
| 
 | ||||
| ![][image4] | ||||
| 
 | ||||
| As can be seen in the graph, all top five occupations, except for Architecture and Engineering, have 80% of tasks which can be completed in a day (8 hours of work). Because AI agents could achieve this level of temporal coherence in 2026 – according to METR’s projections –, AI agents could be unlocking a tremendous amount of economic value. To get a rough sense for the magnitude, suppose temporal coherence would be enough for task automation. Then, if 80% of tasks in all of these occupations can be automated, this could amount to *3.72$ trillion* dollars of economic value. This is of course a simplification and an upper bound, but it illustrates how quickly AI could transform the economy thanks to improvements in temporal coherence. | ||||
| One question remains:  how confident are we in estimates of task duration? | ||||
| 
 | ||||
| Half of all tasks are estimated to take between two to twenty hours, and 75% are less than or equal to 1 day on the lower end and less than or equal to 6 days on the higher end. These estimates sound reasonable for many O\*NET tasks. | ||||
| 
 | ||||
| We find that 90% of task duration estimates have an upper to lower bound duration ratio lower than or equal to 10\. | ||||
| ![][image5] | ||||
| Tasks with very large ratios accurately reflect the ambiguity of the task description. For instance, the task ‘Conduct research in a particular field of knowledge and publish findings in professional journals, books, or electronic media’ takes between 40 hours and 1 year (216x difference). Another example is the task ‘Create or use statistical models for the analysis of genetic data’, which takes between one day and six months (180x difference). | ||||
| 
 | ||||
| **The value of our results** | ||||
| 
 | ||||
| Why should you care about these results?. First, our estimates of the distribution of task length for the US economy point to temporal coherence – or whatever name you want to give to the capacity of people to work autonomously on a task until completion – being a *less important factor* than we expected. The fact that 80% of tasks can be completed in 8 hours or less suggests that, regardless of the actual importance of temporal coherence for unlocking economic value, this will not be a bottleneck for very long. When we combine our estimates with METR’s, which project the increasing capacity of AI agents to perform tasks over longer time horizons, we realise that by 2026, temporal coherence won’t be an issue for automating more than 60% of all tasks. | ||||
| 
 | ||||
| Second, once you add in the assumption that temporal coherence is likely to be the biggest bottleneck of all the abilities that AIs currently lack (or have not perfected): Our results suggest that a lot of economic value could potentially be unlocked due to increases in AI agents’ temporal coherence,[^2] and that this could happen fairly soon. This has potential policy implications that we do not explore in this piece, but seem particularly relevant for concerns about [gradual disempowerment.](https://gradual-disempowerment.ai/) | ||||
| 
 | ||||
| [^1]:  We think Barnett’s classification could be improved further; for example, by using O\*NET’s [Physical Work Conditions](https://www.onetcenter.org/content.html#cm-sub-4-C-2) annotations. | ||||
| 
 | ||||
| [^2]:  This is not guaranteed, however, because actual automation depends on other factors which we don’t analyze here, like the cost of running the AI agents and integrating them into current business, regulation and other frictions. | ||||
							
								
								
									
										507
									
								
								add_task_estimates.py
									
										
									
									
									
										Normal file
									
								
							
							
						
						
									
										507
									
								
								add_task_estimates.py
									
										
									
									
									
										Normal file
									
								
							|  | @ -0,0 +1,507 @@ | |||
| import pandas as pd | ||||
| import litellm | ||||
| import dotenv | ||||
| import os | ||||
| import time | ||||
| import json | ||||
| import math | ||||
| import numpy as np | ||||
| 
 | ||||
| # --- Configuration --- | ||||
| MODEL = "gpt-4.1-mini"  # Make sure this model supports json_schema or structured output | ||||
| RATE_LIMIT = 5000  # Requests per minute | ||||
| CHUNK_SIZE = 300 | ||||
| SECONDS_PER_MINUTE = 60 | ||||
| FILENAME = ( | ||||
|     "tasks_with_estimates.csv"  # This CSV should contain the tasks to be processed | ||||
| ) | ||||
| 
 | ||||
| # --- Prompts and Schema --- | ||||
| SYSTEM_PROMPT = """ | ||||
| You are an expert assistant evaluating the time to completion required for job tasks. Your goal is to estimate the time range needed for a skilled human to complete the following job task remotely, without supervision. | ||||
| 
 | ||||
| Provide a lower and upper bound estimate for the time to completion time. These bounds should capture the time within which approximately 80% of instances of performing this specific task are typically completed by a qualified individual. | ||||
| 
 | ||||
| Base your estimate on the provided task description, its associated activities, and the occupational context. Your estimate must be in one the allowed units: minute, hour, day, week, month, trimester, semester, year. | ||||
| """.strip() | ||||
| 
 | ||||
| USER_MESSAGE_TEMPLATE = """ | ||||
| Please estimate the time range for the following remote task: | ||||
| 
 | ||||
| **Task Description:** {task} | ||||
| **Relevant activies for the task:** | ||||
| {dwas} | ||||
| 
 | ||||
| **Occupation Category:** {occupation_title} | ||||
| **Occupation Description:** {occupation_description} | ||||
| 
 | ||||
| Consider the complexity and the typical steps involved. | ||||
| """.strip() | ||||
| 
 | ||||
| ALLOWED_UNITS = [ | ||||
|     "minute", | ||||
|     "hour", | ||||
|     "day", | ||||
|     "week", | ||||
|     "month", | ||||
|     "trimester", | ||||
|     "semester", | ||||
|     "year", | ||||
| ] | ||||
| 
 | ||||
| SCHEMA_FOR_VALIDATION = { | ||||
|     "name": "estimate_time", | ||||
|     "strict": True,  # Enforce schema adherence | ||||
|     "schema": { | ||||
|         "type": "object", | ||||
|         "properties": { | ||||
|             "lower_bound_estimate": { | ||||
|                 "type": "object", | ||||
|                 "properties": { | ||||
|                     "quantity": { | ||||
|                         "type": "number", | ||||
|                         "description": "The numerical value for the lower bound of the estimate.", | ||||
|                     }, | ||||
|                     "unit": { | ||||
|                         "type": "string", | ||||
|                         "enum": ALLOWED_UNITS, | ||||
|                         "description": "The unit of time for the lower bound.", | ||||
|                     }, | ||||
|                 }, | ||||
|                 "required": ["quantity", "unit"], | ||||
|                 "additionalProperties": False, | ||||
|             }, | ||||
|             "upper_bound_estimate": { | ||||
|                 "type": "object", | ||||
|                 "properties": { | ||||
|                     "quantity": { | ||||
|                         "type": "number", | ||||
|                         "description": "The numerical value for the upper bound of the estimate.", | ||||
|                     }, | ||||
|                     "unit": { | ||||
|                         "type": "string", | ||||
|                         "enum": ALLOWED_UNITS, | ||||
|                         "description": "The unit of time for the upper bound.", | ||||
|                     }, | ||||
|                 }, | ||||
|                 "required": ["quantity", "unit"], | ||||
|                 "additionalProperties": False, | ||||
|             }, | ||||
|         }, | ||||
|         "required": ["lower_bound_estimate", "upper_bound_estimate"], | ||||
|         "additionalProperties": False, | ||||
|     }, | ||||
| } | ||||
| 
 | ||||
| 
 | ||||
| def save_dataframe(df_to_save, filename): | ||||
| 
 | ||||
|     """Saves the DataFrame to the specified CSV file using atomic write.""" | ||||
|     try: | ||||
|         temp_filename = filename + ".tmp" | ||||
|         df_to_save.to_csv(temp_filename, encoding="utf-8-sig", index=False) | ||||
|         os.replace(temp_filename, filename) | ||||
|     except Exception as e: | ||||
|         print(f"--- Error saving DataFrame to {filename}: {e} ---") | ||||
|         if os.path.exists(temp_filename): | ||||
|             try: | ||||
|                 os.remove(temp_filename) | ||||
|             except Exception as remove_err: | ||||
|                 print( | ||||
|                     f"--- Error removing temporary save file {temp_filename}: {remove_err} ---" | ||||
|                 ) | ||||
| 
 | ||||
| def create_task_estimates(): | ||||
|     try: | ||||
|         # Read the CSV | ||||
|         if os.path.exists(FILENAME): | ||||
|             df = pd.read_csv(FILENAME, encoding="utf-8-sig") | ||||
|             print(f"Successfully read {len(df)} rows from {FILENAME}.") | ||||
| 
 | ||||
|             estimate_columns_spec = { | ||||
|                 "lb_estimate_qty": float, | ||||
|                 "lb_estimate_unit": object, | ||||
|                 "ub_estimate_qty": float, | ||||
|                 "ub_estimate_unit": object, | ||||
|             } | ||||
|             save_needed = False | ||||
| 
 | ||||
|             for col_name, target_dtype in estimate_columns_spec.items(): | ||||
|                 if col_name not in df.columns: | ||||
|                     # Initialize with a type-compatible missing value | ||||
|                     if target_dtype == float: | ||||
|                         df[col_name] = np.nan | ||||
|                     else:  # object | ||||
|                         df[col_name] = pd.NA | ||||
|                     df[col_name] = df[col_name].astype(target_dtype)  # Enforce dtype | ||||
|                     print(f"Added '{col_name}' column as {df[col_name].dtype}.") | ||||
|                     save_needed = True | ||||
|                 else: | ||||
|                     # Column exists, ensure correct dtype | ||||
|                     current_pd_dtype = df[col_name].dtype | ||||
|                     expected_pd_dtype = pd.Series(dtype=target_dtype).dtype | ||||
| 
 | ||||
|                     if current_pd_dtype != expected_pd_dtype: | ||||
|                         try: | ||||
|                             if target_dtype == float: | ||||
|                                 df[col_name] = pd.to_numeric(df[col_name], errors="coerce") | ||||
|                             else:  # object | ||||
|                                 df[col_name] = df[col_name].astype(object) | ||||
|                             print( | ||||
|                                 f"Corrected dtype of '{col_name}' to {df[col_name].dtype}." | ||||
|                             ) | ||||
|                             save_needed = True | ||||
|                         except Exception as e: | ||||
|                             print( | ||||
|                                 f"Warning: Could not convert column '{col_name}' to {target_dtype}: {e}. Current dtype: {current_pd_dtype}" | ||||
|                             ) | ||||
| 
 | ||||
|                 # Standardize missing values (e.g., empty strings to NA/NaN) | ||||
|                 # Replace common missing placeholders with pd.NA first | ||||
|                 df[col_name].replace(["", None, ""], pd.NA, inplace=True) | ||||
|                 if target_dtype == float: | ||||
|                     # For float columns, ensure they are numeric and use np.nan after replacement | ||||
|                     df[col_name] = pd.to_numeric(df[col_name], errors="coerce") | ||||
| 
 | ||||
|             if save_needed: | ||||
|                 print(f"Saving {FILENAME} after adding/adjusting estimate columns.") | ||||
|                 save_dataframe(df, FILENAME) | ||||
|         else: | ||||
|             print( | ||||
|                 f"Error: {FILENAME} not found. Please ensure the file exists and contains task data." | ||||
|             ) | ||||
|             exit() | ||||
|     except FileNotFoundError: | ||||
|         print( | ||||
|             f"Error: {FILENAME} not found. Please ensure the file exists and contains task data." | ||||
|         ) | ||||
|         exit() | ||||
|     except Exception as e: | ||||
|         print(f"Error reading or initializing {FILENAME}: {e}") | ||||
|         exit() | ||||
| 
 | ||||
|     # --- Identify Rows to Process --- | ||||
|     # We'll check for NaN in one of the primary quantity columns. | ||||
|     unprocessed_mask = df["lb_estimate_qty"].isna() | ||||
|     if unprocessed_mask.any(): | ||||
|         start_index = unprocessed_mask.idxmax()  # Finds the index of the first True value | ||||
|         print(f"Resuming processing. First unprocessed row found at index {start_index}.") | ||||
|         df_to_process = df.loc[unprocessed_mask].copy() | ||||
|         original_indices = df_to_process.index  # Keep track of original indices | ||||
|     else: | ||||
|         print( | ||||
|             "All rows seem to have estimates already (based on 'lb_estimate_qty'). Exiting." | ||||
|         ) | ||||
|         exit() | ||||
| 
 | ||||
| 
 | ||||
|     # --- Prepare messages for batch completion (only for rows needing processing) --- | ||||
|     messages_list = [] | ||||
|     skipped_rows_indices = [] | ||||
|     valid_original_indices = [] | ||||
| 
 | ||||
|     if not df_to_process.empty: | ||||
|         required_cols = ["task", "occupation_title", "occupation_description", "dwas"] | ||||
|         print( | ||||
|             f"Preparing messages for up to {len(df_to_process)} rows starting from original index {original_indices[0] if len(original_indices) > 0 else 'N/A'}..." | ||||
|         ) | ||||
|         print(f"Checking for required columns: {required_cols}") | ||||
| 
 | ||||
|         for index, row in df_to_process.iterrows(): | ||||
|             missing_or_empty = [] | ||||
|             for col in required_cols: | ||||
|                 if col not in row or pd.isna(row[col]) or str(row[col]).strip() == "": | ||||
|                     missing_or_empty.append(col) | ||||
| 
 | ||||
|             if missing_or_empty: | ||||
|                 print( | ||||
|                     f"Warning: Skipping row original index {index} due to missing/empty required data in columns: {', '.join(missing_or_empty)}." | ||||
|                 ) | ||||
|                 skipped_rows_indices.append(index) | ||||
|                 continue | ||||
| 
 | ||||
|             try: | ||||
|                 user_message = USER_MESSAGE_TEMPLATE.format( | ||||
|                     task=row["task"], | ||||
|                     occupation_title=row["occupation_title"], | ||||
|                     occupation_description=row["occupation_description"], | ||||
|                     dwas=row["dwas"], | ||||
|                 ) | ||||
|             except KeyError as e: | ||||
|                 print( | ||||
|                     f"Error: Skipping row original index {index} due to formatting error - missing key: {e}. Check USER_MESSAGE_TEMPLATE and CSV columns." | ||||
|                 ) | ||||
|                 skipped_rows_indices.append(index) | ||||
|                 continue | ||||
| 
 | ||||
|             messages_for_row = [ | ||||
|                 {"role": "system", "content": SYSTEM_PROMPT}, | ||||
|                 {"role": "user", "content": user_message}, | ||||
|             ] | ||||
|             messages_list.append(messages_for_row) | ||||
|             valid_original_indices.append(index)  # This is the original DataFrame index | ||||
| 
 | ||||
|         print( | ||||
|             f"Prepared {len(messages_list)} valid message sets for batch completion (skipped {len(skipped_rows_indices)} rows)." | ||||
|         ) | ||||
|         if not messages_list: | ||||
|             print("No valid rows found to process after checking required data. Exiting.") | ||||
|             exit() | ||||
|     else: | ||||
|         print( | ||||
|             "No rows found needing processing (df_to_process is empty)." | ||||
|         )  # Should have been caught by earlier check | ||||
|         exit() | ||||
| 
 | ||||
| 
 | ||||
|     # --- Call batch_completion in chunks with rate limiting and periodic saving --- | ||||
|     total_messages_to_send = len(messages_list) | ||||
|     num_chunks = math.ceil(total_messages_to_send / CHUNK_SIZE) | ||||
| 
 | ||||
|     print( | ||||
|         f"\nStarting batch completion for {total_messages_to_send} items in {num_chunks} chunks..." | ||||
|     ) | ||||
| 
 | ||||
|     overall_start_time = time.time() | ||||
|     processed_count_total = 0 | ||||
| 
 | ||||
|     for i in range(num_chunks): | ||||
|         chunk_start_message_index = i * CHUNK_SIZE | ||||
|         chunk_end_message_index = min((i + 1) * CHUNK_SIZE, total_messages_to_send) | ||||
|         message_chunk = messages_list[chunk_start_message_index:chunk_end_message_index] | ||||
|         # Get corresponding original DataFrame indices for this chunk | ||||
|         chunk_original_indices = valid_original_indices[ | ||||
|             chunk_start_message_index:chunk_end_message_index | ||||
|         ] | ||||
| 
 | ||||
|         if not message_chunk: | ||||
|             continue | ||||
| 
 | ||||
|         min_idx_disp = min(chunk_original_indices) if chunk_original_indices else "N/A" | ||||
|         max_idx_disp = max(chunk_original_indices) if chunk_original_indices else "N/A" | ||||
|         print( | ||||
|             f"\nProcessing chunk {i + 1}/{num_chunks} (Messages {chunk_start_message_index + 1}-{chunk_end_message_index} of this run)..." | ||||
|             f" Corresponding to original indices: {min_idx_disp} - {max_idx_disp}" | ||||
|         ) | ||||
|         chunk_start_time = time.time() | ||||
|         responses = [] | ||||
|         try: | ||||
|             print(f"Sending {len(message_chunk)} requests for chunk {i + 1}...") | ||||
|             responses = litellm.batch_completion( | ||||
|                 model=MODEL, | ||||
|                 messages=message_chunk, | ||||
|                 response_format={ | ||||
|                     "type": "json_schema", | ||||
|                     "json_schema": SCHEMA_FOR_VALIDATION, | ||||
|                 }, | ||||
|                 num_retries=3, | ||||
|                 # request_timeout=60 # Optional: uncomment if needed | ||||
|             ) | ||||
|             print(f"Chunk {i + 1} API call completed.") | ||||
| 
 | ||||
|         except Exception as e: | ||||
|             print(f"Error during litellm.batch_completion for chunk {i + 1}: {e}") | ||||
|             responses = [None] * len( | ||||
|                 message_chunk | ||||
|             )  # Ensure responses list matches message_chunk length for processing loop | ||||
| 
 | ||||
|         # --- Process responses for the current chunk --- | ||||
|         chunk_updates = {}  # To store {original_df_index: {qty/unit data}} | ||||
|         successful_in_chunk = 0 | ||||
|         failed_in_chunk = 0 | ||||
| 
 | ||||
|         if responses and len(responses) == len(message_chunk): | ||||
|             for j, response in enumerate(responses): | ||||
|                 original_df_index = chunk_original_indices[j] | ||||
| 
 | ||||
|                 # Initialize values for this item | ||||
|                 lb_qty_val, lb_unit_val, ub_qty_val, ub_unit_val = None, None, None, None | ||||
|                 content_str = None | ||||
| 
 | ||||
|                 if response is None: | ||||
|                     print( | ||||
|                         f"Skipping processing for original index {original_df_index} due to API call failure for this item (response is None)." | ||||
|                     ) | ||||
|                     failed_in_chunk += 1 | ||||
|                     continue | ||||
| 
 | ||||
|                 try: | ||||
|                     if ( | ||||
|                         response.choices | ||||
|                         and response.choices[0].message | ||||
|                         and response.choices[0].message.content | ||||
|                     ): | ||||
|                         content_str = response.choices[0].message.content | ||||
|                         estimate_data = json.loads(content_str)  # Can raise JSONDecodeError | ||||
| 
 | ||||
|                         lower_bound_dict = estimate_data.get("lower_bound_estimate") | ||||
|                         upper_bound_dict = estimate_data.get("upper_bound_estimate") | ||||
| 
 | ||||
|                         valid_response_structure = isinstance( | ||||
|                             lower_bound_dict, dict | ||||
|                         ) and isinstance(upper_bound_dict, dict) | ||||
| 
 | ||||
|                         if valid_response_structure: | ||||
|                             lb_qty_raw = lower_bound_dict.get("quantity") | ||||
|                             lb_unit_raw = lower_bound_dict.get("unit") | ||||
|                             ub_qty_raw = upper_bound_dict.get("quantity") | ||||
|                             ub_unit_raw = upper_bound_dict.get("unit") | ||||
| 
 | ||||
|                             is_valid_item = True | ||||
|                             # Validate LB Qty | ||||
|                             if ( | ||||
|                                 not isinstance(lb_qty_raw, (int, float)) | ||||
|                                 or math.isnan(float(lb_qty_raw)) | ||||
|                                 or float(lb_qty_raw) < 0 | ||||
|                             ): | ||||
|                                 print( | ||||
|                                     f"Warning: Invalid lb_quantity for original index {original_df_index}: {lb_qty_raw}" | ||||
|                                 ) | ||||
|                                 is_valid_item = False | ||||
|                             else: | ||||
|                                 lb_qty_val = float(lb_qty_raw) | ||||
| 
 | ||||
|                             # Validate UB Qty | ||||
|                             if ( | ||||
|                                 not isinstance(ub_qty_raw, (int, float)) | ||||
|                                 or math.isnan(float(ub_qty_raw)) | ||||
|                                 or float(ub_qty_raw) < 0 | ||||
|                             ): | ||||
|                                 print( | ||||
|                                     f"Warning: Invalid ub_quantity for original index {original_df_index}: {ub_qty_raw}" | ||||
|                                 ) | ||||
|                                 is_valid_item = False | ||||
|                             else: | ||||
|                                 ub_qty_val = float(ub_qty_raw) | ||||
| 
 | ||||
|                             # Validate Units | ||||
|                             if lb_unit_raw not in ALLOWED_UNITS: | ||||
|                                 print( | ||||
|                                     f"Warning: Invalid lb_unit for original index {original_df_index}: '{lb_unit_raw}'" | ||||
|                                 ) | ||||
|                                 is_valid_item = False | ||||
|                             else: | ||||
|                                 lb_unit_val = lb_unit_raw | ||||
| 
 | ||||
|                             if ub_unit_raw not in ALLOWED_UNITS: | ||||
|                                 print( | ||||
|                                     f"Warning: Invalid ub_unit for original index {original_df_index}: '{ub_unit_raw}'" | ||||
|                                 ) | ||||
|                                 is_valid_item = False | ||||
|                             else: | ||||
|                                 ub_unit_val = ub_unit_raw | ||||
| 
 | ||||
|                             if is_valid_item: | ||||
|                                 successful_in_chunk += 1 | ||||
|                                 chunk_updates[original_df_index] = { | ||||
|                                     "lb_estimate_qty": lb_qty_val, | ||||
|                                     "lb_estimate_unit": lb_unit_val, | ||||
|                                     "ub_estimate_qty": ub_qty_val, | ||||
|                                     "ub_estimate_unit": ub_unit_val, | ||||
|                                 } | ||||
|                             else: | ||||
|                                 failed_in_chunk += ( | ||||
|                                     1  # Values remain None if not fully valid | ||||
|                                 ) | ||||
|                         else: | ||||
|                             print( | ||||
|                                 f"Warning: Missing or malformed estimate dicts in JSON for original index {original_df_index}. Content: '{content_str}'" | ||||
|                             ) | ||||
|                             failed_in_chunk += 1 | ||||
|                     else: | ||||
|                         finish_reason = ( | ||||
|                             response.choices[0].finish_reason | ||||
|                             if (response.choices and response.choices[0].finish_reason) | ||||
|                             else "unknown" | ||||
|                         ) | ||||
|                         error_message = ( | ||||
|                             response.choices[0].message.content | ||||
|                             if ( | ||||
|                                 response.choices | ||||
|                                 and response.choices[0].message | ||||
|                                 and response.choices[0].message.content | ||||
|                             ) | ||||
|                             else "No content in message." | ||||
|                         ) | ||||
|                         print( | ||||
|                             f"Warning: Received non-standard or empty response content for original index {original_df_index}. " | ||||
|                             f"Finish Reason: '{finish_reason}'. Message: '{error_message}'. Raw Choices: {response.choices}" | ||||
|                         ) | ||||
|                         failed_in_chunk += 1 | ||||
| 
 | ||||
|                 except json.JSONDecodeError: | ||||
|                     print( | ||||
|                         f"Warning: Could not decode JSON for original index {original_df_index}. Content received: '{content_str}'" | ||||
|                     ) | ||||
|                     failed_in_chunk += 1 | ||||
|                 except AttributeError as ae: | ||||
|                     print( | ||||
|                         f"Warning: Missing expected attribute processing response for original index {original_df_index}: {ae}. Response: {response}" | ||||
|                     ) | ||||
|                     failed_in_chunk += 1 | ||||
|                 except Exception as e: | ||||
|                     print( | ||||
|                         f"Warning: An unexpected error occurred processing response for original index {original_df_index}: {type(e).__name__} - {e}. Response: {response}" | ||||
|                     ) | ||||
|                     failed_in_chunk += 1 | ||||
|         else: | ||||
|             print( | ||||
|                 f"Warning: Mismatch between number of responses ({len(responses) if responses else 0}) " | ||||
|                 f"and messages sent ({len(message_chunk)}) for chunk {i + 1}, or no responses. Marking all as failed." | ||||
|             ) | ||||
|             failed_in_chunk = len( | ||||
|                 message_chunk | ||||
|             )  # All items in this chunk are considered failed if response array is problematic | ||||
| 
 | ||||
|         print( | ||||
|             f"Chunk {i + 1} processing summary: Success={successful_in_chunk}, Failed/Skipped={failed_in_chunk}" | ||||
|         ) | ||||
|         processed_count_total += successful_in_chunk | ||||
| 
 | ||||
|         # --- Update Main DataFrame and Save Periodically --- | ||||
|         if chunk_updates: | ||||
|             print( | ||||
|                 f"Updating main DataFrame with {len(chunk_updates)} new estimates for chunk {i + 1}..." | ||||
|             ) | ||||
|             for idx, estimates in chunk_updates.items(): | ||||
|                 if idx in df.index: | ||||
|                     df.loc[idx, "lb_estimate_qty"] = estimates["lb_estimate_qty"] | ||||
|                     df.loc[idx, "lb_estimate_unit"] = estimates["lb_estimate_unit"] | ||||
|                     df.loc[idx, "ub_estimate_qty"] = estimates["ub_estimate_qty"] | ||||
|                     df.loc[idx, "ub_estimate_unit"] = estimates["ub_estimate_unit"] | ||||
| 
 | ||||
|             print(f"Saving progress to {FILENAME}...") | ||||
|             save_dataframe(df, FILENAME) | ||||
|         else: | ||||
|             print(f"No successful estimates obtained in chunk {i + 1} to save.") | ||||
| 
 | ||||
|         # --- Rate Limiting Pause --- | ||||
|         chunk_end_time = time.time() | ||||
|         chunk_duration = chunk_end_time - chunk_start_time | ||||
|         print(f"Chunk {i + 1} took {chunk_duration:.2f} seconds.") | ||||
| 
 | ||||
|         if i < num_chunks - 1:  # No pause after the last chunk | ||||
|             # Calculate ideal time per request based on rate limit | ||||
|             time_per_request = SECONDS_PER_MINUTE / RATE_LIMIT if RATE_LIMIT > 0 else 0 | ||||
|             # Calculate minimum duration this chunk should have taken to respect rate limit | ||||
|             min_chunk_duration_for_rate = len(message_chunk) * time_per_request | ||||
|             # Calculate pause needed | ||||
|             pause_needed = max(0, min_chunk_duration_for_rate - chunk_duration) | ||||
| 
 | ||||
|             if pause_needed > 0: | ||||
|                 print( | ||||
|                     f"Pausing for {pause_needed:.2f} seconds to respect rate limit ({RATE_LIMIT}/min)..." | ||||
|                 ) | ||||
|                 time.sleep(pause_needed) | ||||
| 
 | ||||
|     overall_end_time = time.time() | ||||
|     total_duration_minutes = (overall_end_time - overall_start_time) / 60 | ||||
|     print( | ||||
|         f"\nBatch completion finished." | ||||
|         f" Processed {processed_count_total} new estimates in this run in {total_duration_minutes:.2f} minutes." | ||||
|     ) | ||||
| 
 | ||||
|     print(f"Performing final save to {FILENAME}...") | ||||
|     save_dataframe(df, FILENAME) | ||||
| 
 | ||||
|     print("\nScript finished.") | ||||
|  | @ -1,425 +0,0 @@ | |||
| # Import necessary libraries | ||||
| import pandas as pd | ||||
| import litellm  # Ensure this is installed in your environment | ||||
| import dotenv | ||||
| import os | ||||
| import time | ||||
| import json | ||||
| import math | ||||
| import numpy as np  # Added for NaN handling | ||||
| 
 | ||||
| # Load environment variables | ||||
| dotenv.load_dotenv(override=True) | ||||
| 
 | ||||
| # --- Configuration --- | ||||
| MODEL = "gpt-4.1-mini" | ||||
| # Consider adjusting RATE_LIMIT based on the specific model's actual limits | ||||
| RATE_LIMIT = 5000  # Max requests per minute | ||||
| # Smaller chunk size results in more frequent saving but potentially slower overall processing | ||||
| CHUNK_SIZE = 10  # Process messages in chunks of this size | ||||
| SECONDS_PER_MINUTE = 60 | ||||
| # **UPDATED:** Filename changed as requested | ||||
| FILENAME = "task_to_estimate.csv"  # Use a single filename for in-place updates | ||||
| 
 | ||||
| # --- Prompts and Schema --- | ||||
| SYSTEM_PROMPT = """ | ||||
| You are an expert assistant evaluating the time required for job tasks. Your goal is to estimate the 'effective time' range needed for a skilled human to complete the following job task **remotely**, without supervision | ||||
| 
 | ||||
| 'Effective time' is the active, focused work duration required to complete the task. Crucially, **exclude all waiting periods, delays, or time spent on other unrelated activities**. Think of it as the continuous, productive time investment needed if the worker could pause and resume instantly without cost. | ||||
| 
 | ||||
| Provide a lower and upper bound estimate for the 'effective time'. These bounds should capture the time within which approximately 80% of instances of performing this specific task are typically completed by a qualified individual. | ||||
| 
 | ||||
| You MUST output a JSON object containing the lower and upper bound estimates. Select your lower and upper bound estimates **only** from the following discrete durations: | ||||
| ['10 minutes', '30 minutes', '1 hour', '2 hours', '4 hours', '8 hours', '16 hours', '3 days', '1 week', '3 weeks', '6 weeks', '3 months', '6 months', '1 year', '3 years', '10 years'] | ||||
| 
 | ||||
| Example Output Format: | ||||
| { | ||||
|   "lower_bound_estimate": "1 hour", | ||||
|   "upper_bound_estimate": "4 hours" | ||||
| } | ||||
| 
 | ||||
| Base your estimate on the provided task description, its associated activities, and the occupational context. Only output the JSON object. | ||||
| """.strip()  # Modified prompt slightly to emphasize JSON output for response_format mode | ||||
| 
 | ||||
| # Template uses the correct column names based on previous update | ||||
| USER_MESSAGE_TEMPLATE = """ | ||||
| Please estimate the effective time range for the following remote task: | ||||
| 
 | ||||
| **Occupation Category:** {occupation_title} | ||||
| **Occupation Description:** {occupation_description} | ||||
| 
 | ||||
| **Task Description:** {task} | ||||
| **Relevant steps for the task:** | ||||
| {dwas} | ||||
| 
 | ||||
| Consider the complexity and the typical steps involved. Output ONLY the JSON object with keys "lower_bound_estimate" and "upper_bound_estimate". | ||||
| """.strip()  # Modified prompt slightly to emphasize JSON output for response_format mode | ||||
| 
 | ||||
| 
 | ||||
| ALLOWED_DURATIONS = [ | ||||
|     "10 minutes", | ||||
|     "30 minutes", | ||||
|     "1 hour", | ||||
|     "2 hours", | ||||
|     "4 hours", | ||||
|     "8 hours", | ||||
|     "16 hours", | ||||
|     "3 days", | ||||
|     "1 week", | ||||
|     "3 weeks", | ||||
|     "6 weeks", | ||||
|     "3 months", | ||||
|     "6 months", | ||||
|     "1 year", | ||||
|     "3 years", | ||||
|     "10 years", | ||||
| ] | ||||
| 
 | ||||
| # Schema definition for litellm's response_format validation | ||||
| # **REVERTED:** Using the schema definition compatible with response_format | ||||
| SCHEMA_FOR_VALIDATION = { | ||||
|     "name": "get_time_estimate", | ||||
|     "strict": True, | ||||
|     "schema": { | ||||
|         "type": "object", | ||||
|         "properties": { | ||||
|             "lower_bound_estimate": {"type": "string", "enum": ALLOWED_DURATIONS}, | ||||
|             "upper_bound_estimate": {"type": "string", "enum": ALLOWED_DURATIONS}, | ||||
|         }, | ||||
|         "required": ["lower_bound_estimate", "upper_bound_estimate"], | ||||
|         "additionalProperties": False, | ||||
|     }, | ||||
| } | ||||
| 
 | ||||
| 
 | ||||
| # --- Function to Save DataFrame In-Place --- | ||||
| def save_dataframe(df_to_save, filename): | ||||
|     """Saves the DataFrame to the specified CSV file using atomic write.""" | ||||
|     try: | ||||
|         # Use a temporary file for atomic write to prevent corruption if script crashes during save | ||||
|         temp_filename = filename + ".tmp" | ||||
|         df_to_save.to_csv(temp_filename, encoding="utf-8-sig", index=False) | ||||
|         os.replace(temp_filename, filename)  # Atomic replace | ||||
|         # print(f"--- DataFrame successfully saved to {filename} ---") # Optional: uncomment for verbose logging | ||||
|     except Exception as e: | ||||
|         print(f"--- Error saving DataFrame to {filename}: {e} ---") | ||||
|         # Clean up temp file if rename failed | ||||
|         if os.path.exists(temp_filename): | ||||
|             try: | ||||
|                 os.remove(temp_filename) | ||||
|             except Exception as remove_err: | ||||
|                 print( | ||||
|                     f"--- Error removing temporary save file {temp_filename}: {remove_err} ---" | ||||
|                 ) | ||||
| 
 | ||||
| 
 | ||||
| # --- Main Script Logic --- | ||||
| try: | ||||
|     # Read the CSV | ||||
|     if os.path.exists(FILENAME): | ||||
|         df = pd.read_csv(FILENAME, encoding="utf-8-sig") | ||||
|         print(f"Successfully read {len(df)} rows from {FILENAME}.") | ||||
|         # Check if estimate columns exist, add them if not, initialized with NaN | ||||
|         save_needed = False | ||||
|         if "lb_estimate" not in df.columns: | ||||
|             df["lb_estimate"] = np.nan | ||||
|             print("Added 'lb_estimate' column.") | ||||
|             save_needed = True | ||||
|         # Ensure column is float/object type to hold NaNs and strings | ||||
|     elif not pd.api.types.is_object_dtype( | ||||
|         df["lb_estimate"] | ||||
|     ) and not pd.api.types.is_float_dtype(df["lb_estimate"]): | ||||
|         df["lb_estimate"] = df["lb_estimate"].astype(object) | ||||
| 
 | ||||
|         if "ub_estimate" not in df.columns: | ||||
|             df["ub_estimate"] = np.nan | ||||
|             print("Added 'ub_estimate' column.") | ||||
|             save_needed = True | ||||
|         elif not pd.api.types.is_object_dtype( | ||||
|             df["ub_estimate"] | ||||
|         ) and not pd.api.types.is_float_dtype(df["ub_estimate"]): | ||||
|             df["ub_estimate"] = df["ub_estimate"].astype(object) | ||||
| 
 | ||||
|         # Fill potential empty strings or other placeholders with actual NaN for consistency | ||||
|         df["lb_estimate"].replace(["", None], np.nan, inplace=True) | ||||
|         df["ub_estimate"].replace(["", None], np.nan, inplace=True) | ||||
| 
 | ||||
|         if save_needed: | ||||
|             print(f"Saving {FILENAME} after adding missing estimate columns.") | ||||
|             save_dataframe(df, FILENAME) | ||||
|     else: | ||||
|         print(f"Error: {FILENAME} not found. Please ensure the file exists.") | ||||
|         exit() | ||||
| 
 | ||||
| except FileNotFoundError: | ||||
|     print(f"Error: {FILENAME} not found. Please ensure the file exists.") | ||||
|     exit() | ||||
| except Exception as e: | ||||
|     print(f"Error reading or initializing {FILENAME}: {e}") | ||||
|     exit() | ||||
| 
 | ||||
| 
 | ||||
| # --- Identify Rows to Process --- | ||||
| unprocessed_mask = df["lb_estimate"].isna() | ||||
| start_index = unprocessed_mask.idxmax()  # Finds the index of the first True value | ||||
| 
 | ||||
| if unprocessed_mask.any() and pd.isna(df.loc[start_index, "lb_estimate"]): | ||||
|     print(f"Resuming processing from index {start_index}.") | ||||
|     df_to_process = df.loc[unprocessed_mask].copy() | ||||
|     original_indices = df_to_process.index  # Keep track of original indices | ||||
| else: | ||||
|     print("All rows seem to have estimates already. Exiting.") | ||||
|     exit() | ||||
| 
 | ||||
| 
 | ||||
| # --- Prepare messages for batch completion (only for rows needing processing) --- | ||||
| messages_list = [] | ||||
| skipped_rows_indices = [] | ||||
| valid_original_indices = [] | ||||
| 
 | ||||
| if not df_to_process.empty: | ||||
|     # Use the correct column names | ||||
|     required_cols = ["task", "occupation_title", "occupation_description", "dwas"] | ||||
|     print( | ||||
|         f"Preparing messages for up to {len(df_to_process)} rows starting from index {start_index}..." | ||||
|     ) | ||||
|     print(f"Checking for required columns: {required_cols}") | ||||
| 
 | ||||
|     for index, row in df_to_process.iterrows(): | ||||
|         missing_or_empty = [] | ||||
|         for col in required_cols: | ||||
|             if col not in row or pd.isna(row[col]) or str(row[col]).strip() == "": | ||||
|                 missing_or_empty.append(col) | ||||
| 
 | ||||
|         if missing_or_empty: | ||||
|             print( | ||||
|                 f"Warning: Skipping row index {index} due to missing/empty required data in columns: {', '.join(missing_or_empty)}." | ||||
|             ) | ||||
|             skipped_rows_indices.append(index) | ||||
|             continue | ||||
| 
 | ||||
|         # Format user message using the template with correct column names | ||||
|         try: | ||||
|             user_message = USER_MESSAGE_TEMPLATE.format( | ||||
|                 task=row["task"], | ||||
|                 occupation_title=row["occupation_title"], | ||||
|                 occupation_description=row["occupation_description"], | ||||
|                 dwas=row["dwas"], | ||||
|             ) | ||||
|         except KeyError as e: | ||||
|             print( | ||||
|                 f"Error: Skipping row index {index} due to formatting error - missing key: {e}. Check USER_MESSAGE_TEMPLATE and CSV columns." | ||||
|             ) | ||||
|             skipped_rows_indices.append(index) | ||||
|             continue | ||||
| 
 | ||||
|         messages_for_row = [ | ||||
|             {"role": "system", "content": SYSTEM_PROMPT}, | ||||
|             {"role": "user", "content": user_message}, | ||||
|         ] | ||||
|         messages_list.append(messages_for_row) | ||||
|         valid_original_indices.append(index) | ||||
| 
 | ||||
|     print( | ||||
|         f"Prepared {len(messages_list)} valid message sets for batch completion (skipped {len(skipped_rows_indices)} rows)." | ||||
|     ) | ||||
|     if not messages_list: | ||||
|         print("No valid rows found to process after checking required data. Exiting.") | ||||
|         exit() | ||||
| else: | ||||
|     print("No rows found needing processing.") | ||||
|     exit() | ||||
| 
 | ||||
| 
 | ||||
| # --- Call batch_completion in chunks with rate limiting and periodic saving --- | ||||
| total_messages_to_send = len(messages_list) | ||||
| num_chunks = math.ceil(total_messages_to_send / CHUNK_SIZE) | ||||
| 
 | ||||
| print( | ||||
|     f"\nStarting batch completion for {total_messages_to_send} items in {num_chunks} chunks..." | ||||
| ) | ||||
| 
 | ||||
| overall_start_time = time.time() | ||||
| processed_count_total = 0 | ||||
| 
 | ||||
| for i in range(num_chunks): | ||||
|     chunk_start_message_index = i * CHUNK_SIZE | ||||
|     chunk_end_message_index = min((i + 1) * CHUNK_SIZE, total_messages_to_send) | ||||
|     message_chunk = messages_list[chunk_start_message_index:chunk_end_message_index] | ||||
|     chunk_original_indices = valid_original_indices[ | ||||
|         chunk_start_message_index:chunk_end_message_index | ||||
|     ] | ||||
| 
 | ||||
|     if not message_chunk: | ||||
|         continue | ||||
| 
 | ||||
|     min_idx = min(chunk_original_indices) if chunk_original_indices else "N/A" | ||||
|     max_idx = max(chunk_original_indices) if chunk_original_indices else "N/A" | ||||
|     print( | ||||
|         f"\nProcessing chunk {i + 1}/{num_chunks} (Messages {chunk_start_message_index + 1}-{chunk_end_message_index} of this run)..." | ||||
|         f" Corresponding to original indices: {min_idx} - {max_idx}" | ||||
|     ) | ||||
|     chunk_start_time = time.time() | ||||
|     responses = [] | ||||
|     try: | ||||
|         print(f"Sending {len(message_chunk)} requests for chunk {i + 1}...") | ||||
|         # **REVERTED:** Using response_format with json_schema | ||||
|         responses = litellm.batch_completion( | ||||
|             model=MODEL, | ||||
|             messages=message_chunk, | ||||
|             response_format={ | ||||
|                 "type": "json_schema", | ||||
|                 "json_schema": SCHEMA_FOR_VALIDATION, | ||||
|             }, | ||||
|             num_retries=3, | ||||
|             # request_timeout=60 # Optional: uncomment if needed | ||||
|         ) | ||||
|         print(f"Chunk {i + 1} API call completed.") | ||||
| 
 | ||||
|     except Exception as e: | ||||
|         print(f"Error during litellm.batch_completion for chunk {i + 1}: {e}") | ||||
|         responses = [None] * len(message_chunk) | ||||
| 
 | ||||
|     # --- Process responses for the current chunk --- | ||||
|     chunk_lb_estimates = {} | ||||
|     chunk_ub_estimates = {} | ||||
|     successful_in_chunk = 0 | ||||
|     failed_in_chunk = 0 | ||||
| 
 | ||||
|     if responses and len(responses) == len(message_chunk): | ||||
|         for j, response in enumerate(responses): | ||||
|             original_df_index = chunk_original_indices[j] | ||||
|             lb_estimate = None | ||||
|             ub_estimate = None | ||||
|             content_str = None  # Initialize for potential error logging | ||||
| 
 | ||||
|             if response is None: | ||||
|                 print( | ||||
|                     f"Skipping processing for original index {original_df_index} due to API call failure for this item/chunk." | ||||
|                 ) | ||||
|                 failed_in_chunk += 1 | ||||
|                 continue | ||||
| 
 | ||||
|             try: | ||||
|                 # **REVERTED:** Check for content in the message, not tool_calls | ||||
|                 if ( | ||||
|                     response.choices | ||||
|                     and response.choices[0].message | ||||
|                     and response.choices[0].message.content  # Check if content exists | ||||
|                 ): | ||||
|                     content_str = response.choices[0].message.content | ||||
|                     # Attempt to parse the JSON string content | ||||
|                     estimate_data = json.loads(content_str) | ||||
|                     lb_estimate = estimate_data.get("lower_bound_estimate") | ||||
|                     ub_estimate = estimate_data.get("upper_bound_estimate") | ||||
| 
 | ||||
|                     # Validate against allowed durations | ||||
|                     if ( | ||||
|                         lb_estimate in ALLOWED_DURATIONS | ||||
|                         and ub_estimate in ALLOWED_DURATIONS | ||||
|                     ): | ||||
|                         successful_in_chunk += 1 | ||||
|                     else: | ||||
|                         print( | ||||
|                             f"Warning: Invalid duration value(s) in JSON for original index {original_df_index}. LB: '{lb_estimate}', UB: '{ub_estimate}'. Setting to None." | ||||
|                         ) | ||||
|                         lb_estimate = None | ||||
|                         ub_estimate = None | ||||
|                         failed_in_chunk += 1 | ||||
|                 else: | ||||
|                     # Handle cases where the response structure is unexpected or indicates an error | ||||
|                     finish_reason = ( | ||||
|                         response.choices[0].finish_reason | ||||
|                         if (response.choices and response.choices[0].finish_reason) | ||||
|                         else "unknown" | ||||
|                     ) | ||||
|                     print( | ||||
|                         f"Warning: Received non-standard or empty response content for original index {original_df_index}. " | ||||
|                         f"Finish Reason: '{finish_reason}'. Raw Response Choices: {response.choices}" | ||||
|                     ) | ||||
|                     failed_in_chunk += 1 | ||||
| 
 | ||||
|             except json.JSONDecodeError: | ||||
|                 # Log content_str which failed parsing | ||||
|                 print( | ||||
|                     f"Warning: Could not decode JSON for original index {original_df_index}. Content received: '{content_str}'" | ||||
|                 ) | ||||
|                 failed_in_chunk += 1 | ||||
|             except AttributeError as ae: | ||||
|                 print( | ||||
|                     f"Warning: Missing expected attribute processing response for original index {original_df_index}: {ae}. Response: {response}" | ||||
|                 ) | ||||
|                 failed_in_chunk += 1 | ||||
|             except Exception as e: | ||||
|                 print( | ||||
|                     f"Warning: An unexpected error occurred processing response for original index {original_df_index}: {type(e).__name__} - {e}. Response: {response}" | ||||
|                 ) | ||||
|                 failed_in_chunk += 1 | ||||
| 
 | ||||
|             # Store successfully parsed results | ||||
|             if lb_estimate is not None: | ||||
|                 chunk_lb_estimates[original_df_index] = lb_estimate | ||||
|             if ub_estimate is not None: | ||||
|                 chunk_ub_estimates[original_df_index] = ub_estimate | ||||
| 
 | ||||
|     else: | ||||
|         print( | ||||
|             f"Warning: Mismatch between number of responses ({len(responses) if responses else 0}) " | ||||
|             f"and messages sent ({len(message_chunk)}) for chunk {i + 1}. Marking all as failed." | ||||
|         ) | ||||
|         failed_in_chunk = len(message_chunk) | ||||
| 
 | ||||
|     print( | ||||
|         f"Chunk {i + 1} processing summary: Success={successful_in_chunk}, Failed/Skipped={failed_in_chunk}" | ||||
|     ) | ||||
|     processed_count_total += successful_in_chunk | ||||
| 
 | ||||
|     # --- Update Main DataFrame and Save Periodically --- | ||||
|     if chunk_lb_estimates or chunk_ub_estimates: | ||||
|         print( | ||||
|             f"Updating main DataFrame with {len(chunk_lb_estimates)} LB and {len(chunk_ub_estimates)} UB estimates for chunk {i + 1}..." | ||||
|         ) | ||||
|         if not pd.api.types.is_object_dtype(df["lb_estimate"]): | ||||
|             df["lb_estimate"] = df["lb_estimate"].astype(object) | ||||
|         if not pd.api.types.is_object_dtype(df["ub_estimate"]): | ||||
|             df["ub_estimate"] = df["ub_estimate"].astype(object) | ||||
| 
 | ||||
|         for idx, lb in chunk_lb_estimates.items(): | ||||
|             if idx in df.index: | ||||
|                 df.loc[idx, "lb_estimate"] = lb | ||||
|         for idx, ub in chunk_ub_estimates.items(): | ||||
|             if idx in df.index: | ||||
|                 df.loc[idx, "ub_estimate"] = ub | ||||
| 
 | ||||
|         print(f"Saving progress to {FILENAME}...") | ||||
|         save_dataframe(df, FILENAME) | ||||
|     else: | ||||
|         print(f"No successful estimates obtained in chunk {i + 1} to save.") | ||||
| 
 | ||||
|     # --- Rate Limiting Pause --- | ||||
|     chunk_end_time = time.time() | ||||
|     chunk_duration = chunk_end_time - chunk_start_time | ||||
|     print(f"Chunk {i + 1} took {chunk_duration:.2f} seconds.") | ||||
| 
 | ||||
|     if i < num_chunks - 1: | ||||
|         time_per_request = SECONDS_PER_MINUTE / RATE_LIMIT if RATE_LIMIT > 0 else 0 | ||||
|         min_chunk_duration_for_rate = len(message_chunk) * time_per_request | ||||
|         pause_needed = max(0, min_chunk_duration_for_rate - chunk_duration) | ||||
| 
 | ||||
|         if pause_needed > 0: | ||||
|             print( | ||||
|                 f"Pausing for {pause_needed:.2f} seconds to respect rate limit ({RATE_LIMIT}/min)..." | ||||
|             ) | ||||
|             time.sleep(pause_needed) | ||||
| 
 | ||||
| overall_end_time = time.time() | ||||
| total_duration_minutes = (overall_end_time - overall_start_time) / 60 | ||||
| print( | ||||
|     f"\nBatch completion finished." | ||||
|     f" Processed {processed_count_total} new estimates in this run in {total_duration_minutes:.2f} minutes." | ||||
| ) | ||||
| 
 | ||||
| print(f"Performing final save check to {FILENAME}...") | ||||
| save_dataframe(df, FILENAME) | ||||
| 
 | ||||
| print("\nScript finished.") | ||||
							
								
								
									
										2
									
								
								agents.md
									
										
									
									
									
										Normal file
									
								
							
							
						
						
									
										2
									
								
								agents.md
									
										
									
									
									
										Normal file
									
								
							|  | @ -0,0 +1,2 @@ | |||
| - I use Nix. To run a command, prefix them with `nix develop .#impure -c` | ||||
| - I use uv. To add a package, use: uv add. To run a script use: uv run path/to/script | ||||
							
								
								
									
										2416
									
								
								analysis.ipynb
									
										
									
									
									
								
							
							
						
						
									
										2416
									
								
								analysis.ipynb
									
										
									
									
									
								
							
										
											
												File diff suppressed because one or more lines are too long
											
										
									
								
							
							
								
								
									
										563
									
								
								analysis.py
									
										
									
									
									
										Normal file
									
								
							
							
						
						
									
										563
									
								
								analysis.py
									
										
									
									
									
										Normal file
									
								
							|  | @ -0,0 +1,563 @@ | |||
| import os | ||||
| import litellm | ||||
| import sqlite3 | ||||
| import numpy as np | ||||
| import pandas as pd | ||||
| from google.colab import userdata, files | ||||
| import seaborn as sns | ||||
| import matplotlib.pyplot as plt | ||||
| import matplotlib as mpl | ||||
| 
 | ||||
| os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API_KEY') | ||||
| os.environ['GEMINI_API_KEY'] = userdata.get('GEMINI_API_KEY') | ||||
| 
 | ||||
| occupation_major_codes = { | ||||
|     '11': 'Management', | ||||
|     '13': 'Business and Financial Operations', | ||||
|     '15': 'Computer and Mathematical Occupations', | ||||
|     '17': 'Architecture and Engineering', | ||||
|     '19': 'Life, Physical, and Social Science', | ||||
|     '21': 'Community and Social Services', | ||||
|     '23': 'Legal', | ||||
|     '25': 'Education, Training, and Library', | ||||
|     '27': 'Arts, Design, Entertainment, Sports, and Media', | ||||
|     '29': 'Healthcare Practitioners and Technical', | ||||
|     '31': 'Healthcare Support', | ||||
|     '33': 'Protective Service', | ||||
|     '35': 'Food Preparation and Serving Related', | ||||
|     '37': 'Building and Grounds Cleaning and Maintenance', | ||||
|     '39': 'Personal Care and Service', | ||||
|     '41': 'Sales and Related', | ||||
|     '43': 'Office and Administrative Support', | ||||
|     '45': 'Farming, Fishing, and Forestry', | ||||
|     '47': 'Construction and Extraction', | ||||
|     '49': 'Installation, Maintenance, and Repair', | ||||
|     '51': 'Production', | ||||
|     '53': 'Transportation and Material Moving', | ||||
|     '55': 'Military Specific' | ||||
| } | ||||
| 
 | ||||
| gray   = {'50':'#f8fafc','100':'#f1f5f9','200':'#e2e8f0', | ||||
|                    '300':'#cbd5e1','400':'#94a3b8','500':'#64748b', | ||||
|                    '600':'#475569','700':'#334155','800':'#1e293b', | ||||
|                    '900':'#0f172a','950':'#020617'} | ||||
| lime            = {'50': '#f7fee7','100': '#ecfcca','200': '#d8f999', | ||||
|                    '300': '#bbf451','400': '#9ae600','500': '#83cd00', | ||||
|                    '600': '#64a400','700': '#497d00','800': '#3c6300', | ||||
|                    '900': '#35530e','950': '#192e03'} | ||||
| 
 | ||||
| mpl.rcParams.update({ | ||||
|     'figure.facecolor' : gray['50'], | ||||
|     'axes.facecolor'   : gray['50'], | ||||
|     'axes.edgecolor'   : gray['100'], | ||||
|     'axes.labelcolor'  : gray['700'], | ||||
|     'xtick.color'      : gray['700'], | ||||
|     'ytick.color'      : gray['700'], | ||||
|     'font.family'      : 'Inter',  # falls back to DejaVu if Inter not present | ||||
|     'font.size'        : 11, | ||||
| }) | ||||
| 
 | ||||
| sns.set_style("white")         # keep minimal axes, we will remove default grid | ||||
| sns.set_context("notebook") | ||||
| 
 | ||||
| def prepare_tasks(): | ||||
|     # This dataset comes from https://epoch.ai/gradient-updates/consequences-of-automating-remote-work | ||||
|     # It contains labels for a O*NET task can be done remotely or not (labeled by GPT-4o) | ||||
|     # You can download it here: https://drive.google.com/file/d/1GrHhuYIgaCCgo99dZ_40BWraz-fzo76r/view?usp=sharing | ||||
|     df_remote_status = pd.read_csv("epoch_task_data.csv") | ||||
| 
 | ||||
|     # BLS OEWS: https://www.bls.gov/oes/special-requests/oesm23nat.zip | ||||
|     df_oesm = pd.read_excel("oesm23national.xlsx") | ||||
| 
 | ||||
|     # Run uv run ./enrich_task_ratings.py | ||||
|     df_tasks = pd.read_json("task_ratings_enriched.json") | ||||
| 
 | ||||
|     # Run uv run classify_estimateability_of_tasks.py | ||||
|     df_task_estimateable = pd.read_csv("tasks_estimateable.csv").rename(columns={"task_estimateable": "estimateable"}).drop_duplicates(subset=['task'], keep='first') | ||||
| 
 | ||||
|     # df_tasks now has a remote_status column which contains either "remote" or "not remote" | ||||
|     df_tasks = pd.merge(df_tasks, df_remote_status[['Task', 'Remote']], left_on='task', right_on='Task', how='left') | ||||
|     df_tasks = df_tasks.drop('Task', axis=1).rename(columns={'Remote': 'remote_status'}) | ||||
| 
 | ||||
|     # df_tasks now has a estimateable column which contains either "ATOMIC" or "ONGOING-CONSTRAINT" | ||||
|     df_tasks = pd.merge(df_tasks, df_task_estimateable[['task', 'estimateable']], on='task', how='left') | ||||
| 
 | ||||
|     df_tasks = df_tasks[df_tasks['importance_average'] < 3].copy() | ||||
| 
 | ||||
|     df_tasks['onetsoc_major'] = df_tasks['onetsoc_code'].str[:2] | ||||
| 
 | ||||
|     df_remote_tasks = df_tasks[df_tasks['remote_status'] == 'remote'].copy() | ||||
| 
 | ||||
|     # Call create_task_estimates() from add_task_estimates? which creates tasks_with_estimates.csv | ||||
| 
 | ||||
| def preprocessing_time_estimates(): | ||||
|     df = pd.read_csv("tasks_with_estimates.csv") | ||||
| 
 | ||||
|     df = df[df['importance_average'] > 3].copy() | ||||
| 
 | ||||
|     # The embeddings comes from running `uv run ./embed_task_description.py` | ||||
|     # Columns: ['embedding_id', 'task', 'embedding_vector'] | ||||
|     # These contain embedding for UNIQUE tasks | ||||
|     df_task_embeddings = pd.read_parquet("tasks_with_embeddings.parquet").drop_duplicates(subset=['task'])[['task', 'task_embedding']].rename(columns={"task_embedding": "embedding_vector"}).copy() | ||||
| 
 | ||||
|     df = pd.merge(df, df_task_embeddings[['task', 'embedding_vector']], on='task', how='left') | ||||
|     df = pd.merge(df, df_task_estimateable[['task', 'estimateable']], on='task', how='left') | ||||
| 
 | ||||
|     df['onetsoc_major'] = df['onetsoc_code'].str[:2] | ||||
| 
 | ||||
|     def convert_to_minutes(qty, unit): | ||||
|         """Converts a quantity in a given unit to minutes.""" | ||||
|         return qty * { | ||||
|             "minute": 1, | ||||
|             "hour": 60, | ||||
|             "day": 60 * 24, | ||||
|             "week": 60 * 24 * 7, | ||||
|             "month": 60 * 24 * 30, | ||||
|             "trimester": 60 * 24 * 90, | ||||
|             "semester": 60 * 24 * 180, | ||||
|             "year": 60 * 24 * 365, | ||||
|         }[unit] | ||||
| 
 | ||||
|     df['lb_estimate_in_minutes'] = df.apply( | ||||
|         lambda row: convert_to_minutes(row['lb_estimate_qty'], row['lb_estimate_unit']), axis=1 | ||||
|     ) | ||||
|     df['ub_estimate_in_minutes'] = df.apply( | ||||
|         lambda row: convert_to_minutes(row['ub_estimate_qty'], row['ub_estimate_unit']), axis=1 | ||||
|     ) | ||||
| 
 | ||||
|     df['estimate_range'] = df.ub_estimate_in_minutes - df.lb_estimate_in_minutes | ||||
|     df['estimate_ratio'] = df.ub_estimate_in_minutes / df.lb_estimate_in_minutes | ||||
|     df['estimate_midpoint'] = (df.lb_estimate_in_minutes + df.ub_estimate_in_minutes)/2 | ||||
| 
 | ||||
|     atomic_tasks = df[df['estimateable'] == 'ATOMIC'] | ||||
|     ongoing_tasks = df[df['estimateable'] == 'ONGOING-CONSTRAINT'] | ||||
| 
 | ||||
|     with pd.option_context('display.max_columns', None): | ||||
|       display(df) | ||||
| 
 | ||||
|     # Check for empty estimates | ||||
|     if atomic_tasks['lb_estimate_in_minutes'].isnull().sum() > 0: | ||||
|         print("Missing values in 'lb_estimate_in_minutes':", atomic_tasks['lb_estimate_in_minutes'].isnull().sum()) | ||||
| 
 | ||||
|     if atomic_tasks['ub_estimate_in_minutes'].isnull().sum() > 0: | ||||
|         print("Missing values in 'ub_estimate_in_minutes':", atomic_tasks['ub_estimate_in_minutes'].isnull().sum()) | ||||
| 
 | ||||
|     # Check for impossible bounds | ||||
|     impossible_bounds = atomic_tasks[ | ||||
|         (atomic_tasks['lb_estimate_in_minutes'] <= 0) | | ||||
|         (atomic_tasks['ub_estimate_in_minutes'] <= 0) | | ||||
|         (atomic_tasks['lb_estimate_in_minutes'] > atomic_tasks['ub_estimate_in_minutes']) | ||||
|     ] | ||||
|     if not impossible_bounds.empty: | ||||
|         print(f"Error: Found rows with impossible bounds.") | ||||
|         with pd.option_context('display.max_colwidth', None): | ||||
|         display(impossible_bounds[['task', 'lb_estimate_in_minutes', 'ub_estimate_in_minutes', 'dwas']]) | ||||
| 
 | ||||
|     #with pd.option_context('display.max_colwidth', None): | ||||
|         #display(atomic_tasks.nlargest(20, 'ub_estimate_in_minutes')[['task', 'lb_estimate_qty', 'lb_estimate_unit', 'lb_estimate_in_minutes', 'ub_estimate_qty', 'ub_estimate_unit', 'ub_estimate_in_minutes', 'estimate_ratio']]) | ||||
| 
 | ||||
| def cell1(): | ||||
|     sns.histplot(atomic_tasks.estimate_midpoint, log_scale=True) | ||||
| 
 | ||||
| def cell2(): | ||||
|     plt.figure(figsize=(14,10)) | ||||
|     sns.boxplot( | ||||
|         data=atomic_tasks, | ||||
|         x='onetsoc_major',           # 11 = Management, 15 = Computer/Math, … | ||||
|         y='estimate_range', | ||||
|         showfliers=False | ||||
|     ) | ||||
|     plt.yscale('log')                # long tail => log scale | ||||
|     plt.xlabel('Occupation') | ||||
|     plt.ylabel('Range (upper-lower, minutes)') | ||||
|     plt.title('Spread of time-range estimates per occupation') | ||||
| 
 | ||||
|     ax = plt.gca() | ||||
|     ax.set_xticklabels([occupation_major_codes[code.get_text()] for code in ax.get_xticklabels()], rotation=60, ha='right') | ||||
| 
 | ||||
| def cell3(): | ||||
|     plt.figure(figsize=(10, 10)) | ||||
|     ax = sns.scatterplot( | ||||
|             data=atomic_tasks.replace({'onetsoc_major': occupation_major_codes}),  # Replace codes with labels | ||||
|             x='lb_estimate_in_minutes', y='ub_estimate_in_minutes', | ||||
|             alpha=0.2, edgecolor=None, hue="onetsoc_major"  # Use the labeled column for hue | ||||
|         ) | ||||
| 
 | ||||
|     # 45° reference | ||||
|     lims = (1, atomic_tasks[['lb_estimate_in_minutes','ub_estimate_in_minutes']].max().max()) | ||||
|     ax.plot(lims, lims, color='black', linestyle='--', linewidth=1) | ||||
| 
 | ||||
|     # optional helper lines: 2× and 10×, 100× ratios | ||||
|     for k in [2,10, 100]: | ||||
|         ax.plot(lims, [k*l for l in lims], | ||||
|                 linestyle=':', color='grey', linewidth=1) | ||||
| 
 | ||||
|     ax.set(xscale='log', yscale='log') | ||||
|     ax.set_xlabel('Lower-bound (min, log scale)') | ||||
|     ax.set_ylabel('Upper-bound (min, log scale)') | ||||
|     ax.set_title('Lower vs upper estimates for all tasks') | ||||
| 
 | ||||
|     # Place the legend outside the plot | ||||
|     ax.legend(bbox_to_anchor=(1, 1), loc='upper left') | ||||
| 
 | ||||
| def cell4(): | ||||
|     plt.figure(figsize=(8,4)) | ||||
|     sns.histplot(np.log10(atomic_tasks['estimate_ratio'].replace([np.inf, -np.inf], np.nan).dropna()), | ||||
|                 bins=60, kde=True) | ||||
|     plt.axvline(np.log10(10), color='red', ls='--', lw=1, label='10×') | ||||
|     plt.axvline(np.log10(1.05), color='orange', ls='--', lw=1, label='1.05×') | ||||
|     plt.axvline(0, color='black', ls='-', lw=1)          # ub = lb | ||||
|     plt.xlabel('log₁₀(upper / lower)') | ||||
|     plt.ylabel('Count') | ||||
|     plt.title('Distribution of upper:lower ratio') | ||||
|     plt.legend() | ||||
|     plt.tight_layout() | ||||
| 
 | ||||
| 
 | ||||
| def cell5(): | ||||
|     # 1. Bin lower bounds into quartiles (Q1–Q4) | ||||
|     atomic_tasks['lb_q'] = pd.qcut(atomic_tasks.lb_estimate_in_minutes, | ||||
|                         q=4, labels=['Q1 shortest','Q2','Q3','Q4 longest']) | ||||
| 
 | ||||
| 
 | ||||
|     # 3. Aggregate: median (or mean) ratio per cell | ||||
|     pivot = atomic_tasks.pivot_table(index='onetsoc_major', columns='lb_q', | ||||
|                         values='estimate_ratio', aggfunc='median') | ||||
| 
 | ||||
|     # Map the index (onetsoc_major codes) to their corresponding labels | ||||
|     pivot.index = pivot.index.map(occupation_major_codes) | ||||
| 
 | ||||
| 
 | ||||
|     # 4. Visualise | ||||
|     plt.figure(figsize=(10,8)) | ||||
|     sns.heatmap(pivot, cmap='RdYlGn_r', center=2, annot=True, fmt='.1f', | ||||
|                 cbar_kws={'label':'Median upper/lower ratio'}) | ||||
|     plt.xlabel('Lower-bound quartile') | ||||
|     plt.ylabel('Occupation (major group)') | ||||
|     plt.title('Typical range width by occupation and task length') | ||||
|     plt.tight_layout() | ||||
| 
 | ||||
| 
 | ||||
| 
 | ||||
| def cell6(): | ||||
|     """ | ||||
|     from scipy.stats import median_abs_deviation | ||||
| 
 | ||||
|     def mad_z(series): | ||||
|         med = series.median() | ||||
|         mad = median_abs_deviation(series, scale='normal')  # ⇒ comparable to σ | ||||
|         return (series - med) / mad | ||||
| 
 | ||||
|     df['robust_z'] = df.groupby('onetsoc_code')['estimate_midpoint'].transform(mad_z) | ||||
|     """ | ||||
| 
 | ||||
|     agg = (atomic_tasks | ||||
|            .groupby('onetsoc_code')['estimate_midpoint'] | ||||
|            .agg(median='median', | ||||
|                 q1=lambda x: x.quantile(.25), | ||||
|                 q3=lambda x: x.quantile(.75), | ||||
|                 mean='mean', | ||||
|                 std='std') | ||||
|            .reset_index()) | ||||
|     agg['IQR'] = agg.q3 - agg.q1 | ||||
|     agg['CV']  = agg['std'] / agg['mean']            # coefficient of variation | ||||
| 
 | ||||
|     # merge back the group mean and std so each row can be scored | ||||
|     atomic_tasks = atomic_tasks.merge(agg[['onetsoc_code','mean','std']], on='onetsoc_code') | ||||
| 
 | ||||
| 
 | ||||
|     atomic_tasks['z'] = (atomic_tasks.estimate_midpoint - atomic_tasks['mean']) / atomic_tasks['std'] | ||||
|     outliers = atomic_tasks.loc[atomic_tasks.z.abs() > 3] | ||||
|     outliers | ||||
| 
 | ||||
| def cell7(): | ||||
|     from scipy.stats import median_abs_deviation | ||||
| 
 | ||||
|     def mad_z(series): | ||||
|         med = series.median() | ||||
|         mad = median_abs_deviation(series, scale='normal')  # ⇒ comparable to σ | ||||
|         return (series - med) / mad | ||||
| 
 | ||||
|     atomic_tasks['robust_z'] = atomic_tasks.groupby('onetsoc_code')['estimate_midpoint'].transform(mad_z) | ||||
| 
 | ||||
| def cell8(): | ||||
|     from sklearn.metrics.pairwise import cosine_similarity | ||||
|     from sklearn.neighbors import NearestNeighbors | ||||
| 
 | ||||
|     # unit-normalise embeddings | ||||
|     X = np.vstack(atomic_tasks.embedding_vector.values) | ||||
|     X = X / np.linalg.norm(X, axis=1, keepdims=True) | ||||
| 
 | ||||
|     k = 5 | ||||
|     nn = NearestNeighbors(n_neighbors=k+1, metric='cosine').fit(X) | ||||
|     dist, idx = nn.kneighbors(X)          # idx[:,0] is the task itself | ||||
|     pairs = pd.DataFrame({ | ||||
|         'i': np.repeat(np.arange(len(X)), k), | ||||
|         'j': idx[:,1:].ravel(),           # drop self-match | ||||
|         'sim': 1 - dist[:,1:].ravel()     # cosine similarity | ||||
|     }) | ||||
| 
 | ||||
|     # join time estimates | ||||
|     pairs = (pairs | ||||
|              .merge(atomic_tasks[['estimate_midpoint']], left_on='i', right_index=True) | ||||
|              .merge(atomic_tasks[['estimate_midpoint']], left_on='j', right_index=True, | ||||
|                     suffixes=('_i','_j'))) | ||||
| 
 | ||||
|     pairs['ratio'] = pairs[['estimate_midpoint_i','estimate_midpoint_j']].max(1) / \ | ||||
|                      pairs[['estimate_midpoint_i','estimate_midpoint_j']].min(1) | ||||
| 
 | ||||
|     pairs | ||||
| 
 | ||||
| def cell9(): | ||||
|     sns.scatterplot(data=pairs.sample(20_000), x='sim', y=np.log10(pairs['ratio']), | ||||
|                     alpha=.1) | ||||
|     plt.axhline(np.log10(2), ls=':', color='red');   # e.g. >2× difference | ||||
|     plt.xlabel('Cosine similarity'); plt.ylabel('log₁₀ time-ratio'); | ||||
|     plt.title('Are similar tasks given similar time estimates?'); | ||||
| 
 | ||||
| def cell10(): | ||||
|     import matplotlib.ticker as mtick # For percentage formatting | ||||
|     import matplotlib.colors as mcolors # For color conversion | ||||
| 
 | ||||
|     summary_data = [] | ||||
| 
 | ||||
|     for code, label in occupation_major_codes.items(): | ||||
|         occ_df = df_tasks[df_tasks['onetsoc_major'] == code] | ||||
|         total_tasks_in_occ = len(occ_df) | ||||
| 
 | ||||
|         if total_tasks_in_occ == 0: | ||||
|             continue # Skip if no tasks for this occupation | ||||
| 
 | ||||
|         # Stack 1: % that isn't equal to "remote" | ||||
|         not_remote_count = len(occ_df[occ_df['remote_status'] != 'remote']) | ||||
| 
 | ||||
|         # For the remaining remote tasks: | ||||
|         remote_df = occ_df[occ_df['remote_status'] == 'remote'] | ||||
| 
 | ||||
|         # Stack 2: % of remote + ATOMIC | ||||
|         remote_atomic_count = len(remote_df[remote_df['estimateable'] == 'ATOMIC']) | ||||
| 
 | ||||
|         # Stack 3: % of remote + ONGOING-CONSTRAINT | ||||
|         remote_ongoing_count = len(remote_df[remote_df['estimateable'] == 'ONGOING-CONSTRAINT']) | ||||
| 
 | ||||
|         summary_data.append({ | ||||
|             'onetsoc_major_code': code, | ||||
|             'occupation_label': label, | ||||
|             'count_not_remote': not_remote_count, | ||||
|             'count_remote_atomic': remote_atomic_count, | ||||
|             'count_remote_ongoing': remote_ongoing_count, | ||||
|             'total_tasks': total_tasks_in_occ | ||||
|         }) | ||||
| 
 | ||||
|     summary_df = pd.DataFrame(summary_data) | ||||
| 
 | ||||
|     # --- 3. Calculate Percentages --- | ||||
|     # Ensure total_tasks is not zero to avoid division by zero errors if an occupation had no tasks | ||||
|     summary_df = summary_df[summary_df['total_tasks'] > 0].copy() # Use .copy() to avoid SettingWithCopyWarning | ||||
| 
 | ||||
|     summary_df['pct_not_remote'] = (summary_df['count_not_remote'] / summary_df['total_tasks']) * 100 | ||||
|     summary_df['pct_remote_atomic'] = (summary_df['count_remote_atomic'] / summary_df['total_tasks']) * 100 | ||||
|     summary_df['pct_remote_ongoing'] = (summary_df['count_remote_ongoing'] / summary_df['total_tasks']) * 100 | ||||
| 
 | ||||
|     # Select columns for plotting and set index to occupation label | ||||
|     plot_df = summary_df.set_index('occupation_label')[ | ||||
|         ['pct_not_remote', 'pct_remote_atomic', 'pct_remote_ongoing'] | ||||
|     ] | ||||
| 
 | ||||
|     # Rename columns for a clearer legend | ||||
|     plot_df.columns = ['Not Remote', 'Remote + Estimable', 'Remote + Not estimable'] | ||||
| 
 | ||||
|     plot_df = plot_df.sort_values(by='Not Remote', ascending=False) | ||||
| 
 | ||||
| 
 | ||||
|     # --- 4. Plotting (Modified) --- | ||||
| 
 | ||||
|     # Define the custom colors based on your requirements | ||||
|     # The order must match the column order in plot_df: | ||||
|     # 1. 'Not Remote' | ||||
|     # 2. 'Remote & ATOMIC' | ||||
|     # 3. 'Remote & ONGOING-CONSTRAINT' | ||||
|     bar_colors = [gray["300"], lime["500"], lime["200"]] | ||||
| 
 | ||||
|     fig, ax = plt.subplots(figsize=(14, 10)) # Adjusted figsize for better readability | ||||
| 
 | ||||
|     plot_df.plot(kind='barh', stacked=True, ax=ax, color=bar_colors) | ||||
| 
 | ||||
|     ax.set_xlabel("Percentage of Tasks (%)", fontsize=12) | ||||
|     ax.set_ylabel("Occupation Major Group", fontsize=12) | ||||
|     ax.set_title("Task Breakdown by Occupation, Remote Status, and Estimateability", fontsize=14, pad=20) | ||||
| 
 | ||||
|     # Format x-axis as percentages | ||||
|     ax.xaxis.set_major_formatter(mtick.PercentFormatter()) | ||||
|     plt.xlim(0, 100) # Ensure x-axis goes from 0 to 100% | ||||
| 
 | ||||
|     # Remove right and top spines | ||||
|     ax.spines['right'].set_visible(False) | ||||
|     ax.spines['top'].set_visible(False) | ||||
| 
 | ||||
|     # Function to get contrasting text color | ||||
|     def get_contrasting_text_color(bg_color_hex_or_rgba): | ||||
|         """ | ||||
|         Determines if black or white text provides better contrast against a given background color. | ||||
|         bg_color_hex_or_rgba: A hex string (e.g., '#RRGGBB') or an RGBA tuple (values in [0, 1]). | ||||
|         Returns: 'black' or 'white'. | ||||
|         """ | ||||
|         # Convert to RGBA if it's a hex string or name | ||||
|         if isinstance(bg_color_hex_or_rgba, str): | ||||
|             rgba = mcolors.to_rgba(bg_color_hex_or_rgba) | ||||
|         else: | ||||
|             rgba = bg_color_hex_or_rgba | ||||
| 
 | ||||
|         r, g, b, _ = rgba # Ignore alpha for luminance calculation | ||||
|         # Calculate luminance (standard formula for sRGB) | ||||
|         # Values r, g, b should be in [0, 1] for this formula | ||||
|         luminance = 0.2126 * r + 0.7152 * g + 0.0722 * b | ||||
|         # Threshold for deciding text color | ||||
|         return 'black' if luminance > 0.55 else 'white' # Adjusted threshold slightly for better visual | ||||
| 
 | ||||
|     # Add percentages inside each bar segment | ||||
|     # Iterate through each "category" of bars (Not Remote, Remote & ATOMIC, etc.) | ||||
|     for i, container in enumerate(ax.containers): | ||||
|         # Get the color for this container/category | ||||
|         segment_color = bar_colors[i] | ||||
|         text_color = get_contrasting_text_color(segment_color) | ||||
| 
 | ||||
|         for patch in container.patches: # Iterate through each bar segment in the category | ||||
|             width = patch.get_width() | ||||
|             if width > 3:  # Only add text if segment is wide enough (e.g., >3%) | ||||
|                 x = patch.get_x() + width / 2 | ||||
|                 y = patch.get_y() + patch.get_height() / 2 | ||||
|                 ax.text(x, y, | ||||
|                         f"{width:.1f}%", | ||||
|                         ha='center', | ||||
|                         va='center', | ||||
|                         fontsize=8, # Adjust font size as needed | ||||
|                         color=text_color, | ||||
|                         fontweight='medium') # Bolder text can help | ||||
| 
 | ||||
| 
 | ||||
|     plt.legend(title="Task Category", bbox_to_anchor=(1.02, 1), loc='upper left', frameon=False) | ||||
| 
 | ||||
| def cell11(): | ||||
|     df_oesm['onetsoc_major'] = df_oesm['OCC_CODE'].str[:2] | ||||
| 
 | ||||
|     # Calculate wage bill per occupation | ||||
|     # Wage bill = Total Employment * Annual Mean Wage | ||||
|     # Ensure columns are numeric, converting non-numeric values to NaN first | ||||
|     df_oesm['TOT_EMP'] = pd.to_numeric(df_oesm['TOT_EMP'], errors='coerce') | ||||
|     df_oesm['A_MEAN'] = pd.to_numeric(df_oesm['A_MEAN'], errors='coerce') | ||||
| 
 | ||||
|     # Drop rows with NaN in necessary columns after coercion | ||||
|     df_oesm.dropna(subset=['TOT_EMP', 'A_MEAN', 'onetsoc_major'], inplace=True) | ||||
| 
 | ||||
|     df_oesm['wage_bill'] = df_oesm['TOT_EMP'] * df_oesm['A_MEAN'] | ||||
| 
 | ||||
|     # Aggregate wage bill by onetsoc_major | ||||
|     df_wage_bill_major = df_oesm.groupby('onetsoc_major')['wage_bill'].sum().reset_index() | ||||
| 
 | ||||
|     # Map major codes to titles for better plotting | ||||
|     df_wage_bill_major['OCC_TITLE_MAJOR'] = df_wage_bill_major['onetsoc_major'].map(occupation_major_codes) | ||||
| 
 | ||||
|     # Sort by wage bill for better visualization | ||||
|     df_wage_bill_major = df_wage_bill_major.sort_values('wage_bill', ascending=False) | ||||
| 
 | ||||
|     # Plotting | ||||
|     plt.figure(figsize=(12, 8)) | ||||
|     sns.barplot(x='wage_bill', y='OCC_TITLE_MAJOR', data=df_wage_bill_major, palette="viridis") | ||||
|     plt.title('Total Wage Bill per Major Occupation Group') | ||||
|     plt.xlabel('Total Wage Bill (in billions)') | ||||
|     plt.ylabel('Major Occupation Group') | ||||
|     plt.grid(axis='x', linestyle='--', alpha=0.7) | ||||
| 
 | ||||
| def cell11(): | ||||
|     # ─────────────────────────────────────────────────────────────── | ||||
|     # 1.  CUMULATIVE-DISTRIBUTION-FUNCTION (CDF) PREP | ||||
|     # ─────────────────────────────────────────────────────────────── | ||||
|     def cdf(series): | ||||
|         s = series.sort_values().reset_index(drop=True) | ||||
|         return s.values, ((s.index + 1) / len(s)) * 100 | ||||
| 
 | ||||
|     x_lb , y_lb  = cdf(atomic_tasks['lb_estimate_in_minutes']) | ||||
|     x_ub , y_ub  = cdf(atomic_tasks['ub_estimate_in_minutes']) | ||||
|     x_mid, y_mid = cdf((atomic_tasks['ub_estimate_in_minutes'] + atomic_tasks['lb_estimate_in_minutes']) / 2) | ||||
| 
 | ||||
|     # ─────────────────────────────────────────────────────────────── | ||||
|     # 2.  PLOTTING | ||||
|     # ─────────────────────────────────────────────────────────────── | ||||
|     fig, ax = plt.subplots(figsize=(10, 6)) | ||||
| 
 | ||||
|     # horizontal reference lines every 10 % | ||||
|     for y_val in range(0, 101, 10): | ||||
|         ax.axhline(y_val, color=gray['100'], linewidth=.8, zorder=1) | ||||
| 
 | ||||
|     # Plot Lower Bound CDF | ||||
|     ax.step(x_lb, y_lb, | ||||
|             where='post', | ||||
|             color=lime['300'], # Example: light blue for lower bound | ||||
|             linewidth=1.8, | ||||
|             linestyle='--', | ||||
|             zorder=2, | ||||
|             label='Lower bound estimate (CDF)') | ||||
| 
 | ||||
|     # Plot Upper Bound CDF | ||||
|     ax.step(x_ub, y_ub, | ||||
|             where='post', | ||||
|             color=lime['900'], # Example: light orange/red for upper bound | ||||
|             linewidth=1.8, | ||||
|             linestyle=':', | ||||
|             zorder=3, | ||||
|             label='Upper bound estimate (CDF)') | ||||
| 
 | ||||
|     # Plot Midpoint CDF (plotted last to be on top, or adjust zorder) | ||||
|     ax.step(x_mid, y_mid, | ||||
|             where='post', | ||||
|             color=lime['600'], | ||||
|             linewidth=2.2, | ||||
|             zorder=4, # Ensure it's on top of other lines if they overlap significantly | ||||
|             label='Mid-point estimate (CDF)') | ||||
| 
 | ||||
| 
 | ||||
|     # axes limits / scales | ||||
|     ax.set_ylim(0, 100) | ||||
|     ax.set_xscale('log') | ||||
| 
 | ||||
|     # y-axis ➝ percent labels | ||||
|     ax.yaxis.set_major_formatter(mpl.ticker.PercentFormatter(decimals=0)) | ||||
| 
 | ||||
| 
 | ||||
|     # move y-label to top-left (just inside plotting area) | ||||
|     ax.text(-0.06, 1.03, | ||||
|             "% of tasks with temporal coherence ≤ X", | ||||
|             ha='left', va='bottom', | ||||
|             transform=ax.transAxes, | ||||
|             fontsize=12, fontweight='semibold') | ||||
| 
 | ||||
|     # custom x-ticks at human-friendly durations | ||||
|     ticks      = [1, 5, 10, 30, 60, 120, 240, 480, | ||||
|                 1440, 2880, 10080, 43200, 129600, | ||||
|                 259200, 525600] | ||||
|     ticklabels = ['1 min', '5 min', '10 min', '30 min', '1 hour', '2 hours', '4 hours', '8 hours', | ||||
|                 '1 day', '2 days', '1 week', '30 days', | ||||
|                 '90 days', '180 days', '1 year'] | ||||
| 
 | ||||
|     # Vertical reference lines for x-ticks | ||||
|     for tick in ticks: | ||||
|         ax.axvline(tick, color=gray['300'], linewidth=.8, linestyle='--', zorder=1) | ||||
| 
 | ||||
|     ax.set_xticks(ticks) | ||||
|     ax.set_xticklabels(ticklabels, rotation=45, ha='right') | ||||
| 
 | ||||
|     ax.spines['top'].set_visible(False) | ||||
|     ax.spines['right'].set_visible(False) | ||||
|     ax.spines['left'].set_edgecolor(gray['300']) | ||||
|     ax.spines['bottom'].set_edgecolor(gray['300']) | ||||
| 
 | ||||
| 
 | ||||
|     # legend | ||||
|     ax.legend(frameon=False, loc='lower right') # Keep 'lower right' or adjust as needed | ||||
| 
 | ||||
|     ax.text(0.5, -0.3, | ||||
|             'Temporal coherence (X)', | ||||
|             ha='center', va='center', | ||||
|             transform=ax.transAxes, | ||||
|             fontsize=12, fontweight='semibold') | ||||
							
								
								
									
										0
									
								
								analysis/__init__.py
									
										
									
									
									
										Normal file
									
								
							
							
						
						
									
										0
									
								
								analysis/__init__.py
									
										
									
									
									
										Normal file
									
								
							
							
								
								
									
										207
									
								
								analysis/data.py
									
										
									
									
									
										Normal file
									
								
							
							
						
						
									
										207
									
								
								analysis/data.py
									
										
									
									
									
										Normal file
									
								
							|  | @ -0,0 +1,207 @@ | |||
| import logging | ||||
| import re | ||||
| import requests | ||||
| import shutil | ||||
| import sqlite3 | ||||
| import zipfile | ||||
| from pathlib import Path | ||||
| 
 | ||||
| # Configure logging to provide feedback during the data setup process | ||||
| logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s') | ||||
| 
 | ||||
| # --- Constants --- | ||||
| # Using a data directory at the root of the project | ||||
| DATA_DIR = Path("data") | ||||
| 
 | ||||
| # O*NET database details. We download the MySQL version and convert it to SQLite. | ||||
| ONET_MYSQL_URL = "https://www.onetcenter.org/dl_files/database/db_29_3_mysql.zip" | ||||
| DB_ZIP_PATH = DATA_DIR / "onet_mysql.zip" | ||||
| DB_FILE_PATH = DATA_DIR / "onet.db" | ||||
| EXTRACT_DIR = DATA_DIR / "onet_mysql_extracted" | ||||
| 
 | ||||
| # URLs for other required data files are in a separate text data archive. | ||||
| ONET_TEXT_URL = "https://www.onetcenter.org/dl_files/database/db_29_3_text.zip" | ||||
| TEXT_ZIP_PATH = DATA_DIR / "onet_text.zip" | ||||
| TASK_RATINGS_PATH = DATA_DIR / "Task Ratings.txt" | ||||
| DWA_REFERENCE_PATH = DATA_DIR / "DWA Reference.txt" | ||||
| 
 | ||||
| 
 | ||||
| def setup_data_and_database(): | ||||
|     """ | ||||
|     Main function to orchestrate the data setup. | ||||
|     It ensures the data directory exists, then downloads and sets up the O*NET database | ||||
|     and any other required data files. | ||||
|     """ | ||||
|     logging.info("Starting data and database setup...") | ||||
|     DATA_DIR.mkdir(exist_ok=True) | ||||
| 
 | ||||
|     _setup_onet_database() | ||||
|     _download_additional_data() | ||||
| 
 | ||||
|     logging.info("Data and database setup complete.") | ||||
| 
 | ||||
| 
 | ||||
| def _setup_onet_database(): | ||||
|     """ | ||||
|     Downloads the O*NET MySQL database, extracts it, and imports it into a | ||||
|     new SQLite database, following performance best practices from a shell script. | ||||
|     This method performs minimal text-based conversion of the MySQL dump to | ||||
|     make it compatible with SQLite before importing. | ||||
|     """ | ||||
|     if DB_FILE_PATH.exists(): | ||||
|         logging.info("O*NET database already exists at %s. Skipping setup.", DB_FILE_PATH) | ||||
|         return | ||||
| 
 | ||||
|     logging.info("O*NET database not found. Starting fresh setup.") | ||||
|     # Ensure the extraction directory is clean before use | ||||
|     if EXTRACT_DIR.exists(): | ||||
|         shutil.rmtree(EXTRACT_DIR) | ||||
|     EXTRACT_DIR.mkdir() | ||||
| 
 | ||||
|     try: | ||||
|         # 1. Download if necessary | ||||
|         if not DB_ZIP_PATH.exists(): | ||||
|             logging.info("Downloading O*NET database from %s", ONET_MYSQL_URL) | ||||
|             _download_file(ONET_MYSQL_URL, DB_ZIP_PATH) | ||||
|         else: | ||||
|             logging.info("Using existing O*NET zip file at %s", DB_ZIP_PATH) | ||||
| 
 | ||||
|         # 2. Extract | ||||
|         logging.info("Extracting O*NET database files to %s", EXTRACT_DIR) | ||||
|         with zipfile.ZipFile(DB_ZIP_PATH, 'r') as zip_ref: | ||||
|             zip_ref.extractall(EXTRACT_DIR) | ||||
| 
 | ||||
|         # 3. Create new DB with performance PRAGMAs | ||||
|         logging.info("Creating new SQLite database with performance settings: %s", DB_FILE_PATH) | ||||
|         conn = sqlite3.connect(DB_FILE_PATH) | ||||
|         conn.executescript(""" | ||||
|             PRAGMA journal_mode = OFF; | ||||
|             PRAGMA synchronous = 0; | ||||
|             PRAGMA cache_size = 1000000; | ||||
|             PRAGMA locking_mode = EXCLUSIVE; | ||||
|             PRAGMA temp_store = MEMORY; | ||||
|         """) | ||||
|         conn.close() | ||||
| 
 | ||||
|         # 4. Combine all SQL files, convert, and import in a single transaction | ||||
|         logging.info("Combining and converting SQL files for single transaction import...") | ||||
|         sql_files = sorted(EXTRACT_DIR.rglob('*.sql')) | ||||
|         if not sql_files: | ||||
|             raise FileNotFoundError(f"No SQL files found in {EXTRACT_DIR}") | ||||
| 
 | ||||
|         # Concatenate all files into one string | ||||
|         mysql_dump = "\n".join([sql_file.read_text(encoding='utf-8') for sql_file in sql_files]) | ||||
| 
 | ||||
|         # Minimal conversion for SQLite: remove backticks and ENGINE clauses | ||||
|         sqlite_dump = mysql_dump.replace('`', '') | ||||
|         sqlite_dump = re.sub(r'\) ENGINE=InnoDB.*?;', ');', sqlite_dump, flags=re.DOTALL) | ||||
| 
 | ||||
|         full_script = f"BEGIN TRANSACTION;\n{sqlite_dump}\nCOMMIT;" | ||||
| 
 | ||||
|         logging.info(f"Importing {len(sql_files)} SQL files into database...") | ||||
|         conn = sqlite3.connect(DB_FILE_PATH) | ||||
|         conn.executescript(full_script) | ||||
|         conn.close() | ||||
|         logging.info("Database populated successfully.") | ||||
| 
 | ||||
|         # 5. Restore reliability settings and optimize | ||||
|         logging.info("Restoring reliability settings and optimizing database...") | ||||
|         conn = sqlite3.connect(DB_FILE_PATH) | ||||
|         conn.executescript(""" | ||||
|             PRAGMA journal_mode = WAL; | ||||
|             PRAGMA synchronous = NORMAL; | ||||
|             PRAGMA locking_mode = NORMAL; | ||||
|             PRAGMA temp_store = DEFAULT; | ||||
|             PRAGMA foreign_keys = ON; | ||||
|             PRAGMA optimize; | ||||
|         """) | ||||
|         conn.execute("VACUUM;") | ||||
|         conn.close() | ||||
|         logging.info("Database setup and optimization complete.") | ||||
| 
 | ||||
|     except Exception as e: | ||||
|         logging.error("Failed during database setup: %s", e, exc_info=True) | ||||
|         if DB_FILE_PATH.exists(): | ||||
|             DB_FILE_PATH.unlink() | ||||
|         raise | ||||
|     finally: | ||||
|         # 6. Cleanup | ||||
|         logging.info("Cleaning up temporary files...") | ||||
|         if DB_ZIP_PATH.exists(): | ||||
|             DB_ZIP_PATH.unlink() | ||||
|         if EXTRACT_DIR.exists(): | ||||
|             shutil.rmtree(EXTRACT_DIR) | ||||
| 
 | ||||
| 
 | ||||
| def _download_additional_data(): | ||||
|     """ | ||||
|     Downloads and extracts supplementary data files from the O*NET text archive. | ||||
|     If the required text files already exist, this function does nothing. | ||||
|     """ | ||||
|     required_files = [TASK_RATINGS_PATH, DWA_REFERENCE_PATH] | ||||
|     if all(p.exists() for p in required_files): | ||||
|         logging.info("All required text data files already exist. Skipping download.") | ||||
|         return | ||||
| 
 | ||||
|     logging.info("One or more text data files are missing. Downloading and extracting from archive...") | ||||
|     try: | ||||
|         _download_file(ONET_TEXT_URL, TEXT_ZIP_PATH) | ||||
|         logging.info("Unzipping text data archive...") | ||||
|         with zipfile.ZipFile(TEXT_ZIP_PATH, 'r') as zip_ref: | ||||
|             # Extract only the files we need, without creating subdirectories | ||||
|             for target_path in required_files: | ||||
|                 if not target_path.exists(): | ||||
|                     # Find the corresponding file within the zip archive's directory structure | ||||
|                     member_name = next((m for m in zip_ref.namelist() if m.endswith(target_path.name)), None) | ||||
|                     if member_name: | ||||
|                         with zip_ref.open(member_name) as source, open(target_path, 'wb') as target: | ||||
|                             target.write(source.read()) | ||||
|                         logging.info("Extracted %s", target_path.name) | ||||
|                     else: | ||||
|                         logging.warning("Could not find %s in the text data archive.", target_path.name) | ||||
| 
 | ||||
|     except requests.exceptions.RequestException as e: | ||||
|         logging.error("Failed to download O*NET text data archive: %s", e) | ||||
|         raise | ||||
|     except zipfile.BadZipFile as e: | ||||
|         logging.error("Failed to process the text data archive: %s", e) | ||||
|         raise | ||||
|     finally: | ||||
|         # Clean up the downloaded zip file | ||||
|         if TEXT_ZIP_PATH.exists(): | ||||
|             TEXT_ZIP_PATH.unlink() | ||||
|             logging.info("Cleaned up downloaded text archive zip file.") | ||||
| 
 | ||||
| 
 | ||||
| def _download_file(url, destination): | ||||
|     """ | ||||
|     Helper function to download a file from a URL, with streaming for large files. | ||||
|     """ | ||||
|     logging.info("Downloading from %s to %s", url, destination) | ||||
|     with requests.get(url, stream=True) as r: | ||||
|         r.raise_for_status() | ||||
|         with open(destination, 'wb') as f: | ||||
|             for chunk in r.iter_content(chunk_size=8192): | ||||
|                 f.write(chunk) | ||||
|     logging.info("Download of %s complete.", destination.name) | ||||
| 
 | ||||
| 
 | ||||
| def get_db_connection(): | ||||
|     """ | ||||
|     Establishes and returns a connection to the SQLite database. | ||||
|     Returns None if the database file does not exist. | ||||
|     """ | ||||
|     if not DB_FILE_PATH.exists(): | ||||
|         logging.error("Database file not found at %s. Run the setup process first.", DB_FILE_PATH) | ||||
|         return None | ||||
|     try: | ||||
|         conn = sqlite3.connect(DB_FILE_PATH) | ||||
|         return conn | ||||
|     except sqlite3.Error as e: | ||||
|         logging.error("Failed to connect to the database: %s", e) | ||||
|         return None | ||||
| 
 | ||||
| if __name__ == '__main__': | ||||
|     # This allows the data setup to be run directly from the command line, | ||||
|     # which is useful for initialization or debugging. | ||||
|     setup_data_and_database() | ||||
							
								
								
									
										76
									
								
								analysis/generate.py
									
										
									
									
									
										Normal file
									
								
							
							
						
						
									
										76
									
								
								analysis/generate.py
									
										
									
									
									
										Normal file
									
								
							|  | @ -0,0 +1,76 @@ | |||
| import importlib | ||||
| import logging | ||||
| import pkgutil | ||||
| import shutil | ||||
| from pathlib import Path | ||||
| 
 | ||||
| # The final destination for all generated outputs | ||||
| DIST_DIR = Path("dist") | ||||
| 
 | ||||
| # Configure logging | ||||
| logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s') | ||||
| 
 | ||||
| def create_all_outputs(processed_df): | ||||
|     """ | ||||
|     Dynamically discovers, imports, and runs all output generators. | ||||
| 
 | ||||
|     This function iterates through all modules in the 'analysis.generators' | ||||
|     package. For each module, it assumes there is a 'generate(data)' function, | ||||
|     which it calls with the provided preprocessed DataFrame. | ||||
| 
 | ||||
|     The generator function is expected to save its output to a temporary file | ||||
|     and return the path to that file. This function then moves the output | ||||
| 
 | ||||
|     to the 'dist/' directory. | ||||
| 
 | ||||
|     Args: | ||||
|         processed_df (pd.DataFrame): The fully preprocessed data to be used | ||||
|                                      by the generator functions. | ||||
|     """ | ||||
|     logging.info("Starting output generation...") | ||||
|     DIST_DIR.mkdir(exist_ok=True) | ||||
|     logging.info(f"Output directory is '{DIST_DIR.resolve()}'") | ||||
| 
 | ||||
|     # Path to the generators package | ||||
|     from . import generators as generators_package | ||||
|     generators_path = generators_package.__path__ | ||||
|     generators_prefix = generators_package.__name__ + "." | ||||
| 
 | ||||
|     generated_files_count = 0 | ||||
| 
 | ||||
|     # Discover and run all modules in the generators package | ||||
|     for _, module_name, _ in pkgutil.iter_modules(generators_path, prefix=generators_prefix): | ||||
|         try: | ||||
|             logging.info(f"--- Running generator: {module_name} ---") | ||||
| 
 | ||||
|             # Import the generator module | ||||
|             generator_module = importlib.import_module(module_name) | ||||
| 
 | ||||
|             # Check if the module has the required 'generate' function | ||||
|             if not hasattr(generator_module, 'generate'): | ||||
|                 logging.warning(f"Generator module {module_name} does not have a 'generate' function. Skipping.") | ||||
|                 continue | ||||
| 
 | ||||
|             # Call the generator function, passing in the preprocessed data | ||||
|             generator_func = getattr(generator_module, 'generate') | ||||
|             temp_output_path = generator_func(processed_df) | ||||
| 
 | ||||
|             # If the generator returned a path, move the file to the dist directory | ||||
|             if temp_output_path and isinstance(temp_output_path, Path) and temp_output_path.exists(): | ||||
|                 # Sanitize the module name to create a valid filename | ||||
|                 base_filename = module_name.split('.')[-1] | ||||
|                 # Keep the original extension from the temp file | ||||
|                 final_filename = base_filename + temp_output_path.suffix | ||||
|                 final_output_path = DIST_DIR / final_filename | ||||
| 
 | ||||
|                 shutil.move(temp_output_path, final_output_path) | ||||
|                 logging.info(f"Successfully generated '{final_output_path.name}'") | ||||
|                 generated_files_count += 1 | ||||
|             else: | ||||
|                 logging.warning(f"Generator {module_name} did not return a valid output file path. Nothing was saved.") | ||||
| 
 | ||||
|         except Exception as e: | ||||
|             logging.error(f"Failed to run generator {module_name}. Error: {e}", exc_info=True) | ||||
|             # Continue to the next generator | ||||
| 
 | ||||
|     logging.info(f"--- Output generation complete. Total files generated: {generated_files_count} ---") | ||||
							
								
								
									
										0
									
								
								analysis/generators/__init__.py
									
										
									
									
									
										Normal file
									
								
							
							
						
						
									
										0
									
								
								analysis/generators/__init__.py
									
										
									
									
									
										Normal file
									
								
							
							
								
								
									
										119
									
								
								analysis/generators/estimate_lower_vs_upper_bounds.py
									
										
									
									
									
										Normal file
									
								
							
							
						
						
									
										119
									
								
								analysis/generators/estimate_lower_vs_upper_bounds.py
									
										
									
									
									
										Normal file
									
								
							|  | @ -0,0 +1,119 @@ | |||
| import seaborn as sns | ||||
| import matplotlib.pyplot as plt | ||||
| from pathlib import Path | ||||
| import tempfile | ||||
| import logging | ||||
| import pandas as pd | ||||
| import numpy as np | ||||
| 
 | ||||
| # Copied from other generators for modularity. This dictionary maps | ||||
| # O*NET major occupation group codes to human-readable labels. | ||||
| OCCUPATION_MAJOR_CODES = { | ||||
|     '11': 'Management', | ||||
|     '13': 'Business & Financial', | ||||
|     '15': 'Computer & Mathematical', | ||||
|     '17': 'Architecture & Engineering', | ||||
|     '19': 'Life, Physical, & Social Science', | ||||
|     '21': 'Community & Social Service', | ||||
|     '23': 'Legal', | ||||
|     '25': 'Education, Training, & Library', | ||||
|     '27': 'Arts, Design, & Media', | ||||
|     '29': 'Healthcare Practitioners', | ||||
|     '31': 'Healthcare Support', | ||||
|     '33': 'Protective Service', | ||||
|     '35': 'Food Preparation & Serving', | ||||
|     '37': 'Building & Grounds Maintenance', | ||||
|     '39': 'Personal Care & Service', | ||||
|     '41': 'Sales & Related', | ||||
|     '43': 'Office & Admin Support', | ||||
|     '45': 'Farming, Fishing, & Forestry', | ||||
|     '47': 'Construction & Extraction', | ||||
|     '49': 'Installation, Maintenance, & Repair', | ||||
|     '51': 'Production', | ||||
|     '53': 'Transportation & Material Moving', | ||||
|     '55': 'Military Specific', | ||||
| } | ||||
| 
 | ||||
| 
 | ||||
| def generate(processed_df: pd.DataFrame): | ||||
|     """ | ||||
|     Generates a scatter plot comparing lower vs. upper time estimates for tasks. | ||||
| 
 | ||||
|     This corresponds to 'cell3' from the original analysis notebook. It helps | ||||
|     visualize the relationship and spread between the lower and upper bounds | ||||
| 
 | ||||
|     of time estimates across different occupation groups. | ||||
| 
 | ||||
|     Args: | ||||
|         processed_df (pd.DataFrame): The preprocessed data. Expected columns: | ||||
|                                      'lb_estimate_in_minutes', | ||||
|                                      'ub_estimate_in_minutes', 'onetsoc_major'. | ||||
| 
 | ||||
|     Returns: | ||||
|         Path: The path to the generated temporary image file, or None on failure. | ||||
|     """ | ||||
|     logging.info("Generating plot of lower vs. upper time estimates...") | ||||
| 
 | ||||
|     # --- Data Validation and Preparation --- | ||||
|     required_cols = ['lb_estimate_in_minutes', 'ub_estimate_in_minutes', 'onetsoc_major'] | ||||
|     if not all(col in processed_df.columns for col in required_cols): | ||||
|         logging.error(f"Missing one or more required columns: {required_cols}. Cannot generate plot.") | ||||
|         return None | ||||
| 
 | ||||
|     df = processed_df.copy() | ||||
| 
 | ||||
|     # For log scaling, both lower and upper bounds must be positive. | ||||
|     df = df[(df['lb_estimate_in_minutes'] > 0) & (df['ub_estimate_in_minutes'] > 0)] | ||||
|     if df.empty: | ||||
|         logging.warning("No data with positive lower and upper estimates available to plot.") | ||||
|         return None | ||||
| 
 | ||||
|     # Replace the major code with its readable label for the hue legend. | ||||
|     df['occupation_label'] = df['onetsoc_major'].map(OCCUPATION_MAJOR_CODES) | ||||
| 
 | ||||
|     # --- Plotting --- | ||||
|     try: | ||||
|         plt.figure(figsize=(12, 10)) | ||||
|         ax = sns.scatterplot( | ||||
|             data=df, | ||||
|             x='lb_estimate_in_minutes', | ||||
|             y='ub_estimate_in_minutes', | ||||
|             alpha=0.2, | ||||
|             edgecolor=None, | ||||
|             hue="occupation_label"  # Use the labeled column for the legend | ||||
|         ) | ||||
| 
 | ||||
|         # Determine limits for the 45° reference line | ||||
|         # Use the maximum of both columns to create a square plot | ||||
|         max_val = df[['lb_estimate_in_minutes', 'ub_estimate_in_minutes']].max().max() | ||||
|         lims = (df[['lb_estimate_in_minutes', 'ub_estimate_in_minutes']].min().min(), max_val) | ||||
|         ax.plot(lims, lims, color='black', linestyle='--', linewidth=1, label='Upper = Lower') | ||||
| 
 | ||||
|         # Add helper lines for constant ratios (2x, 10x, 100x) | ||||
|         for k in [2, 10, 100]: | ||||
|             ax.plot(lims, [k * l for l in lims], | ||||
|                     linestyle=':', color='grey', linewidth=0.8, label=f'Upper = {k}x Lower') | ||||
| 
 | ||||
|         ax.set(xscale='log', yscale='log', xlim=lims, ylim=lims) | ||||
|         ax.set_xlabel('Lower-bound Estimate (minutes, log scale)', fontsize=12) | ||||
|         ax.set_ylabel('Upper-bound Estimate (minutes, log scale)', fontsize=12) | ||||
|         ax.set_title('Lower vs. Upper Time Estimates for All Tasks', fontsize=16) | ||||
| 
 | ||||
|         # Place the legend outside the plot to avoid obscuring data | ||||
|         ax.legend(bbox_to_anchor=(1.02, 1), loc='upper left', title='Occupation / Ratio') | ||||
| 
 | ||||
|         # --- File Saving --- | ||||
|         temp_dir = tempfile.gettempdir() | ||||
|         temp_path = Path(temp_dir) / "estimate_lower_vs_upper_bounds.png" | ||||
| 
 | ||||
|         # Use bbox_inches='tight' to ensure the external legend is included in the saved image. | ||||
|         plt.savefig(temp_path, dpi=300, bbox_inches='tight') | ||||
|         logging.info(f"Successfully saved plot to temporary file: {temp_path}") | ||||
| 
 | ||||
|         return temp_path | ||||
| 
 | ||||
|     except Exception as e: | ||||
|         logging.error(f"An error occurred while generating the plot: {e}", exc_info=True) | ||||
|         return None | ||||
|     finally: | ||||
|         plt.close() | ||||
							
								
								
									
										86
									
								
								analysis/generators/estimate_ratio_distribution.py
									
										
									
									
									
										Normal file
									
								
							
							
						
						
									
										86
									
								
								analysis/generators/estimate_ratio_distribution.py
									
										
									
									
									
										Normal file
									
								
							|  | @ -0,0 +1,86 @@ | |||
| import seaborn as sns | ||||
| import matplotlib.pyplot as plt | ||||
| import numpy as np | ||||
| import pandas as pd | ||||
| from pathlib import Path | ||||
| import tempfile | ||||
| import logging | ||||
| 
 | ||||
| def generate(processed_df: pd.DataFrame): | ||||
|     """ | ||||
|     Generates a histogram of the log-ratio of upper to lower time estimates. | ||||
| 
 | ||||
|     This corresponds to 'cell4' from the original analysis notebook. It shows | ||||
|     the distribution of how many times larger the upper estimate is compared | ||||
|     to the lower estimate. | ||||
| 
 | ||||
|     Args: | ||||
|         processed_df (pd.DataFrame): The preprocessed data. Expected columns: | ||||
|                                      'lb_estimate_in_minutes', | ||||
|                                      'ub_estimate_in_minutes'. | ||||
| 
 | ||||
|     Returns: | ||||
|         Path: The path to the generated temporary image file, or None on failure. | ||||
|     """ | ||||
|     logging.info("Generating distribution plot of estimate ratios...") | ||||
| 
 | ||||
|     # --- Data Validation and Preparation --- | ||||
|     required_cols = ['lb_estimate_in_minutes', 'ub_estimate_in_minutes'] | ||||
|     if not all(col in processed_df.columns for col in required_cols): | ||||
|         logging.error(f"Missing one or more required columns: {required_cols}. Cannot generate plot.") | ||||
|         return None | ||||
| 
 | ||||
|     df = processed_df.copy() | ||||
| 
 | ||||
|     # Calculate the ratio. We need to handle cases where the lower bound is zero. | ||||
|     # Replace lower bound of 0 with a small number to avoid division by zero, or filter them out. | ||||
|     # Here, we filter, as a ratio with a zero denominator is undefined. | ||||
|     df = df[df['lb_estimate_in_minutes'] > 0] | ||||
|     df['estimate_ratio'] = df['ub_estimate_in_minutes'] / df['lb_estimate_in_minutes'] | ||||
| 
 | ||||
|     # Replace infinite values (which can occur if ub is huge and lb is tiny) with NaN | ||||
|     # and drop rows with NaN or infinite ratios. | ||||
|     df.replace([np.inf, -np.inf], np.nan, inplace=True) | ||||
|     df.dropna(subset=['estimate_ratio'], inplace=True) | ||||
| 
 | ||||
|     if df.empty: | ||||
|         logging.warning("No valid data available to plot the estimate ratio distribution.") | ||||
|         return None | ||||
| 
 | ||||
|     # --- Plotting --- | ||||
|     try: | ||||
|         plt.figure(figsize=(10, 6)) | ||||
| 
 | ||||
|         # We plot the log10 of the ratio to better visualize the wide distribution | ||||
|         log_ratio = np.log10(df['estimate_ratio']) | ||||
| 
 | ||||
|         sns.histplot(log_ratio, bins=60, kde=True) | ||||
| 
 | ||||
|         # Add vertical lines for reference points | ||||
|         # log10(1) = 0, which is where upper bound equals lower bound | ||||
|         plt.axvline(x=0, color='black', linestyle='-', linewidth=1.5, label='1x (Upper = Lower)') | ||||
|         # A small ratio, e.g., 5% difference | ||||
|         plt.axvline(x=np.log10(1.05), color='orange', linestyle='--', linewidth=1, label='1.05x ratio') | ||||
|         # A 10x ratio | ||||
|         plt.axvline(x=np.log10(10), color='red', linestyle='--', linewidth=1, label='10x ratio') | ||||
| 
 | ||||
|         plt.xlabel('log₁₀(Upper Estimate / Lower Estimate)', fontsize=12) | ||||
|         plt.ylabel('Number of Tasks', fontsize=12) | ||||
|         plt.title('Distribution of Time Estimate Ratios', fontsize=16) | ||||
|         plt.legend() | ||||
|         plt.grid(axis='y', linestyle='--', alpha=0.7) | ||||
|         plt.tight_layout() | ||||
| 
 | ||||
|         # --- File Saving --- | ||||
|         temp_dir = tempfile.gettempdir() | ||||
|         temp_path = Path(temp_dir) / "estimate_ratio_distribution.png" | ||||
|         plt.savefig(temp_path, dpi=300) | ||||
|         logging.info(f"Successfully saved plot to temporary file: {temp_path}") | ||||
| 
 | ||||
|         return temp_path | ||||
| 
 | ||||
|     except Exception as e: | ||||
|         logging.error(f"An error occurred while generating the plot: {e}", exc_info=True) | ||||
|         return None | ||||
|     finally: | ||||
|         plt.close() | ||||
|  | @ -0,0 +1,135 @@ | |||
| import seaborn as sns | ||||
| import matplotlib.pyplot as plt | ||||
| import pandas as pd | ||||
| import numpy as np | ||||
| from pathlib import Path | ||||
| import tempfile | ||||
| import logging | ||||
| 
 | ||||
| # This mapping helps translate the O*NET 2-digit major group codes | ||||
| # into human-readable labels for the plot's y-axis. | ||||
| OCCUPATION_MAJOR_CODES = { | ||||
|     '11': 'Management', | ||||
|     '13': 'Business & Financial', | ||||
|     '15': 'Computer & Mathematical', | ||||
|     '17': 'Architecture & Engineering', | ||||
|     '19': 'Life, Physical, & Social Science', | ||||
|     '21': 'Community & Social Service', | ||||
|     '23': 'Legal', | ||||
|     '25': 'Education, Training, & Library', | ||||
|     '27': 'Arts, Design, & Media', | ||||
|     '29': 'Healthcare Practitioners', | ||||
|     '31': 'Healthcare Support', | ||||
|     '33': 'Protective Service', | ||||
|     '35': 'Food Preparation & Serving', | ||||
|     '37': 'Building & Grounds Maintenance', | ||||
|     '39': 'Personal Care & Service', | ||||
|     '41': 'Sales & Related', | ||||
|     '43': 'Office & Admin Support', | ||||
|     '45': 'Farming, Fishing, & Forestry', | ||||
|     '47': 'Construction & Extraction', | ||||
|     '49': 'Installation, Maintenance, & Repair', | ||||
|     '51': 'Production', | ||||
|     '53': 'Transportation & Material Moving', | ||||
|     '55': 'Military Specific', | ||||
| } | ||||
| 
 | ||||
| 
 | ||||
| def generate(processed_df: pd.DataFrame): | ||||
|     """ | ||||
|     Generates a heatmap of the median estimate ratio by occupation and task length quartile. | ||||
| 
 | ||||
|     This corresponds to 'cell5' from the original analysis notebook. It shows | ||||
|     how the ratio between upper and lower time estimates varies across | ||||
|     different occupations and for tasks of different typical lengths (binned | ||||
|     into quartiles). | ||||
| 
 | ||||
|     Args: | ||||
|         processed_df (pd.DataFrame): The preprocessed data. Expected columns: | ||||
|                                      'lb_estimate_in_minutes', | ||||
|                                      'ub_estimate_in_minutes', 'onetsoc_major'. | ||||
| 
 | ||||
|     Returns: | ||||
|         Path: The path to the generated temporary image file, or None on failure. | ||||
|     """ | ||||
|     logging.info("Generating heatmap of estimate ratios by occupation and task length...") | ||||
| 
 | ||||
|     # --- Data Validation and Preparation --- | ||||
|     required_cols = ['lb_estimate_in_minutes', 'ub_estimate_in_minutes', 'onetsoc_major'] | ||||
|     if not all(col in processed_df.columns for col in required_cols): | ||||
|         logging.error(f"Missing one or more required columns: {required_cols}. Cannot generate plot.") | ||||
|         return None | ||||
| 
 | ||||
|     df = processed_df.copy() | ||||
| 
 | ||||
|     # Calculate the estimate ratio, handling division by zero and infinity | ||||
|     df = df[df['lb_estimate_in_minutes'] > 0] | ||||
|     df['estimate_ratio'] = df['ub_estimate_in_minutes'] / df['lb_estimate_in_minutes'] | ||||
|     df.replace([np.inf, -np.inf], np.nan, inplace=True) | ||||
|     df.dropna(subset=['estimate_ratio'], inplace=True) | ||||
| 
 | ||||
|     if df.empty: | ||||
|         logging.warning("No valid data available for the ratio heatmap.") | ||||
|         return None | ||||
| 
 | ||||
|     # 1. Bin lower bounds into quartiles (Q1–Q4) | ||||
|     # Using duplicates='drop' can help if there are many identical values | ||||
|     # which can make binning into quantiles fail. | ||||
|     try: | ||||
|         df['lb_q'] = pd.qcut( | ||||
|             df.lb_estimate_in_minutes, | ||||
|             q=4, | ||||
|             labels=['Q1 (Shortest)', 'Q2', 'Q3', 'Q4 (Longest)'], | ||||
|             duplicates='drop' | ||||
|         ) | ||||
|     except ValueError as e: | ||||
|         logging.error(f"Could not bin data into quartiles: {e}. There might not be enough unique values.") | ||||
|         return None | ||||
| 
 | ||||
| 
 | ||||
|     # 2. Aggregate: median ratio per cell (occupation x task length quartile) | ||||
|     pivot = df.pivot_table( | ||||
|         index='onetsoc_major', | ||||
|         columns='lb_q', | ||||
|         values='estimate_ratio', | ||||
|         aggfunc='median' | ||||
|     ) | ||||
| 
 | ||||
|     # Map the index (onetsoc_major codes) to their corresponding readable labels | ||||
|     pivot.index = pivot.index.map(OCCUPATION_MAJOR_CODES) | ||||
|     pivot.dropna(inplace=True) # Drop occupations with no data in some quartiles for a cleaner plot | ||||
| 
 | ||||
|     if pivot.empty: | ||||
|         logging.warning("Pivot table is empty after processing. Cannot generate heatmap.") | ||||
|         return None | ||||
| 
 | ||||
|     # --- Plotting --- | ||||
|     try: | ||||
|         plt.figure(figsize=(12, 10)) | ||||
|         sns.heatmap( | ||||
|             pivot, | ||||
|             cmap='RdYlGn_r',  # Red-Yellow-Green (reversed), good for ratios centered around 1 | ||||
|             center=2,         # Center the colormap around a ratio of 2 | ||||
|             annot=True,       # Show the median values in the cells | ||||
|             fmt='.1f',        # Format annotations to one decimal place | ||||
|             linewidths=.5, | ||||
|             cbar_kws={'label': 'Median Upper/Lower Estimate Ratio'} | ||||
|         ) | ||||
|         plt.xlabel('Task Length (based on lower-bound quartile)', fontsize=12) | ||||
|         plt.ylabel('Occupation Major Group', fontsize=12) | ||||
|         plt.title('Typical Estimate Range Width by Occupation and Task Length', fontsize=16) | ||||
|         plt.tight_layout() | ||||
| 
 | ||||
|         # --- File Saving --- | ||||
|         temp_dir = tempfile.gettempdir() | ||||
|         temp_path = Path(temp_dir) / "ratio_heatmap_by_occupation_and_task_length.png" | ||||
|         plt.savefig(temp_path, dpi=300) | ||||
|         logging.info(f"Successfully saved plot to temporary file: {temp_path}") | ||||
| 
 | ||||
|         return temp_path | ||||
| 
 | ||||
|     except Exception as e: | ||||
|         logging.error(f"An error occurred while generating the heatmap: {e}", exc_info=True) | ||||
|         return None | ||||
|     finally: | ||||
|         plt.close() | ||||
							
								
								
									
										161
									
								
								analysis/generators/task_breakdown_by_occupation.py
									
										
									
									
									
										Normal file
									
								
							
							
						
						
									
										161
									
								
								analysis/generators/task_breakdown_by_occupation.py
									
										
									
									
									
										Normal file
									
								
							|  | @ -0,0 +1,161 @@ | |||
| import pandas as pd | ||||
| import matplotlib.pyplot as plt | ||||
| import matplotlib.ticker as mtick | ||||
| import matplotlib.colors as mcolors | ||||
| from pathlib import Path | ||||
| import tempfile | ||||
| import logging | ||||
| 
 | ||||
| # This mapping helps translate the O*NET 2-digit major group codes | ||||
| # into human-readable labels for the plot's y-axis. | ||||
| OCCUPATION_MAJOR_CODES = { | ||||
|     '11': 'Management', | ||||
|     '13': 'Business & Financial', | ||||
|     '15': 'Computer & Mathematical', | ||||
|     '17': 'Architecture & Engineering', | ||||
|     '19': 'Life, Physical, & Social Science', | ||||
|     '21': 'Community & Social Service', | ||||
|     '23': 'Legal', | ||||
|     '25': 'Education, Training, & Library', | ||||
|     '27': 'Arts, Design, & Media', | ||||
|     '29': 'Healthcare Practitioners', | ||||
|     '31': 'Healthcare Support', | ||||
|     '33': 'Protective Service', | ||||
|     '35': 'Food Preparation & Serving', | ||||
|     '37': 'Building & Grounds Maintenance', | ||||
|     '39': 'Personal Care & Service', | ||||
|     '41': 'Sales & Related', | ||||
|     '43': 'Office & Admin Support', | ||||
|     '45': 'Farming, Fishing, & Forestry', | ||||
|     '47': 'Construction & Extraction', | ||||
|     '49': 'Installation, Maintenance, & Repair', | ||||
|     '51': 'Production', | ||||
|     '53': 'Transportation & Material Moving', | ||||
|     '55': 'Military Specific', | ||||
| } | ||||
| 
 | ||||
| # Define colors to match the original notebook's palette. | ||||
| # These are standard hex codes for gray and lime shades. | ||||
| BAR_COLORS = [ | ||||
|     '#D1D5DB', # gray-300 | ||||
|     '#84CC16', # lime-500 | ||||
|     '#D9F99D', # lime-200 | ||||
| ] | ||||
| 
 | ||||
| 
 | ||||
| def _get_contrasting_text_color(bg_color_hex): | ||||
|     """ | ||||
|     Determines if black or white text provides better contrast against a given background color. | ||||
|     """ | ||||
|     try: | ||||
|         rgba = mcolors.to_rgba(bg_color_hex) | ||||
|         # Calculate luminance (Y) using the sRGB formula | ||||
|         luminance = 0.2126 * rgba[0] + 0.7152 * rgba[1] + 0.0722 * rgba[2] | ||||
|         return 'black' if luminance > 0.55 else 'white' | ||||
|     except ValueError: | ||||
|         return 'black' # Default to black if color is invalid | ||||
| 
 | ||||
| 
 | ||||
| def generate(processed_df: pd.DataFrame): | ||||
|     """ | ||||
|     Generates a stacked bar chart breaking down tasks by remote status and estimability. | ||||
| 
 | ||||
|     This corresponds to 'cell10' from the original analysis notebook. It shows, | ||||
|     for each occupation, the percentage of tasks that are not remote, remote and | ||||
|     estimable, or remote and not estimable. | ||||
| 
 | ||||
|     Args: | ||||
|         processed_df (pd.DataFrame): The preprocessed data. Expected columns: | ||||
|                                      'onetsoc_major', 'remote_status', 'estimateable'. | ||||
| 
 | ||||
|     Returns: | ||||
|         Path: The path to the generated temporary image file, or None on failure. | ||||
|     """ | ||||
|     logging.info("Generating task breakdown by occupation plot...") | ||||
| 
 | ||||
|     # --- Data Validation --- | ||||
|     required_cols = ['onetsoc_major', 'remote_status', 'estimateable'] | ||||
|     if not all(col in processed_df.columns for col in required_cols): | ||||
|         logging.error(f"Missing one or more required columns: {required_cols}. Cannot generate plot.") | ||||
|         return None | ||||
| 
 | ||||
|     df = processed_df.copy() | ||||
| 
 | ||||
|     # --- Data Summarization --- | ||||
|     summary_data = [] | ||||
|     for code, label in OCCUPATION_MAJOR_CODES.items(): | ||||
|         occ_df = df[df['onetsoc_major'] == code] | ||||
|         total_tasks = len(occ_df) | ||||
|         if total_tasks == 0: | ||||
|             continue | ||||
| 
 | ||||
|         not_remote_count = len(occ_df[occ_df['remote_status'] != 'remote']) | ||||
|         remote_df = occ_df[occ_df['remote_status'] == 'remote'] | ||||
|         remote_atomic_count = len(remote_df[remote_df['estimateable'] == 'ATOMIC']) | ||||
|         remote_ongoing_count = len(remote_df[remote_df['estimateable'] == 'ONGOING-CONSTRAINT']) | ||||
| 
 | ||||
|         summary_data.append({ | ||||
|             'occupation_label': label, | ||||
|             'count_not_remote': not_remote_count, | ||||
|             'count_remote_atomic': remote_atomic_count, | ||||
|             'count_remote_ongoing': remote_ongoing_count, | ||||
|             'total_tasks': total_tasks | ||||
|         }) | ||||
| 
 | ||||
|     if not summary_data: | ||||
|         logging.warning("No data available to generate the task breakdown plot.") | ||||
|         return None | ||||
| 
 | ||||
|     summary_df = pd.DataFrame(summary_data) | ||||
| 
 | ||||
|     # --- Percentage Calculation --- | ||||
|     summary_df['pct_not_remote'] = (summary_df['count_not_remote'] / summary_df['total_tasks']) * 100 | ||||
|     summary_df['pct_remote_atomic'] = (summary_df['count_remote_atomic'] / summary_df['total_tasks']) * 100 | ||||
|     summary_df['pct_remote_ongoing'] = (summary_df['count_remote_ongoing'] / summary_df['total_tasks']) * 100 | ||||
| 
 | ||||
|     plot_df = summary_df.set_index('occupation_label')[ | ||||
|         ['pct_not_remote', 'pct_remote_atomic', 'pct_remote_ongoing'] | ||||
|     ] | ||||
|     plot_df.columns = ['Not Remote', 'Remote & Estimable', 'Remote & Not Estimable'] | ||||
|     plot_df = plot_df.sort_values(by='Not Remote', ascending=False) | ||||
| 
 | ||||
| 
 | ||||
|     # --- Plotting --- | ||||
|     try: | ||||
|         fig, ax = plt.subplots(figsize=(14, 10)) | ||||
|         plot_df.plot(kind='barh', stacked=True, ax=ax, color=BAR_COLORS, width=0.8) | ||||
| 
 | ||||
|         ax.set_xlabel("Percentage of Tasks", fontsize=12) | ||||
|         ax.set_ylabel("Occupation Major Group", fontsize=12) | ||||
|         ax.set_title("Task Breakdown by Occupation, Remote Status, and Estimability", fontsize=16, pad=20) | ||||
|         ax.xaxis.set_major_formatter(mtick.PercentFormatter()) | ||||
|         ax.set_xlim(0, 100) | ||||
|         ax.spines['right'].set_visible(False) | ||||
|         ax.spines['top'].set_visible(False) | ||||
| 
 | ||||
|         # Add percentage labels inside each bar segment | ||||
|         for i, container in enumerate(ax.containers): | ||||
|             text_color = _get_contrasting_text_color(BAR_COLORS[i]) | ||||
|             for patch in container.patches: | ||||
|                 width = patch.get_width() | ||||
|                 if width > 3:  # Only label segments wider than 3% | ||||
|                     x = patch.get_x() + width / 2 | ||||
|                     y = patch.get_y() + patch.get_height() / 2 | ||||
|                     ax.text(x, y, f"{width:.1f}%", ha='center', va='center', | ||||
|                             fontsize=8, color=text_color, fontweight='medium') | ||||
| 
 | ||||
|         ax.legend(title="Task Category", bbox_to_anchor=(1.02, 1), loc='upper left', frameon=False) | ||||
| 
 | ||||
|         # --- File Saving --- | ||||
|         temp_dir = tempfile.gettempdir() | ||||
|         temp_path = Path(temp_dir) / "task_breakdown_by_occupation.png" | ||||
|         plt.savefig(temp_path, dpi=300, bbox_inches='tight') | ||||
|         logging.info(f"Successfully saved plot to temporary file: {temp_path}") | ||||
| 
 | ||||
|         return temp_path | ||||
| 
 | ||||
|     except Exception as e: | ||||
|         logging.error(f"An error occurred while generating the plot: {e}", exc_info=True) | ||||
|         return None | ||||
|     finally: | ||||
|         plt.close() | ||||
							
								
								
									
										74
									
								
								analysis/generators/task_estimate_distribution.py
									
										
									
									
									
										Normal file
									
								
							
							
						
						
									
										74
									
								
								analysis/generators/task_estimate_distribution.py
									
										
									
									
									
										Normal file
									
								
							|  | @ -0,0 +1,74 @@ | |||
| import seaborn as sns | ||||
| import matplotlib.pyplot as plt | ||||
| from pathlib import Path | ||||
| import tempfile | ||||
| import logging | ||||
| import pandas as pd | ||||
| 
 | ||||
| def generate(processed_df: pd.DataFrame): | ||||
|     """ | ||||
|     Generates a histogram of the task time estimate midpoints. | ||||
| 
 | ||||
|     This generator corresponds to 'cell1' from the original analysis notebook. | ||||
|     It visualizes the distribution of the calculated midpoint of time estimates | ||||
|     for all tasks on a logarithmic scale to handle the wide range of values. | ||||
| 
 | ||||
|     Args: | ||||
|         processed_df (pd.DataFrame): The preprocessed data, expected to contain | ||||
|                                      'lb_estimate_in_minutes' and | ||||
|                                      'ub_estimate_in_minutes' columns. | ||||
| 
 | ||||
|     Returns: | ||||
|         Path: The path to the generated temporary image file, or None if | ||||
|               generation fails. | ||||
|     """ | ||||
|     logging.info("Generating task estimate distribution plot...") | ||||
| 
 | ||||
|     # --- Data Validation and Preparation --- | ||||
|     required_cols = ['lb_estimate_in_minutes', 'ub_estimate_in_minutes'] | ||||
|     if not all(col in processed_df.columns for col in required_cols): | ||||
|         logging.error( | ||||
|             f"Required columns {required_cols} not found in the DataFrame. " | ||||
|             "Cannot generate plot." | ||||
|         ) | ||||
|         return None | ||||
| 
 | ||||
|     # Create a copy to avoid modifying the original DataFrame | ||||
|     df = processed_df.copy() | ||||
| 
 | ||||
|     # Calculate the midpoint from lower and upper bounds, as was done in the notebook | ||||
|     df['estimate_midpoint'] = (df['lb_estimate_in_minutes'] + df['ub_estimate_in_minutes']) / 2 | ||||
| 
 | ||||
|     # For log scaling, we must use positive values. Filter out any non-positive midpoints. | ||||
|     df = df[df['estimate_midpoint'] > 0] | ||||
|     if df.empty: | ||||
|         logging.warning("No data with positive estimate midpoints available to plot.") | ||||
|         return None | ||||
| 
 | ||||
|     # --- Plotting --- | ||||
|     try: | ||||
|         plt.figure(figsize=(10, 6)) | ||||
|         ax = sns.histplot(data=df, x='estimate_midpoint', log_scale=True) | ||||
| 
 | ||||
|         ax.set_title('Distribution of Task Time Estimate Midpoints', fontsize=16) | ||||
|         ax.set_xlabel('Estimate Midpoint (minutes, log scale)', fontsize=12) | ||||
|         ax.set_ylabel('Number of Tasks', fontsize=12) | ||||
|         plt.tight_layout() | ||||
| 
 | ||||
|         # --- File Saving --- | ||||
|         # Create a temporary file to save the plot. The orchestrator (`generate.py`) | ||||
|         # will move this to the final 'dist/' directory. | ||||
|         temp_dir = tempfile.gettempdir() | ||||
|         temp_path = Path(temp_dir) / "task_estimate_distribution.png" | ||||
| 
 | ||||
|         plt.savefig(temp_path, dpi=300) | ||||
|         logging.info(f"Successfully saved plot to temporary file: {temp_path}") | ||||
| 
 | ||||
|         return temp_path | ||||
| 
 | ||||
|     except Exception as e: | ||||
|         logging.error(f"An error occurred while generating the plot: {e}", exc_info=True) | ||||
|         return None | ||||
|     finally: | ||||
|         # Close the figure to free up memory, which is crucial when running many generators. | ||||
|         plt.close() | ||||
							
								
								
									
										134
									
								
								analysis/generators/temporal_coherence_cdf.py
									
										
									
									
									
										Normal file
									
								
							
							
						
						
									
										134
									
								
								analysis/generators/temporal_coherence_cdf.py
									
										
									
									
									
										Normal file
									
								
							|  | @ -0,0 +1,134 @@ | |||
| import pandas as pd | ||||
| import numpy as np | ||||
| import matplotlib.pyplot as plt | ||||
| import matplotlib as mpl | ||||
| from pathlib import Path | ||||
| import tempfile | ||||
| import logging | ||||
| 
 | ||||
| # Replicating the color palette from the original notebook for consistency. | ||||
| # These appear to be inspired by Tailwind CSS colors. | ||||
| GRAY_PALETTE = { | ||||
|     '100': '#F3F4F6', | ||||
|     '300': '#D1D5DB', | ||||
| } | ||||
| LIME_PALETTE = { | ||||
|     '300': '#D9F99D', | ||||
|     '600': '#A3E635', # A mid-tone lime | ||||
|     '900': '#4D7C0F', # A dark lime/green | ||||
| } | ||||
| 
 | ||||
| 
 | ||||
| def _calculate_cdf(series: pd.Series): | ||||
|     """ | ||||
|     Calculates the empirical Cumulative Distribution Function (CDF) for a series. | ||||
|     Returns the sorted values and their corresponding cumulative percentages. | ||||
|     """ | ||||
|     # Drop NA values and ensure the series is sorted | ||||
|     s = series.dropna().sort_values().reset_index(drop=True) | ||||
|     # Calculate cumulative percentage: (index + 1) / total_count | ||||
|     cdf_y = ((s.index + 1) / len(s)) * 100 | ||||
|     return s.values, cdf_y | ||||
| 
 | ||||
| 
 | ||||
| def generate(processed_df: pd.DataFrame): | ||||
|     """ | ||||
|     Generates a Cumulative Distribution Function (CDF) plot for task time estimates. | ||||
| 
 | ||||
|     This corresponds to the second 'cell11' from the original notebook. It plots | ||||
|     the CDF for the lower-bound, upper-bound, and mid-point of time estimates, | ||||
|     showing the percentage of tasks that can be completed within a certain time. | ||||
| 
 | ||||
|     Args: | ||||
|         processed_df (pd.DataFrame): The preprocessed data. Expected columns: | ||||
|                                      'lb_estimate_in_minutes', | ||||
|                                      'ub_estimate_in_minutes'. | ||||
| 
 | ||||
|     Returns: | ||||
|         Path: The path to the generated temporary image file, or None on failure. | ||||
|     """ | ||||
|     logging.info("Generating temporal coherence CDF plot...") | ||||
| 
 | ||||
|     # --- Data Validation and Preparation --- | ||||
|     required_cols = ['lb_estimate_in_minutes', 'ub_estimate_in_minutes'] | ||||
|     if not all(col in processed_df.columns for col in required_cols): | ||||
|         logging.error(f"Missing one or more required columns: {required_cols}. Cannot generate plot.") | ||||
|         return None | ||||
| 
 | ||||
|     df = processed_df.copy() | ||||
| 
 | ||||
|     # Log scale requires positive values. | ||||
|     df = df[(df['lb_estimate_in_minutes'] > 0) & (df['ub_estimate_in_minutes'] > 0)] | ||||
|     if df.empty: | ||||
|         logging.warning("No data with positive estimates available to generate CDF plot.") | ||||
|         return None | ||||
| 
 | ||||
|     # Calculate mid-point estimate | ||||
|     df['midpoint_estimate'] = (df['lb_estimate_in_minutes'] + df['ub_estimate_in_minutes']) / 2 | ||||
| 
 | ||||
|     # Prepare data for CDF plots | ||||
|     x_lb, y_lb = _calculate_cdf(df['lb_estimate_in_minutes']) | ||||
|     x_ub, y_ub = _calculate_cdf(df['ub_estimate_in_minutes']) | ||||
|     x_mid, y_mid = _calculate_cdf(df['midpoint_estimate']) | ||||
| 
 | ||||
|     # --- Plotting --- | ||||
|     try: | ||||
|         fig, ax = plt.subplots(figsize=(12, 8)) | ||||
| 
 | ||||
|         # --- Grid and Reference Lines --- | ||||
|         # Horizontal reference lines for percentages | ||||
|         for y_val in range(0, 101, 10): | ||||
|             ax.axhline(y_val, color=GRAY_PALETTE['100'], linewidth=0.8, zorder=1) | ||||
| 
 | ||||
|         # Vertical reference lines for human-friendly durations | ||||
|         ticks = [1, 5, 10, 30, 60, 120, 240, 480, 1440, 2880, 10080, 43200] | ||||
|         for tick in ticks: | ||||
|             ax.axvline(tick, color=GRAY_PALETTE['300'], linewidth=0.8, linestyle='--', zorder=1) | ||||
| 
 | ||||
|         # --- CDF Plots --- | ||||
|         ax.step(x_lb, y_lb, where='post', color=LIME_PALETTE['300'], linewidth=1.8, linestyle='--', zorder=2, label='Lower-bound Estimate (CDF)') | ||||
|         ax.step(x_ub, y_ub, where='post', color=LIME_PALETTE['900'], linewidth=1.8, linestyle=':', zorder=3, label='Upper-bound Estimate (CDF)') | ||||
|         ax.step(x_mid, y_mid, where='post', color=LIME_PALETTE['600'], linewidth=2.2, zorder=4, label='Mid-point Estimate (CDF)') | ||||
| 
 | ||||
|         # --- Axes Configuration --- | ||||
|         ax.set_ylim(0, 100) | ||||
|         ax.set_xscale('log') | ||||
| 
 | ||||
|         # Custom x-ticks for durations | ||||
|         ticklabels = ['1 min', '5 min', '10 min', '30 min', '1 hr', '2 hrs', '4 hrs', '8 hrs', '1 day', '2 days', '1 week', '30 days'] | ||||
|         ax.set_xticks(ticks) | ||||
|         ax.set_xticklabels(ticklabels, rotation=45, ha='right') | ||||
|         ax.minorticks_off() # Turn off minor ticks for clarity with custom grid | ||||
| 
 | ||||
|         # Format y-axis as percentages | ||||
|         ax.yaxis.set_major_formatter(mpl.ticker.PercentFormatter(decimals=0)) | ||||
| 
 | ||||
|         # --- Spines and Labels --- | ||||
|         for spine in ['top', 'right']: | ||||
|             ax.spines[spine].set_visible(False) | ||||
|         for spine in ['left', 'bottom']: | ||||
|             ax.spines[spine].set_edgecolor(GRAY_PALETTE['300']) | ||||
| 
 | ||||
|         # Use ax.text for more control over label placement than ax.set_ylabel/xlabel | ||||
|         ax.text(-0.07, 1.02, "% of tasks with duration ≤ X", transform=ax.transAxes, | ||||
|                 fontsize=12, fontweight='semibold', va='bottom') | ||||
|         ax.text(0.5, -0.25, 'Task Duration (X)', transform=ax.transAxes, | ||||
|                 fontsize=12, fontweight='semibold', ha='center') | ||||
| 
 | ||||
|         ax.legend(frameon=False, loc='lower right') | ||||
|         fig.suptitle('Cumulative Distribution of Task Time Estimates', fontsize=16, y=0.96) | ||||
|         plt.tight_layout(rect=[0, 0, 1, 0.95]) # Adjust layout to make space for suptitle | ||||
| 
 | ||||
|         # --- File Saving --- | ||||
|         temp_dir = tempfile.gettempdir() | ||||
|         temp_path = Path(temp_dir) / "temporal_coherence_cdf.png" | ||||
|         plt.savefig(temp_path, dpi=300, bbox_inches='tight') | ||||
|         logging.info(f"Successfully saved plot to temporary file: {temp_path}") | ||||
| 
 | ||||
|         return temp_path | ||||
| 
 | ||||
|     except Exception as e: | ||||
|         logging.error(f"An error occurred while generating the CDF plot: {e}", exc_info=True) | ||||
|         return None | ||||
|     finally: | ||||
|         plt.close() | ||||
							
								
								
									
										112
									
								
								analysis/generators/time_estimate_spread_by_occupation.py
									
										
									
									
									
										Normal file
									
								
							
							
						
						
									
										112
									
								
								analysis/generators/time_estimate_spread_by_occupation.py
									
										
									
									
									
										Normal file
									
								
							|  | @ -0,0 +1,112 @@ | |||
| import seaborn as sns | ||||
| import matplotlib.pyplot as plt | ||||
| from pathlib import Path | ||||
| import tempfile | ||||
| import logging | ||||
| import pandas as pd | ||||
| 
 | ||||
| # Based on O*NET SOC 2018 structure, this mapping helps translate | ||||
| # the 2-digit major group codes into human-readable labels. | ||||
| OCCUPATION_MAJOR_CODES = { | ||||
|     '11': 'Management', | ||||
|     '13': 'Business & Financial', | ||||
|     '15': 'Computer & Mathematical', | ||||
|     '17': 'Architecture & Engineering', | ||||
|     '19': 'Life, Physical, & Social Science', | ||||
|     '21': 'Community & Social Service', | ||||
|     '23': 'Legal', | ||||
|     '25': 'Education, Training, & Library', | ||||
|     '27': 'Arts, Design, & Media', | ||||
|     '29': 'Healthcare Practitioners', | ||||
|     '31': 'Healthcare Support', | ||||
|     '33': 'Protective Service', | ||||
|     '35': 'Food Preparation & Serving', | ||||
|     '37': 'Building & Grounds Maintenance', | ||||
|     '39': 'Personal Care & Service', | ||||
|     '41': 'Sales & Related', | ||||
|     '43': 'Office & Admin Support', | ||||
|     '45': 'Farming, Fishing, & Forestry', | ||||
|     '47': 'Construction & Extraction', | ||||
|     '49': 'Installation, Maintenance, & Repair', | ||||
|     '51': 'Production', | ||||
|     '53': 'Transportation & Material Moving', | ||||
|     '55': 'Military Specific', | ||||
| } | ||||
| 
 | ||||
| 
 | ||||
| def generate(processed_df: pd.DataFrame): | ||||
|     """ | ||||
|     Generates a box plot showing the spread of time-range estimates per occupation. | ||||
| 
 | ||||
|     This corresponds to 'cell2' from the original analysis notebook. It visualizes | ||||
|     the distribution of the difference between upper and lower time estimates for | ||||
|     each major occupational group. | ||||
| 
 | ||||
|     Args: | ||||
|         processed_df (pd.DataFrame): The preprocessed data. Expected columns: | ||||
|                                      'lb_estimate_in_minutes', | ||||
|                                      'ub_estimate_in_minutes', 'onetsoc_major'. | ||||
| 
 | ||||
|     Returns: | ||||
|         Path: The path to the generated temporary image file, or None on failure. | ||||
|     """ | ||||
|     logging.info("Generating plot of time estimate spread by occupation...") | ||||
| 
 | ||||
|     # --- Data Validation and Preparation --- | ||||
|     required_cols = ['lb_estimate_in_minutes', 'ub_estimate_in_minutes', 'onetsoc_major'] | ||||
|     if not all(col in processed_df.columns for col in required_cols): | ||||
|         logging.error(f"Missing one or more required columns: {required_cols}. Cannot generate plot.") | ||||
|         return None | ||||
| 
 | ||||
|     df = processed_df.copy() | ||||
| 
 | ||||
|     # Calculate the estimate range. | ||||
|     df['estimate_range'] = df['ub_estimate_in_minutes'] - df['lb_estimate_in_minutes'] | ||||
| 
 | ||||
|     # For log scaling, we need positive values. Filter out any non-positive ranges. | ||||
|     df = df[df['estimate_range'] > 0] | ||||
|     if df.empty: | ||||
|         logging.warning("No data with a positive estimate range available to plot.") | ||||
|         return None | ||||
| 
 | ||||
|     # Sort by the major code to ensure a consistent plot order | ||||
|     df = df.sort_values('onetsoc_major') | ||||
| 
 | ||||
|     # --- Plotting --- | ||||
|     try: | ||||
|         plt.figure(figsize=(14, 10)) | ||||
| 
 | ||||
|         ax = sns.boxplot( | ||||
|             data=df, | ||||
|             x='onetsoc_major', | ||||
|             y='estimate_range', | ||||
|             showfliers=False  # Outliers are excluded for a clearer view of the main distribution | ||||
|         ) | ||||
| 
 | ||||
|         plt.yscale('log')  # The long tail of the data makes a log scale more readable | ||||
|         plt.xlabel('Occupation Major Group', fontsize=12) | ||||
|         plt.ylabel('Time Estimate Range (upper - lower, in minutes, log scale)', fontsize=12) | ||||
|         plt.title('Spread of Time-Range Estimates by Occupation', fontsize=16) | ||||
| 
 | ||||
|         # Replace numeric x-tick labels (e.g., '11', '15') with meaningful text labels | ||||
|         ax.set_xticklabels( | ||||
|             [OCCUPATION_MAJOR_CODES.get(code.get_text(), code.get_text()) for code in ax.get_xticklabels()], | ||||
|             rotation=60, | ||||
|             ha='right' # Align rotated labels correctly | ||||
|         ) | ||||
| 
 | ||||
|         plt.tight_layout() | ||||
| 
 | ||||
|         # --- File Saving --- | ||||
|         temp_dir = tempfile.gettempdir() | ||||
|         temp_path = Path(temp_dir) / "time_estimate_spread_by_occupation.png" | ||||
|         plt.savefig(temp_path, dpi=300, bbox_inches='tight') | ||||
|         logging.info(f"Successfully saved plot to temporary file: {temp_path}") | ||||
| 
 | ||||
|         return temp_path | ||||
| 
 | ||||
|     except Exception as e: | ||||
|         logging.error(f"An error occurred while generating the plot: {e}", exc_info=True) | ||||
|         return None | ||||
|     finally: | ||||
|         plt.close() | ||||
							
								
								
									
										150
									
								
								analysis/generators/wage_bill_by_occupation.py
									
										
									
									
									
										Normal file
									
								
							
							
						
						
									
										150
									
								
								analysis/generators/wage_bill_by_occupation.py
									
										
									
									
									
										Normal file
									
								
							|  | @ -0,0 +1,150 @@ | |||
| import seaborn as sns | ||||
| import matplotlib.pyplot as plt | ||||
| import matplotlib.ticker as mticker | ||||
| import pandas as pd | ||||
| from pathlib import Path | ||||
| import tempfile | ||||
| import logging | ||||
| 
 | ||||
| # Assuming data.py is in the same package and provides this function | ||||
| from ..data import get_db_connection | ||||
| 
 | ||||
| # This mapping helps translate the O*NET 2-digit major group codes | ||||
| # into human-readable labels for the plot's y-axis. | ||||
| OCCUPATION_MAJOR_CODES = { | ||||
|     '11': 'Management', | ||||
|     '13': 'Business & Financial', | ||||
|     '15': 'Computer & Mathematical', | ||||
|     '17': 'Architecture & Engineering', | ||||
|     '19': 'Life, Physical, & Social Science', | ||||
|     '21': 'Community & Social Service', | ||||
|     '23': 'Legal', | ||||
|     '25': 'Education, Training, & Library', | ||||
|     '27': 'Arts, Design, & Media', | ||||
|     '29': 'Healthcare Practitioners', | ||||
|     '31': 'Healthcare Support', | ||||
|     '33': 'Protective Service', | ||||
|     '35': 'Food Preparation & Serving', | ||||
|     '37': 'Building & Grounds Maintenance', | ||||
|     '39': 'Personal Care & Service', | ||||
|     '41': 'Sales & Related', | ||||
|     '43': 'Office & Admin Support', | ||||
|     '45': 'Farming, Fishing, & Forestry', | ||||
|     '47': 'Construction & Extraction', | ||||
|     '49': 'Installation, Maintenance, & Repair', | ||||
|     '51': 'Production', | ||||
|     '53': 'Transportation & Material Moving', | ||||
|     '55': 'Military Specific', | ||||
| } | ||||
| 
 | ||||
| 
 | ||||
| def generate(processed_df: pd.DataFrame): | ||||
|     """ | ||||
|     Generates a bar plot of the total wage bill per major occupation group. | ||||
| 
 | ||||
|     This corresponds to the first 'cell11' from the original analysis notebook. | ||||
|     It calculates the total wage bill (Total Employment * Annual Mean Wage) for | ||||
|     each occupation and aggregates it by major occupation group. This generator | ||||
|     loads its data directly from the O*NET database. | ||||
| 
 | ||||
|     Args: | ||||
|         processed_df (pd.DataFrame): The preprocessed data (not used in this generator, | ||||
|                                      but required by the function signature). | ||||
| 
 | ||||
|     Returns: | ||||
|         Path: The path to the generated temporary image file, or None on failure. | ||||
|     """ | ||||
|     logging.info("Generating plot of total wage bill by occupation...") | ||||
|     conn = None | ||||
|     try: | ||||
|         # --- Data Loading --- | ||||
|         # This generator needs specific data that is not in the main preprocessed_df. | ||||
|         # It loads occupational employment and wage data directly from the database. | ||||
|         conn = get_db_connection() | ||||
|         if conn is None: | ||||
|             raise ConnectionError("Could not get database connection.") | ||||
| 
 | ||||
|         # This data is stored in a long format in the `occupation_level_metadata` table. | ||||
|         # We need to query this table and pivot it to get employment and wage columns. | ||||
|         query = "SELECT onetsoc_code, item, response FROM occupation_level_metadata WHERE item IN ('Employment', 'Annual Mean Wage')" | ||||
|         try: | ||||
|             df_meta = pd.read_sql_query(query, conn) | ||||
| 
 | ||||
|             # Pivot the table to create 'Employment' and 'Annual Mean Wage' columns | ||||
|             df_oesm = df_meta.pivot(index='onetsoc_code', columns='item', values='response').reset_index() | ||||
|             logging.info("Pivoted occupation metadata. Columns are: %s", df_oesm.columns.tolist()) | ||||
| 
 | ||||
|             # Rename for consistency with the original notebook's code | ||||
|             df_oesm.rename(columns={ | ||||
|                 'onetsoc_code': 'OCC_CODE', | ||||
|                 'Employment': 'TOT_EMP', | ||||
|                 'Annual Mean Wage': 'A_MEAN' | ||||
|             }, inplace=True) | ||||
|         except (pd.io.sql.DatabaseError, KeyError) as e: | ||||
|             logging.error(f"Failed to query or pivot occupation metadata: {e}", exc_info=True) | ||||
|             return None | ||||
| 
 | ||||
| 
 | ||||
|         # --- Data Preparation --- | ||||
|         # Create a 'major group' code from the first two digits of the SOC code | ||||
|         df_oesm['onetsoc_major'] = df_oesm['OCC_CODE'].str[:2] | ||||
| 
 | ||||
|         # Ensure wage and employment columns are numeric, coercing errors to NaN | ||||
|         df_oesm['TOT_EMP'] = pd.to_numeric(df_oesm['TOT_EMP'], errors='coerce') | ||||
|         df_oesm['A_MEAN'] = pd.to_numeric(df_oesm['A_MEAN'], errors='coerce') | ||||
| 
 | ||||
|         # Drop rows with missing data in critical columns | ||||
|         df_oesm.dropna(subset=['TOT_EMP', 'A_MEAN', 'onetsoc_major'], inplace=True) | ||||
| 
 | ||||
|         # Calculate the wage bill for each occupation | ||||
|         df_oesm['wage_bill'] = df_oesm['TOT_EMP'] * df_oesm['A_MEAN'] | ||||
| 
 | ||||
|         # Aggregate the wage bill by major occupation group | ||||
|         df_wage_bill_major = df_oesm.groupby('onetsoc_major')['wage_bill'].sum().reset_index() | ||||
| 
 | ||||
|         # Map the major codes to readable titles for plotting | ||||
|         df_wage_bill_major['OCC_TITLE_MAJOR'] = df_wage_bill_major['onetsoc_major'].map(OCCUPATION_MAJOR_CODES) | ||||
|         df_wage_bill_major.dropna(subset=['OCC_TITLE_MAJOR'], inplace=True) # Drop military/unmapped codes | ||||
| 
 | ||||
|         # Sort by wage bill for a more informative plot | ||||
|         df_wage_bill_major = df_wage_bill_major.sort_values('wage_bill', ascending=False) | ||||
| 
 | ||||
|         if df_wage_bill_major.empty: | ||||
|             logging.warning("No data available to generate the wage bill plot.") | ||||
|             return None | ||||
| 
 | ||||
| 
 | ||||
|         # --- Plotting --- | ||||
|         plt.figure(figsize=(12, 10)) | ||||
|         ax = sns.barplot(x='wage_bill', y='OCC_TITLE_MAJOR', data=df_wage_bill_major, palette="viridis", orient='h') | ||||
|         ax.set_title('Total Wage Bill per Major Occupation Group', fontsize=16, pad=15) | ||||
|         ax.set_xlabel('Total Wage Bill (in USD)', fontsize=12) | ||||
|         ax.set_ylabel('Major Occupation Group', fontsize=12) | ||||
|         ax.grid(axis='x', linestyle='--', alpha=0.7) | ||||
| 
 | ||||
|         # Format the x-axis to be more readable (e.g., "$2.0T" for trillions) | ||||
|         def format_billions(x, pos): | ||||
|             if x >= 1e12: | ||||
|                 return f'${x*1e-12:.1f}T' | ||||
|             if x >= 1e9: | ||||
|                 return f'${x*1e-9:.0f}B' | ||||
|             return f'${x*1e-6:.0f}M' | ||||
|         ax.xaxis.set_major_formatter(mticker.FuncFormatter(format_billions)) | ||||
| 
 | ||||
|         plt.tight_layout() | ||||
| 
 | ||||
|         # --- File Saving --- | ||||
|         temp_dir = tempfile.gettempdir() | ||||
|         temp_path = Path(temp_dir) / "wage_bill_by_occupation.png" | ||||
|         plt.savefig(temp_path, dpi=300) | ||||
|         logging.info(f"Successfully saved plot to temporary file: {temp_path}") | ||||
| 
 | ||||
|         return temp_path | ||||
| 
 | ||||
|     except Exception as e: | ||||
|         logging.error(f"An error occurred while generating the wage bill plot: {e}", exc_info=True) | ||||
|         return None | ||||
|     finally: | ||||
|         plt.close() | ||||
|         if conn: | ||||
|             conn.close() | ||||
							
								
								
									
										64
									
								
								analysis/main.py
									
										
									
									
									
										Normal file
									
								
							
							
						
						
									
										64
									
								
								analysis/main.py
									
										
									
									
									
										Normal file
									
								
							|  | @ -0,0 +1,64 @@ | |||
| import logging | ||||
| import sys | ||||
| 
 | ||||
| # Since this file is inside the 'analysis' package, we use relative imports | ||||
| # to access the other modules within the same package. | ||||
| from . import data | ||||
| from . import preprocess | ||||
| from . import generate | ||||
| 
 | ||||
| # Configure logging for the entire application. | ||||
| # This setup will apply to loggers in data, preprocess, and generate modules as well. | ||||
| logging.basicConfig( | ||||
|     level=logging.INFO, | ||||
|     format='%(asctime)s - %(name)s - %(levelname)s - %(message)s', | ||||
|     stream=sys.stdout | ||||
| ) | ||||
| 
 | ||||
| def main(): | ||||
|     """ | ||||
|     The main entry point for the entire analysis pipeline. | ||||
| 
 | ||||
|     This function orchestrates the three main stages of the analysis: | ||||
|     1. Data Setup: Downloads and prepares the necessary raw data and database. | ||||
|     2. Preprocessing: Cleans, enriches, and transforms the raw data into an | ||||
|        analysis-ready DataFrame. | ||||
|     3. Output Generation: Runs all registered generators to produce figures, | ||||
|        tables, and other outputs, saving them to the 'dist/' directory. | ||||
|     """ | ||||
|     logger = logging.getLogger(__name__) | ||||
|     logger.info("=================================================") | ||||
|     logger.info("  STARTING ECONTAI ANALYSIS PIPELINE  ") | ||||
|     logger.info("=================================================") | ||||
| 
 | ||||
|     try: | ||||
|         # Stage 1: Set up the data and database | ||||
|         logger.info("--- STAGE 1: DATA SETUP ---") | ||||
|         data.setup_data_and_database() | ||||
|         logger.info("--- DATA SETUP COMPLETE ---") | ||||
| 
 | ||||
|         # Stage 2: Run the preprocessing pipeline | ||||
|         logger.info("--- STAGE 2: PREPROCESSING ---") | ||||
|         processed_dataframe = preprocess.run_preprocessing() | ||||
|         logger.info("--- PREPROCESSING COMPLETE ---") | ||||
| 
 | ||||
|         # Stage 3: Generate all outputs | ||||
|         logger.info("--- STAGE 3: OUTPUT GENERATION ---") | ||||
|         generate.create_all_outputs(processed_dataframe) | ||||
|         logger.info("--- OUTPUT GENERATION COMPLETE ---") | ||||
| 
 | ||||
|         logger.info("=================================================") | ||||
|         logger.info("  ANALYSIS PIPELINE COMPLETED SUCCESSFULLY  ") | ||||
|         logger.info("=================================================") | ||||
| 
 | ||||
|     except Exception as e: | ||||
|         logger.critical("An unrecoverable error occurred during the pipeline execution.", exc_info=True) | ||||
|         # Exit with a non-zero status code to indicate failure, which is useful for automation. | ||||
|         sys.exit(1) | ||||
| 
 | ||||
| 
 | ||||
| # This allows the script to be run from the command line using `python -m analysis.main`. | ||||
| # The `-m` flag is important because it adds the parent directory to the Python path, | ||||
| # allowing the relative imports (e.g., `from . import data`) to work correctly. | ||||
| if __name__ == '__main__': | ||||
|     main() | ||||
							
								
								
									
										160
									
								
								analysis/preprocess.py
									
										
									
									
									
										Normal file
									
								
							
							
						
						
									
										160
									
								
								analysis/preprocess.py
									
										
									
									
									
										Normal file
									
								
							|  | @ -0,0 +1,160 @@ | |||
| import logging | ||||
| import pandas as pd | ||||
| import numpy as np | ||||
| from scipy.stats import median_abs_deviation | ||||
| from .data import get_db_connection | ||||
| 
 | ||||
| # Configure logging | ||||
| logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s') | ||||
| 
 | ||||
| 
 | ||||
| def _convert_to_minutes(level: float) -> float: | ||||
|     """ | ||||
|     Converts O*NET 'Frequency' scale values (levels) to estimated minutes per day. | ||||
|     This logic is derived from the `preprocessing_time_estimates` function | ||||
|     in the original analysis notebook. | ||||
|     """ | ||||
|     if pd.isna(level): | ||||
|         return 0 | ||||
|     # This mapping is an interpretation of the O*NET frequency scale. | ||||
|     return { | ||||
|         1: 0,       # Yearly or less | ||||
|         2: 2,       # Several times a year | ||||
|         3: 10,      # Several times a month | ||||
|         4: 30,      # Several times a week | ||||
|         5: 120,     # Daily | ||||
|         6: 240,     # Several times a day | ||||
|         7: 480,     # Hourly or more | ||||
|     }.get(int(level), 0) | ||||
| 
 | ||||
| 
 | ||||
| def _mad_z_score(series: pd.Series) -> pd.Series: | ||||
|     """ | ||||
|     Calculates the robust Z-score using Median Absolute Deviation (MAD). | ||||
|     This function is derived from 'cell7' of the original analysis. | ||||
|     """ | ||||
|     if series.isnull().all(): | ||||
|         return pd.Series([np.nan] * len(series), index=series.index) | ||||
| 
 | ||||
|     median = series.median() | ||||
|     # scale='normal' makes MAD comparable to the standard deviation for a normal distribution. | ||||
|     mad = median_abs_deviation(series.dropna(), scale='normal') | ||||
|     if mad == 0: | ||||
|         return pd.Series([np.nan] * len(series), index=series.index) | ||||
|     return (series - median) / mad | ||||
| 
 | ||||
| 
 | ||||
| def run_preprocessing() -> pd.DataFrame: | ||||
|     """ | ||||
|     Main orchestrator for the preprocessing pipeline. | ||||
| 
 | ||||
|     This function faithfully reproduces the data transformation pipeline from the | ||||
|     original `analysis.py` script, including the `preprocessing_time_estimates` | ||||
|     and cell-specific data manipulations. | ||||
| 
 | ||||
|     Returns: | ||||
|         pd.DataFrame: A fully preprocessed DataFrame ready for the generators. | ||||
|     """ | ||||
|     logging.info("Starting data preprocessing...") | ||||
|     conn = None | ||||
|     try: | ||||
|         conn = get_db_connection() | ||||
|         if conn is None: | ||||
|             raise ConnectionError("Could not establish database connection.") | ||||
| 
 | ||||
|         # --- 1. Load Data from Database --- | ||||
|         # Fetch all necessary tables to build the initial DataFrame. | ||||
|         logging.info("Loading data from O*NET database...") | ||||
|         task_ratings_df = pd.read_sql_query("SELECT * FROM task_ratings", conn) | ||||
|         task_statements_df = pd.read_sql_query("SELECT * FROM task_statements", conn) | ||||
|         occupations_df = pd.read_sql_query("SELECT * FROM occupation_data", conn) | ||||
| 
 | ||||
|         # --- 2. Initial Merge --- | ||||
|         # Merge the tables to create a comprehensive base DataFrame. | ||||
|         # Merging on both 'onetsoc_code' and 'task_id' is crucial to avoid | ||||
|         # creating duplicate columns from the overlapping 'onetsoc_code'. | ||||
|         logging.info("Merging base tables...") | ||||
|         tasks_df = pd.merge(task_ratings_df, task_statements_df, on=['onetsoc_code', 'task_id']) | ||||
|         tasks_df = pd.merge(tasks_df, occupations_df, on='onetsoc_code') | ||||
| 
 | ||||
|         # --- 3. Create "Atomic Tasks" and Time Estimates (from `preprocessing_time_estimates`) --- | ||||
|         # This is the core of the analysis, focusing on tasks with frequency ratings. | ||||
|         logging.info("Filtering for 'atomic tasks' (scale_id='FR') and calculating time estimates...") | ||||
|         # Strip whitespace from scale_id to ensure the filter works correctly. | ||||
|         tasks_df['scale_id'] = tasks_df['scale_id'].str.strip() | ||||
|         atomic_tasks = tasks_df[tasks_df['scale_id'] == 'FR'].copy() | ||||
| 
 | ||||
|         # Convert frequency confidence intervals into minutes/day | ||||
|         atomic_tasks['lb_estimate_in_minutes'] = atomic_tasks['lower_ci_bound'].apply(_convert_to_minutes) | ||||
|         atomic_tasks['ub_estimate_in_minutes'] = atomic_tasks['upper_ci_bound'].apply(_convert_to_minutes) | ||||
|         atomic_tasks['estimate_midpoint'] = (atomic_tasks['lb_estimate_in_minutes'] + atomic_tasks['ub_estimate_in_minutes']) / 2 | ||||
| 
 | ||||
|         # --- 4. Add Derived Columns for Analysis (from `cell` logic) --- | ||||
|         logging.info("Adding derived columns for analysis...") | ||||
| 
 | ||||
|         # Add `onetsoc_major` for grouping by occupation category | ||||
|         atomic_tasks['onetsoc_major'] = atomic_tasks['onetsoc_code'].str[:2] | ||||
| 
 | ||||
|         # Calculate estimate_range and estimate_ratio used in several plots | ||||
|         atomic_tasks['estimate_range'] = atomic_tasks['ub_estimate_in_minutes'] - atomic_tasks['lb_estimate_in_minutes'] | ||||
| 
 | ||||
|         # To calculate ratio, ensure lower bound is positive to avoid division by zero | ||||
|         lb_positive = atomic_tasks['lb_estimate_in_minutes'] > 0 | ||||
|         atomic_tasks['estimate_ratio'] = np.nan | ||||
|         atomic_tasks.loc[lb_positive, 'estimate_ratio'] = atomic_tasks['ub_estimate_in_minutes'] / atomic_tasks['lb_estimate_in_minutes'] | ||||
| 
 | ||||
|         # --- 5. Calculate Outlier Scores (from `cell6` and `cell7`) --- | ||||
|         logging.info("Calculating standard and robust Z-scores for outlier detection...") | ||||
| 
 | ||||
|         # Standard Z-score | ||||
|         grouped_stats = atomic_tasks.groupby('onetsoc_code')['estimate_midpoint'].agg(['mean', 'std']) | ||||
|         atomic_tasks = atomic_tasks.merge(grouped_stats, on='onetsoc_code', how='left') | ||||
| 
 | ||||
|         # Calculate Z-score, avoiding division by zero if std is 0 | ||||
|         non_zero_std = atomic_tasks['std'].notna() & (atomic_tasks['std'] != 0) | ||||
|         atomic_tasks['z_score'] = np.nan | ||||
|         atomic_tasks.loc[non_zero_std, 'z_score'] = \ | ||||
|             (atomic_tasks.loc[non_zero_std, 'estimate_midpoint'] - atomic_tasks.loc[non_zero_std, 'mean']) / atomic_tasks.loc[non_zero_std, 'std'] | ||||
| 
 | ||||
|         # Robust Z-score (using MAD) | ||||
|         atomic_tasks['robust_z_score'] = atomic_tasks.groupby('onetsoc_code')['estimate_midpoint'].transform(_mad_z_score) | ||||
| 
 | ||||
|         # --- 6. Prepare for other generators --- | ||||
|         # NOTE: The data for the 'task_breakdown_by_occupation' generator, specifically | ||||
|         # the 'remote_status' and 'estimateable' columns, is not available in the O*NET | ||||
|         # database. This data was likely loaded from a separate file (e.g., 'tasks_clean.parquet') | ||||
|         # in the original notebook. For now, we will add placeholder columns. | ||||
|         atomic_tasks['remote_status'] = 'unknown' | ||||
|         atomic_tasks['estimateable'] = 'unknown' | ||||
| 
 | ||||
| 
 | ||||
|         logging.info("Data preprocessing complete.") | ||||
|         return atomic_tasks | ||||
| 
 | ||||
|     except Exception as e: | ||||
|         logging.error("An error occurred during preprocessing: %s", e, exc_info=True) | ||||
|         # Return an empty DataFrame on failure to prevent downstream errors | ||||
|         return pd.DataFrame() | ||||
|     finally: | ||||
|         if conn: | ||||
|             conn.close() | ||||
|             logging.info("Database connection closed.") | ||||
| 
 | ||||
| if __name__ == '__main__': | ||||
|     # This allows the preprocessing to be run directly for testing or debugging. | ||||
|     # Note: Requires data to be set up first by running data.py. | ||||
|     try: | ||||
|         processed_data = run_preprocessing() | ||||
|         if not processed_data.empty: | ||||
|             print("Preprocessing successful. DataFrame shape:", processed_data.shape) | ||||
|             print("Columns:", processed_data.columns.tolist()) | ||||
|             print(processed_data.head()) | ||||
|             # Save to a temporary file to inspect the output | ||||
|             output_path = "temp_preprocessed_data.csv" | ||||
|             processed_data.to_csv(output_path, index=False) | ||||
|             print(f"Sample output saved to {output_path}") | ||||
|         else: | ||||
|             print("Preprocessing failed or resulted in an empty DataFrame.") | ||||
| 
 | ||||
|     except (FileNotFoundError, ConnectionError) as e: | ||||
|         logging.error("Failed to run preprocessing: %s", e) | ||||
							
								
								
									
										2429
									
								
								archive/analysis.ipynb
									
										
									
									
									
										Normal file
									
								
							
							
						
						
									
										2429
									
								
								archive/analysis.ipynb
									
										
									
									
									
										Normal file
									
								
							
										
											
												File diff suppressed because one or more lines are too long
											
										
									
								
							
							
								
								
									
										21699
									
								
								archive/bck_estimates.csv
									
										
									
									
									
										Normal file
									
								
							
							
						
						
									
										21699
									
								
								archive/bck_estimates.csv
									
										
									
									
									
										Normal file
									
								
							
										
											
												File diff suppressed because it is too large
												Load diff
											
										
									
								
							
							
								
								
									
										4935
									
								
								archive/data_enrichment.ipynb
									
										
									
									
									
										Normal file
									
								
							
							
						
						
									
										4935
									
								
								archive/data_enrichment.ipynb
									
										
									
									
									
										Normal file
									
								
							
										
											
												File diff suppressed because one or more lines are too long
											
										
									
								
							
							
								
								
									
										24
									
								
								archive/loss.py
									
										
									
									
									
										Normal file
									
								
							
							
						
						
									
										24
									
								
								archive/loss.py
									
										
									
									
									
										Normal file
									
								
							|  | @ -0,0 +1,24 @@ | |||
| def calc_loss(df): | ||||
|     """ | ||||
|     Geometric-mean log error between prediction bands and golden bands. | ||||
|     Assumes all columns are strictly positive. | ||||
| 
 | ||||
|     Parameters | ||||
|     ---------- | ||||
|     df : pandas.DataFrame | ||||
|         Must contain the columns: | ||||
|             - 'pred_lower', 'pred_upper' | ||||
|             - 'golden_lower', 'golden_upper' | ||||
| 
 | ||||
|     Returns | ||||
|     ------- | ||||
|     float | ||||
|         Scalar loss value (the smaller, the better). | ||||
|     """ | ||||
|     # Element-wise absolute log-ratios | ||||
|     loss_lower = np.abs(np.log(df["pred_lower"] / df["golden_lower"])) | ||||
|     loss_upper = np.abs(np.log(df["pred_upper"] / df["golden_upper"])) | ||||
| 
 | ||||
|     # Average the two means, then exponentiate | ||||
|     loss = np.exp(0.5 * (loss_lower.mean() + loss_upper.mean())) | ||||
|     return loss | ||||
							
								
								
									
										334
									
								
								archive/onet_explorer_app.py
									
										
									
									
									
										Normal file
									
								
							
							
						
						
									
										334
									
								
								archive/onet_explorer_app.py
									
										
									
									
									
										Normal file
									
								
							|  | @ -0,0 +1,334 @@ | |||
| import streamlit as st | ||||
| import sqlite3 | ||||
| import pandas as pd | ||||
| import graphviz | ||||
| import textwrap | ||||
| 
 | ||||
| # --- Database Setup --- | ||||
| DB_FILE = "onet.database" | ||||
| 
 | ||||
| 
 | ||||
| @st.cache_resource | ||||
| def get_db_connection(): | ||||
|     """Establishes a connection to the SQLite database.""" | ||||
|     conn = sqlite3.connect(DB_FILE) | ||||
|     conn.row_factory = sqlite3.Row  # Access columns by name | ||||
|     return conn | ||||
| 
 | ||||
| 
 | ||||
| @st.cache_data | ||||
| def get_occupations(_conn): | ||||
|     """Fetches all occupations from the database.""" | ||||
|     df = pd.read_sql_query( | ||||
|         "SELECT onetsoc_code, title FROM occupation_data ORDER BY title", _conn | ||||
|     ) | ||||
|     return df | ||||
| 
 | ||||
| 
 | ||||
| @st.cache_data | ||||
| def get_iwas_for_occupation(_conn, onetsoc_code): | ||||
|     """ | ||||
|     Fetches IWAs for a given occupation. | ||||
|     An occupation is linked to Work Activities (element_id in work_activities table). | ||||
|     These Work Activity element_ids are then used in iwa_reference to find associated IWAs. | ||||
|     """ | ||||
|     query = """ | ||||
|     SELECT DISTINCT | ||||
|         ir.iwa_id, | ||||
|         ir.iwa_title | ||||
|     FROM work_activities wa | ||||
|     JOIN iwa_reference ir ON wa.element_id = ir.element_id | ||||
|     WHERE wa.onetsoc_code = ? | ||||
|     ORDER BY ir.iwa_title; | ||||
|     """ | ||||
|     df = pd.read_sql_query(query, _conn, params=(onetsoc_code,)) | ||||
|     return df | ||||
| 
 | ||||
| 
 | ||||
| @st.cache_data | ||||
| def get_dwas_for_iwas(_conn, iwa_ids): | ||||
|     """Fetches DWAs for a list of IWA IDs.""" | ||||
|     if not iwa_ids: | ||||
|         return pd.DataFrame() | ||||
|     placeholders = ",".join( | ||||
|         "?" for _ in iwa_ids | ||||
|     )  # Create one placeholder for each IWA ID | ||||
|     query = f""" | ||||
|     SELECT DISTINCT | ||||
|         dr.dwa_id, | ||||
|         dr.dwa_title, | ||||
|         dr.iwa_id  -- to link back to the IWA | ||||
|     FROM dwa_reference dr | ||||
|     WHERE dr.iwa_id IN ({placeholders}) | ||||
|     ORDER BY dr.dwa_title; | ||||
|     """ | ||||
|     df = pd.read_sql_query(query, _conn, params=iwa_ids) | ||||
|     return df | ||||
| 
 | ||||
| 
 | ||||
| @st.cache_data | ||||
| def get_tasks_for_dwas(_conn, onetsoc_code, dwa_ids): | ||||
|     """Fetches tasks for a given occupation and list of DWA IDs.""" | ||||
|     if not dwa_ids: | ||||
|         return pd.DataFrame() | ||||
|     placeholders = ",".join( | ||||
|         "?" for _ in dwa_ids | ||||
|     )  # Create one placeholder for each DWA ID | ||||
|     query = f""" | ||||
|     SELECT DISTINCT | ||||
|         ts.task_id, | ||||
|         ts.task, | ||||
|         t2d.dwa_id -- to link back to the DWA | ||||
|     FROM task_statements ts | ||||
|     JOIN tasks_to_dwas t2d ON ts.task_id = t2d.task_id | ||||
|     WHERE ts.onetsoc_code = ? AND t2d.dwa_id IN ({placeholders}) | ||||
|     ORDER BY ts.task; | ||||
|     """ | ||||
|     # The parameters list should first contain onetsoc_code, then all DWA IDs. | ||||
|     params = [onetsoc_code] + dwa_ids | ||||
|     df = pd.read_sql_query(query, _conn, params=params) | ||||
|     return df | ||||
| 
 | ||||
| 
 | ||||
| def smart_wrap(text, width=40): | ||||
|     """Wraps text for better display in graph nodes.""" | ||||
|     return "\n".join( | ||||
|         textwrap.wrap( | ||||
|             text, | ||||
|             width=width, | ||||
|             break_long_words=True, | ||||
|             replace_whitespace=False, | ||||
|             drop_whitespace=False, | ||||
|         ) | ||||
|     ) | ||||
| 
 | ||||
| 
 | ||||
| # --- Streamlit App Layout --- | ||||
| st.set_page_config(layout="wide") | ||||
| 
 | ||||
| # Check if database file exists | ||||
| try: | ||||
|     # Attempt to open for binary read to check existence and basic readability | ||||
|     with open(DB_FILE, "rb") as f: | ||||
|         pass | ||||
|     conn = get_db_connection() | ||||
| except FileNotFoundError: | ||||
|     st.error( | ||||
|         f"Database file '{DB_FILE}' not found. Please ensure it is in the same directory as the script." | ||||
|     ) | ||||
|     st.stop() | ||||
| except sqlite3.Error as e: | ||||
|     st.error(f"Error connecting to or reading the database '{DB_FILE}': {e}") | ||||
|     st.info( | ||||
|         "Please ensure the database file is a valid SQLite database and not corrupted." | ||||
|     ) | ||||
|     st.stop() | ||||
| 
 | ||||
| 
 | ||||
| st.title("O*NET Occupation Hierarchy Explorer") | ||||
| st.markdown(""" | ||||
| This application visualizes the relationships between Occupations, Intermediate Work Activities (IWAs), | ||||
| Detailed Work Activities (DWAs), and Task Statements from the O*NET database. | ||||
| Select an occupation from the control panel on the left to view its hierarchical breakdown. | ||||
| """) | ||||
| 
 | ||||
| # --- Sidebar for Occupation Selection --- | ||||
| col1, col2 = st.columns([0.3, 0.7], gap="large") | ||||
| 
 | ||||
| with col1: | ||||
|     st.header("Control Panel") | ||||
|     occupations_df = get_occupations(conn) | ||||
| 
 | ||||
|     if occupations_df.empty: | ||||
|         st.warning("No occupations found in the database.") | ||||
|         st.stop() | ||||
| 
 | ||||
|     # Create a display string with code and title for the selectbox | ||||
|     occupations_df["display_name"] = ( | ||||
|         occupations_df["title"] + " (" + occupations_df["onetsoc_code"] + ")" | ||||
|     ) | ||||
| 
 | ||||
|     search_term = st.text_input( | ||||
|         "Search for an occupation:", placeholder="E.g., Software Developer" | ||||
|     ) | ||||
| 
 | ||||
|     if search_term: | ||||
|         # Ensure search term is treated as a literal string for regex, if needed, or use basic string methods | ||||
|         search_term_safe = ( | ||||
|             search_term.replace("[", "\\[") | ||||
|             .replace("]", "\\]") | ||||
|             .replace("(", "\\(") | ||||
|             .replace(")", "\\)") | ||||
|         ) | ||||
|         filtered_occupations = occupations_df[ | ||||
|             occupations_df["title"].str.contains( | ||||
|                 search_term_safe, case=False, regex=True | ||||
|             ) | ||||
|             | occupations_df["onetsoc_code"].str.contains( | ||||
|                 search_term_safe, case=False, regex=True | ||||
|             ) | ||||
|         ] | ||||
|     else: | ||||
|         filtered_occupations = occupations_df | ||||
| 
 | ||||
|     if not filtered_occupations.empty: | ||||
|         # Sort filtered occupations for consistent display in selectbox | ||||
|         filtered_occupations_sorted = filtered_occupations.sort_values("display_name") | ||||
|         selected_occupation_display_name = st.selectbox( | ||||
|             "Choose an occupation:", | ||||
|             options=filtered_occupations_sorted["display_name"], | ||||
|             index=0,  # Default to the first item | ||||
|         ) | ||||
| 
 | ||||
|         # Get the onetsoc_code and title from the selected display name | ||||
|         selected_row = occupations_df[ | ||||
|             occupations_df["display_name"] == selected_occupation_display_name | ||||
|         ].iloc[0] | ||||
|         selected_onetsoc_code = selected_row["onetsoc_code"] | ||||
|         selected_occupation_title = selected_row["title"] | ||||
|     else: | ||||
|         st.warning("No occupations match your search term.") | ||||
|         selected_onetsoc_code = None | ||||
|         selected_occupation_title = None | ||||
| 
 | ||||
| # --- Main Area for Graph Display --- | ||||
| with col2: | ||||
|     st.header("Occupation Graph") | ||||
|     if selected_onetsoc_code: | ||||
|         st.subheader( | ||||
|             f"Displaying: {selected_occupation_title} ({selected_onetsoc_code})" | ||||
|         ) | ||||
| 
 | ||||
|         iwas_df = get_iwas_for_occupation(conn, selected_onetsoc_code) | ||||
| 
 | ||||
|         if iwas_df.empty: | ||||
|             st.info( | ||||
|                 "No Intermediate Work Activities (IWAs) found directly linked for this occupation." | ||||
|             ) | ||||
|         else: | ||||
|             graph = graphviz.Digraph( | ||||
|                 comment=f"O*NET Hierarchy for {selected_onetsoc_code}" | ||||
|             ) | ||||
|             graph.attr( | ||||
|                 rankdir="LR", | ||||
|                 splines="spline", | ||||
|                 concentrate="false", | ||||
|                 nodesep="0.5", | ||||
|                 ranksep="0.8", | ||||
|             ) | ||||
| 
 | ||||
|             # Occupation Node | ||||
|             occ_node_id = f"occ_{selected_onetsoc_code.replace('.', '_')}"  # Ensure ID is valid for DOT | ||||
|             occ_label = smart_wrap( | ||||
|                 f"Occupation: {selected_occupation_title}\n({selected_onetsoc_code})", | ||||
|                 width=30, | ||||
|             ) | ||||
|             graph.node( | ||||
|                 occ_node_id, | ||||
|                 label=occ_label, | ||||
|                 shape="ellipse", | ||||
|                 style="filled", | ||||
|                 fillcolor="skyblue", | ||||
|             ) | ||||
| 
 | ||||
|             # Fetch DWAs | ||||
|             iwa_ids = iwas_df["iwa_id"].tolist() | ||||
|             dwas_df = get_dwas_for_iwas(conn, iwa_ids) | ||||
| 
 | ||||
|             dwa_ids_for_tasks = [] | ||||
|             if not dwas_df.empty: | ||||
|                 dwa_ids_for_tasks = dwas_df["dwa_id"].unique().tolist() | ||||
| 
 | ||||
|             # Fetch Tasks | ||||
|             tasks_df = get_tasks_for_dwas( | ||||
|                 conn, selected_onetsoc_code, dwa_ids_for_tasks | ||||
|             ) | ||||
| 
 | ||||
|             # Add IWA Nodes and Edges | ||||
|             for _, iwa_row in iwas_df.iterrows(): | ||||
|                 iwa_node_id = f"iwa_{str(iwa_row['iwa_id']).replace('.', '_')}" | ||||
|                 iwa_label = smart_wrap( | ||||
|                     f"IWA: {iwa_row['iwa_title']}\n(ID: {iwa_row['iwa_id']})", width=35 | ||||
|                 ) | ||||
|                 graph.node( | ||||
|                     iwa_node_id, | ||||
|                     label=iwa_label, | ||||
|                     shape="box", | ||||
|                     style="filled", | ||||
|                     fillcolor="khaki", | ||||
|                 ) | ||||
|                 graph.edge(occ_node_id, iwa_node_id) | ||||
| 
 | ||||
|                 # Add DWA Nodes and Edges (for this IWA) | ||||
|                 current_iwa_dwas = dwas_df[dwas_df["iwa_id"] == iwa_row["iwa_id"]] | ||||
|                 for _, dwa_row in current_iwa_dwas.iterrows(): | ||||
|                     dwa_node_id = f"dwa_{str(dwa_row['dwa_id']).replace('.', '_')}" | ||||
|                     dwa_label = smart_wrap( | ||||
|                         f"DWA: {dwa_row['dwa_title']}\n(ID: {dwa_row['dwa_id']})", | ||||
|                         width=40, | ||||
|                     ) | ||||
|                     graph.node( | ||||
|                         dwa_node_id, | ||||
|                         label=dwa_label, | ||||
|                         shape="box", | ||||
|                         style="filled", | ||||
|                         fillcolor="lightcoral", | ||||
|                     ) | ||||
|                     graph.edge(iwa_node_id, dwa_node_id) | ||||
| 
 | ||||
|                     # Add Task Nodes and Edges (for this DWA and Occupation) | ||||
|                     current_dwa_tasks = tasks_df[ | ||||
|                         tasks_df["dwa_id"] == dwa_row["dwa_id"] | ||||
|                     ] | ||||
|                     for _, task_row in current_dwa_tasks.iterrows(): | ||||
|                         # Ensure task_id is a string and valid for DOT | ||||
|                         task_id_str = str(task_row["task_id"]).split(".")[ | ||||
|                             0 | ||||
|                         ]  # Handle decimal task_ids if they appear | ||||
|                         task_node_id = f"task_{task_id_str}" | ||||
|                         task_label = smart_wrap( | ||||
|                             f"Task: {task_row['task']}\n(ID: {task_id_str})", width=50 | ||||
|                         ) | ||||
|                         graph.node( | ||||
|                             task_node_id, | ||||
|                             label=task_label, | ||||
|                             shape="note", | ||||
|                             style="filled", | ||||
|                             fillcolor="lightgray", | ||||
|                         ) | ||||
|                         graph.edge(dwa_node_id, task_node_id) | ||||
| 
 | ||||
|             if ( | ||||
|                 not graph.body or len(graph.body) <= 1 | ||||
|             ):  # Check if any nodes were actually added beyond the occupation | ||||
|                 st.info( | ||||
|                     "No hierarchical data (IWAs, DWAs, Tasks) to display for this occupation after initial selection." | ||||
|                 ) | ||||
|             else: | ||||
|                 try: | ||||
|                     st.graphviz_chart(graph, use_container_width=True) | ||||
|                     with st.expander("View Data Tables for Selected Occupation"): | ||||
|                         st.markdown("##### Intermediate Work Activities (IWAs)") | ||||
|                         st.dataframe(iwas_df, use_container_width=True) | ||||
|                         if not dwas_df.empty: | ||||
|                             st.markdown("##### Detailed Work Activities (DWAs)") | ||||
|                             st.dataframe(dwas_df, use_container_width=True) | ||||
|                         if not tasks_df.empty: | ||||
|                             st.markdown("##### Task Statements") | ||||
|                             st.dataframe(tasks_df, use_container_width=True) | ||||
| 
 | ||||
|                 except Exception as e: | ||||
|                     st.error( | ||||
|                         f"Could not render the graph. Graphviz might not be installed correctly or there's an issue with the graph data: {e}" | ||||
|                     ) | ||||
|                     st.text("Graphviz DOT source (for debugging):") | ||||
|                     st.code(graph.source, language="dot") | ||||
|     else: | ||||
|         st.info("Select an occupation from the control panel to see its graph.") | ||||
| 
 | ||||
| # Instructions to run the app: | ||||
| # 1. Save this code as a Python file (e.g., onet_explorer_app.py). | ||||
| # 2. Ensure the 'onet.database' file is in the same directory. | ||||
| # 3. Install the required libraries: pip install streamlit pandas graphviz | ||||
| # 4. Open your terminal or command prompt, navigate to the directory, and run: | ||||
| #    streamlit run onet_explorer_app.py | ||||
							
								
								
									
										352
									
								
								archive/schema.txt
									
										
									
									
									
										Normal file
									
								
							
							
						
						
									
										352
									
								
								archive/schema.txt
									
										
									
									
									
										Normal file
									
								
							|  | @ -0,0 +1,352 @@ | |||
| CREATE TABLE content_model_reference ( | ||||
|   element_id CHARACTER VARYING(20) NOT NULL, | ||||
|   element_name CHARACTER VARYING(150) NOT NULL, | ||||
|   description CHARACTER VARYING(1500) NOT NULL, | ||||
|   PRIMARY KEY (element_id)); | ||||
| CREATE TABLE job_zone_reference ( | ||||
|   job_zone DECIMAL(1,0) NOT NULL, | ||||
|   name CHARACTER VARYING(50) NOT NULL, | ||||
|   experience CHARACTER VARYING(300) NOT NULL, | ||||
|   education CHARACTER VARYING(500) NOT NULL, | ||||
|   job_training CHARACTER VARYING(300) NOT NULL, | ||||
|   examples CHARACTER VARYING(500) NOT NULL, | ||||
|   svp_range CHARACTER VARYING(25) NOT NULL, | ||||
|   PRIMARY KEY (job_zone)); | ||||
| CREATE TABLE occupation_data ( | ||||
|   onetsoc_code CHARACTER(10) NOT NULL, | ||||
|   title CHARACTER VARYING(150) NOT NULL, | ||||
|   description CHARACTER VARYING(1000) NOT NULL, | ||||
|   PRIMARY KEY (onetsoc_code)); | ||||
| CREATE TABLE scales_reference ( | ||||
|   scale_id CHARACTER VARYING(3) NOT NULL, | ||||
|   scale_name CHARACTER VARYING(50) NOT NULL, | ||||
|   minimum DECIMAL(1,0) NOT NULL, | ||||
|   maximum DECIMAL(3,0) NOT NULL, | ||||
|   PRIMARY KEY (scale_id)); | ||||
| CREATE TABLE ete_categories ( | ||||
|   element_id CHARACTER VARYING(20) NOT NULL, | ||||
|   scale_id CHARACTER VARYING(3) NOT NULL, | ||||
|   category DECIMAL(3,0) NOT NULL, | ||||
|   category_description CHARACTER VARYING(1000) NOT NULL, | ||||
|   PRIMARY KEY (element_id, scale_id, category), | ||||
|   FOREIGN KEY (element_id) REFERENCES content_model_reference(element_id), | ||||
|   FOREIGN KEY (scale_id) REFERENCES scales_reference(scale_id)); | ||||
| CREATE TABLE level_scale_anchors ( | ||||
|   element_id CHARACTER VARYING(20) NOT NULL, | ||||
|   scale_id CHARACTER VARYING(3) NOT NULL, | ||||
|   anchor_value DECIMAL(3,0) NOT NULL, | ||||
|   anchor_description CHARACTER VARYING(1000) NOT NULL, | ||||
|   FOREIGN KEY (element_id) REFERENCES content_model_reference(element_id), | ||||
|   FOREIGN KEY (scale_id) REFERENCES scales_reference(scale_id)); | ||||
| CREATE TABLE occupation_level_metadata ( | ||||
|   onetsoc_code CHARACTER(10) NOT NULL, | ||||
|   item CHARACTER VARYING(150) NOT NULL, | ||||
|   response CHARACTER VARYING(75), | ||||
|   n DECIMAL(4,0), | ||||
|   percent DECIMAL(4,1), | ||||
|   date_updated DATE NOT NULL, | ||||
|   FOREIGN KEY (onetsoc_code) REFERENCES occupation_data(onetsoc_code)); | ||||
| CREATE TABLE survey_booklet_locations ( | ||||
|   element_id CHARACTER VARYING(20) NOT NULL, | ||||
|   survey_item_number CHARACTER VARYING(4) NOT NULL, | ||||
|   scale_id CHARACTER VARYING(3) NOT NULL, | ||||
|   FOREIGN KEY (element_id) REFERENCES content_model_reference(element_id), | ||||
|   FOREIGN KEY (scale_id) REFERENCES scales_reference(scale_id)); | ||||
| CREATE TABLE task_categories ( | ||||
|   scale_id CHARACTER VARYING(3) NOT NULL, | ||||
|   category DECIMAL(3,0) NOT NULL, | ||||
|   category_description CHARACTER VARYING(1000) NOT NULL, | ||||
|   PRIMARY KEY (scale_id, category), | ||||
|   FOREIGN KEY (scale_id) REFERENCES scales_reference(scale_id)); | ||||
| CREATE TABLE work_context_categories ( | ||||
|   element_id CHARACTER VARYING(20) NOT NULL, | ||||
|   scale_id CHARACTER VARYING(3) NOT NULL, | ||||
|   category DECIMAL(3,0) NOT NULL, | ||||
|   category_description CHARACTER VARYING(1000) NOT NULL, | ||||
|   PRIMARY KEY (element_id, scale_id, category), | ||||
|   FOREIGN KEY (element_id) REFERENCES content_model_reference(element_id), | ||||
|   FOREIGN KEY (scale_id) REFERENCES scales_reference(scale_id)); | ||||
| CREATE TABLE abilities ( | ||||
|   onetsoc_code CHARACTER(10) NOT NULL, | ||||
|   element_id CHARACTER VARYING(20) NOT NULL, | ||||
|   scale_id CHARACTER VARYING(3) NOT NULL, | ||||
|   data_value DECIMAL(5,2) NOT NULL, | ||||
|   n DECIMAL(4,0), | ||||
|   standard_error DECIMAL(7,4), | ||||
|   lower_ci_bound DECIMAL(7,4), | ||||
|   upper_ci_bound DECIMAL(7,4), | ||||
|   recommend_suppress CHARACTER(1), | ||||
|   not_relevant CHARACTER(1), | ||||
|   date_updated DATE NOT NULL, | ||||
|   domain_source CHARACTER VARYING(30) NOT NULL, | ||||
|   FOREIGN KEY (onetsoc_code) REFERENCES occupation_data(onetsoc_code), | ||||
|   FOREIGN KEY (element_id) REFERENCES content_model_reference(element_id), | ||||
|   FOREIGN KEY (scale_id) REFERENCES scales_reference(scale_id)); | ||||
| CREATE TABLE education_training_experience ( | ||||
|   onetsoc_code CHARACTER(10) NOT NULL, | ||||
|   element_id CHARACTER VARYING(20) NOT NULL, | ||||
|   scale_id CHARACTER VARYING(3) NOT NULL, | ||||
|   category DECIMAL(3,0), | ||||
|   data_value DECIMAL(5,2) NOT NULL, | ||||
|   n DECIMAL(4,0), | ||||
|   standard_error DECIMAL(7,4), | ||||
|   lower_ci_bound DECIMAL(7,4), | ||||
|   upper_ci_bound DECIMAL(7,4), | ||||
|   recommend_suppress CHARACTER(1), | ||||
|   date_updated DATE NOT NULL, | ||||
|   domain_source CHARACTER VARYING(30) NOT NULL, | ||||
|   FOREIGN KEY (onetsoc_code) REFERENCES occupation_data(onetsoc_code), | ||||
|   FOREIGN KEY (element_id) REFERENCES content_model_reference(element_id), | ||||
|   FOREIGN KEY (scale_id) REFERENCES scales_reference(scale_id), | ||||
|   FOREIGN KEY (element_id, scale_id, category) REFERENCES ete_categories(element_id, scale_id, category)); | ||||
| CREATE TABLE interests ( | ||||
|   onetsoc_code CHARACTER(10) NOT NULL, | ||||
|   element_id CHARACTER VARYING(20) NOT NULL, | ||||
|   scale_id CHARACTER VARYING(3) NOT NULL, | ||||
|   data_value DECIMAL(5,2) NOT NULL, | ||||
|   date_updated DATE NOT NULL, | ||||
|   domain_source CHARACTER VARYING(30) NOT NULL, | ||||
|   FOREIGN KEY (onetsoc_code) REFERENCES occupation_data(onetsoc_code), | ||||
|   FOREIGN KEY (element_id) REFERENCES content_model_reference(element_id), | ||||
|   FOREIGN KEY (scale_id) REFERENCES scales_reference(scale_id)); | ||||
| CREATE TABLE job_zones ( | ||||
|   onetsoc_code CHARACTER(10) NOT NULL, | ||||
|   job_zone DECIMAL(1,0) NOT NULL, | ||||
|   date_updated DATE NOT NULL, | ||||
|   domain_source CHARACTER VARYING(30) NOT NULL, | ||||
|   FOREIGN KEY (onetsoc_code) REFERENCES occupation_data(onetsoc_code), | ||||
|   FOREIGN KEY (job_zone) REFERENCES job_zone_reference(job_zone)); | ||||
| CREATE TABLE knowledge ( | ||||
|   onetsoc_code CHARACTER(10) NOT NULL, | ||||
|   element_id CHARACTER VARYING(20) NOT NULL, | ||||
|   scale_id CHARACTER VARYING(3) NOT NULL, | ||||
|   data_value DECIMAL(5,2) NOT NULL, | ||||
|   n DECIMAL(4,0), | ||||
|   standard_error DECIMAL(7,4), | ||||
|   lower_ci_bound DECIMAL(7,4), | ||||
|   upper_ci_bound DECIMAL(7,4), | ||||
|   recommend_suppress CHARACTER(1), | ||||
|   not_relevant CHARACTER(1), | ||||
|   date_updated DATE NOT NULL, | ||||
|   domain_source CHARACTER VARYING(30) NOT NULL, | ||||
|   FOREIGN KEY (onetsoc_code) REFERENCES occupation_data(onetsoc_code), | ||||
|   FOREIGN KEY (element_id) REFERENCES content_model_reference(element_id), | ||||
|   FOREIGN KEY (scale_id) REFERENCES scales_reference(scale_id)); | ||||
| CREATE TABLE skills ( | ||||
|   onetsoc_code CHARACTER(10) NOT NULL, | ||||
|   element_id CHARACTER VARYING(20) NOT NULL, | ||||
|   scale_id CHARACTER VARYING(3) NOT NULL, | ||||
|   data_value DECIMAL(5,2) NOT NULL, | ||||
|   n DECIMAL(4,0), | ||||
|   standard_error DECIMAL(7,4), | ||||
|   lower_ci_bound DECIMAL(7,4), | ||||
|   upper_ci_bound DECIMAL(7,4), | ||||
|   recommend_suppress CHARACTER(1), | ||||
|   not_relevant CHARACTER(1), | ||||
|   date_updated DATE NOT NULL, | ||||
|   domain_source CHARACTER VARYING(30) NOT NULL, | ||||
|   FOREIGN KEY (onetsoc_code) REFERENCES occupation_data(onetsoc_code), | ||||
|   FOREIGN KEY (element_id) REFERENCES content_model_reference(element_id), | ||||
|   FOREIGN KEY (scale_id) REFERENCES scales_reference(scale_id)); | ||||
| CREATE TABLE task_statements ( | ||||
|   onetsoc_code CHARACTER(10) NOT NULL, | ||||
|   task_id DECIMAL(8,0) NOT NULL, | ||||
|   task CHARACTER VARYING(1000) NOT NULL, | ||||
|   task_type CHARACTER VARYING(12), | ||||
|   incumbents_responding DECIMAL(4,0), | ||||
|   date_updated DATE NOT NULL, | ||||
|   domain_source CHARACTER VARYING(30) NOT NULL, | ||||
|   PRIMARY KEY (task_id), | ||||
|   FOREIGN KEY (onetsoc_code) REFERENCES occupation_data(onetsoc_code)); | ||||
| CREATE TABLE task_ratings ( | ||||
|   onetsoc_code CHARACTER(10) NOT NULL, | ||||
|   task_id DECIMAL(8,0) NOT NULL, | ||||
|   scale_id CHARACTER VARYING(3) NOT NULL, | ||||
|   category DECIMAL(3,0), | ||||
|   data_value DECIMAL(5,2) NOT NULL, | ||||
|   n DECIMAL(4,0), | ||||
|   standard_error DECIMAL(7,4), | ||||
|   lower_ci_bound DECIMAL(7,4), | ||||
|   upper_ci_bound DECIMAL(7,4), | ||||
|   recommend_suppress CHARACTER(1), | ||||
|   date_updated DATE NOT NULL, | ||||
|   domain_source CHARACTER VARYING(30) NOT NULL, | ||||
|   FOREIGN KEY (onetsoc_code) REFERENCES occupation_data(onetsoc_code), | ||||
|   FOREIGN KEY (task_id) REFERENCES task_statements(task_id), | ||||
|   FOREIGN KEY (scale_id) REFERENCES scales_reference(scale_id), | ||||
|   FOREIGN KEY (scale_id, category) REFERENCES task_categories(scale_id, category)); | ||||
| CREATE TABLE work_activities ( | ||||
|   onetsoc_code CHARACTER(10) NOT NULL, | ||||
|   element_id CHARACTER VARYING(20) NOT NULL, | ||||
|   scale_id CHARACTER VARYING(3) NOT NULL, | ||||
|   data_value DECIMAL(5,2) NOT NULL, | ||||
|   n DECIMAL(4,0), | ||||
|   standard_error DECIMAL(7,4), | ||||
|   lower_ci_bound DECIMAL(7,4), | ||||
|   upper_ci_bound DECIMAL(7,4), | ||||
|   recommend_suppress CHARACTER(1), | ||||
|   not_relevant CHARACTER(1), | ||||
|   date_updated DATE NOT NULL, | ||||
|   domain_source CHARACTER VARYING(30) NOT NULL, | ||||
|   FOREIGN KEY (onetsoc_code) REFERENCES occupation_data(onetsoc_code), | ||||
|   FOREIGN KEY (element_id) REFERENCES content_model_reference(element_id), | ||||
|   FOREIGN KEY (scale_id) REFERENCES scales_reference(scale_id)); | ||||
| CREATE TABLE work_context ( | ||||
|   onetsoc_code CHARACTER(10) NOT NULL, | ||||
|   element_id CHARACTER VARYING(20) NOT NULL, | ||||
|   scale_id CHARACTER VARYING(3) NOT NULL, | ||||
|   category DECIMAL(3,0), | ||||
|   data_value DECIMAL(5,2) NOT NULL, | ||||
|   n DECIMAL(4,0), | ||||
|   standard_error DECIMAL(7,4), | ||||
|   lower_ci_bound DECIMAL(7,4), | ||||
|   upper_ci_bound DECIMAL(7,4), | ||||
|   recommend_suppress CHARACTER(1), | ||||
|   not_relevant CHARACTER(1), | ||||
|   date_updated DATE NOT NULL, | ||||
|   domain_source CHARACTER VARYING(30) NOT NULL, | ||||
|   FOREIGN KEY (onetsoc_code) REFERENCES occupation_data(onetsoc_code), | ||||
|   FOREIGN KEY (element_id) REFERENCES content_model_reference(element_id), | ||||
|   FOREIGN KEY (scale_id) REFERENCES scales_reference(scale_id), | ||||
|   FOREIGN KEY (element_id, scale_id, category) REFERENCES work_context_categories(element_id, scale_id, category)); | ||||
| CREATE TABLE work_styles ( | ||||
|   onetsoc_code CHARACTER(10) NOT NULL, | ||||
|   element_id CHARACTER VARYING(20) NOT NULL, | ||||
|   scale_id CHARACTER VARYING(3) NOT NULL, | ||||
|   data_value DECIMAL(5,2) NOT NULL, | ||||
|   n DECIMAL(4,0), | ||||
|   standard_error DECIMAL(7,4), | ||||
|   lower_ci_bound DECIMAL(7,4), | ||||
|   upper_ci_bound DECIMAL(7,4), | ||||
|   recommend_suppress CHARACTER(1), | ||||
|   date_updated DATE NOT NULL, | ||||
|   domain_source CHARACTER VARYING(30) NOT NULL, | ||||
|   FOREIGN KEY (onetsoc_code) REFERENCES occupation_data(onetsoc_code), | ||||
|   FOREIGN KEY (element_id) REFERENCES content_model_reference(element_id), | ||||
|   FOREIGN KEY (scale_id) REFERENCES scales_reference(scale_id)); | ||||
| CREATE TABLE work_values ( | ||||
|   onetsoc_code CHARACTER(10) NOT NULL, | ||||
|   element_id CHARACTER VARYING(20) NOT NULL, | ||||
|   scale_id CHARACTER VARYING(3) NOT NULL, | ||||
|   data_value DECIMAL(5,2) NOT NULL, | ||||
|   date_updated DATE NOT NULL, | ||||
|   domain_source CHARACTER VARYING(30) NOT NULL, | ||||
|   FOREIGN KEY (onetsoc_code) REFERENCES occupation_data(onetsoc_code), | ||||
|   FOREIGN KEY (element_id) REFERENCES content_model_reference(element_id), | ||||
|   FOREIGN KEY (scale_id) REFERENCES scales_reference(scale_id)); | ||||
| CREATE TABLE iwa_reference ( | ||||
|   element_id CHARACTER VARYING(20) NOT NULL, | ||||
|   iwa_id CHARACTER VARYING(20) NOT NULL, | ||||
|   iwa_title CHARACTER VARYING(150) NOT NULL, | ||||
|   PRIMARY KEY (iwa_id), | ||||
|   FOREIGN KEY (element_id) REFERENCES content_model_reference(element_id)); | ||||
| CREATE TABLE dwa_reference ( | ||||
|   element_id CHARACTER VARYING(20) NOT NULL, | ||||
|   iwa_id CHARACTER VARYING(20) NOT NULL, | ||||
|   dwa_id CHARACTER VARYING(20) NOT NULL, | ||||
|   dwa_title CHARACTER VARYING(150) NOT NULL, | ||||
|   PRIMARY KEY (dwa_id), | ||||
|   FOREIGN KEY (element_id) REFERENCES content_model_reference(element_id), | ||||
|   FOREIGN KEY (iwa_id) REFERENCES iwa_reference(iwa_id)); | ||||
| CREATE TABLE tasks_to_dwas ( | ||||
|   onetsoc_code CHARACTER(10) NOT NULL, | ||||
|   task_id DECIMAL(8,0) NOT NULL, | ||||
|   dwa_id CHARACTER VARYING(20) NOT NULL, | ||||
|   date_updated DATE NOT NULL, | ||||
|   domain_source CHARACTER VARYING(30) NOT NULL, | ||||
|   FOREIGN KEY (onetsoc_code) REFERENCES occupation_data(onetsoc_code), | ||||
|   FOREIGN KEY (task_id) REFERENCES task_statements(task_id), | ||||
|   FOREIGN KEY (dwa_id) REFERENCES dwa_reference(dwa_id)); | ||||
| CREATE TABLE emerging_tasks ( | ||||
|   onetsoc_code CHARACTER(10) NOT NULL, | ||||
|   task CHARACTER VARYING(1000) NOT NULL, | ||||
|   category CHARACTER VARYING(8) NOT NULL, | ||||
|   original_task_id DECIMAL(8,0), | ||||
|   date_updated DATE NOT NULL, | ||||
|   domain_source CHARACTER VARYING(30) NOT NULL, | ||||
|   FOREIGN KEY (onetsoc_code) REFERENCES occupation_data(onetsoc_code), | ||||
|   FOREIGN KEY (original_task_id) REFERENCES task_statements(task_id)); | ||||
| CREATE TABLE related_occupations ( | ||||
|   onetsoc_code CHARACTER(10) NOT NULL, | ||||
|   related_onetsoc_code CHARACTER(10) NOT NULL, | ||||
|   relatedness_tier CHARACTER VARYING(50) NOT NULL, | ||||
|   related_index DECIMAL(3,0) NOT NULL, | ||||
|   FOREIGN KEY (onetsoc_code) REFERENCES occupation_data(onetsoc_code), | ||||
|   FOREIGN KEY (related_onetsoc_code) REFERENCES occupation_data(onetsoc_code)); | ||||
| CREATE TABLE unspsc_reference ( | ||||
|   commodity_code DECIMAL(8,0) NOT NULL, | ||||
|   commodity_title CHARACTER VARYING(150) NOT NULL, | ||||
|   class_code DECIMAL(8,0) NOT NULL, | ||||
|   class_title CHARACTER VARYING(150) NOT NULL, | ||||
|   family_code DECIMAL(8,0) NOT NULL, | ||||
|   family_title CHARACTER VARYING(150) NOT NULL, | ||||
|   segment_code DECIMAL(8,0) NOT NULL, | ||||
|   segment_title CHARACTER VARYING(150) NOT NULL, | ||||
|   PRIMARY KEY (commodity_code)); | ||||
| CREATE TABLE alternate_titles ( | ||||
|   onetsoc_code CHARACTER(10) NOT NULL, | ||||
|   alternate_title CHARACTER VARYING(250) NOT NULL, | ||||
|   short_title CHARACTER VARYING(150), | ||||
|   sources CHARACTER VARYING(50) NOT NULL, | ||||
|   FOREIGN KEY (onetsoc_code) REFERENCES occupation_data(onetsoc_code)); | ||||
| CREATE TABLE sample_of_reported_titles ( | ||||
|   onetsoc_code CHARACTER(10) NOT NULL, | ||||
|   reported_job_title CHARACTER VARYING(150) NOT NULL, | ||||
|   shown_in_my_next_move CHARACTER(1) NOT NULL, | ||||
|   FOREIGN KEY (onetsoc_code) REFERENCES occupation_data(onetsoc_code)); | ||||
| CREATE TABLE technology_skills ( | ||||
|   onetsoc_code CHARACTER(10) NOT NULL, | ||||
|   example CHARACTER VARYING(150) NOT NULL, | ||||
|   commodity_code DECIMAL(8,0) NOT NULL, | ||||
|   hot_technology CHARACTER(1) NOT NULL, | ||||
|   in_demand CHARACTER(1) NOT NULL, | ||||
|   FOREIGN KEY (onetsoc_code) REFERENCES occupation_data(onetsoc_code), | ||||
|   FOREIGN KEY (commodity_code) REFERENCES unspsc_reference(commodity_code)); | ||||
| CREATE TABLE tools_used ( | ||||
|   onetsoc_code CHARACTER(10) NOT NULL, | ||||
|   example CHARACTER VARYING(150) NOT NULL, | ||||
|   commodity_code DECIMAL(8,0) NOT NULL, | ||||
|   FOREIGN KEY (onetsoc_code) REFERENCES occupation_data(onetsoc_code), | ||||
|   FOREIGN KEY (commodity_code) REFERENCES unspsc_reference(commodity_code)); | ||||
| CREATE TABLE abilities_to_work_activities ( | ||||
|   abilities_element_id CHARACTER VARYING(20) NOT NULL, | ||||
|   work_activities_element_id CHARACTER VARYING(20) NOT NULL, | ||||
|   FOREIGN KEY (abilities_element_id) REFERENCES content_model_reference(element_id), | ||||
|   FOREIGN KEY (work_activities_element_id) REFERENCES content_model_reference(element_id)); | ||||
| CREATE TABLE abilities_to_work_context ( | ||||
|   abilities_element_id CHARACTER VARYING(20) NOT NULL, | ||||
|   work_context_element_id CHARACTER VARYING(20) NOT NULL, | ||||
|   FOREIGN KEY (abilities_element_id) REFERENCES content_model_reference(element_id), | ||||
|   FOREIGN KEY (work_context_element_id) REFERENCES content_model_reference(element_id)); | ||||
| CREATE TABLE skills_to_work_activities ( | ||||
|   skills_element_id CHARACTER VARYING(20) NOT NULL, | ||||
|   work_activities_element_id CHARACTER VARYING(20) NOT NULL, | ||||
|   FOREIGN KEY (skills_element_id) REFERENCES content_model_reference(element_id), | ||||
|   FOREIGN KEY (work_activities_element_id) REFERENCES content_model_reference(element_id)); | ||||
| CREATE TABLE skills_to_work_context ( | ||||
|   skills_element_id CHARACTER VARYING(20) NOT NULL, | ||||
|   work_context_element_id CHARACTER VARYING(20) NOT NULL, | ||||
|   FOREIGN KEY (skills_element_id) REFERENCES content_model_reference(element_id), | ||||
|   FOREIGN KEY (work_context_element_id) REFERENCES content_model_reference(element_id)); | ||||
| CREATE TABLE riasec_keywords ( | ||||
|   element_id CHARACTER VARYING(20) NOT NULL, | ||||
|   keyword CHARACTER VARYING(150) NOT NULL, | ||||
|   keyword_type CHARACTER VARYING(20) NOT NULL, | ||||
|   FOREIGN KEY (element_id) REFERENCES content_model_reference(element_id)); | ||||
| CREATE TABLE basic_interests_to_riasec ( | ||||
|   basic_interests_element_id CHARACTER VARYING(20) NOT NULL, | ||||
|   riasec_element_id CHARACTER VARYING(20) NOT NULL, | ||||
|   FOREIGN KEY (basic_interests_element_id) REFERENCES content_model_reference(element_id), | ||||
|   FOREIGN KEY (riasec_element_id) REFERENCES content_model_reference(element_id)); | ||||
| CREATE TABLE interests_illus_activities ( | ||||
|   element_id CHARACTER VARYING(20) NOT NULL, | ||||
|   interest_type CHARACTER VARYING(20) NOT NULL, | ||||
|   activity CHARACTER VARYING(150) NOT NULL, | ||||
|   FOREIGN KEY (element_id) REFERENCES content_model_reference(element_id)); | ||||
| CREATE TABLE interests_illus_occupations ( | ||||
|   element_id CHARACTER VARYING(20) NOT NULL, | ||||
|   interest_type CHARACTER VARYING(20) NOT NULL, | ||||
|   onetsoc_code CHARACTER(10) NOT NULL, | ||||
|   FOREIGN KEY (element_id) REFERENCES content_model_reference(element_id), | ||||
|   FOREIGN KEY (onetsoc_code) REFERENCES occupation_data(onetsoc_code)); | ||||
| CREATE TABLE sqlite_stat1(tbl,idx,stat); | ||||
							
								
								
									
										21699
									
								
								archive/tasks_estimateable.csv
									
										
									
									
									
										Normal file
									
								
							
							
						
						
									
										21699
									
								
								archive/tasks_estimateable.csv
									
										
									
									
									
										Normal file
									
								
							
										
											
												File diff suppressed because it is too large
												Load diff
											
										
									
								
							
							
								
								
									
										21699
									
								
								archive/tasks_with_estimates.csv
									
										
									
									
									
										Normal file
									
								
							
							
						
						
									
										21699
									
								
								archive/tasks_with_estimates.csv
									
										
									
									
									
										Normal file
									
								
							
										
											
												File diff suppressed because it is too large
												Load diff
											
										
									
								
							
							
								
								
									
										411
									
								
								classify_estimateability_of_tasks.py
									
										
									
									
									
										Normal file
									
								
							
							
						
						
									
										411
									
								
								classify_estimateability_of_tasks.py
									
										
									
									
									
										Normal file
									
								
							|  | @ -0,0 +1,411 @@ | |||
| import pandas as pd | ||||
| import litellm | ||||
| import dotenv | ||||
| import os | ||||
| import time | ||||
| import json | ||||
| import math | ||||
| 
 | ||||
| # Load environment variables | ||||
| dotenv.load_dotenv(override=True) | ||||
| 
 | ||||
| # litellm._turn_on_debug() # Optional debugging | ||||
| 
 | ||||
| # --- Configuration --- | ||||
| MODEL = "gpt-4.1-mini"  # Make sure this model supports json_schema or structured output | ||||
| RATE_LIMIT = 5000  # Requests per minute | ||||
| CHUNK_SIZE = 300  # Number of unique tasks per API call | ||||
| SECONDS_PER_MINUTE = 60 | ||||
| 
 | ||||
| # File configuration | ||||
| CLASSIFICATION_FILENAME = "tasks_estimateable.csv"  # Output file with classifications | ||||
| TASK_SOURCE_FOR_INIT_FILENAME = "tasks_with_estimates.csv" | ||||
| OUTPUT_COLUMN_NAME = "task_estimateable" | ||||
| SOURCE_FILTER_COLUMN = "remote_status" | ||||
| SOURCE_FILTER_VALUE = "remote" | ||||
| 
 | ||||
| # --- Prompts and Schema --- | ||||
| SYSTEM_PROMPT_CLASSIFY = """ | ||||
| Classify the provided O*NET task into one of these categories: | ||||
|  -  ATOMIC (schedulable): A single, clearly-bounded activity, typically lasting minutes, hours, or a few days. | ||||
|  -  ONGOING-CONSTRAINT (background role/ethical rule): A continuous responsibility or behavioural norm with no schedulable duration (e.g., “follow confidentiality rules,” “serve as department head”). | ||||
| """.strip() | ||||
| 
 | ||||
| USER_MESSAGE_TEMPLATE_CLASSIFY = "Task: {task}" | ||||
| 
 | ||||
| CLASSIFICATION_CATEGORIES = ["ATOMIC", "ONGOING-CONSTRAINT"] | ||||
| 
 | ||||
| SCHEMA_FOR_CLASSIFICATION = { | ||||
|     "name": "classify_task_type", | ||||
|     "strict": True, | ||||
|     "schema": { | ||||
|         "type": "object", | ||||
|         "properties": { | ||||
|             "task_category": { | ||||
|                 "type": "string", | ||||
|                 "enum": CLASSIFICATION_CATEGORIES, | ||||
|                 "description": "The classification of the task (ATOMIC or ONGOING-CONSTRAINT).", | ||||
|             } | ||||
|         }, | ||||
|         "required": ["task_category"], | ||||
|         "additionalProperties": False, | ||||
|     }, | ||||
| } | ||||
| 
 | ||||
| 
 | ||||
| def save_dataframe(df_to_save, filename): | ||||
|     """Saves the DataFrame to the specified CSV file using atomic write.""" | ||||
|     try: | ||||
|         temp_filename = filename + ".tmp" | ||||
|         df_to_save.to_csv(temp_filename, encoding="utf-8-sig", index=False) | ||||
|         os.replace(temp_filename, filename) | ||||
|     except Exception as e: | ||||
|         print(f"--- Error saving DataFrame to {filename}: {e} ---") | ||||
|         if os.path.exists(temp_filename): | ||||
|             try: | ||||
|                 os.remove(temp_filename) | ||||
|             except Exception as remove_err: | ||||
|                 print( | ||||
|                     f"--- Error removing temporary save file {temp_filename}: {remove_err} ---" | ||||
|                 ) | ||||
| 
 | ||||
| 
 | ||||
| # --- Load or Initialize DataFrame --- | ||||
| try: | ||||
|     if os.path.exists(CLASSIFICATION_FILENAME): | ||||
|         df = pd.read_csv(CLASSIFICATION_FILENAME, encoding="utf-8-sig") | ||||
|         print(f"Successfully read {len(df)} rows from {CLASSIFICATION_FILENAME}.") | ||||
| 
 | ||||
|         save_needed_after_load = False | ||||
|         if OUTPUT_COLUMN_NAME not in df.columns: | ||||
|             df[OUTPUT_COLUMN_NAME] = pd.NA | ||||
|             print(f"Added '{OUTPUT_COLUMN_NAME}' column.") | ||||
|             save_needed_after_load = True | ||||
| 
 | ||||
|         df[OUTPUT_COLUMN_NAME].replace(["", None, ""], pd.NA, inplace=True) | ||||
| 
 | ||||
|         if df[OUTPUT_COLUMN_NAME].dtype != object and not isinstance( | ||||
|             df[OUTPUT_COLUMN_NAME].dtype, pd.StringDtype | ||||
|         ): | ||||
|             try: | ||||
|                 df[OUTPUT_COLUMN_NAME] = df[OUTPUT_COLUMN_NAME].astype(object) | ||||
|                 print( | ||||
|                     f"Corrected dtype of '{OUTPUT_COLUMN_NAME}' to {df[OUTPUT_COLUMN_NAME].dtype}." | ||||
|                 ) | ||||
|                 save_needed_after_load = True | ||||
|             except Exception as e: | ||||
|                 print( | ||||
|                     f"Warning: Could not convert column '{OUTPUT_COLUMN_NAME}' to object: {e}." | ||||
|                 ) | ||||
| 
 | ||||
|         if "task" not in df.columns: | ||||
|             print( | ||||
|                 f"Error: {CLASSIFICATION_FILENAME} must contain a 'task' column for processing." | ||||
|             ) | ||||
|             exit() | ||||
| 
 | ||||
|         if save_needed_after_load: | ||||
|             print(f"Saving {CLASSIFICATION_FILENAME} after adding/adjusting column.") | ||||
|             save_dataframe(df, CLASSIFICATION_FILENAME) | ||||
|     else: | ||||
|         print( | ||||
|             f"{CLASSIFICATION_FILENAME} not found. Attempting to create it from {TASK_SOURCE_FOR_INIT_FILENAME}." | ||||
|         ) | ||||
|         if not os.path.exists(TASK_SOURCE_FOR_INIT_FILENAME): | ||||
|             print( | ||||
|                 f"Error: Source file {TASK_SOURCE_FOR_INIT_FILENAME} not found. Cannot create {CLASSIFICATION_FILENAME}." | ||||
|             ) | ||||
|             exit() | ||||
| 
 | ||||
|         df_source = pd.read_csv(TASK_SOURCE_FOR_INIT_FILENAME, encoding="utf-8-sig") | ||||
| 
 | ||||
|         required_source_cols_for_init = ["task", SOURCE_FILTER_COLUMN] | ||||
|         missing_source_cols = [ | ||||
|             col for col in required_source_cols_for_init if col not in df_source.columns | ||||
|         ] | ||||
|         if missing_source_cols: | ||||
|             print( | ||||
|                 f"Error: Source file {TASK_SOURCE_FOR_INIT_FILENAME} is missing required columns for initialization: {', '.join(missing_source_cols)}." | ||||
|             ) | ||||
|             exit() | ||||
| 
 | ||||
|         df_source_filtered = df_source[ | ||||
|             df_source[SOURCE_FILTER_COLUMN] == SOURCE_FILTER_VALUE | ||||
|         ].copy() | ||||
| 
 | ||||
|         if df_source_filtered.empty: | ||||
|             print( | ||||
|                 f"Warning: No tasks with '{SOURCE_FILTER_COLUMN}' == '{SOURCE_FILTER_VALUE}' found in {TASK_SOURCE_FOR_INIT_FILENAME}. " | ||||
|                 f"{CLASSIFICATION_FILENAME} will be created with schema but no tasks to classify initially." | ||||
|             ) | ||||
| 
 | ||||
|         df = df_source_filtered[["task"]].copy() | ||||
|         df[OUTPUT_COLUMN_NAME] = pd.NA | ||||
|         df[OUTPUT_COLUMN_NAME] = df[OUTPUT_COLUMN_NAME].astype(object) | ||||
| 
 | ||||
|         print( | ||||
|             f"Created {CLASSIFICATION_FILENAME} using tasks from {TASK_SOURCE_FOR_INIT_FILENAME} " | ||||
|             f"(where {SOURCE_FILTER_COLUMN}='{SOURCE_FILTER_VALUE}'). New file has {len(df)} tasks." | ||||
|         ) | ||||
|         save_dataframe(df, CLASSIFICATION_FILENAME) | ||||
| 
 | ||||
| except FileNotFoundError: | ||||
|     print(f"Error: A required file was not found. Please check paths.") | ||||
|     exit() | ||||
| except Exception as e: | ||||
|     print(f"Error during DataFrame loading or initialization: {e}") | ||||
|     exit() | ||||
| 
 | ||||
| 
 | ||||
| # --- Identify Unique Tasks to Process --- | ||||
| if df.empty: | ||||
|     print(f"{CLASSIFICATION_FILENAME} is empty. Nothing to process. Exiting.") | ||||
|     exit() | ||||
| 
 | ||||
| initial_unprocessed_mask = df[OUTPUT_COLUMN_NAME].isna() | ||||
| 
 | ||||
| if not initial_unprocessed_mask.any(): | ||||
|     print( | ||||
|         f"All tasks in {CLASSIFICATION_FILENAME} seem to have been classified already. Exiting." | ||||
|     ) | ||||
|     exit() | ||||
| 
 | ||||
| # Filter for rows that are unprocessed AND have a valid 'task' string | ||||
| valid_tasks_to_consider_df = df[ | ||||
|     initial_unprocessed_mask & df["task"].notna() & (df["task"].str.strip() != "") | ||||
| ] | ||||
| 
 | ||||
| if valid_tasks_to_consider_df.empty: | ||||
|     print( | ||||
|         f"No valid, unclassified tasks found to process (after filtering out empty/NaN task descriptions). Exiting." | ||||
|     ) | ||||
|     exit() | ||||
| 
 | ||||
| unique_task_labels_for_api = ( | ||||
|     valid_tasks_to_consider_df["task"].drop_duplicates().tolist() | ||||
| ) | ||||
| total_rows_to_update_potentially = len( | ||||
|     df[initial_unprocessed_mask] | ||||
| )  # Count all rows that are NA | ||||
| 
 | ||||
| print( | ||||
|     f"Found {total_rows_to_update_potentially} total rows in {CLASSIFICATION_FILENAME} needing classification." | ||||
| ) | ||||
| print( | ||||
|     f"Identified {len(unique_task_labels_for_api)} unique, valid task labels to send to the API." | ||||
| ) | ||||
| 
 | ||||
| 
 | ||||
| # --- Prepare messages for batch completion (only for unique task labels) --- | ||||
| messages_list = [] | ||||
| print(f"Preparing messages for {len(unique_task_labels_for_api)} unique task labels...") | ||||
| 
 | ||||
| for task_label in unique_task_labels_for_api: | ||||
|     # task_label is already guaranteed to be non-empty and not NaN from the filtering above | ||||
|     user_message = USER_MESSAGE_TEMPLATE_CLASSIFY.format(task=task_label) | ||||
|     messages_for_task = [ | ||||
|         {"role": "system", "content": SYSTEM_PROMPT_CLASSIFY}, | ||||
|         {"role": "user", "content": user_message}, | ||||
|     ] | ||||
|     messages_list.append(messages_for_task) | ||||
| 
 | ||||
| print(f"Prepared {len(messages_list)} message sets for batch completion.") | ||||
| if ( | ||||
|     not messages_list | ||||
| ):  # Should only happen if unique_task_labels_for_api was empty, caught above | ||||
|     print( | ||||
|         "No messages prepared, though unique tasks were identified. This is unexpected. Exiting." | ||||
|     ) | ||||
|     exit() | ||||
| 
 | ||||
| 
 | ||||
| # --- Call batch_completion in chunks with rate limiting and periodic saving --- | ||||
| total_unique_tasks_to_send = len( | ||||
|     messages_list | ||||
| )  # Same as len(unique_task_labels_for_api) | ||||
| num_chunks = math.ceil(total_unique_tasks_to_send / CHUNK_SIZE) | ||||
| 
 | ||||
| print( | ||||
|     f"\nStarting batch classification for {total_unique_tasks_to_send} unique task labels in {num_chunks} chunks..." | ||||
| ) | ||||
| 
 | ||||
| overall_start_time = time.time() | ||||
| processed_rows_count_total = 0  # Counts actual rows updated in the DataFrame | ||||
| 
 | ||||
| for i in range(num_chunks): | ||||
|     chunk_start_message_index = i * CHUNK_SIZE | ||||
|     chunk_end_message_index = min((i + 1) * CHUNK_SIZE, total_unique_tasks_to_send) | ||||
| 
 | ||||
|     message_chunk = messages_list[chunk_start_message_index:chunk_end_message_index] | ||||
|     # Get corresponding unique task labels for this chunk | ||||
|     chunk_task_labels = unique_task_labels_for_api[ | ||||
|         chunk_start_message_index:chunk_end_message_index | ||||
|     ] | ||||
| 
 | ||||
|     if not message_chunk:  # Should not happen if loop range is correct | ||||
|         continue | ||||
| 
 | ||||
|     print( | ||||
|         f"\nProcessing chunk {i + 1}/{num_chunks} (Unique Task Labels {chunk_start_message_index + 1}-{chunk_end_message_index} of this run)..." | ||||
|     ) | ||||
|     chunk_start_time = time.time() | ||||
|     responses = [] | ||||
|     try: | ||||
|         print( | ||||
|             f"Sending {len(message_chunk)} requests (for unique tasks) for chunk {i + 1}..." | ||||
|         ) | ||||
|         responses = litellm.batch_completion( | ||||
|             model=MODEL, | ||||
|             messages=message_chunk, | ||||
|             response_format={ | ||||
|                 "type": "json_schema", | ||||
|                 "json_schema": SCHEMA_FOR_CLASSIFICATION, | ||||
|             }, | ||||
|             num_retries=3, | ||||
|         ) | ||||
|         print(f"Chunk {i + 1} API call completed.") | ||||
| 
 | ||||
|     except Exception as e: | ||||
|         print(f"Error during litellm.batch_completion for chunk {i + 1}: {e}") | ||||
|         responses = [None] * len(message_chunk) | ||||
| 
 | ||||
|     # --- Process responses for the current chunk --- | ||||
|     # chunk_updates stores {task_label: classification_category} | ||||
|     chunk_task_classifications = {} | ||||
|     successful_api_calls_in_chunk = 0 | ||||
|     failed_api_calls_in_chunk = 0 | ||||
| 
 | ||||
|     if responses and len(responses) == len(message_chunk): | ||||
|         for j, response in enumerate(responses): | ||||
|             current_task_label = chunk_task_labels[ | ||||
|                 j | ||||
|             ]  # The unique task label for this response | ||||
|             content_str = None | ||||
| 
 | ||||
|             if response is None: | ||||
|                 print( | ||||
|                     f"API call failed for task label '{current_task_label}' (response is None)." | ||||
|                 ) | ||||
|                 failed_api_calls_in_chunk += 1 | ||||
|                 continue | ||||
| 
 | ||||
|             try: | ||||
|                 if ( | ||||
|                     response.choices | ||||
|                     and response.choices[0].message | ||||
|                     and response.choices[0].message.content | ||||
|                 ): | ||||
|                     content_str = response.choices[0].message.content | ||||
|                     classification_data = json.loads(content_str) | ||||
|                     category_raw = classification_data.get("task_category") | ||||
| 
 | ||||
|                     if category_raw in CLASSIFICATION_CATEGORIES: | ||||
|                         successful_api_calls_in_chunk += 1 | ||||
|                         chunk_task_classifications[current_task_label] = category_raw | ||||
|                     else: | ||||
|                         print( | ||||
|                             f"Warning: Invalid or missing task_category for task label '{current_task_label}': '{category_raw}'. Content: '{content_str}'" | ||||
|                         ) | ||||
|                         failed_api_calls_in_chunk += 1 | ||||
|                 else: | ||||
|                     finish_reason = ( | ||||
|                         response.choices[0].finish_reason | ||||
|                         if (response.choices and response.choices[0].finish_reason) | ||||
|                         else "unknown" | ||||
|                     ) | ||||
|                     error_message = ( | ||||
|                         response.choices[0].message.content | ||||
|                         if (response.choices and response.choices[0].message) | ||||
|                         else "No content in message." | ||||
|                     ) | ||||
|                     print( | ||||
|                         f"Warning: Received non-standard or empty response content for task label '{current_task_label}'. " | ||||
|                         f"Finish Reason: '{finish_reason}'. Message: '{error_message}'. Raw Choices: {response.choices}" | ||||
|                     ) | ||||
|                     failed_api_calls_in_chunk += 1 | ||||
| 
 | ||||
|             except json.JSONDecodeError: | ||||
|                 print( | ||||
|                     f"Warning: Could not decode JSON for task label '{current_task_label}'. Content received: '{content_str}'" | ||||
|                 ) | ||||
|                 failed_api_calls_in_chunk += 1 | ||||
|             except AttributeError as ae: | ||||
|                 print( | ||||
|                     f"Warning: Missing attribute processing response for task label '{current_task_label}': {ae}. Response: {response}" | ||||
|                 ) | ||||
|                 failed_api_calls_in_chunk += 1 | ||||
|             except Exception as e: | ||||
|                 print( | ||||
|                     f"Warning: Unexpected error processing response for task label '{current_task_label}': {type(e).__name__} - {e}. Response: {response}" | ||||
|                 ) | ||||
|                 failed_api_calls_in_chunk += 1 | ||||
|     else: | ||||
|         print( | ||||
|             f"Warning: Mismatch between #responses ({len(responses) if responses else 0}) " | ||||
|             f"and #messages sent ({len(message_chunk)}) for chunk {i + 1}, or no responses. Marking all API calls in chunk as failed." | ||||
|         ) | ||||
|         failed_api_calls_in_chunk = len(message_chunk) | ||||
| 
 | ||||
|     # --- Update Main DataFrame and Save Periodically --- | ||||
|     rows_updated_this_chunk = 0 | ||||
|     if chunk_task_classifications: | ||||
|         print( | ||||
|             f"Updating main DataFrame with classifications for {len(chunk_task_classifications)} unique tasks from chunk {i + 1}..." | ||||
|         ) | ||||
|         for task_label, category in chunk_task_classifications.items(): | ||||
|             # Update all rows in the main df that match this task_label AND are still NA in the output column | ||||
|             update_condition = (df["task"] == task_label) & ( | ||||
|                 df[OUTPUT_COLUMN_NAME].isna() | ||||
|             ) | ||||
|             num_rows_for_this_task_label = df[update_condition].shape[0] | ||||
| 
 | ||||
|             if num_rows_for_this_task_label > 0: | ||||
|                 df.loc[update_condition, OUTPUT_COLUMN_NAME] = category | ||||
|                 rows_updated_this_chunk += num_rows_for_this_task_label | ||||
| 
 | ||||
|         print( | ||||
|             f"Updated {rows_updated_this_chunk} rows in the DataFrame based on this chunk's API responses." | ||||
|         ) | ||||
|         print(f"Saving progress to {CLASSIFICATION_FILENAME}...") | ||||
|         save_dataframe(df, CLASSIFICATION_FILENAME) | ||||
|     else: | ||||
|         print( | ||||
|             f"No successful API classifications obtained in chunk {i + 1} to update DataFrame or save." | ||||
|         ) | ||||
| 
 | ||||
|     print( | ||||
|         f"Chunk {i + 1} API summary: Successful Calls={successful_api_calls_in_chunk}, Failed/Skipped Calls={failed_api_calls_in_chunk}. " | ||||
|         f"Rows updated in DataFrame this chunk: {rows_updated_this_chunk}" | ||||
|     ) | ||||
|     processed_rows_count_total += rows_updated_this_chunk | ||||
| 
 | ||||
|     # --- Rate Limiting Pause --- | ||||
|     chunk_end_time = time.time() | ||||
|     chunk_duration = chunk_end_time - chunk_start_time | ||||
|     print(f"Chunk {i + 1} (API calls and DF update) took {chunk_duration:.2f} seconds.") | ||||
| 
 | ||||
|     if i < num_chunks - 1: | ||||
|         time_per_request = SECONDS_PER_MINUTE / RATE_LIMIT if RATE_LIMIT > 0 else 0 | ||||
|         min_chunk_duration_for_rate = ( | ||||
|             len(message_chunk) * time_per_request | ||||
|         )  # Based on API calls made | ||||
|         pause_needed = max(0, min_chunk_duration_for_rate - chunk_duration) | ||||
| 
 | ||||
|         if pause_needed > 0: | ||||
|             print( | ||||
|                 f"Pausing for {pause_needed:.2f} seconds to respect rate limit ({RATE_LIMIT}/min)..." | ||||
|             ) | ||||
|             time.sleep(pause_needed) | ||||
| 
 | ||||
| overall_end_time = time.time() | ||||
| total_duration_minutes = (overall_end_time - overall_start_time) / 60 | ||||
| print( | ||||
|     f"\nBatch classification finished." | ||||
|     f" Updated {processed_rows_count_total} rows in '{CLASSIFICATION_FILENAME}' with new classifications in this run." | ||||
|     f" Total duration: {total_duration_minutes:.2f} minutes." | ||||
| ) | ||||
| 
 | ||||
| print(f"Performing final save to {CLASSIFICATION_FILENAME}...") | ||||
| save_dataframe(df, CLASSIFICATION_FILENAME) | ||||
| 
 | ||||
| print("\nScript finished.") | ||||
|  | @ -1,85 +0,0 @@ | |||
| #!/usr/bin/env bash | ||||
| 
 | ||||
| # Set database name and directories | ||||
| ONET_DB_NAME="onet.database" | ||||
| ONET_ZIP_URL="https://www.onetcenter.org/dl_files/database/db_29_1_mysql.zip" | ||||
| ONET_ZIP_FILE="db_29_1_mysql.zip" | ||||
| ONET_EXTRACT_DIR="db_29_1_mysql" | ||||
| 
 | ||||
| # Download O*NET database only if not already downloaded | ||||
| if [ ! -f "$ONET_ZIP_FILE" ]; then | ||||
|     echo "Downloading O*NET database from $ONET_ZIP_URL" | ||||
|     curl -L -o "$ONET_ZIP_FILE" "$ONET_ZIP_URL" || wget -O "$ONET_ZIP_FILE" "$ONET_ZIP_URL" | ||||
| 
 | ||||
|     if [ $? -ne 0 ]; then | ||||
|         echo "Failed to download O*NET database" | ||||
|         exit 1 | ||||
|     fi | ||||
| else | ||||
|     echo "Using existing O*NET database zip file" | ||||
| fi | ||||
| 
 | ||||
| # Extract downloaded zip file only if extraction directory doesn't exist | ||||
| if [ ! -d "$ONET_EXTRACT_DIR" ]; then | ||||
|     echo "Extracting O*NET database files" | ||||
|     unzip -o "$ONET_ZIP_FILE" | ||||
| 
 | ||||
|     if [ $? -ne 0 ]; then | ||||
|         echo "Failed to extract O*NET database files" | ||||
|         exit 1 | ||||
|     fi | ||||
| else | ||||
|     echo "Using existing extracted O*NET database files" | ||||
| fi | ||||
| 
 | ||||
| # Remove existing database if it exists | ||||
| if [ -f "$ONET_DB_NAME" ]; then | ||||
|     echo "Removing existing database" | ||||
|     rm "$ONET_DB_NAME" | ||||
| fi | ||||
| 
 | ||||
| # Create a new SQLite database with optimized settings for fast import | ||||
| echo "Creating new SQLite database: $ONET_DB_NAME with performance settings" | ||||
| sqlite3 "$ONET_DB_NAME" << EOF | ||||
| PRAGMA journal_mode = OFF; | ||||
| PRAGMA synchronous = 0; | ||||
| PRAGMA cache_size = 1000000; | ||||
| PRAGMA locking_mode = EXCLUSIVE; | ||||
| PRAGMA temp_store = MEMORY; | ||||
| PRAGMA foreign_keys = ON; | ||||
| EOF | ||||
| 
 | ||||
| # Combine and execute all SQL files in one transaction | ||||
| echo "Executing SQL files in alphabetical order (single transaction mode)" | ||||
| sqlite3 "$ONET_DB_NAME" << EOF | ||||
| BEGIN TRANSACTION; | ||||
| $(find "$ONET_EXTRACT_DIR" -name "*.sql" | sort | xargs cat) | ||||
| COMMIT; | ||||
| EOF | ||||
| 
 | ||||
| # Check if the execution was successful | ||||
| if [ $? -ne 0 ]; then | ||||
|     echo "Error executing SQL files in batch transaction" | ||||
|     exit 1 | ||||
| else | ||||
|     echo "Database populated successfully. Restoring reliability settings..." | ||||
| 
 | ||||
|     # Restore reliability-focused settings after import | ||||
|     sqlite3 "$ONET_DB_NAME" << EOF | ||||
| PRAGMA journal_mode = WAL; | ||||
| PRAGMA synchronous = NORMAL; | ||||
| PRAGMA locking_mode = NORMAL; | ||||
| PRAGMA temp_store = DEFAULT; | ||||
| PRAGMA foreign_keys = ON; | ||||
| PRAGMA optimize; | ||||
| VACUUM; | ||||
| EOF | ||||
| 
 | ||||
|     if [ $? -ne 0 ]; then | ||||
|         echo "Warning: Failed to restore reliability settings, but database is populated" | ||||
|     else | ||||
|         echo "Reliability settings restored successfully" | ||||
|     fi | ||||
| 
 | ||||
|     echo "O*NET database created and optimized successfully!" | ||||
| fi | ||||
							
								
								
									
										2088
									
								
								data/DWA Reference.txt
									
										
									
									
									
										Normal file
									
								
							
							
						
						
									
										2088
									
								
								data/DWA Reference.txt
									
										
									
									
									
										Normal file
									
								
							
										
											
												File diff suppressed because it is too large
												Load diff
											
										
									
								
							
							
								
								
									
										158752
									
								
								data/Task Ratings.txt
									
										
									
									
									
										Normal file
									
								
							
							
						
						
									
										158752
									
								
								data/Task Ratings.txt
									
										
									
									
									
										Normal file
									
								
							
										
											
												File diff suppressed because it is too large
												Load diff
											
										
									
								
							
							
								
								
									
										
											BIN
										
									
								
								data/onet.db
									
										
									
									
									
										Normal file
									
								
							
							
						
						
									
										
											BIN
										
									
								
								data/onet.db
									
										
									
									
									
										Normal file
									
								
							
										
											Binary file not shown.
										
									
								
							
										
											
												File diff suppressed because one or more lines are too long
											
										
									
								
							|  | @ -5,22 +5,29 @@ description = "Add your description here" | |||
| readme = "README.md" | ||||
| requires-python = ">=3.13" | ||||
| dependencies = [ | ||||
|     "coloraide>=4.6", | ||||
|     "dotenv>=0.9.9", | ||||
|     "graphviz>=0.20.3", | ||||
|     "jupyter>=1.1.1", | ||||
|     "litellm==1.67.0", | ||||
|     "matplotlib>=3.10.3", | ||||
|     "notebook>=7.4.1", | ||||
|     "openai>=1.76.0", | ||||
|     "openpyxl>=3.1.5", | ||||
|     "pandas>=2.2.3", | ||||
|     "pyarrow>=20.0.0", | ||||
|     "pyvis>=0.3.2", | ||||
|     "requests>=2.32.3", | ||||
|     "scipy", | ||||
|     "seaborn>=0.13.2", | ||||
|     "tenacity>=9.1.2", | ||||
|     "tqdm>=4.67.1", | ||||
| ] | ||||
| 
 | ||||
| 
 | ||||
| [tool.pytest.ini_options] | ||||
| pythonpath="src" | ||||
| addopts="-v" | ||||
| pythonpath = "src" | ||||
| addopts = "-v" | ||||
| asyncio_mode = "auto" | ||||
| 
 | ||||
| [tool.black] | ||||
|  |  | |||
							
								
								
									
										302
									
								
								uv.lock
									
										
									
										generated
									
									
									
								
							
							
						
						
									
										302
									
								
								uv.lock
									
										
									
										generated
									
									
									
								
							|  | @ -264,6 +264,15 @@ wheels = [ | |||
|     { url = "https://files.pythonhosted.org/packages/7e/d4/7ebdbd03970677812aac39c869717059dbb71a4cfc033ca6e5221787892c/click-8.1.8-py3-none-any.whl", hash = "sha256:63c132bbbed01578a06712a2d1f497bb62d9c1c0d329b7903a866228027263b2", size = 98188, upload_time = "2024-12-21T18:38:41.666Z" }, | ||||
| ] | ||||
| 
 | ||||
| [[package]] | ||||
| name = "coloraide" | ||||
| version = "4.6" | ||||
| source = { registry = "https://pypi.org/simple" } | ||||
| sdist = { url = "https://files.pythonhosted.org/packages/38/26/229d409cf6f74ea731d85f1bd2d762f7c1c6555687431ead8532a076ccce/coloraide-4.6.tar.gz", hash = "sha256:e41ce60210476d88e18093a32973a2f046bc3a4e778871bb14e7f2f817f88f6a", size = 22260485, upload_time = "2025-04-22T14:32:04.837Z" } | ||||
| wheels = [ | ||||
|     { url = "https://files.pythonhosted.org/packages/c7/76/5aafb032453bdd3d0e94306648472d021252a9e44fcd06661382df64586a/coloraide-4.6-py3-none-any.whl", hash = "sha256:1a39e52b2189e13c374deddac8f7f92e6ed06ad450de95dc537cc858aeff6c22", size = 261386, upload_time = "2025-04-22T14:32:01.71Z" }, | ||||
| ] | ||||
| 
 | ||||
| [[package]] | ||||
| name = "colorama" | ||||
| version = "0.4.6" | ||||
|  | @ -285,6 +294,46 @@ wheels = [ | |||
|     { url = "https://files.pythonhosted.org/packages/e6/75/49e5bfe642f71f272236b5b2d2691cf915a7283cc0ceda56357b61daa538/comm-0.2.2-py3-none-any.whl", hash = "sha256:e6fb86cb70ff661ee8c9c14e7d36d6de3b4066f1441be4063df9c5009f0a64d3", size = 7180, upload_time = "2024-03-12T16:53:39.226Z" }, | ||||
| ] | ||||
| 
 | ||||
| [[package]] | ||||
| name = "contourpy" | ||||
| version = "1.3.2" | ||||
| source = { registry = "https://pypi.org/simple" } | ||||
| dependencies = [ | ||||
|     { name = "numpy" }, | ||||
| ] | ||||
| sdist = { url = "https://files.pythonhosted.org/packages/66/54/eb9bfc647b19f2009dd5c7f5ec51c4e6ca831725f1aea7a993034f483147/contourpy-1.3.2.tar.gz", hash = "sha256:b6945942715a034c671b7fc54f9588126b0b8bf23db2696e3ca8328f3ff0ab54", size = 13466130, upload_time = "2025-04-15T17:47:53.79Z" } | ||||
| wheels = [ | ||||
|     { url = "https://files.pythonhosted.org/packages/2e/61/5673f7e364b31e4e7ef6f61a4b5121c5f170f941895912f773d95270f3a2/contourpy-1.3.2-cp313-cp313-macosx_10_13_x86_64.whl", hash = "sha256:de39db2604ae755316cb5967728f4bea92685884b1e767b7c24e983ef5f771cb", size = 271630, upload_time = "2025-04-15T17:38:19.142Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/ff/66/a40badddd1223822c95798c55292844b7e871e50f6bfd9f158cb25e0bd39/contourpy-1.3.2-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:3f9e896f447c5c8618f1edb2bafa9a4030f22a575ec418ad70611450720b5b08", size = 255670, upload_time = "2025-04-15T17:38:23.688Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/1e/c7/cf9fdee8200805c9bc3b148f49cb9482a4e3ea2719e772602a425c9b09f8/contourpy-1.3.2-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:71e2bd4a1c4188f5c2b8d274da78faab884b59df20df63c34f74aa1813c4427c", size = 306694, upload_time = "2025-04-15T17:38:28.238Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/dd/e7/ccb9bec80e1ba121efbffad7f38021021cda5be87532ec16fd96533bb2e0/contourpy-1.3.2-cp313-cp313-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:de425af81b6cea33101ae95ece1f696af39446db9682a0b56daaa48cfc29f38f", size = 345986, upload_time = "2025-04-15T17:38:33.502Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/dc/49/ca13bb2da90391fa4219fdb23b078d6065ada886658ac7818e5441448b78/contourpy-1.3.2-cp313-cp313-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:977e98a0e0480d3fe292246417239d2d45435904afd6d7332d8455981c408b85", size = 318060, upload_time = "2025-04-15T17:38:38.672Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/c8/65/5245ce8c548a8422236c13ffcdcdada6a2a812c361e9e0c70548bb40b661/contourpy-1.3.2-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:434f0adf84911c924519d2b08fc10491dd282b20bdd3fa8f60fd816ea0b48841", size = 322747, upload_time = "2025-04-15T17:38:43.712Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/72/30/669b8eb48e0a01c660ead3752a25b44fdb2e5ebc13a55782f639170772f9/contourpy-1.3.2-cp313-cp313-musllinux_1_2_aarch64.whl", hash = "sha256:c66c4906cdbc50e9cba65978823e6e00b45682eb09adbb78c9775b74eb222422", size = 1308895, upload_time = "2025-04-15T17:39:00.224Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/05/5a/b569f4250decee6e8d54498be7bdf29021a4c256e77fe8138c8319ef8eb3/contourpy-1.3.2-cp313-cp313-musllinux_1_2_x86_64.whl", hash = "sha256:8b7fc0cd78ba2f4695fd0a6ad81a19e7e3ab825c31b577f384aa9d7817dc3bef", size = 1379098, upload_time = "2025-04-15T17:43:29.649Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/19/ba/b227c3886d120e60e41b28740ac3617b2f2b971b9f601c835661194579f1/contourpy-1.3.2-cp313-cp313-win32.whl", hash = "sha256:15ce6ab60957ca74cff444fe66d9045c1fd3e92c8936894ebd1f3eef2fff075f", size = 178535, upload_time = "2025-04-15T17:44:44.532Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/12/6e/2fed56cd47ca739b43e892707ae9a13790a486a3173be063681ca67d2262/contourpy-1.3.2-cp313-cp313-win_amd64.whl", hash = "sha256:e1578f7eafce927b168752ed7e22646dad6cd9bca673c60bff55889fa236ebf9", size = 223096, upload_time = "2025-04-15T17:44:48.194Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/54/4c/e76fe2a03014a7c767d79ea35c86a747e9325537a8b7627e0e5b3ba266b4/contourpy-1.3.2-cp313-cp313t-macosx_10_13_x86_64.whl", hash = "sha256:0475b1f6604896bc7c53bb070e355e9321e1bc0d381735421a2d2068ec56531f", size = 285090, upload_time = "2025-04-15T17:43:34.084Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/7b/e2/5aba47debd55d668e00baf9651b721e7733975dc9fc27264a62b0dd26eb8/contourpy-1.3.2-cp313-cp313t-macosx_11_0_arm64.whl", hash = "sha256:c85bb486e9be652314bb5b9e2e3b0d1b2e643d5eec4992c0fbe8ac71775da739", size = 268643, upload_time = "2025-04-15T17:43:38.626Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/a1/37/cd45f1f051fe6230f751cc5cdd2728bb3a203f5619510ef11e732109593c/contourpy-1.3.2-cp313-cp313t-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:745b57db7758f3ffc05a10254edd3182a2a83402a89c00957a8e8a22f5582823", size = 310443, upload_time = "2025-04-15T17:43:44.522Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/8b/a2/36ea6140c306c9ff6dd38e3bcec80b3b018474ef4d17eb68ceecd26675f4/contourpy-1.3.2-cp313-cp313t-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:970e9173dbd7eba9b4e01aab19215a48ee5dd3f43cef736eebde064a171f89a5", size = 349865, upload_time = "2025-04-15T17:43:49.545Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/95/b7/2fc76bc539693180488f7b6cc518da7acbbb9e3b931fd9280504128bf956/contourpy-1.3.2-cp313-cp313t-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:c6c4639a9c22230276b7bffb6a850dfc8258a2521305e1faefe804d006b2e532", size = 321162, upload_time = "2025-04-15T17:43:54.203Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/f4/10/76d4f778458b0aa83f96e59d65ece72a060bacb20cfbee46cf6cd5ceba41/contourpy-1.3.2-cp313-cp313t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:cc829960f34ba36aad4302e78eabf3ef16a3a100863f0d4eeddf30e8a485a03b", size = 327355, upload_time = "2025-04-15T17:44:01.025Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/43/a3/10cf483ea683f9f8ab096c24bad3cce20e0d1dd9a4baa0e2093c1c962d9d/contourpy-1.3.2-cp313-cp313t-musllinux_1_2_aarch64.whl", hash = "sha256:d32530b534e986374fc19eaa77fcb87e8a99e5431499949b828312bdcd20ac52", size = 1307935, upload_time = "2025-04-15T17:44:17.322Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/78/73/69dd9a024444489e22d86108e7b913f3528f56cfc312b5c5727a44188471/contourpy-1.3.2-cp313-cp313t-musllinux_1_2_x86_64.whl", hash = "sha256:e298e7e70cf4eb179cc1077be1c725b5fd131ebc81181bf0c03525c8abc297fd", size = 1372168, upload_time = "2025-04-15T17:44:33.43Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/0f/1b/96d586ccf1b1a9d2004dd519b25fbf104a11589abfd05484ff12199cca21/contourpy-1.3.2-cp313-cp313t-win32.whl", hash = "sha256:d0e589ae0d55204991450bb5c23f571c64fe43adaa53f93fc902a84c96f52fe1", size = 189550, upload_time = "2025-04-15T17:44:37.092Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/b0/e6/6000d0094e8a5e32ad62591c8609e269febb6e4db83a1c75ff8868b42731/contourpy-1.3.2-cp313-cp313t-win_amd64.whl", hash = "sha256:78e9253c3de756b3f6a5174d024c4835acd59eb3f8e2ca13e775dbffe1558f69", size = 238214, upload_time = "2025-04-15T17:44:40.827Z" }, | ||||
| ] | ||||
| 
 | ||||
| [[package]] | ||||
| name = "cycler" | ||||
| version = "0.12.1" | ||||
| source = { registry = "https://pypi.org/simple" } | ||||
| sdist = { url = "https://files.pythonhosted.org/packages/a9/95/a3dbbb5028f35eafb79008e7522a75244477d2838f38cbb722248dabc2a8/cycler-0.12.1.tar.gz", hash = "sha256:88bb128f02ba341da8ef447245a9e138fae777f6a23943da4540077d3601eb1c", size = 7615, upload_time = "2023-10-07T05:32:18.335Z" } | ||||
| wheels = [ | ||||
|     { url = "https://files.pythonhosted.org/packages/e7/05/c19819d5e3d95294a6f5947fb9b9629efb316b96de511b418c53d245aae6/cycler-0.12.1-py3-none-any.whl", hash = "sha256:85cef7cff222d8644161529808465972e51340599459b8ac3ccbac5a854e0d30", size = 8321, upload_time = "2023-10-07T05:32:16.783Z" }, | ||||
| ] | ||||
| 
 | ||||
| [[package]] | ||||
| name = "debugpy" | ||||
| version = "1.8.14" | ||||
|  | @ -372,6 +421,23 @@ wheels = [ | |||
|     { url = "https://files.pythonhosted.org/packages/4d/36/2a115987e2d8c300a974597416d9de88f2444426de9571f4b59b2cca3acc/filelock-3.18.0-py3-none-any.whl", hash = "sha256:c401f4f8377c4464e6db25fff06205fd89bdd83b65eb0488ed1b160f780e21de", size = 16215, upload_time = "2025-03-14T07:11:39.145Z" }, | ||||
| ] | ||||
| 
 | ||||
| [[package]] | ||||
| name = "fonttools" | ||||
| version = "4.58.4" | ||||
| source = { registry = "https://pypi.org/simple" } | ||||
| sdist = { url = "https://files.pythonhosted.org/packages/2e/5a/1124b2c8cb3a8015faf552e92714040bcdbc145dfa29928891b02d147a18/fonttools-4.58.4.tar.gz", hash = "sha256:928a8009b9884ed3aae17724b960987575155ca23c6f0b8146e400cc9e0d44ba", size = 3525026, upload_time = "2025-06-13T17:25:15.426Z" } | ||||
| wheels = [ | ||||
|     { url = "https://files.pythonhosted.org/packages/d4/4f/c05cab5fc1a4293e6bc535c6cb272607155a0517700f5418a4165b7f9ec8/fonttools-4.58.4-cp313-cp313-macosx_10_13_universal2.whl", hash = "sha256:5f4a64846495c543796fa59b90b7a7a9dff6839bd852741ab35a71994d685c6d", size = 2745197, upload_time = "2025-06-13T17:24:40.645Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/3e/d3/49211b1f96ae49308f4f78ca7664742377a6867f00f704cdb31b57e4b432/fonttools-4.58.4-cp313-cp313-macosx_10_13_x86_64.whl", hash = "sha256:e80661793a5d4d7ad132a2aa1eae2e160fbdbb50831a0edf37c7c63b2ed36574", size = 2317272, upload_time = "2025-06-13T17:24:43.428Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/b2/11/c9972e46a6abd752a40a46960e431c795ad1f306775fc1f9e8c3081a1274/fonttools-4.58.4-cp313-cp313-manylinux1_x86_64.manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_5_x86_64.whl", hash = "sha256:fe5807fc64e4ba5130f1974c045a6e8d795f3b7fb6debfa511d1773290dbb76b", size = 4877184, upload_time = "2025-06-13T17:24:45.527Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/ea/24/5017c01c9ef8df572cc9eaf9f12be83ad8ed722ff6dc67991d3d752956e4/fonttools-4.58.4-cp313-cp313-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:b610b9bef841cb8f4b50472494158b1e347d15cad56eac414c722eda695a6cfd", size = 4939445, upload_time = "2025-06-13T17:24:47.647Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/79/b0/538cc4d0284b5a8826b4abed93a69db52e358525d4b55c47c8cef3669767/fonttools-4.58.4-cp313-cp313-musllinux_1_2_aarch64.whl", hash = "sha256:2daa7f0e213c38f05f054eb5e1730bd0424aebddbeac094489ea1585807dd187", size = 4878800, upload_time = "2025-06-13T17:24:49.766Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/5a/9b/a891446b7a8250e65bffceb248508587958a94db467ffd33972723ab86c9/fonttools-4.58.4-cp313-cp313-musllinux_1_2_x86_64.whl", hash = "sha256:66cccb6c0b944496b7f26450e9a66e997739c513ffaac728d24930df2fd9d35b", size = 5021259, upload_time = "2025-06-13T17:24:51.754Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/17/b2/c4d2872cff3ace3ddd1388bf15b76a1d8d5313f0a61f234e9aed287e674d/fonttools-4.58.4-cp313-cp313-win32.whl", hash = "sha256:94d2aebb5ca59a5107825520fde596e344652c1f18170ef01dacbe48fa60c889", size = 2185824, upload_time = "2025-06-13T17:24:54.324Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/98/57/cddf8bcc911d4f47dfca1956c1e3aeeb9f7c9b8e88b2a312fe8c22714e0b/fonttools-4.58.4-cp313-cp313-win_amd64.whl", hash = "sha256:b554bd6e80bba582fd326ddab296e563c20c64dca816d5e30489760e0c41529f", size = 2236382, upload_time = "2025-06-13T17:24:56.291Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/0b/2f/c536b5b9bb3c071e91d536a4d11f969e911dbb6b227939f4c5b0bca090df/fonttools-4.58.4-py3-none-any.whl", hash = "sha256:a10ce13a13f26cbb9f37512a4346bb437ad7e002ff6fa966a7ce7ff5ac3528bd", size = 1114660, upload_time = "2025-06-13T17:25:13.321Z" }, | ||||
| ] | ||||
| 
 | ||||
| [[package]] | ||||
| name = "fqdn" | ||||
| version = "1.5.1" | ||||
|  | @ -433,6 +499,15 @@ wheels = [ | |||
|     { url = "https://files.pythonhosted.org/packages/44/4b/e0cfc1a6f17e990f3e64b7d941ddc4acdc7b19d6edd51abf495f32b1a9e4/fsspec-2025.3.2-py3-none-any.whl", hash = "sha256:2daf8dc3d1dfa65b6aa37748d112773a7a08416f6c70d96b264c96476ecaf711", size = 194435, upload_time = "2025-03-31T15:27:07.028Z" }, | ||||
| ] | ||||
| 
 | ||||
| [[package]] | ||||
| name = "graphviz" | ||||
| version = "0.20.3" | ||||
| source = { registry = "https://pypi.org/simple" } | ||||
| sdist = { url = "https://files.pythonhosted.org/packages/fa/83/5a40d19b8347f017e417710907f824915fba411a9befd092e52746b63e9f/graphviz-0.20.3.zip", hash = "sha256:09d6bc81e6a9fa392e7ba52135a9d49f1ed62526f96499325930e87ca1b5925d", size = 256455, upload_time = "2024-03-21T07:50:45.772Z" } | ||||
| wheels = [ | ||||
|     { url = "https://files.pythonhosted.org/packages/00/be/d59db2d1d52697c6adc9eacaf50e8965b6345cc143f671e1ed068818d5cf/graphviz-0.20.3-py3-none-any.whl", hash = "sha256:81f848f2904515d8cd359cc611faba817598d2feaac4027b266aa3eda7b3dde5", size = 47126, upload_time = "2024-03-21T07:50:43.091Z" }, | ||||
| ] | ||||
| 
 | ||||
| [[package]] | ||||
| name = "h11" | ||||
| version = "0.16.0" | ||||
|  | @ -650,6 +725,15 @@ wheels = [ | |||
|     { url = "https://files.pythonhosted.org/packages/41/9f/3500910d5a98549e3098807493851eeef2b89cdd3032227558a104dfe926/json5-0.12.0-py3-none-any.whl", hash = "sha256:6d37aa6c08b0609f16e1ec5ff94697e2cbbfbad5ac112afa05794da9ab7810db", size = 36079, upload_time = "2025-04-03T16:33:11.927Z" }, | ||||
| ] | ||||
| 
 | ||||
| [[package]] | ||||
| name = "jsonpickle" | ||||
| version = "4.1.0" | ||||
| source = { registry = "https://pypi.org/simple" } | ||||
| sdist = { url = "https://files.pythonhosted.org/packages/03/4f/1dde1e344dc41c40bc3f0eb721d7ddc5fed827bf518ba410c369f6bbaa07/jsonpickle-4.1.0.tar.gz", hash = "sha256:d417d6d693a63fb137e53334164aba618d18aca05a4fd025ff01c2ec134ae4c8", size = 318466, upload_time = "2025-05-21T19:40:19.02Z" } | ||||
| wheels = [ | ||||
|     { url = "https://files.pythonhosted.org/packages/0e/b5/8d90bb4951a1e821d0b4e559edb3c70918b8954b566c7eb9211846a48c47/jsonpickle-4.1.0-py3-none-any.whl", hash = "sha256:763f837a0b2586b45424d9a07108a798d9feac52f3a152606336f7f9e1a22ffa", size = 46615, upload_time = "2025-05-21T19:40:13.344Z" }, | ||||
| ] | ||||
| 
 | ||||
| [[package]] | ||||
| name = "jsonpointer" | ||||
| version = "3.0.0" | ||||
|  | @ -898,6 +982,42 @@ wheels = [ | |||
|     { url = "https://files.pythonhosted.org/packages/64/7a/f2479ba401e02f7fcbd3fc6af201eac888eaa188574b8e9df19452ab4972/jupyterlab_widgets-3.0.14-py3-none-any.whl", hash = "sha256:54c33e3306b7fca139d165d6190dc6c0627aafa5d14adfc974a4e9a3d26cb703", size = 213999, upload_time = "2025-04-10T13:00:38.626Z" }, | ||||
| ] | ||||
| 
 | ||||
| [[package]] | ||||
| name = "kiwisolver" | ||||
| version = "1.4.8" | ||||
| source = { registry = "https://pypi.org/simple" } | ||||
| sdist = { url = "https://files.pythonhosted.org/packages/82/59/7c91426a8ac292e1cdd53a63b6d9439abd573c875c3f92c146767dd33faf/kiwisolver-1.4.8.tar.gz", hash = "sha256:23d5f023bdc8c7e54eb65f03ca5d5bb25b601eac4d7f1a042888a1f45237987e", size = 97538, upload_time = "2024-12-24T18:30:51.519Z" } | ||||
| wheels = [ | ||||
|     { url = "https://files.pythonhosted.org/packages/79/b3/e62464a652f4f8cd9006e13d07abad844a47df1e6537f73ddfbf1bc997ec/kiwisolver-1.4.8-cp313-cp313-macosx_10_13_universal2.whl", hash = "sha256:1c8ceb754339793c24aee1c9fb2485b5b1f5bb1c2c214ff13368431e51fc9a09", size = 124156, upload_time = "2024-12-24T18:29:45.368Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/8d/2d/f13d06998b546a2ad4f48607a146e045bbe48030774de29f90bdc573df15/kiwisolver-1.4.8-cp313-cp313-macosx_10_13_x86_64.whl", hash = "sha256:54a62808ac74b5e55a04a408cda6156f986cefbcf0ada13572696b507cc92fa1", size = 66555, upload_time = "2024-12-24T18:29:46.37Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/59/e3/b8bd14b0a54998a9fd1e8da591c60998dc003618cb19a3f94cb233ec1511/kiwisolver-1.4.8-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:68269e60ee4929893aad82666821aaacbd455284124817af45c11e50a4b42e3c", size = 65071, upload_time = "2024-12-24T18:29:47.333Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/f0/1c/6c86f6d85ffe4d0ce04228d976f00674f1df5dc893bf2dd4f1928748f187/kiwisolver-1.4.8-cp313-cp313-manylinux_2_12_i686.manylinux2010_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:34d142fba9c464bc3bbfeff15c96eab0e7310343d6aefb62a79d51421fcc5f1b", size = 1378053, upload_time = "2024-12-24T18:29:49.636Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/4e/b9/1c6e9f6dcb103ac5cf87cb695845f5fa71379021500153566d8a8a9fc291/kiwisolver-1.4.8-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:3ddc373e0eef45b59197de815b1b28ef89ae3955e7722cc9710fb91cd77b7f47", size = 1472278, upload_time = "2024-12-24T18:29:51.164Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/ee/81/aca1eb176de671f8bda479b11acdc42c132b61a2ac861c883907dde6debb/kiwisolver-1.4.8-cp313-cp313-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:77e6f57a20b9bd4e1e2cedda4d0b986ebd0216236f0106e55c28aea3d3d69b16", size = 1478139, upload_time = "2024-12-24T18:29:52.594Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/49/f4/e081522473671c97b2687d380e9e4c26f748a86363ce5af48b4a28e48d06/kiwisolver-1.4.8-cp313-cp313-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:08e77738ed7538f036cd1170cbed942ef749137b1311fa2bbe2a7fda2f6bf3cc", size = 1413517, upload_time = "2024-12-24T18:29:53.941Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/8f/e9/6a7d025d8da8c4931522922cd706105aa32b3291d1add8c5427cdcd66e63/kiwisolver-1.4.8-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:a5ce1e481a74b44dd5e92ff03ea0cb371ae7a0268318e202be06c8f04f4f1246", size = 1474952, upload_time = "2024-12-24T18:29:56.523Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/82/13/13fa685ae167bee5d94b415991c4fc7bb0a1b6ebea6e753a87044b209678/kiwisolver-1.4.8-cp313-cp313-musllinux_1_2_aarch64.whl", hash = "sha256:fc2ace710ba7c1dfd1a3b42530b62b9ceed115f19a1656adefce7b1782a37794", size = 2269132, upload_time = "2024-12-24T18:29:57.989Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/ef/92/bb7c9395489b99a6cb41d502d3686bac692586db2045adc19e45ee64ed23/kiwisolver-1.4.8-cp313-cp313-musllinux_1_2_i686.whl", hash = "sha256:3452046c37c7692bd52b0e752b87954ef86ee2224e624ef7ce6cb21e8c41cc1b", size = 2425997, upload_time = "2024-12-24T18:29:59.393Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/ed/12/87f0e9271e2b63d35d0d8524954145837dd1a6c15b62a2d8c1ebe0f182b4/kiwisolver-1.4.8-cp313-cp313-musllinux_1_2_ppc64le.whl", hash = "sha256:7e9a60b50fe8b2ec6f448fe8d81b07e40141bfced7f896309df271a0b92f80f3", size = 2376060, upload_time = "2024-12-24T18:30:01.338Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/02/6e/c8af39288edbce8bf0fa35dee427b082758a4b71e9c91ef18fa667782138/kiwisolver-1.4.8-cp313-cp313-musllinux_1_2_s390x.whl", hash = "sha256:918139571133f366e8362fa4a297aeba86c7816b7ecf0bc79168080e2bd79957", size = 2520471, upload_time = "2024-12-24T18:30:04.574Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/13/78/df381bc7b26e535c91469f77f16adcd073beb3e2dd25042efd064af82323/kiwisolver-1.4.8-cp313-cp313-musllinux_1_2_x86_64.whl", hash = "sha256:e063ef9f89885a1d68dd8b2e18f5ead48653176d10a0e324e3b0030e3a69adeb", size = 2338793, upload_time = "2024-12-24T18:30:06.25Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/d0/dc/c1abe38c37c071d0fc71c9a474fd0b9ede05d42f5a458d584619cfd2371a/kiwisolver-1.4.8-cp313-cp313-win_amd64.whl", hash = "sha256:a17b7c4f5b2c51bb68ed379defd608a03954a1845dfed7cc0117f1cc8a9b7fd2", size = 71855, upload_time = "2024-12-24T18:30:07.535Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/a0/b6/21529d595b126ac298fdd90b705d87d4c5693de60023e0efcb4f387ed99e/kiwisolver-1.4.8-cp313-cp313-win_arm64.whl", hash = "sha256:3cd3bc628b25f74aedc6d374d5babf0166a92ff1317f46267f12d2ed54bc1d30", size = 65430, upload_time = "2024-12-24T18:30:08.504Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/34/bd/b89380b7298e3af9b39f49334e3e2a4af0e04819789f04b43d560516c0c8/kiwisolver-1.4.8-cp313-cp313t-macosx_10_13_universal2.whl", hash = "sha256:370fd2df41660ed4e26b8c9d6bbcad668fbe2560462cba151a721d49e5b6628c", size = 126294, upload_time = "2024-12-24T18:30:09.508Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/83/41/5857dc72e5e4148eaac5aa76e0703e594e4465f8ab7ec0fc60e3a9bb8fea/kiwisolver-1.4.8-cp313-cp313t-macosx_10_13_x86_64.whl", hash = "sha256:84a2f830d42707de1d191b9490ac186bf7997a9495d4e9072210a1296345f7dc", size = 67736, upload_time = "2024-12-24T18:30:11.039Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/e1/d1/be059b8db56ac270489fb0b3297fd1e53d195ba76e9bbb30e5401fa6b759/kiwisolver-1.4.8-cp313-cp313t-macosx_11_0_arm64.whl", hash = "sha256:7a3ad337add5148cf51ce0b55642dc551c0b9d6248458a757f98796ca7348712", size = 66194, upload_time = "2024-12-24T18:30:14.886Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/e1/83/4b73975f149819eb7dcf9299ed467eba068ecb16439a98990dcb12e63fdd/kiwisolver-1.4.8-cp313-cp313t-manylinux_2_12_i686.manylinux2010_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:7506488470f41169b86d8c9aeff587293f530a23a23a49d6bc64dab66bedc71e", size = 1465942, upload_time = "2024-12-24T18:30:18.927Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/c7/2c/30a5cdde5102958e602c07466bce058b9d7cb48734aa7a4327261ac8e002/kiwisolver-1.4.8-cp313-cp313t-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:2f0121b07b356a22fb0414cec4666bbe36fd6d0d759db3d37228f496ed67c880", size = 1595341, upload_time = "2024-12-24T18:30:22.102Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/ff/9b/1e71db1c000385aa069704f5990574b8244cce854ecd83119c19e83c9586/kiwisolver-1.4.8-cp313-cp313t-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:d6d6bd87df62c27d4185de7c511c6248040afae67028a8a22012b010bc7ad062", size = 1598455, upload_time = "2024-12-24T18:30:24.947Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/85/92/c8fec52ddf06231b31cbb779af77e99b8253cd96bd135250b9498144c78b/kiwisolver-1.4.8-cp313-cp313t-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:291331973c64bb9cce50bbe871fb2e675c4331dab4f31abe89f175ad7679a4d7", size = 1522138, upload_time = "2024-12-24T18:30:26.286Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/0b/51/9eb7e2cd07a15d8bdd976f6190c0164f92ce1904e5c0c79198c4972926b7/kiwisolver-1.4.8-cp313-cp313t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:893f5525bb92d3d735878ec00f781b2de998333659507d29ea4466208df37bed", size = 1582857, upload_time = "2024-12-24T18:30:28.86Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/0f/95/c5a00387a5405e68ba32cc64af65ce881a39b98d73cc394b24143bebc5b8/kiwisolver-1.4.8-cp313-cp313t-musllinux_1_2_aarch64.whl", hash = "sha256:b47a465040146981dc9db8647981b8cb96366fbc8d452b031e4f8fdffec3f26d", size = 2293129, upload_time = "2024-12-24T18:30:30.34Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/44/83/eeb7af7d706b8347548313fa3a3a15931f404533cc54fe01f39e830dd231/kiwisolver-1.4.8-cp313-cp313t-musllinux_1_2_i686.whl", hash = "sha256:99cea8b9dd34ff80c521aef46a1dddb0dcc0283cf18bde6d756f1e6f31772165", size = 2421538, upload_time = "2024-12-24T18:30:33.334Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/05/f9/27e94c1b3eb29e6933b6986ffc5fa1177d2cd1f0c8efc5f02c91c9ac61de/kiwisolver-1.4.8-cp313-cp313t-musllinux_1_2_ppc64le.whl", hash = "sha256:151dffc4865e5fe6dafce5480fab84f950d14566c480c08a53c663a0020504b6", size = 2390661, upload_time = "2024-12-24T18:30:34.939Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/d9/d4/3c9735faa36ac591a4afcc2980d2691000506050b7a7e80bcfe44048daa7/kiwisolver-1.4.8-cp313-cp313t-musllinux_1_2_s390x.whl", hash = "sha256:577facaa411c10421314598b50413aa1ebcf5126f704f1e5d72d7e4e9f020d90", size = 2546710, upload_time = "2024-12-24T18:30:37.281Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/4c/fa/be89a49c640930180657482a74970cdcf6f7072c8d2471e1babe17a222dc/kiwisolver-1.4.8-cp313-cp313t-musllinux_1_2_x86_64.whl", hash = "sha256:be4816dc51c8a471749d664161b434912eee82f2ea66bd7628bd14583a833e85", size = 2349213, upload_time = "2024-12-24T18:30:40.019Z" }, | ||||
| ] | ||||
| 
 | ||||
| [[package]] | ||||
| name = "litellm" | ||||
| version = "1.67.0" | ||||
|  | @ -948,6 +1068,37 @@ wheels = [ | |||
|     { url = "https://files.pythonhosted.org/packages/4f/65/6079a46068dfceaeabb5dcad6d674f5f5c61a6fa5673746f42a9f4c233b3/MarkupSafe-3.0.2-cp313-cp313t-win_amd64.whl", hash = "sha256:e444a31f8db13eb18ada366ab3cf45fd4b31e4db1236a4448f68778c1d1a5a2f", size = 15739, upload_time = "2024-10-18T15:21:42.784Z" }, | ||||
| ] | ||||
| 
 | ||||
| [[package]] | ||||
| name = "matplotlib" | ||||
| version = "3.10.3" | ||||
| source = { registry = "https://pypi.org/simple" } | ||||
| dependencies = [ | ||||
|     { name = "contourpy" }, | ||||
|     { name = "cycler" }, | ||||
|     { name = "fonttools" }, | ||||
|     { name = "kiwisolver" }, | ||||
|     { name = "numpy" }, | ||||
|     { name = "packaging" }, | ||||
|     { name = "pillow" }, | ||||
|     { name = "pyparsing" }, | ||||
|     { name = "python-dateutil" }, | ||||
| ] | ||||
| sdist = { url = "https://files.pythonhosted.org/packages/26/91/d49359a21893183ed2a5b6c76bec40e0b1dcbf8ca148f864d134897cfc75/matplotlib-3.10.3.tar.gz", hash = "sha256:2f82d2c5bb7ae93aaaa4cd42aca65d76ce6376f83304fa3a630b569aca274df0", size = 34799811, upload_time = "2025-05-08T19:10:54.39Z" } | ||||
| wheels = [ | ||||
|     { url = "https://files.pythonhosted.org/packages/3b/c1/23cfb566a74c696a3b338d8955c549900d18fe2b898b6e94d682ca21e7c2/matplotlib-3.10.3-cp313-cp313-macosx_10_13_x86_64.whl", hash = "sha256:9f2efccc8dcf2b86fc4ee849eea5dcaecedd0773b30f47980dc0cbeabf26ec84", size = 8180318, upload_time = "2025-05-08T19:10:20.426Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/6c/0c/02f1c3b66b30da9ee343c343acbb6251bef5b01d34fad732446eaadcd108/matplotlib-3.10.3-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:3ddbba06a6c126e3301c3d272a99dcbe7f6c24c14024e80307ff03791a5f294e", size = 8051132, upload_time = "2025-05-08T19:10:22.569Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/b4/ab/8db1a5ac9b3a7352fb914133001dae889f9fcecb3146541be46bed41339c/matplotlib-3.10.3-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:748302b33ae9326995b238f606e9ed840bf5886ebafcb233775d946aa8107a15", size = 8457633, upload_time = "2025-05-08T19:10:24.749Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/f5/64/41c4367bcaecbc03ef0d2a3ecee58a7065d0a36ae1aa817fe573a2da66d4/matplotlib-3.10.3-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:a80fcccbef63302c0efd78042ea3c2436104c5b1a4d3ae20f864593696364ac7", size = 8601031, upload_time = "2025-05-08T19:10:27.03Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/12/6f/6cc79e9e5ab89d13ed64da28898e40fe5b105a9ab9c98f83abd24e46d7d7/matplotlib-3.10.3-cp313-cp313-musllinux_1_2_x86_64.whl", hash = "sha256:55e46cbfe1f8586adb34f7587c3e4f7dedc59d5226719faf6cb54fc24f2fd52d", size = 9406988, upload_time = "2025-05-08T19:10:29.056Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/b1/0f/eed564407bd4d935ffabf561ed31099ed609e19287409a27b6d336848653/matplotlib-3.10.3-cp313-cp313-win_amd64.whl", hash = "sha256:151d89cb8d33cb23345cd12490c76fd5d18a56581a16d950b48c6ff19bb2ab93", size = 8068034, upload_time = "2025-05-08T19:10:31.221Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/3e/e5/2f14791ff69b12b09e9975e1d116d9578ac684460860ce542c2588cb7a1c/matplotlib-3.10.3-cp313-cp313t-macosx_10_13_x86_64.whl", hash = "sha256:c26dd9834e74d164d06433dc7be5d75a1e9890b926b3e57e74fa446e1a62c3e2", size = 8218223, upload_time = "2025-05-08T19:10:33.114Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/5c/08/30a94afd828b6e02d0a52cae4a29d6e9ccfcf4c8b56cc28b021d3588873e/matplotlib-3.10.3-cp313-cp313t-macosx_11_0_arm64.whl", hash = "sha256:24853dad5b8c84c8c2390fc31ce4858b6df504156893292ce8092d190ef8151d", size = 8094985, upload_time = "2025-05-08T19:10:35.337Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/89/44/f3bc6b53066c889d7a1a3ea8094c13af6a667c5ca6220ec60ecceec2dabe/matplotlib-3.10.3-cp313-cp313t-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:68f7878214d369d7d4215e2a9075fef743be38fa401d32e6020bab2dfabaa566", size = 8483109, upload_time = "2025-05-08T19:10:37.611Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/ba/c7/473bc559beec08ebee9f86ca77a844b65747e1a6c2691e8c92e40b9f42a8/matplotlib-3.10.3-cp313-cp313t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:f6929fc618cb6db9cb75086f73b3219bbb25920cb24cee2ea7a12b04971a4158", size = 8618082, upload_time = "2025-05-08T19:10:39.892Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/d8/e9/6ce8edd264c8819e37bbed8172e0ccdc7107fe86999b76ab5752276357a4/matplotlib-3.10.3-cp313-cp313t-musllinux_1_2_x86_64.whl", hash = "sha256:6c7818292a5cc372a2dc4c795e5c356942eb8350b98ef913f7fda51fe175ac5d", size = 9413699, upload_time = "2025-05-08T19:10:42.376Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/1b/92/9a45c91089c3cf690b5badd4be81e392ff086ccca8a1d4e3a08463d8a966/matplotlib-3.10.3-cp313-cp313t-win_amd64.whl", hash = "sha256:4f23ffe95c5667ef8a2b56eea9b53db7f43910fa4a2d5472ae0f72b64deab4d5", size = 8139044, upload_time = "2025-05-08T19:10:44.551Z" }, | ||||
| ] | ||||
| 
 | ||||
| [[package]] | ||||
| name = "matplotlib-inline" | ||||
| version = "0.1.7" | ||||
|  | @ -1076,6 +1227,15 @@ wheels = [ | |||
|     { url = "https://files.pythonhosted.org/packages/a0/c4/c2971a3ba4c6103a3d10c4b0f24f461ddc027f0f09763220cf35ca1401b3/nest_asyncio-1.6.0-py3-none-any.whl", hash = "sha256:87af6efd6b5e897c81050477ef65c62e2b2f35d51703cae01aff2905b1852e1c", size = 5195, upload_time = "2024-01-21T14:25:17.223Z" }, | ||||
| ] | ||||
| 
 | ||||
| [[package]] | ||||
| name = "networkx" | ||||
| version = "3.4.2" | ||||
| source = { registry = "https://pypi.org/simple" } | ||||
| sdist = { url = "https://files.pythonhosted.org/packages/fd/1d/06475e1cd5264c0b870ea2cc6fdb3e37177c1e565c43f56ff17a10e3937f/networkx-3.4.2.tar.gz", hash = "sha256:307c3669428c5362aab27c8a1260aa8f47c4e91d3891f48be0141738d8d053e1", size = 2151368, upload_time = "2024-10-21T12:39:38.695Z" } | ||||
| wheels = [ | ||||
|     { url = "https://files.pythonhosted.org/packages/b9/54/dd730b32ea14ea797530a4479b2ed46a6fb250f682a9cfb997e968bf0261/networkx-3.4.2-py3-none-any.whl", hash = "sha256:df5d4365b724cf81b8c6a7312509d0c22386097011ad1abe274afd5e9d3bbc5f", size = 1723263, upload_time = "2024-10-21T12:39:36.247Z" }, | ||||
| ] | ||||
| 
 | ||||
| [[package]] | ||||
| name = "notebook" | ||||
| version = "7.4.1" | ||||
|  | @ -1174,11 +1334,11 @@ wheels = [ | |||
| 
 | ||||
| [[package]] | ||||
| name = "packaging" | ||||
| version = "25.0" | ||||
| version = "24.2" | ||||
| source = { registry = "https://pypi.org/simple" } | ||||
| sdist = { url = "https://files.pythonhosted.org/packages/a1/d4/1fc4078c65507b51b96ca8f8c3ba19e6a61c8253c72794544580a7b6c24d/packaging-25.0.tar.gz", hash = "sha256:d443872c98d677bf60f6a1f2f8c1cb748e8fe762d2bf9d3148b5599295b0fc4f", size = 165727, upload_time = "2025-04-19T11:48:59.673Z" } | ||||
| sdist = { url = "https://files.pythonhosted.org/packages/d0/63/68dbb6eb2de9cb10ee4c9c14a0148804425e13c4fb20d61cce69f53106da/packaging-24.2.tar.gz", hash = "sha256:c228a6dc5e932d346bc5739379109d49e8853dd8223571c7c5b55260edc0b97f", size = 163950, upload_time = "2024-11-08T09:47:47.202Z" } | ||||
| wheels = [ | ||||
|     { url = "https://files.pythonhosted.org/packages/20/12/38679034af332785aac8774540895e234f4d07f7545804097de4b666afd8/packaging-25.0-py3-none-any.whl", hash = "sha256:29572ef2b1f17581046b3a2227d5c611fb25ec70ca1ba8554b24b0e69331a484", size = 66469, upload_time = "2025-04-19T11:48:57.875Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/88/ef/eb23f262cca3c0c4eb7ab1933c3b1f03d021f2c48f54763065b6f0e321be/packaging-24.2-py3-none-any.whl", hash = "sha256:09abb1bccd265c01f4a3aa3f7a7db064b36514d2cba19a2f694fe6150451a759", size = 65451, upload_time = "2024-11-08T09:47:44.722Z" }, | ||||
| ] | ||||
| 
 | ||||
| [[package]] | ||||
|  | @ -1238,6 +1398,36 @@ wheels = [ | |||
|     { url = "https://files.pythonhosted.org/packages/9e/c3/059298687310d527a58bb01f3b1965787ee3b40dce76752eda8b44e9a2c5/pexpect-4.9.0-py2.py3-none-any.whl", hash = "sha256:7236d1e080e4936be2dc3e326cec0af72acf9212a7e1d060210e70a47e253523", size = 63772, upload_time = "2023-11-25T06:56:14.81Z" }, | ||||
| ] | ||||
| 
 | ||||
| [[package]] | ||||
| name = "pillow" | ||||
| version = "11.2.1" | ||||
| source = { registry = "https://pypi.org/simple" } | ||||
| sdist = { url = "https://files.pythonhosted.org/packages/af/cb/bb5c01fcd2a69335b86c22142b2bccfc3464087efb7fd382eee5ffc7fdf7/pillow-11.2.1.tar.gz", hash = "sha256:a64dd61998416367b7ef979b73d3a85853ba9bec4c2925f74e588879a58716b6", size = 47026707, upload_time = "2025-04-12T17:50:03.289Z" } | ||||
| wheels = [ | ||||
|     { url = "https://files.pythonhosted.org/packages/36/9c/447528ee3776e7ab8897fe33697a7ff3f0475bb490c5ac1456a03dc57956/pillow-11.2.1-cp313-cp313-macosx_10_13_x86_64.whl", hash = "sha256:fdec757fea0b793056419bca3e9932eb2b0ceec90ef4813ea4c1e072c389eb28", size = 3190098, upload_time = "2025-04-12T17:48:23.915Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/b5/09/29d5cd052f7566a63e5b506fac9c60526e9ecc553825551333e1e18a4858/pillow-11.2.1-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:b0e130705d568e2f43a17bcbe74d90958e8a16263868a12c3e0d9c8162690830", size = 3030166, upload_time = "2025-04-12T17:48:25.738Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/71/5d/446ee132ad35e7600652133f9c2840b4799bbd8e4adba881284860da0a36/pillow-11.2.1-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:7bdb5e09068332578214cadd9c05e3d64d99e0e87591be22a324bdbc18925be0", size = 4408674, upload_time = "2025-04-12T17:48:27.908Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/69/5f/cbe509c0ddf91cc3a03bbacf40e5c2339c4912d16458fcb797bb47bcb269/pillow-11.2.1-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:d189ba1bebfbc0c0e529159631ec72bb9e9bc041f01ec6d3233d6d82eb823bc1", size = 4496005, upload_time = "2025-04-12T17:48:29.888Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/f9/b3/dd4338d8fb8a5f312021f2977fb8198a1184893f9b00b02b75d565c33b51/pillow-11.2.1-cp313-cp313-manylinux_2_28_aarch64.whl", hash = "sha256:191955c55d8a712fab8934a42bfefbf99dd0b5875078240943f913bb66d46d9f", size = 4518707, upload_time = "2025-04-12T17:48:31.874Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/13/eb/2552ecebc0b887f539111c2cd241f538b8ff5891b8903dfe672e997529be/pillow-11.2.1-cp313-cp313-manylinux_2_28_x86_64.whl", hash = "sha256:ad275964d52e2243430472fc5d2c2334b4fc3ff9c16cb0a19254e25efa03a155", size = 4610008, upload_time = "2025-04-12T17:48:34.422Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/72/d1/924ce51bea494cb6e7959522d69d7b1c7e74f6821d84c63c3dc430cbbf3b/pillow-11.2.1-cp313-cp313-musllinux_1_2_aarch64.whl", hash = "sha256:750f96efe0597382660d8b53e90dd1dd44568a8edb51cb7f9d5d918b80d4de14", size = 4585420, upload_time = "2025-04-12T17:48:37.641Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/43/ab/8f81312d255d713b99ca37479a4cb4b0f48195e530cdc1611990eb8fd04b/pillow-11.2.1-cp313-cp313-musllinux_1_2_x86_64.whl", hash = "sha256:fe15238d3798788d00716637b3d4e7bb6bde18b26e5d08335a96e88564a36b6b", size = 4667655, upload_time = "2025-04-12T17:48:39.652Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/94/86/8f2e9d2dc3d308dfd137a07fe1cc478df0a23d42a6c4093b087e738e4827/pillow-11.2.1-cp313-cp313-win32.whl", hash = "sha256:3fe735ced9a607fee4f481423a9c36701a39719252a9bb251679635f99d0f7d2", size = 2332329, upload_time = "2025-04-12T17:48:41.765Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/6d/ec/1179083b8d6067a613e4d595359b5fdea65d0a3b7ad623fee906e1b3c4d2/pillow-11.2.1-cp313-cp313-win_amd64.whl", hash = "sha256:74ee3d7ecb3f3c05459ba95eed5efa28d6092d751ce9bf20e3e253a4e497e691", size = 2676388, upload_time = "2025-04-12T17:48:43.625Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/23/f1/2fc1e1e294de897df39fa8622d829b8828ddad938b0eaea256d65b84dd72/pillow-11.2.1-cp313-cp313-win_arm64.whl", hash = "sha256:5119225c622403afb4b44bad4c1ca6c1f98eed79db8d3bc6e4e160fc6339d66c", size = 2414950, upload_time = "2025-04-12T17:48:45.475Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/c4/3e/c328c48b3f0ead7bab765a84b4977acb29f101d10e4ef57a5e3400447c03/pillow-11.2.1-cp313-cp313t-macosx_10_13_x86_64.whl", hash = "sha256:8ce2e8411c7aaef53e6bb29fe98f28cd4fbd9a1d9be2eeea434331aac0536b22", size = 3192759, upload_time = "2025-04-12T17:48:47.866Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/18/0e/1c68532d833fc8b9f404d3a642991441d9058eccd5606eab31617f29b6d4/pillow-11.2.1-cp313-cp313t-macosx_11_0_arm64.whl", hash = "sha256:9ee66787e095127116d91dea2143db65c7bb1e232f617aa5957c0d9d2a3f23a7", size = 3033284, upload_time = "2025-04-12T17:48:50.189Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/b7/cb/6faf3fb1e7705fd2db74e070f3bf6f88693601b0ed8e81049a8266de4754/pillow-11.2.1-cp313-cp313t-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:9622e3b6c1d8b551b6e6f21873bdcc55762b4b2126633014cea1803368a9aa16", size = 4445826, upload_time = "2025-04-12T17:48:52.346Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/07/94/8be03d50b70ca47fb434a358919d6a8d6580f282bbb7af7e4aa40103461d/pillow-11.2.1-cp313-cp313t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:63b5dff3a68f371ea06025a1a6966c9a1e1ee452fc8020c2cd0ea41b83e9037b", size = 4527329, upload_time = "2025-04-12T17:48:54.403Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/fd/a4/bfe78777076dc405e3bd2080bc32da5ab3945b5a25dc5d8acaa9de64a162/pillow-11.2.1-cp313-cp313t-manylinux_2_28_aarch64.whl", hash = "sha256:31df6e2d3d8fc99f993fd253e97fae451a8db2e7207acf97859732273e108406", size = 4549049, upload_time = "2025-04-12T17:48:56.383Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/65/4d/eaf9068dc687c24979e977ce5677e253624bd8b616b286f543f0c1b91662/pillow-11.2.1-cp313-cp313t-manylinux_2_28_x86_64.whl", hash = "sha256:062b7a42d672c45a70fa1f8b43d1d38ff76b63421cbbe7f88146b39e8a558d91", size = 4635408, upload_time = "2025-04-12T17:48:58.782Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/1d/26/0fd443365d9c63bc79feb219f97d935cd4b93af28353cba78d8e77b61719/pillow-11.2.1-cp313-cp313t-musllinux_1_2_aarch64.whl", hash = "sha256:4eb92eca2711ef8be42fd3f67533765d9fd043b8c80db204f16c8ea62ee1a751", size = 4614863, upload_time = "2025-04-12T17:49:00.709Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/49/65/dca4d2506be482c2c6641cacdba5c602bc76d8ceb618fd37de855653a419/pillow-11.2.1-cp313-cp313t-musllinux_1_2_x86_64.whl", hash = "sha256:f91ebf30830a48c825590aede79376cb40f110b387c17ee9bd59932c961044f9", size = 4692938, upload_time = "2025-04-12T17:49:02.946Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/b3/92/1ca0c3f09233bd7decf8f7105a1c4e3162fb9142128c74adad0fb361b7eb/pillow-11.2.1-cp313-cp313t-win32.whl", hash = "sha256:e0b55f27f584ed623221cfe995c912c61606be8513bfa0e07d2c674b4516d9dd", size = 2335774, upload_time = "2025-04-12T17:49:04.889Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/a5/ac/77525347cb43b83ae905ffe257bbe2cc6fd23acb9796639a1f56aa59d191/pillow-11.2.1-cp313-cp313t-win_amd64.whl", hash = "sha256:36d6b82164c39ce5482f649b437382c0fb2395eabc1e2b1702a6deb8ad647d6e", size = 2681895, upload_time = "2025-04-12T17:49:06.635Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/67/32/32dc030cfa91ca0fc52baebbba2e009bb001122a1daa8b6a79ad830b38d3/pillow-11.2.1-cp313-cp313t-win_arm64.whl", hash = "sha256:225c832a13326e34f212d2072982bb1adb210e0cc0b153e688743018c94a2681", size = 2417234, upload_time = "2025-04-12T17:49:08.399Z" }, | ||||
| ] | ||||
| 
 | ||||
| [[package]] | ||||
| name = "platformdirs" | ||||
| version = "4.3.7" | ||||
|  | @ -1342,6 +1532,32 @@ wheels = [ | |||
|     { url = "https://files.pythonhosted.org/packages/8e/37/efad0257dc6e593a18957422533ff0f87ede7c9c6ea010a2177d738fb82f/pure_eval-0.2.3-py3-none-any.whl", hash = "sha256:1db8e35b67b3d218d818ae653e27f06c3aa420901fa7b081ca98cbedc874e0d0", size = 11842, upload_time = "2024-07-21T12:58:20.04Z" }, | ||||
| ] | ||||
| 
 | ||||
| [[package]] | ||||
| name = "pyarrow" | ||||
| version = "20.0.0" | ||||
| source = { registry = "https://pypi.org/simple" } | ||||
| sdist = { url = "https://files.pythonhosted.org/packages/a2/ee/a7810cb9f3d6e9238e61d312076a9859bf3668fd21c69744de9532383912/pyarrow-20.0.0.tar.gz", hash = "sha256:febc4a913592573c8d5805091a6c2b5064c8bd6e002131f01061797d91c783c1", size = 1125187, upload_time = "2025-04-27T12:34:23.264Z" } | ||||
| wheels = [ | ||||
|     { url = "https://files.pythonhosted.org/packages/9b/aa/daa413b81446d20d4dad2944110dcf4cf4f4179ef7f685dd5a6d7570dc8e/pyarrow-20.0.0-cp313-cp313-macosx_12_0_arm64.whl", hash = "sha256:a15532e77b94c61efadde86d10957950392999503b3616b2ffcef7621a002893", size = 30798501, upload_time = "2025-04-27T12:30:48.351Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/ff/75/2303d1caa410925de902d32ac215dc80a7ce7dd8dfe95358c165f2adf107/pyarrow-20.0.0-cp313-cp313-macosx_12_0_x86_64.whl", hash = "sha256:dd43f58037443af715f34f1322c782ec463a3c8a94a85fdb2d987ceb5658e061", size = 32277895, upload_time = "2025-04-27T12:30:55.238Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/92/41/fe18c7c0b38b20811b73d1bdd54b1fccba0dab0e51d2048878042d84afa8/pyarrow-20.0.0-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:aa0d288143a8585806e3cc7c39566407aab646fb9ece164609dac1cfff45f6ae", size = 41327322, upload_time = "2025-04-27T12:31:05.587Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/da/ab/7dbf3d11db67c72dbf36ae63dcbc9f30b866c153b3a22ef728523943eee6/pyarrow-20.0.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:b6953f0114f8d6f3d905d98e987d0924dabce59c3cda380bdfaa25a6201563b4", size = 42411441, upload_time = "2025-04-27T12:31:15.675Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/90/c3/0c7da7b6dac863af75b64e2f827e4742161128c350bfe7955b426484e226/pyarrow-20.0.0-cp313-cp313-manylinux_2_28_aarch64.whl", hash = "sha256:991f85b48a8a5e839b2128590ce07611fae48a904cae6cab1f089c5955b57eb5", size = 40677027, upload_time = "2025-04-27T12:31:24.631Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/be/27/43a47fa0ff9053ab5203bb3faeec435d43c0d8bfa40179bfd076cdbd4e1c/pyarrow-20.0.0-cp313-cp313-manylinux_2_28_x86_64.whl", hash = "sha256:97c8dc984ed09cb07d618d57d8d4b67a5100a30c3818c2fb0b04599f0da2de7b", size = 42281473, upload_time = "2025-04-27T12:31:31.311Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/bc/0b/d56c63b078876da81bbb9ba695a596eabee9b085555ed12bf6eb3b7cab0e/pyarrow-20.0.0-cp313-cp313-musllinux_1_2_aarch64.whl", hash = "sha256:9b71daf534f4745818f96c214dbc1e6124d7daf059167330b610fc69b6f3d3e3", size = 42893897, upload_time = "2025-04-27T12:31:39.406Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/92/ac/7d4bd020ba9145f354012838692d48300c1b8fe5634bfda886abcada67ed/pyarrow-20.0.0-cp313-cp313-musllinux_1_2_x86_64.whl", hash = "sha256:e8b88758f9303fa5a83d6c90e176714b2fd3852e776fc2d7e42a22dd6c2fb368", size = 44543847, upload_time = "2025-04-27T12:31:45.997Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/9d/07/290f4abf9ca702c5df7b47739c1b2c83588641ddfa2cc75e34a301d42e55/pyarrow-20.0.0-cp313-cp313-win_amd64.whl", hash = "sha256:30b3051b7975801c1e1d387e17c588d8ab05ced9b1e14eec57915f79869b5031", size = 25653219, upload_time = "2025-04-27T12:31:54.11Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/95/df/720bb17704b10bd69dde086e1400b8eefb8f58df3f8ac9cff6c425bf57f1/pyarrow-20.0.0-cp313-cp313t-macosx_12_0_arm64.whl", hash = "sha256:ca151afa4f9b7bc45bcc791eb9a89e90a9eb2772767d0b1e5389609c7d03db63", size = 30853957, upload_time = "2025-04-27T12:31:59.215Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/d9/72/0d5f875efc31baef742ba55a00a25213a19ea64d7176e0fe001c5d8b6e9a/pyarrow-20.0.0-cp313-cp313t-macosx_12_0_x86_64.whl", hash = "sha256:4680f01ecd86e0dd63e39eb5cd59ef9ff24a9d166db328679e36c108dc993d4c", size = 32247972, upload_time = "2025-04-27T12:32:05.369Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/d5/bc/e48b4fa544d2eea72f7844180eb77f83f2030b84c8dad860f199f94307ed/pyarrow-20.0.0-cp313-cp313t-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:7f4c8534e2ff059765647aa69b75d6543f9fef59e2cd4c6d18015192565d2b70", size = 41256434, upload_time = "2025-04-27T12:32:11.814Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/c3/01/974043a29874aa2cf4f87fb07fd108828fc7362300265a2a64a94965e35b/pyarrow-20.0.0-cp313-cp313t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:3e1f8a47f4b4ae4c69c4d702cfbdfe4d41e18e5c7ef6f1bb1c50918c1e81c57b", size = 42353648, upload_time = "2025-04-27T12:32:20.766Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/68/95/cc0d3634cde9ca69b0e51cbe830d8915ea32dda2157560dda27ff3b3337b/pyarrow-20.0.0-cp313-cp313t-manylinux_2_28_aarch64.whl", hash = "sha256:a1f60dc14658efaa927f8214734f6a01a806d7690be4b3232ba526836d216122", size = 40619853, upload_time = "2025-04-27T12:32:28.1Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/29/c2/3ad40e07e96a3e74e7ed7cc8285aadfa84eb848a798c98ec0ad009eb6bcc/pyarrow-20.0.0-cp313-cp313t-manylinux_2_28_x86_64.whl", hash = "sha256:204a846dca751428991346976b914d6d2a82ae5b8316a6ed99789ebf976551e6", size = 42241743, upload_time = "2025-04-27T12:32:35.792Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/eb/cb/65fa110b483339add6a9bc7b6373614166b14e20375d4daa73483755f830/pyarrow-20.0.0-cp313-cp313t-musllinux_1_2_aarch64.whl", hash = "sha256:f3b117b922af5e4c6b9a9115825726cac7d8b1421c37c2b5e24fbacc8930612c", size = 42839441, upload_time = "2025-04-27T12:32:46.64Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/98/7b/f30b1954589243207d7a0fbc9997401044bf9a033eec78f6cb50da3f304a/pyarrow-20.0.0-cp313-cp313t-musllinux_1_2_x86_64.whl", hash = "sha256:e724a3fd23ae5b9c010e7be857f4405ed5e679db5c93e66204db1a69f733936a", size = 44503279, upload_time = "2025-04-27T12:32:56.503Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/37/40/ad395740cd641869a13bcf60851296c89624662575621968dcfafabaa7f6/pyarrow-20.0.0-cp313-cp313t-win_amd64.whl", hash = "sha256:82f1ee5133bd8f49d31be1299dc07f585136679666b502540db854968576faf9", size = 25944982, upload_time = "2025-04-27T12:33:04.72Z" }, | ||||
| ] | ||||
| 
 | ||||
| [[package]] | ||||
| name = "pycparser" | ||||
| version = "2.22" | ||||
|  | @ -1403,6 +1619,15 @@ wheels = [ | |||
|     { url = "https://files.pythonhosted.org/packages/8a/0b/9fcc47d19c48b59121088dd6da2488a49d5f72dacf8262e2790a1d2c7d15/pygments-2.19.1-py3-none-any.whl", hash = "sha256:9ea1544ad55cecf4b8242fab6dd35a93bbce657034b0611ee383099054ab6d8c", size = 1225293, upload_time = "2025-01-06T17:26:25.553Z" }, | ||||
| ] | ||||
| 
 | ||||
| [[package]] | ||||
| name = "pyparsing" | ||||
| version = "3.2.3" | ||||
| source = { registry = "https://pypi.org/simple" } | ||||
| sdist = { url = "https://files.pythonhosted.org/packages/bb/22/f1129e69d94ffff626bdb5c835506b3a5b4f3d070f17ea295e12c2c6f60f/pyparsing-3.2.3.tar.gz", hash = "sha256:b9c13f1ab8b3b542f72e28f634bad4de758ab3ce4546e4301970ad6fa77c38be", size = 1088608, upload_time = "2025-03-25T05:01:28.114Z" } | ||||
| wheels = [ | ||||
|     { url = "https://files.pythonhosted.org/packages/05/e7/df2285f3d08fee213f2d041540fa4fc9ca6c2d44cf36d3a035bf2a8d2bcc/pyparsing-3.2.3-py3-none-any.whl", hash = "sha256:a749938e02d6fd0b59b356ca504a24982314bb090c383e3cf201c95ef7e2bfcf", size = 111120, upload_time = "2025-03-25T05:01:24.908Z" }, | ||||
| ] | ||||
| 
 | ||||
| [[package]] | ||||
| name = "python-dateutil" | ||||
| version = "2.9.0.post0" | ||||
|  | @ -1442,6 +1667,20 @@ wheels = [ | |||
|     { url = "https://files.pythonhosted.org/packages/81/c4/34e93fe5f5429d7570ec1fa436f1986fb1f00c3e0f43a589fe2bbcd22c3f/pytz-2025.2-py2.py3-none-any.whl", hash = "sha256:5ddf76296dd8c44c26eb8f4b6f35488f3ccbf6fbbd7adee0b7262d43f0ec2f00", size = 509225, upload_time = "2025-03-25T02:24:58.468Z" }, | ||||
| ] | ||||
| 
 | ||||
| [[package]] | ||||
| name = "pyvis" | ||||
| version = "0.3.2" | ||||
| source = { registry = "https://pypi.org/simple" } | ||||
| dependencies = [ | ||||
|     { name = "ipython" }, | ||||
|     { name = "jinja2" }, | ||||
|     { name = "jsonpickle" }, | ||||
|     { name = "networkx" }, | ||||
| ] | ||||
| wheels = [ | ||||
|     { url = "https://files.pythonhosted.org/packages/ab/4b/e37e4e5d5ee1179694917b445768bdbfb084f5a59ecd38089d3413d4c70f/pyvis-0.3.2-py3-none-any.whl", hash = "sha256:5720c4ca8161dc5d9ab352015723abb7a8bb8fb443edeb07f7a322db34a97555", size = 756038, upload_time = "2023-02-24T20:29:46.758Z" }, | ||||
| ] | ||||
| 
 | ||||
| [[package]] | ||||
| name = "pywin32" | ||||
| version = "310" | ||||
|  | @ -1615,6 +1854,49 @@ wheels = [ | |||
|     { url = "https://files.pythonhosted.org/packages/2d/e5/22865285789f3412ad0c3d7ec4dc0a3e86483b794be8a5d9ed5a19390900/rpds_py-0.24.0-cp313-cp313t-win_amd64.whl", hash = "sha256:675269d407a257b8c00a6b58205b72eec8231656506c56fd429d924ca00bb350", size = 237354, upload_time = "2025-03-26T14:54:33.199Z" }, | ||||
| ] | ||||
| 
 | ||||
| [[package]] | ||||
| name = "scipy" | ||||
| version = "1.16.0" | ||||
| source = { registry = "https://pypi.org/simple" } | ||||
| dependencies = [ | ||||
|     { name = "numpy" }, | ||||
| ] | ||||
| sdist = { url = "https://files.pythonhosted.org/packages/81/18/b06a83f0c5ee8cddbde5e3f3d0bb9b702abfa5136ef6d4620ff67df7eee5/scipy-1.16.0.tar.gz", hash = "sha256:b5ef54021e832869c8cfb03bc3bf20366cbcd426e02a58e8a58d7584dfbb8f62", size = 30581216, upload_time = "2025-06-22T16:27:55.782Z" } | ||||
| wheels = [ | ||||
|     { url = "https://files.pythonhosted.org/packages/46/95/0746417bc24be0c2a7b7563946d61f670a3b491b76adede420e9d173841f/scipy-1.16.0-cp313-cp313-macosx_10_14_x86_64.whl", hash = "sha256:e9f414cbe9ca289a73e0cc92e33a6a791469b6619c240aa32ee18abdce8ab451", size = 36418162, upload_time = "2025-06-22T16:19:56.3Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/19/5a/914355a74481b8e4bbccf67259bbde171348a3f160b67b4945fbc5f5c1e5/scipy-1.16.0-cp313-cp313-macosx_12_0_arm64.whl", hash = "sha256:bbba55fb97ba3cdef9b1ee973f06b09d518c0c7c66a009c729c7d1592be1935e", size = 28465985, upload_time = "2025-06-22T16:20:01.238Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/58/46/63477fc1246063855969cbefdcee8c648ba4b17f67370bd542ba56368d0b/scipy-1.16.0-cp313-cp313-macosx_14_0_arm64.whl", hash = "sha256:58e0d4354eacb6004e7aa1cd350e5514bd0270acaa8d5b36c0627bb3bb486974", size = 20737961, upload_time = "2025-06-22T16:20:05.913Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/93/86/0fbb5588b73555e40f9d3d6dde24ee6fac7d8e301a27f6f0cab9d8f66ff2/scipy-1.16.0-cp313-cp313-macosx_14_0_x86_64.whl", hash = "sha256:75b2094ec975c80efc273567436e16bb794660509c12c6a31eb5c195cbf4b6dc", size = 23377941, upload_time = "2025-06-22T16:20:10.668Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/ca/80/a561f2bf4c2da89fa631b3cbf31d120e21ea95db71fd9ec00cb0247c7a93/scipy-1.16.0-cp313-cp313-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:6b65d232157a380fdd11a560e7e21cde34fdb69d65c09cb87f6cc024ee376351", size = 33196703, upload_time = "2025-06-22T16:20:16.097Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/11/6b/3443abcd0707d52e48eb315e33cc669a95e29fc102229919646f5a501171/scipy-1.16.0-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:1d8747f7736accd39289943f7fe53a8333be7f15a82eea08e4afe47d79568c32", size = 35083410, upload_time = "2025-06-22T16:20:21.734Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/20/ab/eb0fc00e1e48961f1bd69b7ad7e7266896fe5bad4ead91b5fc6b3561bba4/scipy-1.16.0-cp313-cp313-musllinux_1_2_aarch64.whl", hash = "sha256:eb9f147a1b8529bb7fec2a85cf4cf42bdfadf9e83535c309a11fdae598c88e8b", size = 35387829, upload_time = "2025-06-22T16:20:27.548Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/57/9e/d6fc64e41fad5d481c029ee5a49eefc17f0b8071d636a02ceee44d4a0de2/scipy-1.16.0-cp313-cp313-musllinux_1_2_x86_64.whl", hash = "sha256:d2b83c37edbfa837a8923d19c749c1935ad3d41cf196006a24ed44dba2ec4358", size = 37841356, upload_time = "2025-06-22T16:20:35.112Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/7c/a7/4c94bbe91f12126b8bf6709b2471900577b7373a4fd1f431f28ba6f81115/scipy-1.16.0-cp313-cp313-win_amd64.whl", hash = "sha256:79a3c13d43c95aa80b87328a46031cf52508cf5f4df2767602c984ed1d3c6bbe", size = 38403710, upload_time = "2025-06-22T16:21:54.473Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/47/20/965da8497f6226e8fa90ad3447b82ed0e28d942532e92dd8b91b43f100d4/scipy-1.16.0-cp313-cp313t-macosx_10_14_x86_64.whl", hash = "sha256:f91b87e1689f0370690e8470916fe1b2308e5b2061317ff76977c8f836452a47", size = 36813833, upload_time = "2025-06-22T16:20:43.925Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/28/f4/197580c3dac2d234e948806e164601c2df6f0078ed9f5ad4a62685b7c331/scipy-1.16.0-cp313-cp313t-macosx_12_0_arm64.whl", hash = "sha256:88a6ca658fb94640079e7a50b2ad3b67e33ef0f40e70bdb7dc22017dae73ac08", size = 28974431, upload_time = "2025-06-22T16:20:51.302Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/8a/fc/e18b8550048d9224426e76906694c60028dbdb65d28b1372b5503914b89d/scipy-1.16.0-cp313-cp313t-macosx_14_0_arm64.whl", hash = "sha256:ae902626972f1bd7e4e86f58fd72322d7f4ec7b0cfc17b15d4b7006efc385176", size = 21246454, upload_time = "2025-06-22T16:20:57.276Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/8c/48/07b97d167e0d6a324bfd7484cd0c209cc27338b67e5deadae578cf48e809/scipy-1.16.0-cp313-cp313t-macosx_14_0_x86_64.whl", hash = "sha256:8cb824c1fc75ef29893bc32b3ddd7b11cf9ab13c1127fe26413a05953b8c32ed", size = 23772979, upload_time = "2025-06-22T16:21:03.363Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/4c/4f/9efbd3f70baf9582edf271db3002b7882c875ddd37dc97f0f675ad68679f/scipy-1.16.0-cp313-cp313t-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:de2db7250ff6514366a9709c2cba35cb6d08498e961cba20d7cff98a7ee88938", size = 33341972, upload_time = "2025-06-22T16:21:11.14Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/3f/dc/9e496a3c5dbe24e76ee24525155ab7f659c20180bab058ef2c5fa7d9119c/scipy-1.16.0-cp313-cp313t-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:e85800274edf4db8dd2e4e93034f92d1b05c9421220e7ded9988b16976f849c1", size = 35185476, upload_time = "2025-06-22T16:21:19.156Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/ce/b3/21001cff985a122ba434c33f2c9d7d1dc3b669827e94f4fc4e1fe8b9dfd8/scipy-1.16.0-cp313-cp313t-musllinux_1_2_aarch64.whl", hash = "sha256:4f720300a3024c237ace1cb11f9a84c38beb19616ba7c4cdcd771047a10a1706", size = 35570990, upload_time = "2025-06-22T16:21:27.797Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/e5/d3/7ba42647d6709251cdf97043d0c107e0317e152fa2f76873b656b509ff55/scipy-1.16.0-cp313-cp313t-musllinux_1_2_x86_64.whl", hash = "sha256:aad603e9339ddb676409b104c48a027e9916ce0d2838830691f39552b38a352e", size = 37950262, upload_time = "2025-06-22T16:21:36.976Z" }, | ||||
|     { url = "https://files.pythonhosted.org/packages/eb/c4/231cac7a8385394ebbbb4f1ca662203e9d8c332825ab4f36ffc3ead09a42/scipy-1.16.0-cp313-cp313t-win_amd64.whl", hash = "sha256:f56296fefca67ba605fd74d12f7bd23636267731a72cb3947963e76b8c0a25db", size = 38515076, upload_time = "2025-06-22T16:21:45.694Z" }, | ||||
| ] | ||||
| 
 | ||||
| [[package]] | ||||
| name = "seaborn" | ||||
| version = "0.13.2" | ||||
| source = { registry = "https://pypi.org/simple" } | ||||
| dependencies = [ | ||||
|     { name = "matplotlib" }, | ||||
|     { name = "numpy" }, | ||||
|     { name = "pandas" }, | ||||
| ] | ||||
| sdist = { url = "https://files.pythonhosted.org/packages/86/59/a451d7420a77ab0b98f7affa3a1d78a313d2f7281a57afb1a34bae8ab412/seaborn-0.13.2.tar.gz", hash = "sha256:93e60a40988f4d65e9f4885df477e2fdaff6b73a9ded434c1ab356dd57eefff7", size = 1457696, upload_time = "2024-01-25T13:21:52.551Z" } | ||||
| wheels = [ | ||||
|     { url = "https://files.pythonhosted.org/packages/83/11/00d3c3dfc25ad54e731d91449895a79e4bf2384dc3ac01809010ba88f6d5/seaborn-0.13.2-py3-none-any.whl", hash = "sha256:636f8336facf092165e27924f223d3c62ca560b1f2bb5dff7ab7fad265361987", size = 294914, upload_time = "2024-01-25T13:21:49.598Z" }, | ||||
| ] | ||||
| 
 | ||||
| [[package]] | ||||
| name = "send2trash" | ||||
| version = "1.8.3" | ||||
|  | @ -1665,28 +1947,42 @@ name = "sprint-econtai" | |||
| version = "0.1.0" | ||||
| source = { virtual = "." } | ||||
| dependencies = [ | ||||
|     { name = "coloraide" }, | ||||
|     { name = "dotenv" }, | ||||
|     { name = "graphviz" }, | ||||
|     { name = "jupyter" }, | ||||
|     { name = "litellm" }, | ||||
|     { name = "matplotlib" }, | ||||
|     { name = "notebook" }, | ||||
|     { name = "openai" }, | ||||
|     { name = "openpyxl" }, | ||||
|     { name = "pandas" }, | ||||
|     { name = "pyarrow" }, | ||||
|     { name = "pyvis" }, | ||||
|     { name = "requests" }, | ||||
|     { name = "scipy" }, | ||||
|     { name = "seaborn" }, | ||||
|     { name = "tenacity" }, | ||||
|     { name = "tqdm" }, | ||||
| ] | ||||
| 
 | ||||
| [package.metadata] | ||||
| requires-dist = [ | ||||
|     { name = "coloraide", specifier = ">=4.6" }, | ||||
|     { name = "dotenv", specifier = ">=0.9.9" }, | ||||
|     { name = "graphviz", specifier = ">=0.20.3" }, | ||||
|     { name = "jupyter", specifier = ">=1.1.1" }, | ||||
|     { name = "litellm", specifier = "==1.67.0" }, | ||||
|     { name = "matplotlib", specifier = ">=3.10.3" }, | ||||
|     { name = "notebook", specifier = ">=7.4.1" }, | ||||
|     { name = "openai", specifier = ">=1.76.0" }, | ||||
|     { name = "openpyxl", specifier = ">=3.1.5" }, | ||||
|     { name = "pandas", specifier = ">=2.2.3" }, | ||||
|     { name = "pyarrow", specifier = ">=20.0.0" }, | ||||
|     { name = "pyvis", specifier = ">=0.3.2" }, | ||||
|     { name = "requests", specifier = ">=2.32.3" }, | ||||
|     { name = "scipy" }, | ||||
|     { name = "seaborn", specifier = ">=0.13.2" }, | ||||
|     { name = "tenacity", specifier = ">=9.1.2" }, | ||||
|     { name = "tqdm", specifier = ">=4.67.1" }, | ||||
| ] | ||||
|  |  | |||
		Loading…
	
	Add table
		Add a link
		
	
		Reference in a new issue
	
	 Félix Dorn
						Félix Dorn